Refactoring A Data Science Project Part 1 - Abstraction and Composition

Рет қаралды 75,278

Күн бұрын

This is the first part of a 3-part miniseries in which I refactor a hand-written digit recognition data science project based on the MNIST dataset to improve the software design so it's easier to reuse and adapt. In this first part I cover using abstract classes and protocols to better separate the various aspects of the application, and I talk about function composition as a generic solution to dealing with data pipelines.
Thanks to Mark Todisco for helping out with preparing the example. The code I worked on in this video is available here: github.com/Arj....
Links to Pytorch and Scikit learn functional composition tools:
- pytorch.org/do...
- scikit-learn.o...
Part 1: • Refactoring A Data Sci...
Part 2: • Refactoring A Data Sci...
💡 Here's my FREE 7-step guide to help you consistently design great software: arjancodes.com....
🚀If you want to take a quantum leap in your software development career, check out my course The Software Design Mindset: www.arjancodes....
🎓 Courses:
The Software Designer Mindset: www.arjancodes...
The Software Designer Mindset Team Packages: www.arjancodes...
The Software Architect Mindset: Pre-register now! www.arjancodes...
Next Level Python: Become a Python Expert: www.arjancodes...
The 30-Day Design Challenge: www.arjancodes...
🛒 GEAR & RECOMMENDED BOOKS: kit.co/arjancodes.
💬 Join my Discord server here: discord.arjan....
🐦Twitter: / arjancodes
🌍LinkedIn: / arjancodes
🕵Facebook: / arjancodes
👀 Channel code reviewer board:
- Yoriz
- Ryan Laursen
- Sybren A. Stüvel
🔖 Chapters:
0:00 Intro
1:29 Explaining the code
6:41 About data science
7:35 Separating experiment tracking from the rest of the code
16:52 Improving data type consistency
19:44 Improving the way variables are handled
22:26 About function composition
29:03 Final thoughts
#arjancodes #softwaredesign #python
DISCLAIMER - The links in this description might be affiliate links. If you purchase a product or service through one of those links, I may receive a small commission. There is no additional charge to you. Thanks for supporting my channel so I can continue to provide you with free content each week!

Пікірлер: 217

@ArjanCodes 10 ай бұрын

💡 Here's my FREE 7-step guide to help you consistently design great software: arjancodes.com/designguide.

@deez_gainz 2 жыл бұрын

I think its the data science, natural science and non IT related engineering people would actually benefit the most from your software design centric videos. I`m one of them and we literally code spaghetti on the daily basis without ever getting taught the SOLID principles =). Thanks and you're making better those that listen!

@althayrL 2 жыл бұрын

I'm a professional data scientist and I'm following the channel since the beginning. It was essential to me in learning to be a better software engineer, even if this is not my main job requirement but my every day tool...

@selimrbd 2 жыл бұрын

Same here, data scientist greatly benefitting from this channel

@TheMightyOprah 2 жыл бұрын

Agreed - working as a data scientist who is proficient in data wrangling, ML, etc., but definitely lacking in solid software development principles, more videos like these would help me a ton!

@ArjanCodes 2 жыл бұрын

Thanks! It’s definitely an area I’d like to do more videos on in the future.

@Jordan-bi4tn 2 жыл бұрын

Same, very happy to see Arjan covering this topic as it’s what I was looking for few months ago when I first discovered his channel

@tunapedia 2 жыл бұрын

I am a senior data scientist, and I benefit from all your videos. Building architecture, productionizing and scaling up ML models is challenging. It requires good software engineering practices and a good understanding of the full software development stack. Good work as usual Arjan.

@ArjanCodes 2 жыл бұрын

Thank you, glad you liked it!

@DanielTobi00 7 ай бұрын

Hello Tunapedia, I came across your insightful comments on this video. I'm currently deepening my skills in data science and recently secured second place in an NLP competition on Zindi. I admire your expertise and would appreciate any guidance or insights you can provide on potential job opportunities in the field. Thank you.

@ZaneSelvans 2 жыл бұрын

Yes PLEASE do more videos like this at the intersection of data science / ETL pipelines and software engineering. It's extremely helpful for those of us who have come into building software from another adjacent field and are now struggling with big messes of our own making :)

@ArjanCodes 2 жыл бұрын

Thank you Zane, will do!

@kevon217 Жыл бұрын

i second this request!

@shopsmartin5851 2 жыл бұрын

All data science programming I’ve ever seen is usually written for a one-off experiment with very little principles applied, whether SOLID or reproducibility. The code is often not object oriented and is more functional - and written in declarative linear steps in one script. Even this code you are starting with is in better shape. I’ll be watching for sure to see these software development principles applied to that sort of programming style.

@mhmdjouni3669 2 жыл бұрын

I'm a data scientist and machine learning researcher, and looking into code design and refactoring from your perspective is very helpful for me in terms of coding! Thanks a lot

@Tobbzn 2 жыл бұрын

Some feedback: While seeing your face is always a bright point of any day, I still felt that you would often cut to a fullscreen camera view of yourself while talking about the code you just cut away from, which made it a bit hard to follow the structure of the code. Like, at 3:10 you said "You can see this happening here" during a cut where we literally can't see it happening, which caused a weird disconnect in my brain where I felt like I had to switch gears with each cut, trying to take in as much information as possible before the next cut would interrupt the reading. It's an interesting video, but these cuts made it hard to follow.

@cristopherfreitas762 2 жыл бұрын

I totally agree with this.

@ArjanCodes 2 жыл бұрын

Yes, I also noticed this a bit too late. Will make sure this is better in the next videos.

@BBB-zy6er 2 жыл бұрын

@@ArjanCodes Your other videos, editing-wise, have excellent pace and I don't notice the cuts at all, making it easy to follow along. This one felt like the cat was standing on the "cut" key.

@ArjanCodes 2 жыл бұрын

Haha, I did start working with a cat (read: video editor ;) ) since a few weeks. It’s clear we still need to fix a few things in the process, but I’m on it.

@leestoddart7014 2 жыл бұрын

absolutely - this was really stopping me understand the process. Stay in the small box if you are talking about the specific code

@loumote 2 жыл бұрын

The "Unsatisfying cliffhanger" is me realizing I now have to go through a lot of refactoring because I've done this lazy single-variable function chains waaay too much... Great job as always, thank you Arjan !

@anelm.5127 2 жыл бұрын

Learned the most out of your refactoring videos . Really enjoy them. Especially Solid principled in practice made them super easy to understand.

@ArjanCodes 2 жыл бұрын

Great to hear, thanks!

@sdar1988 2 жыл бұрын

I always used coding as a tool to test my hypothesis. You videos put perspective into why and how writing code is much more than that. I am not a trained software engineer, but, professionally a data scientist. I feel your videos are really helping me fill glaring gaps in software design process while conceiving my data projects and this is important for the data science community as most are not from the software engineering background. Please make more videos in this series. Godspeed.

@ArjanCodes 2 жыл бұрын

Hi Arjun, thank you, I'll definitely continue in this direction. I think there are a lot of things to cover, so stay tuned!

@michaelt6922 2 жыл бұрын

Thank you for your content Arjan, I have intermediate python skills but have been learning a lot from your refactoring videos. Moving to OOP for my projects has been a steep but rewarding curve. Thanks again!

@pawelkubik 2 жыл бұрын

It's worth pointing out that those single-variable function calls are often preferred, because network composition is rarely purely sequential. In general, it is a DAG. For experimenting it's important to be able to quickly access intermediate results of the network and a chain of calls make it much easier. In practice it's more important to detect repeatable and meaningful patterns in the network and split them into separate classes, e.g. a network may consist of a sequence of 12 layers, but it could be conceptually easier to view it as a sequence of 4 blocks - 3 layers each. tl;dr - don't refactor out all single-variable function calls right away

@ArjanCodes 2 жыл бұрын

Good to know, thanks!

@pawelkubik 2 жыл бұрын

In my experience, almost every new ML engineer start the journey from solving a very simple problem like classification and implement kind of a "Trainer" object. There is a lot of inversion of control to adjust certain parts of the experiments. It seems like a stable framework, but collapses pretty quickly when they try to do something more complicated.

@pawelkubik 2 жыл бұрын

There are few popular frameworks that approach this a bit more maturely. I think would be interesting to see an analysis and comparison of libraries like Keras, Ignite and Pytorch Lightning from perspective of an experienced programmer. They all invent some kind of callback or hook mechanism to control data loading and model training.

@VikasGuptacherie 2 жыл бұрын

I really liked this novel method of "Code Refactoring" & "Code-Roast" to look things from software best practices and see how to correct these common mistakes. I would like to see more such video.

@MCRuCr 2 жыл бұрын

You shouldn't make pure data science/machine learning content, because there is already plenty of that. A sort of "Software design for data scientists [Dummies]" could be a great contribution!

@TheMightyOprah 2 жыл бұрын

100% agree with a series on Software Design for Data Scientists!

@ArjanCodes 2 жыл бұрын

I agree - I also wouldn't feel very comfortable doing pure data science / ML stuff since that's not my main area of expertise. But I'll definitely think more about how design principles and patterns can be used in this setting!

@sergeiparshin9488 2 жыл бұрын

@@peterdowdy174 Probably Kedro could be useful to combine notebook and code itself. P.S. Kedro - open-source Python framework for creating reproducible, maintainable and modular data science code

@alchemication 2 жыл бұрын

@@peterdowdy174 Hey Peter, I have been struggling with this topic for a few years and ended up here: Notebooks are great for local/quick/dirty experiments, but not for a proper/production grade code. For many, many reasons... Once I accepted this - my life is a happier place ;) Greetings and all the best!

@alonyariv8999 2 жыл бұрын

Yes please, that is such an important content to have

@Michallote Жыл бұрын

Arjan I'm at awe at you ease of reworking things just by looking at them. And it works every time! I just recently followed all your advice in a program I'm developing and it took me a day just to get the thing running again in the new format. We are incredibly lucky to have you teaching us this stuff. Most courses will say over and over the design principles but getting to see them applied so naturally really makes them stick. Thank you so much

@anzei331 2 жыл бұрын

Great vid, was looking forward to this for a while since you mentioned on Reddit that you had plans to get into ML/DS from software engineering perspective. Much better to refactor a project which is a real world scenario, rather than simple hypothetical examples which are abundant.

@leif_p 2 жыл бұрын

Worth pointing out that both sklearn's Pipeline and torch's Sequential compose _classes_ satisfying certain interfaces and return _classes_ (with possibly different capabilities). Which is a bit more complicated than function composition, but usually necessary in real-world situations where the aggregate process needs more capabilities than just being Callable.

@alchemication 2 жыл бұрын

This is actually what I do at work - working in a Data Science team as a Software Engineer with some prior ML knowledge. I have to tell you that the code you received for refactoring here is actually what I would consider a state of the art design ;- ) No offence to Data Scientists, I totally understand how complex their world is!! Hopefully as the discipline matures a bit more, and sadly more projects fail due to quick & dirty solutions - we will be all in a better place. Thank you for your work.

@ArjanCodes 2 жыл бұрын

You're most welcome and I absolutely agree with you - data science is a very complex field and it makes total sense that data science education programs have to spend all their time on data science concepts, leaving little room for software engineering practices!

@sai1921 2 жыл бұрын

I'm a simple man. I see Arjan post, I hit like button. As a DS student, this actually helps a bunch. Thanks brother!

@aliwelchoo 2 жыл бұрын

As a data scientist that was already watching your content, definitely looking forward to this series!

@ArjanCodes 2 жыл бұрын

Thanks!

@AbhirupMishra 2 жыл бұрын

I really loved this video. I work in Quantitative Finance, where we have to write a lot of code (usually in a scientific programming language, a.k.a Python), and I've benefited a lot from these videos. A lot of a code that I've encountered is usually a spaghetti code, and just starting to think of solving the problems from good design principles has really helped in increasing the flexibility, maintainability and readability of my code. I always look forward to watching these videos! Hopefully, you'd cover more advanced topics of Python and designing systems in the future.

@ArjanCodes 2 жыл бұрын

Thanks, I'll definitely do more videos like this in the future!

@programmertheory 2 жыл бұрын

I remember dealing with MNIST data sets in college when I was learning Machine Learning. I was taking an OOP course at the same time and my first ML (Machine Learning) assignment was a single-layered neural network with 10 perceptrons. Even though I went object-oriented with the assignment it took forever to go through the training data and testing data, 12+ hours in total in runtime. It wasn't that accurate either, like 75-80%. However, I redid the assignment, abandoning most, if not all, OOP principles and going towards something more procedural and mathematical (linear algebra to be precise). There was a huge difference in my experience. The code was easier to read, easier to understand, and a lot faster, when going through the training and testing data in less than 1 second and was reaching 92-96% accuracy.

@visualapproach7155 2 жыл бұрын

I love these refactoring series. So informative. Thanks, not only to Arjan, but to the people who submit their code to literally be picked apart and rebuilt.

@joaopedrorocha5693 Жыл бұрын

This helper function to compose is a gold nugget . I think it should go into the functools module so we could simply import it. The idea is so intuitive that it wouldn't be a problem if it wasn't explicitly defined on the codebase.

@1oglop1 2 жыл бұрын

I love this, this video saves and the comments save me a lot of time returning code reviews to data people over and over! Now I can just send them here to explain what is not spaghetti!

@DrPizza92 2 жыл бұрын

I’m a JS guy but have learned so much from watching your videos. Thanks!

@red_cape. 2 жыл бұрын

I'm a newb in python, and being experienced in other languages it is hard to flip the switch to a new one, Arjan videos have beem crucial to my undestanding of the "Pythonic" way. Thanks man! Keep em coming ... I don't know if it is your focus here but would love to see you talk about a project using PyQt5 ;)

@ArjanCodes 2 жыл бұрын

Thank you, glad you like the videos and good topic suggestion!

@niklase5901 2 жыл бұрын

I am really intrested in design for data science applications. I used to be a programmer, but did other stuff for a lot of years, the reason I am back in programming is data science. But I find there is lack of practises that I am used to from programming applications lacking in the world of data science. So this is a great one!

@gregorybutcher2647 2 жыл бұрын

How on earth does this man not have more subscribers. I mean most people would benefit it's their problem if they don't watch these lmao I'm just glad I'm one of the first to hear his wisdom.

@astronemir 2 жыл бұрын

Hi Arjan, I’m an astronomer learning to code more properly, and I work exactly with code like this often. This was so unbelievably helpful. Thank you for starting this series and I’m looking forward to more like it. It’s difficult to prototype things in a Jupyter notebook, get it running, then refactor to something shareable and useable and understandable by others that may need to work with it. You’re teaching me a lot, keep it up!

@joaopedrorocha5693 Жыл бұрын

I'm proto astronomer, passing through the same process as you :D

@MateuszModrzejewski 2 жыл бұрын

Fantastic video, I'm eager to watch the two next parts. From my PhD studies in AI I can tell the majority of research code in ML and AI is terribly written and barely readable, even with published works. The guidelines for clean ML code are just starting to emerge and at times I feel there's even more confusing ML config / scheduling / architecture tools released every day than confusing JS frontend tools (and there's a JS framework released almost every day lol). Good to see plain old good design being used in this context. Content like this is VERY valuable, hope to see more ML refactoring videos! All the best!

@ArjanCodes 2 жыл бұрын

Thanks and glad to hear you enjoyed the video! Let me know what you think of the other two. I'll certainly revisit more data science oriented content focused on design. Doing this miniseries was a lot of fun.

@MateuszModrzejewski 2 жыл бұрын

@@ArjanCodes so I've already watched the other two and really enjoyed them as well . Very clean, understandable and applicable approach and I think your channel really nicely fills a gap in intermediate to advanced programming topics. I really appreciate the references to Dijkstra, Hoare, SOLID, GRASP etc. - super rare to see that on YT. I've also watched your Hydra video and I really like how it compliments this miniseries - Hydra is getting lots of interest in the community these days. Another tool that's growing in popularity and also could be interesting for you for a future video is PyTorch Lightning - it introduces an opinionated design into PyTorch and also aims to clean up some of the clutter which can be found in 90% of AI code.

@jeancerrien3016 2 жыл бұрын

Wonderful video! 🙏 Among many other things, you've shown me three nice ways to compose a sequence of functions: 1) with a torch network 2) with a scikit-learn pipeline 3) with functools.reduce I agree the third is very attractive. Some may find it a bit strange that the order of the functions switches, but that's not a defect in my eye.

@xxshogunflames 2 жыл бұрын

Looking forward to part two! Learned a lot and will be rewatching

@ArjanCodes 2 жыл бұрын

Thanks Jonathan, glad you liked it!

@Bakobiibizo Жыл бұрын

maybe not when this came out, but now is a helluva time to start doing data science material

@TheGagman2000 2 жыл бұрын

Reiterating the others, very useful video for data scientists! I liked the idea of replacing the nested call with the compose function, but what about an "apply" function instead ? def apply_composition(x, *functions): for func in functions: x = func(x) return x For me, this seems easier to read than the functools solution... and its similar to the idea of a torch.nn.ModuleList container in Pytorch

@jessicameneguel4954 2 жыл бұрын

This way you are replacing x as f(x) in the same fashion as the original implementation.

@jessehalliday2948 2 жыл бұрын

I just love watching you delete lines of code, keep up the great and informative videos

@MichaelTVickers 2 жыл бұрын

I’ve been hunting for a nice way to do function composition in standard-library python for awhile and this version with type hints is 👍

@ShaderKite 2 жыл бұрын

I'm loving it! Please continue doing videos like this one :D I'm learning a lot from it - your videos are one of the most valuable/useful ones I've seen for Python or software design in general

@ArjanCodes 2 жыл бұрын

Glad to hear it, thank you!

@justfoundit 2 жыл бұрын

Using the Sequential is 1 way, and it works nicely when the model has a linear flow, however if you want to build a model with - for example - 2 outputs that's sitting on different levels of the model you need to use the non-sequential way, and then the X for all intermediate stage starts to make sense :)

@ArjanCodes 2 жыл бұрын

In this case I would prefer to have a class for defining an Acyclic Directed Graph. Perhaps PyTorch also has this... I didn't check.

@brunosompreee 2 жыл бұрын

Thanks! I'm a Data Engineer and this helps a lot!

@ArjanCodes 2 жыл бұрын

Thanks so much Bruno, glad it was helpful!

@iliqnew 2 жыл бұрын

Once more. A very useful and nice video! Thank you!

@ArjanCodes 2 жыл бұрын

Glad it was helpful!

@kevon217 Жыл бұрын

Really cool compose function. Going to use that.

@AdeelEjaz 2 жыл бұрын

Really good video, very well explained, and I can see in comments below you have noted the jump cuts away from code. Really will make the video perfect! Thank you

@drhilm 2 жыл бұрын

I wish I have seen this video two years ago. I write this kind of project all the time. I learned the hard way to do it like that.

@SupernovaGiacomo 2 жыл бұрын

Wow thanks Senpai! Will definitely share on my linkedin and with my data engineering team

@ArjanCodes 2 жыл бұрын

Thank you, happy you like it!

@marwensallem1397 2 жыл бұрын

Nice video 😊 Hope it reaches all my data scientist colleagues. There are many similarities in machine learning projects, this makes me think of why there is no custom Design Patterns for ML projects ?

@ArjanCodes 2 жыл бұрын

Thanks! I'll try to come up with a few ideas for this and cover that in future videos.

@amir3515 2 жыл бұрын

Very stimulating and educational video. Love the pace. Thank you.

@sergioquijanorey7426 2 жыл бұрын

Really nice video. When working with ml / ds problems, I always end up using ugly designs / hacks that makes the job done. An then refactoring is such a pain. Thanks you for this advice :D

@ArjanCodes 2 жыл бұрын

Thank you Sergio, glad you liked it!

@coert 2 жыл бұрын

Once again, excellent stuff Arjan. Definitely going to work with the function composition!

@ArjanCodes 2 жыл бұрын

Thanks so much Coert! :)

@benjaminthorand9569 Жыл бұрын

PLEASE give us more from just this very content! Awesome videos, going to spread the word! : ]

@ArjanCodes Жыл бұрын

Thanks! Will do!

@garrywreck4291 2 жыл бұрын

Great video! IMHO, a simple loop over functions list is much easier and readable: x = 12 for func in (add_three, add_three, mul_two, mul_two, ): x = func(x)

@matthewtaruno 2 жыл бұрын

One point to consider from a data scientist: a lot of the times we like quick and dirty iterations to our exploratory and predictive insights. Many times (especially under time constraints) quick and dirty is better than slow and beautiful. That's why I personally love notebooks. As long as it is idempotent (notebook runs from start to end without issues) and the environment is containerized, it is reproducible. But I see the merit for both. There is a lot of power in writing scalable and reusable code in this space to organize to complex pipelines that supercharge society's solutions. This is why, over time, I now have learned to use a hybrid of both - but maybe not in the most optimal or well-principled way. Which leads to my suggestion! Would you be able to make a video on how you would use Jupyter Notebooks/Kaggle Kernel Notebooks/Google Collab Notebooks in tandem with with an internal packaged up repository as you have it in the video for DS projects? Maybe this means just maintaining your currently directory structure as shown in this video but adding a "notebooks" folder to the root folder where all that type of analysis is done since we can call your modules from that notebooks folder (not sure how this would be manifested, you probably have a better idea). You use .py scripts for most things that you can install these scripts as modules for use in other scripts or even notebooks, and that is what I have been doing to keep my notebooks cleaner. But I am sure your perspective on how to have fast iteration times to high value insights, maintain a scalable pipeline, yet keep everything reusable in doing this kind of work - even maybe some sort of generalized approach shown through a video example - would be invaluable. I think this would be a game changer for myself and a lot of people in DS and ML. As for this video, your other content has been useful, but seeing it directly applied to the type of work I do on a regular basis brings your concepts to life for me. Please keep these software design principles applied to DS crossover content coming! Thank you for what you do :)

@ArjanCodes 2 жыл бұрын

Thanks and great suggestion regarding the combination of notebooks with running python scripts in a repository. I'll look into it!

@iliqnew 2 жыл бұрын

Yes please! More of these

@vladimirtchuiev2218 11 ай бұрын

This looks more like a deep-learning project than a data-science one (using Torch, Tensorboard to follow the network training, instead of something like Pandas), which is actually exactly what I need right now, I work a lot with Pytorch and Pytorch Lightning and I'm looking to improve my code. The issue that I have with torch.nn.Sequential is that its annoying to debug when you have an error in your network-building lego, but if you sure that the lego is correct it is more clean to use Sequential.

@kobebyrant9483 Жыл бұрын

Function composition is really cool and make the code very concise and clean. However, I feel like we achieve it at the cost of readability of the code and additionally make it hard to debug intermediate calculation/steps if suspect something is wrong(in reality this happens very often when there is too much math involved in the code). Some (picky) managers might not like it during code review/pull request for the reasons stated

@greatfate Жыл бұрын

Exactly what I was thinking

@nicolabombace2004 2 жыл бұрын

As always a great video! The only suggestion I would add is maybe to turn off Intellisense for the video, because all the red squiggly lines are a bit overwhelming and actually useless because the code works!

@ArjanCodes 2 жыл бұрын

Thanks for the tip! I might do that for future refactorings (at least in the beginning :) ).

@sombrero7935 2 жыл бұрын

The one issue I have with this design is that is based solely on pytorch, so if you like to go to another framework such as tensorflow, this will require quite a bit of refactoring (without taking into account the new framework coding stuff), thus most likely making breaking changes to consumers that use the project

@ArjanCodes 2 жыл бұрын

In general, this is a really hard problem to solve. Especially since most frameworks like Pytorch, TensorFlow, etc. ask you to "marry" the framework and use their data types all over the place, which then makes it hard to replace the framework with something else. I'll look into this and try to come up with some ideas to do a video about this.

@igordemetriusalencar5861 2 жыл бұрын

The most important thing I've learned (I'm still learning) is to write good, cleaner, and reproducible data science code was: "Functional programming paradigm". R (with tidyverse, and tidymodel approach), and Julia programming language made me code almost like I was using a "General System Theory" from Bertalanffy, (ins -> transformations -> outs). With this approach, I can change the ins without break all the code, or I can change the functions (transformations, each one with its own rule) without break all code logic. Since I use Python only for NLP tasks I do not use a functional programming paradigm with it, but I know it is possible, maybe easier in Python (function composition was good to know it). The OO paradigm for Data Science that some data scientists use does not make any sense to me, of course, I am not a professional programmer, maybe for not having ground on computer science, I think that way. By the way, I'm learning a lot with you! Thank you very much!!!

@ArjanCodes 2 жыл бұрын

Thanks Igor, glad you like the content! Using pure functions is certainly a great starting point. What OO programming brings to the table is that it provides a nice mechanism for structuring data representations via (data)classes and collection objects such as lists, dicts, and so on. Ideally, you'd have a marriage of both that provides a clear structure of the data, and has data manipulation pipelines with very limited coupling and side effects.

@igordemetriusalencar5861 2 жыл бұрын

@@ArjanCodes Thank you! I will try to apply this approach to my NLP study codes, I know I have a lot to learn to be able to understand OO stuff, classes, dataclasses, but your videos are helping me a lot.

@ingovb6155 Жыл бұрын

Thanks for making this (and similar) videos. They are very helpful and insightful

@ArjanCodes Жыл бұрын

Thank you Ingo, glad you liked the video!

@jimogren6306 2 жыл бұрын

Great video! One thing that I did not quite understand: when you changed the ExperimentTracker from an abstract base class into a protocol then the TensorboardExperiment no longer inherits from ExperimentTracker. I do not see the connection between the two classes anymore. After the refactor, to me ExperimentTracker seems like an unused class. Or am I missing something?

@ArjanCodes 2 жыл бұрын

After changing the ExperimentTracker to a Protocol class, the inheritance relationship between it and TensorboardExperiment is indeed gone. However, ExperimentTracker is used in the Runner class where it defines the interface that is expected for connecting the Runner with the experiment tracker. The result is that you can now create other experiment tracking classes that integrate seamlessly with the Runner class, as long as they implement the methods defined in ExperimentTracker.

@gustavojuantorena 2 жыл бұрын

Awesome! I think there are few tutorials about software design topics for data science.

@Astana1337 2 жыл бұрын

I like to use multiple inheritance for string Enum classes. For example: class MyEnum(str, Enum): RED = 'RED' BLUE = 'BLUE' GREEN = 'GREEN' *Make sure the str comes first. Then you can use the class like normal, MyEnum.RED, and you can also use a string literal. It avoids the need to use the 'name' attribute. Lastly you also get equality if you are comparing the enum to a string literal.

@BjarneThorsted 2 жыл бұрын

Next time, you should definitely do a tensorflow/keras project. Would love to see how you would go about cleaning up the code in a project like that. full disclosure: I've written a very convoluted DL project with tf.keras and I'm 100% positive it can be written better

@ArjanCodes 2 жыл бұрын

Great suggestion! Feel free to submit your code as a Code Roast, and I'd be happy to take a look if it's something I can cover on the channel.

@BjarneThorsted 2 жыл бұрын

@@ArjanCodes I will try and see if I can package it up in a meaningful way. Right now it is split across two private github repos and trains on a rather large and proprietary image dataset

@mhFFFFFF 2 жыл бұрын

Maybe already answered, but does Pandas have function composition (aka network or sequential)? IMO this is a huge benefit of using the R tidyverse, the %>% command is called a “pipe” but it seems to work exactly like function composition and is extremely well-supported and flexible.

@_shikh4r_ 2 жыл бұрын

I'm taking notes 📝

@DistortedV12 2 жыл бұрын

Okay this video is gonna blow up imo

@tonyli7014 2 жыл бұрын

Great topic!

@canvasbagfight 2 жыл бұрын

I’ve written a lot of spaghetti code to process scientific data. It’s usually so bad that it just stays as a notebook that’s copied over and laboriously edited for each new time I repurpose it. Really think this is useful content. More please.

@supratikchowdhury2107 2 жыл бұрын

Yes to more Data Science!

@doublegdog 2 жыл бұрын

Great video. What do you think of folder refactoring? In some repos, I have seen people putting files/classes in a separate folder called "commons" for utility files that are used agnostically across the project. I think this would be a great idea to touch on in a future video. Nonetheless, the best python videos on youtube hands down! Keep up the great content!

@tehdusto Жыл бұрын

27:07 yo dog I heard you like lambda functions, so I put a lambda function in your lambda function so you can function while you function. ...but really this function composition business is actually breaking my mind. I'll need to practice this one.

@RichardVodden1 2 жыл бұрын

Would you ever consider overriding `__str__` on an Enum to return `self.name`? That would avoid having to add `stage.name` in all those f-strings. Feels neat to me from a code repetition perspective, but it does violate the "Explicit is better than Implicit" guidance of the zen of python. I'd be really interesting in your opinion.

@ArjanCodes 2 жыл бұрын

Great suggestion, and I think it works really well in this particular case.

@kazmkazm9676 Жыл бұрын

Thanks for your great contents. However, I didn't find your custom composition function useful. However, PyTorch's Sequential or Scikit Learn's Pipeline seem more proper.

@hudabdulwahab2499 Жыл бұрын

this video is amazing - can we please get another data science / ml pipeline refactor?

@ilyaster42 2 жыл бұрын

That's great video! Thank you a lot!

@TimGrob 2 жыл бұрын

Overwriting the 'forward' function in the Torch Model and updating the state (tensor) of the neural network at each step is actually the recommended way to do it by PyTorch.

@esteenbrink 2 жыл бұрын

At 14:25 you decide to remove the protocol inheritance, making it implicit. There is no difference to the working of the code, though it does make life harder for anyone needing to change and understand this class, for it is not clear anymore that it should adhere to the protocol.

@felipealvarez1982 2 жыл бұрын

I would love to know about the vscode keyboard shortcuts you love the most

@songokussj4cz 2 жыл бұрын

Hi Arjan. Love your stuff. Would you be able to create comprehensive video about "How to structure bigger project"? I've got task to create PySide2 application with at least 3 windows (Main, Settings, Results) and I'm not sure how to structure it so it's not inside one file because that's just too much of a chaos. How to connect signals to what functions and where to write them, shoul each window (code) be individual file, how to connect everything, how to parse variable from one window to second?

@zeki7540 2 жыл бұрын

Thanks Arjan!!

@ArjanCodes 2 жыл бұрын

You're welcome Zeki, glad you liked the video!

@smalltimer666 2 жыл бұрын

Hi Arjan, I write a lot of models and I wanted to ask if you have tips regarding what I imagine is a very simple issue. Version hell. I write code on multiple machines, using multiple styles: jupyter notebooks, org buffers, and of course scripts. Everything is almost always contained in a pipenv environment. But when I try to pipenv install on different machines I keep getting all sorts of version-related errors. I think I am missing some key insight here. There is no way python has such a sloppy design :D Any tips will be really appreciated!

@cajmrn1 2 жыл бұрын

DVC, mlflow, and/or kedro. will change your life. they changed mine :).

@esteenbrink 2 жыл бұрын

Sponsored by 'basically'. Just kidding, great content. Keep it up.

@Glitchiz57 2 жыл бұрын

Great video Thanks ! See you next week

@ArjanCodes 2 жыл бұрын

Thanks, glad you liked it!

@EW-mb1ih 2 жыл бұрын

Except using protocol instead of ABC, your video is nice :) Protocol makes things less clearer. Silly question: why do we need to avoid storing intermediate results in the same variable?

@rshelansky 2 жыл бұрын

Thanks for these videos they have been fun to watch. I see the benefit of function composition, however, In practice (data science) when composing functions I have never not had a whole slew of unique parameters and contexts to pass to each function along the chain. Is there an equally elegant solution to this problem.

@ArjanCodes 2 жыл бұрын

Hi Robert, good question. I like using either closures for this or partial functions (from functools). For example with closures, you can define a function (with parameters, contexts, etc) that returns another function and then that's the function that's passed to the composition. In terms of the example in this video at the end, you could do the following, where n is an extra parameter, add_n is a closure that returns a function: def add_n(n: int): def add(x: int): return x + n return add ... compose(add(5), add(12), multiplyByTwo, ...)

@mathmo 2 жыл бұрын

@@ArjanCodes Robert, not sure whether @ArjanCodes would approve of this, but you could define a Callable ABC base class for your functions that implements a __rmul__ (or sth like that) method that you implements function composition for the __call__ methods and initialize the instances with whatever parameters you want that are not part of the functional input data. And if you make the __call__ method accept and return a dict you can also compose functions of different arities.

@jakobullmann7586 Жыл бұрын

It’s an interesting video, but I think it’s actually misguided advice for Data Science/ML projects. Data Science projects have a different dynamics from software engineering projects, hence the need for MLOps platforms. Tracking is needed in the experimentation stage, when things change quickly, and writing abstractions to become independent of a particular experiment tracking platform is not creating value for anyone. What’s actually important is that the experimentation code is decoupled from the model code (which is why Tensorflow and LightGBM use callbacks… PyTorch doesn’t, but PyTorch Lightning does, which is why I would always use PyTorch Lightning and not raw PyTorch). Moreover, where I feel abstractions are really powerful is for the model itself, because I’m order to do model selection I may have to apply a fair evaluation to models that utilize different frameworks (e.g. PyTorch vs LightGBM) or even different problem framings. The first point is what MLflow Models tries to accomplish.

2 жыл бұрын

I loved this video. It was the best momento to apply the design solid principles to data science because I work with it at daily base. Could you apply solid principles to panda's library because this is the most used library for data processing? Again, Thank you very much!!

@mnsosa 2 жыл бұрын

Where can I learn professional Machine Learning design projects? All I found is Jupyter Notebooks, but I want to do it more professional.

@vlplbl85 2 жыл бұрын

Great stuff

@ArjanCodes 2 жыл бұрын

Thank you Vladimir!

@davidoh6342 2 жыл бұрын

How do you handle errors if one of the composition function raises error?

@Booyah Жыл бұрын

Why do you switch from showing the code you're discussing, to showing yourself full screen and removing the code from view?

@some84884 2 жыл бұрын

Debug of functions composition it's painful. It's much better to have variables with unique names between calls

@gercius Жыл бұрын

You are the Bob Ross of coding

@ArjanCodes Жыл бұрын

Thanks Gercius, happy you’re enjoying the content!

@ravenecho2410 2 жыл бұрын

okay catching up on vids 😋

@_veikkomies 2 жыл бұрын

How can Tensorboard do anything using the experiment tracker class since you removed the inheritance and I can't see how the two classes are linked any more. What's the point of the experiment tracker class now?

@ArjanCodes 2 жыл бұрын

That’s the whole idea of protocols. The relationship no longer exists between superclasses and subclasses, but you use protocols to define the interface at the place where it’s needed and Python’s structural typing system then does the type checks. So in this example, the goal of the experiment tracker protocol class is not to act as a superclass, but to act as an interface of the part of the code that uses it, here that’s the main file and the Runner class.

@_veikkomies 2 жыл бұрын

@@ArjanCodes Ahh thank you

@atillakoseoglu4089 2 жыл бұрын

Dear Arjan, I am a 3 months of rookie in python(learned classes , functions basics etc) And interested in data things , not development 🙀 Is it a problem you think? I mean to find a job and career-wise Thank for your kind answers and advices 🙏

@christiencodes3086 2 жыл бұрын

Do you have Kite installed for autocomplete ?

@Gosu9765 2 жыл бұрын

At the end, do you really believe that replacing that nested call to the functions was more readable by introduction of functional programming paradigm and that lambda function? In my opinion that's way less readable, since you don't work with native construct anymore and have to know what compose function does. Looking at that compose function it takes WAY more time to deconstruct what it does compared to just simply keeping track of parenthesises in nested calls. You went way too far there in my opinion.

@ArjanCodes 2 жыл бұрын

Absolutely. Every time you add another function to the list, you’re going to get another pair of parentheses, leading to completely unreadable code in the eind. This is also the reason that libraries like PyTorch and Scikit Learn have built in mechanisms for this. Of course you have to write the compose function only once and then you can use it anywhere, without having to understand how it works internally.

@Gosu9765 2 жыл бұрын

@@ArjanCodes Makes sense, but in that specific case it definitely was overkill.

@ArjanCodes 2 жыл бұрын

The example with the addThree and multiplyByTwo functions is intended to explain the concept. In practice, I really think using a function composition tool like Sequential from PyTorch or Pipeline from Scikit Learn is a good idea.

@_akuma06 2 жыл бұрын

@@ArjanCodes Sequential is great when you want to do simple model. However, when you want to do complex network with previous layer that are called later or just simple autoencoder networks, you have to use functionnal API.

@DS-tj2tu 2 жыл бұрын

Thank you

@carlosg1535 2 жыл бұрын

9:10 Why do you think abstract bases classes should only have abstract methods and not atrributes?

@ArjanCodes 2 жыл бұрын

Overall, I find this gives more flexibility and offers a better separation of responsibilities. In this case, there are several responsibilities of the original abstract class: defining what the interface is between the experiment tracking and the rest of the code, keeping track of the experiment stage, and providing helper methods. I prefer to keep the single responsibility of the abstract class to define the interface and then use either inheritance or composition to provide the other features you need. For example here, I moved the set_stage implementation to the Tensorboard experiment tracker. Alternatively, if you want to be able to reuse the basic implementation of handling the experiment stage, you could create a subclass "BasicExperimentTracker" that provides that implementation, and then your more specific experiment trackers could inherit from that class.