7 Tips To Structure Your Python Data Science Projects

  Рет қаралды 111,059

ArjanCodes

ArjanCodes

Күн бұрын

In this video, I’ll cover 7 tips to streamline the structure of your Python data science projects. With the right setup and thoughtful software design, you'll be able to modify and enhance your projects more efficiently.
Check out Taipy here: github.com/avaiga/taipy.
The Cookiecutter: github.com/drivendata/cookiec...
👷 Join the FREE Code Diagnosis Workshop to help you review code more effectively using my 3-Factor Diagnosis Framework: www.arjancodes.com/diagnosis
💻 ArjanCodes Blog: www.arjancodes.com/blog
✍🏻 Take a quiz on this topic: www.learntail.com/quiz/okaeaa
Try Learntail for FREE ➡️ www.learntail.com/
🎓 Courses:
The Software Designer Mindset: www.arjancodes.com/mindset
The Software Architect Mindset: Pre-register now! www.arjancodes.com/architect
Next Level Python: Become a Python Expert: www.arjancodes.com/next-level...
The 30-Day Design Challenge: www.arjancodes.com/30ddc
🛒 GEAR & RECOMMENDED BOOKS: kit.co/arjancodes.
👍 If you enjoyed this content, give this video a like. If you want to watch more of my upcoming videos, consider subscribing to my channel!
Social channels:
💬 Discord: discord.arjan.codes
🐦Twitter: / arjancodes
🌍LinkedIn: / arjancodes
🕵Facebook: / arjancodes
📱Instagram: / arjancodes
♪ Tiktok: / arjancodes
👀 Code reviewers:
- Yoriz
- Ryan Laursen
- Dale Hagglund
🎥 Video edited by Mark Bacskai: / bacskaimark
🔖 Chapters:
0:00 Intro
0:50 Tip #1: Use a common structure
1:55 Tip #2: Use existing libraries
4:59 Tip #3: Log your results
5:55 Tip #4: Use intermediate data representations
8:09 Tip #5: Move reusable code to a shared editable package
9:24 Tip #6: Move configuration to a separate file
11:45 Tip #7: Write unit tests
14:09 Final thoughts
#arjancodes #softwaredesign #python
DISCLAIMER - The links in this description might be affiliate links. If you purchase a product or service through one of those links, I may receive a small commission. There is no additional charge to you. Thanks for supporting my channel so I can continue to provide you with free content each week!

Пікірлер: 121
@ArjanCodes
@ArjanCodes 7 ай бұрын
👷 Join the FREE Code Diagnosis Workshop to help you review code more effectively using my 3-Factor Diagnosis Framework: www.arjancodes.com/diagnosis
@digiryde
@digiryde 7 ай бұрын
You talk about unit tests in several videos. And I agree completely. The problem is for most developers, unit testing is still a box of voodoo that they hope they got right. How would you feel about doing a series of videos (or a big one) that goes from simple unit tests to writing a unit test "package" for your hypothetical banking system? Starting with how to discover and define what needs to be tested to how to write those test, capture the output into a ToDo tool, then make that into a development board, and finally how to automate that for every build test where they need to be run (different companies define "need" differently). Testing before deployment is one of the most important tools that the average developer does not use as effectively as they should, if at all. Thank you for the great content!
@dylanloader3869
@dylanloader3869 7 ай бұрын
@@digiryde I would love an Arjan take on this process as well. If you're looking for a decent introduction to share (since it sounds like you have a good understanding yourself) I would recommend: "Coding Habits for Data Scientists" by David Tan, it's a playlist on youtube.
@digiryde
@digiryde 7 ай бұрын
@@dylanloader3869 "since it sounds like you have a good understanding yourself" When it comes to me knowing anything, the one thing I think I know is that there is nothing I don't have more to learn about. :)
@mikefochtman7164
@mikefochtman7164 7 ай бұрын
One thing that I learned long before starting my Python learning. Bugs seem to be inevitable and when one happens I ask myself, "How did this get past my unit tests?" So I go back and modify the test suite to catch the bug. Not quite 'test driven development', but really helpful with any sort of iterative-development or refactoring of a project.
@andrewglick6279
@andrewglick6279 7 ай бұрын
If you use notebooks, I _highly_ recommend enabling Autoreload. I find myself using notebooks / VS Code interactive sessions frequently. One of my biggest frustrations with notebooks was that if I changed a function, I would have to rerun that cell every time to update the function definition. It was also less conducive to separating my code out into submodules (which are quite convenient). It was a total game changer for me to add "%load_ext autoreload %autoreload 2" to my Jupyter startup commands. In a way, this workflow promotes the use putting functions in submodules because any time you call a function, it will reload that file with any changes you have made.
@henrivlot
@henrivlot 7 ай бұрын
woah, that's actually great! Never knew this was a thing.
@robertcannon3190
@robertcannon3190 7 ай бұрын
I rewrote cookiecutter and turned it into a programmable configuration language. It's called `tackle` and I am almost done with the last rewrite before I start promoting it. Does everything cookiecutter does plus a ton of other features like making modular code generators (ex. a license module you can integrate to your code generator), flow control within the prompts (ex. if you select one option then expose a subset of other options), and schema validation (not really related to cookiecutter but is the main use case of the tool). It's more relatable to jsonnette, cue, and dhall but you can easily make code generators like cookiecutter as well. Almost done with it (a 4 year long personal project that I've rewrote 5 times) and hopefully it gets traction.
@Krigalishnikov
@Krigalishnikov 7 ай бұрын
How do I follow for updates?
@robertcannon3190
@robertcannon3190 7 ай бұрын
@@Krigalishnikov When I'm done with this last rewrite I'll definitely be writing articles. Also should start doing the twitter thing at some point (not one for social media). Setup a discord channel a while back but again, haven't promoted it so it is dead. Since it is a language, it needs really good tutorials and examples and so those will be coming soon. I code generated the API docs with the tool and trying to replace tox and make with it as well so all those pieces should make for some cool examples. I also use it all across my own stack managing kubernetes templates so that will come with its own press. Any recommendations for how to promote it are welcome though.
@bimaadi6194
@bimaadi6194 7 ай бұрын
@robertcannon3190 github link?
@sergioquijanorey7426
@sergioquijanorey7426 7 ай бұрын
Biggest tip is to combine python scripts with Notebooks. Notebooks allow for fast and visual exploration of the data / problem. Then move pieces of code to a ./lib folder and use it from them. And start adding test. Most of the time you are performing the same operations of the data so you can end up with a shared lib across multiple projects. And that is very handy when starting a new project.
@DaveGaming99
@DaveGaming99 7 ай бұрын
Love these videos! I found your channel from your Hydra config tutorial and all of your videos have been full of invaluable knowledge that I've already been using in my projects at work! Thank you and keep it up!
@ArjanCodes
@ArjanCodes 7 ай бұрын
Thank you for the kind words, Dave! Will do.
@haldanesghost
@haldanesghost 7 ай бұрын
Official request for a full Taipy video 🙌
@williamduvall3167
@williamduvall3167 7 ай бұрын
The fastest data storage I have found with python is arrow, but I unusually use csv or json although I have used quite a few databases. Also, I have been slowly learning tip number 5 over the past 10 years or so. Once I force myself to make code that others can use, I find myself being much more proficient and can see why the top coding people generally seem to make tools that others can use in one way or another. Thanks!
@ChrisTian-uw9tq
@ChrisTian-uw9tq Ай бұрын
I have had a semi-emoitional experience listening to this :D I have been asbolutely solitary on a project this last 8 months. Deciding at the start, that instead of dealing with it with my old knowledge, with Excel and SQL I would tackle it while learning python along the way, with GPT. To hear these tips, to realize I just through trial and error or logic, got to these tips! I have not done so bad, is the feeling but also lots to learn because this is just the base. And I for sure got a foot in the door even if just slightly. Tip 1 - Common structure - totally tried to apply a common structure, but then some external stresses and pressures, I crumbled and it was visible in my code thereafter :/ Next time - this is a must. Tip 2 - Existing libraries - learning with GPT, can be limiting - specifically asking GPT for alternative libraries for the same solution, asking pros and cons etc, helps you immediately address as many issues with fewer packages or the perfect package directly. GPT didn't point that out as much as it could, so its on you. But learning in more depth, now, after idenitfying which packages I use all over, would aid me to understand under the hood. i.e. Pipeline tool - when I started, I wanted something like this. Head dev guy in the office said nothing like it exists. So made a non gui module of functions for handling data load and export for csv, xlsx, sql - to see Taipy... uff, can't wait. Tip 3 - Log Results - as in excel days, copy after copy, different folders, filenames, - to track every transformation to exactly as you said, back track for verification - plus makes it easier to answer external questions as to how and why. Tip 4 - that was just forced as along the way, given different data sources, data formats - I just refused to hard code sources, so dedicating just to data load phase and how outputs anywhere in the whole project will be presented, stored, displayed - whether in memory or sql or csv an logging where and when everything is stored for easier recalling downstream... was like a matrix Tip 5 - exactly, jupyter notebook once you have done with code - pop them into spyder, and call them whether as functions import or as I learned last week, i can run .py files and have results loaded into jupyter notebook - but I don't know what else there can be done - its sandbox place, but py files is the aim. Tip 6 - exactly again, it just had to happen - anything which you start seeing something repeat, think about the context, the occurrence, variations - pop them in a centralized place for easier management - like in excel, lookup tables :) Just trial and error - but to think of it in advance, is game changer for such bigger projects... Tip 7 - No idea - biggest wow moment for me this one so far - because exactly now I am asked to hand over the code I wrote and I am certain I put a little of tests break points in the code with comments for what change I did as to why to have the break point condition where it can be reverted to a different approach - didn't want to delete any of my test points but had a feeling it would look terrible handing that over... and I am sure the way you talk about it, is much neater than what I have formulated so if there is 'official approach' I would gladly revamp my approach because i literally just used user changeable boolean variables such as sample_mode =True to trigger df.head(1000) throughout the script for example... Next week I find out which team they move me to, as I just came in to clean data and present results and instead I now have code to take any addresses, validate them and proceed to secondary data point evaluations to consolidate data across foreign sources, to spot what is missing and shouldn't be but also validation on more granular data in this case ownership and tax rate. Its the most intense bit of work I have done alone - every single step needed something created and addressed - no normalization anywhere consistent in a single field of data point, crucial or otherwise. The reality even in the end, whatever you do, its only as good as the data you get, but the stories and scenarios of data which can be told after, if more people were interested in that and understanding of the influence of data diligence, its like a civilization booster in awareness :D
@TheSwissGabber
@TheSwissGabber 7 ай бұрын
# 6 is such a great tip. Here's how I do it: - YAML file with the source code for defaults - (maybe user /machine dependant YAML file for some test system) - a YAML file with the data to override any values in evaluation YAML is great because it's human readable and has comments. So you can tell the user "put a config.yaml with the keys X, Y and Z to the data and off you go".
@JeremyLangdon1
@JeremyLangdon1 7 ай бұрын
I’m a big fan of validating data I’m bringing in via Pandera. It is like defining a contract and if the data coming in breaks that contract (data types, column names, nullability, validation checks, etc.) I want to know about it BEFORE I start processing it. I also use Pandera typing heavily to define my function arguments and return types to make it clear that the data going in and out of my functions validate to a schema which is way better than the generic “pd.DataFrame” e.g. def myfunc(df: DataFrame[my_input_schema] -> DataFrame[my_output_schema]:
@thomasbrothaler4963
@thomasbrothaler4963 7 ай бұрын
This was a really great video. I am a data scientist for over 2 years and was great to see that i already use developed a habit to use some points (probably because of other videos of you 😂❤) but also learned something new too! Could you do the same thing for data engineering?? That would be awesome!
@Vanessa-vz5cw
@Vanessa-vz5cw 7 ай бұрын
Great video! As a data scientist myself, I would love to see you work through an example that uses something like MLFlow. It's a very common tool at every DS-job I've worked since it's open source but also part of many managed solutions. Specifically, I'd love to see how you build an MLFlow client, how you structure experiments/runs and when/where you feel it is best in the flow of an ML pipeline to log to/load from MLFlow. Most MLFlow tutorials I've seen are notebook based, which is great for learning the basics but there isn't much guidance out there on how to structure a project that leverages it.
@philscosta
@philscosta 7 ай бұрын
As always, thank you for the great video! I'd love to see more DS content, as well as some content on mlflow and dvc.
@obsoletepowercorrupts
@obsoletepowercorrupts 7 ай бұрын
Some good points in the video to get people thinking about different aspects. Scalability leads to a temptation for some to use multiple APIs and thereby an API management tool which in turn costs time and increases the probability of a ML library being used like a sledgehammer to crack a nut _(especially if time lack inclines a planner to be avoidant of dependency tree challenges)._ No matter the scalability of a software system _(whilst not seeing it as something to be seen as "regardless" of scale)_ databasing to keep track can become a bottleneck, and so retries and dead letter queues are worth it. Your mention of workflows is wise and jumping from one database to another _(e.g. MySQL to Postgres)_ is very likely to incur thoughts spent on workflows for that very task. You can optimise all you like _(which is noble)_ but these days people are more incentivised to "build-things" and so somebody might pip that "optimiser person" to the post by throwing computer horsepower at the challenge, thereby forcing something to be big rather than scalable in ways other than unidirectionally. Logs tend to mean hash tables. There are advantages to storing in a database like the choices available for DHT versus round robin. Environmental variables can be for ENV metadata to set up a virtual machine. If you own that server, like you suggest, it's an extra thing to secure _(for example against IGMP snooping)._ Containers and sandboxes are an extra layer of security rather than a replacement for security. Multiple BSD jails for example can be managed with ansible for instance. My comment has no hate in it and I do no harm. I am not appalled or afraid, boasting or envying or complaining... Just saying. Psalms23: Giving thanks and praise to the Lord and peace and love. Also, I'd say Matthew6.
@FlisB
@FlisB 7 ай бұрын
Great video. I feel data-science projects are rarely examined in terms of design/structure quality. I hope to see more videos about it in the future. Perhaps on writing tests, I sometimes lack ideas about how to test data-science code.
@rafaelagd0
@rafaelagd0 7 ай бұрын
Amazing video! I am very happy to notice that these are the bits of advice I have been pushing around to my work environments. I hope these things become the norm soon.
@ArjanCodes
@ArjanCodes 7 ай бұрын
Glad you enjoyed the content, Rafael! I hope so too :)
@chazzwhu
@chazzwhu 7 ай бұрын
Great vid! Would love to see your approach to doing a data project eg download data, use airflow to process, train model and host via an API endpoint
@luukvanbeusichem7652
@luukvanbeusichem7652 7 ай бұрын
Love these Data Science videos
@KA3AHOBA94
@KA3AHOBA94 7 ай бұрын
Thank you for the good videos. Will there be any examples of design patterns using a GUI as an example?
@spicybaguette7706
@spicybaguette7706 7 ай бұрын
As an intermediate data storage format, I typically use duckdb because I'm quite comfortable with SQL, and duckdb allows me to query large sets of data very quickly
@loic1665
@loic1665 7 ай бұрын
Great video, and very great advice! I'll be giving a course about software development next semester and I think some of the points you talked about are worth mentioning !
@ArjanCodes
@ArjanCodes 7 ай бұрын
I'm glad it was helpful! Good luck on your course next semester :)
@ErikS-
@ErikS- 7 ай бұрын
Arjan can now fill the dutch city of Tilburg with his 200k subs! Impressive since he only passed the city of Breda (150k) a couple of months ago! Congratulations Arjan!
@ErikS-
@ErikS- 7 ай бұрын
And next up will of course be Eindhoven! The dutch city where Royal Philips was founded and now having around 250k inhabitants!
@ArjanCodes
@ArjanCodes 7 ай бұрын
Thanks Erik. It’s nuts when you think about what those numbers actually mean!
@NostraDavid2
@NostraDavid2 7 ай бұрын
A .py file can be a notebook too! Just add "# %%" to create a cell - vscode will detect it automatically! Downside: the output isn't saved like you can do with a regular notebook Upside: code is easier to test, no visual fluff, etc.
@askii3
@askii3 6 ай бұрын
I love these interactive sessions! Another plus of these is auto-reloading of imported functions that I just edited without re-running the whole script. Just add this at the top to enable reloading edited functions: %load_ext autoreload %autoreload 2
@d_b_
@d_b_ 7 ай бұрын
In defense of notebooks, the exploratory aspect that it allows makes it really nice and quick to find problems within the data. You can look deeper at the objects or dataframes where it halts. If there is a way to combine this ability with a well structured set of scripts, it would be fantastic
@joewyndham9393
@joewyndham9393 7 ай бұрын
Have you used an IDE with a good debugger, like Pycharm? You can set breakpoints to interrogate data, you can evaluate expressions etc
@harry8175ritchie
@harry8175ritchie 7 ай бұрын
This isn't a bad idea, but the notebooks are great for exploratory data analysis for a few reasons: it combines markdown + python allowing you to explain analyses in detail, seeing various plots on the fly, separating analyses by cells. However, notebooks are a great environment to get messy quicky. Sharing notebooks is essential during my career, specifically for research / exploratory projects that require explaining your analyses and thinking along the way. @@joewyndham9393
@rfmiotto
@rfmiotto 7 ай бұрын
I used to be a notebook user myself, but over time, I adopted the pure Python script because you can structure your code better using principles of clean code and clean architecture (plus the benefits of having a linter and a formatter) . You made a valid point, though, which is the ability to inspect the object values with easy in notebooks. I overcome this with the VSCode debugger. Having a well structured code makes it easy to inspect variables in their particular scope. And also, I believe that, probably, there might be some VSCode extension that helps displaying the variables in a more friendly way...
@FortunOfficial
@FortunOfficial 7 ай бұрын
@@rfmiottoi had a similar path as you. I tried notebooks for a while and loved the quick analysis. Especially with PySpark it's handy since you don't always have to start the runtime context again, which takes a couple seconds. BUT somehow Notebooks nudge me to bad practices such as putting all operations into the global scope instead of using functions. VSCode debugger is pretty decent, it is a good replacement for the lost interactivity
@isodoubIet
@isodoubIet 6 ай бұрын
What works best for me is to write a script and then run that script in a repl like ipython using %run. That way I still get to do things interactively, persist data sets in memory etc but still not deal with any of the annoyances of notebooks.
@TomatePerita
@TomatePerita 7 ай бұрын
It's a shame the pipeline suggestion came from a sponsor, as it would be very interesting to see you compare different tools like snakemake and nextflow. It's a very niche field and choosing a tools is dificult since you have to comit a lot. Great video though, would love to see that pipeline video eventually.
@askii3
@askii3 6 ай бұрын
One thing I’ve began doing is using a flatter directory hierarchy and using SQLlite to catalog file paths along with useful, project specific metadata. This way I write a SQL query to pre-filter data by only fetching relevant Parquet file paths to pass into Dask for reading and analyzing.
@joshuacantin514
@joshuacantin514 7 ай бұрын
Regarding Jupyter notebooks, a lot of things still work when opening the notebook in VS code, such as code formatters. You just may need to trigger it specifically in each cell (Alt+Shift+f for the black formatter, I believe).
@Andrumen01
@Andrumen01 7 ай бұрын
I started doing Test Driven Development and it has saved me from more than one huge headache! Good advice!
@shouldb.studying4670
@shouldb.studying4670 7 ай бұрын
That fight to the death line caught me mid sip and now I need to clean my monitor 😂
@Will29295
@Will29295 11 сағат бұрын
Good overview. Would really appreciate some physical examples.
@ivansavchuk7956
@ivansavchuk7956 7 ай бұрын
Thanks sir, great video! Can you recommend a book on software design?
@Soltaiyou
@Soltaiyou 7 ай бұрын
Great content as always. I’ve used csv, json, pickle, parquet, and sql files. I would argue there is no “standard” data science project. Once you get past boilerplate stats, you’ll inevitably have to write ad hoc functions to match the idiosyncrasies of your data either for deeper analysis or visualizations.
@ArpadHorvathSzfvar
@ArpadHorvathSzfvar 7 ай бұрын
I use CSV many times 'cause it's simple and compact! I've also used parquet when I wanted to be sure about the data types when I've loaded back the data.
@teprox7690
@teprox7690 7 ай бұрын
Thanks for the content. Quick feedback on the changing camera angles, it may look nice but it disturbs the flow of thought. Thank you very much.
@Michallote
@Michallote 7 ай бұрын
I agree, perhaps it's simply not well executed
@viniciusqueiroz2713
@viniciusqueiroz2713 7 ай бұрын
I constantly use the Parquet data format. It makes loading data WAY faster. In Python, it works just as CSV (e.g., using Pandas, instead of using read_csv() you use read_parquet()). It is bundled with an intelligent way of compressing repeated values, so it has a way smaller HD memory footprint when compared to CSV and JSON. It stores data in a columnar fashion, so that if you only need some columns for a project, and other columns for another, you can avoid retrieving unwanted columns into memory. It also works well with Big Data environments (such as Apache Spark). Having a smaller HD memory footprint means you can transfer to other people easily as well. And store it in cloud solutions with a lower cost. And honestly, as a Data Scientist, you kind of never would open the CSV or JSON file and check it yourself. 99% of the time we use a library like Pandas or a software like Tableau to visualize and work with the data. So being human-readable is not really an advantage for data scientists, as it is for backend and frontend developers.
@MicheleHjorleifsson
@MicheleHjorleifsson 7 ай бұрын
Data Formats: Parquet and Pickle because they are lightweight and easily adapted to pipelines. Requesst: Would love to see a taipy project video :)
@philscosta
@philscosta 7 ай бұрын
It would be great to hear some ideas on how to write good tests for data related projects. Lately I've been using syrupy to write regression/snapshot tests to at least assure that the results don't change when I do a refactor. However this is not very robust. A challenge with all that is creating and managing good test data.
@DeltaXML_Ltd
@DeltaXML_Ltd 7 ай бұрын
Great video 😁
@ArjanCodes
@ArjanCodes 7 ай бұрын
Thank you so much!
@murilopalomosebilla2999
@murilopalomosebilla2999 7 ай бұрын
Great content!!
@ArjanCodes
@ArjanCodes 7 ай бұрын
Thank you so much, happy you're enjoying the content!
@drewmailman1965
@drewmailman1965 7 ай бұрын
For Tip 5, nbdev from fastai is a great package for exporting cells from a Jupyter Notebook to a script. From my notes, ymmv: At top: #| default_exp folder_name_if_desired.file_name Per cell exported: #| export To export, add below to same cell: import nbdev # For current directory path: nbdev.export.nb_export("Notebook Name.ipynb", "./")
@slavrine
@slavrine 6 ай бұрын
can you go over using unit tests in data science projects? My team does not actively use them. We don't find a use when we add new features, change processing code very quickly
@abdelghafourfid8216
@abdelghafourfid8216 7 ай бұрын
for caching and storing inermediate results, the fastest formats I've tried are msgspec for json‐like data and feather for table‐like data
@buchi8449
@buchi8449 7 ай бұрын
Useful list of tips but I have additional tips we can derive by combining these tips. tip 1 + tip 6: Use a common way for externalizing configurations If each project externalizes configurations differently, for example, one uses a YAML file and another uses a .env file, it will be a nightmare for other people, particularly for engineers working on the deployment and the operation.
@buchi8449
@buchi8449 7 ай бұрын
tip 4 + tip 7: Implement data science logic as a pure function In other words, don't persist intermediate data in the same code where a data science logic is implemented. We can say the same thing for the reading of input data. Implement a DS logic as a pure function taking a pandas DF and other parameters as input, and returning a pandas DF as output, for example. File I/O should be done in a different code calling this function. This separation of data science logic and file I/O makes unit tests of data science code easier.
@lamhintai
@lamhintai 7 ай бұрын
I seem to run into this issue with some advice suggesting .env storing DB credentials to avoid leakage via version control. I prefer YAML for other things (non credentials/secrets) more due to its structure. And especially reading in dotenv seems to be a bit messy when using type hints (can return None type theoretically - warnings all over the place in the whole downstream). But this means I’m using both formats and not centralized config…
@MicheleHjorleifsson
@MicheleHjorleifsson 7 ай бұрын
BTW jupyter notebooks in visual code with git are nice as you get simplicity of Notebook, performance data and versioning
@isodoubIet
@isodoubIet 6 ай бұрын
The general point that you can learn from libraries is of course very good, and I've used sklearn as a design guide many times. That said, pandas specifically is probably a bad example for that purpose since a lot of its design is weird/messed up. For example, the default settings for reading and loading dataframes from csvs are all wrong -- it should be the case that if you write a dataframe and then read it, you should get the same thing -- saving stuff to disk should round-trip, right? Well, with pandas it doesn't, both because of indexing issues, and because it rounds stuff by default. Lots of little corner things in pandas like that, so IMHO while it's a powerful package that I probably couldn't live without, it's one best used _after_ you've acquired the relevant domain knowledge, not as a teaching tool for it.
@ringpolitiet
@ringpolitiet 7 ай бұрын
Polars scales great. Read the CSV and query, lazily if needed. Parquet for intermediate file system storage, polars.write_database if needed. "If you have to ask, polars is enough".
@walterppk1989
@walterppk1989 7 ай бұрын
for config, I like to have a centralised config.py with a Config class. It'll have attributes that try to get env vars (environ.get('myvar', 'some_sensible_default'), with a default.
@riptorforever2
@riptorforever2 7 ай бұрын
6:50 to query a 'json file' or a collection of 'json files', there is lib tinydb
@askii3
@askii3 6 ай бұрын
I love using Parquet as an intermediate format then use Dask to read them and do lazy processing as much as possible until the pipeline is forced out of Dask.
@alivecoding4995
@alivecoding4995 7 ай бұрын
Is taipy running completely locally, or using their web services?
@blooberrys
@blooberrys 6 ай бұрын
Can you do an in depth logging video? :)
@s.yasserahmadi7846
@s.yasserahmadi7846 7 ай бұрын
Which video should i watch first?! there's no ordering in this playlist, it's confusing
@user-ml5em9eo2e
@user-ml5em9eo2e 4 ай бұрын
I've been working on a project where files contaone raw numpy files and pandas df. I was really struggling to save the data to simple files as pandas is slow and I didn't know what to do with acquisition parameters. I first made my own serialised and deserialiser to store the data in huge JSON (using orjson for performance) but now I stumbled upon pydantic. It's relatively easy to implement compatibility for numpy and pandas and now it's still JSON files but the classes object are very compact and using ABC it's easy to create inheritance and apply these traits. But now I'm looking at alternative to make these files compatible outside of python using either or both SQL and hdf5. I'm quite supprise that this is not a solved problem. I found xarray that could use be it needs complete rewrite. So' yeah maybe another day
@MrPennywise1540
@MrPennywise1540 7 ай бұрын
I have a Python code that use Tkinter to make a GUI. This code edit images. Now, I'm developing my webpage with Django, and I want to run the GUI code in my page. I hoped I could sove it with PYSCRIPT, but it's not compatible. I'm in a dead end. Can someone give me an advice?
@barmalini
@barmalini 7 ай бұрын
It might be a silly question in this context, but does anyone know of a similar quality channel with a focus on Java? Arjan is such a great educator that I am genuinely considering switching to Python, but it's hard because I must learn Java too.
@joewyndham9393
@joewyndham9393 7 ай бұрын
Can someone outline for me what benefits notebooks have over IDE development? I've recently switched from doing data science with an IDE in a typical software dev environment to using Databricks notebooks (due to a job change). I honestly can't see any benefit, but I can see a lot of drawbacks. In an IDE like Pycharm I can rapidly create experiments, I can visualise data AND I can write clean safe software. Notebooks put so many obstacles in the way of good development. What am I missing?
@machoo55
@machoo55 7 ай бұрын
in an IDE, isn't it slower when one has multiple longish steps in a pipeline and have to rerun everything each time as one iterates? I'd be keen to learn how do you get around that?
@sukawia
@sukawia 7 ай бұрын
@@machoo55 Tip #4 can get you pretty far in many cases. You can even set it up to work like cache (create a decorator that saves the output of the function in a file and next time it directly loads from it, then you can put the decorator on your preprocess, load_data etc functions)
@joewyndham9393
@joewyndham9393 7 ай бұрын
@@machoo55 Yeah that's a reasonable concern, but I think it's overcome by good coding practices like abstraction, separation of concerns, and good personal process. Let me flip the question and ask why you need to run steps A, B, C and D in a pipeline to be able to write step E? I'm assuming your answer might be that you need to know what the data looks like at step E. Then my question would be, why don't you know what is coming out of steps A, B, C and D? What I'm getting at here is that well structured, clean code makes it easy to understand what goes in and what comes out of functions and classes. So you can write huge amounts of good code without actually pushing any real data through it. I also want to ask you about what you are doing when you are "iterating"? Are you debugging? Are you trying new things in your model? Or are you doing both of these at once? If you are trying to do both at once, then I can see why you like notebooks. They really encourage you to bounce around your code slipping changes in here and there. And this is one issue I have with notebooks and the way lots of data scientists work - they don't separate the different tasks they are doing. If I'm writing code, that's all I'm doing. If I'm debugging I'm only debugging. And if I'm extending or modifying a model, it's only after finishing the first version. This point gets me to the debugger. In Pycharm I can use the debugger to pause at any moment in the pipeline, to see all the variables, to evaluate expressions, etc. So the functionality you actually want is there. But, it is only there at the right time - when you are debugging. And it encourages you to write cleanly, because debugging is waaaay easier in tightly written functions and classes with limited namespaces
@machoo55
@machoo55 7 ай бұрын
Thanks for the suggestions! Often times early in projects (I work in a domain that isn't well established) I do need see what the data is like to decide on the choice and order of steps. After that, I refactor everything into a proper class based pipeline. But the helpful things suggested here have given ideas as to how I might start with and stay in an IDE.
@joewyndham9393
@joewyndham9393 7 ай бұрын
@@machoo55 I agree it is always important to do an ad hoc scan of your data, and for that step, you're not necessarily writing code that will live on in your codebase, so you can relax the rules of clean code a fair bit. But in my opinion you can do that in a good IDE which allows inspection of variables and interactive plots. You also get all of the other super productive tools of the IDE. Happy coding!
@pj18739
@pj18739 3 ай бұрын
When would I rather use Taipy than Dagster?
@MicheleHjorleifsson
@MicheleHjorleifsson 7 ай бұрын
Different approach to config variables, use a MyConfig class and store in a pickle file. this way the file isnt clear text when stored
@joshuacantin514
@joshuacantin514 7 ай бұрын
HDF5 seems to be a rather useful data format, for both metadata and data.
@suvidani
@suvidani 7 ай бұрын
Use schemas to describe and validate your data.
@walterppk1989
@walterppk1989 7 ай бұрын
my tip for teams that run many ML pipelines in particular: don't make many projects based on a single cookiecutter. Instead, nest all of your pipelines in a monolithic repo which is ONLY in charge of writing ML code, and separate the projects by subfolders. That way, you don't have to maintain many different CI/CD pipelines and docker imgs (they tend to be large, but that's the dependencies, not the application code).
@ali-om4uv
@ali-om4uv 7 ай бұрын
Thats good and horrible advice at the same time
@abomayeeniatorudabo8203
@abomayeeniatorudabo8203 7 ай бұрын
Notebooks are great for experiments.
@hubstrangers3450
@hubstrangers3450 7 ай бұрын
Thank you, could please return to LLMs for short series with MemGTP, OS and Function calls (YT, v=rxjsbUiuOFo, robot to robot interaction), if time permits, could be able to come up with demo and thought process, how futuristic is the scenario, and will it be a cost effect consideration, on prem cloud platform..... Thank you
@ali-om4uv
@ali-om4uv 7 ай бұрын
Redo the video and asd data version control dvc. That is a must once you work in an organisation. It has rudimentary pipelining for modeltraining as well. Everybody should know mlflow. And.... avoid tools like kubeflow if you do not have sufficient manpower to run it.
@scottmiller2591
@scottmiller2591 7 ай бұрын
Taipy seems to have abandoned the pipeline as a user concept - it no longer appears in the docs. I assume it's still in the mechanism, but no longer explicitly exposed. Rather, the emphasis seems to be on building GUIs with data nodes. My experience with graphical programming like this has been that they are extremely difficult to review, as one has to unfold a lot of nodes to actually get to the code - maybe they've gotten around this somehow, or maybe they want you to assume their code is foolproof, a bad sign.
@truniolado
@truniolado 2 ай бұрын
hey man, where tf you got that amacin tshirt?
@knolljo
@knolljo 7 ай бұрын
A bear t-shirt and no mention of polars?
@jaimehernanmartinezsilva3996
@jaimehernanmartinezsilva3996 7 ай бұрын
Which is that bear tee shirt brand?
@user-cr3ti1vj6f
@user-cr3ti1vj6f 7 ай бұрын
Aryan codes? Based.
@TheSwissGabber
@TheSwissGabber 7 ай бұрын
pandas.. everytime i come back to an old (6M+) project it does not work because they changed the API. Never happened with any other library (numpy, matplotlib, scipi etc.) So I would only use pandas if it REALLY benefits you. otherwhise you'll have a guaranteed refactor in 6 months...
@suvidani
@suvidani 7 ай бұрын
Each project should have its own defined environment, this should not be a problem.
@Jugular1st
@Jugular1st 7 ай бұрын
... And if your abstraction is good it should only impact a small part of your code.
@isodoubIet
@isodoubIet 6 ай бұрын
What's amazing is that despite all the breaking changes, pandas still has a bad api with wrong defaults. Very useful, but not a library anyone should emulate.
@abomayeeniatorudabo8203
@abomayeeniatorudabo8203 7 ай бұрын
You are wearing a pandas shirt.
@alexloftus8892
@alexloftus8892 Ай бұрын
As a professional data scientist, I disagree with a lot of this advice. With exploratory analysis, you are often writing one-off notebooks that nobody will read or reuse. Spending the extra time to write tests in this situation wasted effort. A good middle ground is including `assert` statements in your functions to make sure they're doing what you think they're doing. Pull code you're going to reuse out of your notebooks, and then write tests for it then.
@juancarlospizarromendez3954
@juancarlospizarromendez3954 7 ай бұрын
As always, start from scratch. It's from zero, nothing, etc.
@prison9865
@prison9865 7 ай бұрын
By the time you said what i can do with tipy, i already was not interested. Perhaps tell people what tipy can do for you and then how to install it and shit...
@lukekurlandski7653
@lukekurlandski7653 7 ай бұрын
Tip Number 0: Don't use Notebooks
@slayerdrum
@slayerdrum 7 ай бұрын
I think they are fine as long as you use them for what they are most suitable for: Exploratory analysis with text. Not for creating production-ready code (which often is not the way a project starts anyway).
@sergeys.8830
@sergeys.8830 5 ай бұрын
Why?
@dinoscheidt
@dinoscheidt Ай бұрын
Exactly.
@EdwinCarrenoAI
@EdwinCarrenoAI 8 күн бұрын
They are really useful for Proof of Concepts, Exploratory analysis, or basically to test an idea. But, they are not a good idea for deployments and production code.
@MarkTrombonee
@MarkTrombonee 2 ай бұрын
Wow! Top of the comments sound like AI generated. Username like: name+numbers, and long text comment that noone asked for.
@ardenthebibliophile
@ardenthebibliophile 7 ай бұрын
For point 5 I've started to get our teams to think: reusable code/functions go in .py files. Analyses go in .ipynb. they love Jupyter notebooks and this helps it be more readable. Bonus: the functions if used a lot can be packaged easier
Why Use Design Patterns When Python Has Functions?
23:23
ArjanCodes
Рет қаралды 98 М.
UFC 302 : Махачев VS Порье
02:54
Setanta Sports UFC
Рет қаралды 1,4 МЛН
Каха инструкция по шашлыку
01:00
К-Media
Рет қаралды 8 МЛН
Китайка и Пчелка 4 серия😂😆
00:19
KITAYKA
Рет қаралды 3,6 МЛН
7 Python Code Smells: Olfactory Offenses To Avoid At All Costs
22:10
How to Organize Your Solo Dev Project Like a Pro
7:42
CreaDev Labs
Рет қаралды 15 М.
AsyncIO and the Event Loop Explained
13:34
ArjanCodes
Рет қаралды 21 М.
I've been using Redis wrong this whole time...
20:53
Dreams of Code
Рет қаралды 330 М.
What the Heck Are Monads?!
21:08
ArjanCodes
Рет қаралды 68 М.
How To Structure A Programming Project…
19:00
Tech With Tim
Рет қаралды 84 М.
This INCREDIBLE trick will speed up your data processes.
12:54
Rob Mulla
Рет қаралды 256 М.
Do These 5 Things if You Don’t Want to Write Crappy Code
25:59
15 Python Libraries You Should Know About
14:54
ArjanCodes
Рет қаралды 363 М.
UFC 302 : Махачев VS Порье
02:54
Setanta Sports UFC
Рет қаралды 1,4 МЛН