No video

How to work with big data files (5gb+) in Python Pandas!

  Рет қаралды 37,705

TechTrek by Keith Galli

TechTrek by Keith Galli

Күн бұрын

In this video, we quickly go over how to work with large CSV/Excel files in Python Pandas. Instead of trying to load the full file at once, you should load the data in chunks. This is especially useful for files that are a gigabyte or larger. Let me know if you have any questions :).
Source code on Github:
github.com/Kei...
Raw data used (from Kaggle):
www.kaggle.com...
I want to start uploading data science tips & exercises to this channel more frequently. What should I make videos on??
-------------------------
Follow me on social media!
Instagram | / keithgalli
Twitter | / keithgalli
TikTok | / keithgalli
-------------------------
If you are curious to learn how I make my tutorials, check out this video: • How to Make a High Qua...
Practice your Python Pandas data science skills with problems on StrataScratch!
stratascratch....
Join the Python Army to get access to perks!
KZfaq - www.youtube.co....
Patreon - / keithgalli
*I use affiliate links on the products that I recommend. I may earn a purchase commission or a referral bonus from the usage of these links.
-------------------------
Video timeline!
0:00 - Overview
1:25 - What not to do.
2:16 - Python code to load in large CSV file (read_csv & chunksize)
8:00 - Finalizing our data

Пікірлер: 49
@Hossein118
@Hossein118 2 жыл бұрын
The end of the video was so fascinating to see how that huge amount of data was compressed to such a manageable size.
@TechTrekbyKeithGalli
@TechTrekbyKeithGalli 2 жыл бұрын
I agree! So satisfying :)
@fruitfcker5351
@fruitfcker5351 Жыл бұрын
If (and only if) you only want to read a few columns, just specify the columns you want to process from the CSV by adding *usecols=["brand", "category_code", "event_type"]* to the *pd.read_csv* function. Took about 38seconds to read on an M1 Macbook Air.
@mjacfardk
@mjacfardk 2 жыл бұрын
During my 3 years in the field of data science, this course would be the best I've ever watched. thank you brother, go ahead.
@TechTrekbyKeithGalli
@TechTrekbyKeithGalli 2 жыл бұрын
Glad you enjoyed!
@CaribouDataScience
@CaribouDataScience Жыл бұрын
Since you are working with Python, another approach would be to import the data into SQLite db. Then create some aggregate tables and views ...
@dhssb999
@dhssb999 2 жыл бұрын
Never used chunk in read_csv before, it helps a lot! Great tip, thanks
@TechTrekbyKeithGalli
@TechTrekbyKeithGalli 2 жыл бұрын
Glad it was helpful!!
@michaelhaag3367
@michaelhaag3367 2 жыл бұрын
glad you are back my man, I am currently in a data science bootcamp and you are way better than some of my teachers ;)
@TechTrekbyKeithGalli
@TechTrekbyKeithGalli 2 жыл бұрын
Glad to be back :). I appreciate the support!
@jacktrainer4387
@jacktrainer4387 2 жыл бұрын
No wonder I've had trouble with Kaggle datasets! "Big" is a relative term. It's great to have a reasonable benchmark to work with! Many thanks!
@TechTrekbyKeithGalli
@TechTrekbyKeithGalli 2 жыл бұрын
Definitely, "big" very much means different things to different people and circumstances.
@Nevir202
@Nevir202 Жыл бұрын
Ya, I've been trying to process a book in Sheets, for that processing 100k words, so a few MB, in the way I'm trying to is already too much lol.
@agnesmunee9406
@agnesmunee9406 Жыл бұрын
How would a go about it if it was a jsonlines(jsonl) data file?
@andydataguy
@andydataguy Жыл бұрын
Great video! Hope you start making more soon
@TechTrekbyKeithGalli
@TechTrekbyKeithGalli Жыл бұрын
Thank you! More on the way soon :)
@AshishSingh-753
@AshishSingh-753 2 жыл бұрын
Pandas have capabilities I don't know it - secret Keith knows everything
@TechTrekbyKeithGalli
@TechTrekbyKeithGalli 2 жыл бұрын
Lol I love the nickname "secret keith". Glad this video was helpful!
@abhaytiwari5991
@abhaytiwari5991 2 жыл бұрын
Well-done Keith 👍👍👍
@TechTrekbyKeithGalli
@TechTrekbyKeithGalli 2 жыл бұрын
Thank you :)
@spicytuna08
@spicytuna08 Жыл бұрын
thanks for the great lesson wondering what would be the performance between output = pd.concat([output, summary) vs output.append(summary)?
@manyes7577
@manyes7577 2 жыл бұрын
i have error message on this one. it says 'DataFrame' object is not callable. why is that and how to solve it? thanks for chunk in df: details = chunk[['brand', 'category_code','event_type']] display(details.head()) break
@TechTrekbyKeithGalli
@TechTrekbyKeithGalli 2 жыл бұрын
How did you define "df"? I think that's where your issue lies.
@rishigupta2342
@rishigupta2342 Жыл бұрын
Thanks Keith. Please do more videos on EDA python.
@lesibasepuru8521
@lesibasepuru8521 Жыл бұрын
You are a star my man... thank you
@JADanso
@JADanso 2 жыл бұрын
Very timely info, thanks Keith!!
@machinelearning1822
@machinelearning1822 Жыл бұрын
I have tried and followed each step however it gives this error: OverflowError: signed integer is greater than maximum
@TechTrekbyKeithGalli
@TechTrekbyKeithGalli Жыл бұрын
How big is the data file you are trying to open?
@rokaskarabevicius
@rokaskarabevicius 2 ай бұрын
This works fine if you don't have any duplicates in your data. Even if you de-dupe every chunk, aggregating it makes it impossible to know whether there are any dupes between the chunks. In other words, do not use this method if you're not sure whether your data contains duplicates.
@ahmetsenol6104
@ahmetsenol6104 Жыл бұрын
It was quick and straight to the point. Very good one thanks.
@elu1
@elu1 2 жыл бұрын
great short video! nice job and thanks!
@TechTrekbyKeithGalli
@TechTrekbyKeithGalli 2 жыл бұрын
Glad you enjoyed!
@firasinuraya7065
@firasinuraya7065 2 жыл бұрын
OMG..this is gold..thank you for sharing
@DataAnalystVictoria
@DataAnalystVictoria 9 ай бұрын
Why and how you use 'append' with DataFrame? I have an error, when I do the same thing. Only if I use a list instead, and then concat all the dfs in the list I have the same result as you do.
@CS_n00b
@CS_n00b 10 ай бұрын
why not groupby.size() instead of groupby.sum() the column of 1's?
@lukaschumchal6676
@lukaschumchal6676 2 жыл бұрын
Thank you for video, it was really helpfull. But i am still little confused. Do I have to run every big file with chunks, because its necessary or it is just quicker way of working with large files?
@TechTrekbyKeithGalli
@TechTrekbyKeithGalli 2 жыл бұрын
The answer really depends on the amount of RAM that you have on your machine. For example, I have 16gb of ram on my laptop. No matter what, I would never be able to load in a file 16gb+ all at once because I don't have enough RAM (memory) to do that. Realistically, my machine is probably using about half the RAM for miscellaneous tasks at all times so I wouldn't even be able to open up a 8gb file all at once. If you are on Windows, you can open up your task manager --> performance to see details on how much memory is available. You could technically open up a file as long as you have enough memory available for it, but performance will decrease as you get closer to your total memory limit. As a result my general recommendation would be to load in files in chunks basically any time the file is greater than 1-2gb in size.
@lukaschumchal6676
@lukaschumchal6676 2 жыл бұрын
@@TechTrekbyKeithGalli Thank you very much. I cannot even describe you how this is helpfull to me :).
@oscararmandocisnerosruvalc8503
@oscararmandocisnerosruvalc8503 Жыл бұрын
Cool videos bro . Can you address load and dump for Json please :)?
@TechTrekbyKeithGalli
@TechTrekbyKeithGalli Жыл бұрын
No guarantees, but I'll put that on my idea list!
@dicloniusN35
@dicloniusN35 6 ай бұрын
but new file have only 100000 , not all info, you ignore other data?
@konstantinpluzhnikov4862
@konstantinpluzhnikov4862 2 жыл бұрын
Nice video! Working with big files If a hardware is not at it best means there is much time to make a cup of coffee, discuss the latest news...
@TechTrekbyKeithGalli
@TechTrekbyKeithGalli 2 жыл бұрын
Haha yep
@vickkyphogat
@vickkyphogat Жыл бұрын
what about .SAV files ?
@oscararmandocisnerosruvalc8503
@oscararmandocisnerosruvalc8503 Жыл бұрын
Why did you use the count there
@TechTrekbyKeithGalli
@TechTrekbyKeithGalli Жыл бұрын
If you want to aggregate data (make it smaller), counting the number of occurrences of events is a common method to do that. If you are wondering why I added an additional 'count' column and summing, instead of just doing something like value_counts(), that's just my personal preferred method of doing it. Both can work correctly.
@oscararmandocisnerosruvalc8503
@oscararmandocisnerosruvalc8503 Жыл бұрын
@@TechTrekbyKeithGalli Thanks a lot for your videos, bro !!!!
This INCREDIBLE trick will speed up your data processes.
12:54
Rob Mulla
Рет қаралды 262 М.
Intro to Python Dask: Easy Big Data Analytics with Pandas!
20:31
Bryan Cafferky
Рет қаралды 13 М.
Harley Quinn lost the Joker forever!!!#Harley Quinn #joker
00:19
Harley Quinn with the Joker
Рет қаралды 22 МЛН
Советы на всё лето 4 @postworkllc
00:23
История одного вокалиста
Рет қаралды 5 МЛН
Doing This Instead Of Studying.. 😳
00:12
Jojo Sim
Рет қаралды 30 МЛН
R vs Python
7:07
IBM Technology
Рет қаралды 316 М.
Read Giant Datasets Fast - 3 Tips For Better Data Science Skills
15:17
Python Simplified
Рет қаралды 49 М.
Make Your Pandas Code Lightning Fast
10:38
Rob Mulla
Рет қаралды 181 М.
Exploratory Data Analysis with Pandas Python
40:22
Rob Mulla
Рет қаралды 454 М.
How to Clean Data Like a Pro: Pandas for Data Scientists and Analysts
36:13
Complete Python Pandas Data Science Tutorial! (2024 Updated Edition)
1:34:11
Python in Excel Makes Power Query a MUST-HAVE in 2024!
13:19
David Langer
Рет қаралды 39 М.