Reading Parquet Files in Python

  Рет қаралды 53,642

DataEng Uncomplicated

DataEng Uncomplicated

Күн бұрын

This video is a step by step guide on how to read parquet files in python. Leveraging the pandas library, we can read in data into python without needing pyspark or hadoop cluster. This walkthrough discusses how to install the prerequisites you will need in python as well.
buy me a coffee: www.buymeacoffee.com/dataengu
Sample data from URL from github: github.com/Teradata/kylo/tree...
Medium blog post with code: / how-to-read-parquet-fi...
#python

Пікірлер: 50
@industryrule-4080
@industryrule-4080 9 ай бұрын
Bang on. Thanks for even including the error portion for installing pyarrow. Helpful.
@mdaburafsan5727
@mdaburafsan5727 9 ай бұрын
Thanks for easy-to-follow guide.
@the_dev_life
@the_dev_life 3 жыл бұрын
Good stuff! Concise with great info.
@DataEngUncomplicated
@DataEngUncomplicated 3 жыл бұрын
Thank you for the kind words!
@jozzalex5055
@jozzalex5055 2 жыл бұрын
You save my life!! Thx for the tutorial!!!
@DataEngUncomplicated
@DataEngUncomplicated 2 жыл бұрын
You're welcome Jozz, I'm glad it was helpful!
@hasnaazizah9392
@hasnaazizah9392 Жыл бұрын
love it, thanks a lot
@multitaskprueba1
@multitaskprueba1 2 жыл бұрын
You are a genius! Thank you!
@DataEngUncomplicated
@DataEngUncomplicated 2 жыл бұрын
Haha you're welcome!
@BeABetterDev
@BeABetterDev 3 жыл бұрын
Great video!
@DataEngUncomplicated
@DataEngUncomplicated 3 жыл бұрын
Thanks!
@gustavomagro9934
@gustavomagro9934 Жыл бұрын
very helpfull video, thanks
@DataEngUncomplicated
@DataEngUncomplicated Жыл бұрын
You're welcome!
@KhalilYasser
@KhalilYasser 3 жыл бұрын
Thanks a lot. I encountered that Jupyter kernel is dead and when restaring the kernel and trying again, I got the same problem. I even tried to put the code in .py file and run it from the terninal but I got nothing printed `print(df.head())`
@DataEngUncomplicated
@DataEngUncomplicated 3 жыл бұрын
Strange, I haven't encountered that error. What version of python were you running?
@Nearnface
@Nearnface 5 ай бұрын
wowowowow
@itsevennow
@itsevennow 2 жыл бұрын
Very helpful tutorial. Newbie question - I am able to load my parquet file in the notebook. It has 130 columns. But it shows only 20 columns. How can I see all the columns? even if it is for at least 1 or 2 rows is fine.
@DataEngUncomplicated
@DataEngUncomplicated 2 жыл бұрын
Thank you! Check out this article it might help you with this www.geeksforgeeks.org/how-to-get-column-names-in-pandas-dataframe/
@akshatahabbu5015
@akshatahabbu5015 2 жыл бұрын
Is there a way to load parquet file into oracle DB directly using python scripts?
@DataEngUncomplicated
@DataEngUncomplicated 2 жыл бұрын
Yes this can be done with the cx_oracle library to read and write data.
@MetallicSiren
@MetallicSiren Жыл бұрын
@DataEng Uncomplicated I'm getting a NameError for parquet_file, but it has been defined as shown in the video. Please help, thanks
@DataEngUncomplicated
@DataEngUncomplicated Жыл бұрын
Hello, NameErrors occurs when you try to use a variable, function, or module that doesn't exist. Can you make sure the name exists?
@neelrama3946
@neelrama3946 3 жыл бұрын
I get the error OSError: Passed non-file path:. Have you had this ?
@DataEngUncomplicated
@DataEngUncomplicated 3 жыл бұрын
Sounds like there is an error with your file path. Perhaps you have special characters in it. Try using r"some\path"
@mihaelacostea5783
@mihaelacostea5783 Жыл бұрын
What to do when you have a parquet file somewhere on someone else's cloud, is it possible to feed it to Pandas without saving it locally? I am not seing a way to save it locally. it's a coding challenge that simply gives you the link to the cloud location of the parquet data.
@DataEngUncomplicated
@DataEngUncomplicated Жыл бұрын
Hi Mihaela, is the cloud in AWS? Yes if you use the AWS SDK for pandas library, you can read the file directly into python without having to save it locally. See this video as an example using a CSV: kzfaq.info/get/bejne/mNx0nLVk0Ky5pHk.html
@nishaddhamne773
@nishaddhamne773 3 жыл бұрын
Can you let me know commands to edit parquet metadata information
@DataEngUncomplicated
@DataEngUncomplicated 3 жыл бұрын
Hi Nishad, It sounds like you want to edit the data schema of the file in python. I think this thread has the answer you are looking for: stackoverflow.com/questions/41567081/get-schema-of-parquet-file-in-python
@gauravanand6410
@gauravanand6410 2 жыл бұрын
How to read a list of parquet files and read it as a single dataframe?
@DataEngUncomplicated
@DataEngUncomplicated 2 жыл бұрын
Hi Gaurav, I looking at the read_parquet method documentation and it doesn't look like it receives a list as an option. So what I'm thinking you could do is loop through each file in your list to read it into python and append each dataframe together assuming your data has the same schema. I think this is a good topic for another tutorial video I could make in the future
@hiyoungsun
@hiyoungsun 2 жыл бұрын
Thanks a lot! This video helps a lot! Could you also let us know how to convert the parquet file to .csv file in Python?
@DataEngUncomplicated
@DataEngUncomplicated 2 жыл бұрын
Hi Envfash, thank your feedback. I really appreciate it. You can convert parquet to csv using the python pandas library. I will make a video on this next weekend I think others might benefit from your question!
@DataEngUncomplicated
@DataEngUncomplicated 2 жыл бұрын
Hi Envfash. I made a video just for you on how to convert a parquet file to csv: kzfaq.info/get/bejne/ob-Xm6mFy6q8nok.html
@enzopablofranciscocarratal5262
@enzopablofranciscocarratal5262 Жыл бұрын
I get the error: name 'pd' is not defined any advice?
@DataEngUncomplicated
@DataEngUncomplicated Жыл бұрын
Yes, when you are importing pandas, it is common for people to to rename it to pd such as import pandas as pd
@rahayutrifurwani2294
@rahayutrifurwani2294 Жыл бұрын
cool, thanks
@DataEngUncomplicated
@DataEngUncomplicated Жыл бұрын
No problem!
@keithmosaic3703
@keithmosaic3703 Жыл бұрын
How do you read the metadata
@DataEngUncomplicated
@DataEngUncomplicated Жыл бұрын
I think your looking for the dataframe.info method in pandas
@shubhammural4760
@shubhammural4760 3 жыл бұрын
how to read parquet file from azure blob storage?
@DataEngUncomplicated
@DataEngUncomplicated 3 жыл бұрын
Hi Shubham, this would be a good idea for a future video, but taking a quick look it looks like you need to use the azure.storage.blob library and read it into a stream: stackoverflow.com/questions/63351478/how-to-read-parquet-files-from-azure-blobs-into-pandas-dataframe
@shubhammural4760
@shubhammural4760 3 жыл бұрын
@@DataEngUncomplicated I installed all required library , but facing some issue while reading, because my parquet file size is 350 mb
@shubhammural4760
@shubhammural4760 3 жыл бұрын
@@DataEngUncomplicated I used the same code but giving me error can you please help me on that
@DataEngUncomplicated
@DataEngUncomplicated 3 жыл бұрын
@@shubhammural4760 so pandas creates the dataframe in memory and it's possible that you ran out of memory when trying to read a file of this size. You can try to read the data in smaller chunks so your machine won't run out of memory.
@shubhammural4760
@shubhammural4760 3 жыл бұрын
@@DataEngUncomplicated for reading parquet file, parquet file won't support for chunk, I also explore this thing, so can't iterate throught chunk
@diy__diy
@diy__diy Жыл бұрын
FileNotFoundError: [Errno 2] No such file or directory .... my directory path and file name are correct
@diy__diy
@diy__diy Жыл бұрын
I still stuck with the pd.read_parquet instead of I uploaded the file and it worked!!
@DataEngUncomplicated
@DataEngUncomplicated Жыл бұрын
Hello, if your path is correct, try wrapping your string in r like r'yourpath'
@z1mt0n1x2
@z1mt0n1x2 Жыл бұрын
man why can't they just use a zip file...
@DataEngUncomplicated
@DataEngUncomplicated Жыл бұрын
Sorry what do you mean exactly?
This INCREDIBLE trick will speed up your data processes.
12:54
Rob Mulla
Рет қаралды 260 М.
What is Apache Parquet file?
8:02
Riz Ang
Рет қаралды 73 М.
Now THIS is entertainment! 🤣
00:59
America's Got Talent
Рет қаралды 37 МЛН
ПРОВЕРИЛ АРБУЗЫ #shorts
00:34
Паша Осадчий
Рет қаралды 6 МЛН
ЧУТЬ НЕ УТОНУЛ #shorts
00:27
Паша Осадчий
Рет қаралды 6 МЛН
SHAP with Python (Code and Explanations)
15:41
A Data Odyssey
Рет қаралды 53 М.
Speed Up Data Processing with Apache Parquet in Python
10:12
NeuralNine
Рет қаралды 8 М.
An Introduction to Arrow for Python Programmers
19:37
Voltron Data
Рет қаралды 3,9 М.
If __name__ == "__main__" for Python Developers
8:47
Python Simplified
Рет қаралды 388 М.
Hacking Websites with SQL Injection - Computerphile
8:59
Computerphile
Рет қаралды 2,4 МЛН
Convert Parquet To CSV in Python with Pandas | Step by Step Tutorial
4:35
DataEng Uncomplicated
Рет қаралды 11 М.
An introduction to Apache Parquet
5:16
Learn Data with Mark
Рет қаралды 36 М.
WHAT Is "Pickle" In Python?! (EXTREMELY Useful!)
9:32
Indently
Рет қаралды 76 М.
SHA: Secure Hashing Algorithm - Computerphile
10:21
Computerphile
Рет қаралды 1,2 МЛН
Now THIS is entertainment! 🤣
00:59
America's Got Talent
Рет қаралды 37 МЛН