What is Apache Parquet file?

  Рет қаралды 73,175

Riz Ang

Riz Ang

Күн бұрын

Today's video will discuss what Parquet file is and why you should consider using it.
0:00 Introduction
0:50 Row vs. Columnar data
1:42 Parquet under the hood
3:11 Parquet encoding examples
5:14 Parquet compression
6:35 Query time comparison
7:12 Integration with other frameworks
7:28 Closing
Further reading:
- databricks.com/glossary/what-...
- www.microsoft.com/en-us/p/apa...
- google.github.io/snappy
- gzip.org
- blog.datasyndrome.com/python-...

Пікірлер: 86
@cidrisonly
@cidrisonly 2 жыл бұрын
Thanks for the amazing explanation on parquet file system. Coming from a wood business, parquet as a flooring is not new to me. I have done many projects on parquet installation. Interesting to see it coming back in Big Data and Data Engineering.
@patrickbateman7665
@patrickbateman7665 2 жыл бұрын
No one has explained on KZfaq better than you Riz. Thank you for making such a great video.
@RizAngD
@RizAngD 2 жыл бұрын
Wow, thanks
@ashishdukare1313
@ashishdukare1313 2 жыл бұрын
Thanks from India. Love the way you explain. Very simple and concise information
@farzadshams3260
@farzadshams3260 24 күн бұрын
Thank you Riz. Very helpful video to get a high level understanding of the Parquet files!
@RizAngD
@RizAngD 10 күн бұрын
Glad to hear that!
@royteicher
@royteicher 2 жыл бұрын
Love this video! Less than 10 minutes and in depth about the topic. Thanks you!
@RizAngD
@RizAngD 2 жыл бұрын
Glad it was helpful!
@AmitDileepKulkarni
@AmitDileepKulkarni 2 жыл бұрын
Lovely explanation Riz and thank you for the video ! I would recommend your channel to all my colleagues who do database related jobs !
@RizAngD
@RizAngD 2 жыл бұрын
Thanks for sharing!
@higiniofuentes2551
@higiniofuentes2551 Ай бұрын
Thank you for this very useful video!
@RizAngD
@RizAngD 10 күн бұрын
Glad it was helpful!
@harryocallaghan6393
@harryocallaghan6393 Ай бұрын
Really great explanation! thank you so much
@RizAngD
@RizAngD 10 күн бұрын
Glad you enjoyed it!
@IamDanish99
@IamDanish99 2 жыл бұрын
Thank you Riz for the wonderful explanation!
@RizAngD
@RizAngD 2 жыл бұрын
My pleasure!
@nicknick-71
@nicknick-71 2 жыл бұрын
Thanks mate. A very good and quick explanation. Really good work.
@RizAngD
@RizAngD 2 жыл бұрын
Glad you liked it!
@ecmiguel
@ecmiguel 2 ай бұрын
Great!!!. Saludos desde Perú
@RizAngD
@RizAngD 10 күн бұрын
thanks!
@dhavaldalasaniya
@dhavaldalasaniya Жыл бұрын
Really greatly explained & really nice.. keep going Riz !!!
@RizAngD
@RizAngD Жыл бұрын
Thanks, will do!
@Van_Verder
@Van_Verder 2 жыл бұрын
Very helpful, thanks!🙏🏽
@roadtrippingwithmihir
@roadtrippingwithmihir 28 күн бұрын
Excellent and crisp explanation
@RizAngD
@RizAngD 10 күн бұрын
Glad you liked it
@MarkF-ix5mo
@MarkF-ix5mo 3 ай бұрын
Great video. Loved the fact that you used Physical Graffiti - one of my fave albums of all time.
@RizAngD
@RizAngD 10 күн бұрын
thanks!!
@lcsxwtian
@lcsxwtian 2 жыл бұрын
You make some excellent content my man!
@RizAngD
@RizAngD 2 жыл бұрын
Glad you think so!
@devarapallivamsi7064
@devarapallivamsi7064 4 ай бұрын
Good and to the point.
@RizAngD
@RizAngD 10 күн бұрын
thanks!
@paul1113-zw5pn
@paul1113-zw5pn 6 ай бұрын
Very well explained Encoding and Compression...So I have a Q: Delta versus Dictionary Encoding, How would one decide which given Dictionary seems so much more efficient? But then I suppose it depends on repitition.
@munibabu5566
@munibabu5566 2 жыл бұрын
Thank you.. Very well explained.. Crystal clear :)
@RizAngD
@RizAngD 2 жыл бұрын
Glad it was helpful!
@dylanalbertazzi
@dylanalbertazzi 2 жыл бұрын
Wonderful overview, thank you!
@RizAngD
@RizAngD 2 жыл бұрын
Glad it was helpful!
@masblogger
@masblogger 2 жыл бұрын
I really like this video, very useful.. Can't wait next video.. ;)
@RizAngD
@RizAngD 2 жыл бұрын
Thank you! 😃
@multitaskprueba1
@multitaskprueba1 2 ай бұрын
You are a genius! Fantastic video! Thanks!
@RizAngD
@RizAngD 10 күн бұрын
Glad it helped!
@reddyroopesh7
@reddyroopesh7 2 жыл бұрын
Hi Riz, I am doing development from parquet to delta lake. I’m parquet in-line we have change data capture which only reads the data if it has a change from the previous. How good is it ? Do you recommend using it for our SCDs? Do you see value ?
@LuisRomaUSA
@LuisRomaUSA 2 жыл бұрын
I can def see your channel explode in a few months. Good quality content of difficult topics, often covered in other videos that last 1 hr, with poor sound quality and no logic flow. You are going places my dude.
@RizAngD
@RizAngD 2 жыл бұрын
that's very kind words Luis! I'm still learning to be a better KZfaqr myself :)
@kalkanserdar
@kalkanserdar Жыл бұрын
Nice summary. Although, it would help to explain why querying parquet files is more efficient compared to csv, especially for select * queries (where row store format is usually much more efficient). Is it because the type definition and metadata features of parquet? Thanks
@nachetdelcopet
@nachetdelcopet 2 жыл бұрын
Nice video🎉
@pourmog
@pourmog 2 жыл бұрын
nice overview. thank you.
@RizAngD
@RizAngD 2 жыл бұрын
Thanks for watching!
@praveenravi6014
@praveenravi6014 Жыл бұрын
Hi brother, I have issue in sending the parquet file to snowflake. The problem is the .parquet file is been sent to the snowflake table but the date column is not in the shows 1day minus. i.e if the date is 12-01-2022 then in snowflake it is showing as 11-01-2022. I looking for help. I appreciate your time for reading this. Thanks in advance!
@sreelakshmia6762
@sreelakshmia6762 2 жыл бұрын
Subtitles are covering the content. Please enable option to switch off captions
@RizAngD
@RizAngD 2 жыл бұрын
thanks for the feedback!
@elarboldeundj4383
@elarboldeundj4383 2 жыл бұрын
gran video, me aclaro todo
@RizAngD
@RizAngD 2 жыл бұрын
Thanks!
@Village_Crystal_Stone
@Village_Crystal_Stone Жыл бұрын
How to retrieve latest file in to the destination folder ..... Can u please explain...!?
@subarnashrestha7009
@subarnashrestha7009 Жыл бұрын
great video, how do i combine multiple snappy.parquet files to single file and load it to snowflake ??
@nonstopPKFR
@nonstopPKFR 2 жыл бұрын
Hi! I would like to start a personal project of creating a data warehouse in Azure Synapse Analytics. Do you have any suggestions of how I can do so without having to pay hundreds of dollars a month minimum for provisioning a dedicated SQL pool in azure for my project (as per pricing I've seen) Thanks so much! I hope I simply misunderstood Azures DW pricing.
@ravitalaviya1576
@ravitalaviya1576 9 ай бұрын
I am currently capturing live data in csv format. But for storage benefit, i want to live data is saved in direct parquet format. that is possible or not?
@neelbanerjee7875
@neelbanerjee7875 Жыл бұрын
Sir.. thanks for this detailed contents.. I have below query, that i didnt get clarified from anywhere... People use to say for Hive use ORC, and for spark use Parquet.. dont understand what is the deep logic behind this.. if ORC is more efficient, why we cant use ORC, insted of parquet?
@srinivasa1187
@srinivasa1187 2 жыл бұрын
Hi Riz, Thanks for this info, One of the best explanations i have seen. one doubt 1) When you give a table to Parquet, does it - first partition by rows --> than each partition is converted to columnar and stored inside the parquet. OR - Does it directly store the data into columnar and into parquet And could you please explain ORC and difference between Parquet, ORC, AVRO and when to use what.
@RizAngD
@RizAngD 2 жыл бұрын
That's a good Q and I don't know the answer, please do let me know when you do! Currently I'm really full with "Life" at the moment, but yeah already plan to create videos about ORC and Avro. Stay tune!
@leoxiaoyanqu
@leoxiaoyanqu 2 жыл бұрын
Thanks for the video Riz! I was curious what's the practical use-case for LZO, 00:05:14, cuz I see when comparing with Snappy, assuming we're dealing with hot data, the only advantage of LZO would be faster decompression. Anything I'm missing? Thanks in advance
@RizAngD
@RizAngD 2 жыл бұрын
that's also my understanding :)
@finedinerest
@finedinerest 2 жыл бұрын
Can you please elaborate more on whats repetition levels and definition levels with a simpler example. It would really help. Thanks in advance. ! 😊
@RizAngD
@RizAngD 2 жыл бұрын
I suggest referring this blog, very comprehensive explanation :) www.waitingforcode.com/apache-parquet/nested-data-representation-parquet/read
@ConaillSoraghan
@ConaillSoraghan 2 жыл бұрын
Very useful overview Riz. As a total noob to this format, I have a simple question: how do you convert data into the parquet format? Is that possible?
@RizAngD
@RizAngD 2 жыл бұрын
Thanks Conaill! You can convert data into parquet with many tools in the market these days, some notable examples in Azure worls is Spark (via Databricks or Synapse) and Data Factory (as part of the integration).
@reddyroopesh7
@reddyroopesh7 2 жыл бұрын
Thanks boss
@RizAngD
@RizAngD 2 жыл бұрын
Welcome
@karthikeyanbalasubramaniam598
@karthikeyanbalasubramaniam598 2 жыл бұрын
Riz, The presentation looks good. I use the parquet file thru cognos analytic’s dataset. Does parquet files structure column based by default?
@RizAngD
@RizAngD 2 жыл бұрын
Yes it does Karthikeyan
@arpanmistry3900
@arpanmistry3900 Жыл бұрын
hey Riz, i want your help can you please provide me one sample parquet file with LZO compression ---> i am stucked, i tried alot of things in pyspark and pyarrow to convert but unable to create parquet with lzo compression. please provide me 1 sample file if you can help me
@kennylaikl299
@kennylaikl299 2 жыл бұрын
Hi Riz, can you do a video on the use case for AVRO compared to Parquet?
@RizAngD
@RizAngD 2 жыл бұрын
Already in my backlog, I've just been too busy procrastinating!! :P
@SriRam-yq4id
@SriRam-yq4id 2 жыл бұрын
Thanks for the Parquet video Riz. What is the difference between Parquet and Avro?
@RizAngD
@RizAngD 2 жыл бұрын
There are few main differences Sri Ram, notably Parquet is column based files while Avro is row based (like excel), so Parquet is better if you're using querying the data column by column (e.g. analytics), whereas Avro would be better (compared to parquet) if u want to query the scan/query the whole data. Plus, Avro is also written in JSON (more human readable) while Parquet comes with its own format and not as readable.
@jeevan999able
@jeevan999able 2 жыл бұрын
hello so i need a data federation tool which has a python client, I need to be able to connect and query data from wide variety of data storage platforms , as of now I sore data in ADLS and sqlserver on azure , what would you reccomend
@RizAngD
@RizAngD 2 жыл бұрын
Help me explain what you mean by data federation tool?
@jeevan999able
@jeevan999able 2 жыл бұрын
@@RizAngD so a platform which can connect to many data storage places (s3, adls, mysql, mssq etc )so that regardless of where the data is stored , I have a central platform through which I can access all of it
@michaelshoemaker5635
@michaelshoemaker5635 2 жыл бұрын
What tools are used to query files (csv, parquet) directly? I've never heard of doing this.
@RizAngD
@RizAngD 2 жыл бұрын
Assuming you're using Azure cloud, you can use Polybase to query CSV and parquet file directly (i.e. creating external table) from Azure blob storage (or Data Lake) within Azure SQL Database or Azure Synapse SQL :)
@michaelshoemaker5635
@michaelshoemaker5635 2 жыл бұрын
@@RizAngD Ah, thank you! Never used Azure before. Understood.
@sigkalbar
@sigkalbar Жыл бұрын
DuckDB
@michaelshoemaker5635
@michaelshoemaker5635 Жыл бұрын
@@sigkalbar Thank You!
@CaribouDataScience
@CaribouDataScience Жыл бұрын
It's not butter its Parquet..
@kamkhan7509
@kamkhan7509 2 жыл бұрын
not very helpful video without practical .
@RizAngD
@RizAngD 2 жыл бұрын
Sorry to hear that. Tx
Azure Data Lake Gen 2 VS. Azure Blob Storage Explained
4:41
Parquet File Format - Explained to a 5 Year Old!
11:28
Data Mozart
Рет қаралды 24 М.
Каха и суп
00:39
К-Media
Рет қаралды 6 МЛН
HAPPY BIRTHDAY @mozabrick 🎉 #cat #funny
00:36
SOFIADELMONSTRO
Рет қаралды 17 МЛН
New model rc bird unboxing and testing
00:10
Ruhul Shorts
Рет қаралды 23 МЛН
This INCREDIBLE trick will speed up your data processes.
12:54
Rob Mulla
Рет қаралды 260 М.
Learn Apache Spark in 10 Minutes | Step by Step Guide
10:47
Darshil Parmar
Рет қаралды 274 М.
Looking under the hood of the parquet format
48:35
SQLBits
Рет қаралды 260
The columnar roadmap: Apache Parquet and Apache Arrow
41:39
DataWorks Summit
Рет қаралды 33 М.
What is Apache Iceberg?
12:54
IBM Technology
Рет қаралды 18 М.
Making Apache Spark™ Better with Delta Lake
58:10
Databricks
Рет қаралды 174 М.
Database vs Data Warehouse vs Data Lake | What is the Difference?
5:22
Alex The Analyst
Рет қаралды 747 М.
Каха и суп
00:39
К-Media
Рет қаралды 6 МЛН