Master Reading Spark Query Plans

  Рет қаралды 26,282

Afaque Ahmad

Afaque Ahmad

Күн бұрын

Spark Performance Tuning
Dive deep into Apache Spark Query Plans to better understand how Apache Spark operates under the hood. We'll cover how Spark creates logical and physical plans, as well as the role of the Catalyst Optimizer in utilizing optimization techniques such as filter (predicate) pushdown and projection pushdown.
The video covers intermediate concepts of Apache Spark in-depth, detailed explanations on how to read the Spark UI, understand Apache Spark’s query plans through code snippets of various narrow and wide transformations like reading files, select, filter, join, group by, repartition, coalesce, hash partitioning, hashaggregate, round robin partitioning, range partitioning and sort-merge join. Understanding them is going to give you a grasp on reading Spark’s step-by-step thought process and help identify performance issues and possible optimizations.
📄 Complete Code on GitHub: github.com/afaqueahmad7117/sp...
🎥 Full Spark Performance Tuning Playlist: • Apache Spark Performan...
🔗 LinkedIn: / afaque-ahmad-5a5847129
Chapters:
00:00 Introduction
01:30 How Spark generates logical and physical plans?
04:46 Narrow transformations (filter, select, add or update columns) query plan explanation
09:02 Repartition query plan explanation
12:57 Coalesce query plan explanation
17:32 Joins query plan explanation
23:23 Group by count query plan explanation
27:04 Group by sum query plan explanation
28:05 Group by count distinct query plan explanation
33:59 Interesting observations on Spark’s query plans
36:56 When will predicate pushdown not work?
39:07 Thank you
#ApacheSpark #SparkPerformanceTuning #DataEngineering #SparkDAG #SparkOptimization

Пікірлер: 119
@afaqueahmad7117
@afaqueahmad7117 11 ай бұрын
🔔🔔 Please remember to subscribe to the channel folks. It really motivates me to make more such videos :)
@snehitvaddi
@snehitvaddi 2 күн бұрын
Buddy! You got a new sub here. Loved your detailed explanation. I see no one explaining the query plain this detail and I believe this is the right way of learning. But I would love to see an entire Spark series.
@1994salahuddin
@1994salahuddin 11 ай бұрын
Proud of you brother, looking forward to more of such videos. Great job!
@SidharthanPV
@SidharthanPV 11 ай бұрын
This is one of the best video about Spark I have seen recently!
@user-ue4ul1ru2n
@user-ue4ul1ru2n 8 ай бұрын
Thanks for such an in-depth overview!! helps a lot to grow!!
@shubhamwaingade4144
@shubhamwaingade4144 6 ай бұрын
One of the best videos I have seen on Spark, waiting for your Spark Architecture Video
@OmairaParveen-uy7qt
@OmairaParveen-uy7qt 11 ай бұрын
Explained the concept really well!
@YoSoyWerlix
@YoSoyWerlix 5 ай бұрын
Afaque, THANK YOU SO MUCH FOR THESE VIDEOS!! They are so amazing for a fast paced learning experience. Hope you soon upload much more!!
@roksig3823
@roksig3823 8 ай бұрын
Thanks a bunch. To my knowledge, no one has explained Spark explain function this detailed level. Very in-depth information.
@psicktrick7667
@psicktrick7667 8 ай бұрын
rare content! please don't stop making these
@saravananvel2365
@saravananvel2365 11 ай бұрын
Very useful and explaining complex things in easy manner . Thanks and expect more videos from you
@saptorshidana7903
@saptorshidana7903 11 ай бұрын
Amazing content.. I am a newbie into Spark but I am hooked.. Sir plz post the continued series.. awaiting for your video posts.. Amazing teacher
@adityasingh8553
@adityasingh8553 11 ай бұрын
This takes me back to me YaarPadhade times. Great work Bhai much love!
@abhishekmohanty9971
@abhishekmohanty9971 10 ай бұрын
Beautifully explained. Many concepts got cleared. thanks a lot.Keep going.
@anirbansom6682
@anirbansom6682 9 ай бұрын
My today's well spent 40 mins. Thanks for the knowledge sharing.
@GuruBala
@GuruBala 8 ай бұрын
It's great to see such useful contents in spark... an its helpful to understand clearer with your notes! you rock.... Thankless thanks !!
@sandeepchoudhary3355
@sandeepchoudhary3355 5 ай бұрын
Great content with practical knowledge. Hats off to you !!!
@ridewithsuraj-zz9cc
@ridewithsuraj-zz9cc 12 күн бұрын
This is the most detailed explanation I have ever seen.
@jnana1985
@jnana1985 11 ай бұрын
Great explanation!!Keep uploading such quality content bro
@yashwantdhole7645
@yashwantdhole7645 Ай бұрын
You are a gem bro. The content that you bring here is terrific. ❤❤❤
@afaqueahmad7117
@afaqueahmad7117 25 күн бұрын
Thanks man, @yashwantdhole7645. This means a lot!
@dawidgrzeskow987
@dawidgrzeskow987 4 ай бұрын
After looking for some time for best material which truly explains this topic, and try to dig deep enough you clearly delivered, thanks Afaque.
@afaqueahmad7117
@afaqueahmad7117 4 ай бұрын
Glad it was helpful, appreciate it :)
@sudeepbehera5921
@sudeepbehera5921 5 ай бұрын
Thank you so much for making this video. this is really very helpful.
@maheshbongani
@maheshbongani 10 ай бұрын
It's a great video with a great explanation. Awesome. Thank you for such a detailed explanation. Please keep doing such content.
@garydiaz8886
@garydiaz8886 9 ай бұрын
This is pure gold, congrats bro , keep the good work
@afaqueahmad7117
@afaqueahmad7117 9 ай бұрын
Thank you @garydiaz8886, really appreciate it! :)
@piyushjain5852
@piyushjain5852 9 ай бұрын
Very useful, video man, thanks for explaining things in so much details, keep doing the good work.
@sanjayplays5010
@sanjayplays5010 7 ай бұрын
This is really good, thanks so much for this explanation!
@myl1566
@myl1566 2 ай бұрын
one of the best videos i came across on spark query plan explanation. Thank you! :)
@afaqueahmad7117
@afaqueahmad7117 2 ай бұрын
Appreciate it @myl1566, thank you!
@PavanKalyan-vw2cp
@PavanKalyan-vw2cp 4 ай бұрын
Bro, you dropped this👑
@AmitBhadra
@AmitBhadra 11 ай бұрын
Great content brother. Please post more 😁
@satyajitmohanty5039
@satyajitmohanty5039 13 күн бұрын
Explanation is so good
@neelbanerjee7875
@neelbanerjee7875 Ай бұрын
Absolute gem ❤❤ would like to have video on handling real time scenarios (handle slow running job, oom etc)..
@crazypri8
@crazypri8 3 ай бұрын
Amazing content! Thank you for sharing!
@afaqueahmad7117
@afaqueahmad7117 3 ай бұрын
Thank you @crazypri8, appreciate it :)
@ujvadeeppatil8135
@ujvadeeppatil8135 10 ай бұрын
By far best content i have seen on explain query thing!!! Keep it brother. Good luck!
@afaqueahmad7117
@afaqueahmad7117 10 ай бұрын
Glad, you liked it, thank you! :)
@remedyiq8034
@remedyiq8034 6 ай бұрын
"God bless you! Great video! Learned a lot"
@dishant_22
@dishant_22 9 ай бұрын
Great explanation.
@sarfarazmemon2429
@sarfarazmemon2429 4 ай бұрын
Underrated pro max!
@RahulGhosh-yl7hl
@RahulGhosh-yl7hl 6 ай бұрын
This was awesome!
@iamexplorer6052
@iamexplorer6052 8 ай бұрын
no one teaches detailed way complex things like you no matter what please spread you're knowledge to world i am sure there must be people learn from you , remember you as master life long who settled in it job like me
@user-meowmeow1
@user-meowmeow1 3 ай бұрын
this is gold. Thank you very much!
@afaqueahmad7117
@afaqueahmad7117 3 ай бұрын
@user-meowmeow1 Glad you found it helpful :)
@prasadrajupericharla5545
@prasadrajupericharla5545 2 ай бұрын
Excellent job 🙌
@afaqueahmad7117
@afaqueahmad7117 2 ай бұрын
Thanks @prasadrajupericharla5545, appreciate it :)
@shaheelsahoo8535
@shaheelsahoo8535 2 ай бұрын
Great Content. Nice and Detailed!!
@afaqueahmad7117
@afaqueahmad7117 2 ай бұрын
Thank you @shaheelsahoo8535, appreciate it :)
@Wonderscope1
@Wonderscope1 7 ай бұрын
Great video thanks for sharing. I definitely subscribe
@suman3316
@suman3316 11 ай бұрын
Very Good explanation...Keep Going
@afaqueahmad7117
@afaqueahmad7117 11 ай бұрын
Thank you!
@venkatyelava8043
@venkatyelava8043 10 күн бұрын
One of the cleanest explanation I ever come across on the internals of Spark. Really appreciate all the effort you are putting into making these videos. If you don't mind, May I know which text editor are you are using when pasting the Physical plan?
@vikasverma2580
@vikasverma2580 11 ай бұрын
Bhai mera bhai 😍 Abto hazaro students aayenge bhai ke pass par Apne sabse pehle student ko mat bhulna bawa😜 Very proud of you bhai... And i can guarantee every1 here that he is the best teacher that there is❤️
@jjayeshpawar
@jjayeshpawar Ай бұрын
Great Video!
@afaqueahmad7117
@afaqueahmad7117 Ай бұрын
Appreciate it @jjayeshpawar, thank you!
@tahiliani22
@tahiliani22 7 ай бұрын
This is really informative, such details are not even present in the O'Reilly Learning Spark Book. Please continue to make such content. Needless to say but I have already subscribed.
@nikhilc8611
@nikhilc8611 8 ай бұрын
You are awesome man❤
@ManishKumar-qw3ft
@ManishKumar-qw3ft 5 ай бұрын
Bhai bhot bhadia content banaate ho. Love your vdos. Please keep it up. You have great teaching skills.
@afaqueahmad7117
@afaqueahmad7117 5 ай бұрын
Bohot shukriya bhai sahab!
@thecodingmind9319
@thecodingmind9319 6 ай бұрын
Bro, I am beginner but i was able to understand everything. Really great content and ur explanations was also amazing. Please continue doing such great videos. Thanks a lot for sharing .
@afaqueahmad7117
@afaqueahmad7117 6 ай бұрын
@thecodingmind9319 Thanks for the kind words, means a lot :)
@VenuuMaadhav
@VenuuMaadhav Ай бұрын
By watching your first 15mins of youtube video and I am awed beyond my words. What a great explanation @afaqueahmad. Kudos to you! Please make more videos of solving real time scenarios using PySpark & Cluster configuration. Again BIG THANKS!
@afaqueahmad7117
@afaqueahmad7117 Ай бұрын
Hey @VenuuMaadhav, thank you for the kind words, means a lot. More coming soon :)
@CoolGuy
@CoolGuy 9 ай бұрын
I am sure that down the line, in a few years, you will cross 100k subscribers. Great content BTW.
@afaqueahmad7117
@afaqueahmad7117 9 ай бұрын
Hey @CoolGuy , thanks man! Means a lot to me :)
@varunparuchuri9544
@varunparuchuri9544 2 ай бұрын
please do more vedios bro. love this one
@afaqueahmad7117
@afaqueahmad7117 2 ай бұрын
Thank you @varunparuchuri9544, really appreciate it :)
@MuhammadAhmad-do1sk
@MuhammadAhmad-do1sk 3 ай бұрын
Excellend content, please make more videos like this with deep understanding of "how stuff works"... Highly Appreciate it. Love from 🇵🇰
@afaqueahmad7117
@afaqueahmad7117 3 ай бұрын
Thank you @MuhammadAhmad-do1sk for the appreciation, love from India :)
@niladridey9666
@niladridey9666 11 ай бұрын
quality content
@afaqueahmad7117
@afaqueahmad7117 11 ай бұрын
Thank you!
@crystalllake3158
@crystalllake3158 11 ай бұрын
Thank you for taking the time to create such an in depth video for Spark Plans. This is very helpful ! Would you also be able to explain Spark Memory Tuning ? How do we decide how much resources to allocate (driver mem, executors mem , num executors , etc for a spark submit ? Also Data Structures Tuning, Garbage Collection Tuning ! Thanks again !
@afaqueahmad7117
@afaqueahmad7117 11 ай бұрын
Thanks for the kind words @crystalllake3158 and the suggestion; currently the focus of the series is to cover all possible code level optimization. Resource level optimisations will come in much later, but no plans for the upcoming few months :)
@crystalllake3158
@crystalllake3158 11 ай бұрын
Thanks ! Please do keep uploading, love your videos !
@Shrawani18
@Shrawani18 11 ай бұрын
You were too good!
@afaqueahmad7117
@afaqueahmad7117 11 ай бұрын
Thank you!
@chidellasrinivas
@chidellasrinivas 8 ай бұрын
I loved your explanation and understood it very well. Could you help me to understand at 23 mins, if we have join key as cid and group by region. how the hash partitioning works. will that consider both?
@kvin007
@kvin007 8 ай бұрын
Great explanation! I love the simplicity of it! I wonder what is the app you use for having your Mac as a screenshot that you can edit with your iPad?
@afaqueahmad7117
@afaqueahmad7117 8 ай бұрын
Thanks @kvin007! So, basically I join a zoom meeting with my own self and annotate, haha!
@venkateshkannan7398
@venkateshkannan7398 2 ай бұрын
Great explanation man! Thank you! What's the editor that you use in the video to read query plans?
@afaqueahmad7117
@afaqueahmad7117 2 ай бұрын
Thanks @venkateshkannan7398, appreciate it. Using Notion :)
@sahilmahale7657
@sahilmahale7657 3 ай бұрын
Bro please make more videos !!!
@udaymmmmmmmmmm
@udaymmmmmmmmmm 7 ай бұрын
Can you please prepare a video showing storage anatomy of data during job execution cycle? I am sure there are many aspiring spark students who may be confused about the idea of RDD or dataframe and how it access data through apis (since spark is in memory computation) during job execution. It will help many upcoming spark developers.
@afaqueahmad7117
@afaqueahmad7117 4 ай бұрын
Hey @udaymmmmmmmmmm, I added this video recently on Spark Memory Management. It talks about storage and responsibilities or each of memory components during job execution. You may want to have a look at it :) Link here: kzfaq.info/get/bejne/qb58ZNSY17bdo5s.html
@mohitupadhayay1439
@mohitupadhayay1439 Ай бұрын
Just 10 minutes into this notebook and I am awed beyond my words. What a great explanation Afaque. Kudos to you! Please make more videos of solving real time scenarios using Spark UI and one on Cluster configuration too. Again BIG THANKS!
@afaqueahmad7117
@afaqueahmad7117 Ай бұрын
Hi @mohitupadhayay1439, really appreciate the kind words, it means a lot. A lot coming soon :)
@mohitupadhayay1439
@mohitupadhayay1439 Ай бұрын
Hi Afaque. Do we have any library or can we create a UDF for understanding why some records got corrupt while reading file? I have a nested XML file with large number of columns and I want to understand why some columns are going into corrupt. Couldn't find anything helpful online. This video would be greatly appreciated.
@nijanthanvijayakumar
@nijanthanvijayakumar 11 ай бұрын
Hello @afaqueahmad7117, thanks for the great video. While explaining repartition, you mentioned you’ve a video on the AQE. Please can you link that as well?
@afaqueahmad7117
@afaqueahmad7117 11 ай бұрын
Thanks @nijanthanvijayakumar, yes that video is upcoming in the next few days :)
@nijanthanvijayakumar
@nijanthanvijayakumar 11 ай бұрын
Can't wait for that@@afaqueahmad7117 These KZfaq videos are so much more helpful. Hats down one of the best ones that explain the Spark performance tuning and internals in a very simplest of forms possible. Cheers!
@Pratik0917
@Pratik0917 7 ай бұрын
Fab Cotenet
@tahiliani22
@tahiliani22 4 ай бұрын
At the very end of the video <a href="#" class="seekto" data-time="2316">38:36</a>, we see that the cast("int") filter is present in the parsed logical plan and Analyzed logical plan. I am a little confused as to when we refer those plans. Can you please explain?
@mission_possible
@mission_possible 10 ай бұрын
Thanks for the content and when can we expect new video?
@afaqueahmad7117
@afaqueahmad7117 10 ай бұрын
Coming soon, in the next few days! :)
@rajubyakod8462
@rajubyakod8462 5 ай бұрын
if it is doing local aggregation before shuffling the data then why it will throw out of memory error while taking count of each key when the column has huge distinct values
@TechnoSparkBigData
@TechnoSparkBigData 11 ай бұрын
You mentioned that for coalesce(2) shuffle will happen, but later you mentioned that shuffle will not happen in case of coalesce hence no partitioning scheme. Could you please explain it in detail?
@afaqueahmad7117
@afaqueahmad7117 11 ай бұрын
So, coalesce will only incur a shuffle if its a very aggressive situation. If the objective can be achieved by merging (reducing) the partitions on the same executor, it will go ahead with it. In case of coalesce(2), its an aggressive reduction in the number of partitions, meaning that Spark has no other option but to move the partitions. As there were 3 executors (in the example I referenced in the video), even if it reduced the partitions on each executor to a single partition, it would end up with 3 partitions in total, therefore it incurs a shuffle to have 2 final partitions :)
@TechnoSparkBigData
@TechnoSparkBigData 11 ай бұрын
@@afaqueahmad7117 Thanks for clarification.
@TechnoSparkBigData
@TechnoSparkBigData 11 ай бұрын
Hi Sir, you mentioned that you referred AQE before. Can I get that link ? I want to know about AQE
@afaqueahmad7117
@afaqueahmad7117 11 ай бұрын
Yes, I will be releasing the video in the next few days. :)
@TechnoSparkBigData
@TechnoSparkBigData 11 ай бұрын
@@afaqueahmad7117 Thank you sir.
@TechnoSparkBigData
@TechnoSparkBigData 11 ай бұрын
In exchange hashpartitioning what is the significance of number 200? what does that mean?
@afaqueahmad7117
@afaqueahmad7117 11 ай бұрын
200 is the default number of shuffle partitions. You can find the number here in this table by the property name "spark.sql.shuffle.partitions" spark.apache.org/docs/latest/sql-performance-tuning.html#other-configuration-options
@user-dv1ry5cs7e
@user-dv1ry5cs7e 3 ай бұрын
I am doing coalesce(1) and getting error as : Unable to acquire 65536 bytes of memory, got 0. But when i am doing repartition(1), it worked. Can you please explain what happens internally in this case?
@sangu2227
@sangu2227 4 ай бұрын
I have doubt when the data will be distributed to executor is it before scheduling the task or after scheduling the task and who assign the data to executor
@afaqueahmad7117
@afaqueahmad7117 4 ай бұрын
Hey @sangu2227, this requires an understanding of transformations/actions and lazy evaluation in Spark. Spark doesn't do anything (either scheduling a task or distributing data) until an action is called. The moment an action is invoked, Spark creates a logical -> physical plan and Spark's scheduler divides the work into tasks. Spark's driver and Cluster manager then distributes the data to the executors for processing :)
@TJ-hs1qm
@TJ-hs1qm Ай бұрын
What drawing board are you using for those notes?
@afaqueahmad7117
@afaqueahmad7117 25 күн бұрын
Using "Notion" for text, "Nebo" on iPad for the diagrams
@TJ-hs1qm
@TJ-hs1qm 25 күн бұрын
​@@afaqueahmad7117cool thx!
@ZafarDorna
@ZafarDorna 6 ай бұрын
Hi Afaque, how can I download the data files you are using? I want to try it hands on :)
@afaqueahmad7117
@afaqueahmad7117 6 ай бұрын
Should be available here: github.com/afaqueahmad7117/spark-experiments :)
@bhargaviakkineni
@bhargaviakkineni 2 ай бұрын
Hi sir i came across a doubt Consider the executor size 1gb/executor. We have 3 executors and intially 3 gb data gets distributed across 3 executors each executor is having 1gb partition after various transformations we came across a requirment to decrease the number of partitions to 1 partition for that we will use repartition(1) or coalesce(1). In this scenario all the 3 partitions will merges to 1 partition each partition is having size of 1 gb approximately. Collectively all the partitions size is 3 gb approximately. When repartition (1) or coalesce(1) all the 3 gb data should sit in 1 executor having capicity of 1gb only. So here the data is execeeding the executor size what happens in this scenario. Could you please make video on this requesting sir.
@afaqueahmad7117
@afaqueahmad7117 2 ай бұрын
Hi @bhargaviakkineni, In the scenario you described above where the resulting partition size (3 GB) exceeds the memory available on a single executor (1 GB), Spark will attempt to spill data to disk. The spill to disk is going to help the application from crashing due to out-of-memory errors however, there is going to be a performance impact associated, because disk IO is slower. On a side note, as a best practice, It’s best to also think/re-evaluate the need to write to a single partition. Avoid writing to a single partition, because it generally creates a bottleneck if the sizes are large. Try to balance out the partitions with the resources of the cluster (executors/cores). Hope that clarifies :)
@NiranjanAnandam
@NiranjanAnandam Ай бұрын
Local distinct on cust id doens't make sense and couldn't understand. How globally it does distinct count if the count is already computed. The reasoning behind why cast doens't push down predicate is not clearly explained and just as it's mentioned in the doc
@Precocious_Pervez
@Precocious_Pervez 11 ай бұрын
Great Work buddy keep it up .... love your content, very simple to understand @Afaque Ahmed
@afaqueahmad7117
@afaqueahmad7117 11 ай бұрын
Thanks a ton!
Master Reading Spark DAGs
34:14
Afaque Ahmad
Рет қаралды 13 М.
Red❤️+Green💚=
00:38
ISSEI / いっせい
Рет қаралды 86 МЛН
Jumping off balcony pulls her tooth! 🫣🦷
01:00
Justin Flom
Рет қаралды 25 МЛН
Sigma Kid Hair #funny #sigma #comedy
00:33
CRAZY GREAPA
Рет қаралды 38 МЛН
World’s Largest Jello Pool
01:00
Mark Rober
Рет қаралды 99 МЛН
Shuffle Partition Spark Optimization: 10x Faster!
19:03
Afaque Ahmad
Рет қаралды 6 М.
How to Read Spark DAGs | Rock the JVM
21:12
Rock the JVM
Рет қаралды 22 М.
How do indexes make databases read faster?
23:25
Arpit Bhayani
Рет қаралды 57 М.
Apache Spark Memory Management
23:09
Afaque Ahmad
Рет қаралды 7 М.
How to Read Spark Query Plans | Rock the JVM
16:50
Rock the JVM
Рет қаралды 14 М.
Red❤️+Green💚=
00:38
ISSEI / いっせい
Рет қаралды 86 МЛН