I built data pipelines at Netflix that ran 2000 TBs per day, here’s what I learned about huge data!

  Рет қаралды 357,812

Data with Zach

Data with Zach

3 ай бұрын

Check out my boot camp / course at DataExpert.io where you can learn all this in much more detail!
Use code PROMOTION15 at checkout by April 7th to get 15% off!
#dataengineering
#netflix

Пікірлер: 358
@sevrantw8931
@sevrantw8931 3 ай бұрын
I’m so glad I found this video, I was just sitting here with 60 million gigabytes and was figuring out what joins to use so this was perfect timing.
@aripapas1098
@aripapas1098 3 ай бұрын
if all u registered was 60 mil gb & joins ur not flowing
@smackastan5697
@smackastan5697 3 ай бұрын
You're kidding, but somehow I just started a data analysis project of two terabytes and this video shows up.
@hi-mn5rg
@hi-mn5rg 3 ай бұрын
@@aripapas1098 if you think comments must indicate a user registered every aspect of a video, ur not following
@derickd6150
@derickd6150 3 ай бұрын
​@@aripapas1098this is a sad comment
@00Tenrai00
@00Tenrai00 3 ай бұрын
Sarcasm ???? 😂
@bilbobeutlin3405
@bilbobeutlin3405 3 ай бұрын
Can't wait to build hyperscale pipelines for my startup with 0 users
@92kosta
@92kosta 3 ай бұрын
But it sounds powerful when you say it, like you mean business.
@npc-drew
@npc-drew 3 ай бұрын
Based
@vikingthedude
@vikingthedude 3 ай бұрын
1 user (me)
@JGComments
@JGComments 3 ай бұрын
If you build it, they will come.
@abhilashpatel6852
@abhilashpatel6852 3 ай бұрын
I have 1k TB data just sitting around in my backyard. Glad your video came up to get me started on atleast something.
@subhasishsarkar5106
@subhasishsarkar5106 3 ай бұрын
What I absolutely love about your videos is that as a beginner in the data engineering field, you often talk about things that I had no conception of. In this video for example, I have never heard of SMBs or broadcast joins. This gives me an oppurtunity to learn these things, even hearing them be mentioned from someone as widely experienced as you. You need not necessarily have to even go into detail, but these short form videos act as beacons of knowledge that I can throw myself into learning about. Thanks a lot, and keep these coming Zach!
@EcZachly_
@EcZachly_ 3 ай бұрын
Really appreciate this comment! It reminds to that the value im putting out there is important!
@vasudevreddy3527
@vasudevreddy3527 3 ай бұрын
@@EcZachly_ ✌
@eric.batdorff
@eric.batdorff 3 ай бұрын
Great summation! I was thinking the exact same thing while watching. It's nice hearing even the specialized lingo from technical experts in their fields, it peaks my curiosity.
@MrAmitkr007
@MrAmitkr007 3 ай бұрын
​@@EcZachly_thanks
@prawtism
@prawtism 3 ай бұрын
​@@EcZachly_did you already know the importance of these two before Netflix or did you learn that while working at Netflix?
@supercompooper
@supercompooper 3 ай бұрын
In the future a wrist watch will have a little blinking light that will have 60 million gigabytes of data in it
@dhillaz
@dhillaz 3 ай бұрын
You mean an Electron app?
@aripapas1098
@aripapas1098 3 ай бұрын
yeah okay crack smoker
@mrevilducky
@mrevilducky 3 ай бұрын
And it will still lag and hit 99% singularities
@Ivan-Bagrintsev
@Ivan-Bagrintsev 3 ай бұрын
@@dhillaz that will just show current time
@supercompooper
@supercompooper 2 ай бұрын
@@Ivan-Bagrintsev Yes it will show the time, but with full DRM. Unless you have a license to view certain minutes it will be denied.
@lucas.p.f
@lucas.p.f 3 ай бұрын
Boyfriend simulator: you sit with your bf and he starts talking about this nerdy stuff you have no idea about but need to keep listening because you love him
@EcZachly_
@EcZachly_ 2 ай бұрын
This is exactly correctly
@CU.SpaceCowboy
@CU.SpaceCowboy 2 ай бұрын
aww 🥰
@heykike
@heykike Ай бұрын
After marriage they no longer pretend to listen to
@rajns8643
@rajns8643 Ай бұрын
If only a girl would fall for me when I speak nerdy stuff 🫠
@lucas.p.f
@lucas.p.f Ай бұрын
@@rajns8643 are you kidding me? This is what most people like the most! Intelligent people are extremely attractive
@supafiyalaito
@supafiyalaito 3 ай бұрын
Thanks Zach, hopefully one day I will understand what all of that means
@Bostonaholic
@Bostonaholic 3 ай бұрын
I love that you kept it short and to the point.
@tobiastho9639
@tobiastho9639 3 ай бұрын
He sure wanted to save some data… 😅
@RichardOles
@RichardOles 3 ай бұрын
Holy crap. I’m currently learning about data science, the various roles, etc. -with the hope of one day switching careers. But the current state of learning is all about the languages and software used etc, not about the infrastructure and what to do with massive datasets. So this just 🤯
@samuelisaacs7557
@samuelisaacs7557 Ай бұрын
its really about math but no one talks about it. get at least 1 year university math comprehension and then get into the python and tech tools. the most competent and successful data engineers are always people with a good STEM background. for example Zach has a Bachelor's Degree in Applied Mathematics and a Bachelor's Degree in Computer Science so he is a heavy numbers guy. That's what most of Data Science \ Engineering KZfaqrs don't tell their viewers cause that will cause them to loose viewers.
@byRoyalty
@byRoyalty Ай бұрын
learning the tools can be very different from solving real world problems.
@rajns8643
@rajns8643 Ай бұрын
​@@samuelisaacs7557 True asf
@stevess7777
@stevess7777 Ай бұрын
​@@samuelisaacs7557Yep, even a business administration bachelors will have a lot of maths and it's nowhere near data science which is 3x that.
@WM-eg4gh
@WM-eg4gh 3 ай бұрын
Thank you Zach for taking the time to give us the hard truth and hands down your experience. It helps a lot of enthuastic students/people to know how we can in some way support or help others in the subjects we like. I don't imagine myself processing 2000TBs per day, but it helps give a bigger picture. Once again, appreciate the short video and thank you for sharing
@mohammedaamer4201
@mohammedaamer4201 3 ай бұрын
Just started following you. Really appreciate you for sharing your knowledge with the community.
@rembautimes8808
@rembautimes8808 3 ай бұрын
Great content, an honour to be able to listen to someone who has handled that volume of data.
@stifflery
@stifflery 3 ай бұрын
literally 🎉
@codecaine
@codecaine 2 ай бұрын
Have chat gpt explain it too you or some other LLM.
@Adhanks91
@Adhanks91 3 ай бұрын
Informative and straight to the point, great stuff as usual
@JT-zb6vi
@JT-zb6vi 3 ай бұрын
instant subscribe - really appreciate the concise explanation and clear examples
@LambOverSpicyRice
@LambOverSpicyRice 3 ай бұрын
Excellent video, thanks Zach!
@rohanbhakat2922
@rohanbhakat2922 3 ай бұрын
Thanks for the info Zach. Could you please make an elaboriative video on SMB join.
@jacobp8294
@jacobp8294 3 ай бұрын
I am a regional IT installer who runs Cat6 Ethernet pipelines for managing 1gb loads on HP laptops, this video is really awesome and breaks down your workflow and mindset in a complicated field really efficiently. I would love to get more short videos about the industry like this.
@EcZachly_
@EcZachly_ 2 ай бұрын
I'll keep them coming. I make much more on Tiktok and Instagram since I like making vertical content!
@jacobp8294
@jacobp8294 2 ай бұрын
@@EcZachly_ Ill check it out! Keep it up!
@tanujkhochare3498
@tanujkhochare3498 3 ай бұрын
Hey Zach, your content is consistently amazing! As a newcomer to the field, I'm considering diving into data engineering. What roadmap would you recommend, and are there any certifications that could enhance my journey? I already have a solid grasp of Python and SQL in data analysis.
@sharpsrain8302
@sharpsrain8302 3 ай бұрын
I just found ur stuff but thanks for the content mang keep it up 🙏
@SahilKashyap64
@SahilKashyap64 3 ай бұрын
I've never heard of these terms, thank you sharing your real case scenarios(The FB notification example)
@oakleyorbit
@oakleyorbit 22 күн бұрын
Half of what you said I had no idea what you were taking about but I was very engaged and now I’m gonna look all this stuff up for centering my div!
@souravghosh358
@souravghosh358 3 ай бұрын
Very important concept in such short time.. thank u so very much ❤
@vinit.khandelwal
@vinit.khandelwal 3 ай бұрын
Thanks, looking forward to more such content
@ArjunRajaS
@ArjunRajaS 2 ай бұрын
If you come across a scenario to join 2 large datasets. You could do an iterative broadcast join. Basically you are going the break one of the df into multiple dfs and join the dataframe in a loop till all the multiple dfs are joined.
@jordanmessec5332
@jordanmessec5332 2 ай бұрын
You’ll require a lot of memory and have long start times, no?
@dazzassti
@dazzassti 3 ай бұрын
In the 37 years I’ve been working in data, I’ve never heard anyone call it Peter 😂. PETA
@anotherguy9402
@anotherguy9402 2 ай бұрын
What's wrong with a Peter bite?
@divinecomedian2
@divinecomedian2 2 ай бұрын
Heya Peeda
@Starmast3rmusic
@Starmast3rmusic 2 ай бұрын
Could be an accent or a slip 😂
@ChrisMPerry
@ChrisMPerry 3 ай бұрын
Insightful as always.💯
@EcZachly_
@EcZachly_ 3 ай бұрын
Appreciate that!
@RyanSaplanPT
@RyanSaplanPT 3 ай бұрын
Please more data stuff!!! I hardly understood what you said, but it’s sounds interesting
@nikolagrkovic8769
@nikolagrkovic8769 2 ай бұрын
The amount of knowledge you shared here is astonishing
@arbol41
@arbol41 2 ай бұрын
Thanks Zach , but I have a question broadcast join is used when we have a small dimensions joined with big table this is your case? Or are you used hash join with two large table?
@Jc12x06
@Jc12x06 3 ай бұрын
Dude has beef with Bezos😂
@theAnupamAnandoriginal
@theAnupamAnandoriginal 3 ай бұрын
you can make a bios optimized for throughput and without interrupta , to speeden 67x and more
@maggiejetson7904
@maggiejetson7904 20 күн бұрын
Honestly, 2000 TB per day isn't the problem. The problem is the cost and how much of the data is burst. If it is not burst it is pretty much always cheaper to do it in-house with your own hardware than to pay and rent the cloud to do it.
@Llanowyn
@Llanowyn 3 ай бұрын
I would be interested in the architecture and content delivery for pre and post cdn from a network design perspective. Are there any examples or presentations regarding networking at netflix?
@solitary200
@solitary200 14 күн бұрын
Great points to remember! There are a lot more underlying abstraction layers you can add at these different points to further optimize the second network hop. Caching is a simple one. Can you implement an efficient snapshot system with delta encoding of entities and compress the message? Would be a cool video for you to implement!
@dungenwalkerr619
@dungenwalkerr619 Ай бұрын
Thanks for sharing, now I can finally put some good numbers on my resume 🎉
@ATX_Engineer
@ATX_Engineer 23 күн бұрын
Ah yes, data structures and sorting… but with the “can you even scale bro” tick enabled.
@JGComments
@JGComments 3 ай бұрын
2 pita bites a day, the same as me when I’m on a diet.😊
@theactualslimshady
@theactualslimshady 3 ай бұрын
Please keep up the great content!
@explosivecl
@explosivecl 3 ай бұрын
Thanks for the video
@internetcancer1672
@internetcancer1672 3 ай бұрын
My problem is how do people even find out about the careers that they go into?
@joshi1q2w3e
@joshi1q2w3e Ай бұрын
Did Facebook use Databricks or did they have HPC Clusters for you to run Spark on?
@remo
@remo Ай бұрын
Damn I just wanted to shuffle like there’s no tomorrow and then I found this video.
@earthling_parth
@earthling_parth 3 ай бұрын
Imma wait for Primeagen to confirm this as well when he reacts to this video inevitably 😁
@vikrampandit2174
@vikrampandit2174 3 ай бұрын
Never thought broadcast join is a Netflix saviour
@john_paul
@john_paul 2 ай бұрын
I love how you acronym Sorted Bucket Merge as SMB. Think you may have had Super Mario Bros on the mind 😂
@IAmAlpharius14
@IAmAlpharius14 Ай бұрын
Sir this is a Wendy's.
@OurNewestMember
@OurNewestMember Ай бұрын
Interesting! I would have thought something like sharding (or partitioning and clustering) so data processing and access can scale horizontally.
@EcZachly_
@EcZachly_ Ай бұрын
Bucketing and clustering are similar
@aamadmi5848
@aamadmi5848 3 ай бұрын
Thanks zech for the video
@seegreen6484
@seegreen6484 2 ай бұрын
I love that I’m only a software engineer but I can understand all of this
@rashshawn779
@rashshawn779 3 ай бұрын
Very nice. Short and sweet.
@EcZachly_
@EcZachly_ 3 ай бұрын
Glad you enjoyed it
@TheInterestingInformer
@TheInterestingInformer 2 ай бұрын
I’m trying to get into data analytics and most of this we t over my head but this still sounds lit 🔥
@hearhaw
@hearhaw 23 күн бұрын
I'd like to learn more about these pitabytes. What are they? What do they taste like?
@TLOGhx
@TLOGhx 3 ай бұрын
Insanely valuable content
@uwize5897
@uwize5897 2 ай бұрын
optimizing selling personal data to minimize cost is something i never thought about
@MFsyrup
@MFsyrup 2 ай бұрын
Thank you Tony Hawk, very cool!
@liamvstech
@liamvstech 3 ай бұрын
When I was hired to do data engineering, it was always data that could fit on a single hard drive and it was boring af. I hated it. This sounds way more challenging and interesting.
@ChuckNorris-lf6vo
@ChuckNorris-lf6vo 3 ай бұрын
Hi, what about replacing torrents with IPFS? That's data pipelining, right ?
@TheDa6781
@TheDa6781 28 күн бұрын
Managing retention, storage and flow is always important. Im sitting on a toilet as im writing this.
@narbwow8168
@narbwow8168 3 ай бұрын
Pretty interesting, even though I had no idea about most of what he was talking about.
@user-op5vc9qw6o
@user-op5vc9qw6o 3 ай бұрын
That's cool bro. Will it fix the Netflix app where it shows the title of one show but the preview and description of another?
@EcZachly_
@EcZachly_ 3 ай бұрын
It was to look at network traffic to keep your credit card data secure
@SamCyanide
@SamCyanide 3 ай бұрын
My medical science clients called, they need an 800tb imaging data set parsed by end of day (thank you kubernetes)
@dark_lord98
@dark_lord98 3 ай бұрын
Are those joins available in MySQl or specific to dbms at meta you worked?
@juanbrekesgregoris4405
@juanbrekesgregoris4405 3 ай бұрын
I think they're not available on MySQL because it's an OLTP database. Those joins are used for analytics
@jordanmessec5332
@jordanmessec5332 2 ай бұрын
These are not database joins, they are processing joins. Frameworks such as Flink and Spark would leverage broadcasts. It basically boils down to a single coordinator instance that publishes a small, often changing dataset to all parallel processors. Usually used to enrich, prune, or map the main dataset.
@bacfjib9874
@bacfjib9874 3 ай бұрын
Very informative, I wanna ask you, which certification can help me as a fresh graduate, is AWS data engineer Certification worth it or not? And thank's a lot Zach
@EcZachly_
@EcZachly_ 3 ай бұрын
It’s pretty great!
@_sonicfive
@_sonicfive 2 ай бұрын
Whenever I hold on to more than 60 petabytes I just call the assistant to the regional manager and he runs a fix from his mainframe.
@iloos7457
@iloos7457 3 ай бұрын
Hey are you familiar with cosmosDB from azure? Its a db like mongo but claims to be able to scale infinitely... What are your thoughts on that?
@orppranator5230
@orppranator5230 3 ай бұрын
Bro can figure out how to send my entire homework folder in 1/500th of a second but can’t flip the camera sideways
@sneakybutpirate
@sneakybutpirate 2 ай бұрын
Oh yeah that’s really great and insightful, now what’s a join?
@schwarzie2478
@schwarzie2478 3 ай бұрын
I just felt like drinking from the fountain of knowledge and instantly drowning. Definitily haven't had to deal with these kind of volumes yet...
@GnomeEU
@GnomeEU 25 күн бұрын
Now I just need a billion dollar company to have these kinda problems. My question would be, why you have table that big? Can't you distribute or cluster your data? I'm thinking like 10000 users per server. Only stuff around those 10k users gets stored. No magic needed to query stuff.
@EcZachly_
@EcZachly_ 25 күн бұрын
Gotta analyze it all together though
@theAnupamAnandoriginal
@theAnupamAnandoriginal 3 ай бұрын
: multiple streams across entire ddrs directly accessible
@GameCyborgCh
@GameCyborgCh Ай бұрын
gotta love a good pita byte
@xasm83
@xasm83 2 ай бұрын
my data pipeline usually processes one pitabyte every other day and one shawarmabyte every week week
@emerald42481
@emerald42481 Ай бұрын
Very useful and interesting, even to a layman
@GeneralKenobi69420
@GeneralKenobi69420 3 ай бұрын
The Venn diagram of people who use TikTok and data scientists is two circles my dude lol
@EcZachly_
@EcZachly_ 3 ай бұрын
I have 66k followers on TikTok and this video did 375k views there.
@TheGoodContent37
@TheGoodContent37 Ай бұрын
Love the way you tried to make it sound more complicated than it actually is and failed.
@LucTaylor
@LucTaylor 2 ай бұрын
I might get 5 users on my site this month so this will come in handy
@3dilson
@3dilson 3 ай бұрын
"FNA developer" I'm sorry, my brain couldn't let go of it
@phitsf5475
@phitsf5475 2 ай бұрын
The internet is not something you just dump something on, it's not a big truck. It's a series of tubes.
@picdu2891
@picdu2891 Ай бұрын
I love technology and I know more than your average user, yet I have no IT qualifications and I am light years away from this knowledge, but for some reason, I love watching these videos as if I was ever going to use the information 😂
@49erman2
@49erman2 3 ай бұрын
Quality content!
@bandanaboii3136
@bandanaboii3136 2 ай бұрын
Interviewer: name 5 data types Me:
@cry2love
@cry2love 2 ай бұрын
I still bite my gigas when my man hustling meta in peta
@nat.serrano
@nat.serrano Ай бұрын
This guy earned his half a million salary. I tried to do this myself and failed
@ungeschaut
@ungeschaut 3 ай бұрын
I use just a database with just value as field (long string) and nothing else
@Hishamhh93
@Hishamhh93 27 күн бұрын
Bro is the PewDiePie of data Engineering
@Kusagrass
@Kusagrass 3 ай бұрын
People don’t know the data they collect is very volatile, unless you are paying for it.
@chrism3790
@chrism3790 Ай бұрын
What engine were you using to do these massive joins? Spark?
@EcZachly_
@EcZachly_ Ай бұрын
Yep!
@DxWangZ
@DxWangZ 3 ай бұрын
I don't quite understand why Netflix needs data pipelines.
@tschaderdstrom2145
@tschaderdstrom2145 2 ай бұрын
I love pita bites as much as the next guy, but I don't think I can take more than 35 before I'm full
@AkhilSharmaTech
@AkhilSharmaTech 3 ай бұрын
Yes but why does he look like a French model
@manh9105
@manh9105 3 ай бұрын
ok, so how to do that ...can you make a screencast and show us how to do it!
@mikishwagg
@mikishwagg 3 ай бұрын
Me watching this not knowing anything hes talking about makes me feel like starting a big tech company 😀
@Manhunternew
@Manhunternew 3 ай бұрын
How do you deal with log data
@YishuaiLiu
@YishuaiLiu 3 ай бұрын
Short and informative
@EcZachly_
@EcZachly_ 3 ай бұрын
Thank you! What other videos would you like to see from me?
@PySnek
@PySnek 3 ай бұрын
That's around 160 Gbit/s. Enough for 30K 1080p streams or 10K in 4K.
@Kvltklassik
@Kvltklassik Ай бұрын
I built data pipelines at Netflix that ran 2000000000 MBs per day
@dexnow
@dexnow 26 күн бұрын
I suddenly feel like pita bread...
@aarjunpp
@aarjunpp 2 ай бұрын
1. Are you a data engineer? 2. What tech is this? AWS, Snowflake?
@sergeikulikov4412
@sergeikulikov4412 3 ай бұрын
You shouldn't write "s" in Terabyte per hour, just TB/hr "TBs/hr" looks like "Terabyte*second / hour" 😅
@user-to4md9xm2d
@user-to4md9xm2d 3 ай бұрын
Hey absolutely curious about the content your are doing. In my company we are working dbt and snowflake. I can't find a possibility to work with broadcast joins there. do you see a possibility to replicate this process?
@EcZachly_
@EcZachly_ 3 ай бұрын
Snowflake isn’t suitable for volumes >100tbs in my opinion. Clustering is an option in snowflake that helps though
@tlalepm
@tlalepm 3 ай бұрын
My tech lead keeps talking about bucketing as our integration solution tends to get overloaded sometimes. This kinda puts things into perspective. Definitely dont need most of what he’s talking about but just to know the terms and how to implement them
5 Signs of an Inexperienced Self-Taught Developer (and how to fix)
8:40
I’m just a kid 🥹🥰 LeoNata family #shorts
00:12
LeoNata Family
Рет қаралды 18 МЛН
small vs big hoop #tiktok
00:12
Анастасия Тарасова
Рет қаралды 23 МЛН
What is Data Pipeline? | Why Is It So Popular?
5:25
ByteByteGo
Рет қаралды 84 М.
98% Cloud Cost Saved By Writing Our Own Database
21:45
ThePrimeTime
Рет қаралды 313 М.
How I would learn Data Engineering (if I could start over)
11:21
The most important Python script I ever wrote
19:58
John Watson Rooney
Рет қаралды 153 М.
cute mini iphone
0:34
승비니 Seungbini
Рет қаралды 5 МЛН
Secret Wireless charger 😱 #shorts
0:28
Mr DegrEE
Рет қаралды 2,2 МЛН
How To Unlock Your iphone With Your Voice
0:34
요루퐁 yorupong
Рет қаралды 28 МЛН
YOTAPHONE 2 - СПУСТЯ 10 ЛЕТ
15:13
ЗЕ МАККЕРС
Рет қаралды 117 М.
Lid hologram 3d
0:32
LEDG
Рет қаралды 9 МЛН