AWS Glue ETL Vs EMR - Which one should I use?

  Рет қаралды 36,078

Johnny Chivers

Johnny Chivers

Күн бұрын

ℹ️ aws.amazon.com/emr/
❔www.thequestionbank.io
ℹ️ johnnychivers.co.uk
☕ www.buymeacoffee.com/johnnych...
00:00 - Intro
00:36 - What is EMR?
01:26 - What is AWS Glue?
02:11 - When do I use EMR? When do I use Glue?
In this video we take a look at the use cases surrounding AWS Glue and AWS EMR. It's a common question on which one should I use and when. I therefore attempt to answer this question in a highly requested video.
😎 About me
I have spent the last decade being immersed in the world of big data working as a consultant for some the globe's biggest companies. My journey into the world of data was not the most conventional. I started my career working as performance analyst in professional sport at the top level's of both rugby and football. I then transitioned into a career in data and computing. This journey culminated in the study of a Masters degree in Software development. Alongside many a professional certification in AWS and MS SQL Server.

Пікірлер: 49
@jasper5016
@jasper5016 8 ай бұрын
Wow this is a fantastic video on EMR and Glue. Thanks.
@nickg9650
@nickg9650 Жыл бұрын
What a fantastic explanation - all killer, no filler. Thanks!
@johndanson4427
@johndanson4427 3 ай бұрын
All his videos work. Is this channel in a parallel universe?
@kingsabru
@kingsabru Жыл бұрын
Damn. You're good. I understood the use cases for both in one swing. Thanks 🙏
@SiarheiKarko
@SiarheiKarko Жыл бұрын
Thanks a lot Johnny, awesome explanation as always!
@endpermia
@endpermia 9 ай бұрын
Thanks for the clear explanation!
@ryanshuell
@ryanshuell 6 ай бұрын
Excellent! Keep 'em coming!!
@leoxiaoyanqu
@leoxiaoyanqu 2 жыл бұрын
Thanks a lot! I got my answer so I think it's a great video!
@JohnnyChivers
@JohnnyChivers 2 жыл бұрын
Thanks for watching Leo!
@desloubser5678
@desloubser5678 2 жыл бұрын
Thanks for the video!! It really helps a lot
@JohnnyChivers
@JohnnyChivers 2 жыл бұрын
Thanks for watching
@AVISH747
@AVISH747 7 ай бұрын
Great stuff mate. Subscribed and Liked..!
@bobhaffner5902
@bobhaffner5902 2 жыл бұрын
Great job comparing the two options, Johnny
@JohnnyChivers
@JohnnyChivers 2 жыл бұрын
Thanks as always bob!
@jriosfer
@jriosfer 2 жыл бұрын
Thanks for the explanation! good comparison
@JohnnyChivers
@JohnnyChivers 2 жыл бұрын
Thanks for watching Jorge
@AlexXavier
@AlexXavier 3 ай бұрын
So clear! Thank you!
@GiasoneP
@GiasoneP 2 жыл бұрын
I don’t know how you only have ~3k subscribers. What a trove of knowledge. Thank you
@JohnnyChivers
@JohnnyChivers 2 жыл бұрын
Thanks for watching Jason.
@andregomesdasilva
@andregomesdasilva Жыл бұрын
Just a matter of time for him to get much more subscribers. The content is absolutely great
@channuangadi7504
@channuangadi7504 10 ай бұрын
Crystal clear 🔮 explanation
@Alex-cn9ot
@Alex-cn9ot Жыл бұрын
I do almost the same code in AWS glue as EMR, I mean, I consume from external sources via spark JDBC connectors and publish the results to other warehouses via JDBC, I only have crawlers to detect the intermediate files that are generated at the datalake at the staging or business layer, but I don't use the studio or editor. I feel AWS glue more integrated in terms of managing the workflows and the status (cpu/ram,etc) than a EMR based service.
@whocares_today
@whocares_today Ай бұрын
amazing work
@shared_xp
@shared_xp 2 ай бұрын
I have not heard PIG in forever, really enjoyed that language.
@jiezhu9593
@jiezhu9593 2 жыл бұрын
I think you can request AWS to increase your quota in Glue to have more than 100 DPU enabled per glue job.
@marian6040
@marian6040 Жыл бұрын
Great explanation. How about Choosing between Glue and Emr serverless?
@georgeognyanov
@georgeognyanov Жыл бұрын
Was just thinking that as well. I think his points will still be valid since EMR serverless will be more expensive to run that the self-managed EMR and we still have the case of not utilizing it fully.
@hotpeppermovie
@hotpeppermovie 2 жыл бұрын
Would love if you could do a simple industry grade project starting from beginning to end!
@JohnnyChivers
@JohnnyChivers 2 жыл бұрын
Definitely something I can look into. What would give the most benefit? Glue or EMR? Bearing in mind it would be a very long video as it would be at industry standard.
@hotpeppermovie
@hotpeppermovie 2 жыл бұрын
@@JohnnyChivers glue would be nice since you mentioned its easier for beginner data engineers to learn and use. But yeah i agree it could be a whole course in itself. Perhaps you could split them up into smaller sections/videos if you do decide to do them
@groundingtiming
@groundingtiming 5 ай бұрын
@@JohnnyChivers Hey John, great stuff, have there been an update to this please?
@kaushalroonwal4279
@kaushalroonwal4279 6 ай бұрын
Hi Johnny, since there is EMR server less available now, do you think that the operational overhead is still one of the differentiator between the two? What do you recommend based on the EMR serverless?
@jarosawsmiejczak1138
@jarosawsmiejczak1138 2 жыл бұрын
BUY THIS MAN A COFFEE. Thanks Johnny!
@JohnnyChivers
@JohnnyChivers 2 жыл бұрын
thanks for watching.
@echezonaazubike8054
@echezonaazubike8054 11 ай бұрын
I love your Scottish accent
@JiyuKim-sr1mi
@JiyuKim-sr1mi 11 ай бұрын
Which one is a better option when building a transactional data lake?
@hellorsanjeev11
@hellorsanjeev11 Жыл бұрын
ETL code in pyspark or scala? Can I have it in Java instead?
@drewhunt3328
@drewhunt3328 2 жыл бұрын
For data wrangling only, what are differences between AWS Glue ETL and AWS Sagemaker Data Wrangler? Great videos!
@JohnnyChivers
@JohnnyChivers 2 жыл бұрын
Sagemaker data wrangler helps you build a workflow using pre-created libraries mainly with the intention of using the data for ML. Glue ETL is were you write the code and the logic yourself from scratch. Of course you have any spark and python library at your disposal. And whilst Glue ETL can run ML algos, that doesn’t have to be the aim of your wrangling - unlike Sagemaker data wrangler.
@RohitPal-lz1wf
@RohitPal-lz1wf 2 жыл бұрын
I have a requirement to copy the data from One DynamoDB to other DynamoDB within same account. Data in source table is of 2017 version while target is having 2019 version. Can you please suggest which Service will fit best with no downtime.
@abhijitcaps
@abhijitcaps Жыл бұрын
You can use AWS DMS in co-ordination with AWS Schema Conversion Tool
@alanaugust6733
@alanaugust6733 2 жыл бұрын
Does it make sense to do your proof of concept ETL code in Glue, then have EMR run that process at scale?
@JohnnyChivers
@JohnnyChivers 2 жыл бұрын
Hi Alan, it certainly does. The one factor to be conscious of is load. If using a subset of data to develop your script in glue, there could be performance issue later down the line in EMR once the full dataset is used.
@alan2a1l
@alan2a1l 2 жыл бұрын
@@JohnnyChivers Thanks, Johnny, for the response! Got it! Performance is always an issue, both in volume and composition. The test case would have to include the full range of expected inputs, but volume... well I guess you just have to run it at full volume and fix as necessary. Or parallelize with multiple Producers?...I'm sure you've dealt with it.
@omgleowtf
@omgleowtf Жыл бұрын
They now have EMR Serverless so I guess you don't need a cluster up and running 24/7 when you only need it every now and then
@vitaliryumshin6174
@vitaliryumshin6174 Жыл бұрын
yes, would interesting to get a comment from Johnny. how close those to each other..costs
@himalayasaikia5762
@himalayasaikia5762 Жыл бұрын
hey Im new to AWS...just wondering...even without serverless EMR, cant we use Transient EMR cluster to run and kill once the job is completed...that way we will not have to keep the cluster up and running
@joshi1q2w3e
@joshi1q2w3e 10 ай бұрын
So why do people use either of these when you can just use Databricks? Especially EMR seems like it can be replaced by Databricks.
@shresthaditya2950
@shresthaditya2950 11 ай бұрын
AWS Glue is for ETL Purposes and performing ETL Operations is way easier that is get the data into catalog and create jobs in scala or python and AWS will run without needing to manage Clusters, Infrastructure and Apache engine EMR requires knowledge of clustered computing so it may require a lot of infrastructure cost AWS glue is 20-40% is overhead cost but 1)But In AWS Glue we pay for what we use only that is it on demand service 2)On the other hand you have to pay AWS EMR all the time and in most companies around 80% there isn't any need to run Amazon EMR cluster It gives 100 Dpus (16GB ram 4 CPU per DPU) EMR Is better in AWS Glue and looking data in EMR because Glue requires Jobs for everything
AWS Glue Tutorial for Beginners [FULL COURSE in 45 mins]
41:30
Johnny Chivers
Рет қаралды 244 М.
AWS Glue Studio - Lets Get Hands On!
32:53
Johnny Chivers
Рет қаралды 17 М.
⬅️🤔➡️
00:31
Celine Dept
Рет қаралды 35 МЛН
Каха ограбил банк
01:00
К-Media
Рет қаралды 3,1 МЛН
Just try to use a cool gadget 😍
00:33
123 GO! SHORTS
Рет қаралды 73 МЛН
Balloon Stepping Challenge: Barry Policeman Vs  Herobrine and His Friends
00:28
Intro to Amazon EMR - Big Data Tutorial using Spark
22:02
jayzern
Рет қаралды 17 М.
Top AWS Services A Data Engineer Should Know
13:11
DataEng Uncomplicated
Рет қаралды 151 М.
AWS EMR Serverless - What is it? [FULL TUTORIAL in 25mins]
23:35
Johnny Chivers
Рет қаралды 14 М.
Database vs Data Warehouse vs Data Lake | What is the Difference?
5:22
Alex The Analyst
Рет қаралды 722 М.
Monolithic vs Microservice Architecture: Which To Use and When?
10:43
What is Amazon DataZone? [AWS TUTORIAL in 12MINS]
12:29
Johnny Chivers
Рет қаралды 3,3 М.
Ice cream
0:27
ARGEN
Рет қаралды 17 МЛН
Самый безопасный мотоцикл в мире 🏍️
0:37
ОМЕГА шортс
Рет қаралды 10 МЛН
Kashvi gir gayi 🥲 (she is fine now)
0:25
Cute Krashiv
Рет қаралды 35 МЛН