AWS Tutorials - Building ETL Pipeline using AWS Glue and Step Functions

  Рет қаралды 23,827

AWS Tutorials

AWS Tutorials

2 жыл бұрын

The script URL - github.com/aws-dojo/analytics...
In AWS, ETL pipelines can be built using AWS Glue Job and Glue Crawler. AWS Glue Jobs are responsible for data transformation while Crawlers are responsible for data catalog. Amazon Step Functions is one approach to create such pipelines. In this tutorial, learn how to use Step Functions build ETL pipeline in AWS.

Пікірлер: 76
@arunr2265
@arunr2265 2 жыл бұрын
your channel is gold for data engineers. thanks for sharing the knowledge
@AWSTutorialsOnline
@AWSTutorialsOnline 2 жыл бұрын
So nice of you
@mranaljadhav8259
@mranaljadhav8259 Жыл бұрын
Well said!
@vaishalikankanala6499
@vaishalikankanala6499 2 жыл бұрын
Clear and concise. Great work, thank you very much!
@AWSTutorialsOnline
@AWSTutorialsOnline 2 жыл бұрын
You're welcome!
@coldstone87
@coldstone87 2 жыл бұрын
This is amazing. Glad I found this on youtube. A million thanks.
@AWSTutorialsOnline
@AWSTutorialsOnline 2 жыл бұрын
You're very welcome!
@harsh2014
@harsh2014 2 жыл бұрын
Thank for your session, it helped me !
@AWSTutorialsOnline
@AWSTutorialsOnline 2 жыл бұрын
You are welcome!
@pravakarchaudhury1623
@pravakarchaudhury1623 2 жыл бұрын
It is really awesome. A million thanks to you.
@AWSTutorialsOnline
@AWSTutorialsOnline 2 жыл бұрын
I'm glad you like it
@veerachegu
@veerachegu 2 жыл бұрын
Really helpful and no institute will come to give training on this thankyou so much
@AWSTutorialsOnline
@AWSTutorialsOnline 2 жыл бұрын
Always welcome
@akhilnooney534
@akhilnooney534 Жыл бұрын
Very Well Explained!!!!
@AWSTutorialsOnline
@AWSTutorialsOnline Жыл бұрын
Glad you liked it
@anuradha6892
@anuradha6892 Жыл бұрын
Thanks 🙏 it was a great video.
@najmehforoozani
@najmehforoozani 2 жыл бұрын
Great work
@AWSTutorialsOnline
@AWSTutorialsOnline 2 жыл бұрын
Thanks
@4niceguy
@4niceguy 2 жыл бұрын
Great ! I really appreciate !!!!!
@simij851
@simij851 2 жыл бұрын
thank you a ton lot for doing this!!!
@AWSTutorialsOnline
@AWSTutorialsOnline 2 жыл бұрын
Our pleasure!
@chatchaikomrangded960
@chatchaikomrangded960 2 жыл бұрын
Good one.
@AWSTutorialsOnline
@AWSTutorialsOnline 2 жыл бұрын
Thanks!
@kamrulshuhel7126
@kamrulshuhel7126 Жыл бұрын
Thank you so much for your nice tutorial. I will be grateful can you respond, I have some understanding issues - while I use condition in step functions workflow - not ($.state == "READY") I am getting this error, An error occurred while executing the state 'Choice' (entered at the event id #13). Invalid path '$.state': The choice state's condition path references an invalid value.
@terrcan1008
@terrcan1008 2 жыл бұрын
Thanks for your this kind of tutorials, Could you please share some of the scenarios for AWS Glue job along with Session as well as for AWS lambda. And Also would like to understand the incremental load scenarios in AWS GLUE using HUDI DATASET and other scenarios on same topic
@AWSTutorialsOnline
@AWSTutorialsOnline 2 жыл бұрын
Yes, sure
@picklu1079
@picklu1079 2 жыл бұрын
Thanks for the video. If i use step function to orchestrate glue workflows, will that slow the whole process down?
@AWSTutorialsOnline
@AWSTutorialsOnline 2 жыл бұрын
Please tell me more. Why you want to orchestrate glue workflows?
@BradThurber
@BradThurber 2 жыл бұрын
It looks like Step Functions Workflow Studio includes AWS Glue Start Crawler and AWS Glue Get Crawler states. Could these be used directly instead of the lambdas?
@AWSTutorialsOnline
@AWSTutorialsOnline 2 жыл бұрын
definitely, you can use.
@johnwilliam9310
@johnwilliam9310 Жыл бұрын
Which one would you recommend in order to automate the ETL process? I have seen the AWS glue workflow video as well and this video is also doing something similar thing which is automating the ETL process. I am not able to decide which one should I use? workflow or step function.
@AWSTutorialsOnline
@AWSTutorialsOnline Жыл бұрын
Glue Workflow is good for simple workflow of Glue Jobs and Crawlers. However, if you want to build a complex workflow where you want to reuse the same job / crawler and also call other AWS Services then, you should choose Step Functions. Hope it helps.
@johnwilliam9310
@johnwilliam9310 Жыл бұрын
@@AWSTutorialsOnline Thank you for providing clarity to me.
@simij851
@simij851 2 жыл бұрын
What would you advise if we have 150 tables to move from mySQL into S3 ( No business transformation- just dump load raw) , to have them all in one step function to run parallelly or create individual pipelines to reduce the risk of if one fails all fails with all being clubbed together.
@AWSTutorialsOnline
@AWSTutorialsOnline 2 жыл бұрын
if just dumping - use DMS. One task per table.
@PipatMethavanitpong
@PipatMethavanitpong 2 жыл бұрын
Thank you. This is a nice ELT demo. I wonder how do you handle past extracted and cleaned data. Glue jobs are appending write only, so the raw bucket will contain both old and new extracts and the cleaning job will perform on both the old and new. I think there should be some logic to separate old files and new files.
@AWSTutorialsOnline
@AWSTutorialsOnline 2 жыл бұрын
You can enable job bookmark on Glue Job and that way the job will not processing already processed data.
@PipatMethavanitpong
@PipatMethavanitpong 2 жыл бұрын
@@AWSTutorialsOnline sounds nice. I'll check it out. Thank you.
@veerachegu
@veerachegu 2 жыл бұрын
Really awesome video no where available this content small request can you do the one lab like while daily or hourly fils uploaded in to S3 and trigger the function from S3 to step function pipeline to end of the job
@AWSTutorialsOnline
@AWSTutorialsOnline 2 жыл бұрын
Sure, adding it to backlog
@rishubhanda1084
@rishubhanda1084 2 жыл бұрын
Amazing video!! Could you please go over how to build something like this with a CDK? The visual editor is helpful, but I find it easier to provision resources with code.
@AWSTutorialsOnline
@AWSTutorialsOnline 2 жыл бұрын
Hi - yes. Planning CDK video for setting up data platform.
@rishubhanda1084
@rishubhanda1084 2 жыл бұрын
@@AWSTutorialsOnline Thank you so much! I just watched all your videos on Glue and I think the event driven pipeline with EventBridge would be the most helpful.
@nlopedebarrios
@nlopedebarrios 5 ай бұрын
Considering the continuous evolution of AWS Glue, what do you think is more suitable for a newbie: orchestrating the ETL pipeline with Glue Workflows or Step Functions?
@ravitejatavva7396
@ravitejatavva7396 2 ай бұрын
@AWSTutorialsOnline, Appreciate your good work. AWS glue has evolved so much now, how can we in-corporate data quality checks to the pipelines and send email notifications to the users with dq fail results such as rules_succeeded, rules_skipped, rules_failed and publish the data to a quicksight dashboard. Do we still need step-functions ? Any thoughts / suggestions please.
@veerachegu
@veerachegu 2 жыл бұрын
Pls can you explain what job takes place in between raw crawler to cleanse crawler
@AWSTutorialsOnline
@AWSTutorialsOnline 2 жыл бұрын
Raw layer is immutable. It presents the data in the format it is ingested. From raw to cleansed layer, you do cleaning operations such as handling missing values, format standardization for data, currency, column naming etc.
@anmoljm5799
@anmoljm5799 2 жыл бұрын
my data source is CSV files dropped into an s3 bucket which is crawled, and I trigger the crawler using a lambda to detect when an object has been dropped into the s3 bucket, how do I trigger the start of a pipeline consisting of Glue jobs upon the completion of the first which crawls my source s3 crawler? I could use Workflows which is part of Glue but I have a Glue DataBrew job that needs to be part of the pipeline.
@AWSTutorialsOnline
@AWSTutorialsOnline 2 жыл бұрын
You need to use event based mechanism. I have one tutorial for it. here - kzfaq.info/get/bejne/ZppylaZ9qdLaeX0.html
@anmoljm5799
@anmoljm5799 2 жыл бұрын
@@AWSTutorialsOnline Thank you for the reply and the awesome video!
@sriadityab4794
@sriadityab4794 2 жыл бұрын
How to handle if there are multiple files dropped in S3 at the same time where we need trigger one glue job using Lambda? I see some limitations where it is throwing error where it can’t trigger multiple files at a time. How should we handle Lambda here? Any help is appreciated.
@AWSTutorialsOnline
@AWSTutorialsOnline 2 жыл бұрын
yeah it is real pain if you drop multiple files at the time of ingestion (in raw layer) and you want glue job to start after all drops have completed. Post raw stage, you can hook into Glue and Crawler events to run the pipeline but at the time of ingestion you rely on S3 file drop based event. In such case, based method is to drop a token file after all the files are dropped. S3 event can be configured on put/post event of this token file. Crawler will be configured to exclude token file. Similarly, glue job if doing file based operation will also exclude the token file. Hope it helps.
@abeeya13
@abeeya13 9 күн бұрын
can we combine batch processing with step function?
@nlopedebarrios
@nlopedebarrios 5 ай бұрын
If the purpose of the ETL pipeline is to move data around, and the sources, stages and destination are already cataloged, why would you need to run the crawlers after each glue job is finished?
@user-lq6gc1tw2v
@user-lq6gc1tw2v Жыл бұрын
Hello, good video. Maybe someone knows when use Glue workflows and when use StepFunctions?
@AWSTutorialsOnline
@AWSTutorialsOnline Жыл бұрын
Glue workflow when you want to orchestrate Glue Job and Crawler only. StepFunction when you want to orchestrate Glue Job, Crawler plus other services as well.
@Draco-pu4ro
@Draco-pu4ro Жыл бұрын
How do we run this like an automated flow in real world? Like in a productionized environment?
@AWSTutorialsOnline
@AWSTutorialsOnline Жыл бұрын
You can automate in two ways - event based or schedule based. Event based will be like run StepFunction when data lands in S3 bucket. Schedule based will be run StepFunction at a scheduled time (configured by AWS EventBridge)
@anirbandatta2037
@anirbandatta2037 2 жыл бұрын
Hi, Could you please share some CICD scenarios using AWS services.
@AWSTutorialsOnline
@AWSTutorialsOnline 2 жыл бұрын
Sure - I will plan some. Thanks for the feedback.
@veerachegu
@veerachegu 2 жыл бұрын
One doubt crawler operation is mandatory? To perform raw data to cleanse Can we transfer the raw data directly to cleanse with help of glue job
@AWSTutorialsOnline
@AWSTutorialsOnline 2 жыл бұрын
It is not mandatory but cataloging data at each stage is recommended practice. Each makes data searchable and discoverable at each stage.
@veeru2310
@veeru2310 Жыл бұрын
Hi sir I am passing glue job arguments in step functions to call parallel glue job operation but unfortunately my job getting success but records not transferred path and destination clear please help me job not taking parameters from step function
@AWSTutorialsOnline
@AWSTutorialsOnline Жыл бұрын
show the syntax you use to pass parameters when calling Glue Job?
@veeru2310
@veeru2310 Жыл бұрын
@@AWSTutorialsOnline I am going to load 18 tables so I need to pass 18 table parameters right is it good way can you pls suggest me
@InvestorKiddd
@InvestorKiddd Жыл бұрын
How to create a glue job using aws lambda?
@AWSTutorialsOnline
@AWSTutorialsOnline Жыл бұрын
you want to create glue job or run glue job?
@InvestorKiddd
@InvestorKiddd Жыл бұрын
@@AWSTutorialsOnline create a glue job using aws lambda or aws stepfunction
@AWSTutorialsOnline
@AWSTutorialsOnline Жыл бұрын
@@InvestorKiddd I can probably explain but want to understand more. Generally, people will have job configured and they would like to run it using Lambda / Step Functions. Why you need to create job using Lambda / Step Functions? What is the use case?
@InvestorKiddd
@InvestorKiddd Жыл бұрын
@@AWSTutorialsOnline so I am scraping some files based on cities, and then I want to convert it into parquet and then use Athena queries to get insights. So here I can use same job for mapping and conversion purpose, but input and output path name will be getting changed, like say, input file name is mumbai.csv(city.csv) . So the input path will change when we go for Bangalore.csv , so to solve this issue, my idea was to create a new job for a new city or if we can change input and output path programmatically, then also it is ok for me, I want to automate this process.
@AWSTutorialsOnline
@AWSTutorialsOnline Жыл бұрын
@@InvestorKiddd In this case, you should create a job and at run time pass source and destination location as job parameters. Please check my video - I did talk about it in one of them.
@user-lq6gc1tw2v
@user-lq6gc1tw2v Жыл бұрын
Hello, good video. Maybe someone knows when use Glue workflows and when use StepFunctions?
@AWSTutorialsOnline
@AWSTutorialsOnline Жыл бұрын
Glue workflow when you want to orchestrate Glue Job and Crawler only. StepFunction when you want to orchestrate Glue Job, Crawler plus other services as well.
AWS Tutorials - Data Quality Check using AWS Glue DataBrew
42:50
AWS Tutorials
Рет қаралды 9 М.
AWS Tutorials - Using Concurrent AWS Glue Jobs
24:33
AWS Tutorials
Рет қаралды 6 М.
He sees meat everywhere 😄🥩
00:11
AngLova
Рет қаралды 11 МЛН
World’s Deadliest Obstacle Course!
28:25
MrBeast
Рет қаралды 159 МЛН
AWS Step Functions Crash Course | Step by Step Tutorial
1:16:28
Enlear Academy
Рет қаралды 58 М.
AWS Tutorials - ETL Pipeline with Multiple Files Ingestion in S3
41:30
AWS Tutorials - Partition Data in S3 using AWS Glue Job
36:09
AWS Tutorials
Рет қаралды 17 М.
AWS Tutorials - Building Event Based AWS Glue ETL Pipeline
52:42
AWS Tutorials
Рет қаралды 10 М.
Comparing Apache Airflow and Step Functions | Serverless Office Hours
58:06
AWS Tutorials - Using Job Bookmarks in AWS Glue Jobs
36:14
AWS Tutorials
Рет қаралды 11 М.
Опыт использования Мини ПК от TECNO
1:00
Андронет
Рет қаралды 648 М.
PART 52 || DIY Wireless Switch forElectronic Lights - Easy Guide!
1:01
HUBAB__OFFICIAL
Рет қаралды 28 МЛН
Хотела заскамить на Айфон!😱📱(@gertieinar)
0:21
Взрывная История
Рет қаралды 5 МЛН
Мой инст: denkiselef. Как забрать телефон через экран.
0:54
ОБСЛУЖИЛИ САМЫЙ ГРЯЗНЫЙ ПК
1:00
VA-PC
Рет қаралды 1,2 МЛН