AWS Tutorials - Building Event Based AWS Glue ETL Pipeline

Рет қаралды 9,718

2 жыл бұрын

Python handler code - github.com/aws-dojo/analytics...
AWS Glue Pipelines are responsible to ingest data in the data platform or data lake and manage data transformation lifecycle from raw to cleansed to curated state. There are many methods to build such pipelines. In this video, you learn how to build event based ETL pipeline.

Пікірлер: 39

@danieldani5495 Жыл бұрын

Excellent video, you saved my time.

@veerachegu 2 жыл бұрын

Good to see this demo and Pls do demo for incremental data upload in to S3 bucket

@AWSTutorialsOnline 2 жыл бұрын

Sure. Can you please tell few scenario like data sources and I would try to make one

@veerachegu 2 жыл бұрын

@@AWSTutorialsOnline can you give your mail id pls

@hirendra83 2 жыл бұрын

Thanks, very helpful tutorial. Please continue your good work. Sir can you cover how to create monitoring or observability dashboard for such pipeline using cloudwatch logs

@AWSTutorialsOnline 2 жыл бұрын

Sure. Would plan for it.

@user-fn3zs2wq5c 10 ай бұрын

You've explained well about the execution flow, but you've not explained the creation of Glue Database, Catalog tables creation, Dynamo DB table, Lambda function, Event Bridge creation. You've created backend and just explaining the flow again. Please explain the creation part as well.

@ballusaikumar873 Жыл бұрын

Thank you for making useful videos on AWS. I learnt a lot of knowledge by watching your videos. I have a use case where I need your inputs. A job that writes multiple parquet files (usually a single dataset splitted to multiple files due to spark partitions) to an S3 bucket. I wanted create an event to eventbridge when all files are written successfully. How do I implement this using S3 and eventbridge. Currently I see multiple events are getting triggered.

@AWSTutorialsOnline Жыл бұрын

I did created a video on that. Please check - hope you find it useful. Link - kzfaq.info/get/bejne/i6iYesKGstuqdaM.html

@DanielWeikert 2 жыл бұрын

Great work. You got my sub, you deserved it. Highly appreciate your work. Could you do a Workshop Excercise for setting up such a pipeline? Can you also do a tutorial/workshop in setting up glue job pipelines with cloudformation? Thanks and best regards

@AWSTutorialsOnline 2 жыл бұрын

Thanks for the appreciation. What do you mean by workshop? Also I am planning to do a tutorial on using CDK / CloudFormation for setting up such pipleine.

@DanielWeikert 2 жыл бұрын

@@AWSTutorialsOnline Thanks I was referring to the step by step excercises you provide on your homepage

@AWSTutorialsOnline 2 жыл бұрын

@@DanielWeikert ah ok. I will plan about it.

@suneelkumar-kn4ds Жыл бұрын

Hi Sir, would you claifyone query. I hae this doubt while you are explaining the data pipeline at 3:20. Why we are using data catalog here

@spp3607 2 жыл бұрын

Thank you for the Tutorials. I have a question on deployment, after developing this pipeline(Glue, crawler, Lambda, and event bridge) in the Development environment how to move /deploy all this code in Production

@AWSTutorialsOnline 2 жыл бұрын

You should not create these resource manually. Rather, use CloudFormation or CDK as infrastructure as Code services to script the resource creation and move between environments.

@poojakarthik93 2 жыл бұрын

Hello. Thanks a lot for this video. It is really helpful. I have one question here to run your second glue job how we will know that all our files are copied to S3 ?

@AWSTutorialsOnline 2 жыл бұрын

good question - you need to watch this video - kzfaq.info/get/bejne/i6iYesKGstuqdaM.html

@aniket9602 Жыл бұрын

Can someone pls explain the below code which is written in the lambda script : target = ddresp["Items"][0]["target"]["S"] targettype = ddresp["Items"][0]["targettype"]["S"] What should be the expected output of the above lines !!

@coldstone87 2 жыл бұрын

Hello. Thanks for the tutorial. I have small clarification. So Basically every Glue Job and Glue Crawler by default writes an event to default bus of EventBridge and then based on rule filtering we are invoking the Lambda. Correct? Because I don't see code/any configuration done in job or crawler to publish an event into Eventbridge. Please confirm my understanding.

@AWSTutorialsOnline 2 жыл бұрын

You are right. Most of the AWS Services (including Glue Job and Crawler) automatically publish events to eventbridge default event bus. Then you need to use rules to hook into a particular event and do what you want to do at the raise of this event.

@arunt741 2 жыл бұрын

Thank you very much for your excellent work with this channel. If I have multiple Glue Jobs but I want to publish to Event Bridge only for some Glue jobs, How do I handle it in Event Pattern? If I am not wrong, with this Event pattern all the Glue jobs completion will trigger the lambda, correct? Can we use some tokens in event pattern? Eg: Glue job name starts with GJ_% etc.? Thanks in advance.

@AWSTutorialsOnline 2 жыл бұрын

Hi, It seems there is no way to filter on name in event bridge rule. You will have to filter out at the handler level. You can build two step handler like EventBridge to SNS Topic (filter handler) to Lambda (actual handler). At SNS, you can configure subscription filter on messages to stop processing for certain glue jobs.

@arunt741 2 жыл бұрын

@@AWSTutorialsOnline Thank you very much. It is a great suggestion.

@canye1662 2 жыл бұрын

Nice video but will like to know if you have a code that can be embedded in the glue job script to prevent duplicate data if the jobs runs every hour. and I know bookmark will help but am looking it u have a code that can be included in the script section.

@AWSTutorialsOnline 2 жыл бұрын

do you want to use job bookmark or want to build custom business logic for the incremental data?

@veerachegu 2 жыл бұрын

Can we use S3 instead of using Dynamo DB to Lambda execution data

@AWSTutorialsOnline 2 жыл бұрын

Technically you can. Use a json document so that it is easy to query.

@veerachegu 2 жыл бұрын

Pls One more quiery from my end In my project We need to pull the files to S3 through API and those files contains 25k per day records and next day updated along existing records so for this scenario lambda will support? I think it can support up to 15min but files will call API in this time so pls guide me in this process which is the best method to store data in to S3 ? Without getting any conflicts. Like sqs or step function is good to go or any other service is best pls suggest me

@MrErPratikParab Жыл бұрын

Airflow how much useful here

@AWSTutorialsOnline Жыл бұрын

I see Apache Airflow can anther workflow engine which can deliver the same result

@pvchennareddy 2 жыл бұрын

Can you please share glue jobs code

@AWSTutorialsOnline 2 жыл бұрын

I added job code in the same link

@pvchennareddy 2 жыл бұрын

Thanks

@pvchennareddy 2 жыл бұрын

Thanks for the code. I was working on application side and now I need to work on data lake setup which is new to me. As per my understanding industry is moving towards Data Lake House. I am new to this. I want to know difference between data lake and lake house. When should I go for data lake and when should I go for data lake house. Let me know if you have done anything on or drop me a note if you do it in future. Thanks

@AWSTutorialsOnline 2 жыл бұрын

@@pvchennareddy Data lake is more about managing and governing data at single place. Data lake house goes beyond it. The following links would help you understand it - aws.amazon.com/blogs/big-data/harness-the-power-of-your-data-with-aws-analytics/ aws.amazon.com/blogs/big-data/build-a-lake-house-architecture-on-aws/

@abhijeetjain8228 2 ай бұрын

Demo part is not good. things are not properly explained. Just reading, not shown up to how to create them up. please focus on practical part instead of theory.

@DanielWeikert 2 жыл бұрын

When I triggered a glue workflow with lambda to write csv to another folder as parquet I received this error Unsupported case of DataType: com.amazonaws.services.glue.schema.types.LongType@538fb895 and DynamicNode: stringnode Did not found any help on google. Any ideas?

@AWSTutorialsOnline 2 жыл бұрын

Cannot figure out unless I see data and work a little with it.