AWS Tutorials - ETL Pipeline with Multiple Files Ingestion in S3

Рет қаралды 13,369

2 жыл бұрын

The code link - github.com/aws-dojo/analytics...
Handling multiple file ingestion is a Glue ETL Pipeline is challenge if you want to process all the ingested files at once. Learn how to build a pipeline which can handle processing of multiple files.

Пікірлер: 36

@darkcodecamp1678 17 күн бұрын

what we use in production is when glue job put data in raw s3 bucket it will create an AWS SNS notification which is subscribed by SQS then with the help of queue we trigger lambda :)

@prakashr9221 Жыл бұрын

I was looking for this use case, and this is helpful. Thank you.

@AWSTutorialsOnline Жыл бұрын

Glad it was helpful!

@swapnilkulkarni6719 2 жыл бұрын

Really good..Thanks a lot for making such nice videos..Lot of learning from it..

@AWSTutorialsOnline 2 жыл бұрын

It's my pleasure

@imtiyazali7003 11 ай бұрын

Great info and great tutorial . Thank you !!

@arunasingh8617 2 жыл бұрын

You are doing an excellent job! Get going :)

@AWSTutorialsOnline 2 жыл бұрын

Thank you! 😃

@markkinuthia6178 Жыл бұрын

Thank you very Sir, I love how you teach using cases. My question is can the approach be used in production and can the same also be used in Redshift. Thanks.

@ladakshay 2 жыл бұрын

Good smart solution. We can also orchestrate the entire flow using Glue workflow or Step functions so we don't have to depend on S3 event and Lambda.

@AWSTutorialsOnline 2 жыл бұрын

indeed you can. if you search in my channel, I made two more videos about building pipeline using Glue Workflow and Step Function. But some of the audience were asking for handling S3 event in case of multiple file ingestion. So I made this video.

@ladakshay 2 жыл бұрын

@@AWSTutorialsOnline Yes this use case can come up in any pipeline where we want to trigger next step after data is written in S3.

@saravninja 2 жыл бұрын

Thanks for Great explanation!

@AWSTutorialsOnline 2 жыл бұрын

You're welcome!

@akshaybaura Жыл бұрын

this is acceptable if you have control over the first glue process which is dumping files for you. What is the intended solution if you cant create a token/indicator file sort of thing?

@misekerbirega3510 2 жыл бұрын

Thanks a lot, Sir.

@lakshminarayanau3989 2 жыл бұрын

Thanks for your videos, this channel is a good learning source.. is there any video which talks abt json files multiple nested arrays i.e arrays with in arrays flattening and move to redshift?

@AWSTutorialsOnline 2 жыл бұрын

I have the following videos on nested JSON, hope they help. kzfaq.info/get/bejne/aqemddlet97Wpmg.html kzfaq.info/get/bejne/aKmYnLSQl8ydZ4k.html kzfaq.info/get/bejne/hrhhaLeHv6rLqWg.html

@udaynayak4788 9 ай бұрын

Thank you for the valuable information, can you please cover incremental, wherein RDS is the source and redshift target with the SCD2 approach, pyspark script under glue should handle SCD2

@abhijeetjain8228 2 ай бұрын

that would be nice to cover up!

@deep6858 2 жыл бұрын

Excellent. I am new to AWS and its services. Related question, with multiple files in S3 we trigger Lambda and further Lambda calls Glue Job and we have kept concurrency of both Lambda and Glue Job as 1. Will this work the same way or differently .Thanks

@AWSTutorialsOnline 2 жыл бұрын

not sure about your question. But with concurrency 1 as well, the lambda will trigger for each file upload. Only difference is it will queue up in execution for concurrency.

@vivekjacobalex 2 жыл бұрын

Good video 👍 . I have one doubt, while pulling data from postgressql to raw folder, where did it mention to write files divided on employee record?

@AWSTutorialsOnline 2 жыл бұрын

I did not. Once you choose parquet format with snappy compression, it does automatic partitioning based on size.

@cloudcomputingpl8102 2 жыл бұрын

how to run a glue job only on new files and not full data? If for example you have 700GB, it will take ages to run few hours job every file. Can anyone target me to a resource?

@tan2784 Жыл бұрын

Interesting. Is there an alternative way to create a single lambda function without the token? Suppose a user doesn't have control over how data is loaded to S3, but has to work with the files loaded regularly., i.e. every hour on a new s3 object level.

@AWSTutorialsOnline Жыл бұрын

There has to be some trigger to know that all files have come. It could total file size or file count. You can configure an event to log all new / updated files arriving and let Lambda check their count / total size and if threshold is reached - trigger the pipeline.

@thegeekyreview2916 8 күн бұрын

what happens to the s3 data in the next run? is it overwritten or appended

@helovesdata8483 Жыл бұрын

why write five files from the database? is that just to show how separate files would work in this example?

@AWSTutorialsOnline Жыл бұрын

yes.

@sivaprasanth5961 2 жыл бұрын

how can I select destination as my state machine?

@AWSTutorialsOnline 2 жыл бұрын

sorry, could not get your question? Can you please elaborate a bit?

@SandeepKumar-ne1ln 2 жыл бұрын

Given Glue is server-less, is it really a problem having multiple Glue jobs triggered for individual file in raw zone?

@AWSTutorialsOnline 2 жыл бұрын

no really. but sometime when you are doing aggregation based processing, you want all the files to land before processing. Also multiple instances of Glue will also increase cost.

@SandeepKumar-ne1ln 2 жыл бұрын

@@AWSTutorialsOnline Another question I have is - If multiple files are being created, then instead of an S3 event trigger Lambda function... can't we trigger Lambda on a Glue event (when Glue job completes writing all files in S3)?

@AWSTutorialsOnline 2 жыл бұрын

@@SandeepKumar-ne1ln you can using EventBridge based event. I did talk about it some other videos.