Tracking Processed Data Using AWS Glue Job Bookmarks | Incremental ETL In-depth intuition

Рет қаралды 6,438

2 жыл бұрын

Why only incremental ingestion ? Why not complete Incremental Pipeline starting from Ingestion , Curation & Publishing data ?
If the business logic what we are implementing in curate layer not dependent on past processed data ,then not only ingestion , complete pipeline we can make as Incremental & AWS Glue give the opportunity to do so using one of it's most powerful feature --Job Bookmarking 😊
Today in this video , I have discussed about Job Booking concept in Glue .
For details , you can refer this documentation --
docs.aws.amazon.com/glue/late...
V.V.I. Note:
-----------------
To identify which files stored on S3 to process, job bookmarks check the last modified time of the objects, not the file names. If your input objects changed since the last time the job ran, then they are reprocessed when the job runs again.
Prerequisite:
------------------------
An automated data pipeline using Lambda, S3 and Glue - Big Data with Cloud Computing
• An automated data pipe...
How to Use AWS Glue with Snowflake | PySpark-Snowflake Connectivity
• How to Use AWS Glue wi...
Set up the necessary AWS services to query the data inside an Amazon S3 (Datalake) using AWS Athena
• Set up the necessary A...
Transform Data Using AWS Glue and Amazon Athena
• Transform Data Using A...
Schema Evolution in AWS Glue using Glue Crawler | AWS Athena
• Schema Evolution in AW...
Simplify Amazon DynamoDB data extraction and analysis by using AWS Glue and Amazon Athena
• Simplify Amazon Dynamo...
AWS Glue as Hive catalog
• Using the AWS Glue Dat...
A very frequent technical requirement in big data domain--
You have to write spark dataframe but with kms encryption, if you are using Glue , then this is one approach you can try to improve the security of your pipeline by enabling server side encryption
• Security Configuration...
Incremental Glue crawling using Amazon S3 Event Notifications
• Incremental Glue crawl...
Check this playlist for more Data Engineering related videos:
• Demystifying Data Engi...

Пікірлер: 16

@yashgangrade5460 2 ай бұрын

I ran glue crawler but it's giving error HIVE_INVALID_METADATA: Hive metadata for table raw is invalid: Table descriptor contains duplicate columns.

@manojt7012 2 жыл бұрын

Ur consistency is just inspiring... Fan of ur contents 👌🏻

@KnowledgeAmplifier1 2 жыл бұрын

Thank you Manoj T for your continuous support ! Happy Learning :-)

@FRUXT 6 ай бұрын

How the job bookmark knows what to increment ? We need to specify it to track a specific column ?

@basavapn6487 2 ай бұрын

Can you please make an video when i have requirement where daily an getting files into s3 bucket and i want to process last 90days data present in s3 using glue

@balasakiran Жыл бұрын

Nice explonatios, crisp and clearn. I have a quick question, over a period of time, say after 2 months, if there is a need to do a history load(process all files ) , how can this be achieved ?

@tcsanimesh Жыл бұрын

Superb explanation!! However I have one question. When we enable bookmark for incremental load.. let’s assume the requirement is for incremental load only but it’s not daily but weekly.. so I mean weekly incremental load.. in that case also will this concept work.. I mean doesn’t aws glue read a definite duration back from the bookmarked time stamp only or it is like read all files after the last book marked time stamp

@farookshaik7462 2 жыл бұрын

Really useful. Keeping going..

@KnowledgeAmplifier1 2 жыл бұрын

Thank you Farook Shaik! Happy Learning :-)

@MatheusRibeiro-or2hq Жыл бұрын

Great Video!

@KnowledgeAmplifier1 Жыл бұрын

Thank you Matheus Ribeiro! Happy Learning

@trinath89 Жыл бұрын

Hi, great video.. thanks for taking time to create this video, Please share the link for the incremental data load from RDS - Thanks

@ravikreddy7470 Жыл бұрын

What's the difference between incremental job bookmarking and incremental crawling?

@KnowledgeAmplifier1 Жыл бұрын

Ravi K R , Incremental crawls helps to prevent recrawling of same data from source systems , instead of that crawl only new data and make it available in Glue Catalog for processing , & AWS Glue Job bookmarking helps to prevent the reprocessing of old data . One helps in crawling incrementally , one helps in processing incrementally .... Hope this will give you some idea , for more details , you can refer these links -- Incremental crawls in AWS Glue docs.aws.amazon.com/glue/latest/dg/incremental-crawls.html Tracking processed data using job bookmarks docs.aws.amazon.com/glue/latest/dg/monitor-continuations.html Happy Learning

@ravikreddy7470 Жыл бұрын

@@KnowledgeAmplifier1 crawling and processing both are different?

@KnowledgeAmplifier1 Жыл бұрын

@@ravikreddy7470 yes , crawler creates the metadata that allows GLUE Jobs and services such as ATHENA to view the S3 information as a database with tables & process it .