AWS Tutorials - Using Job Bookmarks in AWS Glue Jobs

Рет қаралды 11,946

3 жыл бұрын

The exercise URL - aws-dojo.com/excercises/excer...
AWS Glue uses job bookmark to track processing of the data to ensure data processed in the previous job run does not get processed again. Job bookmarks help AWS Glue maintain state information and prevent the reprocessing of old data.

Пікірлер: 50

@howards5205 9 ай бұрын

This is a great video. The visualization helped a lot also. Thank you so much!

@pulakhazra5792 2 жыл бұрын

Much clear and helpful.

@AWSTutorialsOnline 2 жыл бұрын

Glad it was helpful!

@victorfeight9644 Жыл бұрын

Best explanation of maxBand I have heard.

@AWSTutorialsOnline Жыл бұрын

Thanks

@harishnttdata2325 3 жыл бұрын

Very Useful Video. Time saver

@AWSTutorialsOnline 3 жыл бұрын

Glad to hear that

@sivahanuman4466 Жыл бұрын

Excellent Sir Very Useful

@AWSTutorialsOnline Жыл бұрын

Thanks and welcome

@veerachegu 2 жыл бұрын

Tq so much explanation is very clear cut

@AWSTutorialsOnline 2 жыл бұрын

Welcome 😊

@VishalSharma-hv6ks 2 жыл бұрын

Hi Sir, Thanks a lot for this wonderful video. I have a doubt. Like I am using AWS Glue as ETL which is reading data everyday from Oracle RDBMS. But in Oracle I have update and delete with insert. You mentioned that we can use incremental read using bookmarking but what about the delete and update in Oracle side. How can we handle this situation. Thank you sir in advance.

@tiktok4372 2 жыл бұрын

Thank you for the video, i have a question that does job bookmark work with DataFrame, suppose i use glueContext.create_data_frame_from_catalog, and then do some transformation to the Dataframe and and write the Dataframe to S3 bucket

@AWSTutorialsOnline 2 жыл бұрын

yes it does

@mylikeskeyan2055 Жыл бұрын

Please put some demo for jdbc with bookmarking for a table and shows the daily updated records only in the output

@yusnardo 2 жыл бұрын

can I run the workflow recursively? I use boundedSize in my glue job. So I need to run the job multiple time in every month until the bookmark was done

@AWSTutorialsOnline 2 жыл бұрын

a job can start another instance of the same job in the job code as long as concurrency allows. But is not a true recursive call - so think about exist condition when doing so.

@abdulhaseeb4980 2 жыл бұрын

Hi, I hope you are doing great. Currently I'm saving the entries for new files on SQS and then read from Glue to read those files but now I want to use the bookmark option. I'm using Python shell job and it's not supported in it. Now I will move to spark job but I will not use spark context there. can you please guide me how I can do this?

@AWSTutorialsOnline 2 жыл бұрын

In order to use job bookmark, you have to program in certain way using spark context. This link might help - docs.aws.amazon.com/glue/latest/dg/monitor-continuations.html

@abir95571 24 күн бұрын

How does job bookmark scale on massive data set ?

@deepakbhutekar5450 Жыл бұрын

sir, how we handle updated records using jobbookmark.? or How jobBookmarkKey identify given record is been updated . becoz once particular record is processes and bookmark and if for some reason process record got updated in source table so how we handle this situation using jobBookMark..?

@creativeminds7397 2 жыл бұрын

Hello , Your videos are simply superb 👌, I have pgp encrypted files in s3 and I need to implement bookmarks ,can you help either it work or not . If not any another approach to follow

@AWSTutorialsOnline 2 жыл бұрын

Hi, sorry never worked with pgp files. Hard to say without testing,

@kumark3176 2 жыл бұрын

Hi Sir, Thanks for sharing the information on Bookmark. I have a task to work on building the bookmark functionality using the PySpark & bookmarking in DynamoDB. I am new to the Bigdata framework technologies & we're moving from glue bookmarking to our own customized code (written in pyspark or java). Can you please suggest any material or sample code when I can use as a reference. We're trying to update based on lastUpdatedTime & DelayTime as motioned by you in this tutorial. Please reply & help me. Thank you..

@sukanyabanu6785 2 жыл бұрын

Hi ,, Were you able to find a solution ?

@user-gs5bl9jm9k 10 ай бұрын

Hello, how can we rest glue job state ?

@vishalrajmane7649 3 жыл бұрын

Do u have any video for incremental load in aws glue for newly inserted updated and deleted data from source to target??

@AWSTutorialsOnline 3 жыл бұрын

I don't have any video on this. But if you are ingesting data from relational database then there are two methods which can work - 1) Using Lake Formation Blueprint or 2) Using Amazon Database Migration Service (DMS) to move data to S3. I have videos about blueprint and DMS but it does not cover incremental update scenario. You can check them in my channel.

@vishalrajmane7649 3 жыл бұрын

Thnks for the help. I will check the options that u have suggested..🙂

@tylerdurden8692 Жыл бұрын

When i try to speicify multiple keys in jobbookmarkkeys , its not working, its taking only the primary of jdbc always. even when there is some modifcations on existing records also its not given, it processing again, anything i am missing here

@AWSTutorialsOnline Жыл бұрын

you can multiple key as long as they increasing or decreasing in values. it that happening in the table?

@tylerdurden8692 Жыл бұрын

@@AWSTutorialsOnline no, it means u are saying like the key field should be auto increment kind of field

@AWSTutorialsOnline Жыл бұрын

@@tylerdurden8692 yes, increment or decrement. Please check this link, it has rules about JDBC - docs.aws.amazon.com/glue/latest/dg/monitor-continuations.html

@mohdshoeb5101 3 жыл бұрын

How i can manage multiple join table through bookmarks.Because When joining table I don't have unique key so that I concatanate multiple id then I get unique key. I need to set bookmark with multiple key. Please tell me how we can do

@AWSTutorialsOnline 2 жыл бұрын

Apologies for the late response due to my summer break. Joining tables for bookmark not possible. You might want to create an ETL Glue Job which merges these datasets together and create primary key. Then run bookmark based processing on the merged dataset. Hope it helps,

@YogithaVenna 3 жыл бұрын

Where is the state information stored? Is it persisted in any data store? What happens behind the scenes?

@AWSTutorialsOnline 3 жыл бұрын

The information is not public so cannot say with confidence.

@selvaganesh2529 2 жыл бұрын

Hi , when I try to reset the bookmark I am getting "entitynotfoundexception , continuation for job not found" source is s3 I hav not altered the transformation ctx also, what might be the error

@AWSTutorialsOnline 2 жыл бұрын

not sure. never come across this error. Can you share more details about what you are doing - some how which I can reproduce.

@selvaganesh2529 2 жыл бұрын

@@AWSTutorialsOnline I fixed the issue, it was due to job_name which I have given as parameter which shouldn't be given as per aws documentation..

@deepakshrikanttamhane285 2 жыл бұрын

Hi Sir , Its very helpful but how configure s3 timestamp based job bookmark instead of using bookmark key

@AWSTutorialsOnline 2 жыл бұрын

I think when you just enable job bookmark without mentioning any key; it uses timestamp for the bookmark purpose. Please check this link - docs.aws.amazon.com/glue/latest/dg/monitor-continuations.html

@deepakshrikanttamhane285 2 жыл бұрын

Great , It works

@joseabzum3073 3 жыл бұрын

What if I want to delete a .csv? Can some process automatically delete the parquet file?

@AWSTutorialsOnline 3 жыл бұрын

You need to use boto3 S3 API to delete the file. Please check this link - boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Client.delete_object

@joseabzum3073 3 жыл бұрын

@@AWSTutorialsOnline Hi, but how can I know what parquet file belongs to a deleted .csv?

@vishalrajmane7649 3 жыл бұрын

If u have plz provide me th link.

@AWSTutorialsOnline 3 жыл бұрын

I don't have any video incremental update. But if you are ingesting data from relational database then there are two methods which can work - 1) Using Lake Formation Blueprint or 2) Using Amazon Database Migration Service (DMS) to move data to S3. I have videos about blueprint and DMS but it does not cover incremental update scenario. You can check them in my channel and go through AWS documentation to understand incremental update part.