AWS Tutorials - Incremental Data Load from JDBC using AWS Glue Jobs

Рет қаралды 11,105

Күн бұрын

AWS Glue Job Bookmark Tutorial - • AWS Tutorials - Using ...
AWS Glue and Lake Formation Tutorial - • AWS Tutorials - AWS Gl...
AWS Glue uses job bookmark to track processing of the data to ensure data processed in the previous job run does not get processed again. Job bookmarks help AWS Glue maintain state information and prevent the reprocessing of old data. Learn to use Job bookmarks with JDBC data sources in the ETL Glue Job.

Пікірлер: 22

@gulo101 Жыл бұрын

Great video, thank you! Couple questions: I will be using the data I copy from JDBC DB to S3 for staging, before it's moved to Snowflake. After I move it to Snowflake, is it safe to delete it from S3 bucket without any negative impact on the bookmark progress? Also, is there any way to see what the current value of the bookmark is, or manually change it in case of load issues? Thank you

@jnana1985 10 ай бұрын

Is it only for inserting new records or does it also work with update and delete records also?

@federicocremer7677 Жыл бұрын

Excelent tutorial and great explanation. Thank you, you got my sub!. Just to be sure, if I have one "updated_at" field in my schema and in my data source (let's say JDBC - Postgres instance) are daily updated rows but rather not inserted new rows, those updated rows will be catched by the new job with bookmark enabled? If that is correct, do I have to add not only my "id" field but also my "updated_at" field in jobBookmarkKeys?

@AWSTutorialsOnline Жыл бұрын

You can use key(s) for job bookmark as long as they meet certain requirements. here are the rules For each table, AWS Glue uses one or more columns as bookmark keys to determine new and processed data. The bookmark keys combine to form a single compound key. You can specify the columns to use as bookmark keys. If you don't specify bookmark keys, AWS Glue by default uses the primary key as the bookmark key, provided that it is sequentially increasing or decreasing (with no gaps). If user-defined bookmarks keys are used, they must be strictly monotonically increasing or decreasing. Gaps are permitted. AWS Glue doesn't support using case-sensitive columns as job bookmark keys.

@brockador_93 4 ай бұрын

hello, how are you? One question, I created a bookmark job, based on the primary key of the source table, when making an update to an already processed record, it was not changed in the destination file, how can I make the job understand that there was a change in this record? for example, the table key is the ID field, and the changed field was the "name" field.

@fredygerman_ 8 ай бұрын

Great video but can you show an example where you connect to an external database using jdbc connection, I.e a database from superbase

@manishchaturvedi7908 5 ай бұрын

Please add a video which leverages timestamp in the source table to incrementally load data

@rajatpathak4499 Жыл бұрын

great tutorial, keep brings us more video on real time scenarion. if you can cover video on glue workflow , which can include source,then lambda invocation, which triggers glue job for cataloging and then another trigger for transformation , after that insert into db which will then trigger lambda for archiving.

@AWSTutorialsOnline Жыл бұрын

Please check my vide for event based pipeline. I have explained there what you are talking about.

@susilpadhy9553 Жыл бұрын

Please make a video on how to handle the incremental load using timestamp column that will really helpful thanks in advance,i watched so many videos of yours it really helps.

@hitesh1907 8 ай бұрын

Please create

@basavapn6487 3 ай бұрын

Can you please make a video on delta files to achieve scd type 1, because in this scenario it was full file ,but i want to process on incremental files

@canye1662 Жыл бұрын

awesome vid...100%

@AWSTutorialsOnline Жыл бұрын

Glad you enjoyed it

@user-on5zy2gc2u Жыл бұрын

Great Content. I'm facing an issue while loading data from ms sql into redshift using glue, scenario is I have multiple tables regarding customers with customer id as primary key I want to get output as when we update any phone number or address related to a customer id I have to write it into redshift with a new row and if any new entry comes it should get inserted as new row is there any solution for this?

@AWSTutorialsOnline Жыл бұрын

You can create a job which filters data from RDS based on last run datetime and pick records (based on created / modified date greater than last run datetime). Then insert picked records into target database.

@tcsanimesh Жыл бұрын

Beautiful video!! Can you please add use case for update and delete as well

@AWSTutorialsOnline Жыл бұрын

In data lake, you generally do not perform update and delete. You only insert. But if want CRUD operation then you should be thinking to use Iceberg, Hudi or Delta Lake on S3.

@shrishark 7 ай бұрын

what is the best approach to read huge volume of data from any on perm sql dbs , identify sensitive data, replace with fake data and to push to aws s3 bucket for specific criteria.?

@victoriwuoha3081 4 ай бұрын

redact the data using kms during processing before storage.

@helovesdata8483 Жыл бұрын

I can't get my jdbc data source to connect with glue. The only error I get is test connection failed

@AWSTutorialsOnline Жыл бұрын

test connection fail because of many reasons - 1) not using the right VPC, Subnet and Security Group associated with the JDBC source 2) Security Group is not configured with right rules 3) Not having VPC Endpoints (S3 Gateway and Glue Interface) in the VPC of the JDBC