ETL | Incremental Data Load from Amazon S3 Bucket to Amazon Redshift Using AWS Glue

ETL | Incremental Data Load from Amazon S3 Bucket to Amazon Redshift Using AWS Glue | Datawarehouse

Рет қаралды 23,611

6 ай бұрын

===================================================================
1. SUBSCRIBE FOR MORE LEARNING :
/ @cloudquicklabs
===================================================================
2. CLOUD QUICK LABS - CHANNEL MEMBERSHIP FOR MORE BENEFITS :
/ @cloudquicklabs
===================================================================
3. BUY ME A COFFEE AS A TOKEN OF APPRECIATION :
www.buymeacoffee.com/cloudqui...
===================================================================
Explore the intricacies of efficient data synchronization between Amazon S3 and Amazon Redshift using AWS Glue in this comprehensive tutorial. Learn the art of incremental data loading, allowing you to seamlessly update Redshift tables with new or modified data sourced from your S3 bucket.
This tutorial covers the entire process, from setting up your AWS environment and configuring Amazon Redshift for data ingestion to crafting and executing AWS Glue jobs tailored for incremental data loading. Discover best practices for script customization, job scheduling, and monitoring to ensure smooth data flow and integrity within Amazon Redshift.
Whether you're a seasoned AWS user or just beginning your journey in data management, this guide equips you with the skills needed to automate and optimize data workflows within the AWS ecosystem. Subscribe for more tutorials and expert insights on leveraging AWS services for efficient data handling.
Repo link : github.com/RekhuGopal/PythonH...
#s3 #redshift #glue #etl #aws #awsglue #amazons3 #amazon #amazonredshift #dataengineering #dataloading #technology #dataanalysis #coding #programming #aws #cloudcomputing #datamanagement #analytics #databases #etl #incrementalloading #automation #awservices #tutorial #learning #informationtechnology #computerscience #development #techhelp #awscloud #bigdata #dataengineering

Пікірлер: 82

@akshaymuktiramrodge3233 6 ай бұрын

This is what I wanted.... Thank u so much 🙏

@cloudquicklabs 6 ай бұрын

Thank you for watching my videos. Glad that it helped you.

@kalyanishende618 5 ай бұрын

Thank you so much I got revision and good idea

@cloudquicklabs 5 ай бұрын

Thank you for watching my videos. Glad that it helped you.

@user-sz8hw4dr1v 5 ай бұрын

This video helped a lot. Is there a way you can add trust relationships during IAM role creation in the repo? Thank you

@cloudquicklabs 5 ай бұрын

Thank you for watching my videos. Did you see a dedicated video here kzfaq.info/get/bejne/qd1dlJufyt64iWw.html

@awaise92 2 ай бұрын

Great content ! I have a question on this. Can we run Merge operation on an external database as well? like Oracle, Snowflake , etc?

@cloudquicklabs 2 ай бұрын

Thank you for watching my videos. Yes.. we can do this. I shall create a new video on this topic soon.

@prabhathkota107 4 ай бұрын

Very useful.. thanks... subscribed now for more interesting content

@cloudquicklabs 4 ай бұрын

Thank you for watching my videos. Happy learning.

@prabhathkota107 Ай бұрын

@@cloudquicklabs Some issue with glueContext.write_dynamic_frame.from_catalog, where as glueContext.write_dynamic_frame.from_jdbc_conf is perfectly working fine.... Getting below error while writing to Redshift catalog table: Error Category: UNCLASSIFIED_ERROR; An error occurred while calling o130.pyWriteDynamicFrame. Exception thrown in awaitResult: SQLException thrown while running COPY query; will attempt to retrieve more information by querying the STL_LOAD_ERRORS table Could you please guide

@nitikjain993 5 ай бұрын

Much waited video thankyou so much,could you attach the code here which have been generated after glue job confriguration.

@cloudquicklabs 5 ай бұрын

Thank you for watching my videos. Glad that it's useful video to you. I have not copied the Glue job code but I have attached relevant files in description please use the same and do follow the these steps it would be working for you.

@tejpatta3641 Ай бұрын

Thank you...very useful...great video👍 small query is - why the table in the redshift is not in the same order as it in the CSV file?

@cloudquicklabs Ай бұрын

Thank you for watching my videos. Glad that it helped you. If you mean column of table , then I would say it does not matter untill proper records are getting updated in respective column

@saikrishna8338 2 ай бұрын

thanks for the valuable inputs. Small query, what if my incremental file is landed on a diff folder on same bucket. How crawler is going to handle.?

@cloudquicklabs Ай бұрын

Thank you for watching my videos. Crawler would work (scrapes all data underneath a defined folder)

@snakhil90 4 күн бұрын

Incase of SCD, how we can define the SCD logic for merge and load? which option will have this option?

@zzzzzzzmr9759 4 ай бұрын

great video! I have two questions: 1. why the table in the redshift is not in the same order as it in the CSV file? 2. why in the ETL job configuration the S3 source type choose the data catalog table instead of S3 location? Does that mean we can complete the incremental data load from s3 to redshift just choosing the S3 location and don't use the crawler? Thanks in advance.

@cloudquicklabs 4 ай бұрын

Thank you for watching my videos. 1. It's purely functionality of ETL job which Extract the data from s3 bucket . 2. We can directly use s3 bucket as source when you have single csv file. But when bucket has multiple csv file it's better to use data catalog so that you can map the schema from source to destination.

@rahulsood81 2 ай бұрын

Can you please explain why I need to run the crawler again if there are no changes to file location or fields / structure of the source (S3 file).

@cloudquicklabs Ай бұрын

Thank you for watching my videos. Indeed you don't need to re-run if all files and folders are scraped by crawler already.

@nitikjain993 5 ай бұрын

Could you please make a video how to make fact & dimension table in redshift it would be great if you make that video too s3 to redshift using glue

@cloudquicklabs 5 ай бұрын

Thank you for watching my videos. I shall work on this concept and create a video.

@thihakyaw6189 5 ай бұрын

I want to know why the rows data is come out when you choose data catalog table from s3 source in your glue job because I just saw crawler that only copy metadata from s3 csv to data catalog not the data right? when I try like you in my case , it says no data to display because there is only schema information updated in data catalog table. Please let me know

@cloudquicklabs 5 ай бұрын

Thank you for watching my videos. It's default verification from Glue which display some sample data when you are mapping tables. In your case you must missing one step. Please do watch video again do the steps as mentioned in video.

@jnana1985 2 ай бұрын

For incremental load to work do we need to enable job bookmark in glue? Or it's not required?

@cloudquicklabs 2 ай бұрын

Thank you for watching my videos. It's not required, what's important is closing right option while declaring destination in ETL job. The configurations shown should be fair enough here.

@alphamaryfrancis862 4 ай бұрын

Amazing content!!! When im trying to read data from s3 on aws glue it is giving me error. Can u please guide

@cloudquicklabs 4 ай бұрын

Thank you for watching my videos. Could you please tell me what is the error that you are getting.

@hoangng16 5 ай бұрын

I have multiple large tables in CSV format in AWS S3, should I: 1. load them from S3 to RDS (mySQL) => do my queries using mySQL Workbench => export the expected data to S3 => sync the expected data to SageMaker for visualization and other analysis using a Notebook instance? 2. load them from S3 to Redshift => doing queries using Redshift, actually I'm not quite sure what to do next in this direction, the goal is to have some filtered data for visualization and analysis. Thank you

@cloudquicklabs 5 ай бұрын

Thank you for watching my videos. Is your requirements here is just to Visualize data or do you want to have ML run on them (since you said you want run Sagemaker as it is for ML and costly tool). You could check for Amazon Quicksight if you are looking visualization.

@hoangng16 5 ай бұрын

Thank you,@@cloudquicklabs. I actually want to do some ML, but that part can be done on a local machine. The primary goal now is to load data from S3 to something I can query to analyze the data better.

@user-cj3oy5ki3g 4 ай бұрын

Hello please.. Do you hold classes any where, and do you provide project support?

@cloudquicklabs 4 ай бұрын

Thank you for watching my videos. Don't provide any classes. But I provide project support and project works.

@user-cj3oy5ki3g 4 ай бұрын

@@cloudquicklabs How can I go about it - getting project help

@nlopedebarrios 5 ай бұрын

Why did you run the crawler after the Glue job finished and before running it for the second time? is that required in order to MERGE to succeed?

@cloudquicklabs 4 ай бұрын

Thank you for watching my videos. I wanted to show that when you source data gets incremented pipelines ETL pipelines copies only the increamental data not the duplicates.

@nlopedebarrios 4 ай бұрын

@@cloudquicklabs Yes, I understand that, but before running the job for the second time, to illustrate that merge works, you run the crawler. Is that needed/recommended?

@basavarajpn4801 4 ай бұрын

It's because of if there is any change in the schema to identify it,so we run everytime the crawler before running the job to keep the latest changes from the source

@cloudquicklabs 4 ай бұрын

Indeed it's needed to load the data from the file. Note it's same file in same path.

@gauravrai4398 3 ай бұрын

But it is data change not schema change ... We can verify by running Athena query on catalog table .... Anyways nice use case explanation

@RajYadav-eb6pp 25 күн бұрын

I have two question 1 why you have rsn the crawler twice 2 if there is files continiusly ( different name ) then how can we use glue job to incremental load

@cloudquicklabs 25 күн бұрын

Thank you for watching my videos. I ran twice to show the demo of incremental load at second time. And when file names are continuously different you would depend on folder path of files and also note that atheist column should be same and in job you need map it to destination target tables.

@mahir14_ Ай бұрын

can you make oracledb(source with cdc) -->S3 --> Glue(some Transformation) --> Redshift warehouse whole usecase please

@cloudquicklabs Ай бұрын

Thank you for watching my videos. Indeed , I shall add this to my to do list and make a video on it.

@ranidalvi1064 28 күн бұрын

Yes I am also waiting this type of project with ecommers data

@kakaji_cloud8106 2 ай бұрын

What if primary key is not in incremental or decreasing order in the data. will incremental data load work?

@cloudquicklabs 2 ай бұрын

Thank you for watching my videos. Indeed it should load as Primary key is unique in tables and records are identified by that.

@lipi1004 6 ай бұрын

Is Incremental loading available for SQl Server and Postgre as target?

@cloudquicklabs 6 ай бұрын

Thank you for watching my videos. I shall create new video on incremental loading from RDS (SQL/Postgres) to Amazon Redhsift. If not We still have work around do it here, I.e first load to S3 bucket and then load to Amzon Redshift.

@rahulpanda9256 4 ай бұрын

@@cloudquicklabs She is referring RDS as target from S3.

@archanvyas4891 25 күн бұрын

Nice video. I have a question I am getting my data from Raspberrypi to s3 and it updates whenever i run in raspberrypi. Now when my job succussed and update that s3 and run the same job I am getting error so whats the process.

@cloudquicklabs 25 күн бұрын

Thank you for watching my videos. You should not get any error when you run twice, Did you check logs for what is the error message that you see there.

@archanvyas4891 19 күн бұрын

@@cloudquicklabs How to check logs of it, In cloud watch loggroups as i see my live tail it is saying that it is 100%displayed . How to resolve it?

@rahulpanda9256 4 ай бұрын

How do we ensure Primary Key sequence is intact with Source.

@cloudquicklabs 4 ай бұрын

Thank you for watching my videos. I believe you need to use Data Quality rules here to check if your primary keys are in sequence. Ma be you need watch my latest video here kzfaq.info/get/bejne/erOBhamox8u9fZc.htmlsi=mu-t_cNUIvXzHIXv which might give ideas.

@RupeshKumar-kw7zw 6 ай бұрын

Hi, I'm getting invalid input exception error, can you pls resolve?

@cloudquicklabs 6 ай бұрын

Thank you for watching my videos. Could you please watch the video again.

@GaneshBabu-vr2lg 10 күн бұрын

so my question is in this video why you not schedule it. it is a incremental data the data is load on s3 any time. so the incremental data is go to schedule it when the data is been upload means the automatically job will run and execute it. so why you default the execute the job run is how many times you run a crawler and run a job..?

@cloudquicklabs 10 күн бұрын

Thank you for watching my videos. Indeed, it should be scheduled , I have mentioned it in video. As this is demo I have shown it via manual trigger. I am working on v3 of this video, where I cover missing points.

@GaneshBabu-vr2lg 10 күн бұрын

@@cloudquicklabs kk is this uploded

@cloudquicklabs 10 күн бұрын

Not yet I am still working on it.

@yugalbawankar9039 6 ай бұрын

which iam role given to redshift workgroup?

@cloudquicklabs 6 ай бұрын

Thank you for watching my videos. I am using admin permissions as it is demo. Did you check reference documents shared at : github.com/RekhuGopal/PythonHacks/tree/main/AWS_ETL_Increamental_Load_S3_to_RedShift

@yugalbawankar9039 5 ай бұрын

@@cloudquicklabs Which IAM role given to redshift workgroup? Please create and upload basic video on it. I want to build this project. But don't understand which IAM role given to redshift workgroup?

@kalyanishende618 5 ай бұрын

How much cost is expected if I try this with my personal aws account

@cloudquicklabs 5 ай бұрын

Thank you for watching my videos. All the resources involved in this solution are costly. Please make use of AWS given calcutors to estimate you're cost here. As this was lab session for me I did set up and cleaned up later once lab is completed and it might have costed me very less.

@user-ft5ow9mb5z 2 ай бұрын

How to load incremental data by using python script.

@cloudquicklabs 2 ай бұрын

Thank you for watching my videos. I shall do one vidoe using Python scripts on #ETL pipeline.

@supriyakulkarni8378 3 ай бұрын

Tried the Redshift test connection but it failed due to this error ERROR StatusLogger Unrecognized format specifier [d]

@cloudquicklabs 3 ай бұрын

Thank you for watching my videos. Are you getting error while creating connection or while execution of ETL pipeline.

@supriyakulkarni8378 3 ай бұрын

While testing connection to redshift

@swapnil_jadhav1 2 ай бұрын

Target node is not supported What to do

@cloudquicklabs 2 ай бұрын

Did you follow the video here ?

@swapnil_jadhav1 2 ай бұрын

@@cloudquicklabs yes

@swapnil_jadhav1 2 ай бұрын

@@cloudquicklabs I am using dms to transfer data from rsd mysql to s3 Then by using glue I am transfering data from s3 to redshift.. and in glue I am getting error

@user-ft5ow9mb5z Ай бұрын

How to contact yoy

@cloudquicklabs Ай бұрын

You could reach me over email vrchinnarathod@gmail.com

@saikrishna8338 2 ай бұрын

thanks for the valuable inputs. Small query, what if my incremental file is landed on a diff folder on same bucket. How crawler is going to handle.?

@cloudquicklabs Ай бұрын

Thank you for watching my videos. While defining the crawlers you would give the path, choose right path accordingly, and crawler would scrape all data from all files and folder present underneath the defined folder.

@saikrishna8338 Ай бұрын

@@cloudquicklabs thanks for the reply. what if my folder structure is like the below. input_bucket/year={current_year}/month={current_month}/day={current_date}/file.txt how can i define my crawler to check the file based on date and load data on incremental basis not as full refresh...any idea ??