ETL | AWS Glue | AWS S3 | Load Data from AWS S3 to Amazon RedShift

Рет қаралды 77,424

2 жыл бұрын

===================================================================
1. SUBSCRIBE FOR MORE LEARNING :
/ @cloudquicklabs
===================================================================
2. CLOUD QUICK LABS - CHANNEL MEMBERSHIP FOR MORE BENEFITS :
/ @cloudquicklabs
===================================================================
3. BUY ME A COFFEE AS A TOKEN OF APPRECIATION :
www.buymeacoffee.com/cloudqui...
===================================================================
This video shows complete end-to-end demo achieving ETL with AWS glue service where a data will be loaded from AWS S3 bucket to Amazon RedShift. It has clean architectural explanation, part by part explanation as well. And demo configuring complete automation set up.
Please find the relevant file at repo : github.com/RekhuGopal/PythonH...

Пікірлер: 129

@funforkids9927 Жыл бұрын

This is very useful video with deep level of understanding. Thank you so much for this kind of videos. Looking Forword to upcoming videos.

@cloudquicklabs Жыл бұрын

Thank you for watching my videos. And thank you very much encouraging words. Keep watching and keep learning from my videos.

@olorunfemirichard5976 2 жыл бұрын

Great presentation. You speak exactly like someone I met at Generation & I.

@cloudquicklabs 2 жыл бұрын

Thank you for watching my videos.

@senthilkumar5129 2 жыл бұрын

Clean explanation. Thank you keep uploaded more

@cloudquicklabs 2 жыл бұрын

Thank you for watching my videos. Keep watching, lots of videos are coming ahead.

@behnamvakili1638 2 жыл бұрын

It was a great demo. Thank you.

@cloudquicklabs 2 жыл бұрын

You are welcome. Thank you for watching my videos.

@nikhilrao1266 2 жыл бұрын

This is Awesome. I subbed. Hope to see more videos on AWS.

@cloudquicklabs 2 жыл бұрын

Thank you very much for encouraging words and watching my videos. Looking forward to contribute more value to the AWS community.

@sriharsharompicherla9704 2 жыл бұрын

Awesome Session very helpful💐

@cloudquicklabs 2 жыл бұрын

Thank you for watching my videos. Please do share it your friends as well.

@90daner 8 ай бұрын

a wonderfull video! thanks man !

@cloudquicklabs 8 ай бұрын

Thank you for watching my videos. Glad that it helped you.

@canye1662 2 жыл бұрын

Nice video n explanation.

@cloudquicklabs 2 жыл бұрын

Thank you for watching my videos.

@DanielWeikert 2 жыл бұрын

Hi when exactly do I need to set up a route table and NAT Gateway for using glue? Can you elaborate on that? I can also use glue and crawlers without the route table as long as the roles are correct? Thanks

@cloudquicklabs 2 жыл бұрын

Thank you for watching my videos. Yes, please make sure that you have all trafic are allowed from internet. And then create Glue crawlers and Jobs, it works as expected.

@artijadon9734 Жыл бұрын

Thanks for the demo. Have you created the (Setting the ETL Infrastructure using Terraform) demo for the same?

@cloudquicklabs Жыл бұрын

Thank you for watching my videos. I have set Infrastructure using manually here. But its good idea to keep this infrastructure with using Infrastructure as Code i.e Terraform. I shall keep in mind this in my next video

@gilbertyoungjr4898 2 жыл бұрын

what if you have multiple tables? Do you need to create a redshift crawler for each tbl??

@cloudquicklabs 2 жыл бұрын

Thank you for watching my videos. The current capability of aws glue crawlers needs to create multiple points of crawlers when you have multiple tables.

@amanchoudhary9518 Жыл бұрын

Could you please explain the creation of s3 endpoint in detail and what policy is attached while creating s3 endpoint?

@cloudquicklabs Жыл бұрын

Thank you for watching my videos. S3 endpoint here is enabling VPC S3 bucket endpoints, this is needed to securely connect to Redshift via VPC AWS backbone network. You need have VPC and S3 bucket access to do this.

@shadabbarmare7797 6 ай бұрын

it was a good demo ...thank you so much. just a small question . Why we didn't use S3 trigger destination to Lambda function if new file is added or update instead of EventBridge ?

@cloudquicklabs 6 ай бұрын

Thank you for watching my videos. This is custom solution that designed here. But Indeed we can use S3 object create event to trigger the workflow as well.

@rohitmukherjee4271 Жыл бұрын

Hey Quick question, why is not there a trigger at target point? is it cause the Destination would be a one-time activity?? please revert?

@cloudquicklabs Жыл бұрын

Thank you for watching my videos. Yes.. Here in this lab, I have shown for one table at Redshift cluster which is destination and it would one time activity. But source file could keep on coming which needs to updated to Redshift cluster table.

@nachiket26 Жыл бұрын

this demo is good but there are two things missing here. only the incremental data should be loaded in redshift when the job runs again. in your case there is duplication. another thing u could have shown is making changes on the fly in the excel sheet which invokes the crawler and automatically runs a job.

@cloudquicklabs Жыл бұрын

Thank you for watching my videos. These are indeed valuable inputs. I am planning to make next version of this video where these inputs are taken care. Keep watching and Happy learning.

@ManojKumar-vp1zj Жыл бұрын

right

@akshaymuktiramrodge3233 7 ай бұрын

Hi brother if you have any update on how we can load only incremental data please share it with me....

@cloudquicklabs 7 ай бұрын

Thank you for providing reminder. I am on it , I shall upload it soon.

@akshaymuktiramrodge3233 7 ай бұрын

@@cloudquicklabsok

@harikaanasuya8976 Жыл бұрын

We can also load data to redshift using copy command from S3. So, what is the difference between the above method and using copy command.

@cloudquicklabs Жыл бұрын

Thank you for watching my videos. This really a good question. The method shown in this video is using complete aws given service (Glue crawlers) to extract, load and transform, whereas COPY commond is using SQL method by connecting to RDS and uploads the data.

@kishlayamourya3141 Жыл бұрын

Awesome video!! I have a query: - I wanted to push s3 data(csv) to redshift tables. Can I anyhow use table schema created by crawler to create table in redshift? In every tutorial instructor 1st hand creates a table in redshift, then uses crawler again to create schema in glue then pushes the data to reshift...then what is the use of creating schema using crawler?

@cloudquicklabs Жыл бұрын

Thank you for watching my videos. This is really good question, Glue crawlers works based on source (pre - configured details about source) and destination (pre - configured details about destination ) mapping and it expects both parts to be present before it starts pumping the data using job. Your requirement is more of custom and dynamic one, in this case you might need to rely with AWS Lambda where you can dynamically create table and upload data.

@MrMadmaggot Жыл бұрын

There must be a way to get the schema easily witouth having to us many resources bu I havne't found it yet, what about you? I know that Pandas has a function that helps with that maybe I can download the data and read a chunk and create the schema, but what if the data is FRIGIN BIG? Like teras of data, no way I'm goin to dowload that. As far as I'm concearned the crawler gives you a data schema from the csv which u can use it. RN I will try to upload a csv in S3, the redshift cluster is connected to that S3 which means I can copy the csv from s3 to a table in redshift but i wanna se if the csv appears already there in the DB of the redshift cluster.

@veerachegu 2 жыл бұрын

We have invoked the crawler by lamda but we were crawled it se table again what is the purpose same table again pushed? we got already table in redshift from source.

@cloudquicklabs 2 жыл бұрын

There two aspects are been shown there. 1. Created Glue job based ETL pipeline (manually triggered once) whcih creates table for the first time in redshift. 2. Next we, are triggering the same Glue pipeline using event rule (when there is a file uploaded in s3 bucket), this automation scenario. Hope this helps you to understand it.

@damienspectre4231 Жыл бұрын

Can you do the same with Firehose? It supports Redshift from S3 automatically using the COPY command...

@cloudquicklabs Жыл бұрын

Thank you for watching my videos. Indeed I shall create a new video on it, thank you for giving new ideas.

@gourbhoi722 11 ай бұрын

Thanks man

@cloudquicklabs 11 ай бұрын

Thank you for watching my videos. Glad that it helped you.

@vijaykumarreddyt3287 22 күн бұрын

I am getting "Account *** is denied access" while creating crawlers. I tried signing in afresh. It's not helping. Also, while modifying inbound rules for security groups i selected allow all for the types.

@cloudquicklabs 21 күн бұрын

Thank you for watching my videos. May be AWS might have disabled crawler or ETL job in your account , just raise service quota limit or raise support case should help you.

@monishbhanushali3214 2 жыл бұрын

@Cloud Quick Labs How to configure cloudwatch and lambda function to trigger when job is succeeded I tried but my lambda is not getting invoked.

@cloudquicklabs 2 жыл бұрын

Thank you for watching my videos. Hope you have followed all steps been mentioned in the video to invoke lambda.. In that case could you please check the permissions been assigned to role of Lambda.

@monishbhanushali3214 2 жыл бұрын

@@cloudquicklabs Hi, thanks for replying really appreciated. I checked my lamda role permission applied was administrative access and trusted policy I have given for lambda, aws glue and s3 by adding each in json

@cloudquicklabs 2 жыл бұрын

Thanks for coming back on this. Hope this helped you

@pallavkan Жыл бұрын

Appreciate yuor efforts! But you have not explained the significance of creating reshift crawler, why security group required with the same name... I would request you to go step by step for making this things better understand.. Patience!

@cloudquicklabs Жыл бұрын

Thank you for watching videos. I appreciate your patience and valuable feedback are taken. Keep learning. :)

@Alexmontoyaangel 2 жыл бұрын

Thanks for the tutorial. I would like you to give me your point on two doubts that I have - What is the best way to debug the code that does the ETL Glue process? - Is it possible to do all this infrastructure using IaaC ? Do you have any example?

@cloudquicklabs 2 жыл бұрын

Thank you for watching my videos. Please find my answers below. 1. To debug the AWS Glue configuration, we have to check logs and follow the standard methods. 2. Yes complete configuration fone above can maintained as Infrastructure as Code using Terraform, for examples please check my AWS Infrastructure as Code using Terraform videos.

@Parker_Perry 2 жыл бұрын

At 9:11 how did you add both lambda and glue as trusted entities? Im only allowed to add one and then Im unsure of how to edit the json to add glue

@cloudquicklabs 2 жыл бұрын

Thank you for watching my videos. This can be added when you are defining role , put trust values as 'glue.amazonaws.com'.

@Parker_Perry 2 жыл бұрын

@@cloudquicklabs It doesnt allow you to put multiple trust values. I couldnt add both lambda and glue. I had to edit the json to add glue

@Parker_Perry 2 жыл бұрын

Great video though!!

@cloudquicklabs 2 жыл бұрын

Exactly, this the way we need to add it.

@cloudquicklabs 2 жыл бұрын

Thank you for watching my videos, Happy learning.

@sandeepreddy9689 5 күн бұрын

Hi, @cloudquicklabs What was the use of second crawler. Is it only run once to get schema of redshift in to temp database and later it won't be running any time?

@cloudquicklabs 5 минут бұрын

Thank you for watching my videos. It's to load the data from s3 to aws glue data catalog and then to Amazon Redshift services. Did you watch the second version of this video kzfaq.info/get/bejne/iK2Dfsh8nt7Tp2Q.htmlsi=FB_1BXVQp-SnfUtq

@suryatejasingasani256 Жыл бұрын

Hii bro i have a doubt i have a datastage job converted into XML file i want to convert the XML file into glue job how can I do

@cloudquicklabs Жыл бұрын

Thank you for watching my videos. Do you mean that you want to convert the xml file data to Redshift cluster DB.?

@mindset873 2 жыл бұрын

Hi I am having problem with steps Iam role at minute 37:54. No found roles with services principal aws-glue. this UI I see after clicking the iam roles. I need really your help to solve my problem.

@cloudquicklabs 2 жыл бұрын

Thank you for watching my videos. In my video I have suggested to use one IAM role with administrative policy attached to it and trust to AWS Glue and Lambda services. Please have these configuration done IAM role use same while creating Glue crawlers job and Lambda.. It should work

@manjindersingh8677 2 жыл бұрын

Hello Sir, I hope you are doing well. I have a confusion while creating a new role. At Step 1 of creating role it gives an option to choose Use case . What should we choose there in order to keep both (Glue and Lambda) as trusted entities. Your guidance will be appriciated. Thanks

@cloudquicklabs 2 жыл бұрын

Thank you for watching my videos. I am doing good, hope you are also doing good as well. You could add 'S3' and 'AWS Lambda' in trusted entities of so that same role can be used on either side.

@himansusekharmishra5114 2 жыл бұрын

Hi Thank you for this beautiful video. Unfortunately when we are running the glue job it is putting the duplicate data inside redshift. Like suppose we have new data in s3 and we run the crawler job after the job succeeded it is triggering the glue job and the new data is being updated is inside redshift and the data in redshift is getting doubled as it is duplicating the past data. Can you please help with this...??

@cloudquicklabs 2 жыл бұрын

Thank you for watching my videos and reaching on this. Here the solution is not intelligent enough to deduplicate the data or rather take the differential data upload in redshift, so I would suggest to clean up the file which uploaded in redshift and make only required data to be present in s3 bucket side when file is uploaded.

@himansusekharmishra5114 2 жыл бұрын

@@cloudquicklabs Thank you again, Shall we do it with the glue job itself?

@cloudquicklabs 2 жыл бұрын

I believe we can't do it with glue job but by making s3 file containing unique data. And might be cleaning temp table created by crawlers from sources side.

@himansusekharmishra5114 2 жыл бұрын

@@cloudquicklabs Hi I got the solution I think all we need to enable bookmarks in gluejob from the console and it will not repeat the old data.

@cloudquicklabs 2 жыл бұрын

Great to hear back from you side. Happy for you that you explored it. And thank you very much for sharing it here.

@sabaakram9310 2 жыл бұрын

what are all policies u have attached in ETLlambdaAccessRole ????

@cloudquicklabs 2 жыл бұрын

Thanl you for watching my videos. I have attached policy 'Administrator Access' to the role. You need fine grain it (like give only download access on S3, CloudWatch log group creation, log put access, AWS Lambda, Event bridge, AWS glue access and Redshift accesses) when it is production to follow least privilege method..

@sabaakram9310 2 жыл бұрын

@@cloudquicklabs thankyou 😊

@cloudquicklabs 2 жыл бұрын

You are welcome.

@user-qx4iy6nq9x 8 ай бұрын

Again Running Crawler and Watch dog and Lamda it creates duplicate rows right ? How will be the New data from source to target keep updated automatically kindly explain pls...Other than that Your Full Demo session is wonderful for beginners bro thanks

@cloudquicklabs 8 ай бұрын

Thank you for watching my videos. I am currently looking to create v2 video this lab , I shall solve this problem in that video.

@asishb 9 ай бұрын

Hello ! Wonderful video. You got a sub. Thanks for saving my time of hours of reasearch. However, I have two questions/requests. 1) When I use a crawler in the Console, the Glue Catalog is able to create a table automatically. But, when I use Pyspark to type the code, I am required to provide a table . Why is that ? 2) Can you show how to transfer data from S3 (csv file) to Redshift after some transformation using Glue, in the Visual editor of the Glue console ?

@cloudquicklabs 9 ай бұрын

Thank you for watching my videos. 1. It could due pyspark does not the capability while it is meant for Transforming the data and provides high scalable power. 2. Indeed , I am going to create new video here on how to migrate data from S3 (csv file) to Redshift cluster soon.

@asishb 9 ай бұрын

@@cloudquicklabs Thanks forcthe info

@Pavan-kn5pg Жыл бұрын

Good content, but the flow of your explanation is quite confusing.

@cloudquicklabs Жыл бұрын

Thank you for watching my video. Apologies that you find this video bit confusing. Please watch it with patience , it should be helping you.

@jhonamorgan9194 2 жыл бұрын

Please try to add subtitles to your videos

@cloudquicklabs 2 жыл бұрын

Thank you for watching my videos. Yes.. I am enabling it right away.

@udaykumar-tb5kn Жыл бұрын

How to configure redshift, vpc ur explain high levele

@cloudquicklabs Жыл бұрын

Thank you for watching my videos. Thank you for your encouraging words.

@udaykumar-tb5kn Жыл бұрын

@@cloudquicklabs no comment pls answer my question.. Provide your data

@santaellal 2 жыл бұрын

thanks for the tutorial, please load with sub in spanish :( !

@cloudquicklabs 2 жыл бұрын

Thank you for watching my videos. Yes, I shall try to update it.

@santaellal 2 жыл бұрын

@@cloudquicklabs thanks, can i load .txt with aws glue? or only csv?

@cloudquicklabs 2 жыл бұрын

It supports csv (as it has column and rows). I dont think that it supports. txt.

@udaykumar-tb5kn Жыл бұрын

Can u share files and script used here

@cloudquicklabs Жыл бұрын

Thank you for watching my videos. You can find file share in videos description, you can find the repo github.com/RekhuGopal/PythonHacks/tree/main/AWSBoto3Hacks/AWS-ETL-S3-Glue-RedShift

@kailashchand7596 Жыл бұрын

Confusing at many place

@cloudquicklabs Жыл бұрын

Thank you for watching my videos. Apologies that it made you confused. Please watch again may be just pause and rewind it. Feel free ask doubts as well.

@sauravverma6179 Жыл бұрын

Dude, Administrator access!! SERIOUSLY!

@cloudquicklabs Жыл бұрын

Thank you for watching my videos. Administrator access is blanket access but you could finegrain your IAM role with least previlage approach and use it for solution(I left this part as self service)

@sauravverma6179 Жыл бұрын

@@cloudquicklabs Yes. The people who are new to DevOps will just use this administrator role. You could have provided an IAM policy for people to use.

@cloudquicklabs Жыл бұрын

Thank you for the valuable inputs. I shall consider these points in my next video.

@100derman 2 жыл бұрын

34 minutos para importar un csv en una tabla... mamita...

@cloudquicklabs 2 жыл бұрын

Thank you for watching my videos. Yes took that much time as it has step-by-step explanation..

@eylesit9268 8 ай бұрын

I tried to watch this video, but there were adverts every 2 or 3 minutes. For me this is unacceptable and unwatchable.

@cloudquicklabs 8 ай бұрын

Thank you for watching my videos. Apologies that you are not liking the often adverts coming while playing it. But note that this is how KZfaq works.

@eylesit9268 8 ай бұрын

No other videos I watch have this many adverts. The overwhelming majority have one advert at the start and no others. Forcing viewers to watch adverts every few minutes disrupts the learning process.@@cloudquicklabs

@AshisRaj 5 ай бұрын

Mr. Author, you could be more humble. Your tone should be modulated.

@cloudquicklabs 5 ай бұрын

Thank you for watching my videos. I shall consider this input. Thank you again for your inputs.

@cloudquicklabs 5 ай бұрын

I understand the pain point here. Apologies for inconvenience.

@jubrilbabatunde3040 2 жыл бұрын

poor video quality.

@cloudquicklabs 2 жыл бұрын

Thanl you valuable input, I am in progress to improve quality soon. . Till then please do rely on this and keep learning.

@saurabhpuranik5474 2 жыл бұрын

Most confused trainer I have ever seen. This guy just jumps here and there. Follow proper steps if you want to create proper content.

@cloudquicklabs 2 жыл бұрын

Thank you for giving valuable inputs. I shall to take it to fine tune myself. This Lab scenario is unfortunately bit complex and content was meant to reach wider audience like starter, proficient and experts.

@suneelkumarparvatala7527 7 күн бұрын

Hi Thank you for the informational videos, would you clarify this doubt I have, here gluecrawler created and ran only once I believe instead of creating glue crawler can it be possible to migrate data directly from s3source crawler to redshift table . My intention of asking this query is since we are running gluecrawler only once or only at a time when we would like to see records in redshift table since the functionally of accessing records in redshift is possible by quering in the query editor

@cloudquicklabs 6 күн бұрын

Thank you for watching my videos. We migrating data from source s3 to Amazon redshift which is destination here. Crawlers can be scheduled or invoked on demand.