AWS Glue PySpark:Insert records into Amazon Redshift Table

Рет қаралды 10,733

Жыл бұрын

This video is a step-by-step guide on how to write new records to an amazon redshift table with AWS Glue Pyspark.
add redshift table to AWS Glue Catalog: • Add Redshift Data Sour...
#aws #awsglue

Пікірлер: 39

@sarfarazsiddiqui9683 Жыл бұрын

Short and sweet explanation. Thanks for this useful post.

@DataEngUncomplicated Жыл бұрын

Thanks sarfaraz!

@joggyjames Жыл бұрын

I am working on a project updating records in s3, this method is much more efficient than what I was doing, thank you.

@DataEngUncomplicated Жыл бұрын

Good I'm glad it was more efficient than your other method. This only works for inserting new records. I plan on making another video on how to update records with pyspark and AWS glue at some point soon

@joggyjames Жыл бұрын

@@DataEngUncomplicated oops, I meant adding new records but looking forward to that video as well.

@denisgomonov2320 Жыл бұрын

@@DataEngUncomplicated I have a file (test_report.csv) that gets uploaded to S3 (let's say new version every 24hr) -> Glue crawlers pick it up and eventually I have a job that populates a table in Redshift. However, after creating a Schedule for my job I noticed that the table in Redshift is not getting updated with the most current csv file it picks up from S3; is there no overwrite? Is there a way to update the table in Redshift dynamically, otherwise how is this service practical?

@DataEngUncomplicated Жыл бұрын

@@denisgomonov2320 Hi Denis, yea absolutely this is possible. I am doing something similiar to this in aws glue where I am upserting records by a particular composite key. I am planning to make a video on how to do upserts in AWS Glue to redshift.

@Streampax Жыл бұрын

Can we insert the data with same columns but has different ordinal positions?

@LukInMaking Жыл бұрын

Thank you for the awesome content!

@DataEngUncomplicated Жыл бұрын

Thanks Luk!

@paritoshec Жыл бұрын

Nice Video. It is very helpful. Thanks is there a video on complex JSON files to be inserted in Redshift from the S3 bucket via Notebook?

@venkatramreddy-iq2zv Жыл бұрын

I need to execute a procedure in redshift by passing parameters from Glue ETL script and get the out parameter value to use in the glue job for later processing can any one provide the sample code snippet

@unw1718 Жыл бұрын

is it faster to insert into redshift using a pyspark script, or a COPY command? I know that COPY uses parallel processing, will the pyspark script also use it when inserting into redshift? Is it possible to do a bulk insert with pyspark?

@DataEngUncomplicated 10 ай бұрын

Hey I believe the pyspark uses copy under the hood to write to redshift. One of the parameters it asks for is a temp directory in s3. It uses this to write the temp parquet files before copying it to redshift.

@heysiri-hz6ko Жыл бұрын

I have tables in RDS and I need to load them into redshift. concern here is that whenever new rows gets added into rds tables ,only the new rows should get appended to the redshift tables without deleting existing redshift data . how can I do that. thank you in advance

@DataEngUncomplicated Жыл бұрын

You can use pre and post queries to perform basically an upset to solve for this. I'm going to make a video soon on this.

@blairnangle Жыл бұрын

What client are you using to view Redshift tables?

@DataEngUncomplicated Жыл бұрын

I'm using pycharm professional

@varunsam9947 Жыл бұрын

Can we update an existing row in the redshift using glue pyspark?

@DataEngUncomplicated Жыл бұрын

Yes this is possible, one of my next videos is on this with redshift.

@vishalkushwaha1075 Жыл бұрын

Can we get data from api gatway in sqs and send to s3 and s3 to redshift using spark

@DataEngUncomplicated Жыл бұрын

Are you thinking real time loading or just a bulk load of all the files? But yea it's definitely possible

@gshan994 Жыл бұрын

Instead I ll say use s3 presigned urls. Create api and lambda. User passes filename to api and lambda provides presigned url. User makes post request to url. Till 5 GB data can be passed via presigned url

@fawadh Жыл бұрын

Im having issues i need help!

@Nayak_Bukya_08 14 күн бұрын

as I am working on glue, it super argent to me, the question is, how to add the source file name in the dyamaic frame, it would be great if you could respond on a priority basis. Thank you

@DataEngUncomplicated 13 күн бұрын

Hi, this is outside the scope of this video. I don't know if this is even possible to be honest. With dataframes the files are abstracted away from us. Please post on repost your question to see if someone can help you out. Also Google search or reading the docs can be your friend!

@imdadood5705 Жыл бұрын

Do you have any Udemy courses on Data Engineering?

@DataEngUncomplicated Жыл бұрын

Hey, unfortunately I do not. If I had more time I would love to make an aws data engineering course.

@imdadood5705 Жыл бұрын

@@DataEngUncomplicated I have subscribed. If you have plans to do paid courses on Udemy. I hope you will make an announcement. Yeah?

@shashikantkrishna4156 Жыл бұрын

How to make glue job faster?

@DataEngUncomplicated Жыл бұрын

Hi Shashikant, it's important to understand where the bottle neck is. If you are running a pyspark job one way is to add more worker nodes if you are processing extremely large data sets

@shashikantkrishna4156 Жыл бұрын

@@DataEngUncomplicated Hi, Thanks for your reply. Actually I have created redshift JDBC connection and trying to write the dynamic frame to redshift table. datasink0 = glueContext.write_dynamic_frame.from_jdbc_conf(frame = applymapping_for_ids, catalog_connection = "redshift", connection_options = {"dbtable": "track_order_ids_test", "database": "dev"}, redshift_tmp_dir = args["TempDir"], transformation_ctx = "datasink0") This dynamic frame has one column(ID) with 8M data and approx. size is 800MB and it's taking around 32mints. What's your suggestion on the same. Thanks in advance.

@user-bk5oo2rs5x Жыл бұрын

Where is the tutorial for this part? kzfaq.info/get/bejne/e8uknqiXmJudgXk.html

@DataEngUncomplicated Жыл бұрын

Are you looking for the script? If so, I started uploading the scripts a while after I made this video sorry.

@prabhathkota107 Ай бұрын

I able to run successfully with glueContext.write_dynamic_frame.from_options & glueContext.write_dynamic_frame.from_jdbc_conf but I see some issue with glueContext.write_dynamic_frame.from_catalog as below: Getting below error: Error Category: UNCLASSIFIED_ERROR; An error occurred while calling o130.pyWriteDynamicFrame. Exception thrown in awaitResult: SQLException thrown while running COPY query; will attempt to retrieve more information by querying the STL_LOAD_ERRORS table Could you please guide

@DataEngUncomplicated Ай бұрын

These errors can be tricky and require you to look into the logs further to see what is causing it. It sounds specific to your data.