AWS Glue PySpark: Upserting Records into a Redshift Table

Рет қаралды 6,505

Жыл бұрын

This video is a step by step guide on how to upsert records into a dynamic dataframe using pyspark. This video will use a file from s3 that has new and existing records that we want to perform an upsert into our redshift table.
github: github.com/AdrianoNicolucci/d...
Related videos: • Add Redshift Data Sour...

Пікірлер: 25

@asfakmp7244 Ай бұрын

Thanks for the video! I've tested the entire workflow, but I'm encountering an issue with the section on creating a DynamicFrame from the target Redshift table in the AWS Glue Data Catalog and displaying its schema. While I can see the updated schema reflected in the Glue catalog table, the code you provided still prints the old schema.

@mariumbegum7325 Жыл бұрын

Great explanation 😀

@DataEngUncomplicated Жыл бұрын

Thanks Marium!

@tuankyou9158 9 ай бұрын

Thanks for sharing your solution 😍😍

@DataEngUncomplicated 9 ай бұрын

you're welcome!

@critical11creator Жыл бұрын

Amazing tutorials! Truly haven't seen such drilled down stuff for a while. Is there a native PySpark course, perhaps, in the making? :) I'm certain it will be very appreciated by many if such a course existed on this channel

@DataEngUncomplicated 11 ай бұрын

Thank you for your kind words! I have been slowly adding pyspark related content but I don't have a full course in the making, I wish I had more time!

@ashishsinha5338 Жыл бұрын

good explanation regards to stagging.

@DataEngUncomplicated Жыл бұрын

Thanks Ashish!

@vivek2319 Жыл бұрын

Please make more such diverse videos with what-if scenarios..

@DataEngUncomplicated Жыл бұрын

Hi Vivek, can you provide me some examples of what you are thinking?

@rambandi4330 Жыл бұрын

Thanks for the video, Does this work for RDS oracle ?

@DataEngUncomplicated Жыл бұрын

Im not sure. I haven't worked with RDS oracle but in theory it should.

@rambandi4330 Жыл бұрын

@@DataEngUncomplicated Thanks for the response👍

@datagufo 7 ай бұрын

Hi Adriano, first of all thanks for the amazing series of tutorials. They are really clear and detailed. I am trying to implement the UPSERT into Redshift using AWS Glue, but I am getting what seems to be an odd problem. If I run my glue script from the notebook (it is actually a copy-paste from your notebook, with minor adaptations to make it work with my data and setup), when writing to Redshift the "preactions" and "postactions" are ignored, meaning that I end up with just a `staging` table that never gets deleted and to which data are simply appended. And no `target` table is ever created. Have you ever had such a problem. I could not find any solution online and I do not understand why your code would work for you and not in my case. Thanks again!

@DataEngUncomplicated 7 ай бұрын

Ciao Alberto, thanks! Hmm I think I might have had this happen to me before. Can you check to make sure you haven't misspelt any of the parameters, I think I'd there is an error it would ignore it the preactions.

@datagufo 6 ай бұрын

Ciao Adriano (@@DataEngUncomplicated)! Thanks a lot for your reply. I also thought that might be the case, but it does not seem like it is. I really tried to copy & paste your code. Moreover, that happens also with code generated with the Visual Editor, which I assume having the correct syntax. I was wondering whether it could be related to the permissions of the role that is used to run the script, but I do not see why it would be allow writing data in the table, but not the SQL preaction ... In the meantime, I really enjoyed your other video about local development, and it really helps to keep dev costs down and to speed up significantly the devel cycle.

@DataEngUncomplicated 6 ай бұрын

Did you check to make sure your user in the database has permissions to create and drop a table? Maybe your user only has read/write access?

@mohammadfatha7740 Жыл бұрын

I followed the same steps but it's throwing error like id column is integer and you are trying to query is varying

@DataEngUncomplicated Жыл бұрын

Hey, it sounds like you might have different data types in your column. You perhaps think its an int but there is actually some strings in there.

@NasimaKhatun-jb7qo 4 ай бұрын

Hi Where are you running the code

@DataEngUncomplicated 4 ай бұрын

Hi, I'm running my code locally using an interactive glue session.

@NasimaKhatun-jb7qo 4 ай бұрын

I am trying my hands on how to run the code locally, can you create some video on how to run glue jobs locally(notebook version).. setup and configuration

@DataEngUncomplicated 4 ай бұрын

I actually have many videos on this, for example see this one: kzfaq.info/get/bejne/lcWaYLaq1Na6cqc.html. You can setup docker to run glue in there or use interactive sessions but that will cost compute in aws since you are just connecting to the cluster remotely but you can use a jupyter notebook to do this

@NasimaKhatun-jb7qo 3 ай бұрын

Yes I have seen that video and had the same impression of cost. I am trying to setup local where I can use local spark(AWS glue) and let's say jupyter notebook. Also if locally I will be able to connect s3 and other services. Do you recommend other way to work locally? Also how this setup can be done .. trying since long ,not getting success