AWS Glue PySpark:Insert records into Amazon Redshift Table

  Рет қаралды 10,733

DataEng Uncomplicated

DataEng Uncomplicated

Жыл бұрын

This video is a step-by-step guide on how to write new records to an amazon redshift table with AWS Glue Pyspark.
add redshift table to AWS Glue Catalog: • Add Redshift Data Sour...
#aws #awsglue

Пікірлер: 39
@sarfarazsiddiqui9683
@sarfarazsiddiqui9683 Жыл бұрын
Short and sweet explanation. Thanks for this useful post.
@DataEngUncomplicated
@DataEngUncomplicated Жыл бұрын
Thanks sarfaraz!
@joggyjames
@joggyjames Жыл бұрын
I am working on a project updating records in s3, this method is much more efficient than what I was doing, thank you.
@DataEngUncomplicated
@DataEngUncomplicated Жыл бұрын
Good I'm glad it was more efficient than your other method. This only works for inserting new records. I plan on making another video on how to update records with pyspark and AWS glue at some point soon
@joggyjames
@joggyjames Жыл бұрын
@@DataEngUncomplicated oops, I meant adding new records but looking forward to that video as well.
@denisgomonov2320
@denisgomonov2320 Жыл бұрын
@@DataEngUncomplicated I have a file (test_report.csv) that gets uploaded to S3 (let's say new version every 24hr) -> Glue crawlers pick it up and eventually I have a job that populates a table in Redshift. However, after creating a Schedule for my job I noticed that the table in Redshift is not getting updated with the most current csv file it picks up from S3; is there no overwrite? Is there a way to update the table in Redshift dynamically, otherwise how is this service practical?
@DataEngUncomplicated
@DataEngUncomplicated Жыл бұрын
​@@denisgomonov2320 Hi Denis, yea absolutely this is possible. I am doing something similiar to this in aws glue where I am upserting records by a particular composite key. I am planning to make a video on how to do upserts in AWS Glue to redshift.
@Streampax
@Streampax Жыл бұрын
Can we insert the data with same columns but has different ordinal positions?
@LukInMaking
@LukInMaking Жыл бұрын
Thank you for the awesome content!
@DataEngUncomplicated
@DataEngUncomplicated Жыл бұрын
Thanks Luk!
@paritoshec
@paritoshec Жыл бұрын
Nice Video. It is very helpful. Thanks is there a video on complex JSON files to be inserted in Redshift from the S3 bucket via Notebook?
@venkatramreddy-iq2zv
@venkatramreddy-iq2zv Жыл бұрын
I need to execute a procedure in redshift by passing parameters from Glue ETL script and get the out parameter value to use in the glue job for later processing can any one provide the sample code snippet
@unw1718
@unw1718 Жыл бұрын
is it faster to insert into redshift using a pyspark script, or a COPY command? I know that COPY uses parallel processing, will the pyspark script also use it when inserting into redshift? Is it possible to do a bulk insert with pyspark?
@DataEngUncomplicated
@DataEngUncomplicated 10 ай бұрын
Hey I believe the pyspark uses copy under the hood to write to redshift. One of the parameters it asks for is a temp directory in s3. It uses this to write the temp parquet files before copying it to redshift.
@heysiri-hz6ko
@heysiri-hz6ko Жыл бұрын
I have tables in RDS and I need to load them into redshift. concern here is that whenever new rows gets added into rds tables ,only the new rows should get appended to the redshift tables without deleting existing redshift data . how can I do that. thank you in advance
@DataEngUncomplicated
@DataEngUncomplicated Жыл бұрын
You can use pre and post queries to perform basically an upset to solve for this. I'm going to make a video soon on this.
@blairnangle
@blairnangle Жыл бұрын
What client are you using to view Redshift tables?
@DataEngUncomplicated
@DataEngUncomplicated Жыл бұрын
I'm using pycharm professional
@varunsam9947
@varunsam9947 Жыл бұрын
Can we update an existing row in the redshift using glue pyspark?
@DataEngUncomplicated
@DataEngUncomplicated Жыл бұрын
Yes this is possible, one of my next videos is on this with redshift.
@vishalkushwaha1075
@vishalkushwaha1075 Жыл бұрын
Can we get data from api gatway in sqs and send to s3 and s3 to redshift using spark
@DataEngUncomplicated
@DataEngUncomplicated Жыл бұрын
Are you thinking real time loading or just a bulk load of all the files? But yea it's definitely possible
@gshan994
@gshan994 Жыл бұрын
Instead I ll say use s3 presigned urls. Create api and lambda. User passes filename to api and lambda provides presigned url. User makes post request to url. Till 5 GB data can be passed via presigned url
@fawadh
@fawadh Жыл бұрын
Im having issues i need help!
@Nayak_Bukya_08
@Nayak_Bukya_08 14 күн бұрын
as I am working on glue, it super argent to me, the question is, how to add the source file name in the dyamaic frame, it would be great if you could respond on a priority basis. Thank you
@DataEngUncomplicated
@DataEngUncomplicated 13 күн бұрын
Hi, this is outside the scope of this video. I don't know if this is even possible to be honest. With dataframes the files are abstracted away from us. Please post on repost your question to see if someone can help you out. Also Google search or reading the docs can be your friend!
@imdadood5705
@imdadood5705 Жыл бұрын
Do you have any Udemy courses on Data Engineering?
@DataEngUncomplicated
@DataEngUncomplicated Жыл бұрын
Hey, unfortunately I do not. If I had more time I would love to make an aws data engineering course.
@imdadood5705
@imdadood5705 Жыл бұрын
@@DataEngUncomplicated I have subscribed. If you have plans to do paid courses on Udemy. I hope you will make an announcement. Yeah?
@shashikantkrishna4156
@shashikantkrishna4156 Жыл бұрын
How to make glue job faster?
@DataEngUncomplicated
@DataEngUncomplicated Жыл бұрын
Hi Shashikant, it's important to understand where the bottle neck is. If you are running a pyspark job one way is to add more worker nodes if you are processing extremely large data sets
@shashikantkrishna4156
@shashikantkrishna4156 Жыл бұрын
@@DataEngUncomplicated Hi, Thanks for your reply. Actually I have created redshift JDBC connection and trying to write the dynamic frame to redshift table. datasink0 = glueContext.write_dynamic_frame.from_jdbc_conf(frame = applymapping_for_ids, catalog_connection = "redshift", connection_options = {"dbtable": "track_order_ids_test", "database": "dev"}, redshift_tmp_dir = args["TempDir"], transformation_ctx = "datasink0") This dynamic frame has one column(ID) with 8M data and approx. size is 800MB and it's taking around 32mints. What's your suggestion on the same. Thanks in advance.
@user-bk5oo2rs5x
@user-bk5oo2rs5x Жыл бұрын
Where is the tutorial for this part? kzfaq.info/get/bejne/e8uknqiXmJudgXk.html
@DataEngUncomplicated
@DataEngUncomplicated Жыл бұрын
Are you looking for the script? If so, I started uploading the scripts a while after I made this video sorry.
@prabhathkota107
@prabhathkota107 Ай бұрын
I able to run successfully with glueContext.write_dynamic_frame.from_options & glueContext.write_dynamic_frame.from_jdbc_conf but I see some issue with glueContext.write_dynamic_frame.from_catalog as below: Getting below error: Error Category: UNCLASSIFIED_ERROR; An error occurred while calling o130.pyWriteDynamicFrame. Exception thrown in awaitResult: SQLException thrown while running COPY query; will attempt to retrieve more information by querying the STL_LOAD_ERRORS table Could you please guide
@DataEngUncomplicated
@DataEngUncomplicated Ай бұрын
These errors can be tricky and require you to look into the logs further to see what is causing it. It sounds specific to your data.
@sumitgupta1492
@sumitgupta1492 Жыл бұрын
while using this code i am getting an error "while calling o90.getSink an error occurred"
@DataEngUncomplicated
@DataEngUncomplicated Жыл бұрын
Hi Sumit, hmm is there any more error messages provided? That seems to be a very vague.
Add Column To Redshift Table - With SQL
2:32
DataEng Uncomplicated
Рет қаралды 1,2 М.
Amazon Cognito Beginner Guide
24:55
Be A Better Dev
Рет қаралды 113 М.
КАРМАНЧИК 2 СЕЗОН 7 СЕРИЯ ФИНАЛ
21:37
Inter Production
Рет қаралды 519 М.
The day of the sea 🌊 🤣❤️ #demariki
00:22
Demariki
Рет қаралды 106 МЛН
AWS Glue PySpark: Upserting Records into a Redshift Table
8:48
DataEng Uncomplicated
Рет қаралды 7 М.
Top AWS Services A Data Engineer Should Know
13:11
DataEng Uncomplicated
Рет қаралды 154 М.
SQLAlchemy: The BEST SQL Database Library in Python
16:39
ArjanCodes
Рет қаралды 54 М.
Importing CSV files from S3 into Redshift with AWS Glue
17:04
Majestic.cloud
Рет қаралды 80 М.