AWS Glue PySpark: Calculate Fields

Рет қаралды 2,203

Күн бұрын

This is a technical tutorial on how to calculate new fields in AWS Glue using Pyspark. This tutorial will cover three examples leveraging the map transform method on Dynamic Frames. The examples covered are how to calculate new fields with a constant value, how to calculate a new field based on values in another column and how to calculate datetime and write it to a new field.
timeline:
00:00 Introduction
00:35 Create Field With Constant Value
03:30 Create Field from Existing Field
05:04 Calculate Datetime to field

Пікірлер: 7

@jmclachlan 11 ай бұрын

Thanks!

@DataEngUncomplicated 11 ай бұрын

Wow, thanks for the super thanks and supporting the channel.

@VijayKumar-tr8ki 3 ай бұрын

Thank you for the great work. I am new to Glue and your videos are great help. I was able to create derived column based on this video like new column - total_amount which is equal to price * quantity. Now in next step I want to categorize the customers based on total_amount i.e. if total_amount =300 and total_amount

@joegenshlea6827 Жыл бұрын

I've been doing on premise ETL for a zillion years using a lot of SQL and Java, and am now finally moving to more modern tech. Anyway, the use case that I'm struggling with is how to create a calculated field in one table based on data from a second (or third) table (data frame). For example, suppose there is a payment table and an order table (an order may have many payments). How would I add "total_payments" to the order data frame that is the sum of payments from the payment table? Easy in SQL, but pyspark is a steeper learning curve.

@viswanath311 Жыл бұрын

Good video. Is there any possible way we consume input from aws athena instead of s3 files? (We dont have permission to s3 bucket). Also the pipeline runs for 30-60 mins. Task is to read from athena and upsert to snowflake. Any good cost effective service in aws that can be used? Cant use lambda here because of 15 mins time limit

@DataEngUncomplicated Жыл бұрын

Hi viswanath, I think I have a solution that could work for your particular use case and limitations. You can use the AWS SDK for pandas to query aws athena within a glue job. aws-sdk-pandas.readthedocs.io/en/stable/stubs/awswrangler.athena.read_sql_query.html#awswrangler.athena.read_sql_query You can then write your results to snowflake within your aws glue job. I'm not aware of a a native pyspark connector for aws athena (I could be wrong). Hopefully this helps

@viswanath311 Жыл бұрын

@@DataEngUncomplicated thankyou. I will try this