AWS Glue PySpark: Calculate Fields

  Рет қаралды 2,203

DataEng Uncomplicated

DataEng Uncomplicated

Күн бұрын

This is a technical tutorial on how to calculate new fields in AWS Glue using Pyspark. This tutorial will cover three examples leveraging the map transform method on Dynamic Frames. The examples covered are how to calculate new fields with a constant value, how to calculate a new field based on values in another column and how to calculate datetime and write it to a new field.
timeline:
00:00 Introduction
00:35 Create Field With Constant Value
03:30 Create Field from Existing Field
05:04 Calculate Datetime to field

Пікірлер: 7
@jmclachlan
@jmclachlan 11 ай бұрын
Thanks!
@DataEngUncomplicated
@DataEngUncomplicated 11 ай бұрын
Wow, thanks for the super thanks and supporting the channel.
@VijayKumar-tr8ki
@VijayKumar-tr8ki 3 ай бұрын
Thank you for the great work. I am new to Glue and your videos are great help. I was able to create derived column based on this video like new column - total_amount which is equal to price * quantity. Now in next step I want to categorize the customers based on total_amount i.e. if total_amount =300 and total_amount
@joegenshlea6827
@joegenshlea6827 Жыл бұрын
I've been doing on premise ETL for a zillion years using a lot of SQL and Java, and am now finally moving to more modern tech. Anyway, the use case that I'm struggling with is how to create a calculated field in one table based on data from a second (or third) table (data frame). For example, suppose there is a payment table and an order table (an order may have many payments). How would I add "total_payments" to the order data frame that is the sum of payments from the payment table? Easy in SQL, but pyspark is a steeper learning curve.
@viswanath311
@viswanath311 Жыл бұрын
Good video. Is there any possible way we consume input from aws athena instead of s3 files? (We dont have permission to s3 bucket). Also the pipeline runs for 30-60 mins. Task is to read from athena and upsert to snowflake. Any good cost effective service in aws that can be used? Cant use lambda here because of 15 mins time limit
@DataEngUncomplicated
@DataEngUncomplicated Жыл бұрын
Hi viswanath, I think I have a solution that could work for your particular use case and limitations. You can use the AWS SDK for pandas to query aws athena within a glue job. aws-sdk-pandas.readthedocs.io/en/stable/stubs/awswrangler.athena.read_sql_query.html#awswrangler.athena.read_sql_query You can then write your results to snowflake within your aws glue job. I'm not aware of a a native pyspark connector for aws athena (I could be wrong). Hopefully this helps
@viswanath311
@viswanath311 Жыл бұрын
@@DataEngUncomplicated thankyou. I will try this
AWS Glue PySpark:  Change Column Data Types
4:20
DataEng Uncomplicated
Рет қаралды 3,3 М.
Sigma Kid Hair #funny #sigma #comedy
00:33
CRAZY GREAPA
Рет қаралды 32 МЛН
Little girl's dream of a giant teddy bear is about to come true #shorts
00:32
THEY made a RAINBOW M&M 🤩😳 LeoNata family #shorts
00:49
LeoNata Family
Рет қаралды 42 МЛН
A clash of kindness and indifference #shorts
00:17
Fabiosa Best Lifehacks
Рет қаралды 111 МЛН
AWS Glue PySpark: Flatten Nested Schema (JSON)
7:51
DataEng Uncomplicated
Рет қаралды 13 М.
Browsing Parquet Files from VSCode
2:41
Understanding Data with Alex Merced
Рет қаралды 845
Add Redshift Data Source In AWS Glue Catalog
9:14
DataEng Uncomplicated
Рет қаралды 7 М.
AWS Glue PySpark: Upserting Records into a Redshift Table
8:48
DataEng Uncomplicated
Рет қаралды 7 М.
PyTorch Tutorial 09 - Dataset and DataLoader - Batch Training
15:27
Patrick Loeber
Рет қаралды 185 М.
Query Redshift Table with SQL in Python | AWS SDK for Pandas
7:38
DataEng Uncomplicated
Рет қаралды 2,4 М.
Sigma Kid Hair #funny #sigma #comedy
00:33
CRAZY GREAPA
Рет қаралды 32 МЛН