Manage AWS Glue Jobs with Step Functions

Рет қаралды 13,748

2 жыл бұрын

In this video , the usage of AWS Step Functions to orchestrate multiple Glue ETL jobs is explained from scratch.
Prerequisite:
------------------------
AWS Glue Workflow in-depth intuition with Lab
• AWS Glue Workflow in-d...
Build and automate Serverless DataLake using an AWS Glue , Lambda , Cloudwatch
• Build and automate Ser...
Step 1:
--------
Create a crawler
Step 2:
--------
Start crawler and get crawler state in Step Function
Step 3:
--------
Inspect the Json of GetCrawler component to build the if-else condition
Step 4:
--------
Create a waiter block
Step 5:
--------
Add the Glue Run Job component (Below code)--
(Configure the block as synchronous component i.e. call the service, and have Step Functions wait for a job to complete)
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
@params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "{}", table_name = "{}", transformation_ctx = "datasource0")
datasink4 = glueContext.write_dynamic_frame.from_options(frame = datasource0, connection_type = "s3",
connection_options = {"path": "s3://{}/{}/"}, format = "parquet", transformation_ctx = "datasink4")
job.commit()
Reusable Step Function Json:
-------------------------------
{
"Comment": "A description of my state machine",
"StartAt": "StartCrawler",
"States": {
"StartCrawler": {
"Type": "Task",
"Parameters": {
"Name": "{Write the Crawler name here}"
},
"Resource": "arn:aws:states:::aws-sdk:glue:startCrawler",
"Next": "GetCrawler"
},
"GetCrawler": {
"Type": "Task",
"Parameters": {
"Name": "{Write the Crawler name here}"
},
"Resource": "arn:aws:states:::aws-sdk:glue:getCrawler",
"Next": "Choice"
},
"Choice": {
"Type": "Choice",
"Choices": [
{
"Variable": "$.Crawler.State",
"StringEquals": "RUNNING",
"Next": "Wait"
}
],
"Default": "Glue StartJobRun"
},
"Wait": {
"Type": "Wait",
"Seconds": 5,
"Next": "GetCrawler"
},
"Glue StartJobRun": {
"Type": "Task",
"Resource": "arn:aws:states:::glue:startJobRun.sync",
"Parameters": {
"JobName": "{Write the Job name here}"
},
"End": true
}
}
}
Learn AWS Step Fucniton from Scratch:
• AWS Step Functions Sim...
Check this playlist for more AWS Projects in Big Data domain:
• Demystifying Data Engi...
🙏🙏🙏🙏🙏🙏🙏🙏
YOU JUST NEED TO DO
3 THINGS to support my channel
LIKE
SHARE
&
SUBSCRIBE
TO MY KZfaq CHANNEL

Пікірлер: 17

@josemanuelgutierrez4095 Жыл бұрын

I have a question my friend what happen is that I have 2 csv ein my bucket and when I want to execute my crawler in my tables I see both csvs no the name of my bucket as you , do you think some steps are missing ? . Thx

@SimonLopez-hj2cj 2 ай бұрын

how can i know the json output without executing the state machine?

@StephenNyatsine 4 ай бұрын

Very helpful but anyone can assist I am getting the below error "error":"States.Runtime" "cause":"Invalid path '$.Crawler.State': The choice state's condition path references an invalid value." }

@FaresTabet Жыл бұрын

Great video! Well prepared with examples, it helped me a lot

@KnowledgeAmplifier1 Жыл бұрын

Glad to know the video is helpful to you Fares Tabet! Happy Learning :-)

@youdontneedmyname2298 Жыл бұрын

Thank you!

@KnowledgeAmplifier1 Жыл бұрын

You are welcome buddy ! Happy Learning

@InvestorKiddd Жыл бұрын

Hi, very nice video, but is there any way to provide database name and table in glue as a input in step function instead of hard coding it inside script? Same question for crawler also, can we provide s3 object as a input?

@InvestorKiddd 7 ай бұрын

@@VinayGanesh-nk2lk yes, you need to create script file and save it in s3 bucket, and forward that key to glue

@InvestorKiddd 7 ай бұрын

@@VinayGanesh-nk2lk will share on Monday, remind me once if I forget

@josemanuelgutierrez4095 Жыл бұрын

Hi my friend I have a question , the code that you put inside of glue job , that codes convert cvs to parquet , right?

@KnowledgeAmplifier1 Жыл бұрын

@josemanuelgutierrez4095 yes correct

@josemanuelgutierrez4095 Жыл бұрын

@@KnowledgeAmplifier1 Thanks you my friend ,I like your videos , those videos help me to improve my skills a lot :v

@KnowledgeAmplifier1 Жыл бұрын

@@josemanuelgutierrez4095 glad to hear that .. Happy Learning

@DanielWeikert Жыл бұрын

Do you know / use a good documentation to see how the json response always looks like? Because this is required to then refer to e.g $.Crawler.State thx

@KnowledgeAmplifier1 Жыл бұрын

Hello Daniel Weikert, you can check aws documentation ( docs.aws.amazon.com/step-functions/latest/dg/welcome.html ) or else simple way is to use pass block to check respose and then further code accordingly as I explained in this video 😊