Build and automate Serverless DataLake using an AWS Glue , Lambda , Cloudwatch

Рет қаралды 8,412

2 жыл бұрын

In this video, how to create a fully automated data cataloging and ETL pipeline to transform your data is explained in-depth from scratch.
Prerequisite:
-----------------------
Implement a CloudWatch Events Rule That Calls an AWS Lambda Function
• Implement a CloudWatch...
Using AWS Lambda with Amazon CloudWatch Events | Send notification when ec2 stops
• Using AWS Lambda with ...
Pipeline design with monitoring and alert functionalities using Cloudwatch Alarm , EC2 & Lambda
• Pipeline design with m...
Enable CloudWatch logs for API Gateway | Monitoring and Logging API Activity
• Enable CloudWatch logs...
Invoking State Machine with CloudWatch
• Invoking State Machine...
AWS Glue Workflow in-depth intuition with Lab
• AWS Glue Workflow in-d...
An automated data pipeline using Lambda, S3 and Glue - Big Data with Cloud Computing
• An automated data pipe...
Lambda Code to trigger Glue Crawler:
---------------------------------------------------------------
import json
import boto3
glue=boto3.client('glue');
def lambda_handler(event, context):
TODO implement
response = glue.start_crawler(
Name='{Put the Name of the Glue Crawler here}'
)
return {
'statusCode': 200,
'body': json.dumps('Hello from Lambda!')
}
Lambda Code to trigger Glue Job:
----------------------------------------------------------
import json
import boto3
def lambda_handler(event, context):
glue=boto3.client('glue');
response = glue.start_job_run(JobName = "{Put the Glue ETL Job name here}")
print("Lambda Invoke")
Glue Code:
---------------------
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
@params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "{}", table_name = "{}", transformation_ctx = "datasource0")
datasink4 = glueContext.write_dynamic_frame.from_options(frame = datasource0, connection_type = "s3",
connection_options = {"path": "s3://{}/{}/"}, format = "parquet", transformation_ctx = "datasink4")
job.commit()
Cloudwatch rule for trigger the Lambda on success of the Glue Crawler:
-----------------------------------------------------------------------------------------------------------------------
{
"source": [
"aws.glue"
],
"detail-type": [
"Glue Crawler State Change"
],
"detail": {
"state": [
"Succeeded"
],
"crawlerName": [
"{Put your Crawler Name here}"
]
}
}
Cloudwatch rule for Triggering the SNS on success of Glue Job:
---------------------------------------------------------------------------------------------------------
{
"source": [
"aws.glue"
],
"detail-type": [
"Glue Job State Change"
],
"detail": {
"jobName": [
"{Put your Job name here}"
],
"state": [
"SUCCEEDED"
]
}
}
Check this playlist for more AWS Projects in Big Data domain:
• Demystifying Data Engi...

Пікірлер: 17

@sjdreams_13615 Жыл бұрын

It’s a great job done by you explaining the serverless Glue ETL process. Its the best video I found on KZfaq on this topic so far 👍🏻

@KnowledgeAmplifier1 Жыл бұрын

Thank you so much for your positive feedback, Sravan Kumar Jalluri! I am glad to hear that my video was helpful to you. Happy Learning

@Someonner Жыл бұрын

Most underrated video in AWS.

@deepakrawat5065 Жыл бұрын

Thank you Knowledge Amplifier for sharing your knowledge in simple and clear way

@KnowledgeAmplifier1 Жыл бұрын

You are welcome Deepak Rawat! Happy Learning

@rahulkakade1579 Жыл бұрын

Hats off to you brother what a details explanation thanks for sharing this

@KnowledgeAmplifier1 Жыл бұрын

Thank you Rahul kakade15 for your inspiring comment ! Happy Learning

@adesuraj4649 Жыл бұрын

Great explanation 🙂

@KnowledgeAmplifier1 Жыл бұрын

Thank you Ade Suraj! Happy Learning

@SourabhDattalkar89 11 ай бұрын

Great video you have explained 5 hrs process in few minutes 😂

@KnowledgeAmplifier1 10 ай бұрын

Glad it helped!

@nagasabsreeshgontla4628 Жыл бұрын

Running the crawler everytime when csv uploaded is not required right. Because it also increase the cost for crawler

@rahulkakade1579 Жыл бұрын

Can you please make video on what is sns,sqs,event bridge and when to use what 🙂

@KnowledgeAmplifier1 Жыл бұрын

ok sure Rahul kakade15, noted in backlog...

@rahulkakade1579 Жыл бұрын

@@KnowledgeAmplifier1 thank buddy

@Ashisagrawall 2 жыл бұрын

Hello I need to talk to you. Please let me know how to contact you, need some help related to an application we are building and wanted to use snowflake. Please let me know

@KnowledgeAmplifier1 2 жыл бұрын

Please post your doubt or requirements here buddy .. if I have knowledge in that domain , I will surely try to help u out :-)