AWS Tutorials - Using Concurrent AWS Glue Jobs

  Рет қаралды 5,747

AWS Tutorials

AWS Tutorials

Күн бұрын

Script Example - github.com/aws-dojo/analytics...
Using Concurrent Glue Job Runs to ingest data at scale is a very scalable and maintainable approach. Learn how to configure and run Glue jobs for the concurrent execution.

Пікірлер: 30
@trinath89
@trinath89 Жыл бұрын
Thanks a lot, everything is put in as simple as possible format for us to understand.
@ashishvishwakarma8790
@ashishvishwakarma8790 Жыл бұрын
Excellent explanation. I'm working on a similar use case - however, I need to run the same job multiple time for same table (writing to different partition). The problem I'm facing with that is - the moment 1 of the many parallel job executions finishes, it wipes the temporary directory (created by spark) in the table directory, leading to deletion of temp data of other execution writing to the same table, which results into data loss as the execution of other parallel execution was still in progress, but the 1st job to complete deleted the temp data(created by Spark). Do you have solution to that problem?
2 жыл бұрын
Pleased to return again, this time to clarify an additional limitation to be taken into account, it is about the ips available in the vpc, because the glue job occupies ec2 instances and if there are not enough ips the job will crash, so it is important to verify the ips available to paralyze
@AWSTutorialsOnline
@AWSTutorialsOnline 2 жыл бұрын
I agree. Glue will occupy IP only if your are working with VPC based resources.
@hsz7338
@hsz7338 2 жыл бұрын
Hello, thank you for the tutorial. It is fantastic as always. On where the actual Concurrent (parallel) job run, are those jobs are run in one serverless Glue compute cluster or multiple serverless Glue compute clusters? If it is the formal, it means it is Concurrent but not pure parallelisation. If it is the latter, then the actual Glue job we are creating acts as a job definition, whereby such job defnition can be deployed across multiple serverless compute in parallel (within the Max Concurrency)?
@AWSTutorialsOnline
@AWSTutorialsOnline 2 жыл бұрын
It is like one job definition which can run for more than one instances at the same time - no matter how you start the job.
@pulkitdikshit9474
@pulkitdikshit9474 Жыл бұрын
Hi, I have a lambda function where I pass a list of tables + lambda triggers a glue job. Glue job has been configured with workers 2 and max concurrency = 1. Later I saw that only one element(one table in the list of tables passed in lambda) gets executed. What is the reason for it? will it cost higher if I increase concurrency? In this case, is it important to keep max concurrency equal to the length of list(number of elements in python array list) ? If not, then what is the best possible approach such that glue job executes all the table elements in the array list passed in Lambda. Fyi, storing results in S3 bucket. Please do reply. Thanks in Advance :)
@rtzkdt
@rtzkdt 3 ай бұрын
Nice tutorial,Thanks. can it run in sequence? i want to run the jobs with different parameter, but i want the second job run after the first one is finished. Like a queue. Or we must set the max concurrent to 1 and handle the retry ourself if max concurrent error occurred?
@zubinbal1880
@zubinbal1880 4 ай бұрын
Hi Sir, Is it possible to enable job bookmark for concurrent job run but single script with step function?
@MahimDashoraHackR
@MahimDashoraHackR 11 ай бұрын
What happens if the python script itself uses multiprocessing for achieving concurrency
@gatsbylee2773
@gatsbylee2773 2 жыл бұрын
I got some idea what the max concurrency=4 is for. Based on your example, you still need to create multiple AWS Glue Jobs ( more precisely 200 "Runs" in a Job ) since you set "Source Table Name" and "Target Table Name" with the same Glue Job. Basically, you can group jobs in a job by increasing max concurrency. but you still need to create 200 Runs in a Job. And, you can still share a code across 200 Jobs or 200 Runs. I really appreciate to your video. It helps me get an idea what the parameter is for. Thank you.
@AWSTutorialsOnline
@AWSTutorialsOnline 2 жыл бұрын
yeah. It is one code based and configuration for the job. But you are running multiple instances of it with different parameters.
@anti2117
@anti2117 Жыл бұрын
Thank you for this video, very insightfull. How is this working with job bookmarks (transformation_ctx)?
@victorgueorguiev6500
@victorgueorguiev6500 2 жыл бұрын
It turns out you can't really use Glue Workflows for running them in parallel. When trying to add a job multiple times in different nodes in the workflow, it throws an error that the "action contains duplicate job name", which prevents one from adding the same job more than once in sequence or in parallel. Really silly, since Glue inherently lets you have concurrent runs. Luckily Step Functions works fine, but really disappointing that Glue natively doesn't support this in Workflows. Maybe I'm doing something wrong?
@victorgueorguiev6500
@victorgueorguiev6500 2 жыл бұрын
Thank you for the video by the way! It was really informative
@AWSTutorialsOnline
@AWSTutorialsOnline 2 жыл бұрын
You are right. Unfortunately workflow does not allow running parallel jobs.
2 жыл бұрын
Nice tuturial just now i make 5 jobs.. But try the 3 aproach. My dubt is what hapend when the size of table is variable... The num of worker can change?
@AWSTutorialsOnline
@AWSTutorialsOnline 2 жыл бұрын
I don't think you can change job capacity at the time of job run when calling in Glue Workflow or Step Function. However - if you are calling the job using CLI or Code then you do have opportunity change allocated capacity, max capacity and worker type.
@sheikirfan2652
@sheikirfan2652 Жыл бұрын
Nice tutorial. One question here, how to configure the glue job to run multiple SQL queries in parallel instead of reading from multiple tables
@AWSTutorialsOnline
@AWSTutorialsOnline Жыл бұрын
I think you are looking for this one - kzfaq.info/get/bejne/h65hfcZqvNjUaY0.html
@sheikirfan2652
@sheikirfan2652 Жыл бұрын
@@AWSTutorialsOnline Thanks brother i will check and let you know
@sheikirfan2652
@sheikirfan2652 Жыл бұрын
​@@AWSTutorialsOnlineThanks. I looked into it and seems that video explains we can have parallel runs corresponding to one column. But my solution is something like we need to pass SQL query as a job parameter and using that job parameter i should pass more than one SQL query either through just CLI or step function. Example my job concurrency is 2 So the job should run parallel with a queries like "select * from emp inner join students where std_id = 5" and "select * from emp inner join class where class_id = 10" and fetch results in respective locations(S3 locations).
@sheikirfan2652
@sheikirfan2652 Жыл бұрын
Also I have a solution like i can run more than one SQL query in my glue job but that approach will work sequentially not parallely
@veerachegu
@veerachegu 2 жыл бұрын
Can you pls clarify i have a 15 data sets in one of the source how to run concurrent run from raw layer to cleansed layer maybe the script is different based on DQ in this scenario how to run concurrent job ?
@AWSTutorialsOnline
@AWSTutorialsOnline 2 жыл бұрын
is each job doing the same things between raw to cleansed layer?
@veerachegu
@veerachegu 2 жыл бұрын
@@AWSTutorialsOnline yes
@veerachegu
@veerachegu 2 жыл бұрын
You are implementing through step function can you pls suggest how to do concurrent run on glue work flow
@siddharthsatapathy1366
@siddharthsatapathy1366 2 жыл бұрын
Hello Sir, In case of concurrent runs how are the resources shared in different runs?
@AWSTutorialsOnline
@AWSTutorialsOnline 2 жыл бұрын
each run is allocated the same capacity as configured in the job,
@IranianButterfly
@IranianButterfly 2 жыл бұрын
but there is a drawback here in term of pricing, let's say you have 20 tables and you run with concurrency and let's say each job finish in 1 minute, G1.X would bills for min 10 minutes, it means you will pay 20*10 (min), instead of 20*1 (min).
AWS Tutorials - Optimizing AWS Glue Crawler
29:06
AWS Tutorials
Рет қаралды 2,8 М.
KINDNESS ALWAYS COME BACK
00:59
dednahype
Рет қаралды 162 МЛН
Who has won ?? 😀 #shortvideo #lizzyisaeva
00:24
Lizzy Isaeva
Рет қаралды 64 МЛН
Now THIS is entertainment! 🤣
00:59
America's Got Talent
Рет қаралды 37 МЛН
DAD LEFT HIS OLD SOCKS ON THE COUCH…😱😂
00:24
JULI_PROETO
Рет қаралды 13 МЛН
AWS Tutorials - Handling PII Data in AWS Glue
35:12
AWS Tutorials
Рет қаралды 4,1 М.
AWS Tutorials - Working with Data Sources in AWS Glue Job
42:06
AWS Tutorials
Рет қаралды 9 М.
AWS Glue Blueprints | Amazon Web Services
13:06
Amazon Web Services
Рет қаралды 7 М.
AWS Tutorials - Introduction to AWS Glue Studio
28:21
AWS Tutorials
Рет қаралды 8 М.
AWS Tutorials - Single AWS Glue Job & Multiple Transformations
28:16
OZON РАЗБИЛИ 3 КОМПЬЮТЕРА
0:57
Кинг Комп Shorts
Рет қаралды 1,9 МЛН
Это Xiaomi Su7 Max 🤯 #xiaomi #su7max
1:01
Tynalieff Shorts
Рет қаралды 1,1 МЛН
تجربة أغرب توصيلة شحن ضد القطع تماما
0:56
صدام العزي
Рет қаралды 57 МЛН
Choose a phone for your mom
0:20
ChooseGift
Рет қаралды 7 МЛН
СТРАШНЫЙ ВИРУС НА МАКБУК
0:39
Кринжовый чел
Рет қаралды 1,4 МЛН
😱Хакер взломал зашифрованный ноутбук.
0:54
Последний Оплот Безопасности
Рет қаралды 940 М.