AWS re:Invent 2020: Serverless data preparation with AWS Glue

Рет қаралды 14,563

Күн бұрын

The first step in an analytics or machine learning project is to prepare your data to obtain quality results. AWS Glue is a serverless extract, transform, and load (ETL) service with a recent series of innovations that make data preparation simpler, faster, and cheaper. Join this session and listen to AWS Glue general manager Mehul A. Shah showcase the service’s new visual experience that makes it easier to author, debug, and manage your ETL jobs. This session dives deep on the new AWS Glue engine that offers 10 times faster job start times and improved support for data extraction, streaming ETL, orchestrating ETL workflows, and more.
Learn more about re:Invent 2020 at bit.ly/3c4NSdY
Subscribe:
More AWS videos bit.ly/2O3zS75
More AWS events videos bit.ly/316g9t4
#AWS #AWSEvents

Пікірлер: 13

@aruncp1980 Жыл бұрын

Excellent material and presentation..

@nexus888 3 жыл бұрын

Great presentation, thank you.

@mangeshxjoshi 3 жыл бұрын

Excellent presentation . very much appreciated .

@DJ-ws6je 3 жыл бұрын

lots of data is an understatement

@sukulmahadik0303 3 жыл бұрын

*Notes: Part 5:* *AWS Glue Custom connectors* To be introduced Dec 2020. This feature allows us to create our own custom connectors for our data sources and use them in our glue jobs. We can also easily deploy partner developer connectors from AWS marketplace. *AWS Glue DataBrew* New interface for cleaning and normalizing our data. It profiles our data to detect patterns and anomalies and we can choose from over 250 built-in cleaning transformations and visually apply them at scale. *AWS Glue schema registry:* Centrally discover , control and evolve our data schemas. This allows us to enforce schemas and schema evolution to prevent downstream application failures. This helps improve data quality for our data streaming applications and easily integrates with AWS MSK, Kinesis Data streams, kinesis Data Analytics for Apache flink.

@maa1dz1333q2eqER 2 жыл бұрын

Great presentation, thanks. Still would prefer a full 60 minutes.

@sanooosai 2 жыл бұрын

great

@sukulmahadik0303 3 жыл бұрын

*Notes: Part 2:* *AWS Glue Components:* 1) Serverless ETL engine: a. Serverless ETL engine based on Apache Spark. b. Apache Spark or Python Shell jobs - We provide Spark or Python scripts and Glue takes care of entire lifecycle of the job execution. Glue spins up the necessary cluster , run the script and shut down the machine. We only pay for what we use. c. Visual tool (Glue Studio) are also provided to create jobs interactively and Glue compiles those jobs into Apache Spark scripts. 2) AWS Glue Data Catalog: a. Centralised metadata store. b. Fully managed. c. Hive metastore compatible d. Many services , 3rd party partners, Open source tools are integrated with this catalog. 3) Crawlers: a. Used to load and maintain Data Catalog. b. They infer metadata of our table - schema c. Also supports schema evolution thru versioning. 4) Workflow management: a. Orchestrate triggers, crawlers and jobs b. Helps us build and monitor complex workflows for our pipeline in a reliable fashion. *What is Glue used for? (Glue Use cases)* 1) Building Datalakes: Customers take all their data and store it on Amazon S3 (ubiquitous, low cost , highly durable object store). They break their data silos and use AWS Glue jobs and workflows to ingest data from their silos into S3 and process that data from stage to stage. AWS Glue crawlers would be used to load and maintain the Data Catalog. Customers also use AWS Lake formation service to secure the datalake. The Datalake built can then be accessed for analysis, business intelligence or machine learning using tools like Athena, QuickSight, EMR , Redshift, Sagemaker etc. 2) Loading DW: AWS Glue is also being used to load Data warehousing using the traditional ETL processing. 3) Data preparation for AI/ML and Data science workloads. Cleaning, Enriching data, extracting features , build training etc. Data scientists also use notebooks connected to Glue for Data exploration and Experimentation.

@mangeshxjoshi 3 жыл бұрын

hi Sir, we have been comparing two cloud based etl tools , AWS Glue and Azure Data Factory . Scenario : we need to process process / extract S3 files , and S3 files are large in size (may be million of records) / more than 100MB . How do we process such larger files through AWS Glue , if you through some ideas . i believe , AWS Glue 2.0 is much more recommended here as compare to Azure data factory . 2) How the S3 File encryption / decryption can be handled through AWS Glue , looking for encryption key management through AWS glue . How encrypted S3 files being processed into AWS RDS postgre sql engine . need some thoughts on Encryption mechanism in AWS Glue . Regards, Mangesh