AWS Glue: Read CSV Files From AWS S3 Without Glue Catalog

Рет қаралды 29,741

Жыл бұрын

This video is about how to read in data files stored in csv in AWS S3 in AWS Glue when your data is not defined in the AWS Glue Catalog. This video uses the create_dynamic_frame_from_options method
AWS Documentation: docs.aws.amazon.com/glue/late...
Code example: github.com/AdrianoNicolucci/d...
#aws, #awsglue

Пікірлер: 54

@Diminishstudioz Жыл бұрын

I am so happy that I found this channel

@DataEngUncomplicated Жыл бұрын

I'm happy you found it to Farsim! Thanks for subscribing!

@akshitha2110 11 ай бұрын

Thank You. This is very helpful. My use case is to take the csv files from S3 and perform Data Quality checks and output in the parquet format. I was planning to use Pyspark in aws and I think this is a simple procedure I can follow to do the same.

@DataEngUncomplicated 10 ай бұрын

No problem! Yup this approach would work. Why do you need to use pyspark though? Are you analyzing millions of records? If it's only 1000s or 100,000s lambda functions or just using a glue shell job might be sufficient

@priyanka2309 Жыл бұрын

Excellent

@sumanranjan6597 7 ай бұрын

Hi, I'm having an error while running the first default code. Plz provide the IAM role used to launch notebook in the aws glue.

@vvkk-vl9jw Жыл бұрын

thank u very much for this video playlist. pls upload new videos on multiple condition.

@DataEngUncomplicated Жыл бұрын

Thanks, can you elaborate on what videos would be helpful on multiple conditions?

@vvkk-vl9jw Жыл бұрын

@@DataEngUncomplicated thank u for replying. i want new videos 1)using triggers for crawler, and connect to sns service for msg similar like that. 2)join oracle database to glue for querying. I really appreciate your efforts.💟

@tiktok4372 Жыл бұрын

What is the better option? reading from glueCatalog or directly from S3 ? I’m working on a project that everyday new data files are loaded into S3 bucket ( right now almost parquet files, but in the feature there will be any other format). When the files are already in S3, we trigger AwsGlue Job to read(via glueCatalog), transform and write to data to another S3 bucket. But before starting Glue job, we need to start the related crawlers to crawl the new files(register new partition, update schema if there is any change,…). Because of that, we need to create many crawlers and orchestrate them base on the event of corresponding file is loaded into S3, and waiting for crawlers to finish running also takes time and cost. Do you think we keep doing that or just read file directly from S3 ? is there any risk or performance issue between 2 methods or any other recommendation? Thank you very much

@DataEngUncomplicated Жыл бұрын

Hey, sorry for the late reply. Whether to read from GlueCatalog or directly from S3 depends on the specific requirements and constraints of your project. Here are some factors to consider: Performance: Reading data directly from S3 can be faster than reading through GlueCatalog, as GlueCatalog adds an additional layer of metadata management. However, the performance difference may not be significant, especially if you use partitioning and indexing in GlueCatalog to optimize queries. Schema evolution: If your data schema is likely to change frequently or unpredictably, using GlueCatalog can provide a more flexible and automated way to manage schema evolution. GlueCatalog can automatically detect schema changes and update table definitions, which can save you from having to manually update your code. Cost: Using GlueCatalog can add some additional cost to your AWS bill, as you are paying for the metadata management and indexing that GlueCatalog provides. However, the cost may be small compared to the benefits of using GlueCatalog for your specific use case.

@PRI_Vlogs_Australia Жыл бұрын

Thank you for this awesome explanation. Can I please request you to make the video about 'How to implement Change Data Capture' using python? and Secondly, How to automate Python pipelines to load the data in AWS cloud say S3. Thanks.

@DataEngUncomplicated Жыл бұрын

Thanks! Sure I will add the change data capture to my video suggestion list. I have a couple of videos on writing data to s3 using AWS lambda service and AWS glue you can check out. Check out this blog post on aws related to CDC with aws glue aws.amazon.com/blogs/big-data/implement-a-cdc-based-upsert-in-a-data-lake-using-apache-iceberg-and-aws-glue/ This might be helpful if you can leverage the iceberg file format

@udaynayak4788 Жыл бұрын

just came with new scenario, can you please create one UDF in pyspark aws glue. needed the most

@devanshaggarwal2627 11 ай бұрын

What IAM Role should I choose while creating ETL job in Jupyter notebook to write this code?

@DataEngUncomplicated 10 ай бұрын

There isn't an existing role that will give you everything. You need to add permissions to your s3 bucket if you are using it for reading and writing data as well.

@joelluis4938 Жыл бұрын

There is any reason to avoid Catalog ? I'm just learning about Glue and I use the Catalog. I have other question.. I've tried to run a Crawler to take one csv file from my S3 buket but when i check the new tables, it doensn't recognize the column names. It shows col 0 col1 col2 col3. Do you know why this happens? or how to solve it ?

@DataEngUncomplicated Жыл бұрын

No reason to avoid Catalog, I made this video because you might need to read files with aws glue that have not already been configured in the aws glue catalog. Is your schema defined in your first column of your csv files? That is one reason I could think of why it's not showing up in the catalog.

@alejandrasilva8008 Жыл бұрын

Hello, great video. Thanks yoy. So a cuestion. When I run the code .printSchema() The notebbok run : root ++ || ++ ++ and I review the file and it has header. What happened? and thank you for your answer.

@DataEngUncomplicated Жыл бұрын

Hey thanks you. I've seen this when no data is being returned...can you confirm there are no records being returned?

@shashankreddy8390 Жыл бұрын

Hi buddy this is a nice video, but every one creates video on reading and writing from s3. 1. Can you create a video on how to use Glue studio notebook (interactive session) to read data from Awsgluecatalog and write the results to S3? 2. Please can you include every step- i.e what kind of permissions should we need to create to read and write. (I am getting a lot of permission denied errors) Also recommend doing a video on Athena notebook editor reading data from Gluecatalog using pyspark. (Please also include detailed permissions steps)

@DataEngUncomplicated Жыл бұрын

Hi Shashank, these are great video suggestions I will add them to my list, I have broken my videos down into smaller segments but it having a video end to end might be beneficial esp with the permission challenges

@shashankreddy8390 Жыл бұрын

@@DataEngUncomplicated what number is my request on your list 😅😅😅😅

@powerspan 5 ай бұрын

Hello there, In my csv lot of non utf8 characters are there how can i ignore them while uploading since its throwing error "unable to parse the file"

@DataEngUncomplicated 4 ай бұрын

In AWS Glue, you can use PySpark to read a CSV file and ignore non-UTF8 characters. Here’s an example if you convert your dynamicFrame into a pyspark dataframe # Replace non-UTF8 characters for column in df.columns: df = df.withColumn(column, col(column).cast("string").alias(column))

@himanshusingh-nv5wn 4 ай бұрын

I am getting iam:passrole failed to start the session I do have glue console full policy attached to iam role

@malvika2011 Жыл бұрын

Thank you for this video, I am getting an error glueContext not defined. Even though when starting a notebook in aws glue it is getting imported automatically. Thank you

@DataEngUncomplicated Жыл бұрын

Hi Malvika, it sounds like you did not define glueContext correctly, I would check to make sure you included the template python code that comes when you first create a new glue job

@malvika2011 Жыл бұрын

@@DataEngUncomplicated Thank you 😊 I will check and get back. Thank you for the response. Merry Christmas and a Happy New Year to you !

@DataEngUncomplicated Жыл бұрын

Thanks! Merry Christmas and happy new year!

@jomymcet 9 ай бұрын

Can anyone please help me. I have some NON_ASCII characters in my file placed inside S3. How can I remove those junk characters from that file in S3 using AWS Glue?? Please help.

@DataEngUncomplicated 9 ай бұрын

Hi, try posting on AWS repost, you might get a quicker response for this particular problem.

@patilharss Жыл бұрын

How i can update the file and store it again in s3?

@DataEngUncomplicated Жыл бұрын

Hey Harsh, do you want to replace the same data on AWS S3? There is a parameter on right that will overwrite the partition which could be an option

@muralichiyan Ай бұрын

Data bricks glue are same?

@DataEngUncomplicated Ай бұрын

If you're asking if databricks and glue are the same then no they definitely are not.

@yagnasivasai Жыл бұрын

Do you have any course related to the content?

@DataEngUncomplicated Жыл бұрын

Hey, Unfortunately I don't have a formal course but I am building out a youtube playlist related to aws glue for reading transforming and writing data: kzfaq.info/sun/PL7bE4nSzLSWci0WpYafgTOBcqpdtO3cdY

@denmur77 Жыл бұрын

Thanks for your valuable videos! I'm working on an interesting task. I need to use Kinesis Data Streams as a source in AWS Glue (without Lambda or other AWS services) and put data into RDS Aurora PostgreSQL. I can NOT do that for some reason. Do you think it's possible?

@DataEngUncomplicated Жыл бұрын

Yes you can! Aws glue actually supports a streaming mode which supports kinesis as a data source!

@denmur77 Жыл бұрын

@@DataEngUncomplicated did you try?

@denmur77 Жыл бұрын

@@DataEngUncomplicated I can put data from KDS to S3 or grab data from S3 and put it on RDS but can't do directly from KDS.

@DataEngUncomplicated Жыл бұрын

It says it supports it: docs.aws.amazon.com/glue/latest/dg/add-job-streaming.html

@denmur77 Жыл бұрын

@@DataEngUncomplicated I saw that. Unfortunately it doesn't work.

@shashankemani1609 3 ай бұрын

Could you please let me know why are you using gluecontext as you are not using any of the glue ETL functionalities and why are you using dynamic dataframe as you are not dealing with semi-structured or unstructured data? any specific reason?

@DataEngUncomplicated 3 ай бұрын

Hi there, although in this tutorial I am not using and glue transformation methods, I am using the create_dynamic_frame_from_options method to load the data which is from the GlueContext class. This is why we need to use gluecontext. Dynamic dataframes can be for structured data s well not only just semi-structured or unstructured.

@bk3460 2 ай бұрын

sorry, what is wrong with df = spark.read.csv(path)?

@DataEngUncomplicated 2 ай бұрын

That works too but it's not using the aws glue library to do it.

@bk3460 2 ай бұрын

@@DataEngUncomplicated Sorry, I'm new to Spark and Glue. Would you mind to elaborate about glue library are you referring to? I know about Glue Data Catalogue, but it is not affected when I use df = spark.read.csv(path).

@DataEngUncomplicated 2 ай бұрын

Give a read on the aws glue api and the transformations that come with it: docs.aws.amazon.com/glue/latest/dg/aws-glue-api.html