AWS Tutorials - Data Quality Check using AWS Glue DataBrew

Рет қаралды 8,690

2 жыл бұрын

The code link - github.com/aws-dojo/analytics...
Maintaining data quality is very important for the data platform. Bad data can break ETL jobs. It can crash dashboards and reports. It can hit accuracy of the machine learning models due to bias and error. AWS Glue DataBrew Data Profile jobs can be used for data quality checks. One can define data quality rules and validate data against it. Learning how to use Data Quality Rules in AWS Glue DataBrew to validate data quality.

Пікірлер: 33

@smmike 2 жыл бұрын

Thanks, very comprehensive overview of the quality checking in DataBrew.

@AWSTutorialsOnline 2 жыл бұрын

Glad it was helpful!

@MahmoudAtef 2 жыл бұрын

That was extremely helpful, thank you!

@AWSTutorialsOnline 2 жыл бұрын

Glad it was helpful!

@jeety5 2 жыл бұрын

Very impressive, I have been looking at data validation frameworks and think this would be great fit. The 2 open source libraries I checked are: 1) Great Expectations: Found it tough to configure, has a steep learning curve 2) PyDeequ: Not as up to date as the Scala verison(Deequ). Also, comminity is not so active Having said that I have few queries about databrew, Kindly provide your thoughts: 1) We have thousands of etl processes (both bacth and real time). Do you think databrew can handle that scale? 2) Anomoly detection: can databrew handle this? If not is there any alternative approach you could suggest? 3) As we onboard new sources, I want the data validation framework should easily be extendible. For example, just add the rule set and it should be able to ahndle any new source. Do you think storing rules in some datastore (ex:dynamodb ) is a better idea than doing it in databrew. Databrew can just look dynamodb to check the rules defined and rule against incoming data 4) if certian check is not available, can we customize to handle that logic? In case of open source librarry like Great expectation etc) it can be ahndled. Another idea is if it cant be handled in datbrew then using step function(conditional statements) to trigger databrew vs some other validation mechanism

@Rawnauk 4 ай бұрын

Very nicely explained..

@shokyeeyong6469 2 жыл бұрын

Thank you for the tutorial which can have understanding on the overall about the DQ part. Whether having possible to view the detail records which is succeeded or fail?

@AWSTutorialsOnline 2 жыл бұрын

I don't think it links back to the records which pass or fail.

@ds12v123 2 жыл бұрын

Nice explanation and details

@AWSTutorialsOnline 2 жыл бұрын

Glad it was helpful!

@ladakshay 2 жыл бұрын

This is perfect. We have thousands of datasets where we need to perform DQ checks and send reports. Is it possible to automate or create the rules programmatically instead of using the console? Something like create rules in a yaml/csv file?

@AWSTutorialsOnline 2 жыл бұрын

you can use sdks like python boto3 to create rules dynamically. Please check this link - boto3.amazonaws.com/v1/documentation/api/latest/reference/services/databrew.html

@spandans2049 2 жыл бұрын

This was very nicely explained! Thank yo so much :) Is it possible to have a rule set where new data file should verify against some existing redshift table data ? For example, let's say we are getting orders data from kinesis into S3 and we need to verify inventory information which is in dynamodb and whenever the inventory is lower than a threshold value we want to run a particular pipeline. Can we do this ?

@AWSTutorialsOnline 2 жыл бұрын

Hi, it is not possible with current features. Moreover, what you are talking about is not a "data quality" problem but "business logic/rule" problem. DataBrew DQ is not the right place to solve this problem. You should be using code based business logic running in Glue Job / Lambda to perform these checks. Hope it helps.

@spandans2049 2 жыл бұрын

Oh okay, got it. Thank you very much for clarifying :).

@vishalchavhan6731 2 жыл бұрын

Great.. Do have any plans to make a video on aws glue and apache hudi integration?

@AWSTutorialsOnline 2 жыл бұрын

Yes, soon

@sergiozavota7099 2 жыл бұрын

Thanks for the clear explanation! I've seen that there isn't a simple way to create a rule which allows a column to match a certain value and at the same time that it could be null. For instance, consider a column "age" of int values. I would like to create a rule with the following checks: 1- Age must be between 0 and 100 2- Age can have missing values The problem here is that check 1 fails if there is a missing value and there isn't a selection in the databrew menu for check 2. Have you found some way to accomplish this task?

@AWSTutorialsOnline 2 жыл бұрын

When I try to use "missing values" check along with "Numeric Values" check within the same rule, it does not allow and give me error - "This column value data quality check can't be combined with the Missing values column statistic check.". However, it does allow one rule with missing values and other rule with numeric values within the ruleset. That way - at ruleset level the quality check will fail if any of the rule has failed. Hope it helps.

@smmike 2 жыл бұрын

That's actually possible to do in a single rule. You need to select "..OR" as a Rule success criteria and have 2 checks: the one for the Age to be between 0 and 100 and the other check for the same column "Value is missing" (This check can be found in the "Column values" group of checks). So you basically get: "Age is between 0 and 100 or it's missing".

@AWSTutorialsOnline 2 жыл бұрын

@@smmike between and missing combination works with OR?

@smmike 2 жыл бұрын

@@AWSTutorialsOnline Yes, but be sure to use "Value is missing" check, not "Missing values". The latter is an aggregate check for the whole column, but the first is a test for a particular value.

@AWSTutorialsOnline 2 жыл бұрын

@@smmike awesome

@scotter Жыл бұрын

I'm looking for the most code-light (a short Python Lambda function is ok and assumed) way to set up a process so when a CSV file is dropped into my S3 bucket/incoming folder, the file will automatically be validated using a DQ Ruleset I would manually build earlier in console. For any given Lambda call (I assume triggered by a file dropped into our S3 bucket) If possible, I'd like the Lambda to instruct the DQ Ruleset to run but not wait for it to finish (Step function?). Wanting to output a log file of which rows/columns failed to my S3 bucket/reports folder (Using some kind of trigger that fires from a DQ Ruleset finishing execution?). Again, it is important that the process be fully automated because hundreds of files per day with hundreds of thousands of rows will be dropped into our S3 bucket/incoming folder every day via a different automated process. End goal is merely to let client know if their file does not fit rules. No need to save or clean data. I realize I may be asking a lot, so please feel free to only share the best high level path of which AWS services to use in which order. Thank you!

@AWSTutorialsOnline Жыл бұрын

You can use crawler to catalog data stored in S3 and then define DQ ruleset on it. Use S3 event to call Lambda which calls start_data_quality_ruleset_evaluation_run method in Glue API to start the DA evaluation. The method has a parameter to mention S3 bucket where the DQ evaluation results are stored. You might want to check the follow video of mine as well - kzfaq.info/get/bejne/gcCPgtt8zJu2iXk.html