The Missing Piece in Many Data Pipelines

  Рет қаралды 3,672

Kahan Data Solutions

Kahan Data Solutions

Күн бұрын

►► Establish a Well-Structured Data Warehouse for Your Small Team In 90 Days (Free Guide) → www.kahandatasolutions.com/guide
All data teams (large & small) have at least one thing in common.
Source data.
But not everyone handles it the same way in their pipelines.
For some, they'll reference raw source tables directly in many queries.
For others, they'll create ad-hoc custom tables to address subtle formatting changes.
But without any real over arching strategy or consistent naming behind it.
While a more popular topic is data modeling (ex. kimball, one big table, etc.)
I believe an equally more important area to consider is what you do BEFORE you start creating those core data models.
For many, this "before" layer doesn't exist at all.
In previous videos I've talked about a 3-Layered Data Model.
And today I want to focus solely on Layer 1, which addresses this concept.
It's called a "Staging" layer.
When done right, it can help you establish reliable pipelines from the very start.
Timestamps:
00:00 - Intro
00:52 - What is a Staging Layer?
03:23 - Reason # 1: Modularity
05:03 - Reason # 2: Consistency
07:21 - Reason #3: Clarity
Title & Tags:
The Missing Piece in Many Data Pipelines
#kahandatasolutions #dataengineering #datamodeling

Пікірлер: 13
@KahanDataSolutions
@KahanDataSolutions 28 күн бұрын
►► Establish a Well-Structured Data Warehouse for Your Small Team In 90 Days (Free Guide) → www.kahandatasolutions.com/guide
@andresarmua
@andresarmua 18 күн бұрын
Nice! I use a staging layer as a view and then 4 more layers for the pipeline until I get to the mart. I usually alternate between views and materialized tables, but I am not quite sure how to know the optimal way to decide between tables and views at each time. How do you compare performance, storage and other practical factors?
@bertjanvdberg
@bertjanvdberg 27 күн бұрын
Nice! Question: Do you also use views in your warehouse and mart layers? I've been at companies where the marts were basically views based on views based on views times 10 which was terrible for the performance of getting the data.
@ramtadam1469
@ramtadam1469 27 күн бұрын
We always use tables as marts and then sometimes on top build views that do things with the materialized marts data.
@johnpower1458
@johnpower1458 24 күн бұрын
Do you truncate the data each batch pipeline run on staging and capture the cleaned data in snapshots? If not, how do you avoid duplicates down stream if you’re using say SCD Type 2?
@thedavidabides
@thedavidabides 28 күн бұрын
Nice work! Where should the staging layer come when using a bronze, silver, gold medallion structure ?
@muhammadbadar6089
@muhammadbadar6089 28 күн бұрын
from my understanding you would use your bronze layer as a staging layer pulling from all source systems
@personalbranddata
@personalbranddata 27 күн бұрын
It's the silver layer. Bronze = raw data in this video. Silver = "staging"/cleaned data in this video. Gold = Warehouse in this video. I don't like that he's using the term "staging" to refer to cleaned data because in traditional data warehousing a staging table typically refers to uncleaned data straight after you've loaded it from a source system and the cleaning happens later.
@ArmandsPutnis
@ArmandsPutnis 27 күн бұрын
it does not really matter how you call them if you have agreed on the purpose. Bronze layer can be raw_source or it can be staging. personally i like to keep the source out of the way and use bronze for staging - cleaning/transforming. silver for joining multiple bronze tables, what i know can be reused for multiple use cases in a gold layer. gold layer for the final solution/consumption joining some silver and bronze tables.
@gatorpika
@gatorpika 24 күн бұрын
@@ArmandsPutnis yeah, this. Bronze, silver and gold is an abstraction to help you think about your structure, not something with set rules you have to follow dogmatically. Figure out what layers you need to solve your problems and then just structure your layers appropriately. Staging serves a purpose to help you shift the transforms left so changes are easier down the road given they will propagate through all your downstream transforms. Then transform on top of that assuming the stage takes care of most of the cleaning/formatting for you. If your management makes you pick a metal, I suggest the titanium layer.
@williamchurch711
@williamchurch711 15 күн бұрын
The staging layer would be equivalent to a landing zone?
@senarl
@senarl 11 күн бұрын
Migh be wrong but I take that the staging layer would be a bronze layer in the Medallion architecture, so we would have landing with raw data, bronze with cleaned raw data, silver with any new columns or any enhancement to the data and Gold with the joins and business logic. But thats just how I use at work and it can be changed to fit your needs
@Milhouse77BS
@Milhouse77BS 28 күн бұрын
Stage All the Things
How to Create a Data Modeling Pipeline (3 Layer Approach)
9:41
Kahan Data Solutions
Рет қаралды 4,4 М.
What tools should you know as a Data Engineer?
10:24
Kahan Data Solutions
Рет қаралды 62 М.
Как бесплатно замутить iphone 15 pro max
00:59
ЖЕЛЕЗНЫЙ КОРОЛЬ
Рет қаралды 8 МЛН
Spot The Fake Animal For $10,000
00:40
MrBeast
Рет қаралды 178 МЛН
I loaded 100,000,000 rows into MySQL (fast)
18:27
PlanetScale
Рет қаралды 177 М.
Data Architecture 101: The Modern Data Warehouse
5:48
Kahan Data Solutions
Рет қаралды 21 М.
Why I think Event data model is the easiest to get started
39:59
Pydantic is OP, here's why
18:10
Carberra
Рет қаралды 21 М.
How To Load One BILLION Rows into an SQL Database
12:17
Database Star
Рет қаралды 28 М.
Liked Pydantic? You'll LOVE Msgspec
15:53
Carberra
Рет қаралды 10 М.
Writing My Own Database From Scratch
42:00
Tony Saro
Рет қаралды 192 М.
Data Warehouse vs Data Lake | Explained (non-technical)
5:12
Kahan Data Solutions
Рет қаралды 15 М.
What is a Headless Data Architecture?
11:11
Confluent
Рет қаралды 10 М.
Modern Data Engineering Workflows, Explained
6:38
Kahan Data Solutions
Рет қаралды 5 М.