Modern Data Lakehouse? Where do you even start??? Azure Synapse of course!

No video

Modern Data Lakehouse? Where do you even start??? Azure Synapse of course!

Рет қаралды 41,965

Күн бұрын

Trying to get started with a Modern Data Lakehouse? There are so many options today. Where do you even begin? Patrick and Stijn help you get a handle on it with Azure Synapse Analytics!
Stijn Wynants
/ sqlstijn
Building a Modern Lakehouse Data Warehouse with Azure Synapse Analytics: Moving your Database to the lake
sql-stijn.com/...
Pipelines and activities in Azure Data Factory and Azure Synapse Analytics
docs.microsoft...
Integrate with pipelines
docs.microsoft...
📢 Become a member: guyinacu.be/me...
*******************
Want to take your Power BI skills to the next level? We have training courses available to help you with your journey.
🎓 Guy in a Cube courses: guyinacu.be/co...
*******************
LET'S CONNECT!
*******************
-- / guyinacube
-- / awsaxton
-- / patrickdba
-- / guyinacube
-- / guyinacube
-- guyinacube.com
**Gear**
🛠 Check out my Tools page - guyinacube.com...
#AzureSynapse #Pipeline #GuyInACube

Пікірлер: 59

@rachidt2764 2 жыл бұрын

Would love a comparison with "the regular way" with an azure sql and adf in terms of performance and cost. highlighting which cost drivers will lead you down the lake house route and which performance considerations you'd have with power BI on top

@billmanassas7746 2 жыл бұрын

This could easily be performed with Azure SQL and azure data factory, unless you have sth like >100m of records over the wire. It is obvious for me that Microsoft is trying to promote synapse through every channel possible ( Guyinacube is (was?) dedicated to PowerBI) and taking advantage of Guyinacube popularity.

@B-Luv 2 жыл бұрын

Yoooooo! I didn't know you guys did Synapse videos too! I been making the shift from developing on prem DW to trying to implement a more modern cloud data lakehouse architecture. What I was hoping to see was how to build out the different layers... and how to implement the delta loads on that. Trying to wrap my head around all of this. It is a paradigm shift to say the least.

@jananunn8244 2 жыл бұрын

Thank you for this video. It's been very helpful to see these possibilities using parameters! I'm looking forward to the rest of the series with Stijn!

@AessamL 2 жыл бұрын

it is really nice, but really some of the workflow for ADF "Pipeline" is really unintuitive. I mean creating the same parameter 3 times !!?

@samaguire1 2 жыл бұрын

Wicked video 👍👍. A great intro into how easy DF can be. I'll be watching the rest in this series for sure!

@newbi4655 2 жыл бұрын

Wow your enthusiasm is infectious Patrick! I am not familiar with Azure but your clear explanations help decipher this new maze of content. Thank you both. It all helps.

@aamoody81 2 жыл бұрын

I really like the new content. Please keep it coming.

@dbszepesi 2 жыл бұрын

Definitely looking forward to this series.

@topcataj 2 жыл бұрын

"Hey this thing is really quick and easy".... 3000 settings and configuration changes later.... "See!"

@xanthopsized 2 жыл бұрын

I’ve been happy doing this with SQL Server and SSIS and loading everything in a tabular model for analysis. But ever since we implemented Dynamics 365 in the cloud the only way out for the data from Dynamics is into a data lake. This series has come at the right time cause it’s exactly what we’re trying to do.

@FrostSpike 2 жыл бұрын

Are you using Synapse Link for Dataverse, or acquiring the data through Dynamics 365 APIs? In the past, you could have used Data Export Service to maintain a SQL Server mirror of the interesting Dataverse tables but that's being deprecated in Nov 2022.

@xanthopsized 2 жыл бұрын

@@FrostSpike I was referring to Dynamics 365 Finance, which is not based on Dataverse. It has its own Export to Data Lake service which was released only a few months ago.

@davidmvg216146 2 жыл бұрын

Hey Fato, We are doing the same at the moment (F&O to DataLake), have you been able to to distinguish deleted rows in the tables? do you know if there is any documentation of how the feature deals with exports of deleted records?

@xanthopsized 2 жыл бұрын

@@davidmvg216146 Hi David. We’re still at the very beginning to get it to work. Not handling deleted records is concerning but the tables we’re focusing on are ones that grow only (e.g. LedgerTrans). I’ll post back if I learn something about the subject you raised.

@MDevion Жыл бұрын

@@xanthopsized There are some standard solutions out there that cover this subject. The CDM structure is quite complex and would advise everyone not to build it yourself unless you have a deep understanding of D365FO and CDM.

@ahmedroberts4883 Жыл бұрын

Amazing. I truly Appreciate the content.

@sandeepbarge4699 2 жыл бұрын

Thanks for choosing the use case demonstrated. I am sure it would help a lot of us in building generic pipelines. In your next video, it would be nice if you could highlight advantages of using Data Lake v/s RDBMS. I mean why would someone give up RDBMS for data warehouse and go for data lake, which stores data in file system in .parquet format. Sorry if you covered it and I missed it.

@coding3438 Жыл бұрын

Basically, storing data in a data warehouse is expensive. Ideally, you don’t want to store TBs of your raw data in your DW. You would Instead store it in a data lake in its raw format, where storage is cheap. You would them clean it using an external compute, like synapse serverless or azure Databricks, aggregate it, so that each line item in a transaction table that was on an hourly basis now collapses into day daily basis, then load that aggregated data into your DW. That way, you’re storing way less data in DW, and can use SQL to query data instead of cleaning it, which was was sql is originally intended for. Hope this helps.

@ramdasnaik6522 2 жыл бұрын

Hi can you make a video on how to capture the data from powerbi service to SharePoint list

@culpritdesign 2 жыл бұрын

Amazing content!

@matthiask4602 2 жыл бұрын

all about pumping more data from on-premise into the cloud, is it not? but why?

@garymelhaff7779 Жыл бұрын

First - I love you guys, great video as always but....as they say, devil is in the details so just wanted to provide some context for these technologies. Yes this is the new landing area pattern and it's wonderful for dynamic source acquisition. No more BIML to do this! And just want to say that the dynamic landing area ELT is also of course wonderful. But (so many buts here)....just wish there was more context provided for this as it could be misleading on what it should be used for. First, you can do the same dynamic landing using ADF so this is not unique to synapse pipelines. And the dynamic capability in of itself has nothing to do with datalake per se. The data lakehouse is really the whole synapse serverless sql pool, not synapse pipelines. Two completely different things. And finally, in reality is if you try to use datalake as your final landing area you will get a real nasty surprise when you try to join datalake files together when loading integrated targets such as DW (whether you're reading straight from the datalake or via serverless sql pool) as the performance is bad....I mean really bad....compared to using Sqlserver tables. Remember, in the end these are just flat files in the datalake not indexed tables. Why oh why would you take sqlserver source and put it in flat files? If you have a compelling reason fine but as general rule that's a pretty bad idea folks. We are using data lake only because D365 sources now support sync'ing with datalake files. But then we pull it into our sqlserver landing area where the real work is done. It's SUPER cool with synapse sql pool ability to run Sql queries against flat files and treat it like regular database source. But it's not practical from performance standpoint to actually use that for anything other than a source for the real sqlserver landing area as it just can't scale well as flat files.

@omakuaugustine3198 Жыл бұрын

Q Hello Patrick how can I delta load from multiple on Onprem SQL Tables in to Azure data lake

@alexmahrou6434 2 жыл бұрын

How's the performance on the sql server source? For big instances, I'm imagining some long run times. Also, how are deltas and change handled? Sorry to inundate with questions but it also looks like there's a prerequisite for having an on prem integration runtime as well?

@coding3438 Жыл бұрын

Yes. You need a “self hosted integration runtime” if you’re getting data from sources that’s in private networks. That could be on prem or vnets and vpcs. Deltas and changes have to manually configured in the pipeline. Using watermark columns, or change capture technologies in sources rdmsms.

@LandscapeInMotion 2 жыл бұрын

Nice! How did it build out fact and dimension folders in the lake? How would it know what table (dim or fact) to put under each folder?

@coding3438 Жыл бұрын

That’s is because of parameterrizatiom. He is basically telling adf to create a folder in the sink with the same name as the schema of the source, and the name file name in the sink as the table name in the source

@JustBigdata 10 ай бұрын

I am not sure if you covered the difference between Data Lakehouse and Data lake . Would love to know about the difference especially in Azure environment

@PowerBIUGLima Жыл бұрын

Thanks for this video and blog post, very usefull. One question about the medallion arquitecture. Is mandatory that gold layer resides in the lakehouse? In a "smaller" project I want to send data from silver in lake to gold in azure sql database. Technically, I know is possible. But I want to hear opinions from the design side. What do you think?

@carexpertDATA Жыл бұрын

I've followed the steps exactly until 6:05 but then the "databasename" parameter doens't show up. It still looks like 5:42. What have I missed? I guess there are some extra steps that are not shown here or the cutting was bad. The source dataset suddenly has another name at 6:05 - what's going on there?

@moncefansseti1907 Жыл бұрын

Quick question, for the merging strategy do we need to use it on the bronze or silver layer?

@gpetrites-work 11 ай бұрын

I'm a bit confused by the multiple files per table and how these would be used. At 13:54, you show three files in the Customer folder. Given the dates, I assume each is a complete snapshot. You remark that you don't care about the file names. But when it's time to use the file, you obviously only want to use a single file so as to not duplicate data. Programmatically, this would seem to be difficult without fixed file or folder names. For example, if you needed to build a pipeline to copy data into silver, how would you select the most recent file? Or if you built a notebook to analyze the data, how would you reference the most recent file regardless of when you used the notebook?

@and1178 Жыл бұрын

If you're connecting to an on premise db, aren't you supposed to connect to sql server as your source dataset?

@radekou Жыл бұрын

Is the copy activity going to support "Delta" as a sink?

@martynsharp5936 2 жыл бұрын

How would you handle having to connect to multiple SQL Servers as your source data though?

@MrSARAZZ 2 жыл бұрын

Like Patrick, my mind has just blown!!! WOW!!! I'm assuming the same can be done in ADF and a basic ADLS v2 storage account if you can't afford Azure Synapse?

@jmiguell.a.2693 2 жыл бұрын

low cost option, try Synapse SQL Serverless

@stuartgoodrick1095 2 жыл бұрын

Great for the first cut of Data Source ... what happens when a column is added removed or modified in source. Select * gets the data but won't the schema be different?

@coding3438 Жыл бұрын

The scheme will be different but as we are storing data as flat files I’m the sink, the pipeline won’t break. It’s when you read from that sink, that’s when problem arise. This is one of the problems of writing to a flat file data store.

@bouseuxlatache4140 Жыл бұрын

to make the example work you needed to install azure integration runtime and java runtime environment on the computer, is that correct?

@coding3438 Жыл бұрын

You need to install “self hosted integration runtime”.

@KAshIf0o7 2 жыл бұрын

amazing

@MarkDiamond Жыл бұрын

Way too fast guys... what's the rush?

@nomanqureshi2703 2 жыл бұрын

I would like to see user access grant on Sql server end.

@selemanarefi1542 Жыл бұрын

8:13 Batman is that you?

@rahulpratap85 2 жыл бұрын

How you’ll version control this and deploy from one env to another?

@B-Luv 2 жыл бұрын

We use Azure Devops. It's integrated.

@datacentric 2 жыл бұрын

I wish MSFT would stop calling Synapse and datablake a lakehouae. it's obvious it wants to take Databricks' market share. But synapse and a data lake is not a lalehouse. It's a deliberate attempt to confuse.

@B-Luv 2 жыл бұрын

Please elaborate.

@datacentric 2 жыл бұрын

@@B-Luv Synapse isn't a lakehouse. Hope that helps. Google for more info or if you're convinced by the claim, that's good also.

@B-Luv 2 жыл бұрын

Your statement lacks support. I'm not sure if your argument is merely semantic, in which case I will understand and agree with you. But I am at the stage where I am learning about Lakehouse architecture and from what I have learned so far you are supposed to be able to build Lakehouse architecture with it. But when you don't explain why it does not fit into your definition of a Lakehouse and tell me to Google it..your comment as such leaves me at a loss. I know this video does not go into the architecture but I am still under the belief that such architecture is possible with Synapse.

@datacentric 2 жыл бұрын

@@B-Luv You're right of course. I have in no way supported my position and just said it is so. I apologise, lack of time and motivation is to blame!. Synapse is what I would describe as a 2 tier (storage) architecture. You have the lake tier and you have the synapse dedicated pool (SQL DWH) tier. Sure you don't have to use dedicated pools but you won't build a persisted DWH without it. Now you face all the challenges and complexity of managing data in multiple tiers. You wanna use dedicated pools you'd usually move data from 1 tier to the other tier. What's happening with the lake? because without a data management layer you will just have a massive file and object store, immutable with no business context. A swamp. Ok, so now you could use open source delta and create a delta lake which provides acid transactions, schema enforcement, sharing etc. This is great, you can now perform workloads supporting AI/ML, streaming and maybe even BI and reporting. Is it performant?, not really, can you provide row level security? not yet. Ok, let's use databricks which provides perrformant compute of data lake storage, provides SQL endpoints and Unity catalog for row level security. Now I can manage all my workloads off the data lake using delta which is what I'd call a lakehouae, provides data lake scalability with warehouse data management. That's the lakehouse. Now why do I need Synapse? and how is it a lakehouse? I don't think it is so why do they think it is.

@thomasivarsson1291 2 жыл бұрын

This is nothing new. You have been able to do this before with an on-prem file systems and an on-prem sql server -> it's the tail of "the emperor's New Clothes". You will also loose all meta data from sql server when you put the data in the cloud file system.

@AessamL 2 жыл бұрын

Not entirely , You can use something like common data model and you can have the metadata in a new organized Model.json file along with the query that created these tables, their locations in the data lake, and description for each table, relationship,...etc.

@stuartgoodrick1095 2 жыл бұрын

Nothing to stop you using the Information schema as a source and hold the metadata alongside the data in the Lake ?

@thomasivarsson1291 2 жыл бұрын

@@AessamL Since the scenario is sql server sources it's quite easy to load the tables in to a sql-server destination. You will get more query speed in that way. I don't see the point with the demo scenerio.