No video

Advancing Spark - External Tables with Unity Catalog

  Рет қаралды 15,839

Advancing Analytics

Advancing Analytics

Күн бұрын

A very common pattern is for companies to have many different lakes, whether as part of a mesh, or the simple realities of large companies. But with Unity Catalog expecting a single lake for each metastore, how do we manage access to other cloud stores?
In this video, Simon looks at the setup of storage credentials and managed identities, followed by the creation of external, unmanaged tables.
The primary steps for external tables can be found here: docs.microsoft...
The managed identity setup steps can be found here: docs.microsoft...
As always, if you're setting up a Lakehouse and need a helping hand, get in touch with Advancing Analytics

Пікірлер: 34
@alexischicoine2072
@alexischicoine2072 Жыл бұрын
Creating tables as external has saved me many times when deleting tables by mistake. Having that extra step of having to delete the data in storage isn't so bad if the data is important and hard to recover. Then if you messed up recreating the external table is very easy. Obviously if you're in a mature organization where you don't do anything manual in prod it's not as much of an issue.
@AyyyyyyyyyLmao
@AyyyyyyyyyLmao Жыл бұрын
I came here looking for tips on how to build a table within the unity game engine. Not what I asked for but a great video nonetheless!
@paulroome3928
@paulroome3928 2 жыл бұрын
Great video! One suggestion - each external location can/should be used to create many external tables. Register an external location on a parent folder of your adls account, and when you create external tables in child directories, unity will automatically figure out you have access to do that!
@AdvancingAnalytics
@AdvancingAnalytics 2 жыл бұрын
Oh absolutely! The location I added was for the full lake container in this case, apologies if that wasn't clear in the vid!
@ShravanKumar-yv4be
@ShravanKumar-yv4be Жыл бұрын
Any suggestions as to when to use Managed and External tables? Would it be a good idea to use managed for bronzer/silver and external for Gold layer?
@KarthikPaladugu-kz8rt
@KarthikPaladugu-kz8rt Жыл бұрын
In articles we are not seeing how we need to update external locations in existing workflows after enabling unity catalog. we can not use DBFS as per recommendations, we need to use external locations, how to update existing code to point to external locations. we will use upgrade option in unity catalog to migrate external tables but how to update workflows to point external locations
@RaiOkami
@RaiOkami Жыл бұрын
Great videos! I am relatively new to Databricks and even more so for UC. So your videos have been a really great help! I am interested in implementing it to our project for a client just to get the permissions and governance stuff out of the way. But what exactly is the best practice for storing these tables? Is it really better to save them as external tables than managed? I was told from the forums that UC would handle the file saves in ADLS plus the actual table registration in the metastore. Yet, by default it is still a managed table.
@AdvancingAnalytics
@AdvancingAnalytics Жыл бұрын
So our /old/ advice was always to use external for full control over storage. Given UC gives far more flexibility, that's less important. Now the question is how strongly coupled do you want the SQL object to the underlying data. If someone deletes the table, do you want the data to be deleted too? Comes down to process/ownership
@RaiOkami
@RaiOkami Жыл бұрын
@@AdvancingAnalytics Appreciate the response! I see. So for managed tables the real big drawback is that the data itself is tied to the table object in metastore (be it in hive or UC). Unlike external tables where you can drop the table "representation" of the files from ADLS but can anytime pull them back with all the original data untouched. Did I get the rationale?
@AdvancingAnalytics
@AdvancingAnalytics Жыл бұрын
@@RaiOkami yep, exactly
@dofa47
@dofa47 Жыл бұрын
Hi, late to the party but I have a question: Can I have a read only access to the storage account with the Data? So one cannot modifiy the prod data? In other words, where the metadata of the external table are saved?
@KarloKotarac
@KarloKotarac Жыл бұрын
Thnx for the great video(s)! Unfortunately, I am not able to create an external location as you suggested in the video root@storage_account_name that should (I guess) point to the whole storage account - I need to do it container by container. Also creating external tables doesn't seem to be possible anymore by using storage credentials directly (previously it was by WITH command). Do you have any idea how to address this issue?
@alexanderowens4898
@alexanderowens4898 2 жыл бұрын
Always love your videos…a couple of questions: 1. How can I provide an external path using Scala saveAsTable() for UC? 2. Wouldn’t the use of external tables limit the ability to get and use the lineage tracking if you loads data from and then save data to external locations (unmanaged tables)?
@AdvancingAnalytics
@AdvancingAnalytics 2 жыл бұрын
Hey! When using external tables traditionally, you can create the Database with a location (CREATE DATABASE LOCATION '/mnt/data/mydatabase/' etc. Then if you use saveAsTable() it would inherit the external location of that database. I've not tried it with UC, so can't guarantee that it works! And with Lineage, I've not looked into how supported external tables are currently, but when lineage is GA, I'd hope that it supports both types of table!
@andrewli7542
@andrewli7542 2 жыл бұрын
Something like df.write.option("path", "s3://some/path").saveAsTable("t") should work. Lineage will support both managed and external tables.
@alexanderowens4898
@alexanderowens4898 2 жыл бұрын
@@andrewli7542 that was initially what I had tried; though we use an azure mount path /mnt/account/somepath and I was receiving an "Missing Cloud file system scheme" error notice.
@andrewli7542
@andrewli7542 2 жыл бұрын
@@alexanderowens4898 You can't use mounts with external locations for now. A full path with abfss:// is needed.
@alexanderowens4898
@alexanderowens4898 2 жыл бұрын
@@andrewli7542 great tip, this is what I ended up doing .write.option("path", s"abfss://$container@$account.dfs.core.windows.net/$system/$table").saveAsTable(s"$container.$system.$table") and it worked great! Thank you very much.
@BritonWells
@BritonWells Жыл бұрын
Is there a solution/update to the issue around 15:00 - to me that seems like a deal breaker? I want to expose the same external table to many curated catalogs. Do managed tablea have the same limit?
@yashgoel5315
@yashgoel5315 Жыл бұрын
Is there any video on how to setup unity catalog??
@palanithangaraj797
@palanithangaraj797 Жыл бұрын
can you create persistent views from external tables to within Unity Catalog?
@majetisaisowmya909
@majetisaisowmya909 11 ай бұрын
can u please help me "create metastore,catalog ,table automatically using python or powershell
@AdvancingAnalytics
@AdvancingAnalytics 11 ай бұрын
So you'll still need to throw that over to the SQL side of things, but you can build up the SQL string programatically. For example, using python inside a Databricks notebook you could use: CatName = "myCatalog" SQLString = f"CREATE CATALOG {CatName}" spark.sql(SQLString) That's going to write a sql command for you, then execute that command. You can use that in a loop to programatically create schemas, tables, whatever you want!
@mauromi68
@mauromi68 Жыл бұрын
really a great video. I'm new to DataBricks Unity Catalog and I tried to replicate these steps, but I still get the error "Error in SQL statement: UnauthorizedAccessException: PERMISSION_DENIED: request not authorized" It seems to me I did whatever I had to do: I created a Databricks access connector in Azure (which becomes a managed identity) I created a storage Account ADLS Gen2 (DAtalake with hierarchical namespace) plus container On my datalake container I assigned Storage Blob Data Contributor role to the managed identity above I created a new Databricks Premium Workspace I created a new metastore in Unity Catalog that "binds" the access connector to the DataLake Bound the metastore to the premium databricks workspace I gave my Databricks user Admin permission on the above Databricks workspace I created a new cluster in the same premium workspaces, choosing framework 11.1 and "single user" access mode I ran the workspace, which correctly created a new catalog, assinged proper rights to it, created a schema, confirmed that I am the owner for that schema The only (but most important) SQL command of the same notebook that fails is the one that tries to create a managed Delta table and insert two records: CREATE TABLE IF NOT EXISTS quickstart_catalog_mauromi.quickstart_schema_mauromi.quickstart_table (columnA Int, columnB String) PARTITIONED BY (columnA); When I run it, it starts working and in fact it starts creating the folder structure for this delta table in my storage account enter image description here , however then it fails with the following error: java.util.concurrent.ExecutionException: Failed to acquire a SAS token for list on /data/a3b9da69-d82a-4e0d-9015-51646a2a93fb/tables/eab1e2cc-1c0d-4ee4-9a57-18f17edcfabb/_delta_log due to java.util.concurrent.ExecutionException: com.databricks.sql.managedcatalog.acl.UnauthorizedAccessException: PERMISSION_DENIED: request not authorized Please consider that I didn't have any folder created under "unity-catalog" container before running the table creation command. So it seems that is can successfully create the folder structure, but after it creates the "table" folder, it can't acquare "the SAS token". So I can't understand since I am an admin in this workspace and since Databricks managed identity is assigned the contributor role on the storage container, and since Databricks actually starts creating the other folders. What else should I configure?
@mauromi68
@mauromi68 Жыл бұрын
I found it: you need to only to assign, at container level, the Storage Blob Data Contributor role to the Azure Databricks Connector. In fact, you need to assign the same role and the same connector at STORAGE ACCOUNT level. I couldn't find this information in the documentation and I frankly can't understand why this is needed since the delta table path was created. However, this way, it works.
@blackdeckerz0r
@blackdeckerz0r 2 жыл бұрын
Only problem with abfss is that python only code like pandas and open() doesn't work with that path, we are currently migrating from mounting the storage account to abfss, and we found this limitation
@AdvancingAnalytics
@AdvancingAnalytics 2 жыл бұрын
Unity Catalog is expecting you to work entirely through the SQL tables registered. So it's only with the initial table registry that you use the full abfss:// path... That said, harder to use pandas with the SQL tables ;)
@andrewli7542
@andrewli7542 2 жыл бұрын
For now, access to External location paths is limited to Spark Dataset API or dbutils.fs API. PySpark Pandas APIs that use Spark Dataset APIs under the hood should also work like ps.read_table().
@t0nyc0llins
@t0nyc0llins Жыл бұрын
Can I ask how you now access files on your datalake with pandas etc? We are just starting to move to Unity Catalog and I am unsure how we do this without mounting etc
@rakeshprasad4969
@rakeshprasad4969 2 жыл бұрын
is this managed identity for all the workspaces i have in my tenant? or its for one specific one? if its latter, how do we know MI belongs to which workspace?
@rakeshprasad4969
@rakeshprasad4969 2 жыл бұрын
i should have watched it bit more. step 8:18 explains it. Thanks. good video. i think ADB should have this MI concept without unity too. its a good way and makes it at par with ADF or any other Azure PaaS (from identity perspective). can any one use MI that are in catalogue? or ADB has released grant for MI too? i would like to control who/which code has access to which MI.
@AdvancingAnalytics
@AdvancingAnalytics 2 жыл бұрын
@@rakeshprasad4969 good question, haven't looked at the permission model for who/what has access to the managed instance. I assume the access is delegated by the user who adds the credential to the workspace
@t0nyc0llins
@t0nyc0llins Жыл бұрын
@@rakeshprasad4969​ @Advancing Analytics I'd really like to understand this too. If I create an MI with contributor to the root, can all analysts use this connection to read/write anything in the lake? Hopefully I am missing something key.
Advancing Spark - Setting up Databricks Unity Catalog Environments
21:21
Advancing Analytics
Рет қаралды 17 М.
Advancing Spark - Tracking Lineage with Unity Catalog
15:00
Advancing Analytics
Рет қаралды 5 М.
АЗАРТНИК 4 |СЕЗОН 1 Серия
40:47
Inter Production
Рет қаралды 557 М.
Incredible Dog Rescues Kittens from Bus - Inspiring Story #shorts
00:18
Fabiosa Best Lifehacks
Рет қаралды 29 МЛН
Spark Data Engineering Patterns - Shortcuts and external tables
15:48
Azure Synapse Analytics
Рет қаралды 6 М.
Databricks Unity Catalog: Storage Credentials and External Locations
9:41
Pathfinder Analytics
Рет қаралды 7 М.
Advancing Spark - Rethinking ETL with Databricks Autoloader
21:09
Advancing Analytics
Рет қаралды 26 М.
Advancing Spark - Automated Data Quality with Lakehouse Monitoring
17:37
Advancing Analytics
Рет қаралды 7 М.
Create external location table in unity catalog #6 #data
13:09
CloudFitness
Рет қаралды 8 М.
Advancing Spark - Databricks SQL Serverless First Look
12:00
Advancing Analytics
Рет қаралды 10 М.
Advancing Spark - Databricks Delta Live Tables First Look
33:20
Advancing Analytics
Рет қаралды 41 М.
Core Databricks: Understand the Hive Metastore
22:12
Bryan Cafferky
Рет қаралды 15 М.
Advancing Spark - Managing Files with Unity Catalog Volumes
12:21
Advancing Analytics
Рет қаралды 7 М.
Advancing Spark - Provisioning Databricks Users through SCIM
14:23
Advancing Analytics
Рет қаралды 8 М.