AWS re:Invent 2019: Deep dive and best practices for Amazon Redshift (ANT418)

Рет қаралды 52,733

Күн бұрын

In this session, we take an in-depth look at best practices for data warehousing with Amazon Redshift. We show you the best way to analyze all your data, inside and outside your data warehouse, without moving the data, which helps you gain deeper insights for running your business. We also cover best practices for how to design optimal schemas, load data efficiently, and optimize your queries to deliver high throughput and performance.

Пікірлер: 27

@vaibhavdce123 3 жыл бұрын

One of best talks on Amazon Redshift

@DiogoNomura 3 жыл бұрын

Really the best content of Redshift!

@galeop Жыл бұрын

My notes: 19:34: ## Data distrubution strategies, for data distribution across slices (ie virtual nodes -1 physical node is made up of [2, 16] slices, acting as sharding virtual nodes). ## Note that data in a table is stored as follows : it is distributed according to the table's data-distribution strategy (eg according to their distribution key, if you chose "KEY based distribution"), and then data is sorted within each partition according to the sort key(s) (the sort key is relevant for zone maps). Data distribution has two goals : 1) *Avoid "hot partitions/shards"* , ie a partition of data where the request is centered, thus putting a lot of strain on the resources underlying the slice/node that hosts it, while the other slices in the cluster are under-utilized. 2) *Skip IO* : with a query like "SELECT * FROM my_table WHERE country='FRANCE'; " if this table was distributed according to the "country" column, then all the relevant data for this query was stored on the same partition, and can be returned quickly. Compute nodes can quickly identify the partitions where the data is and is not. (Note: "country" is probably not a good choice of partition key, as the resulting partitions will probably be too big, and therefore we'll result in a hot partition) 3) *Return relevant data together in few IO operations:* this is the same idea as 2) : when all relevant data is in the whole partition (or serveral partitions, all the data of those partitions is relevant), then all the data blocks of the partition are relevant, and the storage layer can read them quickly in just a few IO operations. For this reason, it makes sense to use as partition key a column that is often used as joining creteria (eg "JOIN mytable1 and mytable2 ON columnToUseAsPartitionKey") There are 4 data distribution options for a table: - *hash partitioning/sharding* (aka "distribution based on a *key*") : you choose a column (aka key) of your table that will be used as partition key. All the rows that share the same value for this column will be stored together in the same partition/shard. To get a good data dsitribution, and avoid having hot partitions, use a column with an even value distribution (to get even-sized partitions), and with high cardinality (to avoid having huge partitions, that can't be broken down in any smaller pieces), and on which your queries aren't focused on a single value (that's another reason why "country" is a bad partition key, because you'll typically query about facts regarding a single country, and therefore you'll put all the strain on the partition corresponding to that country, thus making it a hot partition ; for the same reason, don't use date as a partition key, because you'll typically query facts on a specific date, thus making the partition corresponding to that date a hot partition). With hash partitioning, if you choose your distribution key right, you'll avoid hot partitions (gaol 1), skip IO (goal 2) and read your data in fewer IO operations (goal 3). - *Even distribution:* the rows of your table are evenly spread among your slices, in a round robin fashion, row by row. The idea is to avoid hot partitions (goal 1), but you don't get goals 2 and 3. - *ALL*: your table is duplicated on the 1st slice of each of your compute nodes. This only makes sense for "small" tables (ie No I think it’s okay as the data model is typically for historizations, so inserts would be appended to the table, in the same HCC/ORC set of row. Sure, adding one by one new rows is a heavier process than doing so in a row-oriented DB, but it’s a heavy process taking place “at the margin”, affecting just one HCC/ORC set of rows per table at a time. 25:00 data is always at least twice redundant in the cluster. Redshift uses 2-phase commit for transactions.

@aksharjamgaonkar1677 2 жыл бұрын

this is pretty crisp and to the point, worth your time.

@andreyavdeev8181 3 жыл бұрын

fantastic session! thank you!

@raghukundurthi1288 4 жыл бұрын

Wow! An intelligent and insightful deep dive into the hidden pearls of Redshift! Love the topic of Advisor! Thanks Tony & Harshida!

@saravanangovindharaj6997 3 жыл бұрын

Awesome presentation!

@venkata.krishnan 3 жыл бұрын

Excellent presentation structure..

@liuruibnu 3 жыл бұрын

Excellent structure!

@vivien1252 3 жыл бұрын

Fantastic session

@galeop Жыл бұрын

13:40 so zone maps are like "storage indexes" in Oracle Exadata ; they allow to skip unecessary IO by telling the storage server what are the min and max values of a compressed 1MB storage block, thus allowing the server to know if the values it looks for NOT in that block. Sort keys in Redshift are meant to determine how your table will be sorted on disk, and therefore impact the zone maps that track the min/max of the sort keys in a 1MB block.You'd typically set sort keys for frequently filtered columns, and have fewer than 4 sort keys per table (the table being sorted based on column X, and then column Y when column X had identical vallues, etc.). Use in priority low cardinality columns for sort keys, as they'll allow to make zone maps more meaningful (to know where the filtered data is not)

@galeop Жыл бұрын

16:34 like all datawarehouses, you should use a denormalized data moldel ; something akin to a star schema or snowflake schema. In your star schema, the Fact table is a denormalised table where entities that belong to the same Fact have been pre-joinned ("materialized", ie they are stored joined together on the Fact table). You will also store in the Faact table pre-calculated aggregations (as you store those values on the table, they are "materialized").

@shrabanti84 2 жыл бұрын

I have two question, 1. from performance point of view do we have any major difference between temp table and ctas with backup no? 2. For alter table append command what if the the source datatypes are same but data lengths are not matching will it still work?

@ArtisticKalam 3 жыл бұрын

Best talk on redshift

@yasink18 Жыл бұрын

Best way to use update statement where I am getting slow response

@Thevisionaryaddy 3 жыл бұрын

how much concurrency and CPU usage if you are running a Decent Size Query which takes about 4-5 hours and is made joining 2-3 tables. It runs in the very morning

@upuldi Жыл бұрын

Very good presentation

@RPRAVEENKUMAR 4 жыл бұрын

Very vast for a beginner like me. But answered a lot of questions

@galeop Жыл бұрын

27:51 according to which criteria should the big S3 file to COPY be split into 16 smaller file ? Should we partition it according to the same distribution method as the one used in the table (eg same hash key)?

@galeop Жыл бұрын

43:40 for WLM and queues, what are these percentages of memory about? Leader node's memory, compute node's memory, or both?

@dsapraveen 3 жыл бұрын

I suppose, Advisor is not available in ap-south-1, I wonder why AWS is selective when releasing these kind add-on of services,.

@galeop Жыл бұрын

55:01 I suppose that altering the distribution key of a table means that all my table's partitions will be re-written from scratch, according to that new distribution key ; so it's a very heavy operation. Correct ?