Tame the small files problem and optimize data layout for streaming ingestion to Iceberg

  Рет қаралды 1,950

Dremio

Dremio

Жыл бұрын

In modern data architectures, stream processing engines such as Apache Flink are used to ingest continuous streams of data into data lakes such as Apache Iceberg. Streaming ingestion to Iceberg tables can suffer from two problems: the small files problem that can hurt read performance, and poor data clustering that can make file pruning less effective.
In this session, we will discuss how data teams can address those problems by adding a shuffling stage to the Flink Iceberg streaming writer to intelligently group data via bin packaging or range partition, reduce the number of concurrent files that every task writes, and improve data clustering. We will explain the motivations in detail and dive into the design of the shuffling stage. We will also share the evaluation results that demonstrate the effectiveness of smart shuffling.

Пікірлер
100❤️
00:19
MY💝No War🤝
Рет қаралды 20 МЛН
Managing Data Files In Apache Iceberg
27:30
Dremio
Рет қаралды 3,3 М.
Data Lakehouse Deep Dive: Hudi, Iceberg, and Delta Lake
57:50
OnehouseHQ
Рет қаралды 3 М.
What is Apache Iceberg?
12:54
IBM Technology
Рет қаралды 17 М.
Intro to Flink SQL | Apache Flink 101
9:00
Confluent
Рет қаралды 16 М.
"I Hate Agile!" | Allen Holub On Why He Thinks Agile And Scrum Are Broken
8:33
Зачем ЭТО электрику? #секрет #прибор #энерголикбез
0:56
Александр Мальков
Рет қаралды 152 М.
Battery  low 🔋 🪫
0:10
dednahype
Рет қаралды 1,3 МЛН
WATERPROOF RATED IP-69🌧️#oppo #oppof27pro#oppoindia
0:10
Fivestar Mobile
Рет қаралды 18 МЛН