Exporting CSV files to Parquet with Pandas, Polars, and DuckDB

  Рет қаралды 8,122

Learn Data with Mark

Learn Data with Mark

Күн бұрын

In this video, we'll learn how to export or convert bigger-than-memory CSV files from CSV to Parquet format. We'll look at how to do this task using Pandas, Polars, and DuckDB.
#pandas #python #polars #duckdb
Resources
► Blog post - www.markhneedham.com/blog/202...

Пікірлер: 15
@marycombinatoria7335
@marycombinatoria7335 Жыл бұрын
saved my day, thank youuuuuu!
@learndatawithmark
@learndatawithmark Жыл бұрын
You're very welcome!
@awe8401
@awe8401 Жыл бұрын
your content is great...
@learndatawithmark
@learndatawithmark Жыл бұрын
Thank you!
@myyouaccounttube1024
@myyouaccounttube1024 Жыл бұрын
Many thanks for this nice Video. A question about the method you presented with DuckDB: When exporting a table from DuckDB into the disk with Parquet format using COPY, is it possible to have some partitioning parameter to specify keys (Hive style) based on which the data would be split?
@learndatawithmark
@learndatawithmark Жыл бұрын
You can't specify a partitioning key at the moment, but I think in version 0.7 you will be able to do hive style partitioning. I'll make a video showing how to do that once the new version is released!
@guocity
@guocity 3 ай бұрын
Pandas work much better in unclean data, how do you handle pyarrow headache in data conversion error?: ArrowInvalid: Could not convert '230' with type str: tried to convert to double make many dependencies unusable: to_parquet() convert pandas to polars open csv in data wrangle, save as parquet in data wrangle
@PYG143
@PYG143 Жыл бұрын
Many thanks, We have a requirement to convert huge csv file to Parquet . Is it possible using C# console program ?
@learndatawithmark
@learndatawithmark Жыл бұрын
I think you should be able to do it using this library - www.codeproject.com/Articles/1145337/Cinchoo-ETL-CSV-Reader
@kpyoutuber4671
@kpyoutuber4671 5 ай бұрын
Thank you, Mark!!. Can you also explain the parquet dataset? I used to create a partitioned Parquet dataset by using Pandas and Polars. But I want to know how to read data from such partitioned parquet datasets directly to Polars lazy frame (not to pandas as data size is larger than memory) to do some analytics. import polars as pl import pyarrow.parquet as pq # Read data written to parquet dataset pq_df = pq.read_table(r"C:\Users\test_pl", schema=pd_df_schema, ) pl_df = pl.from_pandas(pq_df.to_pandas()).lazy() Is there any better way to do this
@learndatawithmark
@learndatawithmark 5 ай бұрын
Can you explain a bit more - m.h.needham@gmail.com if you like Do you mean that you have multiple Parquet files and you want to read them all?
@kpyoutuber4671
@kpyoutuber4671 5 ай бұрын
@@learndatawithmark Thanks for the kind reply. I just came to know through a response received to my SO quiz that pl.from_arrow() or pl.scan_parquet() can be used to read from a partitioned parquet dataset. It worked. pl_df = pl.from_arrow(pq_df).lazy() and/or pl.scan_parquet(r"test_pl/*/*.parquet").collect(streaming=True)
@Phoenixspin
@Phoenixspin Жыл бұрын
Why not a simple option in Excel to "save as" parquet? Why is this so hard?
@JardanySvidrigailov
@JardanySvidrigailov Жыл бұрын
It's SIMPLE... AUTOMATION!!!!
@learndatawithmark
@learndatawithmark Жыл бұрын
I'm guessing not a high percentage of excel users also use Parquet?!
Using DuckDB to analyze the data quality of Apache Parquet files
4:19
Learn Data with Mark
Рет қаралды 7 М.
What polars does for you - Ritchie Vink
27:45
EuroPython Conference
Рет қаралды 3,6 М.
路飞被小孩吓到了#海贼王#路飞
00:41
路飞与唐舞桐
Рет қаралды 76 МЛН
ЧУТЬ НЕ УТОНУЛ #shorts
00:27
Паша Осадчий
Рет қаралды 6 МЛН
Cool Items! New Gadgets, Smart Appliances 🌟 By 123 GO! House
00:18
123 GO! HOUSE
Рет қаралды 17 МЛН
Эффект Карбонаро и нестандартная коробка
01:00
История одного вокалиста
Рет қаралды 9 МЛН
This INCREDIBLE trick will speed up your data processes.
12:54
Rob Mulla
Рет қаралды 260 М.
Why should you care about DuckDB? ft. Mihai Bojin
14:35
MotherDuck
Рет қаралды 8 М.
What is Apache Parquet file?
8:02
Riz Ang
Рет қаралды 73 М.
DuckDB vs Pandas vs Polars For Python devs
12:05
MotherDuck
Рет қаралды 15 М.
Polars Is The Faster Pandas
8:53
NeuralNine
Рет қаралды 12 М.
An introduction to Apache Parquet
5:16
Learn Data with Mark
Рет қаралды 36 М.
Querying JSON Documents with DuckDB
7:10
Learn Data with Mark
Рет қаралды 4,9 М.
Composable Queries with DuckDB
7:06
Learn Data with Mark
Рет қаралды 3,3 М.
路飞被小孩吓到了#海贼王#路飞
00:41
路飞与唐舞桐
Рет қаралды 76 МЛН