Multi-threaded scan #47

freddie-freeloader · 2024-07-14T10:27:04Z

First findings

It seems that DuckdDB might be able to read multiple parquet-files in concurrently -- but not one file concurrently

Thoughts

In theory, we could do this by copy from with exactly the same number of threads & use each thread the location info of the sheetreader thread.
Would it be possible to partition excel sheet in 2048 / (number of threads) rows? + make the buffers that size? Probably tricky, because we would have to know the number of columns before (because buffer size / columns is the numbers of rows, which fit into one buffer)

TODO

A multi-threaded scan would be interesting, since our copy/scan function takes some time.

Have a look at:

https://github.com/duckdb/duckdb_delta/blob/main/src/functions/delta_scan.cpp

According to the README, it supports a multi-threaded scan. I suspect that this doesn't need any new implementation, since they are reading the parquet files.

Find out whether this is due to the parquet files
Find out whether DuckDB supports also a multi-threaded scan of Apache Arrow format
Have a look at how the multi-threaded scan is implemented
Find out whether we could copy concurrently -- this might not be possible, because sheetreader-core saves the data in a special way (per thread & some rows are split in multiple threads -- and there is only an implicit order)

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-threaded scan #47

Multi-threaded scan #47

freddie-freeloader commented Jul 14, 2024 •

edited

Loading

Multi-threaded scan #47

Multi-threaded scan #47

Comments

freddie-freeloader commented Jul 14, 2024 • edited Loading

First findings

Thoughts

TODO

freddie-freeloader commented Jul 14, 2024 •

edited

Loading