Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-threaded scan #47

Open
4 tasks
freddie-freeloader opened this issue Jul 14, 2024 · 0 comments
Open
4 tasks

Multi-threaded scan #47

freddie-freeloader opened this issue Jul 14, 2024 · 0 comments

Comments

@freddie-freeloader
Copy link
Contributor

freddie-freeloader commented Jul 14, 2024

First findings

  • It seems that DuckdDB might be able to read multiple parquet-files in concurrently -- but not one file concurrently

Thoughts

  • In theory, we could do this by copy from with exactly the same number of threads & use each thread the location info of the sheetreader thread.
  • Would it be possible to partition excel sheet in 2048 / (number of threads) rows? + make the buffers that size? Probably tricky, because we would have to know the number of columns before (because buffer size / columns is the numbers of rows, which fit into one buffer)

TODO

A multi-threaded scan would be interesting, since our copy/scan function takes some time.

Have a look at:

https://github.com/duckdb/duckdb_delta/blob/main/src/functions/delta_scan.cpp

According to the README, it supports a multi-threaded scan. I suspect that this doesn't need any new implementation, since they are reading the parquet files.

  • Find out whether this is due to the parquet files
  • Find out whether DuckDB supports also a multi-threaded scan of Apache Arrow format
  • Have a look at how the multi-threaded scan is implemented
  • Find out whether we could copy concurrently -- this might not be possible, because sheetreader-core saves the data in a special way (per thread & some rows are split in multiple threads -- and there is only an implicit order)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant