Skip to content

Latest commit

 

History

History
24 lines (13 loc) · 2.17 KB

ROADMAP.md

File metadata and controls

24 lines (13 loc) · 2.17 KB

Current SPYT development roadmap

As new versions are released, we will update tasks in this roadmap with corresponding versions and add new tasks.


Integration with the YTsaurus shuffle service

The internal shuffle service of YTsaurus will be used for sorting and subsequent access to sorted portions of data. It is also planned to support saving the sorted parts in case of unexpected aborts of executors. It will not accelerate time of operations in case when no one interruption is in place, but will improve time in case when several aborts have taken, which are the norm on average.

Support for dynamic allocation by changing the number of jobs in a running operation (as part of direct spark-submit)

Depending on the size of the table being processed, SPYT in direct spark-submit mode will be able to adjust the number of executors on the fly. For large tables, add them, and for small ones, on the contrary, reduce them.

Support columnar-statistics with using Spark 3.4.x

YTsaurus metadata stores information about the generational statistics of tables, for example, for the number of values or the number of unique values in a column. Currently, this information is not used by Spark in any way when building a query plan. By adding the ability to take this information into account, we expect that the query plan will be built more efficiently.

Support Spark SQL via Query Tracker for working with dynamic tables

Currently, in a query via Query Tracker, in order to read data from a dynamic table, it is necessary to specify a suffix in the name of the table in the form of a timestamp, at the time of which you need to read the data slice. Which is generally inconvenient, because it is necessary to find and apply this timestamp to table name. And when you rerun that query in after a certain period of time, the timestamp may change, so user will need to take an actual value of timestamp . We plan to automatically find the last timestamp of the table from its "last_commit_timestamp" attribute.

Support for Java 17 and Scala 2.13

Java 11 and Scala 2.12 are currently supported. To integrate with upcoming Spark 4.x.x. it will need support for Java 17 and Scala 2.13.