-
Notifications
You must be signed in to change notification settings - Fork 70
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
12 changed files
with
367 additions
and
34 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -13,3 +13,4 @@ practices, and helps users solve some common problems. | |
|
||
loading_data | ||
storage_backend | ||
chunking |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,77 @@ | ||
.. _chunking: | ||
|
||
======== | ||
Chunking | ||
======== | ||
|
||
Xorbits divides large datasets into multiple chunks, with each chunk executed independently using | ||
single-node libraries such as pandas and numpy. Chunking significantly impacts performance. Too | ||
many chunks can lead to a large computation graph, causing the supervisor to spend excessive time | ||
on scheduling. Conversely, too few chunks may result in OOM (Out-Of-Memory) issues for some chunks | ||
that exceed memory capacity. Therefore, a single chunk should not be too large or too small, and | ||
chunking needs to align with both the computation and the available storage. Users familiar with | ||
Dask know that Dask requires manual setting of chunk shape or sizes using certain operations, such as :code:`repartition()`. | ||
|
||
Automatically | ||
------------- | ||
|
||
Unlike Dask, Xorbits does not require users to manually set chunk sizes or perform :code:`repartition()` | ||
operations, as our chunking process occurs automatically in the background, transparent to the user. | ||
This automatic chunking mechanism simplifies user interfaces (no more extra :code:`repartition` code) and | ||
optimizes performance (no more OOM issues). We call this process **Dynamic Tiling**. Interested | ||
readers can refer to our `research paper <https://arxiv.org/abs/2401.00865>`_ for more detailed | ||
information. | ||
|
||
Xorbits' operator partitioning is referred to as tiling. We have a predefined option called | ||
:code:`chunk_store_limit`. This option controls the upper limit of each chunk. During the tiling | ||
process, Xorbits calculates the data size incoming from upstream operators. Each chunk's data size | ||
is \<= the :code:`chunk_store_limit`. Any data exceeding the :code:`chunk_store_limit` is | ||
partitioned into a new chunk. | ||
|
||
We have set this :code:`chunk_store_limit` option to :code:`512 * 1024 ** 2`, which is equivalent to | ||
512 M. It's important to note that this value may not be optimal for all scenarios and workloads. | ||
In CPU environments, setting this value higher may not yield substantial benefits even if you | ||
have a large amount of RAM available. However, in GPU scenarios, it's advisable to set this value | ||
higher to maximize the data size within each chunk, thereby minimizing data transfer between GPUs. | ||
|
||
You can set this value with a context: :code:`xorbits.pandas.option_context({"chunk_store_limit": 1024 ** 3})`. | ||
|
||
.. code-block:: python | ||
import xorbits.pandas as xpd | ||
with xpd.option_context({"chunk_store_limit": 1024 ** 3}): | ||
# your xorbits code | ||
Or you can set this value at the begining of your Python script: :code:`xorbits.pandas.set_option({"chunk_store_limit": 1024 ** 3})` | ||
|
||
.. code-block:: python | ||
import xorbits.pandas as xpd | ||
xpd.set_option("chunk_store_limit", 1024 ** 3) | ||
# your xorbits code | ||
Manually | ||
-------- | ||
|
||
We recommend using either the :code:`xorbits.option_context()` method or the :code:`xorbits.options` | ||
attribute mentioned above to configure the setting. If you wish to specify the number of chunks | ||
(typically for debugging purposes), you can do so as follows by specifying the :code:`chunk_size` | ||
when creating a Xorbits DataFrame or Array. | ||
|
||
.. code-block:: python | ||
import numpy as np | ||
import pandas as pd | ||
import xorbits.numpy as xnp | ||
import xorbits.pandas as xpd | ||
a = xnp.ones((100, 100), chunk_size=30) | ||
data = pd.DataFrame( | ||
np.random.rand(10, 10), index=np.arange(10), columns=np.arange(3, 13) | ||
) | ||
xdf = xpd.DataFrame(data, chunk_size=5) | ||
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,86 @@ | ||
.. _configuration: | ||
|
||
============= | ||
Configuration | ||
============= | ||
|
||
In Xorbits, there are two types of configuration and option setting approaches: | ||
|
||
- cluster-level: applied to the whole cluster when starting the supervisor and the workers. | ||
- job-level: applied to a specific Xorbits job or Python script. | ||
|
||
Cluster-Level Configuration | ||
--------------------------- | ||
|
||
Cluster-level configurations are applied to the entire Xorbits cluster and affect all jobs | ||
running on it. These settings are typically defined when starting the Xorbits cluster | ||
(i.e., the supervisor or the workers) and remain constant throughout the cluster's lifetime. | ||
|
||
Examples of cluster-level configurations include: | ||
|
||
- Network: use TCP Socket or UCX. | ||
- Storage: use Shared Memory or Filesystem. | ||
|
||
These configurations are usually set through command-line arguments and configuration files | ||
when launching the Xorbits cluster. Specifically, users should create a YAML configuration | ||
file (e.g., `config.yml`) and starting the when starting the | ||
supervisor and workers using the ``-f config.yml`` option. Find more details on how to use ``-f`` in :ref:`custom configuration | ||
in cluster deployment <cluster_custom_configuration>`. The default YAML file is | ||
`base_config.yml <https://github.com/xorbitsai/xorbits/blob/main/python/xorbits/_mars/deploy/oscar/base_config.yml>`_. | ||
Write your own one like this: | ||
|
||
.. code-block:: yaml | ||
:caption: config.yml | ||
"@inherits": "@default" | ||
storage: | ||
default_config: | ||
transfer_block_size: 10 * 1024 ** 2 | ||
cluster: | ||
node_timeout: 1200 | ||
Job-Level Configuration | ||
----------------------- | ||
|
||
Job-level configurations are specific to individual Xorbits jobs or sessions. These settings | ||
allow users to fine-tune the behavior of their specific workloads without affecting other | ||
jobs running on the same cluster. | ||
|
||
Job-level configurations can be set using the following methods: | ||
|
||
1. Using ``xorbits.options.set_option()`` or ``xorbits.pandas.set_option()``. | ||
|
||
``xorbits.options.set_option()`` and ``xorbits.pandas.set_option()`` are effective for all packages within Xorbits. | ||
|
||
.. code-block:: python | ||
from xorbits import options | ||
options.set_option("chunk_store_limit", 1024 ** 3) | ||
Using ``xorbits.pandas.set_option()`` to configure both pandas and Xoribts, | ||
as ``xorbits.pandas.set_option()`` can also be used to configure not only | ||
Xorbits but also pandas-native settings. | ||
|
||
.. code-block:: python | ||
import xorbits.pandas as xpd | ||
xpd.set_option("chunk_store_limit", 1024 ** 3) | ||
xpd.set_option("display.max_rows", 100) | ||
2. Using ``xorbits.option_context()`` or ``xorbits.pandas.option_context()``. | ||
|
||
Note that the argument of ``option_context()`` is a ``dict``. These two ``option_context()`` configuration methods are only effective within a specific | ||
context. Similar to ``xorbits.pandas.set_option()``, ``xorbits.pandas.option_context()`` can also be used to configure pandas-native settings. | ||
|
||
.. code-block:: python | ||
import xorbits.pandas as xpd | ||
with xpd.option_context({"chunk_store_limit": 1024 ** 3}): | ||
# Your Xorbits code here | ||
# The chunk_store_limit will be set to 1 GB | ||
# only within this context |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.