Skip to content

Commit

Permalink
Merge branch 'main' into numpy2
Browse files Browse the repository at this point in the history
  • Loading branch information
mergify[bot] authored Oct 6, 2024
2 parents 6a02fe7 + 3304440 commit 7faf811
Show file tree
Hide file tree
Showing 12 changed files with 367 additions and 34 deletions.
10 changes: 10 additions & 0 deletions doc/source/reference/xorbits/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,16 @@ Initialization
shutdown
deploy.kubernetes.client.new_cluster

Configuration
~~~~~~~~~~~~~

.. autosummary::
:toctree: generated/

option_context
options.get_option
options.set_option

Computation
~~~~~~~~~~~

Expand Down
1 change: 1 addition & 0 deletions doc/source/user_guide/best_practices.rst
Original file line number Diff line number Diff line change
Expand Up @@ -13,3 +13,4 @@ practices, and helps users solve some common problems.

loading_data
storage_backend
chunking
77 changes: 77 additions & 0 deletions doc/source/user_guide/chunking.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
.. _chunking:

========
Chunking
========

Xorbits divides large datasets into multiple chunks, with each chunk executed independently using
single-node libraries such as pandas and numpy. Chunking significantly impacts performance. Too
many chunks can lead to a large computation graph, causing the supervisor to spend excessive time
on scheduling. Conversely, too few chunks may result in OOM (Out-Of-Memory) issues for some chunks
that exceed memory capacity. Therefore, a single chunk should not be too large or too small, and
chunking needs to align with both the computation and the available storage. Users familiar with
Dask know that Dask requires manual setting of chunk shape or sizes using certain operations, such as :code:`repartition()`.

Automatically
-------------

Unlike Dask, Xorbits does not require users to manually set chunk sizes or perform :code:`repartition()`
operations, as our chunking process occurs automatically in the background, transparent to the user.
This automatic chunking mechanism simplifies user interfaces (no more extra :code:`repartition` code) and
optimizes performance (no more OOM issues). We call this process **Dynamic Tiling**. Interested
readers can refer to our `research paper <https://arxiv.org/abs/2401.00865>`_ for more detailed
information.

Xorbits' operator partitioning is referred to as tiling. We have a predefined option called
:code:`chunk_store_limit`. This option controls the upper limit of each chunk. During the tiling
process, Xorbits calculates the data size incoming from upstream operators. Each chunk's data size
is \<= the :code:`chunk_store_limit`. Any data exceeding the :code:`chunk_store_limit` is
partitioned into a new chunk.

We have set this :code:`chunk_store_limit` option to :code:`512 * 1024 ** 2`, which is equivalent to
512 M. It's important to note that this value may not be optimal for all scenarios and workloads.
In CPU environments, setting this value higher may not yield substantial benefits even if you
have a large amount of RAM available. However, in GPU scenarios, it's advisable to set this value
higher to maximize the data size within each chunk, thereby minimizing data transfer between GPUs.

You can set this value with a context: :code:`xorbits.pandas.option_context({"chunk_store_limit": 1024 ** 3})`.

.. code-block:: python
import xorbits.pandas as xpd
with xpd.option_context({"chunk_store_limit": 1024 ** 3}):
# your xorbits code
Or you can set this value at the begining of your Python script: :code:`xorbits.pandas.set_option({"chunk_store_limit": 1024 ** 3})`

.. code-block:: python
import xorbits.pandas as xpd
xpd.set_option("chunk_store_limit", 1024 ** 3)
# your xorbits code
Manually
--------

We recommend using either the :code:`xorbits.option_context()` method or the :code:`xorbits.options`
attribute mentioned above to configure the setting. If you wish to specify the number of chunks
(typically for debugging purposes), you can do so as follows by specifying the :code:`chunk_size`
when creating a Xorbits DataFrame or Array.

.. code-block:: python
import numpy as np
import pandas as pd
import xorbits.numpy as xnp
import xorbits.pandas as xpd
a = xnp.ones((100, 100), chunk_size=30)
data = pd.DataFrame(
np.random.rand(10, 10), index=np.arange(10), columns=np.arange(3, 13)
)
xdf = xpd.DataFrame(data, chunk_size=5)
86 changes: 86 additions & 0 deletions doc/source/user_guide/configuration.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
.. _configuration:

=============
Configuration
=============

In Xorbits, there are two types of configuration and option setting approaches:

- cluster-level: applied to the whole cluster when starting the supervisor and the workers.
- job-level: applied to a specific Xorbits job or Python script.

Cluster-Level Configuration
---------------------------

Cluster-level configurations are applied to the entire Xorbits cluster and affect all jobs
running on it. These settings are typically defined when starting the Xorbits cluster
(i.e., the supervisor or the workers) and remain constant throughout the cluster's lifetime.

Examples of cluster-level configurations include:

- Network: use TCP Socket or UCX.
- Storage: use Shared Memory or Filesystem.

These configurations are usually set through command-line arguments and configuration files
when launching the Xorbits cluster. Specifically, users should create a YAML configuration
file (e.g., `config.yml`) and starting the when starting the
supervisor and workers using the ``-f config.yml`` option. Find more details on how to use ``-f`` in :ref:`custom configuration
in cluster deployment <cluster_custom_configuration>`. The default YAML file is
`base_config.yml <https://github.com/xorbitsai/xorbits/blob/main/python/xorbits/_mars/deploy/oscar/base_config.yml>`_.
Write your own one like this:

.. code-block:: yaml
:caption: config.yml
"@inherits": "@default"
storage:
default_config:
transfer_block_size: 10 * 1024 ** 2
cluster:
node_timeout: 1200
Job-Level Configuration
-----------------------

Job-level configurations are specific to individual Xorbits jobs or sessions. These settings
allow users to fine-tune the behavior of their specific workloads without affecting other
jobs running on the same cluster.

Job-level configurations can be set using the following methods:

1. Using ``xorbits.options.set_option()`` or ``xorbits.pandas.set_option()``.

``xorbits.options.set_option()`` and ``xorbits.pandas.set_option()`` are effective for all packages within Xorbits.

.. code-block:: python
from xorbits import options
options.set_option("chunk_store_limit", 1024 ** 3)
Using ``xorbits.pandas.set_option()`` to configure both pandas and Xoribts,
as ``xorbits.pandas.set_option()`` can also be used to configure not only
Xorbits but also pandas-native settings.

.. code-block:: python
import xorbits.pandas as xpd
xpd.set_option("chunk_store_limit", 1024 ** 3)
xpd.set_option("display.max_rows", 100)
2. Using ``xorbits.option_context()`` or ``xorbits.pandas.option_context()``.

Note that the argument of ``option_context()`` is a ``dict``. These two ``option_context()`` configuration methods are only effective within a specific
context. Similar to ``xorbits.pandas.set_option()``, ``xorbits.pandas.option_context()`` can also be used to configure pandas-native settings.

.. code-block:: python
import xorbits.pandas as xpd
with xpd.option_context({"chunk_store_limit": 1024 ** 3}):
# Your Xorbits code here
# The chunk_store_limit will be set to 1 GB
# only within this context
14 changes: 7 additions & 7 deletions doc/source/user_guide/deployment_cluster.rst
Original file line number Diff line number Diff line change
Expand Up @@ -125,6 +125,7 @@ Extra Options for Workers
| | devices |
+--------------------+----------------------------------------------------------------+

.. _cluster_custom_configuration:

Custom configuration
--------------------
Expand All @@ -134,18 +135,17 @@ Default configuration can be modified by specifying a ``-f`` flag. Provide the p
For example
~~~~~~~~~~~

If the user want to modify ``transfer_block_size``` and ``node_timeout``, specify ``-f your-config.yml``.
If the user want to modify ``transfer_block_size`` and ``node_timeout``, specify ``-f your-config.yml``.

your-config.yml

.. code-block:: bash
.. code-block:: yaml
:caption: your-config.yml
"@inherits": "@default"
storage:
default_config:
transfer_block_size: 10 * 1024 ** 2
default_config:
transfer_block_size: 10 * 1024 ** 2
cluster:
node_timeout: 1200
node_timeout: 1200
Expand Down
1 change: 1 addition & 0 deletions doc/source/user_guide/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -19,5 +19,6 @@ Further information on any specific method can be obtained in the

deferred_execution
deployment
configuration
best_practices
logging
3 changes: 2 additions & 1 deletion python/xorbits/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@


from . import _version
from .config import option_context, options
from .core import run
from .deploy import init, shutdown

Expand Down Expand Up @@ -44,4 +45,4 @@ def _install():

__version__ = _version.get_versions()["version"]

__all__ = ["init", "shutdown"]
__all__ = ["init", "shutdown", "run", "options", "option_context"]
Loading

0 comments on commit 7faf811

Please sign in to comment.