Merge branch 'main' into numpy2

xorbitsai · Oct 6, 2024 · 7faf811 · 7faf811
2 parents 6a02fe7 + 3304440
commit 7faf811
Show file tree

Hide file tree

Showing 12 changed files with 367 additions and 34 deletions.
diff --git a/doc/source/reference/xorbits/index.rst b/doc/source/reference/xorbits/index.rst
@@ -17,6 +17,16 @@ Initialization
    shutdown
    deploy.kubernetes.client.new_cluster
 
+Configuration
+~~~~~~~~~~~~~
+
+.. autosummary::
+   :toctree: generated/
+
+   option_context
+   options.get_option
+   options.set_option
+
 Computation
 ~~~~~~~~~~~
 

diff --git a/doc/source/user_guide/best_practices.rst b/doc/source/user_guide/best_practices.rst
@@ -13,3 +13,4 @@ practices, and helps users solve some common problems.
 
    loading_data
    storage_backend
+   chunking
diff --git a/doc/source/user_guide/chunking.rst b/doc/source/user_guide/chunking.rst
@@ -0,0 +1,77 @@
+.. _chunking:
+
+========
+Chunking
+========
+
+Xorbits divides large datasets into multiple chunks, with each chunk executed independently using 
+single-node libraries such as pandas and numpy. Chunking significantly impacts performance. Too 
+many chunks can lead to a large computation graph, causing the supervisor to spend excessive time 
+on scheduling. Conversely, too few chunks may result in OOM (Out-Of-Memory) issues for some chunks 
+that exceed memory capacity. Therefore, a single chunk should not be too large or too small, and 
+chunking needs to align with both the computation and the available storage. Users familiar with 
+Dask know that Dask requires manual setting of chunk shape or sizes using certain operations, such as :code:`repartition()`.
+
+Automatically
+-------------
+
+Unlike Dask, Xorbits does not require users to manually set chunk sizes or perform :code:`repartition()` 
+operations, as our chunking process occurs automatically in the background, transparent to the user. 
+This automatic chunking mechanism simplifies user interfaces (no more extra :code:`repartition` code) and 
+optimizes performance (no more OOM issues). We call this process **Dynamic Tiling**. Interested 
+readers can refer to our `research paper <https://arxiv.org/abs/2401.00865>`_ for more detailed 
+information.
+
+Xorbits' operator partitioning is referred to as tiling. We have a predefined option called 
+:code:`chunk_store_limit`. This option controls the upper limit of each chunk. During the tiling 
+process, Xorbits calculates the data size incoming from upstream operators. Each chunk's data size 
+is \<= the :code:`chunk_store_limit`. Any data exceeding the :code:`chunk_store_limit` is 
+partitioned into a new chunk.
+
+We have set this :code:`chunk_store_limit` option to :code:`512 * 1024 ** 2`, which is equivalent to 
+512 M. It's important to note that this value may not be optimal for all scenarios and workloads. 
+In CPU environments, setting this value higher may not yield substantial benefits even if you 
+have a large amount of RAM available. However, in GPU scenarios, it's advisable to set this value 
+higher to maximize the data size within each chunk, thereby minimizing data transfer between GPUs.
+
+You can set this value with a context: :code:`xorbits.pandas.option_context({"chunk_store_limit": 1024 ** 3})`.
+
+.. code-block:: python
+
+    import xorbits.pandas as xpd
+
+    with xpd.option_context({"chunk_store_limit": 1024 ** 3}):
+        # your xorbits code
+
+Or you can set this value at the begining of your Python script: :code:`xorbits.pandas.set_option({"chunk_store_limit": 1024 ** 3})`
+
+.. code-block:: python
+
+    import xorbits.pandas as xpd
+    
+    xpd.set_option("chunk_store_limit", 1024 ** 3)
+    # your xorbits code
+
+Manually
+--------
+
+We recommend using either the :code:`xorbits.option_context()` method or the :code:`xorbits.options` 
+attribute mentioned above to configure the setting. If you wish to specify the number of chunks 
+(typically for debugging purposes), you can do so as follows by specifying the :code:`chunk_size` 
+when creating a Xorbits DataFrame or Array.
+
+.. code-block:: python
+
+    import numpy as np
+    import pandas as pd
+    import xorbits.numpy as xnp
+    import xorbits.pandas as xpd
+    
+    a = xnp.ones((100, 100), chunk_size=30)
+
+    data = pd.DataFrame(
+        np.random.rand(10, 10), index=np.arange(10), columns=np.arange(3, 13)
+    )
+    xdf = xpd.DataFrame(data, chunk_size=5)
+
+
diff --git a/doc/source/user_guide/configuration.rst b/doc/source/user_guide/configuration.rst
@@ -0,0 +1,86 @@
+.. _configuration:
+
+=============
+Configuration
+=============
+
+In Xorbits, there are two types of configuration and option setting approaches: 
+
+- cluster-level: applied to the whole cluster when starting the supervisor and the workers.
+- job-level: applied to a specific Xorbits job or Python script.
+
+Cluster-Level Configuration
+---------------------------
+
+Cluster-level configurations are applied to the entire Xorbits cluster and affect all jobs 
+running on it. These settings are typically defined when starting the Xorbits cluster 
+(i.e., the supervisor or the workers) and remain constant throughout the cluster's lifetime.
+
+Examples of cluster-level configurations include:
+
+- Network: use TCP Socket or UCX.
+- Storage: use Shared Memory or Filesystem.
+
+These configurations are usually set through command-line arguments and configuration files 
+when launching the Xorbits cluster. Specifically, users should create a YAML configuration 
+file (e.g., `config.yml`) and starting the when starting the 
+supervisor and workers using the ``-f config.yml`` option. Find more details on how to use ``-f`` in :ref:`custom configuration 
+in cluster deployment <cluster_custom_configuration>`. The default YAML file is 
+`base_config.yml <https://github.com/xorbitsai/xorbits/blob/main/python/xorbits/_mars/deploy/oscar/base_config.yml>`_.
+Write your own one like this:
+
+.. code-block:: yaml
+    :caption: config.yml
+
+    "@inherits": "@default"
+    storage:
+        default_config: 
+            transfer_block_size: 10 * 1024 ** 2
+    cluster:
+        node_timeout: 1200
+
+Job-Level Configuration
+-----------------------
+
+Job-level configurations are specific to individual Xorbits jobs or sessions. These settings 
+allow users to fine-tune the behavior of their specific workloads without affecting other 
+jobs running on the same cluster.
+
+Job-level configurations can be set using the following methods:
+
+1. Using ``xorbits.options.set_option()`` or ``xorbits.pandas.set_option()``.
+
+``xorbits.options.set_option()`` and ``xorbits.pandas.set_option()`` are effective for all packages within Xorbits. 
+
+.. code-block:: python
+   
+   from xorbits import options
+
+   options.set_option("chunk_store_limit", 1024 ** 3)
+
+
+Using ``xorbits.pandas.set_option()`` to configure both pandas and Xoribts, 
+as ``xorbits.pandas.set_option()`` can also be used to configure not only 
+Xorbits but also pandas-native settings.
+
+.. code-block:: python
+
+   import xorbits.pandas as xpd
+   
+   xpd.set_option("chunk_store_limit", 1024 ** 3)
+   xpd.set_option("display.max_rows", 100)
+
+
+2. Using ``xorbits.option_context()`` or ``xorbits.pandas.option_context()``.
+
+Note that the argument of ``option_context()`` is a ``dict``. These two ``option_context()`` configuration methods are only effective within a specific 
+context. Similar to ``xorbits.pandas.set_option()``, ``xorbits.pandas.option_context()`` can also be used to configure pandas-native settings.
+
+.. code-block:: python
+
+   import xorbits.pandas as xpd
+
+   with xpd.option_context({"chunk_store_limit": 1024 ** 3}):
+      # Your Xorbits code here
+      # The chunk_store_limit will be set to 1 GB 
+      # only within this context
diff --git a/doc/source/user_guide/deployment_cluster.rst b/doc/source/user_guide/deployment_cluster.rst
@@ -125,6 +125,7 @@ Extra Options for Workers
 |                    | devices                                                        |
 +--------------------+----------------------------------------------------------------+
 
+.. _cluster_custom_configuration:
 
 Custom configuration
 --------------------
@@ -134,18 +135,17 @@ Default configuration can be modified by specifying a ``-f`` flag. Provide the p
 For example
 ~~~~~~~~~~~
 
-If the user want to modify ``transfer_block_size``` and ``node_timeout``, specify ``-f your-config.yml``.
+If the user want to modify ``transfer_block_size`` and ``node_timeout``, specify ``-f your-config.yml``.
 
-your-config.yml
-
-.. code-block:: bash
+.. code-block:: yaml
+    :caption: your-config.yml
 
     "@inherits": "@default"
     storage:
-    default_config: 
-        transfer_block_size: 10 * 1024 ** 2
+        default_config: 
+            transfer_block_size: 10 * 1024 ** 2
     cluster:
-    node_timeout: 1200
+        node_timeout: 1200
 
 
 

diff --git a/doc/source/user_guide/index.rst b/doc/source/user_guide/index.rst
@@ -19,5 +19,6 @@ Further information on any specific method can be obtained in the
 
    deferred_execution
    deployment
+   configuration
    best_practices
    logging
diff --git a/python/xorbits/__init__.py b/python/xorbits/__init__.py
@@ -14,6 +14,7 @@
 
 
 from . import _version
+from .config import option_context, options
 from .core import run
 from .deploy import init, shutdown
 
@@ -44,4 +45,4 @@ def _install():
 
 __version__ = _version.get_versions()["version"]
 
-__all__ = ["init", "shutdown"]
+__all__ = ["init", "shutdown", "run", "options", "option_context"]
Original file line number	Diff line number	Diff line change
Expand Up		@@ -13,3 +13,4 @@ practices, and helps users solve some common problems.

		loading_data
		storage_backend
		chunking