Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arun sqc interop #5

Open
wants to merge 47 commits into
base: Feature_Branch_for_QC
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
47 commits
Select commit Hold shift + click to select a range
4815bc3
FIX-#0000: Fix type hint (#7343)
ZhipengXue97 Jul 15, 2024
f5f9ae9
DOCS-#0000: Update RunLLM Ask AI widget script path (#7345)
likawind Jul 22, 2024
7c1dde0
FEAT-#7331: Initial Polars API (#7332)
devin-petersohn Jul 24, 2024
a40cef7
FEAT-#7340: Add more granular lazy flags to query compiler (#7348)
noloerino Jul 29, 2024
621f49e
FIX-#7351: Add ipython method calls to non-lookup list (#7352)
devin-petersohn Jul 31, 2024
24018db
FIX-#7134: Use a separate docstring class for BasePandasDataset. (#7353)
sfc-gh-mvashishtha Jul 31, 2024
22ed4d8
FIX-#7113: Fix docstring overrides for subclasses. (#7354)
sfc-gh-mvashishtha Aug 1, 2024
6dce30e
FIX-#7355: Cpu count would be set incorrectly on a cluster (#7356)
arunjose696 Aug 1, 2024
b236b76
FIX-#7357: Fix `NoAttributeError` on `DataFrame.copy` (#7358)
devin-petersohn Aug 2, 2024
05e5c48
FEAT-#7368: Add a new environment variable for using dynamic partitio…
Retribution98 Aug 19, 2024
8fc230a
FIX-#7373: Try a previous version of `motoserver/moto` service, pin t…
anmyachev Aug 26, 2024
da01571
FEAT-#4605: Add native query compiler (#7259)
arunjose696 Aug 26, 2024
8249915
FEAT-#7337: Using dynamic partitionning in `broadcast_apply` (#7338)
Retribution98 Aug 26, 2024
ba48d29
FEAT-modin-project#4605: Add small query compiler
arunjose696 Apr 29, 2024
40e1cd4
fixing tests
arunjose696 May 13, 2024
a574172
removing additional parameter from try_cast_to_pandas
arunjose696 May 15, 2024
6f9795b
test_iter passing
arunjose696 May 16, 2024
07d8d3a
fixing isin unique and clip
arunjose696 May 16, 2024
ac75854
Enable test_default.py and test_join_sort.py
YarShev May 16, 2024
52b3136
fixed test_map_metadata by adding set_frame_dtypes_cache and has_mate…
arunjose696 May 23, 2024
4aed823
Fix test_dot
YarShev May 23, 2024
7149ba3
test_udf passing
arunjose696 May 23, 2024
b0e0a82
All tests except one passing in modin/tests/pandas/dataframe
arunjose696 May 23, 2024
507d58a
All tests in modin/tests/pandas/dataframe/ passing
arunjose696 May 29, 2024
88a6207
PR comments
arunjose696 May 29, 2024
b01f7b9
renaming to PlainPandasQueryCompiler to NativeDataframeMode
arunjose696 Jun 5, 2024
1a61931
renaming to PlainPandasQueryCompiler to NativeDataframeMode
arunjose696 Jun 5, 2024
a105677
PR comments + changes
arunjose696 Jun 10, 2024
8e29fbe
Apply suggestions from code review
arunjose696 Jun 12, 2024
1620f82
fix conflict
arunjose696 Jun 24, 2024
4392165
Apply suggestions from code review
arunjose696 Jul 4, 2024
7a1a30f
PR comments
arunjose696 Jul 4, 2024
4e24de2
FEAT-modin-project#7308: Interoperability between DataFrames using di…
arunjose696 Jul 3, 2024
2308295
Started tests for interop
arunjose696 Jul 4, 2024
d240ac8
modified test_binary and test_default
arunjose696 Jul 10, 2024
a4c1697
Started adding testindexing
arunjose696 Jul 12, 2024
89dabf9
fixing envvars repeatation
arunjose696 Jul 12, 2024
b212c3b
adding tests
arunjose696 Aug 5, 2024
90ba1a4
PR comments and tests
arunjose696 Aug 14, 2024
e9060a7
Update modin/core/storage_formats/pandas/query_compiler_validator.py
arunjose696 Aug 14, 2024
60f9512
PR comments and tests
arunjose696 Aug 14, 2024
80ce01c
Doc updates
arunjose696 Aug 26, 2024
6a068dc
PR comments
arunjose696 Aug 28, 2024
cac89d2
Apply suggestions from code review
arunjose696 Aug 28, 2024
8e8ec46
PR changes
arunjose696 Aug 28, 2024
c1b0942
creating function for create_test_df_in_defined_mode
arunjose696 Aug 28, 2024
f488872
creating function for create_test_df_in_defined_mode
arunjose696 Aug 28, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
52 changes: 48 additions & 4 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -246,12 +246,16 @@ jobs:
unidist: ${{ steps.filter.outputs.unidist }}
engines: ${{ steps.engines.outputs.engines }}
experimental: ${{ steps.experimental.outputs.experimental }}
test-native-dataframe-mode: ${{ steps.filter.outputs.test-native-dataframe-mode }}
steps:
- uses: actions/checkout@v4
- uses: dorny/paths-filter@v3
id: filter
with:
filters: |
test-native-dataframe-mode:
- 'modin/core/storage_formats/pandas/native_query_compiler.py'
- 'modin/core/storage_formats/base/query_compiler.py'
shared: &shared
- 'modin/core/execution/dispatching/**'
ray:
Expand Down Expand Up @@ -293,7 +297,7 @@ jobs:
name: test-ubuntu (engine unidist ${{matrix.unidist-backend}}, python ${{matrix.python-version}})
services:
moto:
image: motoserver/moto
image: motoserver/moto:5.0.13
ports:
- 5000:5000
env:
Expand Down Expand Up @@ -382,7 +386,7 @@ jobs:
# Using workaround https://github.com/actions/runner/issues/822#issuecomment-1524826092
moto:
# we only need moto service on Ubuntu and for group_4 task or python engine
image: ${{ (matrix.os == 'ubuntu' && (matrix.engine == 'python' || matrix.test_task == 'group_4')) && 'motoserver/moto' || '' }}
image: ${{ (matrix.os == 'ubuntu' && (matrix.engine == 'python' || matrix.test_task == 'group_4')) && 'motoserver/moto:5.0.13' || '' }}
ports:
- 5000:5000
env:
Expand Down Expand Up @@ -462,6 +466,7 @@ jobs:
if: matrix.engine == 'python' || matrix.test_task == 'group_4'
- run: python -m pytest modin/tests/interchange/dataframe_protocol/pandas/test_protocol.py
if: matrix.engine == 'python' || matrix.test_task == 'group_4'
- run: python -m pytest modin/tests/polars/test_dataframe.py
- run: |
python -m pip install lazy_import
python -m pytest modin/tests/pandas/integrations/
Expand Down Expand Up @@ -507,7 +512,7 @@ jobs:
name: test-${{ matrix.os }}-sanity (engine ${{ matrix.execution.name }}, python ${{matrix.python-version}})
services:
moto:
image: ${{ matrix.os != 'windows' && 'motoserver/moto' || '' }}
image: ${{ matrix.os != 'windows' && 'motoserver/moto:5.0.13' || '' }}
ports:
- 5000:5000
env:
Expand Down Expand Up @@ -622,7 +627,7 @@ jobs:
name: test experimental
services:
moto:
image: motoserver/moto
image: motoserver/moto:5.0.13
ports:
- 5000:5000
env:
Expand Down Expand Up @@ -664,6 +669,45 @@ jobs:
python-version: ${{matrix.python-version}}
- run: python -m pytest modin/tests/experimental/spreadsheet/test_general.py

test-native-dataframe-mode:
needs: [ lint-flake8, execution-filter]
if: ${{ needs.execution-filter.outputs.test-native-dataframe-mode == 'true' }}
runs-on: ubuntu-latest
defaults:
run:
shell: bash -l {0}
strategy:
matrix:
python-version: ["3.9"]
env:
MODIN_NATIVE_DATAFRAME_MODE: "Pandas"
name: test-native-dataframe-mode python ${{matrix.python-version}})
steps:
- uses: actions/checkout@v4
- uses: ./.github/actions/mamba-env
with:
environment-file: environment-dev.yml
python-version: ${{matrix.python-version}}
- run: python -m pytest modin/tests/pandas/dataframe/test_binary.py
- run: python -m pytest modin/tests/pandas/dataframe/test_default.py
- run: python -m pytest modin/tests/pandas/dataframe/test_indexing.py
- run: python -m pytest modin/tests/pandas/dataframe/test_iter.py
- run: python -m pytest modin/tests/pandas/dataframe/test_join_sort.py
- run: python -m pytest modin/tests/pandas/dataframe/test_map_metadata.py
- run: python -m pytest modin/tests/pandas/dataframe/test_pickle.py
- run: python -m pytest modin/tests/pandas/dataframe/test_reduce.py
- run: python -m pytest modin/tests/pandas/dataframe/test_udf.py
- run: python -m pytest modin/tests/pandas/dataframe/test_window.py
- run: python -m pytest modin/tests/pandas/native_df_mode/test_binary.py
- run: python -m pytest modin/tests/pandas/native_df_mode/test_default.py
- run: python -m pytest modin/tests/pandas/native_df_mode/test_indexing.py
- run: python -m pytest modin/tests/pandas/native_df_mode/test_iter.py
- run: python -m pytest modin/tests/pandas/native_df_mode/test_join_sort.py
- run: python -m pytest modin/tests/pandas/native_df_mode/test_map_metadata.py
- run: python -m pytest modin/tests/pandas/native_df_mode/test_pickle.py
- run: python -m pytest modin/tests/pandas/native_df_mode/test_window.py
- uses: ./.github/actions/upload-coverage

merge-coverage-artifacts:
needs: [test-internals, test-api-and-no-engine, test-defaults, test-all-unidist, test-all, test-experimental, test-sanity]
if: always() # we need to run it regardless of some job being skipped, like in PR
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/push-to-main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ jobs:
shell: bash -l {0}
services:
moto:
image: motoserver/moto
image: motoserver/moto:5.0.13
ports:
- 5000:5000
env:
Expand Down
3 changes: 1 addition & 2 deletions docs/_static/custom.js
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,8 @@ document.addEventListener("DOMContentLoaded", function () {
script.type = "module";
script.id = "runllm-widget-script"

script.src = "https://cdn.jsdelivr.net/npm/@runllm/search-widget@stable/dist/run-llm-search-widget.es.js";
script.src = "https://widget.runllm.com";

script.setAttribute("version", "stable");
script.setAttribute("runllm-keyboard-shortcut", "Mod+j"); // cmd-j or ctrl-j to open the widget.
script.setAttribute("runllm-name", "Modin");
script.setAttribute("runllm-position", "BOTTOM_RIGHT");
Expand Down
33 changes: 33 additions & 0 deletions docs/usage_guide/optimization_notes/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,38 @@ Range-partitioning is not a silver bullet, meaning that enabling it is not alway
a link to the list of operations that have support for range-partitioning and practical advices on when one should
enable it: :doc:`operations that support range-partitioning </usage_guide/optimization_notes/range_partitioning_ops>`.

Dynamic-partitioning in Modin
"""""""""""""""""""""""""""""

Ray engine experiences slowdowns when running a large number of small remote tasks at the same time. Ray Core recommends to `avoid tiny task`_.
When modin DataFrame has a large number of partitions, some functions produce a large number of remote tasks, which can cause slowdowns.
To solve this problem, Modin suggests using dynamic partitioning. This approach reduces the number of remote tasks
by combining multiple partitions into a single virtual partition and perform a common remote task on them.

Dynamic partitioning is typically used for operations that are fully or partially executed on all partitions separately.

.. code-block:: python

import modin.pandas as pd
from modin.config import context

df = pd.DataFrame(...)

with context(DynamicPartitioning=True):
df.abs()

Dynamic partitioning is also not always useful, and this approach is usually used for medium-sized DataFrames with a large number of columns.
If the number of columns is small, the number of partitions will be close to the number of CPUs, and Ray will not have this problem.
If the DataFrame has too many rows, this is also not a good case for using Dynamic-partitioning, since each task is no longer tiny and performing
the combined tasks carries more overhead than assigning them separately.

Unfortunately, the use of Dynamic-partitioning depends on various factors such as data size, number of CPUs, operations performed,
and it is up to the user to determine whether Dynamic-partitioning will give a boost in his case or not.

..
TODO: Define heuristics to automatically enable dynamic partitioning without performance penalty.
`Issue #7370 <https://github.com/modin-project/modin/issues/7370>`_

Understanding Modin's partitioning mechanism
""""""""""""""""""""""""""""""""""""""""""""

Expand Down Expand Up @@ -311,3 +343,4 @@ an inner join you may want to swap left and right DataFrames.
Note that result columns order may differ for first and second ``merge``.

.. _range-partitioning: https://www.techopedia.com/definition/31994/range-partitioning
.. _`avoid tiny task`: https://docs.ray.io/en/latest/ray-core/tips-for-first-time.html#tip-2-avoid-tiny-tasks
1 change: 1 addition & 0 deletions environment-dev.yml
Original file line number Diff line number Diff line change
Expand Up @@ -70,3 +70,4 @@ dependencies:
- git+https://github.com/modin-project/modin-spreadsheet.git@49ffd89f683f54c311867d602c55443fb11bf2a5
# The `numpydoc` version should match the version installed in the `lint-pydocstyle` job of the CI.
- numpydoc==1.6.0
- polars
4 changes: 4 additions & 0 deletions modin/config/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@
CpuCount,
DaskThreadsPerWorker,
DocModule,
DynamicPartitioning,
Engine,
EnvironmentVariable,
GithubCI,
Expand All @@ -39,6 +40,7 @@
MinPartitionSize,
MinRowPartitionSize,
ModinNumpy,
NativeDataframeMode,
NPartitions,
PersistentPickle,
ProgressBar,
Expand Down Expand Up @@ -68,6 +70,7 @@
"CpuCount",
"GpuCount",
"Memory",
"NativeDataframeMode",
# Ray specific
"IsRayCluster",
"RayRedisAddress",
Expand Down Expand Up @@ -95,6 +98,7 @@
"AsyncReadMode",
"ReadSqlEngine",
"IsExperimental",
"DynamicPartitioning",
# For tests
"TrackFileLeaks",
"TestReadFromSqlServer",
Expand Down
52 changes: 52 additions & 0 deletions modin/config/envvars.py
Original file line number Diff line number Diff line change
Expand Up @@ -332,6 +332,24 @@ class CpuCount(EnvironmentVariable, type=int):

varname = "MODIN_CPUS"

@classmethod
def _put(cls, value: int) -> None:
"""
Put specific value if CpuCount wasn't set by a user yet.

Parameters
----------
value : int
Config value to set.

Notes
-----
This method is used to set CpuCount from cluster resources internally
and should not be called by a user.
"""
if cls.get_value_source() == ValueSource.DEFAULT:
cls.put(value)

@classmethod
def _get_default(cls) -> int:
"""
Expand Down Expand Up @@ -874,6 +892,18 @@ class DaskThreadsPerWorker(EnvironmentVariable, type=int):
default = 1


class DynamicPartitioning(EnvironmentVariable, type=bool):
"""
Set to true to use Modin's dynamic-partitioning implementation where possible.

Please refer to documentation for cases where enabling this options would be beneficial:
https://modin.readthedocs.io/en/stable/usage_guide/optimization_notes/index.html#dynamic-partitioning-in-modin
"""

varname = "MODIN_DYNAMIC_PARTITIONING"
default = False


def _check_vars() -> None:
"""
Check validity of environment variables.
Expand Down Expand Up @@ -913,4 +943,26 @@ def _check_vars() -> None:
)


class NativeDataframeMode(EnvironmentVariable, type=str):
"""
Configures the query compiler to process Modin data.

When this config is set to ``Default``, ``PandasQueryCompiler`` is used,
which leads to Modin executing dataframes in distributed fashion.
When set to a string (e.g., ``pandas``), ``NativeQueryCompiler`` is used,
which handles the dataframes without distributing,
falling back to native library functions (e.g., ``pandas``).

This could be beneficial for handling relatively small dataframes
without involving additional overhead of communication between processes.
"""

varname = "MODIN_NATIVE_DATAFRAME_MODE"
choices = (
"Default",
"Pandas",
)
default = "Default"


_check_vars()
8 changes: 5 additions & 3 deletions modin/core/dataframe/algebra/groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -655,9 +655,11 @@ def aggregate_on_dict(grp_obj, *args, **kwargs):
)

native_res_part = [] if native_agg_res is None else [native_agg_res]
result = pandas.concat(
[*native_res_part, *custom_results], axis=1, copy=False
)
parts = [*native_res_part, *custom_results]
if parts:
result = pandas.concat(parts, axis=1, copy=False)
else:
result = pandas.DataFrame(columns=result_columns)

# The order is naturally preserved if there's no custom aggregations
if preserve_aggregation_order and len(custom_aggs):
Expand Down
Loading
Loading