Implement first-class List type #60629

WillAyd · 2024-12-30T20:58:17Z

Quick POC for now. There's a lot to do here but hoping to work in pieces. This currently assumes pyarrow is installed.

The blocks / formatting stuff is not super familiar to me so hoping @mroeschke or @jbrockmendel might have some ideas on how to better approach. I think the main problem I am having is the Block seems to want to infer the type from the values contained. That works for NumPy, but doesn't work with PyArrow, for example when you have an array of all nulls that is separately paired with a type oflist[string]

mroeschke · 2024-12-30T22:14:15Z

Does this assume moving forward with the logical type system PDEP? i.e. List type backed by multiple (in theory) implementations

WillAyd · 2024-12-30T22:48:48Z

PDEP-14 would be nice but I don't think its required here. If we do not revert PDEP-10, then we can assume pyarrow is required and just build off of that. This can fit logically into the extension type system.

We may just want to start referring to that as something else besides "numpy_nulllable," but there is an issue already open for that #59032

jbrockmendel · 2024-12-31T03:14:19Z

The blocks / formatting stuff is not super familiar to me so hoping @mroeschke or @jbrockmendel might have some ideas on how to better approach. I think the main problem I am having is the Block seems to want to infer the type from the values contained.

Yah I'd really rather avoid the changes this makes in that part of the code. Will comment in-line and see if we can find alternatives.

jbrockmendel · 2024-12-31T03:15:13Z

pandas/core/internals/blocks.py

+        try:
+            return self.values.dtype
+        except AttributeError:  # PyArrow fallback
+            return self.values.type


This doesn't make sense to me. self.values should be the EA, and the EA.dtype should be the right thing here.

Ah OK thanks. I think this is a holdover from an intermediate state and I didn't recognize the requirement here. Reverting this fixes a lot of the other comments you've made here as well - thanks!

jbrockmendel · 2024-12-31T03:15:49Z

pandas/core/internals/blocks.py

+    if dtype:
+        klass = get_block_type(dtype)
+    else:
+        klass = get_block_type(values.dtype)


as above, values.dtype should be the ListDtype already. I don't see why passing dtype separately is necessary.

jbrockmendel · 2024-12-31T03:16:51Z

pandas/core/series.py

@@ -505,7 +505,7 @@ def __init__(
                data = data.copy()
        else:
            data = sanitize_array(data, index, dtype, copy)
-            data = SingleBlockManager.from_array(data, index, refs=refs)
+            data = SingleBlockManager.from_array(data, dtype, index, refs=refs)


if dtype is your ListDtype, then data.dtype should be ListDtype at this point so the new argument should be unnecessary

jbrockmendel · 2024-12-31T03:17:42Z

pandas/io/formats/format.py

@@ -1103,7 +1103,11 @@ def format_array(
    List[str]
    """
    fmt_klass: type[_GenericArrayFormatter]
-    if lib.is_np_dtype(values.dtype, "M"):
+    if hasattr(values, "type") and values.type == "null":


can we do something more explicit than hasattr checks? i.e. isinstance(dtype, ListDtype) or whatever?

jbrockmendel · 2024-12-31T03:19:19Z

pandas/tests/extension/list/test_list.py


-    return ListArray(data)
+class TestListArray(BaseConstructorsTests): ...


i think we moved away from this pattern to just ExtensionTests

Yea - I think we can move to that as this gets more production ready. I just wanted to start with something really small while in draft mode

jbrockmendel · 2024-12-31T03:23:15Z

the thing ive always assumed would be a PITA with a ListDtype is __setitem__ distinguishing whether df.iloc[x, :3] = [a, b, c] is to behave like

df.iloc[x, 0] = a
df.iloc[x, 1] = b
df.iloc[x, 2] = c

vs

df.iloc[x, 0] = [a, b, c]
df.iloc[x, 1] = [a, b, c]
df.iloc[x, 2] = [a, b, c]

have you given any thought to that?

WillAyd · 2024-12-31T14:04:02Z

For a List data type the first option wouldn't be possible, since those are scalars values. So I think the latter is correct; if you wanted unpacking I think you'd need to provide a list of lists

jbrockmendel · 2025-01-02T20:46:10Z

pandas/core/internals/construction.py

@@ -460,6 +461,8 @@ def treat_as_nested(data) -> bool:
        len(data) > 0
        and is_list_like(data[0])
        and getattr(data[0], "ndim", 1) == 1
+        # TODO(wayd): hack so pyarrow list elements don't expand
+        and not isinstance(data[0], pa.ListScalar)


I think have is list like return False for pyarrow scalar is less hacky?

That's probably true in this particular case, although I'm not sure how it will generalize to all uses of is_list_like. Will do more research

jbrockmendel · 2025-01-02T20:46:45Z

pandas/core/series.py

@@ -494,7 +495,7 @@ def __init__(
            if not is_list_like(data):
                data = [data]
            index = default_index(len(data))
-        elif is_list_like(data):
+        elif is_list_like(data) and not isinstance(dtype, ListDtype):


What about nested list?

Yea this is a tough one to handle. I'm not sure if something like:

pd.Series([1, 2, 3], index=range(3), dtype=pd.ListDtype())

should raise or broadcast. I think the tests currently want it to broadcast, but we could override that expectation for this array

good example. also indexing is going to be a pain point. a lot of checking for nested listlikes happens in indexing.py and im wary of stumbling blocks there. also indexing in general with a ListDtype index. i think the solution is to interpret lists as sequences and only treat them as a scalar if explicitly wrapped in pa.ListScalar (or something equivalent)

I originally started down that path but I think the problem with expecting pa.ListScalar is that when you select from a ListArray, most users probably expect a plain python list back. So to use that same selection as an indexer I think its hard to avoid the built-in type

So to use that same selection as an indexer

I think we have to just disallow that. Otherwise we have a long tail of places where we expect labels to be non-listlike and nested objects to mean 2D. e.g. df.set_index(list_column).T.set_index([1, 2, 3]) is ambiguous whether it wants three columns to be set as an index or a single column with label [1, 2, 3]. I'm particularly worried about indexing.py code where we check for nested data getting way more complicated if the behavior has to depend on dtypes.

jbrockmendel · 2025-01-04T04:27:35Z

pandas/core/arrays/arrow/array.py

@@ -428,7 +428,7 @@ def _box_pa_scalar(cls, value, pa_type: pa.DataType | None = None) -> pa.Scalar:
        """
        if isinstance(value, pa.Scalar):
            pa_scalar = value
-        elif isna(value):
+        elif not is_list_like(value) and isna(value):


can you add a comment why this is necessary. e.g. off the top of my head I dont know if pa.ListScalar subclasses pa.Scalar

In this particular case the isna(value) check fails when value is a list, since it doesn't return a boolean back.

The "not is_list_like" was a quick way to prevent this branch from throwing an exception, but open to better ways of expressing that

As to your question, pa.ListScalar does inherit from pa.Scalar (all Scalars in pyarrow do) but that is not the type that is hitting this branch, since it is caught in the one preceding

jbrockmendel · 2025-01-04T04:29:49Z

pandas/tests/extension/list/test_list.py

+    This should be three items [B, NA, A] with
+    A < B and NA missing.
+    """
+    pytest.skip("ListArray does not support sorting")


why not? dont python lists have order semantics?

Pyarrow doesn't directly support sorting lists, and if it did I'm not sure it would use the same semantics as Python (I don't know how other languages would care for that)

I suppose we could implement something ourselves to cast all elements to Python and use the language's list semantics, but that's likely a big change better left to a follow up

jbrockmendel · 2025-01-04T04:30:44Z

pandas/_libs/lib.pyx

@@ -834,6 +834,9 @@ cpdef ndarray[object] ensure_string_array(
            if isinstance(val, bytes):
                # GH#49658 discussion of desired behavior here
                result[i] = val.decode()
+            elif isinstance(val, np.ndarray):


in the cython code we use util.is_array for this check

jbrockmendel · 2025-01-04T04:34:40Z

pandas/io/formats/format.py

+        return fmt_values
+
+
+class _ListFormatter(_GenericArrayFormatter):


doesnt look like this is used?

Yep dead code - thanks!

jbrockmendel · 2025-01-04T04:35:46Z

pandas/tests/extension/list/array.py

@@ -1,138 +0,0 @@
-"""


not a huge deal but id prefer if we got the tests passing with the implementation still living here, then moved the implementation in a separate PR

Sure that's not a problem

Actually thinking through this some more - some of the modules like pandas.core end up importing the ListDtype and ListArray. For that, maybe its better to move this out of tests?

OK no big deal if its not easy

WillAyd · 2025-01-04T07:40:04Z

pandas/tests/series/accessors/test_list_accessor.py

        name="a",
    )
    actual = ser.list.len()
-    expected = Series([3, 2, None], dtype=ArrowDtype(pa.int32()), name="a")
+    expected = Series([3, 2, None], dtype=ArrowDtype(pa.int64()), name="a")


Not sure how we feel about the changes in this test module. I swapped us over to the first class type instead of the ArrowDtype list for all the tests, but we can also parametrize

The tricky thing about the logical type here is that, without other logical types we do get back the implementation detail of a pyarrow type back for things like list.len().

Alternatively we could decide on just returning pd.Int64Dtype() here. That would abstract the implementation details of the ListArray and not lose any data, although the mask of the Int64Array requires more memory than pyarrow. This approach would also be more inline with what PDEP-13 proposes

Thinking this over some more it should probably return pd.Int64Dtype() for length. That would match what pd.StringDtype.len returns

jbrockmendel · 2025-01-04T21:47:53Z

Is the idea for this to replace ArrowDtype(pa.list_etc()) and the corresponding ArrowEA or for both to exist? Having multiple implementations of things has been a pain point for the StringDtype(s).

Implement first-class List type

c55bc0a

jbrockmendel reviewed Dec 31, 2024

View reviewed changes

WillAyd added 3 commits December 31, 2024 09:12

Brock feedback

66d8a1d

Test cleanups

ef378f7

Fix API tests

e25c0d4

WillAyd force-pushed the implement-list-type branch from ad6fa08 to e25c0d4 Compare December 31, 2024 19:06

Progress to base.ExtensionArray tests

21a69c9

jbrockmendel reviewed Jan 2, 2025

View reviewed changes

WillAyd added 12 commits January 2, 2025 16:48

Improve test coverage

5859e96

Implement casting tests

9edda32

Implement dtype tests

b20572d

Implement groupby tests

9d404e5

Implement interface tests

6e83ae0

Skip parsing tests

fe6e3be

Implement ArithmeticOps tests

4a8ea29

Implement ComparisonOps tests

8f71766

Implement UnaryOps tests

cf2fb6f

Implement Reshaping tests

4cef00b

Implement SetItem tests

47c7af8

Implement GetItem tests

4a5da0c

Implement Methods tests

305fee3

jbrockmendel reviewed Jan 4, 2025

View reviewed changes

WillAyd added 2 commits January 4, 2025 01:48

Brock feedback

25087f7

Assorted cleanups

5a2b113

WillAyd force-pushed the implement-list-type branch from b4252c3 to 5a2b113 Compare January 4, 2025 07:09

Update list accessor tests

cc345db

WillAyd commented Jan 4, 2025

View reviewed changes


		return ListArray(data)
		class TestListArray(BaseConstructorsTests): ...

		return fmt_values


		class _ListFormatter(_GenericArrayFormatter):

Implement first-class List type #60629

Are you sure you want to change the base?

Implement first-class List type #60629

Conversation

WillAyd commented Dec 30, 2024 • edited Loading

mroeschke commented Dec 30, 2024

WillAyd commented Dec 30, 2024

jbrockmendel commented Dec 31, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jbrockmendel commented Dec 31, 2024

WillAyd commented Dec 31, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WillAyd Jan 4, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WillAyd Jan 4, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WillAyd Jan 4, 2025 • edited Loading

Choose a reason for hiding this comment

WillAyd Jan 4, 2025 • edited Loading

Choose a reason for hiding this comment

jbrockmendel commented Jan 4, 2025

WillAyd commented Dec 30, 2024 •

edited

Loading

WillAyd Jan 4, 2025 •

edited

Loading

WillAyd Jan 4, 2025 •

edited

Loading

WillAyd Jan 4, 2025 •

edited

Loading

WillAyd Jan 4, 2025 •

edited

Loading