ValueError: Not all columns are categoricals #19

YarShev · 2024-01-12T09:44:42Z

condastats version: 0.2.1
Python version: 3.9.18
Operating System: Ubuntu 22.04.3 LTS

Description

I wanted to collect some statistics of a package with condastats but encountered the error.

Describe what you were trying to get done.
Tell us what happened, what went wrong, and what you expected to happen.

What I Did

$ conda install -c conda-forge condastats
$ condastats overall pandas
ValueError: Not all columns are categoricals

Is there something I am doing wrong?

athewsey · 2024-09-03T17:12:13Z

I'm seeing the same error via Python API - stack trace as below:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[14], line 3
      1 import condastats.cli
----> 3 condastats.cli.overall(package_name)

File [/opt/conda/lib/python3.10/site-packages/condastats/cli.py:62](https://{MASKED}.studio.ap-southeast-1.sagemaker.aws/opt/conda/lib/python3.10/site-packages/condastats/cli.py#line=61), in overall(package, month, start_month, end_month, monthly, complete, pkg_platform, data_source, pkg_version, pkg_python)
     55     df = df.query(f'pkg_name in ("{package}")')
     57 # if all optional arguments are None, read in all
     58 # the data for a certain package
     59 else:
     60     # if all optional arguments are None, read in
     61     # all the data for a certain package
---> 62     df = dd.read_parquet(
     63         "s3://anaconda-package-data/conda/monthly/*/*.parquet",
     64         storage_options={"anon": True},
     65         engine="pyarrow"
     66     )
     67     df = df.query(f'pkg_name in ("{package}")')
     69 if complete:

File [/opt/conda/lib/python3.10/site-packages/dask_expr/_collection.py:5433](https://{MASKED}.studio.ap-southeast-1.sagemaker.aws/opt/conda/lib/python3.10/site-packages/dask_expr/_collection.py#line=5432), in read_parquet(path, columns, filters, categories, index, storage_options, dtype_backend, calculate_divisions, ignore_metadata_file, metadata_task_size, split_row_groups, blocksize, aggregate_files, parquet_file_extension, filesystem, engine, arrow_to_pandas, **kwargs)
   5410         raise NotImplementedError(
   5411             "engine is not supported when using the pyarrow filesystem."
   5412         )
   5414     return new_collection(
   5415         ReadParquetPyarrowFS(
   5416             path,
   (...)
   5429         )
   5430     )
   5432 return new_collection(
-> 5433     ReadParquetFSSpec(
   5434         path,
   5435         columns=_convert_to_list(columns),
   5436         filters=filters,
   5437         categories=categories,
   5438         index=index,
   5439         storage_options=storage_options,
   5440         calculate_divisions=calculate_divisions,
   5441         ignore_metadata_file=ignore_metadata_file,
   5442         metadata_task_size=metadata_task_size,
   5443         split_row_groups=split_row_groups,
   5444         blocksize=blocksize,
   5445         aggregate_files=aggregate_files,
   5446         parquet_file_extension=parquet_file_extension,
   5447         filesystem=filesystem,
   5448         engine=_set_parquet_engine(engine),
   5449         kwargs=kwargs,
   5450         _series=isinstance(columns, str),
   5451     )
   5452 )

File [/opt/conda/lib/python3.10/site-packages/dask_expr/_core.py:57](https://{MASKED}.studio.ap-southeast-1.sagemaker.aws/opt/conda/lib/python3.10/site-packages/dask_expr/_core.py#line=56), in Expr.__new__(cls, *args, **kwargs)
     55 inst = object.__new__(cls)
     56 inst.operands = [_unpack_collections(o) for o in operands]
---> 57 _name = inst._name
     58 if _name in Expr._instances:
     59     return Expr._instances[_name]

File [/opt/conda/lib/python3.10/functools.py:981](https://{MASKED}.studio.ap-southeast-1.sagemaker.aws/opt/conda/lib/python3.10/functools.py#line=980), in cached_property.__get__(self, instance, owner)
    979 val = cache.get(self.attrname, _NOT_FOUND)
    980 if val is _NOT_FOUND:
--> 981     val = self.func(instance)
    982     try:
    983         cache[self.attrname] = val

File [/opt/conda/lib/python3.10/site-packages/dask_expr/io/parquet.py:776](https://{MASKED}.studio.ap-southeast-1.sagemaker.aws/opt/conda/lib/python3.10/site-packages/dask_expr/io/parquet.py#line=775), in ReadParquet._name(self)
    770 @cached_property
    771 def _name(self):
    772     return (
    773         self._funcname
    774         + "-"
    775         + _tokenize_deterministic(
--> 776             funcname(type(self)), self.checksum, *self.operands[:-1]
    777         )
    778     )

File [/opt/conda/lib/python3.10/site-packages/dask_expr/io/parquet.py:782](https://{MASKED}.studio.ap-southeast-1.sagemaker.aws/opt/conda/lib/python3.10/site-packages/dask_expr/io/parquet.py#line=781), in ReadParquet.checksum(self)
    780 @property
    781 def checksum(self):
--> 782     return self._dataset_info["checksum"]

File [/opt/conda/lib/python3.10/site-packages/dask_expr/io/parquet.py:1375](https://{MASKED}.studio.ap-southeast-1.sagemaker.aws/opt/conda/lib/python3.10/site-packages/dask_expr/io/parquet.py#line=1374), in ReadParquetFSSpec._dataset_info(self)
   1372 dataset_info["checksum"] = tokenize(checksum)
   1374 # Infer meta, accounting for index and columns arguments.
-> 1375 meta = self.engine._create_dd_meta(dataset_info)
   1376 index = dataset_info["index"]
   1377 index = [index] if isinstance(index, str) else index

File [/opt/conda/lib/python3.10/site-packages/dask/dataframe/io/parquet/arrow.py:1280](https://{MASKED}.studio.ap-southeast-1.sagemaker.aws/opt/conda/lib/python3.10/site-packages/dask/dataframe/io/parquet/arrow.py#line=1279), in ArrowDatasetEngine._create_dd_meta(cls, dataset_info)
   1273         raise ValueError(
   1274             "categories not in available columns.\n"
   1275             "categories: {} | columns: {}".format(categories, list(all_columns))
   1276         )
   1278     # Make sure all categories are set to "unknown".
   1279     # Cannot include index names in the `cols` argument.
-> 1280     meta = clear_known_categories(
   1281         meta,
   1282         cols=[c for c in categories if c not in meta.index.names],
   1283         dtype_backend=dtype_backend,
   1284     )
   1286 if partition_obj:
   1287     # Update meta dtypes for partitioned columns
   1288     for partition in partition_obj:

File [/opt/conda/lib/python3.10/site-packages/dask/dataframe/utils.py:294](https://{MASKED}.studio.ap-southeast-1.sagemaker.aws/opt/conda/lib/python3.10/site-packages/dask/dataframe/utils.py#line=293), in clear_known_categories(x, cols, index, dtype_backend)
    292     cols = mask[mask].index
    293 elif not mask.loc[cols].all():
--> 294     raise ValueError("Not all columns are categoricals")
    295 for c in cols:
    296     x[c] = x[c].cat.set_categories([UNKNOWN_CATEGORIES])

ValueError: Not all columns are categoricals

Other potentially-relevant deps:

condastats = 0.2.1
dask = 2024.5.0
fsspec = 2023.12.2
numpy = 1.26.4
s3fs = 2023.12.2

Seems like it's been a while since condastats was updated so I would guess some incompatibilities might've crept in to dependencies? If anybody's found a solution already it'd be great to know

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ValueError: Not all columns are categoricals #19

ValueError: Not all columns are categoricals #19

YarShev commented Jan 12, 2024

athewsey commented Sep 3, 2024

ValueError: Not all columns are categoricals #19

ValueError: Not all columns are categoricals #19

Comments

YarShev commented Jan 12, 2024

Description

What I Did

athewsey commented Sep 3, 2024