Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError: Not all columns are categoricals #19

Open
YarShev opened this issue Jan 12, 2024 · 1 comment
Open

ValueError: Not all columns are categoricals #19

YarShev opened this issue Jan 12, 2024 · 1 comment

Comments

@YarShev
Copy link

YarShev commented Jan 12, 2024

  • condastats version: 0.2.1
  • Python version: 3.9.18
  • Operating System: Ubuntu 22.04.3 LTS

Description

I wanted to collect some statistics of a package with condastats but encountered the error.

Describe what you were trying to get done.
Tell us what happened, what went wrong, and what you expected to happen.

What I Did

$ conda install -c conda-forge condastats
$ condastats overall pandas
ValueError: Not all columns are categoricals

Is there something I am doing wrong?

@athewsey
Copy link

athewsey commented Sep 3, 2024

I'm seeing the same error via Python API - stack trace as below:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[14], line 3
      1 import condastats.cli
----> 3 condastats.cli.overall(package_name)

File [/opt/conda/lib/python3.10/site-packages/condastats/cli.py:62](https://{MASKED}.studio.ap-southeast-1.sagemaker.aws/opt/conda/lib/python3.10/site-packages/condastats/cli.py#line=61), in overall(package, month, start_month, end_month, monthly, complete, pkg_platform, data_source, pkg_version, pkg_python)
     55     df = df.query(f'pkg_name in ("{package}")')
     57 # if all optional arguments are None, read in all
     58 # the data for a certain package
     59 else:
     60     # if all optional arguments are None, read in
     61     # all the data for a certain package
---> 62     df = dd.read_parquet(
     63         "s3://anaconda-package-data/conda/monthly/*/*.parquet",
     64         storage_options={"anon": True},
     65         engine="pyarrow"
     66     )
     67     df = df.query(f'pkg_name in ("{package}")')
     69 if complete:

File [/opt/conda/lib/python3.10/site-packages/dask_expr/_collection.py:5433](https://{MASKED}.studio.ap-southeast-1.sagemaker.aws/opt/conda/lib/python3.10/site-packages/dask_expr/_collection.py#line=5432), in read_parquet(path, columns, filters, categories, index, storage_options, dtype_backend, calculate_divisions, ignore_metadata_file, metadata_task_size, split_row_groups, blocksize, aggregate_files, parquet_file_extension, filesystem, engine, arrow_to_pandas, **kwargs)
   5410         raise NotImplementedError(
   5411             "engine is not supported when using the pyarrow filesystem."
   5412         )
   5414     return new_collection(
   5415         ReadParquetPyarrowFS(
   5416             path,
   (...)
   5429         )
   5430     )
   5432 return new_collection(
-> 5433     ReadParquetFSSpec(
   5434         path,
   5435         columns=_convert_to_list(columns),
   5436         filters=filters,
   5437         categories=categories,
   5438         index=index,
   5439         storage_options=storage_options,
   5440         calculate_divisions=calculate_divisions,
   5441         ignore_metadata_file=ignore_metadata_file,
   5442         metadata_task_size=metadata_task_size,
   5443         split_row_groups=split_row_groups,
   5444         blocksize=blocksize,
   5445         aggregate_files=aggregate_files,
   5446         parquet_file_extension=parquet_file_extension,
   5447         filesystem=filesystem,
   5448         engine=_set_parquet_engine(engine),
   5449         kwargs=kwargs,
   5450         _series=isinstance(columns, str),
   5451     )
   5452 )

File [/opt/conda/lib/python3.10/site-packages/dask_expr/_core.py:57](https://{MASKED}.studio.ap-southeast-1.sagemaker.aws/opt/conda/lib/python3.10/site-packages/dask_expr/_core.py#line=56), in Expr.__new__(cls, *args, **kwargs)
     55 inst = object.__new__(cls)
     56 inst.operands = [_unpack_collections(o) for o in operands]
---> 57 _name = inst._name
     58 if _name in Expr._instances:
     59     return Expr._instances[_name]

File [/opt/conda/lib/python3.10/functools.py:981](https://{MASKED}.studio.ap-southeast-1.sagemaker.aws/opt/conda/lib/python3.10/functools.py#line=980), in cached_property.__get__(self, instance, owner)
    979 val = cache.get(self.attrname, _NOT_FOUND)
    980 if val is _NOT_FOUND:
--> 981     val = self.func(instance)
    982     try:
    983         cache[self.attrname] = val

File [/opt/conda/lib/python3.10/site-packages/dask_expr/io/parquet.py:776](https://{MASKED}.studio.ap-southeast-1.sagemaker.aws/opt/conda/lib/python3.10/site-packages/dask_expr/io/parquet.py#line=775), in ReadParquet._name(self)
    770 @cached_property
    771 def _name(self):
    772     return (
    773         self._funcname
    774         + "-"
    775         + _tokenize_deterministic(
--> 776             funcname(type(self)), self.checksum, *self.operands[:-1]
    777         )
    778     )

File [/opt/conda/lib/python3.10/site-packages/dask_expr/io/parquet.py:782](https://{MASKED}.studio.ap-southeast-1.sagemaker.aws/opt/conda/lib/python3.10/site-packages/dask_expr/io/parquet.py#line=781), in ReadParquet.checksum(self)
    780 @property
    781 def checksum(self):
--> 782     return self._dataset_info["checksum"]

File [/opt/conda/lib/python3.10/site-packages/dask_expr/io/parquet.py:1375](https://{MASKED}.studio.ap-southeast-1.sagemaker.aws/opt/conda/lib/python3.10/site-packages/dask_expr/io/parquet.py#line=1374), in ReadParquetFSSpec._dataset_info(self)
   1372 dataset_info["checksum"] = tokenize(checksum)
   1374 # Infer meta, accounting for index and columns arguments.
-> 1375 meta = self.engine._create_dd_meta(dataset_info)
   1376 index = dataset_info["index"]
   1377 index = [index] if isinstance(index, str) else index

File [/opt/conda/lib/python3.10/site-packages/dask/dataframe/io/parquet/arrow.py:1280](https://{MASKED}.studio.ap-southeast-1.sagemaker.aws/opt/conda/lib/python3.10/site-packages/dask/dataframe/io/parquet/arrow.py#line=1279), in ArrowDatasetEngine._create_dd_meta(cls, dataset_info)
   1273         raise ValueError(
   1274             "categories not in available columns.\n"
   1275             "categories: {} | columns: {}".format(categories, list(all_columns))
   1276         )
   1278     # Make sure all categories are set to "unknown".
   1279     # Cannot include index names in the `cols` argument.
-> 1280     meta = clear_known_categories(
   1281         meta,
   1282         cols=[c for c in categories if c not in meta.index.names],
   1283         dtype_backend=dtype_backend,
   1284     )
   1286 if partition_obj:
   1287     # Update meta dtypes for partitioned columns
   1288     for partition in partition_obj:

File [/opt/conda/lib/python3.10/site-packages/dask/dataframe/utils.py:294](https://{MASKED}.studio.ap-southeast-1.sagemaker.aws/opt/conda/lib/python3.10/site-packages/dask/dataframe/utils.py#line=293), in clear_known_categories(x, cols, index, dtype_backend)
    292     cols = mask[mask].index
    293 elif not mask.loc[cols].all():
--> 294     raise ValueError("Not all columns are categoricals")
    295 for c in cols:
    296     x[c] = x[c].cat.set_categories([UNKNOWN_CATEGORIES])

ValueError: Not all columns are categoricals

Other potentially-relevant deps:

  • condastats = 0.2.1
  • dask = 2024.5.0
  • fsspec = 2023.12.2
  • numpy = 1.26.4
  • s3fs = 2023.12.2

Seems like it's been a while since condastats was updated so I would guess some incompatibilities might've crept in to dependencies? If anybody's found a solution already it'd be great to know

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants