Clean up CEMS handling of datatypes #3221
Labels
data-types
Dtype conversions, standardization and implications of data types
epacems
Integration and analysis of the EPA CEMS dataset.
metadata
Anything having to do with the content, formatting, or storage of metadata. Mostly datapackages.
parquet
Issues related to the Apache Parquet file format which we use for long tables.
Is your feature request related to a problem? Please describe.
Right now, CEMS datatypes are not handled like FERC or EIA data. Datatypes are not defined using
codes.py
orfields.py
, but rather there are a few constraints imposed usingapply_pudl_dtypes
. As of #3187, we now apply some dtypes when we read in raw CEMS data to reduce memory usage, but we don't enforce a set of categories for categorical columns, or define metadata for these fields.Describe the solution you'd like
Reconfigure CEMS column dtype handling to be more similar to other datasets. CEMS data is stored as a parquet only, so column-level descriptions may not be required. In particular, categoricals should be defined in
codes.py
.Describe alternatives you've considered
Currently columns are assigned dtypes in
pudl.extract.epacems
on read-in, using a dictionary. See #3187.The text was updated successfully, but these errors were encountered: