Clean up CEMS handling of datatypes #3221

e-belfer · 2024-01-08T21:30:30Z

Is your feature request related to a problem? Please describe.
Right now, CEMS datatypes are not handled like FERC or EIA data. Datatypes are not defined using codes.py or fields.py, but rather there are a few constraints imposed using apply_pudl_dtypes. As of #3187, we now apply some dtypes when we read in raw CEMS data to reduce memory usage, but we don't enforce a set of categories for categorical columns, or define metadata for these fields.

Describe the solution you'd like
Reconfigure CEMS column dtype handling to be more similar to other datasets. CEMS data is stored as a parquet only, so column-level descriptions may not be required. In particular, categoricals should be defined in codes.py.

Describe alternatives you've considered
Currently columns are assigned dtypes in pudl.extract.epacems on read-in, using a dictionary. See #3187.

The text was updated successfully, but these errors were encountered:

zaneselvans · 2024-01-08T21:40:07Z

There's a complication here in that we have a standardized way of applying PUDL dtypes to tables when we write to or read from SQLite, but IIRC we aren't currently using an IO Manager for the EPA CEMS Parquet outputs, and that's where we would need to apply these dtypes.

Also, because we're writing to Parquet, we probably want to implement these dtypes through the Resource.to_pyarrow() method, which would read in whatever metadata (e.g. ENUM constraints, nullability) has been associated with the fields and table, and translate them into a valid PyArrow schema. This is also something that we need to do for #3102

e-belfer added epacems Integration and analysis of the EPA CEMS dataset. metadata Anything having to do with the content, formatting, or storage of metadata. Mostly datapackages. data-types Dtype conversions, standardization and implications of data types labels Jan 8, 2024

github-project-automation bot added this to Catalyst Megaproject Jan 8, 2024

github-project-automation bot moved this to New in Catalyst Megaproject Jan 8, 2024

zaneselvans added the parquet Issues related to the Apache Parquet file format which we use for long tables. label Jan 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clean up CEMS handling of datatypes #3221

Clean up CEMS handling of datatypes #3221

e-belfer commented Jan 8, 2024

zaneselvans commented Jan 8, 2024

Clean up CEMS handling of datatypes #3221

Clean up CEMS handling of datatypes #3221

Comments

e-belfer commented Jan 8, 2024

zaneselvans commented Jan 8, 2024