Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make CEMS extraction handle new listed year_quarter partitions #3187

Merged
merged 30 commits into from
Jan 8, 2024
Merged
Show file tree
Hide file tree
Changes from 15 commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
4fb33e2
WIP: add list partitions to _matches
e-belfer Dec 22, 2023
a4cbf89
Fix csv file name
e-belfer Dec 22, 2023
aaecdfa
revert fast etl settings
e-belfer Dec 22, 2023
ee9b3bf
update the 860m doi
cmgosnell Dec 22, 2023
61bc757
Fix docs build
e-belfer Dec 26, 2023
542ce85
Merge pull request #3189 from catalyst-cooperative/eia860m-extraction
e-belfer Dec 26, 2023
cf0454d
Merge branch 'dev' into cems-extraction
e-belfer Dec 26, 2023
d26b4aa
Update to non-broken CEMS archive
e-belfer Dec 26, 2023
0b08162
Try adding datastore to CI
e-belfer Dec 29, 2023
e1215fe
Update docker to point at actually right year
e-belfer Dec 29, 2023
4f50183
Actually fix in GH action
e-belfer Dec 29, 2023
65af95d
Move pudl_datastore call
e-belfer Dec 29, 2023
2cbd7dd
Fix typo
e-belfer Dec 29, 2023
5587b2e
Fix partition option
e-belfer Dec 29, 2023
36752c3
Merge branch 'dev' into cems-extraction
e-belfer Jan 2, 2024
19dfb7b
Add so many logs to ID CI failure
e-belfer Jan 3, 2024
b8782cf
Add gcs cache to gh workflow
e-belfer Jan 3, 2024
71f548f
Merge branch 'dev' into cems-extraction
e-belfer Jan 3, 2024
35211db
fix gcs flag
e-belfer Jan 3, 2024
cc0ebc9
Remove gcs cache from GHA
e-belfer Jan 4, 2024
e2c77bc
Add even more logs
e-belfer Jan 4, 2024
28d50df
Switch debug logs to info
e-belfer Jan 4, 2024
36b823d
Add dtypes on readin
e-belfer Jan 4, 2024
5c01dd4
Try to reduce memory usage when reading EPACEMS CSVs.
zaneselvans Jan 5, 2024
f3833c3
Merge branch 'main' into cems-extraction
zaneselvans Jan 5, 2024
fdffa26
Reduce record linkage test threshold to 80%
zaneselvans Jan 5, 2024
69d40d2
Merge branch 'main' into cems-extraction
zaneselvans Jan 5, 2024
b472667
Clean up logging statements
e-belfer Jan 8, 2024
efb8ac3
Merge branch 'main' into cems-extraction
e-belfer Jan 8, 2024
cf8e64c
Merge branch 'main' into cems-extraction
e-belfer Jan 8, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .github/workflows/pytest.yml
Original file line number Diff line number Diff line change
Expand Up @@ -172,6 +172,7 @@ jobs:
- name: Run integration tests, trying to use GCS cache if possible
run: |
pip install --no-deps --editable .
pudl_datastore --dataset epacems --partition year_quarter=2022q1
e-belfer marked this conversation as resolved.
Show resolved Hide resolved
make pytest-integration
- name: Upload coverage
Expand Down
3 changes: 1 addition & 2 deletions src/pudl/extract/epacems.py
Original file line number Diff line number Diff line change
Expand Up @@ -117,7 +117,7 @@ def get_filters(self):
def get_quarterly_file(self) -> Path:
"""Return the name of the CSV file that holds annual hourly data."""
return Path(
f"epacems-{self.year}-{pd.to_datetime(self.year_quarter).quarter}.csv"
f"epacems-{self.year}q{pd.to_datetime(self.year_quarter).quarter}.csv"
)


Expand All @@ -138,7 +138,6 @@ def get_data_frame(self, partition: EpaCemsPartition) -> pd.DataFrame:
archive = self.datastore.get_zipfile_resource(
"epacems", **partition.get_filters()
)

with archive.open(str(partition.get_quarterly_file()), "r") as csv_file:
df = self._csv_to_dataframe(
csv_file, ignore_cols=API_IGNORE_COLS, rename_dict=API_RENAME_DICT
Expand Down
16 changes: 13 additions & 3 deletions src/pudl/workspace/datastore.py
Original file line number Diff line number Diff line change
Expand Up @@ -104,6 +104,13 @@ def _matches(self, res: dict, **filters: Any):
f"Resource filter values should be all lowercase: {k}={v}"
)
parts = res.get("parts", {})
# If partitions are list, match whole list if it contains desired element
if set(map(type, parts.values())) == {list}:
return all(
any(part.lower() == str(v).lower() for part in parts.get(k))
for k, v in filters.items()
)
# Otherwise return matches to int/str partitions
return all(
str(parts.get(k)).lower() == str(v).lower() for k, v in filters.items()
)
Expand Down Expand Up @@ -134,7 +141,10 @@ def get_partitions(self, name: str = None) -> dict[str, set[str]]:
if name and res["name"] != name:
continue
for k, v in res.get("parts", {}).items():
partitions[k].add(v)
if isinstance(v, list):
partitions[k] |= set(v) # Add all items from list
else:
partitions[k].add(v)
return partitions

def get_partition_filters(self, **filters: Any) -> Iterator[dict[str, str]]:
Expand Down Expand Up @@ -172,12 +182,12 @@ class ZenodoDoiSettings(BaseSettings):

censusdp1tract: ZenodoDoi = "10.5281/zenodo.4127049"
eia860: ZenodoDoi = "10.5281/zenodo.10067566"
eia860m: ZenodoDoi = "10.5281/zenodo.10204686"
eia860m: ZenodoDoi = "10.5281/zenodo.10423813"
eia861: ZenodoDoi = "10.5281/zenodo.10204708"
eia923: ZenodoDoi = "10.5281/zenodo.10067550"
eia_bulk_elec: ZenodoDoi = "10.5281/zenodo.7067367"
epacamd_eia: ZenodoDoi = "10.5281/zenodo.7900974"
epacems: ZenodoDoi = "10.5281/zenodo.10306114"
epacems: ZenodoDoi = "10.5281/zenodo.10425497"
ferc1: ZenodoDoi = "10.5281/zenodo.8326634"
ferc2: ZenodoDoi = "10.5281/zenodo.8326697"
ferc6: ZenodoDoi = "10.5281/zenodo.8326696"
Expand Down
Loading