Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update epacamd_eia to properly use latest version #478

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
51 changes: 32 additions & 19 deletions src/pudl_archiver/archivers/epacamd_eia.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,5 @@
"""Download EPA CAMD data."""

from pathlib import Path

from pudl_archiver.archivers.classes import (
AbstractDatasetArchiver,
ArchiveAwaitable,
Expand All @@ -15,23 +13,38 @@ class EpaCamdEiaArchiver(AbstractDatasetArchiver):
name = "epacamd_eia"

async def get_resources(self) -> ArchiveAwaitable:
"""Download EPA CAMD to EIA crosswalk resources."""
for year in [2018, 2021]:
yield self.get_crosswalk_zip(year)

async def get_crosswalk_zip(self, year: int) -> tuple[Path, dict]:
"""Download entire repo as a zipfile from github.

For the version of the crosswalk using 2018 data, download the base EPA repo. For 2021 outputs
use our fork. If we decide to archive more years we can add infrastructure to dynamically run
the crosswalk and only archive the outputs, but for now this is the simplest way to archive
the years in use.
The EPA developed the original version of the crosswalk, but this has been dormant
for several years and only uses EIA data from 2018. We have a fork of this repo,
which we've modified slightly to run with later years of data. For now, the
simplest solution is to use the 2018 data from the EPA repo and the latest data
from our fork as static outputs. At some point it would be best to either
integrate the notebook into our ETL so we can dynamically run it with all years
interest, or develop our own linkage.
"""
crosswalk_urls = {
2018: "https://github.com/USEPA/camd-eia-crosswalk/archive/refs/heads/master.zip",
2021: "https://github.com/catalyst-cooperative/camd-eia-crosswalk-2021/archive/refs/heads/main.zip",
}
download_path = self.download_directory / f"epacamd_eia_{year}.zip"
await self.download_zipfile(crosswalk_urls[year], download_path)

return ResourceInfo(local_path=download_path, partitions={"year": year})
yield self.get_2018()
yield self.get_latest_years()

async def get_latest_years(self) -> ResourceInfo:
"""Get latest version from our forked repo."""
resources = []
for year in [2021, 2023]:
url = f"https://github.com/catalyst-cooperative/camd-eia-crosswalk-latest/archive/refs/tags/v{year}.zip"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're not just pulling from main and are going to refer to these past tags we should make sure we document that in the README for our fork of the crosswalk repo, so we don't forget next time someone goes in to update it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the changes not cumulative? Does the 2023 update only cover 2023, and the 2021 update only covers 2021? What happens to the 2022 data? Or the 2019-2020 data?

download_path = self.download_directory / f"epacamd_eia_{year}.zip"
await self.download_zipfile(url, download_path)

resources.append(
ResourceInfo(local_path=download_path, partitions={"year": year})
)
return resources

async def get_2018(self) -> ResourceInfo:
"""Get 2018 data from EPA repo."""
url = (
"https://github.com/USEPA/camd-eia-crosswalk/archive/refs/heads/master.zip"
)
download_path = self.download_directory / "epacamd_eia_2018.zip"
await self.download_zipfile(url, download_path)

return ResourceInfo(local_path=download_path, partitions={"year": "2018"})