Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new archiver for EPA eGRID #549

Merged
merged 20 commits into from
Jan 29, 2025
Merged

Add new archiver for EPA eGRID #549

merged 20 commits into from
Jan 29, 2025

Conversation

cmgosnell
Copy link
Member

@cmgosnell cmgosnell commented Jan 27, 2025

Overview

Closes #517.

What problem does this address?

What did you change in this PR?

  • this one is weird bc the historical data is all on page, but the recent data is spread across many pages. also there is a seperate PM 2.5 page with data from multiple years
  • because the most recent year of data required searching over multiple urls, it was actually really easy to add the PM base url and add that data to the year based zips as well

current working sandbox deposition:
https://sandbox.zenodo.org/uploads/159880
https://zenodo.org/uploads/14765659

Question:

  • the way I'm grabbing the PM data is a little funky. For all the one-year files, they are going in their year based zips. For the one file that spans multiple years i just added into every zip file it pertains to. This is hardcoded rn. I could generalize it a bit (i.e. if they update with more years but have the same file name patterns with YEAR-YEAR it should be doable but idk maybe they won't so i didn't do that yet)
  • Right now I'm not getting the two PDFs on this page from 2018 and and 2019. Those PDFs describe methodology and also include subregional tables that are included within the multi-year file.

Testing

How did you make sure this worked? How can a reviewer verify this?

To-do list

Tasks

Preview Give feedback

@cmgosnell cmgosnell self-assigned this Jan 27, 2025
@cmgosnell cmgosnell marked this pull request as ready for review January 27, 2025 21:01
@cmgosnell cmgosnell changed the title Draft: Add new archiver for EPA eGRID Add new archiver for EPA eGRID Jan 28, 2025
@e-belfer e-belfer self-requested a review January 28, 2025 17:03
Copy link
Member

@e-belfer e-belfer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do think we want those two method PDFs to get added to their respective years, even if we're hard-coding the file names for now (which don't seem to have a clear year pattern). It doesn't look like the NH3 data etc. are getting included in the 2021 file when I run this currently.

src/pudl_archiver/archivers/epaegrid.py Outdated Show resolved Hide resolved
src/pudl_archiver/archivers/epaegrid.py Outdated Show resolved Hide resolved
src/pudl_archiver/archivers/epaegrid.py Outdated Show resolved Hide resolved
Comment on lines 45 to 61
async def _download_add_unlink(self, link: str, filename: str, zip_path: str):
"""Download the file, add it to an zip file in the archive and unlink.

Little helper function because we are doing this same pattern several times
for this dataset within :meth`get_year_resource` because the data is stored
across several pages or have bespoke patterns.
"""
download_path = self.download_directory / filename
await self.download_file(link, download_path)
self.add_to_archive(
zip_path=zip_path,
filename=filename,
blob=download_path.open("rb"),
)
# Don't want to leave multiple files on disk, so delete
# immediately after they're safely stored in the ZIP
download_path.unlink()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bc adding the bespoke get the pm methodology pdfs added a third iteration of this i pulled it up into its own little helper function.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-blocking: Do you want to just add this in helpers.py? Seems like some version of this also wound up in #534 and maybe it just wants to be a common shared function.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay i moved! i assume you meant i should add this into the base class in classes.py

@cmgosnell cmgosnell requested a review from e-belfer January 29, 2025 15:01
@cmgosnell
Copy link
Member Author

cmgosnell commented Jan 29, 2025

okay @e-belfer if these look good to you i will publish then add the dois and add to the script docs and move this module into the epa dir

Copy link
Member

@e-belfer e-belfer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise all looks as expected.

Comment on lines 45 to 61
async def _download_add_unlink(self, link: str, filename: str, zip_path: str):
"""Download the file, add it to an zip file in the archive and unlink.

Little helper function because we are doing this same pattern several times
for this dataset within :meth`get_year_resource` because the data is stored
across several pages or have bespoke patterns.
"""
download_path = self.download_directory / filename
await self.download_file(link, download_path)
self.add_to_archive(
zip_path=zip_path,
filename=filename,
blob=download_path.open("rb"),
)
# Don't want to leave multiple files on disk, so delete
# immediately after they're safely stored in the ZIP
download_path.unlink()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-blocking: Do you want to just add this in helpers.py? Seems like some version of this also wound up in #534 and maybe it just wants to be a common shared function.

@cmgosnell
Copy link
Member Author

good catches! i added in the methodology pdf manually and fixed the pattern to catch the 2021 other emissions

new draft archives here:
https://zenodo.org/uploads/14767236
https://sandbox.zenodo.org/uploads/159997

@cmgosnell cmgosnell requested a review from e-belfer January 29, 2025 19:38
Copy link
Member

@e-belfer e-belfer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some non-blocking questions, but everything looks ok to me!

src/pudl_archiver/archivers/epaegrid.py Outdated Show resolved Hide resolved
src/pudl_archiver/archivers/epaegrid.py Outdated Show resolved Hide resolved
pm_combo_years = [2018, 2019, 2020, 2021]
if year in pm_combo_years:
url = "https://www.epa.gov/system/files/documents/2024-06/egrid-draft-pm-emissions.xlsx"
filename = f"epaegrid-{year}-pm-emissions.xlsx"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-blocking: Do we want to keep draft in this filename?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for some reason many many of the pm files have draft in their names... I decided not to systematically keep that in these names

@cmgosnell cmgosnell enabled auto-merge January 29, 2025 21:01
@cmgosnell cmgosnell merged commit 265427c into main Jan 29, 2025
3 checks passed
@cmgosnell cmgosnell deleted the epaegrid branch January 29, 2025 22:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

Archive EPA eGrid
2 participants