Add new archiver for EPA eGRID #549

cmgosnell · 2025-01-27T17:50:07Z

Overview

Closes #517.

What problem does this address?

What did you change in this PR?

this one is weird bc the historical data is all on page, but the recent data is spread across many pages. also there is a seperate PM 2.5 page with data from multiple years
because the most recent year of data required searching over multiple urls, it was actually really easy to add the PM base url and add that data to the year based zips as well

current working sandbox deposition:
https://sandbox.zenodo.org/uploads/159880
https://zenodo.org/uploads/14765659

Question:

the way I'm grabbing the PM data is a little funky. For all the one-year files, they are going in their year based zips. For the one file that spans multiple years i just added into every zip file it pertains to. This is hardcoded rn. I could generalize it a bit (i.e. if they update with more years but have the same file name patterns with YEAR-YEAR it should be doable but idk maybe they won't so i didn't do that yet)
Right now I'm not getting the two PDFs on this page from 2018 and and 2019. Those PDFs describe methodology and also include subregional tables that are included within the multi-year file.

Testing

How did you make sure this worked? How can a reviewer verify this?

To-do list

Tasks

Give feedback

add other TODO items here if necessary! questions that need to answered, decisions that need to be made, tests that need to be run, etc.
Update relevant documentation - like comments, docstrings, README, release notes, etc.
Review the PR yourself and call out any questions or issues you have
Options

e-belfer

I do think we want those two method PDFs to get added to their respective years, even if we're hard-coding the file names for now (which don't seem to have a clear year pattern). It doesn't look like the NH3 data etc. are getting included in the 2021 file when I run this currently.

src/pudl_archiver/archivers/epaegrid.py

Co-authored-by: E. Belfer <[email protected]>

…ver into epaegrid

cmgosnell · 2025-01-29T15:00:44Z

src/pudl_archiver/archivers/epaegrid.py

+    async def _download_add_unlink(self, link: str, filename: str, zip_path: str):
+        """Download the file, add it to an zip file in the archive and unlink.
+
+        Little helper function because we are doing this same pattern several times
+        for this dataset within :meth`get_year_resource` because the data is stored
+        across several pages or have bespoke patterns.
+        """
+        download_path = self.download_directory / filename
+        await self.download_file(link, download_path)
+        self.add_to_archive(
+            zip_path=zip_path,
+            filename=filename,
+            blob=download_path.open("rb"),
+        )
+        # Don't want to leave multiple files on disk, so delete
+        # immediately after they're safely stored in the ZIP
+        download_path.unlink()


bc adding the bespoke get the pm methodology pdfs added a third iteration of this i pulled it up into its own little helper function.

Non-blocking: Do you want to just add this in helpers.py? Seems like some version of this also wound up in #534 and maybe it just wants to be a common shared function.

okay i moved! i assume you meant i should add this into the base class in classes.py

cmgosnell · 2025-01-29T15:02:27Z

okay @e-belfer if these look good to you i will publish then add the dois and add to the script docs and move this module into the epa dir

…doi yml

For more information, see https://pre-commit.ci

…ver into epaegrid

e-belfer

2020 data is still missing a PM methodology PDF: https://www.epa.gov/system/files/documents/2022-12/eGRID2020%20DRAFT%20PM%20Memo.pdf
2021 is still missing NH3 and VOC data

Otherwise all looks as expected.

e-belfer · 2025-01-29T16:08:43Z

src/pudl_archiver/archivers/epaegrid.py

+    async def _download_add_unlink(self, link: str, filename: str, zip_path: str):
+        """Download the file, add it to an zip file in the archive and unlink.
+
+        Little helper function because we are doing this same pattern several times
+        for this dataset within :meth`get_year_resource` because the data is stored
+        across several pages or have bespoke patterns.
+        """
+        download_path = self.download_directory / filename
+        await self.download_file(link, download_path)
+        self.add_to_archive(
+            zip_path=zip_path,
+            filename=filename,
+            blob=download_path.open("rb"),
+        )
+        # Don't want to leave multiple files on disk, so delete
+        # immediately after they're safely stored in the ZIP
+        download_path.unlink()


Non-blocking: Do you want to just add this in helpers.py? Seems like some version of this also wound up in #534 and maybe it just wants to be a common shared function.

cmgosnell · 2025-01-29T19:11:43Z

good catches! i added in the methodology pdf manually and fixed the pattern to catch the 2021 other emissions

new draft archives here:
https://zenodo.org/uploads/14767236
https://sandbox.zenodo.org/uploads/159997

…y _s

e-belfer

Some non-blocking questions, but everything looks ok to me!

src/pudl_archiver/archivers/epaegrid.py

e-belfer · 2025-01-29T20:42:41Z

src/pudl_archiver/archivers/epaegrid.py

+        pm_combo_years = [2018, 2019, 2020, 2021]
+        if year in pm_combo_years:
+            url = "https://www.epa.gov/system/files/documents/2024-06/egrid-draft-pm-emissions.xlsx"
+            filename = f"epaegrid-{year}-pm-emissions.xlsx"


Non-blocking: Do we want to keep draft in this filename?

for some reason many many of the pm files have draft in their names... I decided not to systematically keep that in these names

draft version of archiving egrid

883bb97

cmgosnell self-assigned this Jan 27, 2025

cmgosnell added 3 commits January 27, 2025 15:41

fix download link and make recent year work

098001b

add the pm emissions files

1527f13

remove is_recent bool arg

dba6515

cmgosnell marked this pull request as ready for review January 27, 2025 21:01

cmgosnell changed the title ~~Draft: Add new archiver for EPA eGRID~~ Add new archiver for EPA eGRID Jan 28, 2025

e-belfer self-requested a review January 28, 2025 17:03

e-belfer added the new-data label Jan 28, 2025

e-belfer requested changes Jan 28, 2025

View reviewed changes

src/pudl_archiver/archivers/epaegrid.py Outdated Show resolved Hide resolved

src/pudl_archiver/archivers/epaegrid.py Outdated Show resolved Hide resolved

src/pudl_archiver/archivers/epaegrid.py Outdated Show resolved Hide resolved

cmgosnell and others added 6 commits January 29, 2025 09:12

Merge branch 'main' into epaegrid

30e26bb

Merge branch 'main' into epaegrid

3945df7

Update src/pudl_archiver/archivers/epaegrid.py

c1098af

Co-authored-by: E. Belfer <[email protected]>

Update src/pudl_archiver/archivers/epaegrid.py

35c39cd

Co-authored-by: E. Belfer <[email protected]>

Merge branch 'epaegrid' of github.com:catalyst-cooperative/pudl-archi…

d9e6151

…ver into epaegrid

add bespoke pm methodologies and add a bb helper function

80bca99

cmgosnell commented Jan 29, 2025

View reviewed changes

cmgosnell requested a review from e-belfer January 29, 2025 15:01

cmgosnell and others added 5 commits January 29, 2025 10:07

add valid year and to script docs

be0a4a6

move the underscore replace to the table name and add TODOs into the …

13cb3d8

…doi yml

[pre-commit.ci] auto fixes from pre-commit.com hooks

5b391c0

For more information, see https://pre-commit.ci

docs udpates

f686bd4

Merge branch 'epaegrid' of github.com:catalyst-cooperative/pudl-archi…

7a36e5d

…ver into epaegrid

e-belfer requested changes Jan 29, 2025

View reviewed changes

Merge branch 'main' into epaegrid

e34342e

cmgosnell added 2 commits January 29, 2025 14:12

mannually grab methodology and generalize link pattern to include man…

6daf46e

…y _s

migrate little helper method into ABC

bcd0c31

cmgosnell requested a review from e-belfer January 29, 2025 19:38

e-belfer reviewed Jan 29, 2025

View reviewed changes

cmgosnell added 2 commits January 29, 2025 15:56

add dois, add to GHA, move to new epa dir

6655161

Merge branch 'main' into epaegrid

ebae22e

cmgosnell enabled auto-merge January 29, 2025 21:01

e-belfer approved these changes Jan 29, 2025

View reviewed changes

cmgosnell merged commit 265427c into main Jan 29, 2025
3 checks passed

cmgosnell deleted the epaegrid branch January 29, 2025 22:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add new archiver for EPA eGRID #549

Add new archiver for EPA eGRID #549

cmgosnell commented Jan 27, 2025 •

edited

Loading

Tasks

e-belfer left a comment

cmgosnell Jan 29, 2025

e-belfer Jan 29, 2025

cmgosnell Jan 29, 2025

cmgosnell commented Jan 29, 2025 •

edited

Loading

e-belfer left a comment

e-belfer Jan 29, 2025

cmgosnell commented Jan 29, 2025

e-belfer left a comment

e-belfer Jan 29, 2025

cmgosnell Jan 29, 2025

Add new archiver for EPA eGRID #549

Add new archiver for EPA eGRID #549

Conversation

cmgosnell commented Jan 27, 2025 • edited Loading

Overview

Testing

To-do list

Tasks

e-belfer left a comment

Choose a reason for hiding this comment

cmgosnell Jan 29, 2025

Choose a reason for hiding this comment

e-belfer Jan 29, 2025

Choose a reason for hiding this comment

cmgosnell Jan 29, 2025

Choose a reason for hiding this comment

cmgosnell commented Jan 29, 2025 • edited Loading

e-belfer left a comment

Choose a reason for hiding this comment

e-belfer Jan 29, 2025

Choose a reason for hiding this comment

cmgosnell commented Jan 29, 2025

e-belfer left a comment

Choose a reason for hiding this comment

e-belfer Jan 29, 2025

Choose a reason for hiding this comment

cmgosnell Jan 29, 2025

Choose a reason for hiding this comment

cmgosnell commented Jan 27, 2025 •

edited

Loading

cmgosnell commented Jan 29, 2025 •

edited

Loading