Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

520 write an archiver for doe low income energy affordability data lead #536

Conversation

krivard
Copy link
Contributor

@krivard krivard commented Jan 22, 2025

Overview

Closes #520 .

What problem does this address?

Lack of an archiver for DOE LEAD Tool data

What did you change in this PR?

  • Add optional argument to get_hyperlinks to accept headers to pass to session.get, and improve error messages
  • Add new archiver for DOE LEAD

See published production archive here: https://zenodo.org/records/14758685

Arguable design decisions

  • The base URL is the DOE page, which does not list past data releases even if they are still available on OIE.
    • Pros: We'll automatically get the next OIE data release when the DOE page points to a new one
    • Cons: No seamless way to archive past releases. There's only one past release atm, but still
  • Switched to using hard-coded DOIs and checking the DOE page for new releases (and if we find one, calling for help; this is expected to happen << annually)
  • The per-state zip files are not unpacked before adding them to the partition zip
    • Pros: The partition zip contains 60 files even without unpacking; with unpacking there would be 8 + (52x8) = 424 files, which seems unwieldy
    • Cons: Compression will be less efficient; needs an extra unzip step at point of use;
  • The logging is chatty. We can move most/all to DEBUG depending on group habit, but with a running time of 40m-2h, I needed some reassurance

Testing

How did you make sure this worked? How can a reviewer verify this?

mkdir -p /tmp/doelead
pudl_archiver --datasets doelead \
	--initialize \
	--summary-file doelead-summary.json \
	--depositor fsspec \
	--deposition-path /tmp/doelead/

To-do list

Tasks

Preview Give feedback

Kathryn Mazaitis added 2 commits January 22, 2025 16:31
Todo:
- Crawl resulting oie link to fetch state-year links
- Download state-year links and zip into year partitions
- Return ResourceInfo
@krivard krivard linked an issue Jan 22, 2025 that may be closed by this pull request
3 tasks
pre-commit-ci bot and others added 6 commits January 22, 2025 21:35
…bility-data-lead' of github.com:catalyst-cooperative/pudl-archiver into 520-write-an-archiver-for-doe-low-income-energy-affordability-data-lead
@krivard krivard marked this pull request as ready for review January 23, 2025 22:56
@krivard krivard requested review from zschira and e-belfer January 23, 2025 22:56
Copy link
Member

@e-belfer e-belfer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't run this entirely but it did succeed in starting to download the data, so no problems there so far! I do have a request to reconfigure the way that you're downloading the data though (see below), and some other very minor requests re: docstrings to consider after that. The changes to the get_hyperlinks() method look completely good and reasonable to me!

The base URL is the DOE page, which does not list past data releases even if they are still available on OIE.
    Pros: We'll automatically get the next OIE data release when the DOE page points to a new one
    Cons: No seamless way to archive past releases. There's only one past release atm, but still

See my comment in the code. I'd suggest manually archiving the data at each DOI, then comparing the link you find on the main page to the known last data link and making sure that nothing has changed. That way we get the best of both worlds - we grab all the data from stable locations, and get to check it against the latest release (if there is one). The method you already have for finding the submission link should work just fine for the latter part!

The per-state zip files are not unpacked before adding them to the partition zip
    Pros: The partition zip contains 60 files even without unpacking; with unpacking there would be 8 + (52x8) = 424 files, which seems unwieldy
    Cons: Compression will be less efficient; needs an extra unzip step at point of use;

I think you made the right call here! If we need to reformat the data down the line we certainly can cross that bridge then.

The logging is chatty. We can move most/all to DEBUG depending on group habit, but with a running time of 40m-2h, I needed some reassurance

I err towards this as well, it's no problem for me.

src/pudl_archiver/archivers/doelead.py Outdated Show resolved Hide resolved
src/pudl_archiver/archivers/doelead.py Outdated Show resolved Hide resolved
src/pudl_archiver/archivers/doelead.py Outdated Show resolved Hide resolved
src/pudl_archiver/archivers/doelead.py Outdated Show resolved Hide resolved
src/pudl_archiver/archivers/doelead.py Outdated Show resolved Hide resolved
src/pudl_archiver/archivers/doelead.py Outdated Show resolved Hide resolved
Kathryn Mazaitis and others added 5 commits January 24, 2025 16:58
…age for new releases, add better docstrings
…bility-data-lead' of github.com:catalyst-cooperative/pudl-archiver into 520-write-an-archiver-for-doe-low-income-energy-affordability-data-lead
Copy link
Member

@e-belfer e-belfer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ran as expected for me, I just ran out of time to make a sandbox archive to inspect it with my slow internet connection. All is looking good so far!

@@ -253,8 +260,8 @@ async def get_hyperlinks(
# Warn if no links are found
if not hyperlinks:
self.logger.warning(
f"The archiver couldn't find any hyperlinks that match {filter_pattern}."
f"Make sure your filter_pattern is correct or if the structure of the {url} page changed."
f"The archiver couldn't find any hyperlinks{('that match' + filter_pattern) if filter_pattern else ''}."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change may be causing some unexpected problems? https://github.com/catalyst-cooperative/pudl-archiver/actions/runs/13018457111

)
from pudl_archiver.frictionless import ZipLayout

TOOL_URL = "https://www.energy.gov/scep/low-income-energy-affordability-data-lead-tool"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This site has gone down, so we should just refer to the DOIs

Copy link
Member

@e-belfer e-belfer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@krivard Note that the main tool page is no longer online, so we'll just want to archive the DOIs (and quickly, if possible!)

Copy link
Member

@cmgosnell cmgosnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! sad that the actual DOE page went down 😢 ...

@e-belfer e-belfer enabled auto-merge January 29, 2025 16:57
@e-belfer e-belfer disabled auto-merge January 29, 2025 16:58
@@ -6,7 +6,7 @@ on:
inputs:
datasets:
description: 'Comma-separated list of datasets to archive (e.g., "ferc2","ferc6").'
default: '"doeiraec","eia176","eia191","eia757a","eia860","eia860m","eia861","eia923","eia930","eiaaeo","eiawater","eia_bulk_elec","epacamd_eia","epacems","epapcap","ferc1","ferc2","ferc6","ferc60","ferc714","gridpathratoolkit","mshamines","nrelatb","phmsagas","usgsuspvdb","vcerare"'
default: '"doeiraec","doelead","eia176","eia191","eia757a","eia860","eia860m","eia861","eia923","eia930","eiaaeo","eiamecs","eiawater","eia_bulk_elec","epacamd_eia","epacems","epapcap","ferc1","ferc2","ferc6","ferc60","ferc714","gridpathratoolkit","mshamines","nrelatb","phmsagas","usgsuspvdb","vcerare"'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sneaking in eiamecs which didn't land in #516.

@e-belfer e-belfer enabled auto-merge January 29, 2025 17:03
@e-belfer e-belfer dismissed their stale review January 29, 2025 17:43

Changes addressed!

@e-belfer e-belfer merged commit b2ac313 into main Jan 29, 2025
3 checks passed
@e-belfer e-belfer deleted the 520-write-an-archiver-for-doe-low-income-energy-affordability-data-lead branch January 29, 2025 17:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

Write an archiver for DOE Low Income Energy Affordability Data (LEAD)
3 participants