-
-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
520 write an archiver for doe low income energy affordability data lead #536
520 write an archiver for doe low income energy affordability data lead #536
Conversation
… gets passed thru to session.get
Todo: - Crawl resulting oie link to fetch state-year links - Download state-year links and zip into year partitions - Return ResourceInfo
For more information, see https://pre-commit.ci
For more information, see https://pre-commit.ci
…bility-data-lead' of github.com:catalyst-cooperative/pudl-archiver into 520-write-an-archiver-for-doe-low-income-energy-affordability-data-lead
For more information, see https://pre-commit.ci
…rgy-affordability-data-lead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't run this entirely but it did succeed in starting to download the data, so no problems there so far! I do have a request to reconfigure the way that you're downloading the data though (see below), and some other very minor requests re: docstrings to consider after that. The changes to the get_hyperlinks()
method look completely good and reasonable to me!
The base URL is the DOE page, which does not list past data releases even if they are still available on OIE. Pros: We'll automatically get the next OIE data release when the DOE page points to a new one Cons: No seamless way to archive past releases. There's only one past release atm, but still
See my comment in the code. I'd suggest manually archiving the data at each DOI, then comparing the link you find on the main page to the known last data link and making sure that nothing has changed. That way we get the best of both worlds - we grab all the data from stable locations, and get to check it against the latest release (if there is one). The method you already have for finding the submission link should work just fine for the latter part!
The per-state zip files are not unpacked before adding them to the partition zip Pros: The partition zip contains 60 files even without unpacking; with unpacking there would be 8 + (52x8) = 424 files, which seems unwieldy Cons: Compression will be less efficient; needs an extra unzip step at point of use;
I think you made the right call here! If we need to reformat the data down the line we certainly can cross that bridge then.
The logging is chatty. We can move most/all to DEBUG depending on group habit, but with a running time of 40m-2h, I needed some reassurance
I err towards this as well, it's no problem for me.
…age for new releases, add better docstrings
For more information, see https://pre-commit.ci
…bility-data-lead' of github.com:catalyst-cooperative/pudl-archiver into 520-write-an-archiver-for-doe-low-income-energy-affordability-data-lead
…rgy-affordability-data-lead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This ran as expected for me, I just ran out of time to make a sandbox archive to inspect it with my slow internet connection. All is looking good so far!
…rgy-affordability-data-lead
@@ -253,8 +260,8 @@ async def get_hyperlinks( | |||
# Warn if no links are found | |||
if not hyperlinks: | |||
self.logger.warning( | |||
f"The archiver couldn't find any hyperlinks that match {filter_pattern}." | |||
f"Make sure your filter_pattern is correct or if the structure of the {url} page changed." | |||
f"The archiver couldn't find any hyperlinks{('that match' + filter_pattern) if filter_pattern else ''}." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change may be causing some unexpected problems? https://github.com/catalyst-cooperative/pudl-archiver/actions/runs/13018457111
) | ||
from pudl_archiver.frictionless import ZipLayout | ||
|
||
TOOL_URL = "https://www.energy.gov/scep/low-income-energy-affordability-data-lead-tool" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This site has gone down, so we should just refer to the DOIs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@krivard Note that the main tool page is no longer online, so we'll just want to archive the DOIs (and quickly, if possible!)
…rgy-affordability-data-lead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! sad that the actual DOE page went down 😢 ...
…rgy-affordability-data-lead
@@ -6,7 +6,7 @@ on: | |||
inputs: | |||
datasets: | |||
description: 'Comma-separated list of datasets to archive (e.g., "ferc2","ferc6").' | |||
default: '"doeiraec","eia176","eia191","eia757a","eia860","eia860m","eia861","eia923","eia930","eiaaeo","eiawater","eia_bulk_elec","epacamd_eia","epacems","epapcap","ferc1","ferc2","ferc6","ferc60","ferc714","gridpathratoolkit","mshamines","nrelatb","phmsagas","usgsuspvdb","vcerare"' | |||
default: '"doeiraec","doelead","eia176","eia191","eia757a","eia860","eia860m","eia861","eia923","eia930","eiaaeo","eiamecs","eiawater","eia_bulk_elec","epacamd_eia","epacems","epapcap","ferc1","ferc2","ferc6","ferc60","ferc714","gridpathratoolkit","mshamines","nrelatb","phmsagas","usgsuspvdb","vcerare"' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sneaking in eiamecs
which didn't land in #516.
Overview
Closes #520 .
What problem does this address?
Lack of an archiver for DOE LEAD Tool data
What did you change in this PR?
get_hyperlinks
to accept headers to pass to session.get, and improve error messagesSee published production archive here: https://zenodo.org/records/14758685
Arguable design decisions
The base URL is the DOE page, which does not list past data releases even if they are still available on OIE.Pros: We'll automatically get the next OIE data release when the DOE page points to a new oneCons: No seamless way to archive past releases. There's only one past release atm, but stillTesting
How did you make sure this worked? How can a reviewer verify this?
To-do list
Tasks