-
-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
520 write an archiver for doe low income energy affordability data lead #536
Merged
e-belfer
merged 27 commits into
main
from
520-write-an-archiver-for-doe-low-income-energy-affordability-data-lead
Jan 29, 2025
Merged
Changes from all commits
Commits
Show all changes
27 commits
Select commit
Hold shift + click to select a range
9504deb
[wip] feat: permit get_hyperlinks to accept a 'headers' argument that…
7fe9ce4
[wip] feat: add new archiver for DOE LEAD
c5ba88e
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] fe0d44e
the rest of the owl
89937dd
resolve conflicts
1170476
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 51667b2
[fix] Add missing import
c6e1245
Merge branch '520-write-an-archiver-for-doe-low-income-energy-afforda…
020b3cd
[docs] Add more detail to doelead docstring
krivard 3583b4e
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] cb77f65
Merge branch 'main' into 520-write-an-archiver-for-doe-low-income-ene…
krivard de0ba16
[fix] switch to hard-coded DOIs for known releases, check LEAD Tool p…
96b064d
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] e46f6d9
[fix] missing refactor in fstring
7fd864b
Merge branch '520-write-an-archiver-for-doe-low-income-energy-afforda…
15cee78
Merge branch 'main' into 520-write-an-archiver-for-doe-low-income-ene…
krivard 2a3ef26
Merge branch 'main' into 520-write-an-archiver-for-doe-low-income-ene…
e-belfer ac73e65
Drop site that no longer exists, fix class
e-belfer 9a2a149
Download methodology PDFs
e-belfer 99ca4f4
Add PDF metadata to archive, add placeholder DOI
e-belfer 283db91
Restore entire archiving workflow
e-belfer 1f73afb
Update production DOI
e-belfer 9a98f7d
Merge branch 'main' into 520-write-an-archiver-for-doe-low-income-ene…
e-belfer aded9cf
Merge branch 'main' into 520-write-an-archiver-for-doe-low-income-ene…
e-belfer f87e9f7
Add to GHA
e-belfer 285479d
Oops also MECS
e-belfer b9cbbe5
Fix bad merge resolution
e-belfer File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,138 @@ | ||
"""Download DOE LEAD data. | ||
|
||
Each partition includes: | ||
- Data Dictionary | ||
- Census Tracts List | ||
- Cities List | ||
- Counties List | ||
- States List | ||
- Tribal Areas List | ||
- Cities Census Track Overlaps | ||
- Tribal Areas Tract Overlaps | ||
- One .zip file per state, each of which includes: | ||
- AMI Census Tracts | ||
- SMI Census Tracts | ||
- LLSI Census Tracts | ||
- FPL Census Tracts | ||
- LLSI Counties | ||
- SMI Counties | ||
- FPL Counties | ||
- AMI Counties | ||
""" | ||
|
||
import re | ||
|
||
from pudl_archiver.archivers.classes import ( | ||
AbstractDatasetArchiver, | ||
ArchiveAwaitable, | ||
ResourceInfo, | ||
) | ||
from pudl_archiver.frictionless import ZipLayout | ||
|
||
# This site is no longer online as of 01/28/2025. | ||
# TOOL_URL = "https://www.energy.gov/scep/low-income-energy-affordability-data-lead-tool" | ||
|
||
YEARS_DOIS = { | ||
2022: "https://doi.org/10.25984/2504170", | ||
2018: "https://doi.org/10.25984/1784729", | ||
} | ||
|
||
# verified working 2025-01-22 via | ||
# $ wget "https://www.energy.gov/scep/low-income-energy-affordability-data-lead-tool" -O foo.html -U "Mozilla/5.0 Catalyst/2025 Cooperative/2025" | ||
HEADERS = {"User-Agent": "Mozilla/5.0 Catalyst/2025 Cooperative/2025"} | ||
|
||
|
||
class DoeLeadArchiver(AbstractDatasetArchiver): | ||
"""DOE LEAD archiver.""" | ||
|
||
name = "doelead" | ||
|
||
async def get_resources(self) -> ArchiveAwaitable: | ||
"""Download DOE LEAD resources. | ||
|
||
The DOE LEAD Tool is down as of 01/28/2025. It didn't provide direct access | ||
to the raw data, but instead linked to the current raw data release hosted on | ||
OEDI. It did not provide links to past data releases. So, we hard-code the | ||
DOIs for all known releases and archive those. Based on the removal of the main | ||
page, it's safe to assume this won't be updated any time soon. If it is, we'll | ||
need to manually update the DOIs. | ||
""" | ||
# e.g.: https://data.openei.org/files/6219/DC-2022-LEAD-data.zip | ||
# https://data.openei.org/files/6219/Data%20Dictionary%202022.xlsx | ||
# https://data.openei.org/files/6219/LEAD%20Tool%20States%20List%202022.xlsx | ||
data_link_pattern = re.compile(r"([^/]+(\d{4})(?:-LEAD-data.zip|.xlsx))") | ||
"""Regex for matching the data files in a release on the OEDI page. Captures the year, and supports both .zip and .xlsx file names.""" | ||
|
||
for year, doi in YEARS_DOIS.items(): | ||
self.logger.info(f"Processing DOE LEAD raw data release for {year}: {doi}") | ||
filenames_links = {} | ||
for data_link in await self.get_hyperlinks(doi, data_link_pattern): | ||
matches = data_link_pattern.search(data_link) | ||
if not matches: | ||
continue | ||
link_year = int(matches.group(2)) | ||
if link_year != year: | ||
raise AssertionError( | ||
f"We expect all files at {doi} to be for {year}, but we found: {link_year} from {data_link}" | ||
) | ||
filenames_links[matches.group(1)] = data_link | ||
if filenames_links: | ||
self.logger.info(f"Downloading: {year}, {len(filenames_links)} items") | ||
yield self.get_year_resource(filenames_links, year) | ||
|
||
# Download LEAD methodology PDF and other metadata separately | ||
metadata_links = { | ||
"lead-methodology-122024.pdf": "https://www.energy.gov/sites/default/files/2024-12/lead-methodology_122024.pdf", | ||
"lead-tool-factsheet-072624.pdf": "https://www.energy.gov/sites/default/files/2024-07/lead-tool-factsheet_072624.pdf", | ||
} | ||
for filename, link in metadata_links.items(): | ||
yield self.get_metadata_resource(filename=filename, link=link) | ||
|
||
async def get_year_resource(self, links: dict[str, str], year: int) -> ResourceInfo: | ||
"""Download all available data for a year. | ||
|
||
Resulting resource contains one zip file of CSVs per state/territory, plus a handful of .xlsx dictionary and geocoding files. | ||
|
||
Args: | ||
links: filename->URL mapping for files to download | ||
year: the year we're downloading data for | ||
""" | ||
host = "https://data.openei.org" | ||
zip_path = self.download_directory / f"doelead-{year}.zip" | ||
data_paths_in_archive = set() | ||
for filename, link in sorted(links.items()): | ||
self.logger.info(f"Downloading {link}") | ||
download_path = self.download_directory / filename | ||
await self.download_file(f"{host}{link}", download_path) | ||
self.add_to_archive( | ||
zip_path=zip_path, | ||
filename=filename, | ||
blob=download_path.open("rb"), | ||
) | ||
data_paths_in_archive.add(filename) | ||
# Don't want to leave multiple giant files on disk, so delete | ||
# immediately after they're safely stored in the ZIP | ||
download_path.unlink() | ||
return ResourceInfo( | ||
local_path=zip_path, | ||
partitions={"year": year}, | ||
layout=ZipLayout(file_paths=data_paths_in_archive), | ||
) | ||
|
||
async def get_metadata_resource(self, filename: str, link: str) -> ResourceInfo: | ||
"""Download metadata resource. | ||
|
||
Resulting resource contains one PDF file with metadata about the LEAD dataset. | ||
|
||
Args: | ||
links: filename->URL mapping for files to download | ||
year: the year we're downloading data for | ||
""" | ||
self.logger.info(f"Downloading {link}") | ||
download_path = self.download_directory / filename | ||
await self.download_file(url=link, file_path=download_path, headers=HEADERS) | ||
|
||
return ResourceInfo( | ||
local_path=download_path, | ||
partitions={}, | ||
) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sneaking in
eiamecs
which didn't land in #516.