Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: parse and clean archive badges and markdown links to URL #243

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

banesullivan
Copy link

@banesullivan banesullivan commented Dec 15, 2024

This will parse markdown links/badges to consistently capture a URLs from the archive and JOSS DOI fields.

Note that this adds https://github.com/papis/python-doi as a dependency

The solution I landed on here is to always coerce the link to a single URL since this is what the review templates expect here:

https://github.com/pyOpenSci/pyopensci.github.io/blob/d0b561cc493f6e4691f171b4460b0f7d4793267a/_includes/package-grid.html#L42

Since this now validates the DOI links, I needed to use a real/valid DOI for the test data so I used PyVista's DOI for these 😄

I also noticed that some review issues had the JOSS archive key as JOSS DOI and others have it as JOSS. The changes here account for that and ensure data are normalized to JOSS


Copy link

codecov bot commented Dec 15, 2024

Codecov Report

Attention: Patch coverage is 82.50000% with 7 lines in your changes missing coverage. Please review.

Project coverage is 75.25%. Comparing base (b6179f3) to head (2b9db98).

Files with missing lines Patch % Lines
src/pyosmeta/utils_clean.py 78.57% 3 Missing and 3 partials ⚠️
src/pyosmeta/models/base.py 90.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #243      +/-   ##
==========================================
+ Coverage   74.06%   75.25%   +1.18%     
==========================================
  Files          10       10              
  Lines         671      699      +28     
  Branches       82       89       +7     
==========================================
+ Hits          497      526      +29     
+ Misses        166      162       -4     
- Partials        8       11       +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@banesullivan banesullivan changed the title feat: parse and clean archive badges and markdown links to URL fix: parse and clean archive badges and markdown links to URL Dec 19, 2024
@lwasser
Copy link
Member

lwasser commented Dec 19, 2024

hey @banesullivan i'll leave some more specific feedback here but it looks like

  1. this is erroring on a review (bibat) which means there is an outlier format in archive that isn't being handled as we would like it to be handled.
update-reviews
/Users/leahawasser/Documents/GitHub/pyos/pyosMeta/src/pyosmeta/parse_issues.py:489: UserWarning: ## Community Partnerships not found in the list
  warnings.warn(f"{section_str} not found in the list")
Error in review at url: https://api.github.com/repos/pyOpenSci/software-submission/issues/83
Traceback (most recent call last):

  File "/Users/leahawasser/Documents/GitHub/pyos/pyosMeta/src/pyosmeta/parse_issues.py", line 310, in parse_issues
    review = self.parse_issue(issue)

  File "/Users/leahawasser/Documents/GitHub/pyos/pyosMeta/src/pyosmeta/parse_issues.py", line 284, in parse_issue
    return ReviewModel(**model)

  File "/Users/leahawasser/mambaforge/envs/pyosmeta/lib/python3.10/site-packages/pydantic/main.py", line 214, in __init__
    validated_self = self.__pydantic_validator__.validate_python(data, self_instance=self)

pydantic_core._pydantic_core.ValidationError: 1 validation error for ReviewModel
joss
  Value error, Invalid archive URL:  [type=value_error, input_value='', input_type=str]
    For further information visit https://errors.pydantic.dev/2.10/v/value_error

--------------------
http://www.sunpy.org 'http://' replacing w 'https://'
http://sourmash.readthedocs.io/en/latest/ 'http://' replacing w 'https://'
http://movingpandas.org 'http://' replacing w 'https://'
  1. When running it locally is that it hangs. I suspect it's slower because adding python-doi means we are parsing and checking urls / DOIs for each package. For now, we might consider adding more output so a user knows it's doing something. Maybe the package name it's processing in the terminal would be good to print out? I almost thought it was broken, and then I saw it was just processing but slower than it was previously.

@lwasser
Copy link
Member

lwasser commented Dec 19, 2024

I'm also noticing some inconsistency in JOSS:

for joss / sleplet, there is a link to the paper.
issue_link: pyOpenSci/software-submission#149
joss: https://joss.theoj.org/papers/10.21105/joss.05221

For nbcompare, there is a link to the DOI, which resolves to the paper!
issue_link: pyOpenSci/software-submission#146
joss: https://doi.org/10.21105/joss.06490

Both will work! My question is, should we be consistent in how we save the doi and always link to the doi rather than the paper in terms of the data we store in our "database," aka YML file?

the archive value is inconsistent, too, but I think it's OK as is. Especially because sometimes it's a GitHub link and others it's Zenodo, and sometimes the Zenodo link is to the "latest" rather than the actual archive VERSION that we approved. So let's not worry about archive and focus on JOSS DOI as it hopefully isn't a huge amount of work. if it is, we are ok as is and we can open an issue about a future iteration that cleans it up just a bit more.

So, the takeaways here are:

  1. the one issue that is erroring - let's fix that (comment above)
  2. let's fill in tests and use pytest.raises match= for those try/except blocks
  3. let's make the JOSS doi consistent!
  4. let's make sure there is some information in the terminal when it's processing reviews so a user running it knows it's not stalled. A simple fix would be to write out the package name being processed. And to write out when it's failing on a doi URL (potentially?). I'll leave that up to you.

Thank you so much. This looks really great!! 🚀 If you'd like, we could merge this as is today, and you could work on the comments above in another pr. that would allow us to update the website tomorrow.

if you'd like to complete all of the work in this pr, say the word, and we can hold off until January to get things merged.

Copy link
Member

@lwasser lwasser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops. it looks like i left blocks of feedbak but not the line by line comments. i'm just making sure those are visible now.

@@ -7,6 +7,8 @@
from datetime import datetime
from typing import Any

import doi
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we've added a new dep, we should make sure that it is noted in the changelog and also document why we added it.

@@ -4,6 +4,8 @@

## [Unreleased]

* Fix: Parse archive and JOSS links to handle markdown links and validate DOI links are valid (@banesullivan)

[v1.4] - 2024-11-22
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
[v1.4] - 2024-11-22
## [v1.4] - 2024-11-22

let's fix my mistake in the changelog :)

@@ -4,6 +4,8 @@

## [Unreleased]

* Fix: Parse archive and JOSS links to handle markdown links and validate DOI links are valid (@banesullivan)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Fix: Parse archive and JOSS links to handle markdown links and validate DOI links are valid (@banesullivan)
* Fix: Parse archive and JOSS links to handle markdown links and validate DOI links are valid. Added python-doi as a dependency (@banesullivan)

i don't know how you handle a add and fix in one line but i think it's good to note that change here.

This utility will attempt to parse the DOI link from the various formats
that are commonly present in review metadata. This utility will handle:

* Markdown links in the format `[label](URL)`, e.g., `[my archive](https://doi.org/10.1234/zenodo.12345678)`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks like the issue that failed when i ran it locally using a zenodo badge

pyOpenSci/software-submission#83

honestly, I may have updated and added that (it is possible). But it would be good to parse a markdown badge url too.

let me know if you don't see that error but i saw it running locally.

archive = archive.replace("http://", "https://")
# Validate that the URL resolves
if not check_url(archive):
raise ValueError(f"Invalid archive URL: {archive}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want to use pytest.raises ValueError // match= to hit these missing lines in our coverage.

@lwasser
Copy link
Member

lwasser commented Jan 7, 2025

hey @banesullivan i'm checking in on this pr. we are running into the issue that every pr on the website fails CI because of the dois! i think we have two options.

  1. we can merge this almost as is and push a new release if it fixes things, then the website will atleast be green.
  2. you can work on the tests and other smaller items in a separate pr.

The above approach is great if you are busy during this first full week back!!
Alternatively, we can leave this open for a bit longer and fix things here.

Please let me know what you prefer!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants