Add new authors.csv, adjust published columns and tests #99

jacobthill · 2024-09-26T16:51:20Z

@peetucket made some changes to the RIALTO orgs app that adds primary_school and primary_department to the authors.csv file and changes how orgs are assigned to researchers. This PR adds the new authors.csv, publishes those columns, adds the openalex_open_access column, and removes some unneeded columns.

jacobthill · 2024-09-26T16:54:26Z

I noticed test_openalex.py::test_dois_from_orcid_paging and test_openalex.py::test_publications_from_dois are failing as well but I don't think that's related to any changes in this PR.

The dois_from_orcid() now returns a list of unique DOIs, instead of an iterator of possiblly unique DOIs. This allows a failing test to pass. The test for openalex_publications_from_dois() was relaxed a bit to look for 231 or more publications, since a lookup for an individual DOI can sometimes pull back multiple works. The number of columns for OpenAlex is now 53 because we added `institution_assertions`.

edsu · 2024-09-26T19:33:24Z

rialto_airflow/harvest/dimensions.py

@@ -16,10 +16,10 @@
 def dois_from_orcid(orcid):
    logging.info(f"looking up dois for orcid {orcid}")
    q = """
-        search publications where researchers.orcid_id = "{}"
+        search publications where researchers.orcid_id = "{}" and year in [2018:{}]


Why are we limiting to publications between 2018-2024. This seems like something we will forget about, and we will quietly stop getting new data?

The thought was that since Rochelle only cared about pubs after 2017, we could speed up the run time by filtering in the query but I don't think it actually did speed things up so I pulled it out.

edsu · 2024-09-26T19:47:41Z

rialto_airflow/harvest/merge_pubs.py

@@ -86,7 +82,7 @@ def openalex_pubs_df(openalex_pubs):
            "publication_year",
            "title",
            "type",
-            "best_oa_location",


We found it useful to return this in the Easy Deposit workcycle, because we wanted to get the URLs for OpenAccess PDFs. But I guess it can be removed. Did you find it was not needed or redundant?

My thinking was that this DAG would drive the open access dashboard and its not needed for that dashboard. I think we will want another DAG for a publishers data dashboard and others for other purposes. The other option would be to get all of the fields for all dashboards in one DAG and then ignore the fields we don't need but I recall us having concerns about the size of the file with that apporach.

I'm open to either approach. Let me know if you think we should discuss.

It seems like we should minimize the time we have to go out and get metadata from OpenAlex and Dimensions. As long as we have the data in the interim files there will be an opportunity to use it other datasets like you said.

Add new authors.csv, adjust published columns and tests

d992d2c

jacobthill and others added 4 commits September 26, 2024 12:56

changing quotes

f388581

ruff format

5898235

Revert filtering Dimensions by year

ccf6816

edsu approved these changes Sep 26, 2024

View reviewed changes

Reformatted

fd68077

edsu merged commit a0cfdb1 into main Sep 26, 2024
1 check passed

edsu deleted the data-fields branch September 26, 2024 19:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add new authors.csv, adjust published columns and tests #99

Add new authors.csv, adjust published columns and tests #99

jacobthill commented Sep 26, 2024 •

edited by edsu

Loading

jacobthill commented Sep 26, 2024

edsu Sep 26, 2024

jacobthill Sep 26, 2024

edsu Sep 26, 2024

jacobthill Sep 26, 2024

jacobthill Sep 26, 2024

edsu Sep 26, 2024

Add new authors.csv, adjust published columns and tests #99

Add new authors.csv, adjust published columns and tests #99

Conversation

jacobthill commented Sep 26, 2024 • edited by edsu Loading

jacobthill commented Sep 26, 2024

edsu Sep 26, 2024

Choose a reason for hiding this comment

jacobthill Sep 26, 2024

Choose a reason for hiding this comment

edsu Sep 26, 2024

Choose a reason for hiding this comment

jacobthill Sep 26, 2024

Choose a reason for hiding this comment

jacobthill Sep 26, 2024

Choose a reason for hiding this comment

edsu Sep 26, 2024

Choose a reason for hiding this comment

jacobthill commented Sep 26, 2024 •

edited by edsu

Loading