You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The first dataset and model is a continuation of the paper: "GitHub repositories with links to academic papers" (https://doi.org/10.1016/j.jss.2021.111117). This paper introduces a dataset of 20,000 repositories with a link to an academic paper and then further has manual annotations for 400 of those repository and paper linkages (i.e. "is this link to a paper our own paper / related to this repository"). We are hoping to use the 400 manual annotations to train a model and then work our way through the rest of the 20,000. We may need to set up an iterative annotation pass by using the model probability to find examples we should annotate next.
The second dataset and model is new work required for rs-graph to really work which is a "GitHub User to Author Entity Matching model". I have put together a dataset of ~11,000 github users and ~8,000 authors which we will manually annotate as a match or not. The dataset consists of the authors (and contributors) to any paper and associated repository for papers from JOSS and SoftwareX. To help us along the way, for each author, I have calculated the five (5) most similar developers for comparison. A perfect annotation set would mean 1/5 of all authors have a matched developer, however I suspect it will be somewhere in the 1/7 or 1/8 area. We additionally won't use the full dataset for annotation but rather a smaller sample that we can annotate and then train and test on.
These two models and datasets combined will lead to a much better "product" from rs-graph but alone they can also serve as a nice paper ("Two models for research software and author linkage").
There are a few datasets that we are actively working on to enable https://github.com/evamaxfield/rs-graph.
The first dataset and model is a continuation of the paper: "GitHub repositories with links to academic papers" (https://doi.org/10.1016/j.jss.2021.111117). This paper introduces a dataset of 20,000 repositories with a link to an academic paper and then further has manual annotations for 400 of those repository and paper linkages (i.e. "is this link to a paper our own paper / related to this repository"). We are hoping to use the 400 manual annotations to train a model and then work our way through the rest of the 20,000. We may need to set up an iterative annotation pass by using the model probability to find examples we should annotate next.
The second dataset and model is new work required for rs-graph to really work which is a "GitHub User to Author Entity Matching model". I have put together a dataset of ~11,000 github users and ~8,000 authors which we will manually annotate as a match or not. The dataset consists of the authors (and contributors) to any paper and associated repository for papers from JOSS and SoftwareX. To help us along the way, for each author, I have calculated the five (5) most similar developers for comparison. A perfect annotation set would mean 1/5 of all authors have a matched developer, however I suspect it will be somewhere in the 1/7 or 1/8 area. We additionally won't use the full dataset for annotation but rather a smaller sample that we can annotate and then train and test on.
These two models and datasets combined will lead to a much better "product" from rs-graph but alone they can also serve as a nice paper ("Two models for research software and author linkage").
cc @nniiicc
The text was updated successfully, but these errors were encountered: