Skip to content

Commit

Permalink
paper: re-added the parts that were removed for brevity
Browse files Browse the repository at this point in the history
  • Loading branch information
proycon committed Jan 6, 2025
1 parent 589f47a commit b32fdea
Showing 1 changed file with 40 additions and 45 deletions.
85 changes: 40 additions & 45 deletions papers/tooldiscovery.tex
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,8 @@

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%
% This paper is a work in progress and still pending submission!
% It has not yet been peer-reviewed!
% This extended-abstract version of this paper was submitted, accepted and presented to/at the CLARIN conference 2024.
% The post-proceedings paper is a work in progress and still pending submission and further peer review.
%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%5

Expand Down Expand Up @@ -115,12 +115,12 @@ \section{Introduction} \label{intro}
% {http://creativecommons.org/licenses/by/4.0/}
}

% Software is indispensible in a lot of modern-day research, including in sectors
% such as the Humanities and Social Sciences that may have traditionally been
% less focused on information technology. It is also appreciated more and more as
% valid research output, alongside more conventional output such as academic
% publications, presentations, and datasets. Scholars often have a need for
% research software to do their research efficiently.
Software is indispensible in a lot of modern-day research, including in sectors
such as the Humanities and Social Sciences that may have traditionally been
less focused on information technology. It is also appreciated more and more as
valid research output, alongside more conventional output such as academic
publications, presentations, and datasets. Scholars often have a need for
research software to do their research efficiently.

For scholars it is important to be able to find and identify tools suitable for
their research, we call this process \emph{tool discovery}. We define
Expand All @@ -134,20 +134,16 @@ \section{Introduction} \label{intro}
researchers must have access to catalogues that relay \emph{accurate} software
metadata.

There is no shortage in existing initiatives in building such catalogues\footnote{CLARIN itself has one in the form of the Virtual Language Observatory (VLO): \url{https://vlo.clarin.eu} \citep{VLO}}.
, %<cut out part below for brevity>
but the system we describe in this paper is not an attempt to build another catalogue.
There is no shortage in existing initiatives in building such catalogues;
many research groups, projects or institutes have some kind of website featuring
their tools. Aggregation of software metadata from multiple such partners is
also not new, as CLARIN itself already does in the CLARIN Virtual Language
Observatory\footnote{\url{https://vlo.clarin.eu}} \citep{VLO}.
However, the system we describe in this paper is not an attempt to build another catalogue.
We developed a generic pipeline that harvests software metadata from the source, leveraging
various existing metadata formats, and converting those to a uniform linked open data representation.
This data can then be used to feed catalogues.

%; many
%research groups, projects or institutes have some kind of website featuring
%their tools. Aggregation of software metadata from multiple such partners is
%also not new, as CLARIN itself already does in the CLARIN Virtual Language
%Observatory\footnote{\url{https://vlo.clarin.eu}} \citep{VLO}.


\section{The need for high-quality metadata}

Unlike most digital data, software is uniquely characterised as a constantly
Expand All @@ -170,11 +166,11 @@ \section{The need for high-quality metadata}
\emph{complete} metadata. If vital details are missing, the
end-user may not be able to make an informed judgment.

% A common pitfall we have observed in practice is that metadata is often
% manually collected at some stage and published in a catalogue, but never or
% rarely updated or revised. In best case, the software has moved on and the
% metadata covers a mere subset, in worst case, the software or the entire
% catalogue is unmaintained and out of date.
A common pitfall we have observed in practice is that metadata is often
manually collected at some stage and published in a catalogue, but never or
rarely updated or revised. In best case, the software has moved on and the
metadata covers a mere subset, in worst case, the software or the entire
catalogue is unmaintained and out of date.

\section{Bottom-up harvesting from the source}

Expand All @@ -186,41 +182,40 @@ \section{Bottom-up harvesting from the source}
software catalogues\footnote{for example, \url{https://research-software-directory.org/}
offers such a platform. Metadata can often be exported via OAI-PMH.}. Our approach has a number of important advantages:

%\begin{itemize}

%\item
\begin{itemize}

\item
Source code is often already accompanied by software metadata in existing
schemas\footnote{for example \texttt{pyproject.toml}, \texttt{setup.py}, \texttt{pom.xml}, \texttt{package.json}, \texttt{CITATION.cff} and others. Even files such as \texttt{README} and \texttt{LICENSE} may be a source for certain metadata.} because many programming language ecosystems already either require or
recommend this. Our aim is to avoid any duplication of metadata and
\emph{reuse} these existing sources to the maximum extent possible.

%Consider
%for example \texttt{pyproject.toml} or \texttt{setup.py} for Python projects,
%\texttt{package.json} for javascript/npm/nodejs projects, \texttt{pom.xml} for
%Java/Maven and \texttt{Cargo.toml} for Rust. Aside from these, valuable machine
%parsable metadata may be extracted from other files such as a \texttt{LICENSE}
%file or , `README` file which often contains machine-interpretable
%badges\footnote{Badges are small images often included on top a \texttt{README}
%file to express certain properties of the software, such as links to
%documentation, continuous integration services, development status, packaging
%status, etc..}, or a \texttt{CITATION.cff} file, amongst others. Multiple such
%sources may be present and can be recombined.

%\item
Second, source code should be under version control (e.g. git) and published in a forge (e.g. GitHub, Gitlab, Sourcehut).
% is typically held in a version control system (usually git) and published in
%forges such as Github, Gitlab, Bitbucket, Codeberg or Sourcehut.
Consider
for example \texttt{pyproject.toml} or \texttt{setup.py} for Python projects,
\texttt{package.json} for javascript/npm/nodejs projects, \texttt{pom.xml} for
Java/Maven and \texttt{Cargo.toml} for Rust. Aside from these, valuable machine
parsable metadata may be extracted from other files such as a \texttt{LICENSE}
file or , `README` file which often contains machine-interpretable
badges\footnote{Badges are small images often included on top a \texttt{README}
file to express certain properties of the software, such as links to
documentation, continuous integration services, development status, packaging
status, etc..}, or a \texttt{CITATION.cff} file, amongst others. Multiple such
sources may be present and can be recombined.

\item
Second, source code
is typically held in a version control system (usually git) and published in
forges such as Github, Gitlab, Bitbucket, Codeberg or Sourcehut.
This solves versioning issues and ensures metadata can exactly describe the version alongside which it is stored. It also enables the
harvester to properly identify the latest stable release\footnote{provided some kind of industry-standard versioning system like semantic versioning is adhered to}.
Software forges themselves may also provide an API that may serve as an extra source to find software metadata (e.g. descriptions, keywords, links to issue trackers and/or continuous integration services).
%\item
\item
Third, the developers of the tool have full control and authorship over their metadata. There are no middlemen.
%\item
\item
Last, the forges were designed precisely for collaboration on open source software development, so mechanisms for any
third party to amend or correct the metadata are already in place (e.g. via a pull/merge request or patch via e-mail).
So while developers retain full authorship, this does not mean outside contribution and curation is not possible.
%\end{itemize}
\end{itemize}

We do not harvest any metadata from intermediaries\footnote{from other catalogues,
for instance via the aforementioned OAI-PMH endpoints} as that would defeat our philosophy. We do
Expand Down

0 comments on commit b32fdea

Please sign in to comment.