Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New Stat: Maintainers #23

Open
gundalow opened this issue Mar 8, 2021 · 7 comments
Open

New Stat: Maintainers #23

gundalow opened this issue Mar 8, 2021 · 7 comments
Labels
enhancement New feature or request

Comments

@gundalow
Copy link
Contributor

gundalow commented Mar 8, 2021

What

Maintainers are a key part of the Ansible community. They are the people that can merge code (either directly or via ansibullbot)

Definition of a maintainer

  • Is listed as having triage (or higher) permissions for a specific repo in GitHub
  • We should ignore Ansible staff
  • We should include Red Hat staff that aren't part of Ansible

Active maintainer

As a 2nd phase, we may wish to track how many of these maintainers are "active".

Someone interacting with the repository in any way should count as being an active maintainer, ie:

  • Creating an Issue or PR
  • Commenting (text or emoji response) on an Issue or PR
  • Reviewing a PR
  • Changing metadata: Adding/removing labels, assigning someone to review
  • Closing/merging an issue or PR

We would want to define some time limit, though given some repositories don't have much activity, maybe this limit should be fairly high, ie 6 months+?

Which repos

There are many collections on Galaxy.
Some of those are under gh/ansible-collections
The ansible package contains some collections from gh/ansible-collections, as well as some collections hosted elsewhere.

If we need to extract a list of maintainers, then I believe that will limit us to collections under gh/ansible-collections.

We may wish to filter this, to only collections that we include in the ansible package.

Special case repos with .github/BOTMETA.yml

As well as maintainers being defined by having direct permissions via GitHub, we use ansibullbot to delegate permissions to certain people for a specific directory example.

Given the vast number of BOTMETA maintainers, we may wish to track this separately to Collection owners.
"Active" status is harder, as it possible a repository may go many years before Ansibullbot needs to ping a specific maintainer, ie when a PR is raised against a specific module

What would cause an increase

When we add a new collection into the ansible package, it's likely that will have some new maintainers, ie
ibm.ds8000 requests a new collection repo with 8 maintainers, which are all new to the ansible-collections GitHub Org

What would cause a decrease

  • Someone stepping down
  • Removal of a repository (unlikely)
  • Moving a repository from gh/ansible-collections to a different GH Org where we don't have permissions to see who is a maintainer

Presentations/questions we will ask the data

  1. Are we increasing the number of maintainers over time?
  2. Are the maintainers we have active, ie along with time-to-merge metrics, does a repo need help?
  3. Overall maintainers (as defined in GitHub)
  • How this changes over time
  • Ability to mark specific events on the graph (recruitment drive, new repository with large number of maintainers)
  1. maintainers (as defined in GitHub) over time for a specific repository
  2. Overall BOTMETA maintainers
  • How this changes over time
  • Ability to mark specific events on the graph (recruitment drive, new repository with large number of maintainers)
  1. BOTMETA maintainers for a specific repository
  • How this changes over time
  • Ability to mark specific events on the graph (recruitment drive, )
@gundalow gundalow added the enhancement New feature or request label Mar 8, 2021
@GregSutcliffe
Copy link
Contributor

Good stuff, thanks @gundalow. Jotting down some musing while I read it through:

Definition of a maintainer

Note that a given person can be multiple things - an employed developer in one repo, but contributing on their own time in another. This will like need to be managed per repo, which is an overhead - or at least we should verify that whether it's an issue.

Active maintainer

This seems fair. IIRC one can get a public event stream for a user from the GH API, so we can figure out:

a) Have they been active in this repo
b) Have they been active elsewhere

We can probably also quantify if there is work outstanding in a repo - an inactive maintainer on a repo with 50 open PRs is probably worse than one with 2 open PRs.

Decreases: Moving a repository from gh/ansible-collections to a different GH Org where we don't have permissions to see who is a maintainer

We could probably at least infer this from who is committing (IIRC GH logs the author and the committer separately). Possibly for a later phase though.

Shower thoughts

This looks like a solid start point, although it will likely require a new crawler at the user level. Shouldn't be too hard to write.

That said, I'm slightly wary of creating metrics for people - they tend to lead to scoreboards, system gaming, and in rare cases, blame arguments. We run a risk of accidentally creating OKRs for the community when we're really trying to create them for our team (that is, we would like to know when a human needs to reach out with an offer of help, not create a table of names to guilt the lowest entries into trying harder).

Thus, what's the plan for what we do with the data? Is it just for our team? Are we planning to only show aggregate stats here? Or will it be public per repo? (Because the latter essentially de-anonymises the data, any given repo will have just a few maintainers). I'm not against any particular outcome here, I just want to be sure we understand how the end result can be used.

@Andersson007
Copy link

My 50 cents in addition:

  1. the maintainer definition + they can release a collection (i think if a person has commit it can add tags. I mean in general maintainers usually supports repository itself).

  2. We may wish to filter this, to only collections that we include in the ansible package. - sounds sensible

  3. We should extract a list of current maintainers (who have permissions you described) plus some statistics about their activities to grant more or revoke privileges

  4. Top active contributors (say 5 or 10) based on the described activities (if possible) to see new potential maintainers

  5. Presentations/questions we will ask the data would be good if we could see it on https://stats.eng.ansible.com/app/collections_dash. Especially interesting to me, at least, are the current maintainer list and top active contributors (as a starting point it could be a commit number as a metric)

  6. When collecting the stats, shouldn't we ignore "maintainers via BOTMETA.yml"? It's because
    a) now we offer everyone who contributed to a module to be added in BOTMETA.yml
    b) one shipit not enough to get even a certain stuff merged
    c) the scope is limited mostly (module or several ones)
    d) we are also interested in people who can maintain repository / collection itself, e.g., via releasing on demand (frequent patch releases) and scheduled, e.g. to align minor releases with Ansible minor releases and (the most important) major releases with Ansible major releases).
    It's pretty time consuming. I doubt if our team can release effectively more that 5 collections per teammate and somehow release 10 collections per teammate without heavily affecting other work.
    So, I suggest considering a collection maintainer in terms of statistics a person who has extended rights in collection scope (I'm not sure if commit implies tagging).

  7. About stats is public / private - good question. I think the lists of current maintainers and top active contributors can / and maybe even should be public (with no details). As I mentioned it can be shown on https://stats.eng.ansible.com/app/collections_dash

@Andersson007
Copy link

#25

@GregSutcliffe
Copy link
Contributor

I think the lists of current maintainers and top active contributors can / and maybe even should be public (with no details).

Do you mean, just the names, in a random order? That I could support, I think. Anything more detailed would be for the community team and the steering committee (because otherwise you again have a public scoreboard that can be gamed).

Otherwise, good points I think. We'll need to resolve how we feel about BOTMETA but that's likely a second phase of work anyway.

@Andersson007
Copy link

Andersson007 commented Mar 19, 2021

Do you mean, just the names, in a random order?

Works for me:) We could also sort them by how many commits they did (it's available to see in our repositories, e.g. there https://github.com/ansible-collections/community.mysql/graphs/contributors). But it could be random, it's also fine, imo

@GregSutcliffe
Copy link
Contributor

GregSutcliffe commented Apr 12, 2021

OK, cycling around to this.

I already have a users table, so I think it makes sense to orient around that. The process looks something like:

  1. take a repo
  2. figure out a list of maintainers using the rules above
  3. optional, take a shot at the "active" list too
  4. loop over users and update their entries for this repo, recording today's date:
{ 'GregSutcliffe:
  'maintainer': {
    'stats-collections': {
      'first_seen': 2021-04-12,
      'last_seen': 2021-04-12
    },
    ...
  },
  'active': {
    'stats-crawler': {
      'first_seen': 2021-04-12,
      'last_seen': 2021-04-12
    },
    ...
  }
}

This means the date isn't updated when they stop showing up in the list of maintainers, allowing to us to see how the number of maintainers is evolving across the repo set.

Once we have the data being collected by the crawler, we can figure out how to display it, but to comment on this one:

it's available to see in our repositories

That's true, but that's a single repo. If we're going to make it significantly easier to consume that across many repos, then we have an duty to consider how that new format might be used. Just because data already exists doesn't mean you are absolved of responsibility when you process it.

@Andersson007
Copy link

FYI: When looking for new maintainers manually, I usually use the following metrics:

  • Number of merged contributions (should be at least 4-6 to go further) and number of opened.
  • Open dates of the first merged PR and the last open / merged PR: the first one should happen at least several months ago, the last one - recently.
  • Regularity: ideally the merged things should happen regularly with no long gaps. A good picture can be, say, several merged PRs every month during the previous 4-5 months. If there are merges happened during the last month but several PRs open, that's also fine.
  • Activity as a reviewer is a very very good indicator (especially in PRs / issues were a person was not pinged by bot).
  • For big things like c.general, contributions to several unrelated areas are also a good indicator.
  • Contributions to general things like purging ignore-*.txt files is also a good indicator.

Maybe it will give some ideas how to find potential maintainers, though not all the metrics can be used in scripts, mathematical models, etc.

Also if we see people who are active but not for time long enough for maintainers, we should support / encourage / mentor such people as future candidates. So, would be nice to have a banner saying something like "This person has opened 5 PRs in c.docker during the last week" to pay attention to such a person.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants