-
Notifications
You must be signed in to change notification settings - Fork 117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Framewise transcription evaluation #231
Comments
Thanks for bringing this up. A few questions -
|
fwiw, in sound event detection (SED) there seems to be a growing preference for frame-(or "segment", which is basically some fixed time duration)-based evaluation over event-based evaluation (which is equivalent to note-based), because the latter is very penalizing (consider the case where an algorithm returns two consecutive notes for a single reference note - the first note would only be a match if you ignore offsets and the second would always be treated as wrong, even though both match the reference in pitch and time if you ignore the split). So regardless of what the trend in MIREX is (I'm abroad and can't seem to load the mirex website right now), I expect we'll see frame-level metrics used more and more in transcription papers. In this context I should mention there was an interesting attempt at introducing more note-based transcription metrics in order to provide greater insight into system performance precisely due to this issue by Molina et al., though it was focused on singing transcription and I'm not sure whether it has been adopted by the community. With regards to @craffel's questions:
|
No, it seems to be down... One point of reference is: Relating to sklearn's function, there is still confusion on which aggregation function to use ( |
That's a multi-f0 tracking paper though, not transcription. It's important not to confound these, as transcription is a more involved task requiring segmentation/quantization into discrete note events. Perhaps I should add a qualification to my SED analogy - for SED I think it can be less important to focus on discrete events (depending on the source!) and rather consider presence/absence over time. However, in music discrete notes are very much a thing, and music notation is a well established paradigm (as is piano-roll), so I'd be reluctant to abandon note-based eval for transcription altogether. I think the most complete option is to compute both frame and note-level metrics, as done by Sigtia et al., so it would be nice to support that. |
Right! But I think frame-wise evaluation would work the same for both multi-f0 tracking and 'real' transcription. I don't know if it makes sense for note transcription, though. Anyways, following formulas 1, 2, and 3 in Bay et al., assuming we sampled predictions and targets at a specified frame rate, we get two bit vectors tp = float((pred & targ).sum())
fp = float((pred & ~targ).sum())
fn = float((targ & ~pred).sum())
p = tp / (tp + fp)
r = tp / (tp + fn)
f1 = 2 * p * r / (p + r) This corresponds to the
Definitely. |
I just tracked down the literature given by Sigtia et al. and I agree, defining the task is very important here. For me music transcription involved the step of aggregating framewise candidates to a sequence of notes (which then matches the MIDI-like list of note events for the transcription metrics). So, how do you call then the step of having only the framewise candidates? multi-f0 tracking? |
actually, the formulas in the two papers linked by stefan are very likely the wrong ones, ... i wrote up the whole ugly mess here: TL;DR: |
Btw, note-level eval is already implemented in |
Based on my understanding of the metrics being discussed here, this seems correct. What functionality is currently missing? |
Ping. If there is any functionality missing, please make it clear; otherwise, I will close. |
Last I heard we'd reached agreement on how this should be implemented, but I assumed @stefan-balke was the one who was actually going to do it? |
Yep, on my list. @justinsalamon, see you at ICASSP then :) |
TL;DR: Basically all I'm asking for is taking frames as inputs to @rabitt's
mir_eval.multipitch
module.Hi everyone,
recent transcription papers:
This is basically http://scikit-learn.org/stable/modules/model_evaluation.html#multiclass-and-multilabel-classification with using the
macro
samples
parameter, but scaled with the number offrameslabels.As it seems that people use it, would it be useful to have this in
mir_eval
? If we go with thescikit-learn
implementation which I would strongly suggest, this adds it back as a dependency. Opinions?The text was updated successfully, but these errors were encountered: