-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pyannote fails to detect segments correctly #16
Comments
I've been experimenting with it to understand behavior, the code below works for most cases. However, when I try a 9 minute audio recording, it gives an error after a while
code that works on many cases like motivation.wav and some other test audio files that I have. Note that, this is very ugly and includes many debugging steps to sanity checks :-D pub fn get_segments<P: AsRef<Path>>(
samples: &[i16],
sample_rate: u32,
model_path: P,
) -> Result<impl Iterator<Item = Result<Segment>> + '_> {
let session = session::create_session(model_path.as_ref())?;
let frame_size = 270;
let frame_start = 721;
let window_size = (sample_rate * 10) as usize;
let total_duration = samples.len() as f64 / sample_rate as f64;
let padded_length = if samples.len() % window_size != 0 {
samples.len() + (window_size - (samples.len() % window_size))
} else {
samples.len()
};
let total_windows = (padded_length + window_size - 1) / window_size;
println!("Audio stats:");
println!(" Total samples: {}", samples.len());
println!(" Sample rate: {} Hz", sample_rate);
println!(" Duration: {:.2} seconds", total_duration);
println!(
" Window size: {} samples ({} seconds)",
window_size,
window_size as f64 / sample_rate as f64
);
println!(" Total windows needed: {}", total_windows);
let padded_samples = {
let mut padded = Vec::from(samples);
let padding_needed = padded_length - samples.len();
println!(" Adding {} padding samples", padding_needed);
padded.extend(vec![0; padding_needed]);
padded
};
//An error occurs when we try to process a very small speech segment (only 270 samples in this case) in knf_rs::compute_fbank
//Processing segment: 448.80s - 448.81s (duration: 0.01s, samples: 270)
//thread 'main' panicked at examples/infinite.rs:25:10:
//called Result::unwrap() on an Err value: The frames array is empty. No features to compute.
//Location:
// crates/knf-rs/src/lib.rs:38:9
//note: run with RUST_BACKTRACE=1 environment variable to display a backtrace
let min_segment_samples = (sample_rate as f64 * 0.1) as usize;
let mut window_count = 0;
let mut is_speeching = false;
let mut offset = frame_start;
let mut start_offset = 0.0;
let mut current_position = 0;
let mut window_segments: Vec<Segment> = Vec::new();
Ok(std::iter::from_fn(move || {
loop {
// Return any pending segments first
if !window_segments.is_empty() {
return Some(Ok(window_segments.remove(0)));
}
// Check if we've processed all samples
if current_position >= padded_samples.len() {
// Handle any final speech segment
if is_speeching {
let start = start_offset / sample_rate as f64;
let end = offset as f64 / sample_rate as f64;
let start_idx = (start * sample_rate as f64) as usize;
let end_idx = (end * sample_rate as f64) as usize;
// Check if segment is long enough
if (end_idx - start_idx) >= min_segment_samples {
window_segments.push(Segment {
start,
end,
samples: samples
[start_idx.min(samples.len() - 1)..end_idx.min(samples.len())]
.to_vec(),
});
} else {
println!(
" Skipping too short segment: {}s to {}s (duration: {:.3}s)",
start,
end,
end - start
);
}
is_speeching = false;
}
return None;
}
// Process next window
let window_end = (current_position + window_size).min(padded_samples.len());
let window = &padded_samples[current_position..window_end];
window_count += 1;
println!(
"Processing window {}/{} ({:.2}s to {:.2}s)",
window_count,
total_windows,
current_position as f64 / sample_rate as f64,
window_end as f64 / sample_rate as f64
);
// Convert and process window
let array = ndarray::Array1::from_iter(window.iter().map(|&x| x as f32));
let array = array.view().insert_axis(Axis(0)).insert_axis(Axis(1));
let inputs = match ort::inputs![array.into_dyn()] {
Ok(inputs) => inputs,
Err(e) => return Some(Err(eyre::eyre!("Failed to prepare inputs: {:?}", e))),
};
let ort_outs = match session.run(inputs) {
Ok(outputs) => outputs,
Err(e) => return Some(Err(eyre::eyre!("Failed to run the session: {:?}", e))),
};
let ort_out = match ort_outs.get("output").context("Output tensor not found") {
Ok(output) => output,
Err(e) => return Some(Err(eyre::eyre!("Output tensor error: {:?}", e))),
};
let ort_out = match ort_out.try_extract_tensor::<f32>() {
Ok(tensor) => tensor,
Err(e) => return Some(Err(eyre::eyre!("Tensor extraction error: {:?}", e))),
};
println!(" Processing tensor of shape: {:?}", ort_out.shape());
// Process segments in window
for row in ort_out.outer_iter() {
for sub_row in row.axis_iter(Axis(0)) {
let max_index = match find_max_index(sub_row) {
Ok(index) => index,
Err(e) => return Some(Err(e)),
};
if max_index != 0 {
if !is_speeching {
start_offset = offset as f64;
is_speeching = true;
println!(
" Speech started at {:.2}s",
start_offset / sample_rate as f64
);
}
} else if is_speeching {
println!(
" Speech ended at {:.2}s",
offset as f64 / sample_rate as f64
);
let start = start_offset / sample_rate as f64;
let end = offset as f64 / sample_rate as f64;
let start_idx = (start * sample_rate as f64) as usize;
let end_idx = (end * sample_rate as f64) as usize;
window_segments.push(Segment {
start,
end,
samples: samples
[start_idx.min(samples.len() - 1)..end_idx.min(samples.len())]
.to_vec(),
});
is_speeching = false;
}
offset += frame_size;
}
}
// Move to next window
current_position = window_end;
// If we found segments in this window, return the first one
if !window_segments.is_empty() {
return Some(Ok(window_segments.remove(0)));
}
}
}))
} |
tried to hack this ( must meet the minimum length requirement of 0.1 seconds before being added to window_segments!!) } else if is_speeching {
println!(
" Speech ended at {:.2}s",
offset as f64 / sample_rate as f64
);
let start = start_offset / sample_rate as f64;
let end = offset as f64 / sample_rate as f64;
let start_idx = (start * sample_rate as f64) as usize;
let end_idx = (end * sample_rate as f64) as usize;
// Add minimum segment length check here too
if (end_idx - start_idx) >= min_segment_samples {
window_segments.push(Segment {
start,
end,
samples: samples
[start_idx.min(samples.len() - 1)..end_idx.min(samples.len())]
.to_vec(),
});
} else {
println!(
" Skipping too short segment: {}s to {}s (duration: {:.3}s)",
start,
end,
end - start
);
}
is_speeching = false;
} So this currently works with audio files that I tried to test. 😊 pub fn get_segments<P: AsRef<Path>>(
samples: &[i16],
sample_rate: u32,
model_path: P,
) -> Result<impl Iterator<Item = Result<Segment>> + '_> {
let session = session::create_session(model_path.as_ref())?;
let frame_size = 270;
let frame_start = 721;
let window_size = (sample_rate * 10) as usize;
let total_duration = samples.len() as f64 / sample_rate as f64;
let padded_length = if samples.len() % window_size != 0 {
samples.len() + (window_size - (samples.len() % window_size))
} else {
samples.len()
};
let total_windows = (padded_length + window_size - 1) / window_size;
println!("Audio stats:");
println!(" Total samples: {}", samples.len());
println!(" Sample rate: {} Hz", sample_rate);
println!(" Duration: {:.2} seconds", total_duration);
println!(
" Window size: {} samples ({} seconds)",
window_size,
window_size as f64 / sample_rate as f64
);
println!(" Total windows needed: {}", total_windows);
let padded_samples = {
let mut padded = Vec::from(samples);
let padding_needed = padded_length - samples.len();
println!(" Adding {} padding samples", padding_needed);
padded.extend(vec![0; padding_needed]);
padded
};
//An error occurs when we try to process a very small speech segment (only 270 samples in this case) in knf_rs::compute_fbank
//Processing segment: 448.80s - 448.81s (duration: 0.01s, samples: 270)
//thread 'main' panicked at examples/infinite.rs:25:10:
//called Result::unwrap() on an Err value: The frames array is empty. No features to compute.
//Location:
// crates/knf-rs/src/lib.rs:38:9
//note: run with RUST_BACKTRACE=1 environment variable to display a backtrace
//So its fixed!
let min_segment_samples = (sample_rate as f64 * 0.1) as usize;
let mut window_count = 0;
let mut is_speeching = false;
let mut offset = frame_start;
let mut start_offset = 0.0;
let mut current_position = 0;
let mut window_segments: Vec<Segment> = Vec::new();
Ok(std::iter::from_fn(move || {
loop {
// Return any pending segments first
if !window_segments.is_empty() {
return Some(Ok(window_segments.remove(0)));
}
// Check if we've processed all samples
if current_position >= padded_samples.len() {
// Handle any final speech segment
if is_speeching {
let start = start_offset / sample_rate as f64;
let end = offset as f64 / sample_rate as f64;
let start_idx = (start * sample_rate as f64) as usize;
let end_idx = (end * sample_rate as f64) as usize;
// Check if segment is long enough
if (end_idx - start_idx) >= min_segment_samples {
window_segments.push(Segment {
start,
end,
samples: samples
[start_idx.min(samples.len() - 1)..end_idx.min(samples.len())]
.to_vec(),
});
} else {
println!(
" Skipping too short segment: {}s to {}s (duration: {:.3}s)",
start,
end,
end - start
);
}
is_speeching = false;
}
return None;
}
// Process next window
let window_end = (current_position + window_size).min(padded_samples.len());
let window = &padded_samples[current_position..window_end];
window_count += 1;
println!(
"Processing window {}/{} ({:.2}s to {:.2}s)",
window_count,
total_windows,
current_position as f64 / sample_rate as f64,
window_end as f64 / sample_rate as f64
);
// Convert and process window
let array = ndarray::Array1::from_iter(window.iter().map(|&x| x as f32));
let array = array.view().insert_axis(Axis(0)).insert_axis(Axis(1));
let inputs = match ort::inputs![array.into_dyn()] {
Ok(inputs) => inputs,
Err(e) => return Some(Err(eyre::eyre!("Failed to prepare inputs: {:?}", e))),
};
let ort_outs = match session.run(inputs) {
Ok(outputs) => outputs,
Err(e) => return Some(Err(eyre::eyre!("Failed to run the session: {:?}", e))),
};
let ort_out = match ort_outs.get("output").context("Output tensor not found") {
Ok(output) => output,
Err(e) => return Some(Err(eyre::eyre!("Output tensor error: {:?}", e))),
};
let ort_out = match ort_out.try_extract_tensor::<f32>() {
Ok(tensor) => tensor,
Err(e) => return Some(Err(eyre::eyre!("Tensor extraction error: {:?}", e))),
};
println!(" Processing tensor of shape: {:?}", ort_out.shape());
// Process segments in window
for row in ort_out.outer_iter() {
for sub_row in row.axis_iter(Axis(0)) {
let max_index = match find_max_index(sub_row) {
Ok(index) => index,
Err(e) => return Some(Err(e)),
};
if max_index != 0 {
if !is_speeching {
start_offset = offset as f64;
is_speeching = true;
println!(
" Speech started at {:.2}s",
start_offset / sample_rate as f64
);
}
} else if is_speeching {
println!(
" Speech ended at {:.2}s",
offset as f64 / sample_rate as f64
);
let start = start_offset / sample_rate as f64;
let end = offset as f64 / sample_rate as f64;
let start_idx = (start * sample_rate as f64) as usize;
let end_idx = (end * sample_rate as f64) as usize;
// Add minimum segment length check here too
if (end_idx - start_idx) >= min_segment_samples {
window_segments.push(Segment {
start,
end,
samples: samples
[start_idx.min(samples.len() - 1)..end_idx.min(samples.len())]
.to_vec(),
});
} else {
println!(
" Skipping too short segment: {}s to {}s (duration: {:.3}s)",
start,
end,
end - start
);
}
is_speeching = false;
}
offset += frame_size;
}
}
// Move to next window
current_position = window_end;
// If we found segments in this window, return the first one
if !window_segments.is_empty() {
return Some(Ok(window_segments.remove(0)));
}
}
}))
} |
Looks like you found the issue. I forgot to collect the whole segments and instead got only one from the for loops. |
yes, its very poorly written and slow actually 🙂. I haven't analyzed this in detail yet, I've been dealing with this for a long time and I started to get a little confused 😂 . I just wanted to share it as soon as I figured it out 😊 |
note that, since the issue fixed with the v0.3.0-beta.0, feel free to close this. But my eye is also on here https://users.rust-lang.org/t/how-to-create-iterator-with-results/121834/3 to if we find a better way :-) Thank you again, really nice work! |
Sorry, while experimenting some real world scenerios with example
Even though “6_speakers.wav” works, for example a very simple “wget https://github.com/thewh1teagle/sherpa-rs/releases/download/v0.1.0/motivation.wav -O motivation.wav” gives the following output:
Running
target/debug/examples/infinite motivation.wav
start = 7.52, end = 9.12, speaker = 1
cargo run --example infinite motivation.wav This gives exactly the same result ("start_7.52_end_9.12.wav") I don't understand why it behaves like that :(
The text was updated successfully, but these errors were encountered: