Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pyannote fails to detect segments correctly #16

Closed
altunenes opened this issue Nov 29, 2024 · 5 comments
Closed

Pyannote fails to detect segments correctly #16

altunenes opened this issue Nov 29, 2024 · 5 comments

Comments

@altunenes
Copy link
Contributor

Sorry, while experimenting some real world scenerios with example

Even though “6_speakers.wav” works, for example a very simple “wget https://github.com/thewh1teagle/sherpa-rs/releases/download/v0.1.0/motivation.wav -O motivation.wav” gives the following output:

Running target/debug/examples/infinite motivation.wav
start = 7.52, end = 9.12, speaker = 1

cargo run --example infinite motivation.wav This gives exactly the same result ("start_7.52_end_9.12.wav") I don't understand why it behaves like that :(

@altunenes
Copy link
Contributor Author

altunenes commented Nov 29, 2024

I've been experimenting with it to understand behavior, the code below works for most cases. However, when I try a 9 minute audio recording, it gives an error after a while

start = 441.70, end = 443.24, speaker = 19
Processing segment: 443.29s - 448.51s (duration: 5.22s, samples: 167130)
start = 443.29, end = 448.51, speaker = 10
Processing window 46/59 (450.00s to 460.00s)
  Processing tensor of shape: [1, 1182, 7]
  Speech ended at 448.81s
  Speech started at 449.17s
Processing segment: 448.80s - 448.81s (duration: 0.01s, samples: 270)
thread 'main' panicked at examples/infinite.rs:25:10:
called `Result::unwrap()` on an `Err` value: The frames array is empty. No features to compute.
Location:
    crates/knf-rs/src/lib.rs:38:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

code that works on many cases like motivation.wav and some other test audio files that I have. Note that, this is very ugly and includes many debugging steps to sanity checks :-D

pub fn get_segments<P: AsRef<Path>>(
    samples: &[i16],
    sample_rate: u32,
    model_path: P,
) -> Result<impl Iterator<Item = Result<Segment>> + '_> {
    let session = session::create_session(model_path.as_ref())?;
    let frame_size = 270;
    let frame_start = 721;
    let window_size = (sample_rate * 10) as usize;
    let total_duration = samples.len() as f64 / sample_rate as f64;
    let padded_length = if samples.len() % window_size != 0 {
        samples.len() + (window_size - (samples.len() % window_size))
    } else {
        samples.len()
    };
    let total_windows = (padded_length + window_size - 1) / window_size;
    println!("Audio stats:");
    println!("  Total samples: {}", samples.len());
    println!("  Sample rate: {} Hz", sample_rate);
    println!("  Duration: {:.2} seconds", total_duration);
    println!(
        "  Window size: {} samples ({} seconds)",
        window_size,
        window_size as f64 / sample_rate as f64
    );
    println!("  Total windows needed: {}", total_windows);
    let padded_samples = {
        let mut padded = Vec::from(samples);
        let padding_needed = padded_length - samples.len();
        println!("  Adding {} padding samples", padding_needed);
        padded.extend(vec![0; padding_needed]);
        padded
    };
    //An error occurs when we try to process a very small speech segment (only 270 samples in this case) in knf_rs::compute_fbank
    //Processing segment: 448.80s - 448.81s (duration: 0.01s, samples: 270)
    //thread 'main' panicked at examples/infinite.rs:25:10:
    //called Result::unwrap() on an Err value: The frames array is empty. No features to compute.
    //Location:
    // crates/knf-rs/src/lib.rs:38:9
    //note: run with RUST_BACKTRACE=1 environment variable to display a backtrace
    let min_segment_samples = (sample_rate as f64 * 0.1) as usize;

    let mut window_count = 0;
    let mut is_speeching = false;
    let mut offset = frame_start;
    let mut start_offset = 0.0;
    let mut current_position = 0;
    let mut window_segments: Vec<Segment> = Vec::new();
    Ok(std::iter::from_fn(move || {
        loop {
            // Return any pending segments first
            if !window_segments.is_empty() {
                return Some(Ok(window_segments.remove(0)));
            }
            // Check if we've processed all samples
            if current_position >= padded_samples.len() {
                // Handle any final speech segment
                if is_speeching {
                    let start = start_offset / sample_rate as f64;
                    let end = offset as f64 / sample_rate as f64;
                    let start_idx = (start * sample_rate as f64) as usize;
                    let end_idx = (end * sample_rate as f64) as usize;

                    // Check if segment is long enough
                    if (end_idx - start_idx) >= min_segment_samples {
                        window_segments.push(Segment {
                            start,
                            end,
                            samples: samples
                                [start_idx.min(samples.len() - 1)..end_idx.min(samples.len())]
                                .to_vec(),
                        });
                    } else {
                        println!(
                            "  Skipping too short segment: {}s to {}s (duration: {:.3}s)",
                            start,
                            end,
                            end - start
                        );
                    }
                    is_speeching = false;
                }
                return None;
            }
            // Process next window
            let window_end = (current_position + window_size).min(padded_samples.len());
            let window = &padded_samples[current_position..window_end];
            window_count += 1;
            println!(
                "Processing window {}/{} ({:.2}s to {:.2}s)",
                window_count,
                total_windows,
                current_position as f64 / sample_rate as f64,
                window_end as f64 / sample_rate as f64
            );
            // Convert and process window
            let array = ndarray::Array1::from_iter(window.iter().map(|&x| x as f32));
            let array = array.view().insert_axis(Axis(0)).insert_axis(Axis(1));

            let inputs = match ort::inputs![array.into_dyn()] {
                Ok(inputs) => inputs,
                Err(e) => return Some(Err(eyre::eyre!("Failed to prepare inputs: {:?}", e))),
            };

            let ort_outs = match session.run(inputs) {
                Ok(outputs) => outputs,
                Err(e) => return Some(Err(eyre::eyre!("Failed to run the session: {:?}", e))),
            };

            let ort_out = match ort_outs.get("output").context("Output tensor not found") {
                Ok(output) => output,
                Err(e) => return Some(Err(eyre::eyre!("Output tensor error: {:?}", e))),
            };

            let ort_out = match ort_out.try_extract_tensor::<f32>() {
                Ok(tensor) => tensor,
                Err(e) => return Some(Err(eyre::eyre!("Tensor extraction error: {:?}", e))),
            };
            println!("  Processing tensor of shape: {:?}", ort_out.shape());
            // Process segments in window
            for row in ort_out.outer_iter() {
                for sub_row in row.axis_iter(Axis(0)) {
                    let max_index = match find_max_index(sub_row) {
                        Ok(index) => index,
                        Err(e) => return Some(Err(e)),
                    };

                    if max_index != 0 {
                        if !is_speeching {
                            start_offset = offset as f64;
                            is_speeching = true;
                            println!(
                                "  Speech started at {:.2}s",
                                start_offset / sample_rate as f64
                            );
                        }
                    } else if is_speeching {
                        println!(
                            "  Speech ended at {:.2}s",
                            offset as f64 / sample_rate as f64
                        );
                        let start = start_offset / sample_rate as f64;
                        let end = offset as f64 / sample_rate as f64;
                        let start_idx = (start * sample_rate as f64) as usize;
                        let end_idx = (end * sample_rate as f64) as usize;

                        window_segments.push(Segment {
                            start,
                            end,
                            samples: samples
                                [start_idx.min(samples.len() - 1)..end_idx.min(samples.len())]
                                .to_vec(),
                        });
                        is_speeching = false;
                    }
                    offset += frame_size;
                }
            }
            // Move to next window
            current_position = window_end;

            // If we found segments in this window, return the first one
            if !window_segments.is_empty() {
                return Some(Ok(window_segments.remove(0)));
            }
        }
    }))
}

@altunenes
Copy link
Contributor Author

tried to hack this ( must meet the minimum length requirement of 0.1 seconds before being added to window_segments!!)

} else if is_speeching {
    println!(
        "  Speech ended at {:.2}s",
        offset as f64 / sample_rate as f64
    );
    let start = start_offset / sample_rate as f64;
    let end = offset as f64 / sample_rate as f64;
    let start_idx = (start * sample_rate as f64) as usize;
    let end_idx = (end * sample_rate as f64) as usize;

    // Add minimum segment length check here too
    if (end_idx - start_idx) >= min_segment_samples {
        window_segments.push(Segment {
            start,
            end,
            samples: samples
                [start_idx.min(samples.len() - 1)..end_idx.min(samples.len())]
                .to_vec(),
        });
    } else {
        println!(
            "  Skipping too short segment: {}s to {}s (duration: {:.3}s)",
            start,
            end,
            end - start
        );
    }
    is_speeching = false;
}

So this currently works with audio files that I tried to test. 😊

pub fn get_segments<P: AsRef<Path>>(
    samples: &[i16],
    sample_rate: u32,
    model_path: P,
) -> Result<impl Iterator<Item = Result<Segment>> + '_> {
    let session = session::create_session(model_path.as_ref())?;
    let frame_size = 270;
    let frame_start = 721;
    let window_size = (sample_rate * 10) as usize;
    let total_duration = samples.len() as f64 / sample_rate as f64;
    let padded_length = if samples.len() % window_size != 0 {
        samples.len() + (window_size - (samples.len() % window_size))
    } else {
        samples.len()
    };
    let total_windows = (padded_length + window_size - 1) / window_size;
    println!("Audio stats:");
    println!("  Total samples: {}", samples.len());
    println!("  Sample rate: {} Hz", sample_rate);
    println!("  Duration: {:.2} seconds", total_duration);
    println!(
        "  Window size: {} samples ({} seconds)",
        window_size,
        window_size as f64 / sample_rate as f64
    );
    println!("  Total windows needed: {}", total_windows);
    let padded_samples = {
        let mut padded = Vec::from(samples);
        let padding_needed = padded_length - samples.len();
        println!("  Adding {} padding samples", padding_needed);
        padded.extend(vec![0; padding_needed]);
        padded
    };
    //An error occurs when we try to process a very small speech segment (only 270 samples in this case) in knf_rs::compute_fbank
    //Processing segment: 448.80s - 448.81s (duration: 0.01s, samples: 270)
    //thread 'main' panicked at examples/infinite.rs:25:10:
    //called Result::unwrap() on an Err value: The frames array is empty. No features to compute.
    //Location:
    // crates/knf-rs/src/lib.rs:38:9
    //note: run with RUST_BACKTRACE=1 environment variable to display a backtrace
    //So its fixed!
    let min_segment_samples = (sample_rate as f64 * 0.1) as usize;

    let mut window_count = 0;
    let mut is_speeching = false;
    let mut offset = frame_start;
    let mut start_offset = 0.0;
    let mut current_position = 0;
    let mut window_segments: Vec<Segment> = Vec::new();
    Ok(std::iter::from_fn(move || {
        loop {
            // Return any pending segments first
            if !window_segments.is_empty() {
                return Some(Ok(window_segments.remove(0)));
            }
            // Check if we've processed all samples
            if current_position >= padded_samples.len() {
                // Handle any final speech segment
                if is_speeching {
                    let start = start_offset / sample_rate as f64;
                    let end = offset as f64 / sample_rate as f64;
                    let start_idx = (start * sample_rate as f64) as usize;
                    let end_idx = (end * sample_rate as f64) as usize;

                    // Check if segment is long enough
                    if (end_idx - start_idx) >= min_segment_samples {
                        window_segments.push(Segment {
                            start,
                            end,
                            samples: samples
                                [start_idx.min(samples.len() - 1)..end_idx.min(samples.len())]
                                .to_vec(),
                        });
                    } else {
                        println!(
                            "  Skipping too short segment: {}s to {}s (duration: {:.3}s)",
                            start,
                            end,
                            end - start
                        );
                    }
                    is_speeching = false;
                }
                return None;
            }
            // Process next window
            let window_end = (current_position + window_size).min(padded_samples.len());
            let window = &padded_samples[current_position..window_end];
            window_count += 1;
            println!(
                "Processing window {}/{} ({:.2}s to {:.2}s)",
                window_count,
                total_windows,
                current_position as f64 / sample_rate as f64,
                window_end as f64 / sample_rate as f64
            );
            // Convert and process window
            let array = ndarray::Array1::from_iter(window.iter().map(|&x| x as f32));
            let array = array.view().insert_axis(Axis(0)).insert_axis(Axis(1));

            let inputs = match ort::inputs![array.into_dyn()] {
                Ok(inputs) => inputs,
                Err(e) => return Some(Err(eyre::eyre!("Failed to prepare inputs: {:?}", e))),
            };

            let ort_outs = match session.run(inputs) {
                Ok(outputs) => outputs,
                Err(e) => return Some(Err(eyre::eyre!("Failed to run the session: {:?}", e))),
            };

            let ort_out = match ort_outs.get("output").context("Output tensor not found") {
                Ok(output) => output,
                Err(e) => return Some(Err(eyre::eyre!("Output tensor error: {:?}", e))),
            };

            let ort_out = match ort_out.try_extract_tensor::<f32>() {
                Ok(tensor) => tensor,
                Err(e) => return Some(Err(eyre::eyre!("Tensor extraction error: {:?}", e))),
            };
            println!("  Processing tensor of shape: {:?}", ort_out.shape());
            // Process segments in window
            for row in ort_out.outer_iter() {
                for sub_row in row.axis_iter(Axis(0)) {
                    let max_index = match find_max_index(sub_row) {
                        Ok(index) => index,
                        Err(e) => return Some(Err(e)),
                    };

                    if max_index != 0 {
                        if !is_speeching {
                            start_offset = offset as f64;
                            is_speeching = true;
                            println!(
                                "  Speech started at {:.2}s",
                                start_offset / sample_rate as f64
                            );
                        }
                    } else if is_speeching {
                        println!(
                            "  Speech ended at {:.2}s",
                            offset as f64 / sample_rate as f64
                        );
                        let start = start_offset / sample_rate as f64;
                        let end = offset as f64 / sample_rate as f64;
                        let start_idx = (start * sample_rate as f64) as usize;
                        let end_idx = (end * sample_rate as f64) as usize;

                        // Add minimum segment length check here too
                        if (end_idx - start_idx) >= min_segment_samples {
                            window_segments.push(Segment {
                                start,
                                end,
                                samples: samples
                                    [start_idx.min(samples.len() - 1)..end_idx.min(samples.len())]
                                    .to_vec(),
                            });
                        } else {
                            println!(
                                "  Skipping too short segment: {}s to {}s (duration: {:.3}s)",
                                start,
                                end,
                                end - start
                            );
                        }
                        is_speeching = false;
                    }
                    offset += frame_size;
                }
            }
            // Move to next window
            current_position = window_end;

            // If we found segments in this window, return the first one
            if !window_segments.is_empty() {
                return Some(Ok(window_segments.remove(0)));
            }
        }
    }))
}

@thewh1teagle
Copy link
Owner

So this currently works with audio files that I tried to test. 😊

Looks like you found the issue. I forgot to collect the whole segments and instead got only one from the for loops.
You can use VecDeque which is more efficient when needing kind of queue structure.
The only issue with that, is that it doesn't yield each segment immediately (you push them first)
I want to yield each one as soon as possible but it's not simple with Rust's generators/iterators. https://users.rust-lang.org/t/how-to-create-iterator-with-results/121834/3

@altunenes
Copy link
Contributor Author

yes, its very poorly written and slow actually 🙂. I haven't analyzed this in detail yet, I've been dealing with this for a long time and I started to get a little confused 😂 . I just wanted to share it as soon as I figured it out 😊

@altunenes
Copy link
Contributor Author

note that, since the issue fixed with the v0.3.0-beta.0, feel free to close this. But my eye is also on here https://users.rust-lang.org/t/how-to-create-iterator-with-results/121834/3 to if we find a better way :-)

Thank you again, really nice work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants