Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Adding Transcription/Subtitle Viewing Support to the Web Player (VTT) #2918

Draft
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

mfcar
Copy link
Contributor

@mfcar mfcar commented May 4, 2024

I have begun work on adding transcription support to the Web Player.
I've used Whisper to generate transcriptions for some audiobooks and podcasts. Many tools based on Whisper support exports in VTT and SRT formats.
For this pull request, I'm only supporting VTT as it is natively supported by browsers. Support for SRT can be added in a future pull request.

How does it work?

A new endpoint, api/items/:id/file/:fileid/transcript, has been created on the backend. This endpoint attempts to return a transcription for each audio track. For instance, if there's an audio file named adventuresherlockholmes_01_doyle_64kb.mp3, this endpoint will attempt to return the file adventuresherlockholmes_01_doyle_64kb.vtt.

On the frontend, when an audio file is set as the source property of the <audio> HTML tag, a <track> is created and linked to that <audio>. The source property for the <track> HTML tag is populated with the link to the aforementioned endpoint.

What does this PR support?

  • Show/Hide transcription block
  • Highlighting the current transcription line
  • Clicking on a line to seek the player to that time
  • Changing transcriptions when the audio file changes (supports audiobooks and podcasts)

Demo

Screen.Recording.2024-05-04.at.20.12.35.mov

What is missing for the scope of this PR

  • Hiding the "Show transcription" button when the transcription is not available for the audio file
  • Known issues

Known issues

  • When playing an audio file with transcription, if you close the web player and reopen it, the transcription block is not displayed, even though the transcription is still available. Clicking on the "Show transcription" button to display the block again. I think this is related with the MediaPlayerContainer.vue component not reloading the TranscriptionUi component.
Screen.Recording.2024-05-04.at.14.14.37.mov
  • When playing an audio file with transcription, if you change the audio file, the active transcription line for the new audio file focuses on the first line. The focus shifts to the correct line only when the next line change occurs.

Related

@@ -116,6 +116,7 @@
this.router.post('/items/:id/chapters', LibraryItemController.middleware.bind(this), LibraryItemController.updateMediaChapters.bind(this))
this.router.get('/items/:id/ffprobe/:fileid', LibraryItemController.middleware.bind(this), LibraryItemController.getFFprobeData.bind(this))
this.router.get('/items/:id/file/:fileid', LibraryItemController.middleware.bind(this), LibraryItemController.getLibraryFile.bind(this))
this.router.get('/items/:id/file/:fileid/transcript', LibraryItemController.middleware.bind(this), LibraryItemController.getTranscriptionFile.bind(this))

Check failure

Code scanning / CodeQL

Missing rate limiting High

This route handler performs
a file system access
, but is not rate-limited.
@mfcar mfcar changed the title WIP: Adding Transcription Support to the Web Player (VTT) WIP: Adding Transcription Playing Support to the Web Player (VTT) May 4, 2024
@mfcar mfcar changed the title WIP: Adding Transcription Playing Support to the Web Player (VTT) WIP: Adding Transcription/Subtitle Viewing Support to the Web Player (VTT) May 5, 2024
@barolo
Copy link

barolo commented May 28, 2024

The placement irks me for some reason. I think that this feature demands something like "Now Playing" screen.
But even in this form I really really want this feature in.
My sister is hearing impaired and this would really help her.

@mfcar
Copy link
Contributor Author

mfcar commented May 28, 2024

The placement irks me for some reason. I think that this feature demands something like "Now Playing" screen. But even in this form I really really want this feature in. My sister is hearing impaired and this would really help her.

I also don't like the placement. I was thinking putting it on a side panel or a floating, movable modal.
But, the side panel raises concerns about taking up too much space on the sidebar, especially if the user has a narrow display.
The floating, movable modal adds more complexity to the JavaScript and CSS.
I will make some tests with both behaviours and try to provide updates here.

Sidebar like Apple Music:

image


Floating transcription window:

image

@ashwinm4friends
Copy link

Great job on the project! For UX improvement, please consider looking into word highlighting in Snipd, as shown in this video: https://www.youtube.com/watch?v=jBi-OId37Uw
https://www.youtube.com/watch?v=jzPekGpC4uw

@barolo
Copy link

barolo commented Jun 2, 2024

Great job on the project! For UX improvement, please consider looking into word highlighting in Snipd, as shown in this video: https://www.youtube.com/watch?v=jBi-OId37Uw https://www.youtube.com/watch?v=jzPekGpC4uw

Snipd uses word level timestamps, while such subs are easy to generate the only sane format is ssa/ass (srt can blow into megabytes which is insane) afaik. Which is not natively supported by browsers.
No idea if it's possible with vtt
(and I'm guessing that snippd just uses raw JSON or something since the subs can be generated on the fly)

@mfcar
Copy link
Contributor Author

mfcar commented Jun 2, 2024

Great job on the project! For UX improvement, please consider looking into word highlighting in Snipd, as shown in this video: https://www.youtube.com/watch?v=jBi-OId37Uw https://www.youtube.com/watch?v=jzPekGpC4uw

Snipd uses word level timestamps, while such subs are easy to generate the only sane format is ssa/ass (srt can blow into megabytes which is insane) afaik. Which is not natively supported by browsers. No idea if it's possible with vtt (and I'm guessing that snippd just uses raw JSON or something since the subs can be generated on the fly)

Look at the WebVTT, which supports something similar to the "Karaoke Style" using :past and :future pseudo-classes. However, VTT files need to be adapted for this as well. I think it's not common to get a VTT file with this information.
I was using Whisper to generate transcriptions, but I'm not sure if we can generate word-by-word transcriptions.

SSA/ASS and SRT support, I was checking what the best approach is. I was considering parsing to VTT to keep the implementation consistent with how we show the transcriptions, I'm not sure if this is the best way yet

image

@barolo
Copy link

barolo commented Jun 2, 2024

@mfcar I've used https://github.com/jianfch/stable-ts to generate ass/ssa karaoke style captions with custom style for my podcasts/books. I don't remember if vtt is one of the options.
Whisper.cpp can spit out world level output too, but you have to process it with a script to get a valid subs file.

@ashwinm4friends
Copy link

In the past, I have used stable-ts to create VTT files. I generated word-level timestamps with Whisper’s base.en model.

@Vito0912
Copy link
Contributor

@mfcar Is there any progress made on this?
Would love to have some kind of this feature

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants