Transcription service

This project extracts text transcriptions from movies and audio recordings in most popular video formats.

Capabilities

Supports transcription of both popular video and audio formats
Should be able to handle long videos (up to several hours)
Scalable architecture using Redis (with the extension of Redis Queue) for job queuing and worker management
Self-hosted solution using open-source Whisper AI model
RESTful API for file upload and transcription status checking
Separate worker processes for handling transcription tasks
Everything is containerized using Docker and Docker Compose, including cuda support setup
Option to include word timestamps in transcriptions

Architecture

The service consists of the following components:

API Server: Handles incoming requests and manages the transcription queue.
Redis: Acts as a message broker and job queue.
Worker: Processes transcription jobs using the Whisper AI model.

Running the containerized service

Prerequisites

Docker and Docker Compose
NVIDIA GPU with CUDA support (optional, for GPU acceleration)
NVIDIA drivers and NVIDIA Container Runtime installed on the host system (optional, for GPU acceleration)

Installation and setup (containerized)

Clone this repository:

git clone <org_path>/transcription_service.git
cd transcription_service

The environment variables are already set in the docker-compose.yml file. If you need to modify any settings, you can do so directly in the compose file or by creating a .env file in the project root directory.
Build and start the services using Docker Compose:
```
docker-compose up --build
```

Configuration

Configuration of different service components can be found in Docker Compose file under docker/docker-compose.yml.

Environment variables

The following environment variables are configured in the docker-compose.yml file:

REDIS_HOST: Hostname of the Redis server (set to redis unless one wants to use a different/external Redis server)
REDIS_PORT: Port of the Redis server (set to 6379)
REDIS_DB: Number of Redis database used for tasks/jobs orchestration (defaults to 10)
UPLOADS_DIR: Directory for uploaded files (defaults to /app/uploads, and the directory can be accessed from the volume)
TRANSCRIPTIONS_DIR: Directory for storing transcriptions (set to /app/transcriptions, and the directory can be accessed from the volume)
WHISPER_MODEL_NAME: Whisper model to use (defaults to large-v3 with best quality, see alternative models in resource-scare scenarios)
WHISPER_MODEL_DEVICE: Device to run the model on (set to cuda for GPU acceleration)

GPU support

GPU support is enabled by default in the Docker Compose configuration. To use it:

Ensure your host system has NVIDIA drivers and NVIDIA Container Runtime installed.
The docker-compose.yml file already includes the necessary configuration:

deploy:
  resources:
    reservations:
      devices:
        - driver: nvidia
          count: all
          capabilities: [ gpu ]
runtime: nvidia

Scaling workers

To scale the number of worker processes horizontally:

Use Docker Compose's --scale option:

docker-compose up --scale worker=3

This command will start 3 worker containers, you can use whatever number you see fit.

Alternatively, you can modify the docker-compose.yml file to include a deploy section for the worker service:

worker:
  # ... other configurations ...
  deploy:
    replicas: 3

Then run docker-compose up --build to apply the changes.

IMPORTANT: Make sure to adjust the number of workers based on the available resources on your host system – especially when using GPU acceleration.

Whisper Model Configuration

By default, the service uses the large-v3 Whisper model, which requires approximately 10-11GB of GPU memory (VRAM). You can choose different models based on your hardware capabilities – with smaller options such as tiny or base requiring much less resources and working very fast even on just CPUs – but providing lower transcription quality.

For a full list of available models and their capabilities, visit: https://github.com/openai/whisper?tab=readme-ov-file#available-models-and-languages To change the model, update the WHISPER_MODEL_NAME environment variable in the docker-compose.yml file under the worker service. Note: Models are not embedded in the worker images and are downloaded at runtime.

Running the service locally/development setup

For local development follow the steps below:

Clone this repository:

git clone <org_path>/transcription_service.git
cd transcription_service

Create a virtual environment and install dependencies (poetry must be installed in the system already):
```
poetry install
```
Set up pre-commit hooks (pre-commit must be installed in the system already):
```
pre-commit install
```

Start the service using Uvicorn:

uvicorn transcription_service.main:app --reload --env-file=.example.env

Workers should be able to be started with a similar fashion, using the following command:

python -m transcription_service.worker

Redis server should be running on the default port (6379) on localhost.

Testing and usage

To run the end-to-end transcription workflow, you can use the provided utility script:

python scripts/end_to_end_transcription.py

This script uploads a sample video file to the service, checks the transcription status, and downloads the transcription once it's ready.

Note: ensure the service is running before executing the script + you're running it from virtual environment with all dependencies installed.

Call with example arguments

python scripts/end_to_end_transcription.py localhost:8001 https://www.youtube.com/watch?v=TX4s0X6FDcQ /path/to/local/sandbox/outputs/TX4s0X6FDcQ_transcription.txt --include-word-timestamps

The call above will upload the video to the service running at localhost:8001 from the provided YouTube link (we support both paths to local files, and YT links for convenience, as it was a popular internal use case), check the transcription status, and download the transcription once it's ready to path specified. The --include-word-timestamps flag is responsible for setting the generated transcription to "rich" format which will include word timestamps in the transcription.

Swagger API documentation

All endpoints are documented using Swagger UI, which can be accessed at http://localhost:8001/docs.

TODOs

Future (possible) improvements:

Add configurable file size restrictions for uploads and video lengths to manage system resources effectively.
Implement user authentication and authorization for secure access to the API.
Add support for other TTS models than Whisper (it was prioritized due to being SOTA).

License

Extracted from a larger system, this component was brought to you by 🐰 datarabbit.ai 🐰

It is licenced under the Apache License, Version 2.0. in order to allow for the flexibility of use and modification.

Do something cool with it! 🚀

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
docker		docker
resources		resources
scripts		scripts
transcription_service		transcription_service
.dockerignore		.dockerignore
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Transcription service

Capabilities

Architecture

Running the containerized service

Prerequisites

Installation and setup (containerized)

Configuration

Environment variables

GPU support

Scaling workers

Whisper Model Configuration

Running the service locally/development setup

Testing and usage

Call with example arguments

Swagger API documentation

TODOs

License

About

Releases

Packages

Languages

License

datarabbit-ai/transcription_service

Folders and files

Latest commit

History

Repository files navigation

Transcription service

Capabilities

Architecture

Running the containerized service

Prerequisites

Installation and setup (containerized)

Configuration

Environment variables

GPU support

Scaling workers

Whisper Model Configuration

Running the service locally/development setup

Testing and usage

Call with example arguments

Swagger API documentation

TODOs

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages