-
Frame Extraction
- Uses OpenCV to extract frames from video
- Calculates frame differences to identify key moments
- Saves frames as JPEGs for LLM analysis
- Adaptive sampling based on video length and target frames per minute
-
Target Frame Calculation
- Calculates target frames based on video duration and frames_per_minute
- Respects optional max_frames limit
- Ensures at least 1 frame and no more than total video frames
-
Adaptive Sampling
- Uses sampling interval = total_frames / (target_frames * 2)
- Reduces processing load while maintaining coverage
- Samples more frequently than target to ensure enough candidates
-
Frame Difference Analysis
- Converts frames to grayscale for efficient comparison
- Uses OpenCV's absdiff to calculate absolute difference
- Compares against FRAME_DIFFERENCE_THRESHOLD (default 10.0)
- Stores frame number, image data, and difference score
-
Final Selection Process
- Selects frames with highest difference scores
- Takes top N frames based on target frame count
- If max_frames specified, samples evenly across selected frames
- Ensures most significant changes are captured
- Frames between sampling intervals may be missed
- Rapid sequences may only have one frame selected
- High-scoring frames may be excluded if outranked by others
- Even sampling with max_frames may skip some significant changes
-
Audio Processing
- Extracts audio using FFmpeg
- Uses Whisper for transcription
- Handles poor quality audio by checking confidence scores
- Segments audio for better context in final analysis
-
Frame Analysis
- Each frame is analyzed independently using vision LLM
- Uses frame_analysis.txt prompt to guide LLM analysis
- Captures timestamp, visual elements, and actions
- Maintains chronological order for narrative flow
-
Video Reconstruction
- Combines frame analyses chronologically
- Integrates audio transcript if available
- Uses video_reconstruction.txt prompt to create technical description
- Uses narrate_storyteller.txt to transform into engaging narrative
class LLMClient:
def encode_image(self, image_path: str) -> str:
# Common base64 encoding for all clients
with open(image_path, "rb") as image_file:
return base64.b64encode(image_file.read()).decode('utf-8')
@abstractmethod
def generate(self,
prompt: str,
image_path: Optional[str] = None,
stream: bool = False,
model: str = "llama3.2-vision",
temperature: float = 0.2,
num_predict: int = 256) -> Dict[Any, Any]:
pass
-
Ollama (ollama.py)
- Uses local Ollama API
- Sends images as base64 in "images" array
- Returns raw response from Ollama
-
Generic OpenAI API (generic_openai_api.py)
- Compatible with OpenAI-style APIs (OpenAI, OpenRouter, etc.)
- Configurable API URL (e.g. OpenRouter: https://openrouter.ai/api/v1, OpenAI: https://api.openai.com/v1)
- Sends images as content array with type "image_url"
- Requires API key and service URL
- Returns standardized response format
Uses cascade priority:
- Command line args
- User config (config.json)
- Default config (default_config.json)
Key configuration groups:
{
"clients": {
"default": "ollama",
"ollama": {
"url": "http://localhost:11434",
"model": "llama3.2-vision"
},
"openai_api": {
"api_key": "",
"api_url": "https://openrouter.ai/api/v1",
"model": "meta-llama/llama-3.2-11b-vision-instruct:free"
}
},
"frames": {
"per_minute": 60,
"analysis_threshold": 10.0,
"min_difference": 5.0,
"max_count": 30
}
}
Two key prompts:
-
frame_analysis.txt
- Analyzes single frame
- Includes timestamp context
- Focuses on visual elements and actions
- Supports user questions through {prompt} token
-
describe.txt
- Combines frame analyses
- Uses 1 frame
- Integrates transcript
- Creates a description of the video based on all the past frames
- Supports user questions through {prompt} token
Both prompts support user questions via the --prompt flag. When a question is provided, it is prefixed with "I want to know" and injected into the prompts using the {prompt} token. This allows users to ask specific questions about the video that guide both the frame analysis and final description.
The prompt loading system supports flexible prompt file locations and custom prompts:
-
Path Resolution:
- User-specified directory (via config
prompt_dir
):- Absolute paths:
/path/to/prompts
- User home paths:
~/prompts
- Relative paths: Checked against:
- Current working directory
- Package root directory
- Absolute paths:
- Package resources (fallback)
- User-specified directory (via config
-
Development Workflow:
- Install in dev mode:
pip install -e .
- Modify prompt files directly
- Changes reflect immediately without reinstall
- Works from any directory
- Install in dev mode:
-
Configuration:
{
"prompt_dir": "/absolute/path/to/prompts", // Absolute path
// or "~/prompts" // User home directory
// or "prompts" // Relative path
// or "" // Use package prompts only
"prompts": [
{
"name": "Frame Analysis",
"path": "frame_analysis/frame_analysis.txt"
},
{
"name": "Video Reconstruction",
"path": "frame_analysis/describe.txt"
}
]
}
The system prioritizes user-specified prompts over package prompts, enabling customization while maintaining reliable fallbacks.
-
Frame Analysis Failures
- Ollama: Check if service is running and model is loaded
- OpenRouter: Verify API key and check response format
- Both: Ensure image encoding is correct for each API
-
Memory Usage
- Adjust frames_per_minute based on video length
- Clean up frames after analysis
- Use appropriate Whisper model size
-
Poor Analysis Quality
- Check frame extraction threshold
- Verify prompt templates
- Ensure correct model is being used
-
New LLM Provider
- Inherit from LLMClient
- Implement correct image format for API
- Add client config to default_config.json
- Update create_client() in video_analyzer.py
-
Custom Analysis
- Add new prompt template
- Update VideoAnalyzer methods
- Modify output format in results