Skip to content

Releases: pipecat-ai/pipecat

v0.0.58

26 Feb 19:34
11383a8
Compare
Choose a tag to compare

Added

  • Added track-specific audio event on_track_audio_data to AudioBufferProcessor for accessing separate input and output audio tracks.

  • Pipecat version will now be logged on every application startup. This will help us identify what version we are running in case of any issues.

  • Added a new StopFrame which can be used to stop a pipeline task while keeping the frame processors running. The frame processors could then be used in a different pipeline. The difference between a StopFrame and a StopTaskFrame is that, as with EndFrame and EndTaskFrame, the StopFrame is pushed from the task and the StopTaskFrame is pushed upstream inside the pipeline by any processor.

  • Added a new PipelineTask parameter observers that replaces the previous PipelineParams.observers.

  • Added a new PipelineTask parameter check_dangling_tasks to enable or disable checking for frame processors' dangling tasks when the Pipeline finishes running.

  • Added new on_completion_timeout event for LLM services (all OpenAI-based services, Anthropic and Google). Note that this event will only get triggered if LLM timeouts are setup and if the timeout was reached. It can be useful to retrigger another completion and see if the timeout was just a blip.

  • Added new log observers LLMLogObserver and TranscriptionLogObserver that can be useful for debugging your pipelines.

  • Added room_url property to DailyTransport.

  • Added addons argument to DeepgramSTTService.

  • Added exponential_backoff_time() to utils.network module.

Changed

  • ⚠️ PipelineTask now requires keyword arguments (except for the first one for the pipeline).

  • Updated PlayHTHttpTTSService to take a voice_engine and protocol input in the constructor. The previous method of providing a voice_engine input that contains the engine and protocol is deprecated by PlayHT.

  • The base TTSService class now strips leading newlines before sending text to the TTS provider. This change is to solve issues where some TTS providers, like Azure, would not output text due to newlines.

  • GrokLLMSService now uses grok-2 as the default model.

  • AnthropicLLMService now uses claude-3-7-sonnet-20250219 as the default model.

  • RimeHttpTTSService needs an aiohttp.ClientSession to be passed to the constructor as all the other HTTP-based services.

  • RimeHttpTTSService doesn't use a default voice anymore.

  • DeepgramSTTService now uses the new nova-3 model by default. If you want to use the previous model you can pass LiveOptions(model="nova-2-general").
    (see https://deepgram.com/learn/introducing-nova-3-speech-to-text-api)

stt = DeepgramSTTService(..., live_options=LiveOptions(model="nova-2-general"))

Deprecated

  • PipelineParams.observers is now deprecated, you the new PipelineTask parameter observers.

Removed

  • Remove TransportParams.audio_out_is_live since it was not being used at all.

Fixed

  • Fixed a GoogleLLMService that was causing an exception when sending inline audio in some cases.

  • Fixed an AudioContextWordTTSService issue that would cause an EndFrame to disconnect from the TTS service before audio from all the contexts was received. This affected services like Cartesia and Rime.

  • Fixed an issue that was not allowing to pass an OpenAILLMContext to create GoogleLLMService's context aggregators.

  • Fixed a ElevenLabsTTSService, FishAudioTTSService, LMNTTTSService and PlayHTTTSService issue that was resulting in audio requested before an interruption being played after an interruption.

  • Fixed match_endofsentence support for ellipses.

  • Fixed an issue that would cause undesired interruptions via EmulateUserStartedSpeakingFrame when only interim transcriptions (i.e. no final transcriptions) where received.

  • Fixed an issue where EndTaskFrame was not triggering on_client_disconnected or closing the WebSocket in FastAPI.

  • Fixed an issue in DeepgramSTTService where the sample_rate passed to the LiveOptions was not being used, causing the service to use the default sample rate of pipeline.

  • Fixed a context aggregator issue that would not append the LLM text response to the context if a function call happened in the same LLM turn.

  • Fixed an issue that was causing HTTP TTS services to push TTSStoppedFrame more than once.

  • Fixed a FishAudioTTSService issue where TTSStoppedFrame was not being pushed.

  • Fixed an issue that start_callback was not invoked for some LLM services.

  • Fixed an issue that would cause DeepgramSTTService to stop working after an error occurred (e.g. sudden network loss). If the network recovered we would not reconnect.

  • Fixed a STTMuteFilter issue that would not mute user audio frames causing transcriptions to be generated by the STT service.

Other

  • Added Gemini support to examples/phone-chatbot.

  • Added foundational example 34-audio-recording.py showing how to use the AudioBufferProcessor callbacks to save merged and track recordings.

v0.0.57

15 Feb 02:58
b45f7fe
Compare
Choose a tag to compare

Added

  • Added new AudioContextWordTTSService. This is a TTS base class for TTS services that handling multiple separate audio requests.

  • Added new frames EmulateUserStartedSpeakingFrame and EmulateUserStoppedSpeakingFrame which can be used to emulated VAD behavior without VAD being present or not being triggered.

  • Added a new audio_in_stream_on_start field to TransportParams.

  • Added a new method start_audio_in_streaming in the BaseInputTransport.

    • This method should be used to start receiving the input audio in case the field audio_in_stream_on_start is set to false.
  • Added support for the RTVIProcessor to handle buffered audio in base64 format, converting it into InputAudioRawFrame for transport.

  • Added support for the RTVIProcessor to trigger start_audio_in_streaming only after the client-ready message.

  • Added new MUTE_UNTIL_FIRST_BOT_COMPLETE strategy to STTMuteStrategy. This strategy starts muted and remains muted until the first bot speech completes, ensuring the bot's first response cannot be interrupted. This complements the existing FIRST_SPEECH strategy which only mutes during the first detected bot speech.

  • Added support for Google Cloud Speech-to-Text V2 through GoogleSTTService.

  • Added RimeTTSService, a new WordTTSService. Updated the foundational example 07q-interruptible-rime.py to use RimeTTSService.

  • Added support for Groq's Whisper API through the new GroqSTTService and OpenAI's Whisper API through the new OpenAISTTService. Introduced a new base class BaseWhisperSTTService to handle common Whisper API functionality.

  • Added PerplexityLLMService for Perplexity NIM API integration, with an OpenAI-compatible interface. Also, added foundational example 14n-function-calling-perplexity.py.

  • Added DailyTransport.update_remote_participants(). This allows you to update remote participant's settings, like their permissions or which of their devices are enabled. Requires that the local participant have participant admin permission.

Changed

  • We don't consider a colon : and end of sentence any more.

  • Updated DailyTransport to respect the audio_in_stream_on_start field, ensuring it only starts receiving the audio input if it is enabled.

  • Updated FastAPIWebsocketOutputTransport to send TransportMessageFrame and TransportMessageUrgentFrame to the serializer.

  • Updated WebsocketServerOutputTransport to send TransportMessageFrame and TransportMessageUrgentFrame to the serializer.

  • Enhanced STTMuteConfig to validate strategy combinations, preventing MUTE_UNTIL_FIRST_BOT_COMPLETE and FIRST_SPEECH from being used together as they handle first bot speech differently.

  • Updated foundational example 07n-interruptible-google.py to use all Google services.

  • RimeHttpTTSService now uses the mistv2 model by default.

  • Improved error handling in AzureTTSService to properly detect and log synthesis cancellation errors.

  • Enhanced WhisperSTTService with full language support and improved model documentation.

  • Updated foundation example 14f-function-calling-groq.py to use GroqSTTService for transcription.

  • Updated GroqLLMService to use llama-3.3-70b-versatile as the default model.

  • RTVIObserver doesn't handle LLMSearchResponseFrame frames anymore. For now, to handle those frames you need to create a GoogleRTVIObserver instead.

Deprecated

  • STTMuteFilter constructor's stt_service parameter is now deprecated and will be removed in a future version. The filter now manages mute state internally instead of querying the STT service.

  • RTVI.observer() is now deprecated, instantiate an RTVIObserver directly instead.

  • All RTVI frame processors (e.g. RTVISpeakingProcessor, RTVIBotLLMProcessor) are now deprecated, instantiate an RTVIObserver instead.

Fixed

  • Fixed a FalImageGenService issue that was causing the event loop to be blocked while loading the downloadded image.

  • Fixed a CartesiaTTSService service issue that would cause audio overlapping in some cases.

  • Fixed a websocket-based service issue (e.g. CartesiaTTSService) that was preventing a reconnection after the server disconnected cleanly, which was causing an inifite loop instead.

  • Fixed a BaseOutputTransport issue that was causing upstream frames to no be pushed upstream.

  • Fixed multiple issue where user transcriptions where not being handled properly. It was possible for short utterances to not trigger VAD which would cause user transcriptions to be ignored. It was also possible for one or more transcriptions to be generated after VAD in which case they would also be ignored.

  • Fixed an issue that was causing BotStoppedSpeakingFrame to be generated too late. This could then cause issues unblocking STTMuteFilter later than desired.

  • Fixed an issue that was causing AudioBufferProcessor to not record synchronized audio.

  • Fixed an RTVI issue that was causing bot-tts-text messages to be sent before being processed by the output transport.

  • Fixed an issue[#1192] in 11labs where we are trying to reconnect/disconnect the websocket connection even when the connection is already closed.

  • Fixed an issue where has_regular_messages condition was always true in GoogleLLMContext due to Part having function_call & function_response with None values.

Other

  • Added new instant-voice example. This example showcases how to enable instant voice communication as soon as a user connects.

  • Added new local-input-select-stt example. This examples allows you to play with local audio inputs by slecting them through a nice text interface.

v0.0.56

06 Feb 21:53
d4b2160
Compare
Choose a tag to compare

Changed

  • Use gemini-2.0-flash-001 as the default model for GoogleLLMSerivce.

  • Improved foundational examples 22b, 22c, and 22d to support function calling. With these base examples, FunctionCallInProgressFrame and FunctionCallResultFrame will no longer be blocked by the gates.

Fixed

  • Fixed a TkLocalTransport and LocalAudioTransport issues that was causing errors on cleanup.

  • Fixed an issue that was causing tests.utils import to fail because of logging setup.

  • Fixed a SentryMetrics issue that was preventing any metrics to be sent to Sentry and also was preventing from metrics frames to be pushed to the pipeline.

  • Fixed an issue in BaseOutputTransport where incoming audio would not be resampled to the desired output sample rate.

  • Fixed an issue with the TwilioFrameSerializer and TelnyxFrameSerializer where twilio_sample_rate and telnyx_sample_rate were incorrectly initialized to audio_in_sample_rate. Those values currently default to 8000 and should be set manually from the serializer constructor if a different value is needed.

Other

  • Added a new sentry-metrics example.

v0.0.55

05 Feb 19:40
99d3227
Compare
Choose a tag to compare

Added

  • Added a new start_metadata field to PipelineParams. The provided metadata will be set to the initial StartFrame being pushed from the PipelineTask.

  • Added new fields to PipelineParams to control audio input and output sample rates for the whole pipeline. This allows controlling sample rates from a single place instead of having to specify sample rates in each service. Setting a sample rate to a service is still possible and will override the value from PipelineParams.

  • Introduce audio resamplers (BaseAudioResampler). This is just a base class to implement audio resamplers. Currently, two implementations are provided SOXRAudioResampler and ResampyResampler. A new create_default_resampler() has been added (replacing the now deprecated resample_audio()).

  • It is now possible to specify the asyncio event loop that a PipelineTask and all the processors should run on by passing it as a new argument to the PipelineRunner. This could allow running pipelines in multiple threads each one with its own event loop.

  • Added a new utils.TaskManager. Instead of a global task manager we now have a task manager per PipelineTask. In the previous version the task manager was global, so running multiple simultaneous PipelineTasks could result in dangling task warnings which were not actually true. In order, for all the processors to know about the task manager, we pass it through the StartFrame. This means that processors should create tasks when they receive a StartFrame but not before (because they don't have a task manager yet).

  • Added TelnyxFrameSerializer to support Telnyx calls. A full running example has also been added to examples/telnyx-chatbot.

  • Allow pushing silence audio frames before TTSStoppedFrame. This might be useful for testing purposes, for example, passing bot audio to an STT service which usually needs additional audio data to detect the utterance stopped.

  • TwilioSerializer now supports transport message frames. With this we can create Twilio emulators.

  • Added a new transport: WebsocketClientTransport.

  • Added a metadata field to Frame which makes it possible to pass custom data to all frames.

  • Added test/utils.py inside of pipecat package.

Changed

  • GatedOpenAILLMContextAggregator now require keyword arguments. Also, a new start_open argument has been added to set the initial state of the gate.

  • Added organization and project level authentication to OpenAILLMService.

  • Improved the language checking logic in ElevenLabsTTSService and ElevenLabsHttpTTSService to properly handle language codes based on model compatibility, with appropriate warnings when language codes cannot be applied.

  • Updated GoogleLLMContext to support pushing LLMMessagesUpdateFrames that contain a combination of function calls, function call responses, system messages, or just messages.

  • InputDTMFFrame is now based on DTMFFrame. There's also a new OutputDTMFFrame frame.

Deprecated

  • resample_audio() is now deprecated, use create_default_resampler() instead.

Removed

  • AudioBufferProcessor.reset_audio_buffers() has been removed, use AudioBufferProcessor.start_recording() and AudioBufferProcessor.stop_recording() instead.

Fixed

  • Fixed a AudioBufferProcessor that would cause crackling in some recordings.

  • Fixed an issue in AudioBufferProcessor where user callback would not be called on task cancellation.

  • Fixed an issue in AudioBufferProcessor that would cause wrong silence padding in some cases.

  • Fixed an issue where ElevenLabsTTSService messages would return a 1009 websocket error by increasing the max message size limit to 16MB.

  • Fixed a DailyTransport issue that would cause events to be triggered before join finished.

  • Fixed a PipelineTask issue that was preventing processors to be cleaned up after cancelling the task.

  • Fixed an issue where queuing a CancelFrame to a pipeline task would not cause the task to finish. However, using PipelineTask.cancel() is still the recommended way to cancel a task.

Other

  • Improved Unit Test run_test() to use PipelineTask and PipelineRunner. There's now also some control around StartFrame and EndFrame. The EndTaskFrame has been removed since it doesn't seem necessary with this new approach.

  • Updated twilio-chatbot with a few new features: use 8000 sample rate and avoid resampling, a new client useful for stress testing and testing locally without the need to make phone calls. Also, added audio recording on both the client and the server to make sure the audio sounds good.

  • Updated examples to use task.cancel() to immediately exit the example when a participant leaves or disconnects, instead of pushing an EndFrame. Pushing an EndFrame causes the bot to run through everything that is internally queued (which could take some seconds). Note that using task.cancel() might not always be the best option and pushing an EndFrame could still be desirable to make sure all the pipeline is flushed.

v0.0.54

27 Jan 22:59
cd5075e
Compare
Choose a tag to compare

Added

  • In order to create tasks in Pipecat frame processors it is now recommended to use FrameProcessor.create_task() (which uses the new utils.asyncio.create_task()). It takes care of uncaught exceptions, task cancellation handling and task management. To cancel or wait for a task there is FrameProcessor.cancel_task() and FrameProcessor.wait_for_task(). All of Pipecat processors have been updated accordingly. Also, when a pipeline runner finishes, a warning about dangling tasks might appear, which indicates if any of the created tasks was never cancelled or awaited for (using these new functions).

  • It is now possible to specify the period of the PipelineTask heartbeat frames with heartbeats_period_secs.

  • Added DailyMeetingTokenProperties and DailyMeetingTokenParams Pydantic models for meeting token creation in get_token method of DailyRESTHelper.

  • Added enable_recording and geo parameters to DailyRoomProperties.

  • Added RecordingsBucketConfig to DailyRoomProperties to upload recordings to a custom AWS bucket.

Changed

  • Enhanced UserIdleProcessor with retry functionality and control over idle monitoring via new callback signature (processor, retry_count) -> bool. Updated the 17-detect-user-idle.py to show how to use the retry_count.

  • Add defensive error handling for OpenAIRealtimeBetaLLMService's audio truncation. Audio truncation errors during interruptions now log a warning and allow the session to continue instead of throwing an exception.

  • Modified TranscriptProcessor to use TTS text frames for more accurate assistant transcripts. Assistant messages are now aggregated based on bot speaking boundaries rather than LLM context, providing better handling of interruptions and partial utterances.

  • Updated foundational examples 28a-transcription-processor-openai.py, 28b-transcript-processor-anthropic.py, and 28c-transcription-processor-gemini.py to use the updated TranscriptProcessor.

Fixed

  • Fixed an GeminiMultimodalLiveLLMService issue that was preventing the user to push initial LLM assistant messages (using LLMMessagesAppendFrame).

  • Added missing FrameProcessor.cleanup() calls to Pipeline, ParallelPipeline and UserIdleProcessor.

  • Fixed a type error when using voice_settings in ElevenLabsHttpTTSService.

  • Fixed an issue where OpenAIRealtimeBetaLLMService function calling resulted in an error.

  • Fixed an issue in AudioBufferProcessor where the last audio buffer was not being processed, in cases where the _user_audio_buffer was smaller than the buffer size.

Performance

  • Replaced audio resampling library resampy with soxr. Resampling a 2:21s audio file from 24KHz to 16KHz took 1.41s with resampy and 0.031s with soxr with similar audio quality.

Other

  • Added initial unit test infrastructure.

v0.0.53

18 Jan 22:51
a169e0c
Compare
Choose a tag to compare

Added

  • Added ElevenLabsHttpTTSService which uses EleveLabs' HTTP API instead of the websocket one.

  • Introduced pipeline frame observers. Observers can view all the frames that go through the pipeline without the need to inject processors in the pipeline. This can be useful, for example, to implement frame loggers or debuggers among other things. The example examples/foundational/30-observer.py shows how to add an observer to a pipeline for debugging.

  • Introduced heartbeat frames. The pipeline task can now push periodic heartbeats down the pipeline when enable_heartbeats=True. Heartbeats are system frames that are supposed to make it all the way to the end of the pipeline. When a heartbeat frame is received the traversing time (i.e. the time it took to go through the whole pipeline) will be displayed (with TRACE logging) otherwise a warning will be shown. The example examples/foundational/31-heartbeats.py shows how to enable heartbeats and forces warnings to be displayed.

  • Added LLMTextFrame and TTSTextFrame which should be pushed by LLM and TTS services respectively instead of TextFrames.

  • Added OpenRouter for OpenRouter integration with an OpenAI-compatible interface. Added foundational example 14m-function-calling-openrouter.py.

  • Added a new WebsocketService based class for TTS services, containing base functions and retry logic.

  • Added DeepSeekLLMService for DeepSeek integration with an OpenAI-compatible interface. Added foundational example 14l-function-calling-deepseek.py.

  • Added FunctionCallResultProperties dataclass to provide a structured way to control function call behavior, including:

    • run_llm: Controls whether to trigger LLM completion
    • on_context_updated: Optional callback triggered after context update
  • Added a new foundational example 07e-interruptible-playht-http.py for easy testing of PlayHTHttpTTSService.

  • Added support for Google TTS Journey voices in GoogleTTSService.

  • Added 29-livekit-audio-chat.py, as a new foundational examples for LiveKitTransportLayer.

  • Added enable_prejoin_ui, max_participants and start_video_off params to DailyRoomProperties.

  • Added session_timeout to FastAPIWebsocketTransport and WebsocketServerTransport for configuring session timeouts (in seconds). Triggers on_session_timeout for custom timeout handling.
    See examples/websocket-server/bot.py.

  • Added the new modalities option and helper function to set Gemini output modalities.

  • Added examples/foundational/26d-gemini-multimodal-live-text.py which is using Gemini as TEXT modality and using another TTS provider for TTS process.

Changed

  • Modified UserIdleProcessor to start monitoring only after first conversation activity (UserStartedSpeakingFrame or BotStartedSpeakingFrame) instead of immediately.

  • Modified OpenAIAssistantContextAggregator to support controlled completions and to emit context update callbacks via FunctionCallResultProperties.

  • Added aws_session_token to the PollyTTSService.

  • Changed the default model for PlayHTHttpTTSService to Play3.0-mini-http.

  • api_key, aws_access_key_id and region are no longer required parameters for the PollyTTSService (AWSTTSService)

  • Added session_timeout example in examples/websocket-server/bot.py to handle session timeout event.

  • Changed InputParams in src/pipecat/services/gemini_multimodal_live/gemini.py to support different modalities.

  • Changed DeepgramSTTService to send finalize event whenever VAD detects UserStoppedSpeakingFrame. This helps in faster transcriptions and clearing the Deepgram audio buffer.

Fixed

  • Fixed an issue where DeepgramSTTService was not generating metrics using pipeline's VAD.

  • Fixed UserIdleProcessor not properly propagating EndFrames through the pipeline.

  • Fixed an issue where websocket based TTS services could incorrectly terminate their connection due to a retry counter not resetting.

  • Fixed a PipelineTask issue that would cause a dangling task after stopping the pipeline with an EndFrame.

  • Fixed an import issue for PlayHTHttpTTSService.

  • Fixed an issue where languages couldn't be used with the PlayHTHttpTTSService.

  • Fixed an issue where OpenAIRealtimeBetaLLMService audio chunks were hitting an error when truncating audio content.

  • Fixed an issue where setting the voice and model for RimeHttpTTSService wasn't working.

  • Fixed an issue where IdleFrameProcessor and UserIdleProcessor were getting initialized before the start of the pipeline.

v0.0.52

24 Dec 16:24
386ba61
Compare
Choose a tag to compare

Added

  • Constructor arguments for GoogleLLMService to directly set tools and tool_config.

  • Smart turn detection example (22d-natural-conversation-gemini-audio.py) that leverages Gemini 2.0 capabilities ().
    (see https://x.com/kwindla/status/1870974144831275410)

  • Added DailyTransport.send_dtmf() to send dial-out DTMF tones.

  • Added DailyTransport.sip_call_transfer() to forward SIP and PSTN calls to another address or number. For example, transfer a SIP call to a different SIP address or transfer a PSTN phone number to a different PSTN phone number.

  • Added DailyTransport.sip_refer() to transfer incoming SIP/PSTN calls from outside Daily to another SIP/PSTN address.

  • Added an auto_mode input parameter to ElevenLabsTTSService. auto_mode is set to True by default. Enabling this setting disables the chunk schedule and all buffers, which reduces latency.

  • Added KoalaFilter which implement on device noise reduction using Koala Noise Suppression.
    (see https://picovoice.ai/platform/koala/)

  • Added CerebrasLLMService for Cerebras integration with an OpenAI-compatible interface. Added foundational example 14k-function-calling-cerebras.py.

  • Pipecat now supports Python 3.13. We had a dependency on the audioop package which was deprecated and now removed on Python 3.13. We are now using audioop-lts (https://github.com/AbstractUmbra/audioop) to provide the same functionality.

  • Added timestamped conversation transcript support:

    • New TranscriptProcessor factory provides access to user and assistant transcript processors.
    • UserTranscriptProcessor processes user speech with timestamps from transcription.
    • AssistantTranscriptProcessor processes assistant responses with LLM context timestamps.
    • Messages emitted with ISO 8601 timestamps indicating when they were spoken.
    • Supports all LLM formats (OpenAI, Anthropic, Google) via standard message format.
    • New examples: 28a-transcription-processor-openai.py, 28b-transcription-processor-anthropic.py, and 28c-transcription-processor-gemini.py.
  • Add support for more languages to ElevenLabs (Arabic, Croatian, Filipino, Tamil) and PlayHT (Afrikans, Albanian, Amharic, Arabic, Bengali, Croatian, Galician, Hebrew, Mandarin, Serbian, Tagalog, Urdu, Xhosa).

Changed

  • PlayHTTTSService uses the new v4 websocket API, which also fixes an issue where text inputted to the TTS didn't return audio.

  • The default model for ElevenLabsTTSService is now eleven_flash_v2_5.

  • OpenAIRealtimeBetaLLMService now takes a model parameter in the constructor.

  • Updated the default model for the OpenAIRealtimeBetaLLMService.

  • Room expiration (exp) in DailyRoomProperties is now optional (None) by default instead of automatically setting a 5-minute expiration time. You must explicitly set expiration time if desired.

Deprecated

  • AWSTTSService is now deprecated, use PollyTTSService instead.

Fixed

  • Fixed token counting in GoogleLLMService. Tokens were summed incorrectly (double-counted in many cases).

  • Fixed an issue that could cause the bot to stop talking if there was a user interruption before getting any audio from the TTS service.

  • Fixed an issue that would cause ParallelPipeline to handle EndFrame incorrectly causing the main pipeline to not terminate or terminate too early.

  • Fixed an audio stuttering issue in FastPitchTTSService.

  • Fixed a BaseOutputTransport issue that was causing non-audio frames being processed before the previous audio frames were played. This will allow, for example, sending a frame A after a TTSSpeakFrame and the frame A will only be pushed downstream after the audio generated from TTSSpeakFrame has been spoken.

  • Fixed a DeepgramSTTService issue that was causing language to be passed as an object instead of a string resulting in the connection to fail.

v0.0.51

16 Dec 23:37
Compare
Choose a tag to compare

Fixed

  • Fixed an issue in websocket-based TTS services that was causing infinite reconnections (Cartesia, ElevenLabs, PlayHT and LMNT).

v0.0.50

11 Dec 19:51
8e140b2
Compare
Choose a tag to compare

Added

  • Added GeminiMultimodalLiveLLMService. This is an integration for Google's Gemini Multimodal Live API, supporting:

    • Real-time audio and video input processing
    • Streaming text responses with TTS
    • Audio transcription for both user and bot speech
    • Function calling
    • System instructions and context management
    • Dynamic parameter updates (temperature, top_p, etc.)
  • Added AudioTranscriber utility class for handling audio transcription with Gemini models.

  • Added new context classes for Gemini:

    • GeminiMultimodalLiveContext
    • GeminiMultimodalLiveUserContextAggregator
    • GeminiMultimodalLiveAssistantContextAggregator
    • GeminiMultimodalLiveContextAggregatorPair
  • Added new foundational examples for GeminiMultimodalLiveLLMService:

    • 26-gemini-multimodal-live.py
    • 26a-gemini-multimodal-live-transcription.py
    • 26b-gemini-multimodal-live-video.py
    • 26c-gemini-multimodal-live-video.py
  • Added SimliVideoService. This is an integration for Simli AI avatars.
    (see https://www.simli.com)

  • Added NVIDIA Riva's FastPitchTTSService and ParakeetSTTService.
    (see https://www.nvidia.com/en-us/ai-data-science/products/riva/)

  • Added IdentityFilter. This is the simplest frame filter that lets through all incoming frames.

  • New STTMuteStrategy called FUNCTION_CALL which mutes the STT service during LLM function calls.

  • DeepgramSTTService now exposes two event handlers on_speech_started and on_utterance_end that could be used to implement interruptions. See new example examples/foundational/07c-interruptible-deepgram-vad.py.

  • Added GroqLLMService, GrokLLMService, and NimLLMService for Groq, Grok, and NVIDIA NIM API integration, with an OpenAI-compatible interface.

  • New examples demonstrating function calling with Groq, Grok, Azure OpenAI, Fireworks, and NVIDIA NIM: 14f-function-calling-groq.py, 14g-function-calling-grok.py, 14h-function-calling-azure.py, 14i-function-calling-fireworks.py, and 14j-function-calling-nvidia.py.

  • In order to obtain the audio stored by the AudioBufferProcessor you can now also register an on_audio_data event handler. The on_audio_data handler will be called every time buffer_size (a new constructor argument) is reached. If buffer_size is 0 (default) you need to manually get the audio as before using AudioBufferProcessor.merge_audio_buffers().

@audiobuffer.event_handler("on_audio_data")
async def on_audio_data(processor, audio, sample_rate, num_channels):
    await save_audio(audio, sample_rate, num_channels)
  • Added a new RTVI message called disconnect-bot, which when handled pushes an EndFrame to trigger the pipeline to stop.

Changed

  • STTMuteFilter now supports multiple simultaneous muting strategies.

  • XTTSService language now defaults to Language.EN.

  • SoundfileMixer doesn't resample input files anymore to avoid startup delays. The sample rate of the provided sound files now need to match the sample rate of the output transport.

  • Input frames (audio, image and transport messages) are now system frames. This means they are processed immediately by all processors instead of being queued internally.

  • Expanded the transcriptions.language module to support a superset of languages.

  • Updated STT and TTS services with language options that match the supported languages for each service.

  • Updated the AzureLLMService to use the OpenAILLMService. Updated the api_version to 2024-09-01-preview.

  • Updated the FireworksLLMService to use the OpenAILLMService. Updated the default model to accounts/fireworks/models/firefunction-v2.

  • Updated the simple-chatbot example to include a Javascript and React client example, using RTVI JS and React.

Removed

  • Removed AppFrame. This was used as a special user custom frame, but there's actually no use case for that.

Fixed

  • Fixed a ParallelPipeline issue that would cause system frames to be queued.

  • Fixed FastAPIWebsocketTransport so it can work with binary data (e.g. using the protobuf serializer).

  • Fixed an issue in CartesiaTTSService that could cause previous audio to be received after an interruption.

  • Fixed Cartesia, ElevenLabs, LMNT and PlayHT TTS websocket reconnection. Before, if an error occurred no reconnection was happening.

  • Fixed a BaseOutputTransport issue that was causing audio to be discarded after an EndFrame was received.

  • Fixed an issue in WebsocketServerTransport and FastAPIWebsocketTransport that would cause a busy loop when using audio mixer.

  • Fixed a DailyTransport and LiveKitTransport issue where connections were being closed in the input transport prematurely. This was causing frames queued inside the pipeline being discarded.

  • Fixed an issue in DailyTransport that would cause some internal callbacks to not be executed.

  • Fixed an issue where other frames were being processed while a CancelFrame was being pushed down the pipeline.

  • AudioBufferProcessor now handles interruptions properly.

  • Fixed a WebsocketServerTransport issue that would prevent interruptions with TwilioSerializer from working.

  • DailyTransport.capture_participant_video now allows capturing user's screen share by simply passing video_source="screenVideo".

  • Fixed Google Gemini message handling to properly convert appended messages to Gemini's required format.

  • Fixed an issue with FireworksLLMService where chat completions were failing by removing the stream_options from the chat completion options.

v0.0.49

17 Nov 22:34
53f675f
Compare
Choose a tag to compare

Added

  • Added RTVI on_bot_started event which is useful in a single turn interaction.

  • Added DailyTransport events dialin-connected, dialin-stopped, dialin-error and dialin-warning. Needs daily-python >= 0.13.0.

  • Added RimeHttpTTSService and the 07q-interruptible-rime.py foundational example.

  • Added STTMuteFilter, a general-purpose processor that combines STT muting and interruption control. When active, it prevents both transcription and interruptions during bot speech. The processor supports multiple strategies: FIRST_SPEECH (mute only during bot's first speech), ALWAYS (mute during all bot speech), or CUSTOM (using provided callback).

  • Added STTMuteFrame, a control frame that enables/disables speech transcription in STT services.