LLM

Vocode - Build Voice-enabled AI Apps with this Amazing Open-source Python Framework

Hazem Abbas

Oct 17, 2024 — 3 min read

Table of Content

Vocode is an open-source library that simplifies building voice-enabled applications powered by large language models (LLMs). It allows developers to create real-time, voice-based conversations with LLMs and deploy them to phone calls, Zoom meetings, and beyond.

With Vocode, you can build interactive apps like personal assistants or voice-controlled games such as chess.

It provides easy-to-use integrations and abstractions, streamlining the development of voice-enabled apps, making LLM interactions more accessible through voice interfaces.

Features

🗣 Real-time Conversations: Initiate voice-based conversations directly with your system audio using LLMs.
➡️ 📞 Phone Number Setup: Set up phone numbers that interact with LLM-based agents for automated responses.
📞 ➡️ Outbound Phone Calls: Send calls from your managed phone numbers powered by LLM-based agents.
🧑‍💻 Zoom Call Integration: Dial into Zoom meetings with voice-enabled LLMs.
🤖 Langchain Agent Integration: Make outbound calls to real phone numbers using LLMs via Langchain agents.
Ultra-Realistic Voices: Choose from a wide selection of highly realistic voices to enhance customer experiences.
Multilingual Capability: Configure bots to communicate in multiple languages, offering native language support for your customers.
Custom Language Models: Utilize your own language models to create bots with personalized, context-specific communication.
External Action Integration: Automate tasks such as scheduling, payments, and more by integrating with external systems.
Phone Menu Navigation: Bots can navigate phone menus to reach the appropriate department or individual.
Hold Waiting: Bots can wait on hold for you and transfer the call once a human agent is available.
Knowledge Base Connection: Enhance bots with knowledge from your own resources to improve their responses and intelligence.
Analytics and Monitoring: Track bot performance and customer interactions for valuable insights.
Available 24/7: Bots are always active, scaling up or down as needed, eliminating the need for hiring, training, or managing human agents.

Out-of-the-box Integrations

Transcription Services:
- AssemblyAI
- Deepgram
- Gladia
- Google Cloud
- Microsoft Azure
- RevAI
- Whisper
- Whisper.cpp
LLMs:
- OpenAI
- Anthropic
Synthesis Services:
- Rime.ai
- Microsoft Azure
- Google Cloud
- Play.ht
- Eleven Labs
- Cartesia
- Coqui (OSS)
- gTTS
- StreamElements
- Bark
- AWS Polly

Install

pip install vocode

Usage

import asyncio
import signal

from pydantic_settings import BaseSettings, SettingsConfigDict

from vocode.helpers import create_streaming_microphone_input_and_speaker_output
from vocode.logging import configure_pretty_logging
from vocode.streaming.agent.chat_gpt_agent import ChatGPTAgent
from vocode.streaming.models.agent import ChatGPTAgentConfig
from vocode.streaming.models.message import BaseMessage
from vocode.streaming.models.synthesizer import AzureSynthesizerConfig
from vocode.streaming.models.transcriber import (
    DeepgramTranscriberConfig,
    PunctuationEndpointingConfig,
)
from vocode.streaming.streaming_conversation import StreamingConversation
from vocode.streaming.synthesizer.azure_synthesizer import AzureSynthesizer
from vocode.streaming.transcriber.deepgram_transcriber import DeepgramTranscriber

configure_pretty_logging()


class Settings(BaseSettings):
    """
    Settings for the streaming conversation quickstart.
    These parameters can be configured with environment variables.
    """

    openai_api_key: str = "ENTER_YOUR_OPENAI_API_KEY_HERE"
    azure_speech_key: str = "ENTER_YOUR_AZURE_KEY_HERE"
    deepgram_api_key: str = "ENTER_YOUR_DEEPGRAM_API_KEY_HERE"

    azure_speech_region: str = "eastus"

    # This means a .env file can be used to overload these settings
    # ex: "OPENAI_API_KEY=my_key" will set openai_api_key over the default above
    model_config = SettingsConfigDict(
        env_file=".env",
        env_file_encoding="utf-8",
        extra="ignore",
    )


settings = Settings()


async def main():
    (
        microphone_input,
        speaker_output,
    ) = create_streaming_microphone_input_and_speaker_output(
        use_default_devices=False,
    )

    conversation = StreamingConversation(
        output_device=speaker_output,
        transcriber=DeepgramTranscriber(
            DeepgramTranscriberConfig.from_input_device(
                microphone_input,
                endpointing_config=PunctuationEndpointingConfig(),
                api_key=settings.deepgram_api_key,
            ),
        ),
        agent=ChatGPTAgent(
            ChatGPTAgentConfig(
                openai_api_key=settings.openai_api_key,
                initial_message=BaseMessage(text="What up"),
                prompt_preamble="""The AI is having a pleasant conversation about life""",
            )
        ),
        synthesizer=AzureSynthesizer(
            AzureSynthesizerConfig.from_output_device(speaker_output),
            azure_speech_key=settings.azure_speech_key,
            azure_speech_region=settings.azure_speech_region,
        ),
    )
    await conversation.start()
    print("Conversation started, press Ctrl+C to end")
    signal.signal(signal.SIGINT, lambda _0, _1: asyncio.create_task(conversation.terminate()))
    while conversation.is_active():
        chunk = await microphone_input.get_audio()
        conversation.receive_audio(chunk)


if __name__ == "__main__":
    asyncio.run(main())