Skip to content

Architecture

Schakel is a middleware orchestrator that bridges a voice satellite to multiple backend services. It receives raw audio, processes it through a pipeline, and returns synthesized speech.

System Overview

flowchart TB
    subgraph Satellite["Voice Satellite"]
        mic[Microphone]
        speaker[Speaker]
    end

    subgraph Schakel["Schakel Middleware"]
        ws["WebSocket\n/ws/audio"]
        ww[Wake Word\nDetection]
        stt["STT\n(Whisper)"]
        router{{"Intent Router\n(Classifier LLM)"}}
        tts["TTS Engine"]

        ws --> ww --> stt --> router
        tts --> ws
    end

    subgraph Agents["Specialized Agents"]
        domotica["Domotica Agent\n(HA Translator)"]
        musica["Musica Agent\n(Spotify)"]
        general["General Agent\n(Conversational)"]
    end

    subgraph External["External Services"]
        ha["Home Assistant\nAPI"]
        spotify["Spotify\nAPI"]
        llm_local["Local LLM\n(Ollama)"]
        llm_cloud["Cloud LLM\n(OpenAI-compatible)"]
    end

    mic -- "audio bytes" --> ws
    ws -- "audio bytes" --> speaker

    router -- "DOMOTICA" --> domotica
    router -- "MUSICA" --> musica
    router -- "GENERAL" --> general

    domotica -- "JSON action" --> ha
    musica --> spotify
    general --> llm_local
    general --> llm_cloud

    domotica -- "confirmation" --> tts
    musica -- "response" --> tts
    general -- "response" --> tts

Audio Pipeline

Audio flows through a two-state WebSocket pipeline:

LISTENING State

The satellite streams audio chunks continuously over the WebSocket. The wake word detector (openwakeword) processes each chunk looking for the configured trigger word (default: "alexa"). All audio is discarded until the wake word is detected.

RECORDING State

Once the wake word is heard, the pipeline switches to recording mode and buffers audio for 3 seconds (48,000 bytes at 16 kHz, 16-bit mono). After the buffer is full, the complete pipeline runs:

  1. STT -- faster-whisper transcribes the audio buffer to text
  2. Intent Classification -- the router LLM classifies the text as DOMOTICA, MUSICA, or GENERAL
  3. Agent Dispatch -- the classified intent is routed to the appropriate agent
  4. Execution -- the agent processes the request (HA service call, Spotify command, or LLM conversation)
  5. TTS -- Piper synthesizes the response text to 16 kHz PCM audio
  6. Response -- the audio bytes are sent back through the WebSocket to the satellite speaker

The pipeline then resets to LISTENING and waits for the next wake word.

Module Layout

Module Path Responsibility
Entry point app/main.py FastAPI app, WebSocket handler, service initialization
Configuration app/core/config.py YAML loading and Pydantic validation
Logging app/core/logger.py Logging setup
Schemas app/schemas/models.py Pydantic models: Intent, HAAction, MusicAction, RouterResponse
Wake word app/services/audio/wakeword.py openwakeword integration
STT app/services/audio/stt.py faster-whisper transcription
TTS app/services/audio/tts.py Piper synthesis with resampling to 16 kHz
Intent Router app/services/llm/router.py Intent classification and agent dispatch
Local LLM app/services/llm/local.py Ollama async client
Cloud LLM app/services/llm/cloud.py OpenAI/Anthropic/Mistral async client
HA Client app/services/home_assistant/client.py Entity discovery and service calls
Spotify Client app/services/music/spotify.py Spotipy async wrapper

Key Design Decisions

All agents operate in Spanish. All system prompts and agent output are in Spanish since this is a Spanish-language voice assistant.

Structured JSON output for action agents. The domotica and musica agents return structured JSON that is parsed and executed programmatically. Only the confirmation field reaches TTS. This ensures reliable service calls without depending on LLM text parsing.

Confirmation from real results. For the musica agent, the TTS confirmation comes from the actual Spotify search result (real track/artist names), not from the LLM's guess. This prevents the assistant from announcing a song name that doesn't match what's actually playing.

Graceful degradation. If Spotify is not configured, the music agent responds politely instead of crashing. If the LLM call fails, the general agent returns a fallback message. If a Home Assistant service call fails, the domotica agent reports the error.

For details on each agent's behavioral contract, see Agents.