Schakel Voice¶

Orchestrator middleware for home automation and local voice assistance. Schakel sits between a voice satellite and Home Assistant, routing spoken commands through STT, an intent classifier, and specialized agents (home automation, music, general conversation) before responding via TTS.

Architecture¶

flowchart TB
    subgraph Satellite["Voice Satellite"]
        mic[Microphone]
        speaker[Speaker]
    end

    subgraph Schakel["Schakel Middleware"]
        ws["WebSocket\n/ws/audio"]
        ww[Wake Word\nDetection]
        stt["STT\n(Whisper)"]
        router{{"Intent Router\n(Classifier LLM)"}}
        tts["TTS Engine"]

        ws --> ww --> stt --> router
        tts --> ws
    end

    subgraph Agents["Specialized Agents"]
        domotica["Domotica Agent\n(HA Translator)"]
        musica["Musica Agent\n(Spotify)"]
        general["General Agent\n(Conversational)"]
    end

    subgraph External["External Services"]
        ha["Home Assistant\nAPI"]
        spotify["Spotify\nAPI"]
        llm_local["Local LLM\n(Ollama)"]
        llm_cloud["Cloud LLM\n(OpenAI-compatible)"]
    end

    mic -- "audio bytes" --> ws
    ws -- "audio bytes" --> speaker

    router -- "DOMOTICA" --> domotica
    router -- "MUSICA" --> musica
    router -- "GENERAL" --> general

    domotica -- "JSON action" --> ha
    musica --> spotify
    general --> llm_local
    general --> llm_cloud

    domotica -- "confirmation" --> tts
    musica -- "response" --> tts
    general -- "response" --> tts

How It Works¶

Audio flows through a two-state WebSocket pipeline:

LISTENING -- streams audio through wake word detection (openwakeword). All audio is discarded until the wake word is heard.
RECORDING -- buffers audio until the recording limit is reached (3 seconds), then runs the full pipeline: Whisper STT -> intent classification -> agent dispatch -> Piper TTS -> audio response back to the satellite.

The intent router classifies each utterance into one of three categories:

Intent	Description	Example
DOMOTICA	Home automation device control	"enciende la luz del salon"
MUSICA	Spotify playback commands	"pon Despacito"
GENERAL	Questions and conversation	"que tiempo hace manana"

Each intent is handled by a specialized agent that produces a structured response, which is then synthesized to audio and sent back to the satellite.

Quick Links¶

Getting Started -- prerequisites and setup
Configuration -- config.yaml walkthrough
M5Stack Atom Echo -- flash your voice satellite
Agent Architecture -- how the agents work
Docker Deployment -- run with containers