Architecture¶
Schakel is a middleware orchestrator that bridges a voice satellite to multiple backend services. It receives raw audio, processes it through a pipeline, and returns synthesized speech.
System Overview¶
flowchart TB
subgraph Satellite["Voice Satellite"]
mic[Microphone]
speaker[Speaker]
end
subgraph Schakel["Schakel Middleware"]
ws["WebSocket\n/ws/audio"]
ww[Wake Word\nDetection]
stt["STT\n(Whisper)"]
router{{"Intent Router\n(Classifier LLM)"}}
tts["TTS Engine"]
ws --> ww --> stt --> router
tts --> ws
end
subgraph Agents["Specialized Agents"]
domotica["Domotica Agent\n(HA Translator)"]
musica["Musica Agent\n(Spotify)"]
general["General Agent\n(Conversational)"]
end
subgraph External["External Services"]
ha["Home Assistant\nAPI"]
spotify["Spotify\nAPI"]
llm_local["Local LLM\n(Ollama)"]
llm_cloud["Cloud LLM\n(OpenAI-compatible)"]
end
mic -- "audio bytes" --> ws
ws -- "audio bytes" --> speaker
router -- "DOMOTICA" --> domotica
router -- "MUSICA" --> musica
router -- "GENERAL" --> general
domotica -- "JSON action" --> ha
musica --> spotify
general --> llm_local
general --> llm_cloud
domotica -- "confirmation" --> tts
musica -- "response" --> tts
general -- "response" --> tts
Audio Pipeline¶
Audio flows through a two-state WebSocket pipeline:
LISTENING State¶
The satellite streams audio chunks continuously over the WebSocket. The wake word detector (openwakeword) processes each chunk looking for the configured trigger word (default: "alexa"). All audio is discarded until the wake word is detected.
RECORDING State¶
Once the wake word is heard, the pipeline switches to recording mode and buffers audio for 3 seconds (48,000 bytes at 16 kHz, 16-bit mono). After the buffer is full, the complete pipeline runs:
- STT -- faster-whisper transcribes the audio buffer to text
- Intent Classification -- the router LLM classifies the text as
DOMOTICA,MUSICA, orGENERAL - Agent Dispatch -- the classified intent is routed to the appropriate agent
- Execution -- the agent processes the request (HA service call, Spotify command, or LLM conversation)
- TTS -- Piper synthesizes the response text to 16 kHz PCM audio
- Response -- the audio bytes are sent back through the WebSocket to the satellite speaker
The pipeline then resets to LISTENING and waits for the next wake word.
Module Layout¶
| Module | Path | Responsibility |
|---|---|---|
| Entry point | app/main.py |
FastAPI app, WebSocket handler, service initialization |
| Configuration | app/core/config.py |
YAML loading and Pydantic validation |
| Logging | app/core/logger.py |
Logging setup |
| Schemas | app/schemas/models.py |
Pydantic models: Intent, HAAction, MusicAction, RouterResponse |
| Wake word | app/services/audio/wakeword.py |
openwakeword integration |
| STT | app/services/audio/stt.py |
faster-whisper transcription |
| TTS | app/services/audio/tts.py |
Piper synthesis with resampling to 16 kHz |
| Intent Router | app/services/llm/router.py |
Intent classification and agent dispatch |
| Local LLM | app/services/llm/local.py |
Ollama async client |
| Cloud LLM | app/services/llm/cloud.py |
OpenAI/Anthropic/Mistral async client |
| HA Client | app/services/home_assistant/client.py |
Entity discovery and service calls |
| Spotify Client | app/services/music/spotify.py |
Spotipy async wrapper |
Key Design Decisions¶
All agents operate in Spanish. All system prompts and agent output are in Spanish since this is a Spanish-language voice assistant.
Structured JSON output for action agents. The domotica and musica agents return structured JSON that is parsed and executed programmatically. Only the confirmation field reaches TTS. This ensures reliable service calls without depending on LLM text parsing.
Confirmation from real results. For the musica agent, the TTS confirmation comes from the actual Spotify search result (real track/artist names), not from the LLM's guess. This prevents the assistant from announcing a song name that doesn't match what's actually playing.
Graceful degradation. If Spotify is not configured, the music agent responds politely instead of crashing. If the LLM call fails, the general agent returns a fallback message. If a Home Assistant service call fails, the domotica agent reports the error.
For details on each agent's behavioral contract, see Agents.