Hi all,
Sharing a Swift package I've been working on — a modular speech processing toolkit that runs entirely on-device using
MLX Swift and CoreML.
The package provides 8 protocols and 11 model implementations covering the full speech pipeline:
- SpeechRecognitionModel — Qwen3-ASR, Parakeet TDT (CoreML)
- SpeechGenerationModel — Qwen3-TTS, CosyVoice TTS (streaming)
- SpeechToSpeechModel — PersonaPlex 7B (full-duplex)
- VoiceActivityDetectionModel — Silero (streaming), Pyannote (overlap)
- SpeakerEmbeddingModel — WeSpeaker ResNet34
- SpeakerDiarizationModel — Pyannote + WeSpeaker + spectral clustering
- SpeechEnhancementModel — DeepFilterNet3 (CoreML Neural Engine)
- ForcedAlignmentModel — Qwen3-ForcedAligner (word-level timestamps)
Each model target is independent — import Qwen3ASR doesn't pull in TTS or anything else. Models download from
HuggingFace on first use, cached locally.
A few design decisions I'd appreciate feedback on:
- MLX vs CoreML split — Large models (ASR, TTS, PersonaPlex) run on MLX/GPU. Small models (Silero VAD, DeepFilterNet3)
run on CoreML/Neural Engine. This avoids ANE contention when running multiple models. Does this pattern resonate with
others doing on-device ML? - Protocol design — All protocols use AnyObject constraint (reference semantics for large weight buffers) and optional
language: String?. No ModelLoadable protocol since each model has different loading parameters. See Protocols.swift. - Composed pipelines — Currently StreamingASR (VAD → ASR) and DiarizationPipeline exist as Layer 2 classes. Working on
MeetingTranscriber (diarize → per-segment ASR) next. What other compositions would be useful?
Roadmap: Roadmap: v0.1 → v0.3 · ivan-digital/qwen3-asr-swift · Discussion #81 · GitHub
Repo: GitHub - ivan-digital/qwen3-asr-swift: AI speech toolkit for Apple Silicon — ASR, TTS, speech-to-speech, VAD, and diarization powered by MLX and CoreML · GitHub
Requirements: Swift 5.9+, macOS 14+ / iOS 17+, Apple Silicon.