Kani TTS: Fast and Expressive Speech Generation

A modular Human-Like Text-to-Speech model that generates high-quality speech from text input. With 450M parameters optimized for edge devices, Kani TTS delivers exceptional performance for real-time applications.

Kani TTS Demo
“Anyway, um, so, um, tell me, tell me all about her. I mean, what's she like? Is she really, you know, pretty?”

Key Features

450M Parameters

Optimized for edge devices and affordable servers, providing efficient processing without compromising quality.

High-Quality Speech

22kHz audio output with 0.6kbps compression, delivering clear and natural-sounding speech.

Real-Time Performance

Ultra-fast generation suitable for interactive voice assistants, gaming, and live content.

Advanced Architecture

Kani TTS operates on a two-stage pipeline that combines a powerful language model with a highly efficient audio codec for optimal performance.

LiquidAI LFM2-350M Backbone

The first stage converts input text into compressed audio tokens. Trained on 50k hours of text and audio data, it analyzes semantic meaning, syntactic structure, and prosodic cues.

  • • Processes raw text with punctuation and prosodic markers
  • • Maps information to discrete audio tokens
  • • Produces compact token sequences for fast processing

NVIDIA NanoCodec

The second stage serves as the vocoder, converting audio tokens into high-fidelity audio waveforms with near-instantaneous processing.

  • • Lightweight generative model for real-time operation
  • • Reconstructs full audio signal from compressed tokens
  • • Delivers final WAV format output

Try Kani TTS

Experience Kani TTS in action. Select a model, enter your text, and generate high-quality speech instantly.

Performance Benefits

Low Latency Design

The two-stage architecture provides significant speed advantages. Token generation is highly parallelizable, while NanoCodec decoding is near-instantaneous.

This makes Kani TTS ideal for applications requiring real-time responsiveness, such as interactive voice assistants and live content generation.

Multilingual Support

The model is trained primarily on English for robust core capabilities, with tokenizer support for multiple languages including Arabic, Chinese, French, German, Japanese, Korean, and Spanish.

The base model can be continually pretrained on multilingual datasets for enhanced performance across different languages.

Model Variants

Choose from different model variants to match your specific voice characteristics and requirements.

Base Model

nineninesix/kani-tts-450m-0.1-pt

Default model that generates random voices with consistent quality.

Female Voice

nineninesix/kani-tts-450m-0.2-ft

Fine-tuned model optimized for female voice characteristics.

Male Voice

nineninesix/kani-tts-450m-0.1-ft

Fine-tuned model optimized for male voice characteristics.

Audio Examples

“Anyway, um, so, um, tell me, tell me all about her. I mean, what's she like? Is she really, you know, pretty?”

“No, that does not make you a failure. No, sweetie, no. It just, uh, it just means that you're having a tough time...”

“You make my days brighter, and my wildest dreams feel like reality. How do you do that?”

“Great, and just a couple quick questions so we can match you with the right buyer. Is your home address still 330 East Charleston Road?”

Use Cases

Voice Assistants

Real-time speech generation for interactive AI assistants and chatbots.

Gaming

Dynamic voice generation for characters and narration in games.

Content Creation

Audio narration for videos, podcasts, and multimedia content.

Accessibility

Text-to-speech solutions for accessibility applications and tools.

Frequently Asked Questions

What makes Kani TTS different from other TTS models?

Kani TTS uses a unique two-stage architecture combining LiquidAI LFM2-350M for token generation and NVIDIA NanoCodec for waveform synthesis. This design provides exceptional speed and quality while maintaining efficiency for edge device deployment.

What are the system requirements for running Kani TTS?

Kani TTS is optimized for edge devices and affordable servers. The 450M parameter model requires minimal computational resources while delivering high-quality 22kHz audio output. Core dependencies include PyTorch, librosa, soundfile, and numpy.

Can I customize the voice characteristics?

Yes, Kani TTS offers multiple model variants including base models for random voices, and fine-tuned models optimized for specific voice characteristics like female or male voices. You can modify the ModelConfig in config.py to use different models.

What languages are supported?

While primarily trained on English, the tokenizer supports English, Arabic, Chinese, French, German, Japanese, Korean, and Spanish. The base model can be continually pretrained on multilingual datasets for enhanced performance.

How fast is the speech generation?

Kani TTS is designed for real-time applications with ultra-low latency. The two-stage architecture allows for highly parallelizable token generation followed by near-instantaneous waveform synthesis, making it suitable for interactive applications.

Is there a web interface available?

Yes, Kani TTS includes a FastAPI-based web interface with interactive text input, parameter adjustment, real-time audio generation and playback, download functionality, and server health monitoring.