I shipped a 326 MB Marathi text-to-speech model that runs entirely in your browser via WebGPU — no server, no API keys, just <audio> and a phoneme tokenizer in TypeScript. Type Marathi (or Marathi mixed with English — Minglish), pick a voice, hit synthesize.
Try it yourself: huggingface.co/spaces/shreyask/bol-tts-marathi. Requires WebGPU (Chrome 113+, Edge 113+). The model is ~326 MB, loads once, then cached in your browser. First load takes ~30s; subsequent visits are instant.
Here’s what it sounds like — Asha voice reading a pure-Marathi sentence:
पावसाच्या संध्याकाळी शनिवारवाड्याजवळ गरम कटिंग चहा घेत मी मैत्रिणींना म्हणाले, “आज आकाशातला प्रत्येक ढग एखादी नवी गोष्ट सांगतोय.” (“On a rainy evening near Shaniwarwada, sipping hot cutting chai, I said to my friends — today every cloud in the sky is telling some new story.”)
This post walks through the why and how — why I built a Marathi-specific TTS at all, why Kokoro-82M was the right architecture to bend toward Indic, what the training pipeline looked like (data prep, vocabulary surgery, two-stage fine-tuning, voicepack extraction), and a few of the load-bearing bugs I hit shipping it.
Why Marathi TTS, in a browser?
Marathi has 80+ million speakers — it’s the third-most-spoken language in India. The TTS landscape for it is thin. There’s AI4Bharat’s Indic-TTS, which is great research-quality work but server-bound. There’s Microsoft and Google’s voice APIs, which are commercial and require a network round-trip and a cloud account for every utterance. There’s nothing small enough to ship as part of a demo, a prototype, an embedded device, or — what motivated this most — a personal app you can hand to a parent or grandparent without explaining what an API key is.
In-browser TTS solves three real problems at once:
- Privacy. The text never leaves the device. For a language model that’s whispering reading-aided text or transcribed voice notes back to a user, this matters.
- Latency. Network round-trips dominate the perceived response time of cloud TTS for short utterances. A 4-word sentence takes ~200ms to synthesize locally and ~600ms with a server hop. The local feel is qualitatively different.
- Cost. No per-character billing. No quota anxiety. You ship the model file once and the marginal cost of the millionth synthesis is zero.
For Marathi specifically, the in-browser path also unlocks a use case that doesn’t really exist today: code-switched Minglish. Conversational Marathi is heavily mixed with English loanwords — “Friday ला Zomato वर dinner order करूया का?” — and a TTS that handles the mixture cleanly is something you can’t get from any current commercial offering. (More on the trick that makes this work below.)
So the goal was specific: a sub-100M parameter Marathi TTS, ONNX-deployable, runnable on consumer WebGPU, with at least four named voices and clean handling of Minglish.
Why Kokoro-82M as the base
There’s a small constellation of open-weight TTS models in the under-100M-parameter range. I evaluated a few — KittenTTS-15M, Piper, Kokoro-82M, some smaller research models. Kokoro won on three axes:
Architecture. Kokoro is a StyleTTS2 variant — a phoneme-input model with a PL-BERT text encoder, an LSTM-based duration/F0/N predictor conditioned on a voicepack (a pre-extracted style+prosody embedding for a specific speaker), and an ISTFTNet decoder that produces 24 kHz waveforms. ISTFTNet matters because it’s lighter than HiFi-GAN and exports cleanly to ONNX. The whole graph fits comfortably as a single ONNX file.
Size. 82M parameters at fp32 is 326 MB. Big-but-manageable as a one-time download — comparable to a YouTube video — and small enough that browser caches happily hold onto it.
License. Apache 2.0. No surprises.
There was also a precedent that made the project feel tractable: the semidark/kokoro-deutsch project had already done a German fine-tune of Kokoro with the same general recipe I needed (Stage-2 from a Stage-1 init, voicepack extraction, ONNX export). Most of the pipeline scaffolding could be borrowed and re-pointed at Marathi.
What I actually had to do
The fine-tune was not “just run train.py with Marathi data.” Three pieces of work were Marathi-specific.
1. Phoneme inventory + vocabulary surgery
Kokoro’s vocabulary is fixed by its pretrained checkpoint at 178 IPA-style tokens, mostly English-leaning. Marathi has phonemes that aren’t in that set. Most consequentially: ɭ — the retroflex lateral, which Hindi doesn’t use but Marathi very much does (it’s in केळी = “banana”, सकाळ = “morning”, बोल = the word the project is named after). Without ɭ in the vocabulary, every Marathi sentence with a retroflex lateral gets butchered.
You can’t add new tokens to a pretrained Kokoro checkpoint without surgery — the embedding table is sized to 178, the predictor’s input projection expects 178, and adding a 179th row everywhere is the path that burned my first attempt at this project. The cleaner alternative: hijack an unused slot in the existing 178-token vocab. Slot 144 in stock Kokoro corresponds to a phoneme that Marathi doesn’t use at all. So ɭ got slotted into 144 — same dimensionality, no embedding-table changes, and the predictor learns the new phoneme during fine-tuning.
Here’s the model demonstrating the retroflex lateral in three Marathi-specific words — केळी (banana), सकाळ (morning), and बोल (the project’s name):
For phonemization itself, misaki (Kokoro’s official g2p frontend) supports Marathi via espeak.EspeakG2P(language='mr'). Espeak’s Marathi backend is solid enough to be a viable production phonemizer; the gaps are exactly the loanword problem (more on that below) and a handful of nasalized vowels that escape into rare-token territory.
2. Data: three corpora, three different cleanliness profiles
There’s no single Marathi TTS corpus that’s both clean and large. I ended up using three:
- ai4bharat/IndicVoices-R — Marathi config, 18,776 train samples / ~29 GB, multi-speaker reads at 48 kHz. This is the largest source. Good speaker diversity, mostly clean reads, but the audio quality varies — some speakers were recorded in noisier environments than others. Filtered to SNR > 15 dB, duration 2–15s, picked the cleanest two female and two male speakers as the seed for named voicepacks.
- ai4bharat/Rasa — Hindi+Marathi M/F, professionally recorded, CC-BY-4.0. The cleanest source but smaller. The “Asha” and “Vivek” voices are extracted from Rasa’s Marathi female and male speakers respectively.
- SPRINGLab/IndicTTS_Marathi — IITM-licensed, commercial-OK, similar to Rasa but with different recording conditions and a faster intrinsic speaking rate. The “Mukta” and “Dnyanesh” voices come from this corpus. (More on the speaking-rate quirk later.)
Total training set after filtering, manifest-building, and trim-silence: ~25,500 wavs at 24 kHz mono. About 800 held-out for validation and a few thousand more for OOD listening tests. All three corpora got a unified phonemization pass through espeak-mr with ɭ-aware vocab mapping.
3. The two-stage fine-tune
StyleTTS2 trains in two stages:
Stage 1 trains the style encoder + acoustic model on the autoregressive text-to-mel objective. It’s the cheaper, more stable stage — for a Kokoro fine-tune it converges in ~10 epochs on a single A100 80 GB. Validation loss landed at 0.233 after Stage 1.
Stage 2 brings in the duration predictor, F0/energy heads, the joint adversarial training (the “SLM-adv” objective that uses a frozen speech LM as a critic), and the iSTFTNet decoder updates. This is the expensive, finicky stage — it OOMs at default batch sizes on a single A100, and the upstream train_second.py has a few traps (torch.autograd.set_detect_anomaly(True) is hardcoded at line 10 and causes indefinite hangs; mid-training resume is broken in the fork I was using and you have to restart fresh if it dies).
The recipe that worked: batch size 8, PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True, anomaly-detect disabled, 5 epochs of Stage 2 from a Stage-1 init. Total compute cost across both stages, including a couple of restart attempts: roughly $30 on a rented A100. That’s the entire training budget for the model that’s now live. By comparison, the smallest commercial Indic TTS contracts I’ve been quoted sit in the $/character range.
4. Voicepack extraction
Kokoro’s deployment story doesn’t ship a single “speaker” embedded in the model — instead, each voice is a [510, 1, 256] float32 tensor extracted from reference audio, slotted in at inference time. The 510 corresponds to the maximum phoneme sequence length; for each token-count, you have a precomputed style+prosody vector. The first 128 channels are the acoustic style (timbre); the last 128 are the prosody style (rhythm, F0 mean/var).
Voicepack extraction runs after Stage 2: for each named voice, you take 40 reference clips of that speaker, run them through the trained style_encoder (acoustic) and predictor_encoder (prosody), and average the resulting style vectors at each token-count slot. The output is a 511 KB .pt file per voice. The current demo ships four: Asha (Rasa-female), Vivek (Rasa-male), Mukta (SpringLab-female), Dnyanesh (SpringLab-male).
A subtle thing: voicepacks must match the encoders they’re going to be conditioned by at inference. If you re-train the model and don’t re-extract voicepacks, the embeddings live in a different latent space than the predictor expects, and the predictor under-allocates phoneme durations on the mismatch. I learned this expensively (story below).
The Devanagari trick for Minglish
Conversational Marathi is heavily mixed with English loanwords — “Weekend ला movie बघायचा plan आहे का?” — and the TTS model only knows Marathi phonemes. It has no idea what “movie” looks like as English IPA, and even if I gave it /ˈmuːvi/, the decoder has never seen English phonemes during training.
The trick that makes Minglish work is a client-side preprocessor: before phonemization, transliterate every Latin word into Devanagari using a 19,400-entry lookup table built from IndicCMix. “Movie” becomes “मूव्ही”, “weekend” becomes “वीकेंड”, “Zomato” becomes “झोमॅटो”, and the espeak Marathi backend phonemizes the all-Devanagari result. The model never sees Latin script.
The user never sees the Devanagari either — the textarea preserves their original input; the transliteration only happens in memory before tokenizing. This was important: people writing Minglish are thinking in Latin for the loanwords. Replacing what they typed with मूव्ही would feel like the demo “corrected” them.
The downside: the loanword lookup is only as good as the table. IndicCMix’s frequency-pick gave the wrong Devanagari for some common words (movie came out as मोव्ही, “moo-vee”, instead of मूव्ही). I patched the most painful ones by hand and added India-specific brand names that weren’t in the source corpus — Zepto, Blinkit, UPI, Zomato, Swiggy, Paytm, Aadhaar, etc. The table now has ~19,450 entries.
Here’s the trick at work — the same Minglish sentence in three voices. Notice the loanwords (Friday, Zomato, dinner, order) are pronounced cleanly because they were transliterated to Devanagari before phonemization:
“Friday ला Zomato वर dinner order करूया का?” (“Should we order dinner from Zomato on Friday?”)
Asha (Rasa-female, 0.85×):
Mukta (SPRINGLab-female, 0.80×):
Vivek (Rasa-male, 0.90×):
Three bugs that mattered
The shipping path was rough. Most of the bugs were boring infrastructure stuff (stale vite caches, ONNX export silently producing zeros under torch 2.9, browser IndexedDB serving the previous model file because URL-based cache invalidation lies) — the kind of thing you fix once, write down, and forget. I’ll skip those.
Three bugs are worth describing because the lessons travel.
The speed slider was a lie
The demo has a speed slider — drag it to slow down the audio. Mukta sounded like she was eating the first word of every sentence. “Plan → pla.” I went deep on three hypotheses:
- Maybe the v0.5 model has a predictor regression. I rendered the same input through PyTorch native, PyTorch with FiLM baked into LSTM weights, and the exported ONNX. All three produced identical
pred_durarrays. Not the model. - Maybe ONNX Runtime Web’s WebGPU EP is dropping precision. Compared local CPU output to the live WebGPU output. Audibly identical. Not the runtime.
- Maybe the voicepack is mismatched against the model. Re-extracted from the matching encoder checkpoint. Audio still sounded eaten.
Three negative results. I almost looked for a fourth hypothesis. Then I changed the slider from 0.90× to 0.75× and the displayed audio length didn’t change. That’s when I read what my own demo code was doing:
audioEl.playbackRate = speed; // <-- this is the only place "speed" is used
The slider was setting the HTML5 <audio>’s playback rate. The model never knew about the slider. The audio buffer was always generated at the predictor’s intrinsic pacing — speed=1.0 baked into the ONNX export. Slowing playback rate just played the already-crammed buffer back slower. If the predictor allocated 1-2 frames to a final consonant, slowing the audio doesn’t un-cram it; it plays the cramming back in slow motion.
For Mukta — extracted from a SPRINGLab speaker who naturally talks fast — speed=1.0 means under-allocated frames on word-final phonemes. Specifically: leading consonants of the first word get no leading silence to absorb the rushed onset, so they sound clipped.
The fix is architectural. Kokoro’s KModel.forward_with_tokens already supports speed natively — duration = sigmoid(...) / speed — but our ONNX export wrapper had baked speed=1.0 as a Python constant. Re-exporting with speed as a third dynamic ONNX input made the slider real:
class _KokoroONNXWrapper(torch.nn.Module):
def forward(self, input_ids, ref_s, speed): # speed: float32[1]
return self.kmodel.forward_with_tokens(
input_ids=input_ids, ref_s=ref_s, speed=speed[0]
)
torch.onnx.export(
wrap, (dummy_ids, dummy_ref_s, torch.tensor([1.0])),
input_names=["input_ids", "ref_s", "speed"],
dynamic_axes={"input_ids": {1: "n_phonemes"}, "audio": {1: "n_samples"}},
opset_version=17, dynamo=False,
)
At speed=0.75 durations now expand to 1.30× — matching 1/0.75 = 1.33, with the small gap from integer rounding. The TS client passes new Tensor("float32", [speed], [1]). audioEl.playbackRate stays at 1.0 — the audio buffer arrives already at the requested pacing.
The first-word eat went away.
The lesson: in TTS, speed-as-playback-rate and speed-as-frame-allocation sound the same to a casual listener but are completely different operations. The first plays a pre-cooked buffer faster or slower; the second changes what gets cooked. If your slider controls playback rate, your slider is decorative. Test with a fast-paced voicepack — fast voicepacks are the ones that expose the lie.
The model I retrained was correct, and that was the problem
With the speed input working, I went back to ship v0.5 — a model I’d trained with a “Stage-2-only-from-Kokoro” recipe that adapts only the Marathi-relevant layers and inherits the rest from stock Kokoro. Cleaner Marathi, ~10× cheaper to train than a from-scratch Stage-2. Validation loss was meaningfully lower than v0.2.
Smoke-tested the matched stack:
| Stack | Audio length | “amazing” sounds like |
|---|---|---|
| v0.2 model + v0.2 voicepacks (currently live) | 7.80s | “amazing” ✓ |
| v0.5 model + v0.2 voicepacks (the bad ship I’d reverted earlier) | 6.40s | artifacts + under-allocation |
| v0.5 model + v0.5 voicepacks (the right combo) | 7.60s | “amajing” ✗ |
The matched v0.5 stack was structurally clean. The earlier eat-symptom on v0.5 was, indeed, the voicepack mismatch I’d dismissed. Frame counts were within 2% of the v0.2 baseline. By every metric I had, v0.5 was the better model.
But v0.5 said “amajing.” v0.2 said “amazing.” Listen to the exact same input — “Artificial intelligence ही technology खूप amazing आहे.” — through both models, with matched voicepacks:
v0.2 (currently shipped) — note “amazing”:
v0.5 (retrained, lower val loss, “more correct” Marathi) — now “amajing”:
Same input phoneme /ɟʰ/ — espeak-mr renders अमेझिंग (“amazing” → Devanagari → IPA) with /ɟʰ/ for the झ glyph. v0.2’s decoder rendered /ɟʰ/ with an English-flavored /z/ quality. v0.5’s decoder rendered the same phoneme as a faithful Marathi /ʤ/.
v0.5 was more correctly Marathi. v0.2 had been getting English loanwords right by accident — its decoder, having descended from a German-Kokoro continuation that had English somewhere in its training history, leaked an English-leaning quality onto /ɟʰ/. That accidental leakage was exactly what the Devanagari transliteration trick depended on.
For pure Marathi, v0.5 is the better model. For a Minglish demo, it was strictly worse.
I rolled back to v0.2 and kept the v0.5 artifacts on disk for future re-evaluation. The demo still ships on v0.2.
The lesson: validation loss on training-distribution data tells you nothing about how the model will perform on the actual user task — especially when the task depends on accidental behavior. If your demo leans on a quirky thing the model does, that thing has to be a deliberate test target, not something assumed to survive a retraining run. The v0.6 plan now has an explicit goal — preserve the English leakage in the decoder, while improving the Marathi-pure quality — that didn’t exist before this test.
A free 5× voice expansion (the trick I didn’t see coming)
The demo originally shipped four voices — Asha, Vivek, Mukta, Dnyanesh — extracted from Marathi corpora using v0.2’s encoders. Standard recipe.
Then someone asked: “crazy idea — what if we extracted a voicepack from English audio?”
Kokoro’s voicepack is just a [510, 1, 256] tensor of style + prosody embeddings. It’s whatever the style_encoder and predictor_encoder produce when fed reference audio. v0.2’s encoders are a continuation fine-tune of stock Kokoro — they started from the stock weights and only the Marathi-specific layers drifted during Stage 2.
So the latent space stays close. Stock Kokoro voicepacks (the af/am/bf/bm/ff/im/pf/zf/zm series shipped with kokoro-js) were extracted through stock Kokoro’s encoders. They should still be valid ref_s inputs to v0.2’s predictor + decoder.
I synthesized a Marathi sentence and a Minglish sentence with all 53 stock voicepacks. 43 of 53 produced clean output (peak < 0.95, no clipping, F0 in voice range). Some sounded like a foreign speaker reading Marathi with a thick accent. Some sounded indistinguishable from a clean Marathi voice. A few even sounded better than my trained voices — specifically on punctuation prosody, because stock Kokoro was trained on LibriTTS and similar corpora that preserved natural pre/post-clip silence, while my Marathi training data was silence-trimmed. The trimming threw away the prosody signal the predictor needs to learn pause durations.
The end result: 19 stock voicepacks adopted as new voices in the demo, with neo-Marathi names — Svara (स्वरा, US-EN female), Vihaan (विहान, Hindi male), Tara (तारा), Kavya (काव्या), Aakash (आकाश), and a dozen more spanning American, British, French, Italian, Portuguese, and Mandarin source speakers. Plus two child voices (Pari परी and Vir वीर). Voice count went from 4 → 23 in twenty minutes, zero training cost.
A taste — both of these are reading pure Marathi sentences. Speaker timbre is fully borrowed from the stock voicepack; phonemes + prosody flow are produced by v0.2’s predictor + decoder:
Svara (स्वरा — stock af_heart voicepack, US English female speaker, on Marathi):
Vihaan (विहान — stock hm_omega voicepack, Hindi male speaker, on Marathi):
Caveats: 10 of 53 voicepacks had peak > 0.95 (clipping risk) and got dropped. This trick only works for continuation fine-tunes — if you trained from scratch, your latent space is its own thing and stock voicepacks won’t transfer. For Kokoro fine-tunes specifically (any language, presumably), it’s a free voice-expansion mechanism worth knowing about.
The unexpected lesson: voicepacks travel across compatible models. They’re not as tied to the model that extracted them as I’d assumed. A “voicepack zoo” — extracted once, reusable across compatible fine-tunes — is a plausible deployment shape.
What’s now live
- Model: v0.2 Kokoro fine-tune, 82M params, fp32 ONNX (326 MB), with a
speedinput that actually works. - Voices: 23 total. Four trained on Marathi corpora (Asha + Vivek from Rasa, Mukta + Dnyanesh from SPRINGLab) plus 19 stock-Kokoro crossovers used as drop-in voicepacks (the trick from the previous section). Includes 11 adult females, 4 adult males, 2 children, and source-language diversity across US/UK English, French, Italian, Portuguese, Mandarin, and Hindi.
- Phonemization: misaki + espeak-mr with
ɭslotted at vocab position 144. - Minglish: client-side Devanagari transliteration table, ~19,450 entries with hand-curated overrides for the high-frequency mispronunciations and India-specific brands (Zepto, Blinkit, UPI, Zomato, Swiggy, Paytm, Aadhaar, etc.).
- Punctuation pauses: client-side silence splice (200-300 ms after
, ; : — . ! ? …) compensates for the v0.2 predictor’s under-allocation; the crossover voicepacks need it less because they encode richer prosody from clean training corpora. - Speed control: the slider scales the predictor’s per-phoneme duration before frame upsampling — not just
<audio>.playbackRate. Real frame allocation, not playback-rate cosmetics. - Performance card: the demo surfaces Time-to-First-Audio (ms) and Real-Time Factor (×) so you can see the WebGPU vs WASM-fallback gap.
- Downloads: every voice has a 🎧 reference clip, and every synth has a ⬇ download for the WAV.
Things that don’t work yet:
- Pure-English-only sentences (the decoder hallucinates Marathi acoustics if you don’t give it any Devanagari context).
- Multi-word English proper nouns (“Schrödinger’s cat” → ?). The transliterator works on single tokens.
- Long-tail loanwords (the table covers maybe the top ~5K English words by frequency in Indian Twitter; everything past that falls back to misaki’s espeak-mr output, which is hit-and-miss).
- v0.6’s explicit goal: keep v0.2’s English-leaning decoder behavior while improving everything else. Probably synthetic English data mixed into Stage 2. We’ll see.
Try it
huggingface.co/spaces/shreyask/bol-tts-marathi. Type Marathi. Type Minglish (“Friday ला Zomato वर dinner order करूया का?”). Pick a voice. Hit synthesize. The first synth downloads ~326 MB of model and voicepack data once; subsequent ones are sub-second.
If something sounds wrong — eaten word, weird vowel, missing loanword — please tell me. The two biggest improvements in this past week happened because someone said “movie sounds like moo-vee” and “plan → pla, Mukta is eating words.” User reports beat training metrics every time.
The Space’s source (TS client, vocab JSON, voicepack .bin files, loanword map) is browsable directly in the Space’s Files tab — fork it, change a voicepack, edit the textarea defaults, host your own. The model, ONNX file, and voicepack .pt files live at shreyask/bol-tts-marathi-onnx.