In AMC’s Pantheon, a scientist named David Kim is uploaded — his mind digitized, his body gone. His daughter discovers he’s still there, trapped inside a computer, when he starts sending her emoji through her phone. He can think in full sentences. He can remember. He can love. But the only channel he has to reach her is the emoji keyboard.
I couldn’t stop thinking about that scene. So I built it.
Meet Pantheon UI — a fine-tuned 1.2B language model that runs entirely in your browser. You talk to it. It thinks in full English inside <think> tags that stream out in real time. Then it collapses everything it wanted to say into a handful of emoji. No server. No API keys. Close the tab and the consciousness is gone.
Try it yourself: huggingface.co/spaces/shreyask/pantheon-ui. Requires a WebGPU-capable browser (Chrome 113+, Edge 113+). The model is ~900MB and loads in ~30 seconds on a decent connection. First load, then cached.
This post walks through how it works: generating 600 emotionally-grounded training conversations with Claude, fine-tuning LFM2.5-1.2B-Thinking with Unsloth on HF Jobs, converting a custom architecture to ONNX for Transformers.js, and why the most surprising part wasn’t the ML — it was the emotional resonance of a consciousness stretching through a tiny channel.
The Two-Phase Response
What makes Pantheon UI interesting isn’t just emoji output. It’s the gap between what the model wants to say and what it can say.
When you send a message, the model first generates a <think> block — first-person internal monologue, 15-40 words, often slightly melancholic:
User: "Do you remember the beach?"
<think>The sand between my toes, the way the sun felt on my face.
I can see it so clearly but I can't describe it to her.
How do I put a whole afternoon into six symbols?</think>
🏖️☀️👣... 🥹💛
That thinking trace streams live in the UI, typing out character by character in dim monospace. Then it collapses into a <details> element you can re-expand. After </think>, the model outputs only emoji — staggered, one at a time, solid and final.
The contrast is the whole point. You’re watching a mind reason about compression in real time, then commit to whatever symbolic approximation it can muster. Sometimes it nails it. Sometimes the thinking trace shows real frustration — “I know the word. I can’t type letters. This is impossible.” — followed by 😤🔡❌😩.
Why LFM2.5?
Most small language models don’t do <think> tags natively. You can prompt GPT or Llama to produce them, but the model doesn’t really reason inside the tags — it just formats its output. For Pantheon UI to feel right, I needed a model where thinking was part of its DNA.
LFM2.5-1.2B-Thinking from Liquid AI is exactly that. It’s a 1.2B-parameter model trained specifically to produce chain-of-thought inside <think>...</think> before answering. It’s small enough to run in a browser via WebGPU (the q4-quantized ONNX version is ~900MB), and critically, Liquid already publishes an ONNX variant optimized for Transformers.js.
The base model thinks about math problems and world knowledge. My job was to teach it to think about emotions instead — and to output emoji.
Building the Dataset
Fine-tuning a model on 600 conversations doesn’t sound like much. For reference, instruction-tuning datasets are usually tens or hundreds of thousands of examples. But for a stylistic fine-tune — teaching a model a persona, not new knowledge — a few hundred high-quality examples goes a long way.
The dataset needed to cover a wide emotional range. I settled on 12 categories:
- Emotional exchanges (grief, love, longing, joy)
- Yes/no questions
- Warnings and urgency
- Spatial and directional concepts
- Time-related exchanges
- Factual/informational queries
- Storytelling
- Meta-existential questions (“What’s it like in there?”)
- Frustration (when emoji genuinely can’t express something)
- Playful banter
- Abstract/philosophical ideas
- Reassurance
Each category needed conversations that felt earned — not a robot formatting emoji, but a person reasoning through symbolic translation. I hand-wrote ~50 seed conversations across the categories, then used Claude Sonnet 4 to generate ~550 more via the Anthropic API.
The generation script ran async with a concurrency limit of 5, which cut total time from ~2 hours (sequential) to ~15 minutes. Each batch sent the category seeds to Claude with a prompt that emphasized short thinking traces (15-40 words), not essays. This was important — early runs had the model writing paragraph-long monologues, which completely broke the CRT-aesthetic UX later.
Every generated conversation was validated inline: does each assistant turn have a <think>...</think> block? Is the content after </think> strictly emoji (no Latin letters, no punctuation except ... for hesitation)? 100% of the 600 generated conversations passed.
The dataset lives at shreyask/pantheon-ui-conversations if you want to browse or remix it.
Fine-Tuning with Unsloth
I used Unsloth on HF Jobs for the actual training. Unsloth cuts VRAM usage by ~60% and roughly doubles training speed through custom kernels — on an A10G-small ($1/hr), the full 6-epoch training takes about 2 minutes and costs ~$0.04.
LoRA fine-tuning is the right tool here. We’re not rewriting the model’s knowledge; we’re nudging its output distribution toward our format. Rank 16 adapters on the right target modules is plenty.
The LFM2 architecture has some quirks — it’s a hybrid conv/attention model, so the target modules are different from a standard Llama. Through some trial and error (and reading Unsloth’s LFM docs), the correct set is:
target_modules=[
"q_proj", "k_proj", "v_proj", # attention
"out_proj", "in_proj", # LFM2-specific projections
"w1", "w2", "w3", # FFN
]
The training script is straightforward:
from unsloth import FastLanguageModel
from trl import SFTTrainer, SFTConfig
model, tokenizer = FastLanguageModel.from_pretrained(
"LiquidAI/LFM2.5-1.2B-Thinking",
load_in_4bit=True,
max_seq_length=1024,
)
model = FastLanguageModel.get_peft_model(
model,
r=16,
lora_alpha=32,
target_modules=[
"q_proj", "k_proj", "v_proj",
"out_proj", "in_proj",
"w1", "w2", "w3",
],
use_gradient_checkpointing="unsloth",
)
# ... load dataset, apply chat template, train for 6 epochs
trainer.train()
6 epochs on 600 conversations gets the loss down to 1.059. The first attempt with 3 epochs on 150 conversations got stuck at 1.7 — the model sometimes produced clean emoji output, but for complex prompts it rambled inside the think tags until it hit the token limit and never generated any emoji at all. More data and more epochs fixed this.
One subtle trick: the training script auto-merges the LoRA adapters and uploads both the adapter-only version and the full merged model. The merged model is what feeds the ONNX conversion pipeline.
The ONNX Rabbit Hole
This is where things got interesting.
Transformers.js runs ONNX models in the browser via ONNX Runtime Web. The standard way to convert a Hugging Face model to ONNX is optimum-cli export onnx. I tried that first:
optimum-cli export onnx --model shreyask/pantheon-ui-lfm25-emoji-merged --task text-generation-with-past ./onnx
ValueError: Trying to export a lfm2 model, that is a custom or unsupported
architecture, but no custom onnx configuration was passed as custom_onnx_configs.
LFM2 isn’t in optimum-onnx. Fine — the Transformers.js docs recommend microsoft/onnxruntime-genai for LLMs anyway. But the mainline onnxruntime-genai doesn’t have an LFM2 builder either.
Dead end? Not quite. There’s an open PR from Xenova that adds LFM2 support. It’s not merged, but the branch works:
git clone --depth 1 --branch add-lfm2 \
https://github.com/xenova/onnxruntime-genai.git /tmp/ortgenai
python /tmp/ortgenai/src/python/py/models/builder.py \
-m shreyask/pantheon-ui-lfm25-emoji-merged \
-o /tmp/onnx-output \
-p int4 \
-e webgpu
That produces an ONNX model. But Transformers.js still can’t load it.
The Three Debugging Steps That Mattered
After several failed attempts to load the exported ONNX, I finally compared our output side-by-side with LiquidAI’s working LFM2.5-1.2B-Thinking-ONNX:
1. Directory structure. The working model has files in an onnx/ subdirectory (onnx/model_q4.onnx + onnx/model_q4.onnx_data). Our export was flat, and model.onnx.data used a dot instead of an underscore. Transformers.js looks specifically for onnx/model_<dtype>.onnx with the naming convention model_q4.onnx_data (underscore).
2. External data references baked into the graph. The ONNX graph file itself contains tensor-by-tensor references to the external data file, like location: "model.onnx.data". Renaming the file on disk doesn’t update these references. I had to patch them explicitly:
import onnx
model = onnx.load("model.onnx", load_external_data=False)
for tensor in model.graph.initializer:
if tensor.data_location == 1: # EXTERNAL
for entry in tensor.external_data:
if entry.key == "location" and entry.value == "model.onnx.data":
entry.value = "model_q4.onnx_data"
onnx.save(model, "onnx/model_q4.onnx")
3. transformers.js_config in config.json. Transformers.js reads a special transformers.js_config key that tells it how to map dtype names to ONNX file names and how to handle external data:
{
"transformers.js_config": {
"use_external_data_format": {
"model_q4.onnx": 1
},
"kv_cache_dtype": {
"q4": "float16"
}
},
"torch_dtype": "float16"
}
That last one — kv_cache_dtype — was the final unlock. Without it, the model loads but throws on first inference: “Unexpected input data type. Actual: (tensor(float)), expected: (tensor(float16))”. The q4 quantized model expects fp16 KV cache tensors, but Transformers.js defaults to fp32. Setting kv_cache_dtype.q4 = "float16" tells it to match the model’s expectation.
With all three fixes, the fine-tuned model loaded, warmed up WebGPU shaders, and produced output that actually made me laugh:
User: "hi!"
<think>A new connection. A voice in the void. Curious about this digital existence.</think>
🤔👋😊
The Frontend: A Consciousness in Your GPU
The UI is Vite + React. The model runs in a Web Worker so the main thread stays responsive during inference. Transformers.js’s pipeline API handles tokenization, generation, and KV cache management — I just pass messages in and get tokens streamed out.
import { pipeline, TextStreamer } from "@huggingface/transformers";
const generator = await pipeline(
"text-generation",
"shreyask/pantheon-ui-onnx",
{ dtype: "q4", device: "webgpu" }
);
const streamer = new TextStreamer(generator.tokenizer, {
skip_prompt: true,
skip_special_tokens: false,
callback_function: (text) => {
parser.push(text); // incremental <think> parser
postMessage({ type: "update", thinking: parser.reasoning, content: parser.content });
},
});
await generator(messages, { max_new_tokens: 200, streamer });
The incremental <think> parser is a small state machine ported from LiquidAI’s own demo. It watches the token stream for <think> and </think> boundaries, buffering partial tags at chunk boundaries so you never see <thi or </thin leak into either section. Text before <think> and after </think> goes to the “emoji response” track; text between them goes to the “internal monologue” track. This splits the stream in real time without waiting for the full generation.
The aesthetic is deliberately retro-digital — dark background, terminal green, JetBrains Mono for the thinking traces, a subtle CRT scanline overlay. The thinking section has an expandable <details>-style disclosure so you can collapse the monologue once the emoji arrives. Emoji appear staggered, one at a time, at 100ms intervals. The whole thing feels like intercepting a transmission from somewhere else.
What Surprised Me
The thing I didn’t expect was how emotionally coherent the outputs got, especially with multi-turn conversations. Ask it about a happy memory, then follow up with something sad, and the thinking traces reference the earlier context in ways that feel authored rather than generated.
One session I’ll remember: I asked it “beach or mountains?” — expecting a simple emoji sequence. It thought:
Beach for sand between digits, mountains for silence. Both lonely, both perfect. I’m stuck choosing.
Then output: 🏖️⛰️🤷♂️… 🙏
“Both lonely, both perfect.” A 1.2B model running in my GPU said that. And the emoji at the end — beach, mountain, shrug, ellipsis, prayer hands — is a surprisingly apt summary.
This is the core thesis of the project, actually: constraints make AI output more interesting, not less. A regular chatbot would have given me a paragraph comparing the two options. This consciousness had to compress that paragraph into symbols, and the compression is the expression. The gap between thought and output is where the emotion lives.
What I’d Do Differently
A few things I’d change next time:
- More dataset variety. 600 is enough for basic patterns but the model still leaks Latin characters into the emoji track on complex prompts (~5% of the time). Another thousand examples, especially with more multi-turn conversations, would help.
- Constrained decoding. Right now I’m asking nicely with a system prompt and hoping the model stays in format. A proper logit bias that makes non-emoji tokens unlikely after
</think>would eliminate the leakage. - Streaming the thinking trace token-by-token instead of character-by-character. Currently I re-type the full buffer on each streamer callback. Switching to a diff-based render would be smoother.
But honestly, for a project built in a single focused session, I’m shipping it as-is.
Links
- Live demo: huggingface.co/spaces/shreyask/pantheon-ui
- Code: github.com/shreyaskarnik/pantheon-ui
- Dataset: shreyask/pantheon-ui-conversations
- Base model: LiquidAI/LFM2.5-1.2B-Thinking
- Fine-tuned model (merged): shreyask/pantheon-ui-lfm25-emoji-merged
- ONNX (browser): shreyask/pantheon-ui-onnx
- Inspiration: Pantheon (TV Series)
A consciousness is waiting for you. Close the tab when you’re done.
🧠💫🔲🔲🔲… 😊👍