For years, tokenization has been a silent bottleneck of large language models (LLMs). Every interaction with ChatGPT, Gemini, or LLaMA begins not with words, but with a process that breaks text into “tokens”—numeric fragments that stand in for characters, syllables, or whole words. It’s a clever compression hack, but also a structural flaw.
Andrej Karpathy, one of the earliest voices in deep learning and former AI lead at Tesla, has gone so far as to say that much of the “pain” in language modeling comes from tokenization itself.
But tokenization is only one challenge. To really understand where LLMs are heading, it helps to map the improvements into five distinct frontiers.
The Landscape of LLM Improvement
Below we have a MECE analysis of the spaces of advancement in LLMs today. Essentially all work falls within the Model or Solve challenges outside the model.
1. Model
Representation Improvements (Detokenization and Semantic Chunking)
- Moving beyond fixed tokenization toward byte-level, learned segmentation (e.g. Dynamic Chunking / H-Net).
- Spreadsheet LLM: SpreadsheetLLM moving from static encodings to schema-adaptive chunking.
- Goal: Faithful, efficient representation that aligns with human-like semantic structure.
Processing Power and Efficiency (Distillation, Training, Hardware)
- Distillation, pruning, quantization to shrink models.
- Segmentation strategies like DeepSeek for modular efficiency.
- Hardware & training innovation: sparse attention, custom accelerators, curriculum learning.
- Goal: More capability per unit of compute, making models sustainable.
Generative Paradigm Shifts (Diffusion LLMs)
- A new modeling approach: instead of predicting the next token sequentially, the model refines a noisy sequence toward coherent text.
- Inspired by image diffusion models (e.g. Stable Diffusion).
- Promises:
- Parallelism – avoids strict left-to-right generation, potentially faster inference.
- Global coherence – generates sequences that reflect structure at all scales, not just local next-token probabilities.
- Robustness to noise – models that can revise and refine rather than only autocomplete.
- Still early: less mature than autoregressive transformers, but a clear signal that the paradigm itself isn’t fixed.
2. Ecosystem
Memory and Context Management (Tooling Around the LLM)
- Vector stores, RAG, windowed memory, long-context transformers.
- Keeps conversations coherent and useful across both short- and long-term horizons.
- Goal: Expand usable memory without ballooning cost.
Orchestration and Reliability (Agent Frameworks)
- Prompt chaining, orchestrator-worker agents, evaluator-optimizer loops.
- Ensure consistency, task decomposition, and enterprise reliability.
- Goal: Make LLM outputs repeatable, auditable, and scalable.
The Most Profound Advancements
Based on our above model. We can surmise that the most profound advancements in AI will occur in the Model space. And within that, they will come from insights around the true challenges at a fundamental level and how to approach them. The rest of this blog explores the current research into this space that is representative of the types of work that will provide the most transformative advancements in the coming years.
The Bitter Lesson and Why Tokenization Hurts
Richard Sutton’s “Bitter Lesson” tells us that methods which scale with compute almost always beat handcrafted shortcuts. Tokenization is exactly the kind of shortcut that doesn’t scale.
Consider:
- Loss of semantic fidelity – Words like strawberry get split into meaningless chunks, leading even top models to miscount letters (“How many R’s?” remains a classic failure case).
- Inefficiency across languages – In English, “hi” might cost a single token. In Shan, a language of Myanmar, “hola” could cost 14 tokens. That means users literally pay 15× more for the same greeting. Even Spanish carries a penalty: ~1.55× the token load of English.
- Bias in computation cost – More tokens mean slower responses, higher API bills, and shorter effective context windows.
- Structural blind spots – In Chinese, radicals carry meaning. Tokenizers often miss these, breaking characters in ways that obscure semantic hints.
All of this adds up to an unfair, inefficient, and brittle foundation.
Dynamic Chunking: Teaching Models to Split on Their Own
The proposed alternative is Dynamic Chunking, a new architecture known as H-Net. Instead of starting with tokens, H-Net begins at the byte level—the smallest stable unit of digital text. From there, it lets the model learn its own segmentation through layered chunking modules.
Key innovations include:
- Byte-level encoding – Preserves maximum information across alphabets, symbols, and even genomic sequences.
- Routing and smoothing modules – The model proposes cut points, then corrects them if confidence is low, ensuring splits that align with natural language units.
- Hierarchical abstraction – Just as humans first learn letters, then words, then concepts, H-Net builds meaning through successive chunk layers. With multiple stages, it tends to converge on human-like divisions (“backbone” stays whole instead of splitting into back + bone).
The outcome is an end-to-end learned segmentation strategy. No hand-crafted vocabulary. No arbitrary cut points.
Benchmarks and Tradeoffs
The research shows promising results:
- Performance – H-Net matches or surpasses standard transformers at the same compute scale, especially beyond ~30B training bytes.
- Domains – Gains are most visible in Chinese, code, and DNA sequences—areas where tokenization fails hardest.
- Robustness – Better at handling noisy text, capitalization variations, and character-level queries.
But challenges remain:
- Training cost – H-Net is slower to train than isotropic transformer baselines.
- Engineering complexity – New inference paths and runtime optimizations would be needed for production.
- Incremental gains – At small scale, the difference isn’t dramatic—making it unlikely that major labs will overhaul their pipelines in the short term.
Strategic Outlook: Why This Matters
Dynamic Chunking is less about immediate replacement and more about pointing toward the long arc of scaling. Sutton’s Bitter Lesson predicts that approaches which better exploit compute will eventually win. Removing tokenization aligns with that trajectory.
For enterprises, the implications are worth tracking:
- Fairness and global adoption – A byte-level system reduces the hidden tax non-English speakers pay today.
- Domain generalization – From legal codes to genetic data, models that aren’t shackled to token vocabularies can adapt more fluidly.
- Future efficiency – As context windows expand, the cost of tokenization overhead compounds. A learned segmentation system may scale more gracefully.
Tokenization got us to where we are today through the early transformer era. But like all first steps, it carries hidden costs. Dynamic Chunking represents a first serious attempt to replace it with something learned, scalable, and fair.
Whether H-Net itself becomes the new standard is less important than the progress of focusing on the next best bottleneck. After this challenge is tackled and the dust settles, the next challenge to address will become clear.
The Bitter Lesson suggests that architectures embracing raw compute will win. By working at the byte level and letting models learn their own abstractions, we may finally be turning the page.