In the evolving landscape of Retrieval-Augmented Generation (RAG), where large language models are enhanced by context-relevant information from a vector store, much of the focus falls on cloud-scale infrastructure and proprietary APIs. But what if you could build a fully private, zero-dependency MVP of RAG—entirely in the browser?
Using the WebAssembly-enabled all-MiniLM-L6-v2
model from Xenova, this whitepaper walks through an implementation of the MVP of the MVP: a fully client-side embedding generator and vector search interface. We explore what makes this approach valuable, how far it can take you, and where its limitations begin.
1. Introduction: What Is RAG at Its Core?
RAG is a technique that combines language generation with vector-based retrieval. It consists of:
- Embedding content into high-dimensional vectors
- Storing those vectors in a retrievable structure
- Querying new inputs as embeddings
- Matching them to stored vectors via similarity (e.g. cosine)
- Feeding top matches back into a language model as context
While most implementations rely on server-hosted embedding models and vector databases like Pinecone or Weaviate, the essence of RAG can be distilled to its fundamentals in their simplest form for accessibility and learning.
2. The Core Tool: MiniLM for the Browser
all-MiniLM-L6-v2
is a compact 384-dimensional embedding model distilled from larger Transformer architectures. It balances performance with size, making it ideal for client-side use.
Key features:
- Model size: ~30MB
- Embedding dimension: 384
- Latency: <1s for short texts
- Normalization: Built-in cosine compatibility via
normalize=true
- Hosting: Runs entirely in-browser via WebAssembly (no API keys or cloud calls)
This enables a fully offline, privacy-preserving semantic search tool.
3. Browser MVP: The Working Implementation
The MVP consists of a single HTML file:
- Loads the Xenova MiniLM model using JS modules
- Accepts text input and a pasted vector store
- Generates a query embedding and computes cosine similarity
- Sorts and displays the top N results
Sample Vector Store Format
1 | [ |
These three vectorized films are embedded locally and used as a toy dataset for querying, comparison, and interactive learning.
Query Flow
- User enters a search phrase
- Embedding is generated client-side via MiniLM
- Vector is compared to all stored vectors
- Top results are displayed with similarity scores
No API keys. No servers. Just raw semantic search.
4. The MVP App
1 | <html lang="en"> |
5. Why Start Here?
This minimalist approach provides:
- Privacy: All processing is local
- Speed: Instant feedback on small datasets
- Portability: Single-file deployment
- Transparency: Easy to inspect and debug
It’s ideal for:
- Local knowledge bases (e.g., Obsidian, Zettelkasten)
- Prototyping embedded interfaces
- Educational tools for understanding semantic similarity
- MVP validation before investing in backend infrastructure
6. Tiers of Vector Search and Model Use
Embedding Model Tiers
Tier | Model | Dim | Hosting | Strengths |
---|---|---|---|---|
1 | MiniLM (Browser) | 384 | WebAssembly | Lightweight, instant, private |
2 | Sentence Transformers | 768 | Node.js | Stronger abstraction, serverable |
3 | OpenAI, Cohere | 1536+ | Cloud | Domain-tuned, high-performance |
Vector Store Tiers
Tier | Storage Type | Access Scope | Use Case |
---|---|---|---|
1 | In-file JSON | Local only | Prototyping, demos, educational |
2 | Remote JSON endpoint | Internal API | MVPs with small teams |
3 | Cloud DB / Elastic | Scalable API | Production-grade applications |
This flexibility means you can scale both model and storage independently as your app matures.
7. Cosine Similarity vs. Other Search Methods
At the core of vector search lies the idea of comparing high-dimensional embeddings to find relevance. Cosine similarity is the most widely used method in RAG pipelines—especially in compact, low-resource setups like our in-browser MiniLM MVP.
What Is Cosine Similarity?
Cosine similarity compares the angle between two vectors rather than their distance. It’s ideal when the direction of a vector matters more than its magnitude—true for most embedding models that represent semantic meaning.
Formula:
1 | cosine_similarity(a, b) = (a · b) / (||a|| * ||b||) |
This produces a value between -1 and 1:
- 1 → vectors are identical in direction (perfect semantic match)
- 0 → vectors are orthogonal (unrelated)
- -1 → opposite meaning
Why Cosine Works for MiniLM
- MiniLM embeddings are already normalized to unit length
- Focuses on semantic similarity, not raw distance
- Performs well even on small, dense datasets like the 3-movie local JSON
- Requires only a dot product—cheap and fast in JavaScript
Other Metrics
Metric | Description | Pros | Cons |
---|---|---|---|
Cosine | Angle between vectors | Best for semantics | Less useful for unnormalized data |
Euclidean (L2) | Straight-line distance | Good for spatial layouts | Sensitive to magnitude |
Dot Product | Raw projection of one vector on another | Very fast | Biases toward longer vectors |
Manhattan (L1) | Sum of absolute differences | Robust to outliers | Less intuitive in high-D space |
When to Use What
- Use cosine when you’re matching semantic meaning and vectors are normalized (like in MiniLM).
- Use L2 or inner product when you’re working with approximate nearest neighbor systems (e.g., FAISS).
- Consider hybrid scoring (e.g., vector + keyword) for production-grade retrieval.
8. Beyond Search: Context Injection into LLMs
The final step of the RAG loop is feeding your best-matched context back into a language model. Even within the browser MVP, this can be demonstrated by:
- Displaying top matched results with accompanying text
- Using those results as preamble to a prompt to a local or cloud LLM
Future versions can integrate:
- GPT-4 or Claude via API
- LLaMA or Mistral models in Node.js
- Client-based LLMs like WebLLM for a full local RAG pipeline
The power of the system isn’t just in the matching—it’s in what happens next. Injecting retrieved knowledge improves fluency, factual grounding, and relevance.
9. Conclusion
If RAG is the future of personalized, context-rich AI interaction, then MiniLM-in-the-browser is the future of accessible prototyping. It empowers developers, tinkerers, and curious minds to understand and deploy semantic search without any infrastructure at all.
A minimal vector store. A 30MB model. Three embedded movies. And a single HTML file. That’s all it takes to start building the future—today.
From here, you can expand upward: vector stores to remote APIs and databases, embeddings to stronger and larger models, and from simple scoring to powerful AI-enhanced context generation.
Start small. Grow smart. Build forward.