Hike News
Hike News

Minimum Viable RAG: Embeddings and Vector Search in the Browser with MiniLM

In the evolving landscape of Retrieval-Augmented Generation (RAG), where large language models are enhanced by context-relevant information from a vector store, much of the focus falls on cloud-scale infrastructure and proprietary APIs. But what if you could build a fully private, zero-dependency MVP of RAG—entirely in the browser?

Using the WebAssembly-enabled all-MiniLM-L6-v2 model from Xenova, this whitepaper walks through an implementation of the MVP of the MVP: a fully client-side embedding generator and vector search interface. We explore what makes this approach valuable, how far it can take you, and where its limitations begin.


1. Introduction: What Is RAG at Its Core?

RAG is a technique that combines language generation with vector-based retrieval. It consists of:

  • Embedding content into high-dimensional vectors
  • Storing those vectors in a retrievable structure
  • Querying new inputs as embeddings
  • Matching them to stored vectors via similarity (e.g. cosine)
  • Feeding top matches back into a language model as context

While most implementations rely on server-hosted embedding models and vector databases like Pinecone or Weaviate, the essence of RAG can be distilled to its fundamentals in their simplest form for accessibility and learning.


2. The Core Tool: MiniLM for the Browser

all-MiniLM-L6-v2 is a compact 384-dimensional embedding model distilled from larger Transformer architectures. It balances performance with size, making it ideal for client-side use.

Key features:

  • Model size: ~30MB
  • Embedding dimension: 384
  • Latency: <1s for short texts
  • Normalization: Built-in cosine compatibility via normalize=true
  • Hosting: Runs entirely in-browser via WebAssembly (no API keys or cloud calls)

This enables a fully offline, privacy-preserving semantic search tool.


3. Browser MVP: The Working Implementation

The MVP consists of a single HTML file:

  • Loads the Xenova MiniLM model using JS modules
  • Accepts text input and a pasted vector store
  • Generates a query embedding and computes cosine similarity
  • Sorts and displays the top N results

Sample Vector Store Format

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
[
{
"id": "movie1",
"vector": [0.12, 0.34, ...],
"metadata": { "title": "A Knights Tale", "about":"Peasant-born William Thatcher (Heath Ledger) begins a quest to change his stars, win the heart of an exceedingly fair maiden (Shanynn Sossamon) and rock his medieval world. With the help of friends (Mark Addy, Paul Bettany, Alan Tudyk), he faces the ultimate test of medieval gallantry -- tournament jousting -- and tries to discover if he has the mettle to become a legend."}
},
{
"id": "movie2",
"vector": [0.11, 0.35, ...],
"metadata": { "title": "No Country For Old Men", "about":"While out hunting, Llewelyn Moss (Josh Brolin) finds the grisly aftermath of a drug deal. Though he knows better, he cannot resist the cash left behind and takes it with him. The hunter becomes the hunted when a merciless killer named Chigurh (Javier Bardem) picks up his trail. Also looking for Moss is Sheriff Bell (Tommy Lee Jones), an aging lawman who reflects on a changing world and a dark secret of his own, as he tries to find and protect Moss." }
},
{
"id": "movie3",
"vector": [0.14, 0.32, ...],
"metadata": { "title": "Happy Gilmore", "about": "All Happy Gilmore (Adam Sandler) has ever wanted is to be a professional hockey player. But he soon discovers he may actually have a talent for playing an entirely different sport: golf. When his grandmother (Frances Bay) learns she is about to lose her home, Happy joins a golf tournament to try and win enough money to buy it for her. With his powerful driving skills and foulmouthed attitude, Happy becomes an unlikely golf hero -- much to the chagrin of the well-mannered golf professionals." }
}
]

These three vectorized films are embedded locally and used as a toy dataset for querying, comparison, and interactive learning.

Query Flow

  1. User enters a search phrase
  2. Embedding is generated client-side via MiniLM
  3. Vector is compared to all stored vectors
  4. Top results are displayed with similarity scores

No API keys. No servers. Just raw semantic search.


4. The MVP App

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
<html lang="en">
<head>
<meta charset="UTF-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<title>Vector Search</title>
<style>
body { font-family: sans-serif; padding: 2rem; max-width: 600px; margin: auto; }
textarea, input { width: 100%; margin: 1rem 0; padding: 0.5rem; }
button { padding: 0.5rem 1rem; }
.result { margin-top: 1rem; padding: 0.5rem; border: 1px solid #ccc; }
</style>
<script type="module">
import { pipeline } from 'https://cdn.jsdelivr.net/npm/@xenova/transformers@2.5.2';

let embedder;

async function initModel() {
embedder = await pipeline('feature-extraction', 'Xenova/all-MiniLM-L6-v2');
}

async function getEmbedding(text) {
if (!embedder) await initModel();
const output = await embedder(text, { pooling: 'mean', normalize: true });
return output.data;
}

function cosineSimilarity(a, b) {
let dot = 0, normA = 0, normB = 0;
for (let i = 0; i < a.length; i++) {
dot += a[i] * b[i];
normA += a[i] * a[i];
normB += b[i] * b[i];
}
return dot / (Math.sqrt(normA) * Math.sqrt(normB));
}

window.embedAndSearch = async function () {
const queryText = document.getElementById("queryText").value;

if (!queryText) {
alert("Please enter a search term.");
return;
}

let queryVec;
try {
queryVec = await getEmbedding(queryText);
const queryVectorElement = document.getElementById("queryVector");
queryVectorElement.innerHTML = queryVec;
} catch (e) {
alert("Failed to generate embedding.");
return;
}

let vectorStore;
try {
const rawData = "exampleJsonAbove"
vectorStore = JSON.parse(rawData);
console.log({vectorStore});
} catch (e) {
alert("Invalid JSON in vector store.");
return;
}

const results = vectorStore.map(entry => {
return {
...entry,
similarity: cosineSimilarity(queryVec, entry.vector)
};
}).sort((a, b) => b.similarity - a.similarity);
console.log({results});

const resultBox = document.getElementById("results");
resultBox.innerHTML = '<h2>Top Matches</h2>' + results.slice(0, 5).map(r => `
<div class="result">
<strong>${r.title || r.id}</strong><br>
Similarity: ${r.similarity.toFixed(4)}
</div>
`).join("");
}
</script>
</head>
<body>
<h1>Vector Similarity Search</h1>

<label for="queryText">Search Term:</label>
<input id="queryText" placeholder="Enter your search phrase..." />

<label for="queryVector">This is the calculated embedding for your search input</label>
<textarea id="queryVector"></textarea>

<button onclick="embedAndSearch()">Search</button>

<div id="results"></div>
</body>
</html>


5. Why Start Here?

This minimalist approach provides:

  • Privacy: All processing is local
  • Speed: Instant feedback on small datasets
  • Portability: Single-file deployment
  • Transparency: Easy to inspect and debug

It’s ideal for:

  • Local knowledge bases (e.g., Obsidian, Zettelkasten)
  • Prototyping embedded interfaces
  • Educational tools for understanding semantic similarity
  • MVP validation before investing in backend infrastructure

6. Tiers of Vector Search and Model Use

Embedding Model Tiers

Tier Model Dim Hosting Strengths
1 MiniLM (Browser) 384 WebAssembly Lightweight, instant, private
2 Sentence Transformers 768 Node.js Stronger abstraction, serverable
3 OpenAI, Cohere 1536+ Cloud Domain-tuned, high-performance

Vector Store Tiers

Tier Storage Type Access Scope Use Case
1 In-file JSON Local only Prototyping, demos, educational
2 Remote JSON endpoint Internal API MVPs with small teams
3 Cloud DB / Elastic Scalable API Production-grade applications

This flexibility means you can scale both model and storage independently as your app matures.


7. Cosine Similarity vs. Other Search Methods

At the core of vector search lies the idea of comparing high-dimensional embeddings to find relevance. Cosine similarity is the most widely used method in RAG pipelines—especially in compact, low-resource setups like our in-browser MiniLM MVP.

What Is Cosine Similarity?

Cosine similarity compares the angle between two vectors rather than their distance. It’s ideal when the direction of a vector matters more than its magnitude—true for most embedding models that represent semantic meaning.

Formula:

1
cosine_similarity(a, b) = (a · b) / (||a|| * ||b||)

This produces a value between -1 and 1:

  • 1 → vectors are identical in direction (perfect semantic match)
  • 0 → vectors are orthogonal (unrelated)
  • -1 → opposite meaning

Why Cosine Works for MiniLM

  • MiniLM embeddings are already normalized to unit length
  • Focuses on semantic similarity, not raw distance
  • Performs well even on small, dense datasets like the 3-movie local JSON
  • Requires only a dot product—cheap and fast in JavaScript

Other Metrics

Metric Description Pros Cons
Cosine Angle between vectors Best for semantics Less useful for unnormalized data
Euclidean (L2) Straight-line distance Good for spatial layouts Sensitive to magnitude
Dot Product Raw projection of one vector on another Very fast Biases toward longer vectors
Manhattan (L1) Sum of absolute differences Robust to outliers Less intuitive in high-D space

When to Use What

  • Use cosine when you’re matching semantic meaning and vectors are normalized (like in MiniLM).
  • Use L2 or inner product when you’re working with approximate nearest neighbor systems (e.g., FAISS).
  • Consider hybrid scoring (e.g., vector + keyword) for production-grade retrieval.

8. Beyond Search: Context Injection into LLMs

The final step of the RAG loop is feeding your best-matched context back into a language model. Even within the browser MVP, this can be demonstrated by:

  • Displaying top matched results with accompanying text
  • Using those results as preamble to a prompt to a local or cloud LLM

Future versions can integrate:

  • GPT-4 or Claude via API
  • LLaMA or Mistral models in Node.js
  • Client-based LLMs like WebLLM for a full local RAG pipeline

The power of the system isn’t just in the matching—it’s in what happens next. Injecting retrieved knowledge improves fluency, factual grounding, and relevance.


9. Conclusion

If RAG is the future of personalized, context-rich AI interaction, then MiniLM-in-the-browser is the future of accessible prototyping. It empowers developers, tinkerers, and curious minds to understand and deploy semantic search without any infrastructure at all.

A minimal vector store. A 30MB model. Three embedded movies. And a single HTML file. That’s all it takes to start building the future—today.

From here, you can expand upward: vector stores to remote APIs and databases, embeddings to stronger and larger models, and from simple scoring to powerful AI-enhanced context generation.

Start small. Grow smart. Build forward.