Minimum Viable RAG: Embeddings and Vector Search in the Browser with MiniLM

June 23, 2025 AI 本文总阅读量次

In the evolving landscape of Retrieval-Augmented Generation (RAG), where large language models are enhanced by context-relevant information from a vector store, much of the focus falls on cloud-scale infrastructure and proprietary APIs. But what if you could build a fully private, zero-dependency MVP of RAG—entirely in the browser?

Using the WebAssembly-enabled all-MiniLM-L6-v2 model from Xenova, this whitepaper walks through an implementation of the MVP of the MVP: a fully client-side embedding generator and vector search interface. We explore what makes this approach valuable, how far it can take you, and where its limitations begin.

1. Introduction: What Is RAG at Its Core?

RAG is a technique that combines language generation with vector-based retrieval. It consists of:

Embedding content into high-dimensional vectors
Storing those vectors in a retrievable structure
Querying new inputs as embeddings
Matching them to stored vectors via similarity (e.g. cosine)
Feeding top matches back into a language model as context

While most implementations rely on server-hosted embedding models and vector databases like Pinecone or Weaviate, the essence of RAG can be distilled to its fundamentals in their simplest form for accessibility and learning.

2. The Core Tool: MiniLM for the Browser

all-MiniLM-L6-v2 is a compact 384-dimensional embedding model distilled from larger Transformer architectures. It balances performance with size, making it ideal for client-side use.

Key features:

Model size: ~30MB
Embedding dimension: 384
Latency: <1s for short texts
Normalization: Built-in cosine compatibility via normalize=true
Hosting: Runs entirely in-browser via WebAssembly (no API keys or cloud calls)

This enables a fully offline, privacy-preserving semantic search tool.

3. Browser MVP: The Working Implementation

The MVP consists of a single HTML file:

Loads the Xenova MiniLM model using JS modules
Accepts text input and a pasted vector store
Generates a query embedding and computes cosine similarity
Sorts and displays the top N results

Sample Vector Store Format

[
  {
    "id": "movie1",
    "vector": [0.12, 0.34, ...],
    "metadata": { "title": "A Knights Tale", "about":"Peasant-born William Thatcher (Heath Ledger) begins a quest to change his stars, win the heart of an exceedingly fair maiden (Shanynn Sossamon) and rock his medieval world. With the help of friends (Mark Addy, Paul Bettany, Alan Tudyk), he faces the ultimate test of medieval gallantry -- tournament jousting -- and tries to discover if he has the mettle to become a legend."}
  },
  {
    "id": "movie2",
    "vector": [0.11, 0.35, ...],
    "metadata": { "title": "No Country For Old Men", "about":"While out hunting, Llewelyn Moss (Josh Brolin) finds the grisly aftermath of a drug deal. Though he knows better, he cannot resist the cash left behind and takes it with him. The hunter becomes the hunted when a merciless killer named Chigurh (Javier Bardem) picks up his trail. Also looking for Moss is Sheriff Bell (Tommy Lee Jones), an aging lawman who reflects on a changing world and a dark secret of his own, as he tries to find and protect Moss." }
  },
  {
    "id": "movie3",
    "vector": [0.14, 0.32, ...],
    "metadata": { "title": "Happy Gilmore", "about": "All Happy Gilmore (Adam Sandler) has ever wanted is to be a professional hockey player. But he soon discovers he may actually have a talent for playing an entirely different sport: golf. When his grandmother (Frances Bay) learns she is about to lose her home, Happy joins a golf tournament to try and win enough money to buy it for her. With his powerful driving skills and foulmouthed attitude, Happy becomes an unlikely golf hero -- much to the chagrin of the well-mannered golf professionals." }
  }
]

These three vectorized films are embedded locally and used as a toy dataset for querying, comparison, and interactive learning.

Query Flow

User enters a search phrase
Embedding is generated client-side via MiniLM
Vector is compared to all stored vectors
Top results are displayed with similarity scores

No API keys. No servers. Just raw semantic search.

4. The MVP App

<html lang="en">
<head>
  <meta charset="UTF-8" />
  <meta name="viewport" content="width=device-width, initial-scale=1.0" />
  <title>Vector Search</title>
  <style>
    body { font-family: sans-serif; padding: 2rem; max-width: 600px; margin: auto; }
    textarea, input { width: 100%; margin: 1rem 0; padding: 0.5rem; }
    button { padding: 0.5rem 1rem; }
    .result { margin-top: 1rem; padding: 0.5rem; border: 1px solid #ccc; }
  </style>
  <script type="module">
    import { pipeline } from 'https://cdn.jsdelivr.net/npm/@xenova/transformers@2.5.2';

    let embedder;

    async function initModel() {
      embedder = await pipeline('feature-extraction', 'Xenova/all-MiniLM-L6-v2');
    }

    async function getEmbedding(text) {
      if (!embedder) await initModel();
      const output = await embedder(text, { pooling: 'mean', normalize: true });
      return output.data;
    }

    function cosineSimilarity(a, b) {
      let dot = 0, normA = 0, normB = 0;
      for (let i = 0; i < a.length; i++) {
        dot += a[i] * b[i];
        normA += a[i] * a[i];
        normB += b[i] * b[i];
      }
      return dot / (Math.sqrt(normA) * Math.sqrt(normB));
    }

    window.embedAndSearch = async function () {
      const queryText = document.getElementById("queryText").value;

      if (!queryText) {
        alert("Please enter a search term.");
        return;
      }

      let queryVec;
      try {
        queryVec = await getEmbedding(queryText);
        const queryVectorElement = document.getElementById("queryVector");
          queryVectorElement.innerHTML = queryVec;
      } catch (e) {
        alert("Failed to generate embedding.");
        return;
      }

      let vectorStore;
      try {
        const rawData = "exampleJsonAbove"
        vectorStore = JSON.parse(rawData);
        console.log({vectorStore});
      } catch (e) {
        alert("Invalid JSON in vector store.");
        return;
      }

      const results = vectorStore.map(entry => {
        return {
          ...entry,
          similarity: cosineSimilarity(queryVec, entry.vector)
        };
      }).sort((a, b) => b.similarity - a.similarity);
      console.log({results});

      const resultBox = document.getElementById("results");
      resultBox.innerHTML = '<h2>Top Matches</h2>' + results.slice(0, 5).map(r => `
        <div class="result">
          <strong>${r.title || r.id}</strong><br>
          Similarity: ${r.similarity.toFixed(4)}
        </div>
      `).join("");
    }
  </script>
</head>
<body>
  <h1>Vector Similarity Search</h1>

  <label for="queryText">Search Term:</label>
  <input id="queryText" placeholder="Enter your search phrase..." />

  <label for="queryVector">This is the calculated embedding for your search input</label>
  <textarea id="queryVector"></textarea>

  <button onclick="embedAndSearch()">Search</button>

  <div id="results"></div>
</body>
</html>

5. Why Start Here?

This minimalist approach provides:

Privacy: All processing is local
Speed: Instant feedback on small datasets
Portability: Single-file deployment
Transparency: Easy to inspect and debug

It’s ideal for:

Local knowledge bases (e.g., Obsidian, Zettelkasten)
Prototyping embedded interfaces
Educational tools for understanding semantic similarity
MVP validation before investing in backend infrastructure

6. Tiers of Vector Search and Model Use

Embedding Model Tiers

Tier	Model	Dim	Hosting	Strengths
1	MiniLM (Browser)	384	WebAssembly	Lightweight, instant, private
2	Sentence Transformers	768	Node.js	Stronger abstraction, serverable
3	OpenAI, Cohere	1536+	Cloud	Domain-tuned, high-performance

Vector Store Tiers

Tier	Storage Type	Access Scope	Use Case
1	In-file JSON	Local only	Prototyping, demos, educational
2	Remote JSON endpoint	Internal API	MVPs with small teams
3	Cloud DB / Elastic	Scalable API	Production-grade applications

This flexibility means you can scale both model and storage independently as your app matures.

7. Cosine Similarity vs. Other Search Methods

At the core of vector search lies the idea of comparing high-dimensional embeddings to find relevance. Cosine similarity is the most widely used method in RAG pipelines—especially in compact, low-resource setups like our in-browser MiniLM MVP.

What Is Cosine Similarity?

Cosine similarity compares the angle between two vectors rather than their distance. It’s ideal when the direction of a vector matters more than its magnitude—true for most embedding models that represent semantic meaning.

Formula:

1	cosine_similarity(a, b) = (a · b) / (\|\|a\|\| * \|\|b\|\|)

This produces a value between -1 and 1:

1 → vectors are identical in direction (perfect semantic match)
0 → vectors are orthogonal (unrelated)
-1 → opposite meaning

Why Cosine Works for MiniLM

MiniLM embeddings are already normalized to unit length
Focuses on semantic similarity, not raw distance
Performs well even on small, dense datasets like the 3-movie local JSON
Requires only a dot product—cheap and fast in JavaScript

Other Metrics

Metric	Description	Pros	Cons
Cosine	Angle between vectors	Best for semantics	Less useful for unnormalized data
Euclidean (L2)	Straight-line distance	Good for spatial layouts	Sensitive to magnitude
Dot Product	Raw projection of one vector on another	Very fast	Biases toward longer vectors
Manhattan (L1)	Sum of absolute differences	Robust to outliers	Less intuitive in high-D space

When to Use What

Use cosine when you’re matching semantic meaning and vectors are normalized (like in MiniLM).
Use L2 or inner product when you’re working with approximate nearest neighbor systems (e.g., FAISS).
Consider hybrid scoring (e.g., vector + keyword) for production-grade retrieval.

8. Beyond Search: Context Injection into LLMs

The final step of the RAG loop is feeding your best-matched context back into a language model. Even within the browser MVP, this can be demonstrated by:

Displaying top matched results with accompanying text
Using those results as preamble to a prompt to a local or cloud LLM

Future versions can integrate:

GPT-4 or Claude via API
LLaMA or Mistral models in Node.js
Client-based LLMs like WebLLM for a full local RAG pipeline

The power of the system isn’t just in the matching—it’s in what happens next. Injecting retrieved knowledge improves fluency, factual grounding, and relevance.

9. Conclusion

If RAG is the future of personalized, context-rich AI interaction, then MiniLM-in-the-browser is the future of accessible prototyping. It empowers developers, tinkerers, and curious minds to understand and deploy semantic search without any infrastructure at all.

A minimal vector store. A 30MB model. Three embedded movies. And a single HTML file. That’s all it takes to start building the future—today.

From here, you can expand upward: vector stores to remote APIs and databases, embeddings to stronger and larger models, and from simple scoring to powerful AI-enhanced context generation.

Start small. Grow smart. Build forward.