<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:media="http://search.yahoo.com/mrss/" xmlns:dc="http://purl.org/dc/elements/1.1/"><channel><title>Ajay Walia</title><link>https://curiousbit.netlify.app/</link><description>Digital workplace, artificial intelligence, cloud, security, automation, and enterprise technology notes by Ajay Walia.</description><language>en-au</language><managingEditor>Ajay Walia</managingEditor><webMaster>Ajay Walia</webMaster><copyright>Copyright 2026 Ajay Walia</copyright><lastBuildDate>Sun, 21 Jun 2026 05:46:10 +0000</lastBuildDate><atom:link href="https://curiousbit.netlify.app/tags/vector-search/index.xml" rel="self" type="application/rss+xml"/><image><url>https://curiousbit.netlify.app/images/og-default.png</url><title>Ajay Walia</title><link>https://curiousbit.netlify.app/</link></image><item><title>RAG, Graph RAG, Agentic RAG — and How to Make Any of Them 32× Memory Efficient</title><link>https://curiousbit.netlify.app/rag-graph-rag-agentic-rag-and-how-to-make-any-of-them-32-memory-efficient/</link><guid isPermaLink="true">https://curiousbit.netlify.app/rag-graph-rag-agentic-rag-and-how-to-make-any-of-them-32-memory-efficient/</guid><pubDate>Thu, 28 May 2026 00:00:00 +0000</pubDate><dc:creator>Ajay Walia</dc:creator><description>&lt;style&gt;
@import url('https://fonts.googleapis.com/css2?family=Inter:wght@400;500;600;700&amp;family=JetBrains+Mono:wght@400;600&amp;display=swap');
.rag-art {
--bg: #0a1220;
--bg2: #0f1a2e;
--bg3: #142442;
--line: #1f3358;
--text: #e2e8f0;
--muted: #8aa0c0;
--accent: #22d3ee;
--accent2: #00e5a8;
--warn: #f59e0b;
--danger: #ef4444;
--purple: #a78bfa;
font-family: 'Inter', sans-serif;
color: var(--text);
background: var(--bg);
padding: 2rem 2.25rem;
border-radius: 16px;
box-shadow: 0 4px 30px rgba(0,0,0,0.55);
line-height: 1.75;
}
.rag-art * { box-sizing: border-box; }
.rag-art .section { padding: 40px 0; border-bottom: 1px solid var(--line); }
.rag-art .section:last-child { border-bottom: none; }
.rag-art .label { font-family: 'JetBrains Mono', monospace; font-size: .72rem; letter-spacing: .22em; text-transform: uppercase; color: var(--accent); font-weight: 600; margin-bottom: 8px; }
.rag-art h2 { font-family: 'Inter', sans-serif; font-size: clamp(1.5rem, 1.2rem + 1vw, 2.2rem); font-weight: 700; letter-spacing: -.02em; color: #fff; margin: 0 0 18px; }
.rag-art h3 { font-family: 'Inter', sans-serif; font-size: 1.15rem; font-weight: 600; color: #fff; margin: 28px 0 10px; }
.rag-art p { margin: 0 0 16px; font-size: 1rem; color: var(--text); }
.rag-art ul { list-style: none; padding: 0; margin: 0 0 18px; }
.rag-art ul li { padding: 4px 0 4px 22px; position: relative; font-size: .95rem; }
.rag-art ul li::before { content: "▸"; position: absolute; left: 0; color: var(--accent); }
.rag-art strong { color: #fff; }
.rag-art code { background: var(--bg3); padding: 1px 6px; border-radius: 4px; font-family: 'JetBrains Mono', monospace; font-size: .88rem; color: var(--accent2); }
.rag-visual { margin: 28px 0; border-radius: 8px; overflow-x: auto; overflow-y: hidden; border: 1px solid var(--line); background: var(--bg2); }
.rag-visual svg { display: block; width: 100%; height: auto; min-width: 700px; }
.rag-cap { text-align: center; font-size: .78rem; color: var(--muted); font-style: italic; margin: -10px 0 24px; }
.rag-figure { margin: 28px 0; }
.rag-figure img, .rag-figure video { display: block; width: 100%; height: auto; border-radius: 8px; border: 1px solid var(--line); }
.rag-figure figcaption { text-align: center; font-size: .78rem; color: var(--muted); font-style: italic; margin-top: 8px; }
@keyframes ragPulseSoft { 0%,100% { opacity: .35; } 50% { opacity: 1; } }
@keyframes ragPulseStrong{ 0%,100% { opacity: .55; r: 9; } 50% { opacity: 1; r: 11; } }
@keyframes ragMissed { 0%,100% { opacity: .25; } 50% { opacity: .9; } }
@keyframes ragFlow { from { stroke-dashoffset: 24; } to { stroke-dashoffset: 0; } }
@keyframes ragDraw { from { stroke-dashoffset: 800; } to { stroke-dashoffset: 0; } }
.rag-flow-line { stroke-dasharray: 6 4; animation: ragFlow 1.2s linear infinite; }
.rag-traversal { stroke-dasharray: 6 4; animation: ragFlow 1.5s linear infinite; }
.rag-glow { animation: ragPulseStrong 2.4s ease-in-out infinite; }
.rag-missed { animation: ragMissed 3s ease-in-out infinite; }
/* Sequential agent tool highlight (5 tools) */
@keyframes ragAgentT { 0%,18% { opacity: 1; } 25%,100% { opacity: .35; } }
.rag-tool-1 { animation: ragAgentT 5s ease-in-out infinite; animation-delay: 0s; }
.rag-tool-2 { animation: ragAgentT 5s ease-in-out infinite; animation-delay: 1s; }
.rag-tool-3 { animation: ragAgentT 5s ease-in-out infinite; animation-delay: 2s; }
.rag-tool-4 { animation: ragAgentT 5s ease-in-out infinite; animation-delay: 3s; }
.rag-tool-5 { animation: ragAgentT 5s ease-in-out infinite; animation-delay: 4s; }
/* Sequential float→binary cell transform (8 cells) */
@keyframes ragBitTransform { 0%,8% { opacity: 0; } 14%,100% { opacity: 1; } }
.rag-bit-1 { animation: ragBitTransform 4s ease-out infinite; animation-delay: 0s; }
.rag-bit-2 { animation: ragBitTransform 4s ease-out infinite; animation-delay: .35s; }
.rag-bit-3 { animation: ragBitTransform 4s ease-out infinite; animation-delay: .7s; }
.rag-bit-4 { animation: ragBitTransform 4s ease-out infinite; animation-delay: 1.05s; }
.rag-bit-5 { animation: ragBitTransform 4s ease-out infinite; animation-delay: 1.4s; }
.rag-bit-6 { animation: ragBitTransform 4s ease-out infinite; animation-delay: 1.75s; }
.rag-bit-7 { animation: ragBitTransform 4s ease-out infinite; animation-delay: 2.1s; }
.rag-bit-8 { animation: ragBitTransform 4s ease-out infinite; animation-delay: 2.45s; }
@media (prefers-reduced-motion: reduce) {
.rag-flow-line, .rag-traversal, .rag-glow, .rag-missed,
.rag-tool-1,.rag-tool-2,.rag-tool-3,.rag-tool-4,.rag-tool-5,
.rag-bit-1,.rag-bit-2,.rag-bit-3,.rag-bit-4,.rag-bit-5,.rag-bit-6,.rag-bit-7,.rag-bit-8 { animation: none !important; opacity: 1 !important; }
}
.rag-cards { display: grid; grid-template-columns: repeat(auto-fit, minmax(240px, 1fr)); gap: 14px; margin: 24px 0; }
.rag-card { background: var(--bg2); border: 1px solid var(--line); border-radius: 10px; padding: 18px 18px; }
.rag-card h4 { font-size: 1rem; font-weight: 600; color: #fff; margin: 0 0 8px; }
.rag-card p { font-size: .88rem; color: var(--muted); margin: 0; }
.rag-pull { border-left: 3px solid var(--accent); padding: 6px 0 6px 18px; margin: 28px 0; font-size: 1.08rem; color: var(--accent); font-style: italic; }
.rag-pull.warn { border-color: var(--warn); color: var(--warn); }
.rag-table { width: 100%; border-collapse: collapse; margin: 18px 0 26px; font-size: .9rem; }
.rag-table th, .rag-table td { padding: 11px 14px; text-align: left; border-bottom: 1px solid var(--line); vertical-align: top; }
.rag-table th { color: var(--accent); font-family: 'JetBrains Mono', monospace; font-size: .72rem; letter-spacing: .14em; text-transform: uppercase; border-bottom: 1px solid var(--accent); font-weight: 600; }
.rag-table td.k { color: #fff; font-weight: 600; width: 25%; }
.rag-table td.v { color: var(--text); }
.rag-table td.m { color: var(--muted); font-size: .85rem; }
.rag-stat { display: grid; grid-template-columns: repeat(4,1fr); gap: 1px; background: var(--line); border: 1px solid var(--line); margin: 28px 0; border-radius: 6px; overflow: hidden; }
.rag-stat &gt; div { background: var(--bg2); padding: 18px 14px; text-align: center; }
.rag-stat .num { font-family: 'Inter', sans-serif; font-weight: 700; font-size: 1.7rem; color: var(--accent); display: block; }
.rag-stat .lbl { font-family: 'JetBrains Mono', monospace; font-size: .66rem; letter-spacing: .1em; color: var(--muted); text-transform: uppercase; margin-top: 4px; display: block; }
@media (max-width: 600px) {
.rag-art { padding: 1.25rem; }
.rag-stat { grid-template-columns: repeat(2,1fr); }
.rag-visual svg { min-width: 600px; }
}
&lt;/style&gt;
&lt;div class="rag-art"&gt;
&lt;!-- ── OPENING ── --&gt;
&lt;div class="section" style="padding-top:0;border-bottom:none"&gt;
&lt;p&gt;Retrieval-Augmented Generation has become the default way to give a language model access to your data. But "RAG" now covers at least three meaningfully different architectures, and most engineers only know the first one well. Pick the wrong one and your assistant answers confidently with information it never actually retrieved.&lt;/p&gt;</description><content:encoded>&lt;![CDATA[<img src="https://curiousbit.netlify.app/images/rag-variants/hero.png" alt="Vector-Search" style="max-width:100%;height:auto;margin-bottom:1.5em;"/><style>
@import url('https://fonts.googleapis.com/css2?family=Inter:wght@400;500;600;700&family=JetBrains+Mono:wght@400;600&display=swap');
.rag-art {
--bg: #0a1220;
--bg2: #0f1a2e;
--bg3: #142442;
--line: #1f3358;
--text: #e2e8f0;
--muted: #8aa0c0;
--accent: #22d3ee;
--accent2: #00e5a8;
--warn: #f59e0b;
--danger: #ef4444;
--purple: #a78bfa;
font-family: 'Inter', sans-serif;
color: var(--text);
background: var(--bg);
padding: 2rem 2.25rem;
border-radius: 16px;
box-shadow: 0 4px 30px rgba(0,0,0,0.55);
line-height: 1.75;
}
.rag-art * { box-sizing: border-box; }
.rag-art .section { padding: 40px 0; border-bottom: 1px solid var(--line); }
.rag-art .section:last-child { border-bottom: none; }
.rag-art .label { font-family: 'JetBrains Mono', monospace; font-size: .72rem; letter-spacing: .22em; text-transform: uppercase; color: var(--accent); font-weight: 600; margin-bottom: 8px; }
.rag-art h2 { font-family: 'Inter', sans-serif; font-size: clamp(1.5rem, 1.2rem + 1vw, 2.2rem); font-weight: 700; letter-spacing: -.02em; color: #fff; margin: 0 0 18px; }
.rag-art h3 { font-family: 'Inter', sans-serif; font-size: 1.15rem; font-weight: 600; color: #fff; margin: 28px 0 10px; }
.rag-art p { margin: 0 0 16px; font-size: 1rem; color: var(--text); }
.rag-art ul { list-style: none; padding: 0; margin: 0 0 18px; }
.rag-art ul li { padding: 4px 0 4px 22px; position: relative; font-size: .95rem; }
.rag-art ul li::before { content: "▸"; position: absolute; left: 0; color: var(--accent); }
.rag-art strong { color: #fff; }
.rag-art code { background: var(--bg3); padding: 1px 6px; border-radius: 4px; font-family: 'JetBrains Mono', monospace; font-size: .88rem; color: var(--accent2); }
.rag-visual { margin: 28px 0; border-radius: 8px; overflow-x: auto; overflow-y: hidden; border: 1px solid var(--line); background: var(--bg2); }
.rag-visual svg { display: block; width: 100%; height: auto; min-width: 700px; }
.rag-cap { text-align: center; font-size: .78rem; color: var(--muted); font-style: italic; margin: -10px 0 24px; }
.rag-figure { margin: 28px 0; }
.rag-figure img, .rag-figure video { display: block; width: 100%; height: auto; border-radius: 8px; border: 1px solid var(--line); }
.rag-figure figcaption { text-align: center; font-size: .78rem; color: var(--muted); font-style: italic; margin-top: 8px; }
@keyframes ragPulseSoft { 0%,100% { opacity: .35; } 50% { opacity: 1; } }
@keyframes ragPulseStrong{ 0%,100% { opacity: .55; r: 9; } 50% { opacity: 1; r: 11; } }
@keyframes ragMissed { 0%,100% { opacity: .25; } 50% { opacity: .9; } }
@keyframes ragFlow { from { stroke-dashoffset: 24; } to { stroke-dashoffset: 0; } }
@keyframes ragDraw { from { stroke-dashoffset: 800; } to { stroke-dashoffset: 0; } }
.rag-flow-line { stroke-dasharray: 6 4; animation: ragFlow 1.2s linear infinite; }
.rag-traversal { stroke-dasharray: 6 4; animation: ragFlow 1.5s linear infinite; }
.rag-glow { animation: ragPulseStrong 2.4s ease-in-out infinite; }
.rag-missed { animation: ragMissed 3s ease-in-out infinite; }
/* Sequential agent tool highlight (5 tools) */
@keyframes ragAgentT { 0%,18% { opacity: 1; } 25%,100% { opacity: .35; } }
.rag-tool-1 { animation: ragAgentT 5s ease-in-out infinite; animation-delay: 0s; }
.rag-tool-2 { animation: ragAgentT 5s ease-in-out infinite; animation-delay: 1s; }
.rag-tool-3 { animation: ragAgentT 5s ease-in-out infinite; animation-delay: 2s; }
.rag-tool-4 { animation: ragAgentT 5s ease-in-out infinite; animation-delay: 3s; }
.rag-tool-5 { animation: ragAgentT 5s ease-in-out infinite; animation-delay: 4s; }
/* Sequential float→binary cell transform (8 cells) */
@keyframes ragBitTransform { 0%,8% { opacity: 0; } 14%,100% { opacity: 1; } }
.rag-bit-1 { animation: ragBitTransform 4s ease-out infinite; animation-delay: 0s; }
.rag-bit-2 { animation: ragBitTransform 4s ease-out infinite; animation-delay: .35s; }
.rag-bit-3 { animation: ragBitTransform 4s ease-out infinite; animation-delay: .7s; }
.rag-bit-4 { animation: ragBitTransform 4s ease-out infinite; animation-delay: 1.05s; }
.rag-bit-5 { animation: ragBitTransform 4s ease-out infinite; animation-delay: 1.4s; }
.rag-bit-6 { animation: ragBitTransform 4s ease-out infinite; animation-delay: 1.75s; }
.rag-bit-7 { animation: ragBitTransform 4s ease-out infinite; animation-delay: 2.1s; }
.rag-bit-8 { animation: ragBitTransform 4s ease-out infinite; animation-delay: 2.45s; }
@media (prefers-reduced-motion: reduce) {
.rag-flow-line, .rag-traversal, .rag-glow, .rag-missed,
.rag-tool-1,.rag-tool-2,.rag-tool-3,.rag-tool-4,.rag-tool-5,
.rag-bit-1,.rag-bit-2,.rag-bit-3,.rag-bit-4,.rag-bit-5,.rag-bit-6,.rag-bit-7,.rag-bit-8 { animation: none !important; opacity: 1 !important; }
}
.rag-cards { display: grid; grid-template-columns: repeat(auto-fit, minmax(240px, 1fr)); gap: 14px; margin: 24px 0; }
.rag-card { background: var(--bg2); border: 1px solid var(--line); border-radius: 10px; padding: 18px 18px; }
.rag-card h4 { font-size: 1rem; font-weight: 600; color: #fff; margin: 0 0 8px; }
.rag-card p { font-size: .88rem; color: var(--muted); margin: 0; }
.rag-pull { border-left: 3px solid var(--accent); padding: 6px 0 6px 18px; margin: 28px 0; font-size: 1.08rem; color: var(--accent); font-style: italic; }
.rag-pull.warn { border-color: var(--warn); color: var(--warn); }
.rag-table { width: 100%; border-collapse: collapse; margin: 18px 0 26px; font-size: .9rem; }
.rag-table th, .rag-table td { padding: 11px 14px; text-align: left; border-bottom: 1px solid var(--line); vertical-align: top; }
.rag-table th { color: var(--accent); font-family: 'JetBrains Mono', monospace; font-size: .72rem; letter-spacing: .14em; text-transform: uppercase; border-bottom: 1px solid var(--accent); font-weight: 600; }
.rag-table td.k { color: #fff; font-weight: 600; width: 25%; }
.rag-table td.v { color: var(--text); }
.rag-table td.m { color: var(--muted); font-size: .85rem; }
.rag-stat { display: grid; grid-template-columns: repeat(4,1fr); gap: 1px; background: var(--line); border: 1px solid var(--line); margin: 28px 0; border-radius: 6px; overflow: hidden; }
.rag-stat > div { background: var(--bg2); padding: 18px 14px; text-align: center; }
.rag-stat .num { font-family: 'Inter', sans-serif; font-weight: 700; font-size: 1.7rem; color: var(--accent); display: block; }
.rag-stat .lbl { font-family: 'JetBrains Mono', monospace; font-size: .66rem; letter-spacing: .1em; color: var(--muted); text-transform: uppercase; margin-top: 4px; display: block; }
@media (max-width: 600px) {
.rag-art { padding: 1.25rem; }
.rag-stat { grid-template-columns: repeat(2,1fr); }
.rag-visual svg { min-width: 600px; }
}</style><div class="rag-art"><div class="section" style="padding-top:0;border-bottom:none"><p>Retrieval-Augmented Generation has become the default way to give a language model access to your data. But "RAG" now covers at least three meaningfully different architectures, and most engineers only know the first one well. Pick the wrong one and your assistant answers confidently with information it never actually retrieved.</p><p>This piece does two things.<strong>First</strong> — break down RAG, Graph RAG, and Agentic RAG visually: how each works, where each one breaks, and which query type it's the right fit for.<strong>Second</strong> — show how a single technique called<em>binary quantization</em> can shrink the vector index inside any of these architectures by a factor of 32 without breaking retrieval quality. This is the trick Perplexity, Azure, and HubSpot use in production.</p><div class="rag-stat"><div><span class="num">3</span><span class="lbl">Architectures<br>compared</span></div><div><span class="num">32×</span><span class="lbl">Memory<br>reduction</span></div><div><span class="num">1-hop / N-hop</span><span class="lbl">When each<br>variant wins</span></div><div><span class="num">0 code</span><span class="lbl">All concept,<br>all visual</span></div></div></div><div class="section"><div class="label">Part 1A · Standard RAG</div><h2>The Default Pipeline — and What It's Actually Good At</h2><p>Standard RAG is what most engineers mean when they say "RAG". Documents are split into chunks, each chunk is embedded into a high-dimensional vector, and those vectors are stored in a vector database. At query time, the user's question is embedded too, and the database returns the top-k chunks by similarity (usually cosine distance). Those chunks are pasted into the LLM's prompt as context, and the model answers from them.</p><div class="rag-visual"><svg viewBox="0 0 860 240" xmlns="http://www.w3.org/2000/svg"><rect width="860" height="240" fill="#0f1a2e"/><text x="430" y="22" text-anchor="middle" font-family="Inter,sans-serif" font-size="11" fill="#22d3ee" letter-spacing="2.5" font-weight="600">STANDARD RAG · LINEAR PIPELINE</text><defs><marker id="rarr" markerWidth="8" markerHeight="8" refX="5" refY="4" orient="auto"><path d="M0,0 L8,4 L0,8 Z" fill="#22d3ee"/></marker></defs><g font-family="Inter,sans-serif"><rect x="30" y="92" width="120" height="56" rx="6" fill="#142442" stroke="#22d3ee" stroke-width="1"/><text x="90" y="118" text-anchor="middle" font-size="12" fill="#fff" font-weight="600">User Query</text><text x="90" y="135" text-anchor="middle" font-size="9" fill="#8aa0c0">natural language</text><line x1="150" y1="120" x2="195" y2="120" stroke="#22d3ee" stroke-width="1.6" marker-end="url(#rarr)" class="rag-flow-line"/><rect x="200" y="92" width="120" height="56" rx="6" fill="#142442" stroke="#22d3ee" stroke-width="1"/><text x="260" y="118" text-anchor="middle" font-size="12" fill="#fff" font-weight="600">Embed</text><text x="260" y="135" text-anchor="middle" font-size="9" fill="#8aa0c0">float32 vector</text><line x1="320" y1="120" x2="365" y2="120" stroke="#22d3ee" stroke-width="1.6" marker-end="url(#rarr)" class="rag-flow-line"/><rect x="370" y="78" width="130" height="84" rx="6" fill="#1f3358" stroke="#00e5a8" stroke-width="1.3"/><text x="435" y="100" text-anchor="middle" font-size="12" fill="#fff" font-weight="600">Vector DB</text><text x="435" y="116" text-anchor="middle" font-size="9" fill="#8aa0c0">cosine similarity</text><text x="435" y="140" text-anchor="middle" font-size="9" fill="#00e5a8" font-family="JetBrains Mono,monospace">top-k chunks</text><line x1="500" y1="120" x2="545" y2="120" stroke="#22d3ee" stroke-width="1.6" marker-end="url(#rarr)" class="rag-flow-line"/><rect x="550" y="78" width="130" height="84" rx="6" fill="#142442" stroke="#a78bfa" stroke-width="1.3"/><text x="615" y="100" text-anchor="middle" font-size="12" fill="#fff" font-weight="600">LLM</text><text x="615" y="116" text-anchor="middle" font-size="9" fill="#8aa0c0">prompt + context</text><text x="615" y="140" text-anchor="middle" font-size="9" fill="#a78bfa" font-family="JetBrains Mono,monospace">grounded answer</text><line x1="680" y1="120" x2="725" y2="120" stroke="#22d3ee" stroke-width="1.6" marker-end="url(#rarr)" class="rag-flow-line"/><rect x="730" y="92" width="100" height="56" rx="6" fill="#142442" stroke="#22d3ee" stroke-width="1"/><text x="780" y="118" text-anchor="middle" font-size="12" fill="#fff" font-weight="600">Answer</text><text x="780" y="135" text-anchor="middle" font-size="9" fill="#8aa0c0">to user</text><line x1="370" y1="178" x2="500" y2="178" stroke="#8aa0c0" stroke-width=".5" stroke-dasharray="3,2"/><text x="435" y="196" text-anchor="middle" font-size="9" fill="#8aa0c0" font-style="italic">retrieval happens here</text><text x="435" y="208" text-anchor="middle" font-size="9" fill="#8aa0c0" font-style="italic">— individual chunks, ranked by similarity</text></g></svg></div><h3>Where Standard RAG Wins</h3><ul><li><strong>Direct factual lookups.</strong> Single chunk contains the answer. "What is our refund policy?" → retrieves the refund policy chunk → done.</li><li><strong>Cost and latency.</strong> One embedding call, one similarity search, one LLM call. Easy to debug.</li><li><strong>Mature tooling.</strong> Pinecone, Weaviate, Qdrant, Milvus, pgvector — all production-ready for this pattern.</li></ul><h3>Where Standard RAG Breaks</h3><p>It retrieves<em>chunks</em>, never the relationships<em>between</em> chunks. The moment the answer requires combining facts that live in different documents — or even different sections of the same document — similarity search starts missing things.</p><div class="rag-pull warn">Similarity search will happily return two facts that sit close to the query in embedding space, while the missing third fact that connects them sits far away and never makes it into the context window.</div><p>Concretely, imagine a vector database storing three facts about your internal services:</p><div class="rag-visual"><svg viewBox="0 0 860 280" xmlns="http://www.w3.org/2000/svg"><rect width="860" height="280" fill="#0f1a2e"/><text x="430" y="22" text-anchor="middle" font-family="Inter,sans-serif" font-size="11" fill="#f59e0b" letter-spacing="2.5" font-weight="600">THE MULTI-HOP PROBLEM</text><rect x="40" y="44" width="780" height="200" rx="6" fill="#0a1220" stroke="#1f3358" stroke-width=".5" stroke-dasharray="3,3"/><text x="55" y="62" font-family="JetBrains Mono,monospace" font-size="8" fill="#8aa0c0" letter-spacing="1">EMBEDDING SPACE (conceptual)</text><g transform="translate(110,180)"><circle r="10" fill="#a78bfa"/><text x="0" y="-18" text-anchor="middle" font-family="Inter,sans-serif" font-size="9.5" fill="#a78bfa" font-weight="600">QUERY</text><text x="0" y="32" text-anchor="middle" font-family="Inter,sans-serif" font-size="8.5" fill="#fff">"Will checkout be affected</text><text x="0" y="42" text-anchor="middle" font-family="Inter,sans-serif" font-size="8.5" fill="#fff">by Friday's maintenance?"</text></g><g transform="translate(230,150)"><circle r="9" fill="#00e5a8" class="rag-glow"/><text x="0" y="-16" text-anchor="middle" font-family="Inter,sans-serif" font-size="9" fill="#00e5a8" font-weight="600">Fact 1 · RETRIEVED</text><text x="0" y="30" text-anchor="middle" font-family="Inter,sans-serif" font-size="8.5" fill="#fff">"Checkout service uses</text><text x="0" y="40" text-anchor="middle" font-family="Inter,sans-serif" font-size="8.5" fill="#fff">the payments API."</text></g><g transform="translate(440,90)"><circle r="9" fill="#ef4444" opacity=".75" class="rag-missed"/><text x="0" y="-14" text-anchor="middle" font-family="Inter,sans-serif" font-size="9" fill="#ef4444" font-weight="600">Fact 2 · MISSED</text><text x="0" y="26" text-anchor="middle" font-family="Inter,sans-serif" font-size="8.5" fill="#fff">"Payments API runs</text><text x="0" y="36" text-anchor="middle" font-family="Inter,sans-serif" font-size="8.5" fill="#fff">on cluster-3."</text><text x="0" y="48" text-anchor="middle" font-family="Inter,sans-serif" font-size="7.5" fill="#ef4444" font-style="italic">no "checkout" · no "maintenance"</text></g><g transform="translate(680,160)"><circle r="9" fill="#00e5a8" class="rag-glow" style="animation-delay:1.2s"/><text x="0" y="-16" text-anchor="middle" font-family="Inter,sans-serif" font-size="9" fill="#00e5a8" font-weight="600">Fact 3 · RETRIEVED</text><text x="0" y="30" text-anchor="middle" font-family="Inter,sans-serif" font-size="8.5" fill="#fff">"Cluster-3 maintenance</text><text x="0" y="40" text-anchor="middle" font-family="Inter,sans-serif" font-size="8.5" fill="#fff">scheduled for Friday."</text></g><line x1="120" y1="180" x2="220" y2="156" stroke="#00e5a8" stroke-width="1" stroke-opacity=".6"/><line x1="120" y1="180" x2="670" y2="166" stroke="#00e5a8" stroke-width="1" stroke-opacity=".6"/><line x1="120" y1="180" x2="430" y2="96" stroke="#ef4444" stroke-width="1" stroke-opacity=".4" stroke-dasharray="4,3"/><text x="430" y="266" text-anchor="middle" font-family="Inter,sans-serif" font-size="10" fill="#f59e0b" font-style="italic">LLM gets facts 1 and 3 but can't link them. Answers "I don't know" or hallucinates a connection.</text></svg></div><p class="rag-cap">The bridge fact sits too far from the query in embedding space. Similarity search has no way to find it from where it started.</p></div><div class="section"><div class="label">Part 1B · Graph RAG</div><h2>Adding a Knowledge Graph on Top</h2><p>Graph RAG addresses the multi-hop problem by adding a structural layer over the documents. During indexing, an LLM extracts<strong>entities</strong> (services, people, places, concepts) and the<strong>relationships</strong> between them, building a knowledge graph alongside the vector index. At query time, the system traverses that graph instead of relying purely on embedding similarity.</p><div class="rag-visual"><svg viewBox="0 0 860 320" xmlns="http://www.w3.org/2000/svg"><rect width="860" height="320" fill="#0f1a2e"/><text x="430" y="22" text-anchor="middle" font-family="Inter,sans-serif" font-size="11" fill="#22d3ee" letter-spacing="2.5" font-weight="600">GRAPH RAG · GRAPH TRAVERSAL OVER LINKED ENTITIES</text><defs><marker id="garr" markerWidth="8" markerHeight="8" refX="5" refY="4" orient="auto"><path d="M0,0 L8,4 L0,8 Z" fill="#00e5a8"/></marker></defs><g font-family="Inter,sans-serif"><text x="170" y="50" text-anchor="middle" font-size="10" fill="#8aa0c0" font-family="JetBrains Mono,monospace" letter-spacing="1.5">[ AT INDEXING TIME ]</text><rect x="60" y="70" width="220" height="50" rx="4" fill="#142442" stroke="#1f3358"/><text x="170" y="92" text-anchor="middle" font-size="11" fill="#fff" font-weight="600">Documents</text><text x="170" y="108" text-anchor="middle" font-size="9" fill="#8aa0c0">chunks · prose · tables</text><line x1="170" y1="120" x2="170" y2="142" stroke="#00e5a8" stroke-width="1.2" marker-end="url(#garr)"/><rect x="60" y="148" width="220" height="50" rx="4" fill="#142442" stroke="#a78bfa"/><text x="170" y="170" text-anchor="middle" font-size="11" fill="#fff" font-weight="600">LLM Entity Extractor</text><text x="170" y="186" text-anchor="middle" font-size="9" fill="#a78bfa">nodes + edges</text><line x1="170" y1="198" x2="170" y2="220" stroke="#00e5a8" stroke-width="1.2" marker-end="url(#garr)"/><rect x="60" y="226" width="220" height="50" rx="4" fill="#1f3358" stroke="#00e5a8" stroke-width="1.3"/><text x="170" y="248" text-anchor="middle" font-size="11" fill="#fff" font-weight="600">Knowledge Graph</text><text x="170" y="264" text-anchor="middle" font-size="9" fill="#00e5a8">Neo4j · Memgraph · property graph</text></g><line x1="320" y1="50" x2="320" y2="290" stroke="#1f3358" stroke-width=".5" stroke-dasharray="3,3"/><text x="600" y="50" text-anchor="middle" font-size="10" fill="#8aa0c0" font-family="JetBrains Mono,monospace" letter-spacing="1.5">[ THE RESULTING GRAPH ]</text><g font-family="Inter,sans-serif"><circle cx="400" cy="120" r="28" fill="#142442" stroke="#22d3ee" stroke-width="1.4"/><text x="400" y="118" text-anchor="middle" font-size="10" fill="#fff" font-weight="600">checkout</text><text x="400" y="130" text-anchor="middle" font-size="9" fill="#8aa0c0">service</text><circle cx="560" cy="170" r="28" fill="#142442" stroke="#22d3ee" stroke-width="1.4"/><text x="560" y="168" text-anchor="middle" font-size="10" fill="#fff" font-weight="600">payments</text><text x="560" y="180" text-anchor="middle" font-size="9" fill="#8aa0c0">API</text><circle cx="720" cy="120" r="28" fill="#142442" stroke="#22d3ee" stroke-width="1.4"/><text x="720" y="118" text-anchor="middle" font-size="10" fill="#fff" font-weight="600">cluster-3</text><text x="720" y="130" text-anchor="middle" font-size="9" fill="#8aa0c0">infra</text><circle cx="640" cy="260" r="28" fill="#142442" stroke="#f59e0b" stroke-width="1.4"/><text x="640" y="258" text-anchor="middle" font-size="10" fill="#fff" font-weight="600">Friday</text><text x="640" y="270" text-anchor="middle" font-size="9" fill="#f59e0b">maintenance</text><line x1="428" y1="120" x2="534" y2="166" stroke="#00e5a8" stroke-width="1.2"/><text x="475" y="135" font-size="8" fill="#00e5a8" font-style="italic">uses</text><line x1="588" y1="170" x2="692" y2="124" stroke="#00e5a8" stroke-width="1.2"/><text x="635" y="138" font-size="8" fill="#00e5a8" font-style="italic">runs_on</text><line x1="700" y1="148" x2="660" y2="232" stroke="#00e5a8" stroke-width="1.2"/><text x="700" y="195" font-size="8" fill="#00e5a8" font-style="italic">affects</text><path d="M 400 120 Q 480 70 560 170 Q 640 210 640 260" fill="none" stroke="#f59e0b" stroke-width="2.5" opacity=".85" class="rag-traversal" pathLength="100" stroke-dasharray="100"><animate attributeName="stroke-dashoffset" from="100" to="0" dur="2.5s" repeatCount="indefinite"/></path><text x="430" y="298" font-size="9" fill="#f59e0b" font-family="Inter,sans-serif" font-style="italic">↑ Traversal: checkout → uses → payments → runs_on → cluster-3 → affects → Friday maintenance</text></g></svg></div><h3>How a Graph RAG Query Actually Runs</h3><p>The user asks "Will checkout be affected by Friday's maintenance?". The system identifies the entities mentioned in the query (<code>checkout</code>,<code>Friday maintenance</code>), looks them up as nodes in the graph, and walks the edges between them. The traversal returns the chain of relationships, and that chain gets handed to the LLM as structured context — not as random chunks of prose.</p><div class="rag-cards"><div class="rag-card"><h4>Multi-hop reasoning</h4><p>Following<code>uses → runs_on → affects</code> recovers the bridge fact that pure similarity search missed.</p></div><div class="rag-card"><h4>Explainable context</h4><p>Every answer comes with a traversable path. Easier to audit than "the top-5 most similar chunks said so."</p></div><div class="rag-card"><h4>Heavier to build</h4><p>Entity extraction at index time is expensive. Schema design matters. Not free lunch.</p></div><div class="rag-card"><h4>Less flexible than agents</h4><p>The graph schema is fixed at indexing time. Queries that need fresh tools or external sources still need help.</p></div></div></div><div class="section"><div class="label">Part 1C · Agentic RAG</div><h2>Letting the LLM Choose How to Retrieve</h2><p>Agentic RAG replaces the fixed retrieval pipeline with an LLM agent that decides — at query time — which tools to invoke, which sources to query, and in what order. The agent might call a vector search, then a SQL database, then a web fetch, then a graph traversal, all in service of one question.</p><div class="rag-visual"><svg viewBox="0 0 860 360" xmlns="http://www.w3.org/2000/svg"><rect width="860" height="360" fill="#0f1a2e"/><text x="430" y="22" text-anchor="middle" font-family="Inter,sans-serif" font-size="11" fill="#22d3ee" letter-spacing="2.5" font-weight="600">AGENTIC RAG · DYNAMIC TOOL ORCHESTRATION</text><defs><marker id="aarr" markerWidth="7" markerHeight="7" refX="4" refY="3.5" orient="auto"><path d="M0,0 L7,3.5 L0,7 Z" fill="#a78bfa"/></marker></defs><g font-family="Inter,sans-serif"><rect x="30" y="170" width="120" height="40" rx="4" fill="#142442" stroke="#22d3ee"/><text x="90" y="194" text-anchor="middle" font-size="11" fill="#fff" font-weight="600">User Query</text></g><line x1="150" y1="190" x2="310" y2="190" stroke="#a78bfa" stroke-width="1.4" marker-end="url(#aarr)"/><g font-family="Inter,sans-serif"><circle cx="380" cy="190" r="64" fill="#1f3358" stroke="#a78bfa" stroke-width="1.8"/><text x="380" y="186" text-anchor="middle" font-size="13" fill="#fff" font-weight="700">LLM Agent</text><text x="380" y="202" text-anchor="middle" font-size="9" fill="#a78bfa" font-family="JetBrains Mono,monospace">plan · choose · iterate</text><text x="380" y="216" text-anchor="middle" font-size="8" fill="#8aa0c0">reflects on partial results</text></g><g font-family="Inter,sans-serif"><rect x="540" y="40" width="130" height="44" rx="5" fill="#142442" stroke="#22d3ee" stroke-width="1"/><text x="605" y="60" text-anchor="middle" font-size="10" fill="#fff" font-weight="600">Vector DB</text><text x="605" y="74" text-anchor="middle" font-size="8.5" fill="#8aa0c0">unstructured docs</text><rect x="700" y="100" width="130" height="44" rx="5" fill="#142442" stroke="#22d3ee" stroke-width="1"/><text x="765" y="120" text-anchor="middle" font-size="10" fill="#fff" font-weight="600">SQL Database</text><text x="765" y="134" text-anchor="middle" font-size="8.5" fill="#8aa0c0">structured rows</text><rect x="720" y="170" width="130" height="44" rx="5" fill="#142442" stroke="#22d3ee" stroke-width="1"/><text x="785" y="190" text-anchor="middle" font-size="10" fill="#fff" font-weight="600">Knowledge Graph</text><text x="785" y="204" text-anchor="middle" font-size="8.5" fill="#8aa0c0">linked entities</text><rect x="700" y="240" width="130" height="44" rx="5" fill="#142442" stroke="#22d3ee" stroke-width="1"/><text x="765" y="260" text-anchor="middle" font-size="10" fill="#fff" font-weight="600">Web Search</text><text x="765" y="274" text-anchor="middle" font-size="8.5" fill="#8aa0c0">fresh facts</text><rect x="540" y="300" width="130" height="44" rx="5" fill="#142442" stroke="#22d3ee" stroke-width="1"/><text x="605" y="320" text-anchor="middle" font-size="10" fill="#fff" font-weight="600">Code Interpreter</text><text x="605" y="334" text-anchor="middle" font-size="8.5" fill="#8aa0c0">compute · joins</text><rect x="380" y="300" width="130" height="44" rx="5" fill="#142442" stroke="#22d3ee" stroke-width="1"/><text x="445" y="320" text-anchor="middle" font-size="10" fill="#fff" font-weight="600">Internal Systems</text><text x="445" y="334" text-anchor="middle" font-size="8.5" fill="#8aa0c0">Slack · Jira · ITSM</text><line x1="430" y1="155" x2="540" y2="65" stroke="#a78bfa" stroke-width="1.6" stroke-dasharray="3,2" class="rag-tool-1"/><line x1="444" y1="175" x2="700" y2="118" stroke="#a78bfa" stroke-width="1.6" stroke-dasharray="3,2" class="rag-tool-2"/><line x1="444" y1="194" x2="720" y2="192" stroke="#a78bfa" stroke-width="1.6" stroke-dasharray="3,2" class="rag-tool-3"/><line x1="440" y1="220" x2="700" y2="258" stroke="#a78bfa" stroke-width="1.6" stroke-dasharray="3,2" class="rag-tool-4"/><line x1="406" y1="248" x2="586" y2="300" stroke="#a78bfa" stroke-width="1.6" stroke-dasharray="3,2" class="rag-tool-5"/><line x1="378" y1="254" x2="426" y2="300" stroke="#a78bfa" stroke-width="1.2" stroke-dasharray="3,2" opacity=".5"/><circle cx="380" cy="190" r="64" fill="none" stroke="#a78bfa" stroke-width="1.4" opacity=".6"><animate attributeName="r" from="64" to="96" dur="2.8s" repeatCount="indefinite"/><animate attributeName="opacity" from=".55" to="0" dur="2.8s" repeatCount="indefinite"/></circle></g><text x="430" y="354" text-anchor="middle" font-family="Inter,sans-serif" font-size="9" fill="#8aa0c0" font-style="italic">No fixed pipeline. The agent picks the next tool based on what the last tool returned.</text></svg></div><h3>What "Dynamic" Actually Means</h3><p>A user asks:<em>"Has any customer raised a ticket about the checkout outage we had last Friday, and what was our response time on it?"</em> An agentic system might:</p><ul><li>Call the<strong>knowledge graph</strong> to confirm there was an outage on the checkout service last Friday.</li><li>Call<strong>SQL</strong> on the ticketing database to list tickets opened that day mentioning "checkout".</li><li>Call the<strong>vector DB</strong> over chat history to find related customer complaints in Slack.</li><li>Call the<strong>code interpreter</strong> to compute average first-response time on the matching tickets.</li><li>Compose the answer.</li></ul><p>None of that ordering was decided in advance. The agent chose it. That flexibility is the whole point — and the whole risk.</p><div class="rag-cards"><div class="rag-card"><h4>Flexible</h4><p>Handles open-ended tasks that touch multiple data sources and require fresh information.</p></div><div class="rag-card"><h4>Higher latency</h4><p>Several tool calls per question. A simple lookup that took 200ms in standard RAG now takes 4–8 seconds.</p></div><div class="rag-card"><h4>Harder to debug</h4><p>The agent's reasoning path is non-deterministic. Reproducing a failure mode can be slippery.</p></div><div class="rag-card"><h4>Can spiral</h4><p>Without tight tool authority and budgets, agents loop on themselves. Pair this with a state machine.</p></div></div></div><div class="section"><div class="label">Part 1D · Decision</div><h2>These Aren't Levels — They're Different Tools</h2><p>The most common mistake is treating these as a maturity ladder you have to climb. They aren't. They solve different query types. A good system often uses all three in different parts of the same product.</p><div class="rag-visual"><svg viewBox="0 0 860 280" xmlns="http://www.w3.org/2000/svg"><rect width="860" height="280" fill="#0f1a2e"/><text x="430" y="22" text-anchor="middle" font-family="Inter,sans-serif" font-size="11" fill="#22d3ee" letter-spacing="2.5" font-weight="600">PICK BY QUERY TYPE — NOT BY HYPE</text><g font-family="Inter,sans-serif"><rect x="40" y="56" width="240" height="200" rx="8" fill="#142442" stroke="#22d3ee" stroke-width="1.3"/><text x="160" y="86" text-anchor="middle" font-size="13" fill="#fff" font-weight="700">Standard RAG</text><line x1="60" y1="98" x2="260" y2="98" stroke="#22d3ee" stroke-width=".5"/><text x="160" y="124" text-anchor="middle" font-size="10" fill="#22d3ee" font-family="JetBrains Mono,monospace">single-hop</text><text x="160" y="144" text-anchor="middle" font-size="10" fill="#22d3ee" font-family="JetBrains Mono,monospace">factual lookups</text><text x="160" y="174" text-anchor="middle" font-size="9.5" fill="#8aa0c0">"What's our refund policy?"</text><text x="160" y="190" text-anchor="middle" font-size="9.5" fill="#8aa0c0">"How do I reset my password?"</text><text x="160" y="206" text-anchor="middle" font-size="9.5" fill="#8aa0c0">"Where is the SLA defined?"</text><text x="160" y="234" text-anchor="middle" font-size="9" fill="#00e5a8" font-style="italic">cost: low · latency: low</text><rect x="310" y="56" width="240" height="200" rx="8" fill="#142442" stroke="#00e5a8" stroke-width="1.3"/><text x="430" y="86" text-anchor="middle" font-size="13" fill="#fff" font-weight="700">Graph RAG</text><line x1="330" y1="98" x2="530" y2="98" stroke="#00e5a8" stroke-width=".5"/><text x="430" y="124" text-anchor="middle" font-size="10" fill="#00e5a8" font-family="JetBrains Mono,monospace">multi-hop</text><text x="430" y="144" text-anchor="middle" font-size="10" fill="#00e5a8" font-family="JetBrains Mono,monospace">relationship queries</text><text x="430" y="174" text-anchor="middle" font-size="9.5" fill="#8aa0c0">"Who depends on cluster-3?"</text><text x="430" y="190" text-anchor="middle" font-size="9.5" fill="#8aa0c0">"What did this RCA reference?"</text><text x="430" y="206" text-anchor="middle" font-size="9.5" fill="#8aa0c0">"What blocks this release?"</text><text x="430" y="234" text-anchor="middle" font-size="9" fill="#00e5a8" font-style="italic">cost: medium · build: heavy</text><rect x="580" y="56" width="240" height="200" rx="8" fill="#142442" stroke="#a78bfa" stroke-width="1.3"/><text x="700" y="86" text-anchor="middle" font-size="13" fill="#fff" font-weight="700">Agentic RAG</text><line x1="600" y1="98" x2="800" y2="98" stroke="#a78bfa" stroke-width=".5"/><text x="700" y="124" text-anchor="middle" font-size="10" fill="#a78bfa" font-family="JetBrains Mono,monospace">multi-source</text><text x="700" y="144" text-anchor="middle" font-size="10" fill="#a78bfa" font-family="JetBrains Mono,monospace">tool-using tasks</text><text x="700" y="174" text-anchor="middle" font-size="9.5" fill="#8aa0c0">"Did anyone open a ticket about</text><text x="700" y="190" text-anchor="middle" font-size="9.5" fill="#8aa0c0">last Friday's outage, and</text><text x="700" y="206" text-anchor="middle" font-size="9.5" fill="#8aa0c0">what was our response time?"</text><text x="700" y="234" text-anchor="middle" font-size="9" fill="#a78bfa" font-style="italic">cost: high · debug: hardest</text></g></svg></div><p>Once the right architecture is in place for the query type, the next leverage point is efficiency. Every one of these three depends on a vector index somewhere underneath — and that index is where most of the memory cost lives.</p></div><div class="section"><div class="label">Part 2 · Efficiency</div><h2>How to Make Any RAG 32× More Memory Efficient</h2><figure class="rag-figure"><video autoplay= muted= loop= playsinline= preload="metadata" poster="/images/rag-variants/binary-quantization-chip.png"><source src="/images/rag-variants/binary-quantization-chip.mp4" type="video/mp4"><img src="/images/rag-variants/binary-quantization-chip.png" alt="A glowing silicon chip with float values transforming into binary on its surface"/><figcaption>The 32× trick — float magnitudes compressed to a single sign bit per dimension.</figcaption></figure><p>Every RAG variant pays the same tax: it stores high-dimensional embeddings of every chunk it's ever indexed. That tax adds up fast. At ten million chunks, a standard 768-dimension float32 index needs about 30 GB just to hold the vectors — and that index has to sit in fast RAM if you want sub-second retrieval. Doubling your corpus doubles the bill.</p><p>The trick that Perplexity, Azure AI Search, and HubSpot all use in production is called<strong>binary quantization</strong>. It cuts the memory footprint by 32 times. The architecture above it doesn't change — Standard, Graph, or Agentic, the same trick applies.</p><h3>The Memory Bill, in Numbers</h3><div class="rag-visual"><svg viewBox="0 0 860 220" xmlns="http://www.w3.org/2000/svg"><rect width="860" height="220" fill="#0f1a2e"/><text x="430" y="22" text-anchor="middle" font-family="Inter,sans-serif" font-size="11" fill="#22d3ee" letter-spacing="2.5" font-weight="600">A SINGLE EMBEDDING — WHAT YOU'RE ACTUALLY PAYING FOR</text><g font-family="JetBrains Mono,monospace"><text x="60" y="60" font-size="9" fill="#8aa0c0">VECTOR:</text><g transform="translate(60,70)"><rect x="0" y="0" width="95" height="40" fill="#142442" stroke="#22d3ee" stroke-width=".5"/><text x="48" y="22" text-anchor="middle" font-size="9.5" fill="#fff">0.42</text><text x="48" y="34" text-anchor="middle" font-size="7" fill="#8aa0c0">32 bits</text><rect x="96" y="0" width="95" height="40" fill="#142442" stroke="#22d3ee" stroke-width=".5"/><text x="144" y="22" text-anchor="middle" font-size="9.5" fill="#fff">-0.18</text><text x="144" y="34" text-anchor="middle" font-size="7" fill="#8aa0c0">32 bits</text><rect x="192" y="0" width="95" height="40" fill="#142442" stroke="#22d3ee" stroke-width=".5"/><text x="240" y="22" text-anchor="middle" font-size="9.5" fill="#fff">0.93</text><text x="240" y="34" text-anchor="middle" font-size="7" fill="#8aa0c0">32 bits</text><rect x="288" y="0" width="95" height="40" fill="#142442" stroke="#22d3ee" stroke-width=".5"/><text x="336" y="22" text-anchor="middle" font-size="9.5" fill="#fff">-0.05</text><text x="336" y="34" text-anchor="middle" font-size="7" fill="#8aa0c0">32 bits</text><text x="410" y="22" font-size="9" fill="#8aa0c0">…</text><text x="410" y="34" font-size="9" fill="#8aa0c0">…</text><rect x="450" y="0" width="280" height="40" fill="#0a1220" stroke="#1f3358" stroke-width=".5"/><text x="590" y="22" text-anchor="middle" font-size="9.5" fill="#22d3ee">764 more dimensions</text><text x="590" y="34" text-anchor="middle" font-size="7" fill="#8aa0c0">×32 bits each</text></g></g><g font-family="Inter,sans-serif"><text x="60" y="138" font-size="11" fill="#fff" font-weight="600">One vector = 768 dimensions × 32 bits =<tspan fill="#f59e0b" font-weight="700">3,072 bytes</tspan></text><text x="60" y="158" font-size="11" fill="#fff" font-weight="600">10 million vectors =<tspan fill="#f59e0b" font-weight="700">~30 GB</tspan>, all in hot RAM for real-time retrieval</text><text x="60" y="178" font-size="10" fill="#8aa0c0" font-style="italic">Most of those 32 bits per dimension are encoding magnitudes you never actually use for ranking.</text></g></svg></div><h3>The Trick: Throw Away Magnitudes, Keep the Sign</h3><p>Binary quantization is structurally simple. For every dimension of every vector, ask one question:<em>is the value positive or negative?</em> Positive becomes<code>1</code>, negative becomes<code>0</code>. The 32-bit float is replaced by a single bit. Same dimensionality, 1/32nd the storage.</p><div class="rag-visual"><svg viewBox="0 0 860 280" xmlns="http://www.w3.org/2000/svg"><rect width="860" height="280" fill="#0f1a2e"/><text x="430" y="22" text-anchor="middle" font-family="Inter,sans-serif" font-size="11" fill="#22d3ee" letter-spacing="2.5" font-weight="600">FLOAT32 → BINARY · ONE BIT PER DIMENSION</text><g font-family="JetBrains Mono,monospace"><text x="60" y="58" font-size="9" fill="#8aa0c0">FLOAT32:</text><g transform="translate(60,68)"><g><rect x="0" y="0" width="86" height="36" fill="#142442" stroke="#22d3ee" stroke-width=".6"/><text x="43" y="22" text-anchor="middle" font-size="11" fill="#fff">+0.42</text><rect x="90" y="0" width="86" height="36" fill="#142442" stroke="#22d3ee" stroke-width=".6"/><text x="133" y="22" text-anchor="middle" font-size="11" fill="#f59e0b">−0.18</text><rect x="180" y="0" width="86" height="36" fill="#142442" stroke="#22d3ee" stroke-width=".6"/><text x="223" y="22" text-anchor="middle" font-size="11" fill="#fff">+0.93</text><rect x="270" y="0" width="86" height="36" fill="#142442" stroke="#22d3ee" stroke-width=".6"/><text x="313" y="22" text-anchor="middle" font-size="11" fill="#f59e0b">−0.05</text><rect x="360" y="0" width="86" height="36" fill="#142442" stroke="#22d3ee" stroke-width=".6"/><text x="403" y="22" text-anchor="middle" font-size="11" fill="#fff">+0.21</text><rect x="450" y="0" width="86" height="36" fill="#142442" stroke="#22d3ee" stroke-width=".6"/><text x="493" y="22" text-anchor="middle" font-size="11" fill="#f59e0b">−0.67</text><rect x="540" y="0" width="86" height="36" fill="#142442" stroke="#22d3ee" stroke-width=".6"/><text x="583" y="22" text-anchor="middle" font-size="11" fill="#fff">+0.11</text><rect x="630" y="0" width="86" height="36" fill="#142442" stroke="#22d3ee" stroke-width=".6"/><text x="673" y="22" text-anchor="middle" font-size="11" fill="#fff">+0.88</text></g></g></g><text x="430" y="138" text-anchor="middle" font-family="Inter,sans-serif" font-size="10" fill="#00e5a8" font-style="italic">sign(x) — if x &gt; 0 then 1 else 0</text><line x1="380" y1="118" x2="380" y2="150" stroke="#00e5a8" stroke-width="1.5"/><line x1="380" y1="150" x2="384" y2="146" stroke="#00e5a8" stroke-width="1.5"/><line x1="380" y1="150" x2="376" y2="146" stroke="#00e5a8" stroke-width="1.5"/><g font-family="JetBrains Mono,monospace"><text x="60" y="178" font-size="9" fill="#8aa0c0">BINARY:</text><g transform="translate(60,188)"><g class="rag-bit-1"><rect x="0" y="0" width="86" height="36" fill="#1f3358" stroke="#00e5a8" stroke-width=".8"/><text x="43" y="24" text-anchor="middle" font-size="14" fill="#00e5a8" font-weight="700">1</text></g><g class="rag-bit-2"><rect x="90" y="0" width="86" height="36" fill="#142442" stroke="#00e5a8" stroke-width=".8"/><text x="133" y="24" text-anchor="middle" font-size="14" fill="#00e5a8" font-weight="700">0</text></g><g class="rag-bit-3"><rect x="180" y="0" width="86" height="36" fill="#1f3358" stroke="#00e5a8" stroke-width=".8"/><text x="223" y="24" text-anchor="middle" font-size="14" fill="#00e5a8" font-weight="700">1</text></g><g class="rag-bit-4"><rect x="270" y="0" width="86" height="36" fill="#142442" stroke="#00e5a8" stroke-width=".8"/><text x="313" y="24" text-anchor="middle" font-size="14" fill="#00e5a8" font-weight="700">0</text></g><g class="rag-bit-5"><rect x="360" y="0" width="86" height="36" fill="#1f3358" stroke="#00e5a8" stroke-width=".8"/><text x="403" y="24" text-anchor="middle" font-size="14" fill="#00e5a8" font-weight="700">1</text></g><g class="rag-bit-6"><rect x="450" y="0" width="86" height="36" fill="#142442" stroke="#00e5a8" stroke-width=".8"/><text x="493" y="24" text-anchor="middle" font-size="14" fill="#00e5a8" font-weight="700">0</text></g><g class="rag-bit-7"><rect x="540" y="0" width="86" height="36" fill="#1f3358" stroke="#00e5a8" stroke-width=".8"/><text x="583" y="24" text-anchor="middle" font-size="14" fill="#00e5a8" font-weight="700">1</text></g><g class="rag-bit-8"><rect x="630" y="0" width="86" height="36" fill="#1f3358" stroke="#00e5a8" stroke-width=".8"/><text x="673" y="24" text-anchor="middle" font-size="14" fill="#00e5a8" font-weight="700">1</text></g></g></g><text x="60" y="252" font-family="Inter,sans-serif" font-size="11" fill="#fff" font-weight="600">8 floats (256 bits) → 8 bits. Apply to 768 dims:<tspan fill="#00e5a8">3,072 bytes → 96 bytes.</tspan></text><text x="60" y="270" font-family="Inter,sans-serif" font-size="10" fill="#00e5a8" font-style="italic">Exactly 32× reduction. Per vector. Across the whole index.</text></svg></div><h3>The Distance Metric Changes Too — Cosine Becomes Hamming</h3><p>Float32 vectors compare via<strong>cosine similarity</strong>, which is computed from dot products. Binary vectors compare via<strong>Hamming distance</strong>: count the number of bits that differ between two vectors. On modern CPUs, this is two instructions —<code>XOR</code> then<code>popcount</code> — and runs at billions of comparisons per second.</p><div class="rag-visual"><svg viewBox="0 0 860 240" xmlns="http://www.w3.org/2000/svg"><rect width="860" height="240" fill="#0f1a2e"/><text x="430" y="22" text-anchor="middle" font-family="Inter,sans-serif" font-size="11" fill="#22d3ee" letter-spacing="2.5" font-weight="600">DISTANCE METRIC · BEFORE AND AFTER</text><g font-family="Inter,sans-serif"><rect x="40" y="50" width="380" height="170" rx="6" fill="#142442" stroke="#22d3ee" stroke-width="1"/><text x="230" y="76" text-anchor="middle" font-size="12" fill="#fff" font-weight="700">Float32 · Cosine Similarity</text><line x1="120" y1="180" x2="220" y2="110" stroke="#22d3ee" stroke-width="1.5"/><g transform-origin="120 180"><line x1="120" y1="180" x2="300" y2="160" stroke="#a78bfa" stroke-width="1.5"><animateTransform attributeName="transform" type="rotate" values="-8 120 180; 6 120 180; -8 120 180" dur="3.6s" repeatCount="indefinite"/></line></g><path d="M 165 158 A 50 50 0 0 1 195 175" fill="none" stroke="#f59e0b" stroke-width="1.5"/><text x="200" y="155" font-size="11" fill="#f59e0b" font-style="italic">θ</text><text x="230" y="206" text-anchor="middle" font-size="10" fill="#8aa0c0" font-family="JetBrains Mono,monospace">cos(θ) = (A · B) / (||A|| × ||B||)</text></g><g font-family="Inter,sans-serif"><rect x="440" y="50" width="380" height="170" rx="6" fill="#142442" stroke="#00e5a8" stroke-width="1"/><text x="630" y="76" text-anchor="middle" font-size="12" fill="#fff" font-weight="700">Binary · Hamming Distance</text><g font-family="JetBrains Mono,monospace" font-size="14"><text x="480" y="118" fill="#8aa0c0">A:</text><text x="510" y="118" fill="#00e5a8" font-weight="700">1</text><text x="530" y="118" fill="#00e5a8" font-weight="700">0</text><text x="550" y="118" fill="#00e5a8" font-weight="700">1</text><text x="570" y="118" fill="#00e5a8" font-weight="700">0</text><text x="590" y="118" fill="#00e5a8" font-weight="700">1</text><text x="610" y="118" fill="#00e5a8" font-weight="700">0</text><text x="630" y="118" fill="#00e5a8" font-weight="700">1</text><text x="650" y="118" fill="#00e5a8" font-weight="700">1</text><text x="480" y="142" fill="#8aa0c0">B:</text><text x="510" y="142" fill="#00e5a8" font-weight="700">1</text><text x="530" y="142" fill="#f59e0b" font-weight="700">1</text><text x="550" y="142" fill="#00e5a8" font-weight="700">1</text><text x="570" y="142" fill="#f59e0b" font-weight="700">1</text><text x="590" y="142" fill="#00e5a8" font-weight="700">1</text><text x="610" y="142" fill="#00e5a8" font-weight="700">0</text><text x="630" y="142" fill="#00e5a8" font-weight="700">1</text><text x="650" y="142" fill="#00e5a8" font-weight="700">1</text><text x="480" y="168" fill="#8aa0c0" font-size="11">XOR:</text><text x="510" y="168" fill="#22d3ee" font-size="14">0</text><text x="530" y="168" fill="#f59e0b" font-size="14">1</text><text x="550" y="168" fill="#22d3ee" font-size="14">0</text><text x="570" y="168" fill="#f59e0b" font-size="14">1</text><text x="590" y="168" fill="#22d3ee" font-size="14">0</text><text x="610" y="168" fill="#22d3ee" font-size="14">0</text><text x="630" y="168" fill="#22d3ee" font-size="14">0</text><text x="650" y="168" fill="#22d3ee" font-size="14">0</text></g><text x="630" y="200" text-anchor="middle" font-size="11" fill="#00e5a8" font-family="Inter,sans-serif">popcount(XOR) = 2 bits different = distance</text></g></svg></div><h3>The Trade-off — and the Fix</h3><p>Of course, throwing away the magnitudes throws away some information. A naive binary index loses roughly 5–10% of retrieval accuracy compared to the full float32 index. Production systems solve this with a<strong>two-stage search</strong>: use the cheap binary index to retrieve a wide net of candidates fast, then re-score the small candidate set using the original full-precision vectors.</p><div class="rag-visual"><svg viewBox="0 0 860 240" xmlns="http://www.w3.org/2000/svg"><rect width="860" height="240" fill="#0f1a2e"/><text x="430" y="22" text-anchor="middle" font-family="Inter,sans-serif" font-size="11" fill="#22d3ee" letter-spacing="2.5" font-weight="600">TWO-STAGE RETRIEVAL · SPEED FIRST, PRECISION SECOND</text><defs><marker id="harr" markerWidth="8" markerHeight="8" refX="5" refY="4" orient="auto"><path d="M0,0 L8,4 L0,8 Z" fill="#00e5a8"/></marker></defs><g font-family="Inter,sans-serif"><rect x="40" y="100" width="100" height="44" rx="5" fill="#142442" stroke="#22d3ee"/><text x="90" y="124" text-anchor="middle" font-size="11" fill="#fff" font-weight="600">Query</text><text x="90" y="138" text-anchor="middle" font-size="8.5" fill="#8aa0c0">embedded</text><line x1="140" y1="122" x2="195" y2="122" stroke="#00e5a8" stroke-width="1.4" marker-end="url(#harr)"/><rect x="200" y="80" width="200" height="84" rx="5" fill="#1f3358" stroke="#00e5a8" stroke-width="1.3"/><text x="300" y="102" text-anchor="middle" font-size="11" fill="#fff" font-weight="700">Stage 1 · Binary Index</text><text x="300" y="118" text-anchor="middle" font-size="9" fill="#00e5a8" font-family="JetBrains Mono,monospace">10M vectors · 0.94 GB</text><text x="300" y="134" text-anchor="middle" font-size="9" fill="#8aa0c0">XOR + popcount</text><text x="300" y="150" text-anchor="middle" font-size="9" fill="#8aa0c0">→ top 500 candidates</text><line x1="400" y1="122" x2="455" y2="122" stroke="#00e5a8" stroke-width="1.4" marker-end="url(#harr)"/><rect x="460" y="80" width="200" height="84" rx="5" fill="#1f3358" stroke="#a78bfa" stroke-width="1.3"/><text x="560" y="102" text-anchor="middle" font-size="11" fill="#fff" font-weight="700">Stage 2 · Float Rescore</text><text x="560" y="118" text-anchor="middle" font-size="9" fill="#a78bfa" font-family="JetBrains Mono,monospace">500 vectors · cosine</text><text x="560" y="134" text-anchor="middle" font-size="9" fill="#8aa0c0">full precision</text><text x="560" y="150" text-anchor="middle" font-size="9" fill="#8aa0c0">→ top 10 results</text><line x1="660" y1="122" x2="715" y2="122" stroke="#00e5a8" stroke-width="1.4" marker-end="url(#harr)"/><rect x="720" y="100" width="100" height="44" rx="5" fill="#142442" stroke="#22d3ee"/><text x="770" y="124" text-anchor="middle" font-size="11" fill="#fff" font-weight="600">LLM</text><text x="770" y="138" text-anchor="middle" font-size="8.5" fill="#8aa0c0">answer</text><circle r="5" fill="#00e5a8" opacity=".9"><animateMotion dur="4s" repeatCount="indefinite" rotate="auto"><mpath href="#ragFlowPath"/></animateMotion></circle><path id="ragFlowPath" d="M 140 122 L 195 122 L 300 122 L 460 122 L 560 122 L 720 122" fill="none" stroke="none"/><rect x="200" y="180" width="200" height="40" rx="4" fill="#0a1220" stroke="#00e5a8" stroke-width=".6" stroke-dasharray="3,2"/><text x="300" y="196" text-anchor="middle" font-size="9" fill="#00e5a8" font-family="JetBrains Mono,monospace">hot RAM</text><text x="300" y="210" text-anchor="middle" font-size="8.5" fill="#8aa0c0">fits in CPU cache for billions</text><rect x="460" y="180" width="200" height="40" rx="4" fill="#0a1220" stroke="#a78bfa" stroke-width=".6" stroke-dasharray="3,2"/><text x="560" y="196" text-anchor="middle" font-size="9" fill="#a78bfa" font-family="JetBrains Mono,monospace">cold tier · SSD or compressed RAM</text><text x="560" y="210" text-anchor="middle" font-size="8.5" fill="#8aa0c0">accessed only for 500 candidates</text></g></svg></div><p>Stage 1 is where the 32× memory win lives. The binary index is small enough to fit comfortably in CPU cache, so you can scan tens of millions of candidates in single-digit milliseconds. Stage 2 only ever touches a few hundred full-precision vectors, so the expensive cosine math is bounded.</p><div class="rag-pull">The recall lost in stage 1 is paid back in stage 2. End-to-end retrieval quality is typically within 1% of a full float32 search, at 1/32 the hot memory.</div><h3>Memory Bill at Scale — Before and After</h3><table class="rag-table"><thead><tr><th>Corpus size</th><th>Float32 only</th><th>Binary (stage 1)</th><th>Hybrid (stage 1 hot + stage 2 cold)</th></tr></thead><tbody><tr><td class="k">1 M vectors</td><td class="v">3 GB · hot RAM</td><td class="v">94 MB · hot RAM</td><td class="m">94 MB hot · 3 GB cold</td></tr><tr><td class="k">10 M vectors</td><td class="v">30 GB · hot RAM</td><td class="v">940 MB · hot RAM</td><td class="m">940 MB hot · 30 GB cold</td></tr><tr><td class="k">100 M vectors</td><td class="v">300 GB · multi-node</td><td class="v">9.4 GB · single node</td><td class="m">9.4 GB hot · 300 GB cold</td></tr><tr><td class="k">1 B vectors</td><td class="v">3 TB · cluster</td><td class="v">94 GB · single beefy node</td><td class="m">94 GB hot · 3 TB cold tier</td></tr></tbody></table><p>The shape of the curve is what matters: the hot index — the part that controls latency — stays manageable even as the corpus grows by orders of magnitude. The cold tier scales linearly but cheaply, because it only gets touched for the few hundred candidates surfaced by stage 1.</p><h3>When to Reach for This</h3><div class="rag-cards"><div class="rag-card"><h4>Above ~1 M vectors</h4><p>Below that scale, plain float32 is fine. The complexity of two-stage retrieval isn't worth the few hundred MB you'd save.</p></div><div class="rag-card"><h4>Hot real-time queries</h4><p>If your retrieval p95 needs to stay under 100ms, the binary first stage is what keeps you there as the index grows.</p></div><div class="rag-card"><h4>Cost-sensitive deployments</h4><p>Saving 30 GB of RAM × 3 replicas × 12 months adds up to real money. Especially on managed vector services.</p></div><div class="rag-card"><h4>Any of the three architectures</h4><p>Standard, Graph, Agentic — they all sit on a vector index somewhere. This optimisation applies everywhere they do.</p></div></div></div><div class="section" style="border-bottom:none"><div class="label">Putting It Together</div><h2>Architecture and Efficiency Are Orthogonal</h2><p>Two decisions, independent of each other.<strong>What kind of question does my system have to answer?</strong> — that's the architecture decision. Single-hop facts go to Standard RAG. Multi-hop relationship questions go to Graph RAG. Open-ended tool-using tasks go to Agentic RAG.<strong>How big is my index going to get?</strong> — that's the efficiency decision. Above a million chunks, binary quantization plus float rescoring buys you 32× memory headroom for ~1% quality cost.</p><p>The same vector index sits underneath all three architectures. The same trick applies to all three. Pick the architecture for the query type. Apply the efficiency trick because the math works.</p><div class="rag-pull">RAG isn't one thing — it's a layered decision. Get the architecture right for the query, then make the index small enough to keep up.</div><p style="text-align:center;margin-top:32px;color:var(--muted);font-size:.85rem"><strong style="color:var(--accent2)">Ajay Walia</strong> · CuriousBit Knowledge Base · May 2026</p></div></div>
]]></content:encoded><media:content url="https://curiousbit.netlify.app/images/rag-variants/hero.png" medium="image"><media:title type="plain">Vector-Search</media:title></media:content><category>artificial-intelligence</category><category>rag</category><category>llm</category><category>vector-search</category><category>architecture</category><category>Knowledge Base</category></item></channel></rss>