RAG, Graph RAG, Agentic RAG — and How to Make Any of Them 32× Memory Efficient
A visual breakdown of three RAG architectures — when each one wins, where it breaks down, and how binary quantization can shrink the vector index by 32× without changing the architecture you picked.
Retrieval-Augmented Generation has become the default way to give a language model access to your data. But "RAG" now covers at least three meaningfully different architectures, and most engineers only know the first one well. Pick the wrong one and your assistant answers confidently with information it never actually retrieved.
This piece does two things. First — break down RAG, Graph RAG, and Agentic RAG visually: how each works, where each one breaks, and which query type it's the right fit for. Second — show how a single technique called binary quantization can shrink the vector index inside any of these architectures by a factor of 32 without breaking retrieval quality. This is the trick Perplexity, Azure, and HubSpot use in production.
compared
reduction
variant wins
all visual
The Default Pipeline — and What It's Actually Good At
Standard RAG is what most engineers mean when they say "RAG". Documents are split into chunks, each chunk is embedded into a high-dimensional vector, and those vectors are stored in a vector database. At query time, the user's question is embedded too, and the database returns the top-k chunks by similarity (usually cosine distance). Those chunks are pasted into the LLM's prompt as context, and the model answers from them.
Where Standard RAG Wins
- Direct factual lookups. Single chunk contains the answer. "What is our refund policy?" → retrieves the refund policy chunk → done.
- Cost and latency. One embedding call, one similarity search, one LLM call. Easy to debug.
- Mature tooling. Pinecone, Weaviate, Qdrant, Milvus, pgvector — all production-ready for this pattern.
Where Standard RAG Breaks
It retrieves chunks, never the relationships between chunks. The moment the answer requires combining facts that live in different documents — or even different sections of the same document — similarity search starts missing things.
Concretely, imagine a vector database storing three facts about your internal services:
The bridge fact sits too far from the query in embedding space. Similarity search has no way to find it from where it started.
Adding a Knowledge Graph on Top
Graph RAG addresses the multi-hop problem by adding a structural layer over the documents. During indexing, an LLM extracts entities (services, people, places, concepts) and the relationships between them, building a knowledge graph alongside the vector index. At query time, the system traverses that graph instead of relying purely on embedding similarity.
How a Graph RAG Query Actually Runs
The user asks "Will checkout be affected by Friday's maintenance?". The system identifies the entities mentioned in the query (checkout, Friday maintenance), looks them up as nodes in the graph, and walks the edges between them. The traversal returns the chain of relationships, and that chain gets handed to the LLM as structured context — not as random chunks of prose.
Multi-hop reasoning
Following uses → runs_on → affects recovers the bridge fact that pure similarity search missed.
Explainable context
Every answer comes with a traversable path. Easier to audit than "the top-5 most similar chunks said so."
Heavier to build
Entity extraction at index time is expensive. Schema design matters. Not free lunch.
Less flexible than agents
The graph schema is fixed at indexing time. Queries that need fresh tools or external sources still need help.
Letting the LLM Choose How to Retrieve
Agentic RAG replaces the fixed retrieval pipeline with an LLM agent that decides — at query time — which tools to invoke, which sources to query, and in what order. The agent might call a vector search, then a SQL database, then a web fetch, then a graph traversal, all in service of one question.
What "Dynamic" Actually Means
A user asks: "Has any customer raised a ticket about the checkout outage we had last Friday, and what was our response time on it?" An agentic system might:
- Call the knowledge graph to confirm there was an outage on the checkout service last Friday.
- Call SQL on the ticketing database to list tickets opened that day mentioning "checkout".
- Call the vector DB over chat history to find related customer complaints in Slack.
- Call the code interpreter to compute average first-response time on the matching tickets.
- Compose the answer.
None of that ordering was decided in advance. The agent chose it. That flexibility is the whole point — and the whole risk.
Flexible
Handles open-ended tasks that touch multiple data sources and require fresh information.
Higher latency
Several tool calls per question. A simple lookup that took 200ms in standard RAG now takes 4–8 seconds.
Harder to debug
The agent's reasoning path is non-deterministic. Reproducing a failure mode can be slippery.
Can spiral
Without tight tool authority and budgets, agents loop on themselves. Pair this with a state machine.
These Aren't Levels — They're Different Tools
The most common mistake is treating these as a maturity ladder you have to climb. They aren't. They solve different query types. A good system often uses all three in different parts of the same product.
Once the right architecture is in place for the query type, the next leverage point is efficiency. Every one of these three depends on a vector index somewhere underneath — and that index is where most of the memory cost lives.
How to Make Any RAG 32× More Memory Efficient
Every RAG variant pays the same tax: it stores high-dimensional embeddings of every chunk it's ever indexed. That tax adds up fast. At ten million chunks, a standard 768-dimension float32 index needs about 30 GB just to hold the vectors — and that index has to sit in fast RAM if you want sub-second retrieval. Doubling your corpus doubles the bill.
The trick that Perplexity, Azure AI Search, and HubSpot all use in production is called binary quantization. It cuts the memory footprint by 32 times. The architecture above it doesn't change — Standard, Graph, or Agentic, the same trick applies.
The Memory Bill, in Numbers
The Trick: Throw Away Magnitudes, Keep the Sign
Binary quantization is structurally simple. For every dimension of every vector, ask one question: is the value positive or negative? Positive becomes 1, negative becomes 0. The 32-bit float is replaced by a single bit. Same dimensionality, 1/32nd the storage.
The Distance Metric Changes Too — Cosine Becomes Hamming
Float32 vectors compare via cosine similarity, which is computed from dot products. Binary vectors compare via Hamming distance: count the number of bits that differ between two vectors. On modern CPUs, this is two instructions — XOR then popcount — and runs at billions of comparisons per second.
The Trade-off — and the Fix
Of course, throwing away the magnitudes throws away some information. A naive binary index loses roughly 5–10% of retrieval accuracy compared to the full float32 index. Production systems solve this with a two-stage search: use the cheap binary index to retrieve a wide net of candidates fast, then re-score the small candidate set using the original full-precision vectors.
Stage 1 is where the 32× memory win lives. The binary index is small enough to fit comfortably in CPU cache, so you can scan tens of millions of candidates in single-digit milliseconds. Stage 2 only ever touches a few hundred full-precision vectors, so the expensive cosine math is bounded.
Memory Bill at Scale — Before and After
| Corpus size | Float32 only | Binary (stage 1) | Hybrid (stage 1 hot + stage 2 cold) |
|---|---|---|---|
| 1 M vectors | 3 GB · hot RAM | 94 MB · hot RAM | 94 MB hot · 3 GB cold |
| 10 M vectors | 30 GB · hot RAM | 940 MB · hot RAM | 940 MB hot · 30 GB cold |
| 100 M vectors | 300 GB · multi-node | 9.4 GB · single node | 9.4 GB hot · 300 GB cold |
| 1 B vectors | 3 TB · cluster | 94 GB · single beefy node | 94 GB hot · 3 TB cold tier |
The shape of the curve is what matters: the hot index — the part that controls latency — stays manageable even as the corpus grows by orders of magnitude. The cold tier scales linearly but cheaply, because it only gets touched for the few hundred candidates surfaced by stage 1.
When to Reach for This
Above ~1 M vectors
Below that scale, plain float32 is fine. The complexity of two-stage retrieval isn't worth the few hundred MB you'd save.
Hot real-time queries
If your retrieval p95 needs to stay under 100ms, the binary first stage is what keeps you there as the index grows.
Cost-sensitive deployments
Saving 30 GB of RAM × 3 replicas × 12 months adds up to real money. Especially on managed vector services.
Any of the three architectures
Standard, Graph, Agentic — they all sit on a vector index somewhere. This optimisation applies everywhere they do.
Architecture and Efficiency Are Orthogonal
Two decisions, independent of each other. What kind of question does my system have to answer? — that's the architecture decision. Single-hop facts go to Standard RAG. Multi-hop relationship questions go to Graph RAG. Open-ended tool-using tasks go to Agentic RAG. How big is my index going to get? — that's the efficiency decision. Above a million chunks, binary quantization plus float rescoring buys you 32× memory headroom for ~1% quality cost.
The same vector index sits underneath all three architectures. The same trick applies to all three. Pick the architecture for the query type. Apply the efficiency trick because the math works.
Ajay Walia · CuriousBit Knowledge Base · May 2026
Test what you read
Quick quiz

About the Author
Ajay Walia
AI {IT Architect} focusing on local-first multi-agent AI engineering, zero-data-egress systems. Ideator, Creator and Executor on Curious Bit.
Keep Reading

RAG Chatbot from indexed public documentation
A domain-specific Retrieval-Augmented Generation assistant built with LangChain, OpenAI embeddings and FAISS that answers questions about the GitHub REST API strictly from indexed public documentation. Week 15 graded mini-project of the IITM Pravartak Professional Certificate Programme in Agentic AI and Applications.

I Built a Team of IT Architects using LLM That Live on MacBook — Meet Aether
How I built Aether — a local-first, multi-agent AI system that runs 10 specialist IT architecture advisors on a single MacBook M5 Pro, with no cloud, no API costs, and zero data egress.

Building with LLMs in 2026: The Framework Atlas
Choosing an LLM in 2026 is the least important decision in your stack. Here is the framework landscape every architect needs to navigate.