Skip to content

RAG, Graph RAG, Agentic RAG — and How to Make Any of Them 32× Memory Efficient

A visual breakdown of three RAG architectures — when each one wins, where it breaks down, and how binary quantization can shrink the vector index by 32× without changing the architecture you picked.

By Ajay Walia · May 28, 2026 · 11 min read

Share: LinkedIn

Retrieval-Augmented Generation has become the default way to give a language model access to your data. But "RAG" now covers at least three meaningfully different architectures, and most engineers only know the first one well. Pick the wrong one and your assistant answers confidently with information it never actually retrieved.

This piece does two things. First — break down RAG, Graph RAG, and Agentic RAG visually: how each works, where each one breaks, and which query type it's the right fit for. Second — show how a single technique called binary quantization can shrink the vector index inside any of these architectures by a factor of 32 without breaking retrieval quality. This is the trick Perplexity, Azure, and HubSpot use in production.

3Architectures
compared
32×Memory
reduction
1-hop / N-hopWhen each
variant wins
0 codeAll concept,
all visual
Part 1A · Standard RAG

The Default Pipeline — and What It's Actually Good At

Standard RAG is what most engineers mean when they say "RAG". Documents are split into chunks, each chunk is embedded into a high-dimensional vector, and those vectors are stored in a vector database. At query time, the user's question is embedded too, and the database returns the top-k chunks by similarity (usually cosine distance). Those chunks are pasted into the LLM's prompt as context, and the model answers from them.

STANDARD RAG · LINEAR PIPELINEUser Querynatural languageEmbedfloat32 vectorVector DBcosine similaritytop-k chunksLLMprompt + contextgrounded answerAnswerto userretrieval happens here— individual chunks, ranked by similarity

Where Standard RAG Wins

  • Direct factual lookups. Single chunk contains the answer. "What is our refund policy?" → retrieves the refund policy chunk → done.
  • Cost and latency. One embedding call, one similarity search, one LLM call. Easy to debug.
  • Mature tooling. Pinecone, Weaviate, Qdrant, Milvus, pgvector — all production-ready for this pattern.

Where Standard RAG Breaks

It retrieves chunks, never the relationships between chunks. The moment the answer requires combining facts that live in different documents — or even different sections of the same document — similarity search starts missing things.

Similarity search will happily return two facts that sit close to the query in embedding space, while the missing third fact that connects them sits far away and never makes it into the context window.

Concretely, imagine a vector database storing three facts about your internal services:

THE MULTI-HOP PROBLEMEMBEDDING SPACE (conceptual)QUERY"Will checkout be affectedby Friday's maintenance?"Fact 1 · RETRIEVED"Checkout service usesthe payments API."Fact 2 · MISSED"Payments API runson cluster-3."no "checkout" · no "maintenance"Fact 3 · RETRIEVED"Cluster-3 maintenancescheduled for Friday."LLM gets facts 1 and 3 but can't link them. Answers "I don't know" or hallucinates a connection.

The bridge fact sits too far from the query in embedding space. Similarity search has no way to find it from where it started.

Part 1B · Graph RAG

Adding a Knowledge Graph on Top

Graph RAG addresses the multi-hop problem by adding a structural layer over the documents. During indexing, an LLM extracts entities (services, people, places, concepts) and the relationships between them, building a knowledge graph alongside the vector index. At query time, the system traverses that graph instead of relying purely on embedding similarity.

GRAPH RAG · GRAPH TRAVERSAL OVER LINKED ENTITIES[ AT INDEXING TIME ]Documentschunks · prose · tablesLLM Entity Extractornodes + edgesKnowledge GraphNeo4j · Memgraph · property graph[ THE RESULTING GRAPH ]checkoutservicepaymentsAPIcluster-3infraFridaymaintenanceusesruns_onaffects↑ Traversal: checkout → uses → payments → runs_on → cluster-3 → affects → Friday maintenance

How a Graph RAG Query Actually Runs

The user asks "Will checkout be affected by Friday's maintenance?". The system identifies the entities mentioned in the query (checkout, Friday maintenance), looks them up as nodes in the graph, and walks the edges between them. The traversal returns the chain of relationships, and that chain gets handed to the LLM as structured context — not as random chunks of prose.

Multi-hop reasoning

Following uses → runs_on → affects recovers the bridge fact that pure similarity search missed.

Explainable context

Every answer comes with a traversable path. Easier to audit than "the top-5 most similar chunks said so."

Heavier to build

Entity extraction at index time is expensive. Schema design matters. Not free lunch.

Less flexible than agents

The graph schema is fixed at indexing time. Queries that need fresh tools or external sources still need help.

Part 1C · Agentic RAG

Letting the LLM Choose How to Retrieve

Agentic RAG replaces the fixed retrieval pipeline with an LLM agent that decides — at query time — which tools to invoke, which sources to query, and in what order. The agent might call a vector search, then a SQL database, then a web fetch, then a graph traversal, all in service of one question.

AGENTIC RAG · DYNAMIC TOOL ORCHESTRATIONUser QueryLLM Agentplan · choose · iteratereflects on partial resultsVector DBunstructured docsSQL Databasestructured rowsKnowledge Graphlinked entitiesWeb Searchfresh factsCode Interpretercompute · joinsInternal SystemsSlack · Jira · ITSMNo fixed pipeline. The agent picks the next tool based on what the last tool returned.

What "Dynamic" Actually Means

A user asks: "Has any customer raised a ticket about the checkout outage we had last Friday, and what was our response time on it?" An agentic system might:

  • Call the knowledge graph to confirm there was an outage on the checkout service last Friday.
  • Call SQL on the ticketing database to list tickets opened that day mentioning "checkout".
  • Call the vector DB over chat history to find related customer complaints in Slack.
  • Call the code interpreter to compute average first-response time on the matching tickets.
  • Compose the answer.

None of that ordering was decided in advance. The agent chose it. That flexibility is the whole point — and the whole risk.

Flexible

Handles open-ended tasks that touch multiple data sources and require fresh information.

Higher latency

Several tool calls per question. A simple lookup that took 200ms in standard RAG now takes 4–8 seconds.

Harder to debug

The agent's reasoning path is non-deterministic. Reproducing a failure mode can be slippery.

Can spiral

Without tight tool authority and budgets, agents loop on themselves. Pair this with a state machine.

Part 1D · Decision

These Aren't Levels — They're Different Tools

The most common mistake is treating these as a maturity ladder you have to climb. They aren't. They solve different query types. A good system often uses all three in different parts of the same product.

PICK BY QUERY TYPE — NOT BY HYPEStandard RAGsingle-hopfactual lookups"What's our refund policy?""How do I reset my password?""Where is the SLA defined?"cost: low · latency: lowGraph RAGmulti-hoprelationship queries"Who depends on cluster-3?""What did this RCA reference?""What blocks this release?"cost: medium · build: heavyAgentic RAGmulti-sourcetool-using tasks"Did anyone open a ticket aboutlast Friday's outage, andwhat was our response time?"cost: high · debug: hardest

Once the right architecture is in place for the query type, the next leverage point is efficiency. Every one of these three depends on a vector index somewhere underneath — and that index is where most of the memory cost lives.

Part 2 · Efficiency

How to Make Any RAG 32× More Memory Efficient

The 32× trick — float magnitudes compressed to a single sign bit per dimension.

Every RAG variant pays the same tax: it stores high-dimensional embeddings of every chunk it's ever indexed. That tax adds up fast. At ten million chunks, a standard 768-dimension float32 index needs about 30 GB just to hold the vectors — and that index has to sit in fast RAM if you want sub-second retrieval. Doubling your corpus doubles the bill.

The trick that Perplexity, Azure AI Search, and HubSpot all use in production is called binary quantization. It cuts the memory footprint by 32 times. The architecture above it doesn't change — Standard, Graph, or Agentic, the same trick applies.

The Memory Bill, in Numbers

A SINGLE EMBEDDING — WHAT YOU'RE ACTUALLY PAYING FORVECTOR:0.4232 bits-0.1832 bits0.9332 bits-0.0532 bits764 more dimensions×32 bits eachOne vector = 768 dimensions × 32 bits =3,072 bytes10 million vectors =~30 GB, all in hot RAM for real-time retrievalMost of those 32 bits per dimension are encoding magnitudes you never actually use for ranking.

The Trick: Throw Away Magnitudes, Keep the Sign

Binary quantization is structurally simple. For every dimension of every vector, ask one question: is the value positive or negative? Positive becomes 1, negative becomes 0. The 32-bit float is replaced by a single bit. Same dimensionality, 1/32nd the storage.

FLOAT32 → BINARY · ONE BIT PER DIMENSIONFLOAT32:+0.42−0.18+0.93−0.05+0.21−0.67+0.11+0.88sign(x) — if x > 0 then 1 else 0BINARY:101010118 floats (256 bits) → 8 bits. Apply to 768 dims:3,072 bytes → 96 bytes.Exactly 32× reduction. Per vector. Across the whole index.

The Distance Metric Changes Too — Cosine Becomes Hamming

Float32 vectors compare via cosine similarity, which is computed from dot products. Binary vectors compare via Hamming distance: count the number of bits that differ between two vectors. On modern CPUs, this is two instructions — XOR then popcount — and runs at billions of comparisons per second.

DISTANCE METRIC · BEFORE AND AFTERFloat32 · Cosine Similarityθcos(θ) = (A · B) / (||A|| × ||B||)Binary · Hamming DistanceA:10101011B:11111011XOR:01010000popcount(XOR) = 2 bits different = distance

The Trade-off — and the Fix

Of course, throwing away the magnitudes throws away some information. A naive binary index loses roughly 5–10% of retrieval accuracy compared to the full float32 index. Production systems solve this with a two-stage search: use the cheap binary index to retrieve a wide net of candidates fast, then re-score the small candidate set using the original full-precision vectors.

TWO-STAGE RETRIEVAL · SPEED FIRST, PRECISION SECONDQueryembeddedStage 1 · Binary Index10M vectors · 0.94 GBXOR + popcount→ top 500 candidatesStage 2 · Float Rescore500 vectors · cosinefull precision→ top 10 resultsLLManswerhot RAMfits in CPU cache for billionscold tier · SSD or compressed RAMaccessed only for 500 candidates

Stage 1 is where the 32× memory win lives. The binary index is small enough to fit comfortably in CPU cache, so you can scan tens of millions of candidates in single-digit milliseconds. Stage 2 only ever touches a few hundred full-precision vectors, so the expensive cosine math is bounded.

The recall lost in stage 1 is paid back in stage 2. End-to-end retrieval quality is typically within 1% of a full float32 search, at 1/32 the hot memory.

Memory Bill at Scale — Before and After

Corpus sizeFloat32 onlyBinary (stage 1)Hybrid (stage 1 hot + stage 2 cold)
1 M vectors3 GB · hot RAM94 MB · hot RAM94 MB hot · 3 GB cold
10 M vectors30 GB · hot RAM940 MB · hot RAM940 MB hot · 30 GB cold
100 M vectors300 GB · multi-node9.4 GB · single node9.4 GB hot · 300 GB cold
1 B vectors3 TB · cluster94 GB · single beefy node94 GB hot · 3 TB cold tier

The shape of the curve is what matters: the hot index — the part that controls latency — stays manageable even as the corpus grows by orders of magnitude. The cold tier scales linearly but cheaply, because it only gets touched for the few hundred candidates surfaced by stage 1.

When to Reach for This

Above ~1 M vectors

Below that scale, plain float32 is fine. The complexity of two-stage retrieval isn't worth the few hundred MB you'd save.

Hot real-time queries

If your retrieval p95 needs to stay under 100ms, the binary first stage is what keeps you there as the index grows.

Cost-sensitive deployments

Saving 30 GB of RAM × 3 replicas × 12 months adds up to real money. Especially on managed vector services.

Any of the three architectures

Standard, Graph, Agentic — they all sit on a vector index somewhere. This optimisation applies everywhere they do.

Putting It Together

Architecture and Efficiency Are Orthogonal

Two decisions, independent of each other. What kind of question does my system have to answer? — that's the architecture decision. Single-hop facts go to Standard RAG. Multi-hop relationship questions go to Graph RAG. Open-ended tool-using tasks go to Agentic RAG. How big is my index going to get? — that's the efficiency decision. Above a million chunks, binary quantization plus float rescoring buys you 32× memory headroom for ~1% quality cost.

The same vector index sits underneath all three architectures. The same trick applies to all three. Pick the architecture for the query type. Apply the efficiency trick because the math works.

RAG isn't one thing — it's a layered decision. Get the architecture right for the query, then make the index small enough to keep up.

Ajay Walia · CuriousBit Knowledge Base · May 2026

Test what you read

Quick quiz

1 of 4

Question 1

What is standard RAG's core structural weakness?

Ajay Walia

About the Author

Ajay Walia

AI {IT Architect} focusing on local-first multi-agent AI engineering, zero-data-egress systems. Ideator, Creator and Executor on Curious Bit.

Don't stop now

Keep Reading