Same answers.
One-fifth the tokens.

Most RAG pipelines pay the LLM to read noise. This page replays a real benchmark showing what happens when a Dell PowerEdge R470 reranks the retrieval — wider search, sharper selection, and an 80% smaller context bill with better recall.

0%
answers found · 10 reranked chunks
vs 74.0% with 50 dense chunks
0
context tokens per query
vs 12,800 in standard RAG
0%
lower context cost
on every query, every model
Start here
00 First, the basics

How an AI finds answers in your documents.

If you already live in vector databases, skip ahead. If not — here is the sixty-second version of what happens under the hood, so the rest of this page makes sense.

1

Documents are cut into chunks

Manuals, wikis, tickets — everything gets split into bite-size passages of about 256 tokens (roughly a paragraph). The chunk is the unit of search: small enough to point at a single idea.

2

Every chunk gets coordinates

An embedding model turns each chunk into a point on a map of meaning. Chunks about the same topic land near each other — millions of them, all searchable in milliseconds.

3

A question retrieves its neighbors

Your question lands on the same map, and the system grabs the K closest chunks — the "top-K." Close in meaning usually means relevant. Usually.

4

The LLM reads what you send — per token

Retrieved chunks get pasted into the prompt as context. The LLM bills every token it reads — each chunk costs the same whether it held the answer or was just nearby noise.

The whole game is choosing which chunks to send. Too few, and you miss the answer. Too many, and you pay to ship noise. Everything below is about beating that trade-off.
01 The problem

Standard RAG ships 50 chunks to the LLM.
Usually one matters.

Every query retrieves the top 50 chunks by similarity and sends all of them as context — relevant or not. At 256 tokens per chunk, that is 12,800 input tokens per query, paid on every single request.

In the benchmark, the chunk that actually contains the answer — the ★ gold chunk — is just one square in this wall. The other 49 are noise the LLM is paid to read past. And even paying for all 50, dense retrieval still misses the answer 26% of the time.

0tokens shipped / query
chunk that held the answer
One query · top-50 retrievalretrieving…
Every square = one 256-token chunk billed to your LLM
02 The fix

Search wider. Send less.

One Dell PowerEdge R470 drops in between your vector store and your LLM. It doesn't replace anything — it makes the retrieval smarter before a single token is billed.

1

Cast a wider net

Retrieve 100 candidates instead of 50. Doubling the pool raises the odds the answer is in the building — recall ceiling jumps from 74% to 82%.

2

Rerank on the R470

A cross-encoder on Intel® Xeon 6 with AMX reads every candidate against the question and re-scores all 100 in milliseconds. No GPU. Pure CPU.

3

Send only the best 10

Only the top 10 reranked chunks go to the LLM — 2,560 tokens instead of 12,800. The answer rides in front, not buried at rank 40.

03 The proof

10 reranked chunks beat 50 dense chunks.

Measured on 300 queries with known answers. The curve shows the share of queries where the gold chunk made it into the context, at every cutoff K.

Answer-found rate vs. chunks sent (K)
Reranked (R470) Dense only
Drag K — watch quality vs. cost
sending 10 chunks per query
Reranked
Dense only
tokens billed / query at this K 2,560
04 Watch it happen

The benchmark, replayed query by query.

Real questions, real retrieval ranks, real rerank scores — replayed from the benchmark log. Both pipelines get the same question at the same moment. Follow the ★ gold chunk — the one that holds the answer.

Loading benchmark data…
05 What it saves

The same 80% — at your volume.

The per-query delta is fixed by the benchmark: 12,800 tokens down to 2,560. What it's worth depends on how many queries you run and which model reads the context.

300,000 queries/day
10K250K500K750K1M
$0
saved per year on input tokens
Standard
$0
Intelligent
$0
computed from the observed recall + your model pricing
How these numbers are computed

Standard RAG. Dense retrieval at K=50 sends all 50 chunks to the LLM. At 256 tokens per chunk that is 12,800 input tokens per query — measured answer-found rate 74.0%.

Intelligent RAG. Dense retrieval at K=100 casts a wider net, the R470 reranks all 100, and only the top 10 go to the LLM — 2,560 input tokens per query, measured answer-found rate 77.3%.

The calculator multiplies the per-query token delta (10,240 tokens) by your daily volume, 365 days, and the selected model's input-token price. Output tokens, identical in both pipelines, are excluded.

06 The hardware

One server turns wide search into narrow spend.

The Dell PowerEdge R470 sits in exactly one place: between your vector store and your LLM. One hundred candidates flow in. Ten chunks flow out. Ninety never get billed.

Where the R470 fits in the retrieval flow A query goes to a vector store, which returns 100 candidate chunks to the Dell PowerEdge R470 reranker. The reranker forwards only the top 10 chunks to the LLM; the other 90 are never billed. YOUR APP asks a question VECTOR STORE top-100 by similarity 100 candidate chunks every query · wide on purpose DELL POWEREDGE R470 CROSS-ENCODER RERANK Intel® Xeon 6 + AMX · milliseconds · no GPU 90 chunks stay home — never billed top 10 only · 2,560 tokens ANY LLM cloud or on-prem
100 → 10 chunks per query milliseconds per rerank pass 0 GPUs — pure Xeon 6 + AMX −80% LLM input spend
Intel Xeon

Intel Xeon 6 with Intel® AMX

Advanced Matrix Extensions run the cross-encoder directly on CPU at production throughput. No GPU required for the selection layer.

Drop-in, vendor-neutral

Sits between any vector store and any LLM backend — no pipeline rewrite, no model migration. It only touches the chunk list.

Built for fleet throughput

High core counts and memory bandwidth keep rerank latency low at thousands of chunks per second — one server covers the whole RAG fleet.

A data-governance boundary

Retrieve and rerank against your corpus locally; only the trimmed 10-chunk context ever leaves. The other 90 chunks never cross the wire.

Powered by Dell Technologies Intel Xeon

Spend tokens where they matter.

Turn "ship all 50 chunks" into "ship the 10 that earn their place" — and let the savings compound every day.