Same answers.
One-fifth the tokens.
Most RAG pipelines pay the LLM to read noise. This page shows what happens when a Dell PowerEdge R470 reranks the retrieval: wider search, sharper selection, and an 80% smaller context bill with better recall.
Start hereNew to RAG? Start here A 60-second primer on how an AI finds answers in your documents.
Here is the short version of what happens under the hood for RAG retrieval, so the rest of this page makes sense.
Documents are cut into chunks
Manuals, wikis, and tickets all get split into bite-size passages of about 256 tokens (roughly a paragraph). The chunk is the unit of search: small enough to point at a single idea.
Every chunk gets coordinates
An embedding model turns each chunk into a point on a map of meaning. Chunks about the same topic land near each other, all searchable in milliseconds.
A question retrieves its neighbors
Your question lands on the same map, and the system grabs the K closest chunks, the "top-K." Close in meaning usually means relevant but could mean similar words are used.
The LLM reads what you send, and bills per token
Retrieved chunks get pasted into the prompt as context. The LLM bills every token it reads. Each chunk costs the same whether it held the answer or was just nearby noise.
Standard RAG ships 50 chunks to the LLM. Usually only one matters.
Every query ships all 50 retrieved chunks, relevant or not. At 256 tokens each, you pay for the whole wall on every request.
The one ★ gold chunk that holds the answer is a single square in it. And even paying for all 50, dense retrieval still misses the answer more than 25% of the time.
Search wider. Send less.
One Dell PowerEdge R470 drops in between your vector store and your LLM, making retrieval smarter before a single token is billed.
Cast a wider net
Retrieve 100 candidates instead of 50. A bigger pool lifts the recall ceiling from 74% to 82%.
Rerank on the R470
A cross-encoder on Intel® Xeon 6 with AMX re-scores all 100 against the question in milliseconds, purely on CPU.
Send only the best 10
Only the top 10 go to the LLM: 2,560 tokens instead of 12,800, answer in front.
10 reranked chunks beat 50 dense chunks.
Measured on 300 queries with known answers. The curve shows the share of queries where the gold chunk made it into the context, at every cutoff K.
The same lift, at your volume.
How these numbers are computed
Standard RAG. Dense retrieval at K=50 sends all 50 chunks to the LLM. At 256 tokens per chunk that is 12,800 input tokens per query, with a measured answer-found rate of 74.0%.
Intelligent RAG. Dense retrieval at K=100 casts a wider net, the R470 reranks all 100, and only the top 10 go to the LLM: 2,560 input tokens per query, with a measured answer-found rate of 77.3%.
The calculator multiplies the per-query token delta (10,240 tokens) by your daily volume, 365 days, and the selected model's input-token price. Output tokens, identical in both pipelines, are excluded.
The benchmark, replayed query by query.
Real questions, real retrieval ranks, real rerank scores, replayed from the benchmark log. Both pipelines get the same question at the same moment. Follow the ★ gold chunk, the one that holds the answer.
One server turns wide search into narrow spend.
The Dell PowerEdge R470 sits between your vector store and your LLM, optimizing your RAG token usage.
Intel Xeon 6 with Intel® AMX
Advanced Matrix Extensions run the cross-encoder directly on CPU at production throughput.z
Drop-in, vendor-neutral
Sits between any vector store and any LLM backend, with no pipeline rewrite and no model migration. It only touches the chunk list.
Built for fleet throughput
High core counts and memory bandwidth keep rerank latency low, providing a fast user expereince as documents grow.
A data-governance boundary
Retrieve and rerank against your corpus locally, sending a smaller subset of data outside your walls.
Spend tokens where they matter.
Turn "ship everything" into "ship what matters," and let the savings compound every day.

