I thought we'd solved the hallucination problem with RAG.
Turns out we just created a new set of
challenges. 🔍
The Promise That Didn't Quite Deliver
Two years ago, Retrieval Augmented Generation
felt revolutionary. Finally, we could ground large language models in real
data. No more making things up. No more confidently wrong answers. Just
retrieve the relevant documents, pass them to the LLM, and let it synthesize an
accurate response.
Except it didn't work nearly as well as hoped.
I remember deploying our first RAG system. We
fed it thousands of documentation pages, set up a vector database, and watched
it... give decent but frustratingly incomplete answers. Sometimes it missed
obvious relevant information. Other times it retrieved the right documents but
failed to synthesize them properly. Occasionally it still hallucinated despite
having the correct information right in front of it. 🤦
The gap between RAG's promise and its reality
became impossible to ignore.
Understanding Where Basic RAG Falls Short
Let's be honest about what basic RAG actually
does. You split documents into chunks. You embed those chunks into vectors.
When a user asks a question, you embed their query, find the most similar
chunks, and stuff them into the LLM's context window.
Simple. Elegant. Limited.
The problems show up immediately in production.
Your chunking strategy matters enormously but there's no universal right
answer. Chunk too small and you lose context. Chunk too large and you dilute
semantic meaning. A 512 token chunk might work great for technical
documentation but terrible for narrative content.
Vector similarity doesn't capture everything
that makes information relevant. Two passages can be semantically similar
without one actually answering the question. Or the most relevant information
might be split across multiple chunks that individually don't seem related to
the query.
The LLM sees retrieved chunks in isolation. It
doesn't know where each chunk came from, how reliable the source is, or how the
chunks relate to each other. You're asking it to synthesize information without
providing crucial metadata and structural context. 📚
The Evolution to RAG Plus
RAG Plus isn't a single technology. It's a
collection of enhancements that address basic RAG's limitations. Think of it as
RAG that actually works in production.
The shift involves multiple layers of
improvement. Better retrieval strategies that go beyond simple vector
similarity. Smarter context management that preserves document structure and
relationships. Enhanced reasoning that lets the system decide what information
it actually needs. Real feedback loops that improve performance over time.
I've spent the last six months implementing
these enhancements across different projects. The results speak for themselves.
Answer accuracy improved by 40 to 60 percent. User satisfaction jumped. Support
tickets about wrong or incomplete answers dropped dramatically. 📈
Let me walk you through what actually works.
Hybrid Search: The Foundation
Pure vector search misses too much. You need to
combine it with traditional keyword search and metadata filtering.
The key insight here is that semantic
similarity and lexical matching capture different aspects of relevance. A
document might use completely different words but express the same concept
(where vector search shines). Or it might use the exact technical terms that
matter for a precise answer (where keyword search excels).
Here's the core logic for blending both
approaches:
python
def hybrid_search(query, top_k=10, alpha=0.5):
#
Get vector search results
vector_results = vector_search(query, top_k * 2)
#
Get keyword search results
keyword_results = keyword_search(query, top_k * 2)
#
Blend scores: alpha * vector + (1-alpha) * keyword
blended = blend_scores(vector_results, keyword_results, alpha)
return
sorted(blended, key=lambda x: x['score'], reverse=True)[:top_k]
The alpha parameter lets you tune the balance.
For technical queries where exact terms matter, lean toward keyword search. For
conceptual questions, favor vector similarity. I typically start with alpha at
0.5 and adjust based on domain testing. 🔧
Query Expansion and Rewriting
Users don't always ask questions clearly. They
use ambiguous terms, provide incomplete context, or frame questions in ways
that don't match how your documents describe the answer.
Query expansion helps by generating multiple
variations of the user's question before retrieval. Instead of searching once
with the original query, you search with 3 to 5 variations that might match
different document styles.
python
def expand_query(original_query):
#
Use LLM to generate alternative phrasings
prompt = f"Rephrase this question 3 different ways: {original_query}"
alternatives = llm_generate(prompt)
#
Return original plus alternatives
return
[original_query] + parse_alternatives(alternatives)
This approach catches relevant documents that
might use different terminology than the user. When someone asks "How do I
reset my password?" your system also searches for "change
password," "forgot password," and "account recovery." 🔑
Contextual Chunk Enhancement
Basic RAG loses critical context by treating
chunks as isolated units. RAG Plus preserves document structure and
relationships.
Think about how you read documentation. You
don't just process random paragraphs in isolation. You understand which section
you're in, what came before, what comes after, and how it all fits into the
larger document structure.
Your RAG system should do the same:
python
class EnhancedChunk:
def __init__(self,
content, metadata):
self.content = content
self.document_title = metadata['title']
self.section_path = metadata['sections']
# e.g. ["Chapter 2", "Authentication",
"Password Reset"]
self.prev_context = metadata['prev_chunk']
self.next_context = metadata['next_chunk']
When you pass enhanced chunks to your LLM,
include the structural context. Now your LLM sees where each chunk came from
and how it fits into the larger document structure. This dramatically improves
synthesis quality. 📖
Agentic Retrieval: Let the AI Decide
The biggest leap in RAG Plus is giving the
system agency over its retrieval process. Instead of retrieving once and hoping
you got the right documents, let the LLM decide what information it needs and
when.
Traditional RAG follows a rigid pattern: query
arrives, retrieve documents, generate answer, done. Agentic RAG is dynamic:
query arrives, LLM thinks about what information it needs, retrieves targeted
information, evaluates if it has enough, retrieves more if needed, then
generates answer.
The implementation uses function calling to
give the LLM retrieval tools:
python
def agentic_rag(question, max_iterations=5):
conversation = [{"role": "user", "content":
question}]
for i
in range(max_iterations):
response = llm_with_tools(conversation)
if
response.wants_to_retrieve:
# LLM decided it needs more info
docs = retrieve(response.search_query)
conversation.append(format_retrieval_results(docs))
elif
response.has_final_answer:
return response.answer
return
"Need more iterations to answer completely"
The beauty of agentic retrieval is
adaptability. For simple questions, it retrieves once and answers. For complex
questions, it breaks the problem down, retrieves multiple times with different
queries, and synthesizes everything together.
I've seen this approach solve problems that
stumped basic RAG completely. A user asks "Compare our Q3 performance to
industry benchmarks." The agent retrieves company financials, then
separately retrieves industry data, then retrieves previous quarters for
context, then synthesizes a comprehensive comparison. 🎯
Re-ranking for Precision
Retrieval gets you candidates. Re ranking
identifies the truly relevant ones.
Your initial retrieval should cast a wide net,
pulling in maybe 20 to 30 candidate documents. Then a more sophisticated model
re evaluates each candidate specifically for the user's query.
python
def rerank_documents(query, documents, top_k=5):
#
Use cross-encoder to score query-document pairs
scores = cross_encoder.predict([(query, doc) for doc in documents])
#
Sort by relevance score
ranked = sorted(zip(documents, scores), key=lambda x: x[1], reverse=True)
return
[doc for doc, score in ranked[:top_k]]
Re ranking catches nuances that vector
similarity misses. A document might contain your exact keywords but not
actually answer the question. Or it might be semantically similar but from the
wrong time period or context. Cross encoders evaluate query document relevance
holistically. 🎓
Response Synthesis with Citations
Basic RAG often fails at synthesis. It either
copies chunks verbatim or ignores them and hallucinates. RAG Plus explicitly
instructs the model on how to synthesize and cite sources.
The key is being extremely explicit in your
prompt about how to use the retrieved information:
python
prompt = f"""Answer using ONLY
the provided sources.
Cite each claim with [Source N].
If sources conflict, mention both perspectives.
If information is missing, state what's
missing.
Question: {query}
Sources: {formatted_documents}
"""
Explicit citation requirements force the model
to ground its responses. Users can verify claims by checking sources. You can
audit whether the system is actually using retrieved information or making
things up. 📝
Feedback Loops for Continuous Improvement
RAG Plus isn't static. It learns from user
interactions to improve over time.
After each interaction, collect feedback. Was
the answer helpful? What was missing? Which retrieved documents were actually
relevant?
python
def log_interaction(query, answer, docs,
user_rating, comment):
db.store({
'query':
query,
'answer': answer,
'retrieved_docs':
docs,
'rating':
user_rating,
'comment':
comment,
'timestamp':
now()
})
Then periodically analyze this feedback to
identify patterns. Are certain types of questions consistently rated poorly?
Are there common phrases in negative feedback that indicate missing
information? Are some document sources more helpful than others?
Use this feedback to identify gaps in your
knowledge base, tune retrieval parameters, or improve chunking strategies. The
systems that improve fastest are those that systematically learn from failures.
📊
Real World Performance Gains
I deployed variations of this system across different
domains. The improvements were consistent.
For a technical documentation chatbot, answer
accuracy went from 62% to 89% as measured by human eval. Time to answer dropped
because the system retrieved the right information on the first try instead of
requiring follow up questions.
For a legal research assistant, the agentic
retrieval approach reduced the number of queries where lawyers had to manually
search for additional context by 73%. The system learned to anticipate what
related information would be needed.
For an internal knowledge management system at
a mid sized company, adoption increased by 4x after upgrading to RAG Plus.
People actually trusted the answers enough to rely on them instead of pinging
colleagues on Slack.
The pattern holds across domains. Better
retrieval plus smarter synthesis equals systems people actually use. 💡
Common Implementation Challenges
Implementing RAG Plus isn't trivial. Here are
the challenges that tripped me up and how to handle them.
Computational cost increases significantly.
You're doing multiple retrievals, running re rankers, and making multiple LLM
calls per query. Budget accordingly. For high traffic systems, consider caching
aggressively and using smaller models where possible.
Latency becomes a concern. Users expect sub
second responses for simple queries. Agentic retrieval might take 5 to 10
seconds for complex questions. Set expectations appropriately in your UI. Show
progress indicators. Let users see the system thinking.
Debugging gets harder. When a basic RAG system
fails, you can trace through one retrieval and one generation. With RAG Plus,
failures can occur at multiple stages. Build comprehensive logging from the
start. Track retrieval scores, re rank scores, which documents were actually
used, and user feedback.
Tuning requires domain expertise. The right
balance between vector and keyword search varies by domain. The optimal number
of retrieval iterations depends on query complexity. The best chunking strategy
depends on your document structure. Expect to iterate based on real usage data.
🔧
The Philosophical Shift
Moving from RAG to RAG Plus represents more
than technical improvements. It's a shift in how we think about AI systems.
Basic RAG treats the LLM as a passive
responder. We retrieve information and force feed it context. RAG Plus treats
the LLM as an active reasoner. We give it tools and let it decide what
information it needs.
This mirrors the broader evolution in AI from
narrow task completion to agentic behavior. Systems that can plan, reason about
their own knowledge gaps, and take actions to fill those gaps.
The implications extend beyond retrieval. If we
can build systems that know when they need more information and can
autonomously gather it, what else becomes possible? Systems that recognize when
they're uncertain and proactively seek validation. Systems that identify
knowledge gaps in your documentation and suggest what's missing. Systems that
learn not just from explicit feedback but from observing their own failure
patterns.
We're moving from AI as a tool you operate to
AI as a collaborator that operates alongside you. 🤝
Looking Ahead
RAG Plus is still evolving. Several exciting
directions are emerging.
Multi modal RAG that retrieves and synthesizes
across text, images, tables, and code. Imagine asking "Show me examples of
this design pattern" and getting relevant code snippets, architecture
diagrams, and explanatory text all properly integrated.
Hierarchical memory systems where the AI
maintains both short term context from the current conversation and long term
memory of past interactions. Your system remembers what information you've
asked about before and proactively suggests related content.
Collaborative retrieval where multiple AI
agents work together, each specializing in different knowledge domains. One
agent handles technical documentation, another handles business context, a
third handles regulatory requirements, and they coordinate to provide
comprehensive answers.
Self improving systems that automatically
generate synthetic training data from successful interactions, fine tuning
retrieval and generation models without manual data curation. 🌟
Implementation Roadmap
If you're ready to upgrade your RAG system,
here's a practical path forward.
·
Implement
hybrid search combining vector and keyword retrieval. This alone will give you
a 15 to 25 percent improvement in retrieval quality. Start with alpha at 0.5
and tune based on your domain.
·
Add query
expansion using an LLM to generate alternative phrasings. Test with 3 to 5
variations per query. Measure whether expanded queries retrieve different
relevant documents than the original.
·
Enhance
your chunks with structural context. Add document titles, section hierarchies,
and surrounding context. Update your prompts to make use of this additional
information.
·
Implement
re ranking on top of your retrieval. Start with a pre trained cross encoder
model. Retrieve 20 to 30 candidates, rerank to top 5.
·
Build
agentic retrieval with function calling. Start simple with just a retrieve
tool. Let the LLM decide when and what to retrieve. Monitor how many retrieval
calls it makes per query.
·
Deploy
feedback collection and build analysis pipelines. Track what works and what
doesn't. Use this data to tune all your previous enhancements.
Each step builds on the previous. You get
incremental improvements at each stage, and by the end you have a system that's
dramatically better than basic RAG. 🚀
The Technical Stack
For those getting started, here's the stack
that works well for RAG Plus systems:
Vector Database: PostgreSQL, Pinecone, Weaviate, or Qdrant for
vector storage and similarity search. For smaller projects, FAISS works fine.
Keyword Search: Elasticsearch for robust keyword search with
advanced filtering. Or use built in hybrid search capabilities if your vector
DB supports it.
Embeddings: OpenAI's ada-002 for general purpose, or domain specific models from
Hugging Face for specialized applications.
Re-rankers: Cross encoder models from sentence transformers. The ms-marco models
work well out of the box.
LLMs: GPT-4 for agentic retrieval and synthesis. Claude for longer context
windows. Llama for self hosted requirements.
Observability: LangSmith or custom logging with PostgreSQL
for tracking interactions and debugging.
The specific tools matter less than the
architecture. Focus on the patterns: hybrid retrieval, query expansion,
contextual chunks, re ranking, agentic behavior, and feedback loops. 🛠️
Measuring Success
How do you know if your RAG Plus upgrade
actually worked? Track these metrics:
Answer Accuracy: Have domain experts evaluate a sample of
answers. Compare before and after. Aim for 80%+ accuracy on your evaluation
set.
Retrieval Precision: What percentage of retrieved documents are
actually relevant? Should be 60%+ after re ranking.
User Satisfaction: Direct feedback ratings. Track thumbs up/down
or 1 to 5 stars. Watch this trend over time.
Adoption Metrics: Are people actually using the system? Track
daily/weekly active users. Compare to alternative information sources (Slack
questions, support tickets, etc).
Time to Answer: Measure from query submission to answer
delivery. Balance thoroughness with speed. Most queries should complete in
under 5 seconds.
Iteration Count: For agentic systems, how many retrieval calls
does the average query require? Complex questions might need 3 to 5. If every
query needs 10+, your initial retrieval quality needs improvement.
Set baselines before you start upgrading.
Measure at each stage. Celebrate wins and dig into failures. 📊
When NOT to Use RAG Plus
RAG Plus isn't always the answer. Sometimes
basic RAG is sufficient or even preferable.
If your knowledge base is small (under 1000
documents) and rarely changes, basic RAG probably works fine. The complexity of
RAG Plus isn't worth it.
If your queries are very simple and predictable
(like FAQ lookup), basic keyword search might be better than any RAG approach.
If latency is absolutely critical (sub 500ms
requirements), the multiple retrieval calls and re ranking steps might be too
slow. Stick with single shot retrieval or pre compute common queries.
If your team lacks ML engineering experience,
starting with RAG Plus might be overwhelming. Build basic RAG first, understand
the fundamentals, then upgrade incrementally.
Know your requirements. Match your solution to
your actual needs, not to what's trendy. 🎯
The Bottom Line
Basic RAG was a good first step. RAG Plus is
what you need for production systems that people actually trust and use.
The upgrade requires investment. More complex
infrastructure, higher computational costs, longer development cycles. But the
returns justify it. Higher accuracy, better user satisfaction, reduced support
burden, and systems that improve over time instead of stagnating.
The organizations building RAG Plus systems now
will have a significant advantage as retrieval augmented generation becomes the
default interaction paradigm. The rest will wonder why their chatbots still
hallucinate despite having access to the right information.
Start with hybrid search and re ranking. Those
alone will dramatically improve your results. Add agentic retrieval when you
need to handle complex multi step questions. Build feedback loops from day one
so your system gets smarter with usage.
The future of AI isn't just about bigger
models. It's about smarter systems that know how to find and use information
effectively. That's what RAG Plus enables. 🎯
#RAG #artificialintelligence #machinelearning
#llm #retrieval #dougortiz
No comments:
Post a Comment