TechBits

Monday, November 17, 2025

Multimodal RAG: Why Text-Only Retrieval is Holding Your AI Back

I have been consistently building RAG systems that excel at retrieving text. Better embeddings. Smarter chunking. More sophisticated re ranking. Have squeezed incredible performance gains from text retrieval.

But here's what nobody talks about: most of the world's knowledge isn't stored in neat paragraphs of text.

It's in diagrams. Screenshots. Charts. Tables. Code snippets with syntax highlighting. Product photos. Architecture diagrams. Handwritten notes. Medical images. Technical schematics. Video frames. Audio transcriptions paired with visual context.

I realized this the hard way last fall. After deploying a RAG system for technical documentation. The system was excellent at answering conceptual questions. "What's the difference between async and sync processing?" Perfect answer every time.

Then someone asked "How do I wire up this component?" and attached a circuit diagram. The system was completely blind. It couldn't see the image. Couldn't parse the diagram. Couldn't connect the visual information to the relevant text documentation.

The user had to describe the diagram in words, ask again, get a text based answer, then manually verify it matched their visual context. The system that was supposed to save time actually added friction.

That's when I started building multimodal RAG.

What Multimodal RAG Actually Means

Multimodal RAG extends retrieval augmented generation to work across different types of content. Not just text, but images, tables, code, audio, video, and any combination thereof.

The core idea is simple: embed different content types into a shared vector space where semantic similarity works across modalities. A user's text query should retrieve relevant images. An uploaded diagram should retrieve related documentation. A code snippet should pull up architecture diagrams and explanatory text.

Simple in concept. Complex in execution.

The challenges show up immediately. How do you embed an image in a way that captures both what it depicts and what it means? How do you compare the relevance of a text paragraph versus a diagram versus a data table? How do you present multimodal results to users in a way that makes sense?

After spending time working through these problems across different domains. The patterns that emerge are consistent. Multimodal RAG isn't just better than text only systems. It unlocks entirely new categories of questions that were previously unanswerable. 🚀

Why Text Only RAG Fails in the Real World

Let's be honest about where traditional RAG breaks down.

Technical documentation is full of diagrams. A paragraph describing a system architecture is useful. The actual architecture diagram is essential. Text only RAG retrieves the paragraph but misses the diagram. Users get incomplete information.

Product information relies heavily on images. Someone asks "Do you have this in blue?" and uploads a photo. Text search can't handle that. You need visual similarity search combined with product metadata.

Financial analysis lives in charts and tables. Quarterly earnings, market trends, comparative data. These are inherently visual. Converting them to text loses the structure and visual patterns that make them meaningful.

Medical diagnosis depends on imaging. X rays, MRIs, pathology slides. A doctor asking "Have we seen cases similar to this?" needs visual similarity, not text descriptions of symptoms.

Code repositories mix documentation, code, diagrams, and screenshots. A developer asking "How do we implement authentication?" needs to see code examples, architecture diagrams, and configuration screenshots together, not just text explanations.

The pattern is clear: as soon as your knowledge base includes non text content (and it definitely does), text only RAG leaves value on the table. 📉

The Multimodal RAG Architecture

Building multimodal RAG requires rethinking your entire pipeline. You can't just bolt image search onto your existing text system. You need an architecture designed for multiple modalities from the ground up.

Here's what actually works:

Unified embedding space where text, images, and other content types are embedded into vectors that can be meaningfully compared. Models like CLIP enable this by training on text image pairs, learning representations where semantically similar content across modalities ends up close in vector space.

Modality specific processing that handles the unique characteristics of each content type. Images need visual feature extraction. Tables need structure preservation. Code needs syntax awareness. Audio needs transcription plus acoustic features.

Cross modal retrieval that can handle queries in any modality returning results in any modality. Text query retrieves images. Image query retrieves text. Table query retrieves related charts.

Intelligent ranking that compares relevance across different content types. Is this diagram more relevant than that paragraph? The answer depends on the query and context.

Multimodal synthesis where the LLM can reason about different content types together, generating answers that reference text, describe images, interpret tables, and explain code in a unified response. 🎯

Image Retrieval: The First Step

Most organizations starting with multimodal RAG begin with images because they're everywhere and the technology is mature.

Vision language models like CLIP create embeddings for both images and text in the same vector space. This enables text to image search. A user types "authentication flow diagram" and your system retrieves relevant flowcharts, even if the image filenames are unhelpful and there's no surrounding text description.

But naive image search has problems. Basic embeddings capture high level semantics but miss fine details. Two circuit diagrams might embed similarly even though they show completely different circuits. You need additional techniques for precision.

OCR extraction pulls text from images. Diagrams often contain labels, annotations, and embedded text that carry crucial information. Extract this text, embed it separately, and use it alongside visual embeddings. A network diagram with labeled components becomes searchable by those component names.

Layout analysis understands document structure. A page from a manual might contain text, diagrams, and tables. Knowing the spatial relationships between elements improves retrieval. The diagram explaining step 3 should be strongly associated with the text describing step 3.

Metadata enrichment adds context. When was the image created? Who created it? What document does it belong to? Which section? This metadata helps with filtering and ranking. A recent diagram from the official documentation should rank higher than an old screenshot from a user forum. 📸

Tables: Structure Matters

Tables are everywhere in business documents. Financial data, product specifications, comparison charts, experimental results. But tables are notoriously hard for RAG systems to handle well.

Text only systems typically convert tables to markdown or CSV format, then chunk them like any other text. This loses crucial information. The spatial layout of a table conveys meaning. Column headers, row labels, cell relationships, visual groupings. Flattening a table to text destroys this structure.

Multimodal RAG treats tables as first class objects. You preserve the tabular structure during indexing. Headers stay linked to their columns. Rows maintain their relationships. The system understands that a cell's meaning depends on its row and column context.

When embedding tables, you want representations that capture both content and structure. Recent approaches use table specific transformers that understand row column relationships and can answer questions about the table as a structured object, not just a blob of text.

The real power comes from combining table understanding with retrieval. A user asks "Which product had the highest growth?" The system retrieves the relevant table, understands its structure, and can directly answer from the tabular data instead of hoping a text summary mentions the answer.

I've seen this transform financial analysis workflows. Analysts used to manually search through reports, find tables, extract data, and analyze it. Now they ask questions and the system finds relevant tables, interprets them correctly, and provides answers grounded in the actual data. Time from question to insight dropped from hours to seconds. 📈

Code as a Modality

Code repositories are multimodal by nature. Implementation files, documentation, architecture diagrams, configuration examples, test cases, deployment scripts. Treating code as just another text document misses its unique properties.

Code has syntax, structure, and semantics that matter for retrieval. A function definition is different from a function call. A class declaration is different from an instantiation. Import statements reveal dependencies. Comments explain intent.

Code aware embeddings understand these distinctions. They recognize that two functions with similar names but different implementations are not equivalent. They capture the semantic meaning of what code does, not just what it looks like textually.

But code retrieval goes beyond embeddings. You want to search by functionality ("show me authentication examples"), by API usage ("how do I use the User model?"), by pattern ("decorator implementations"), or by visual structure ("class hierarchies").

The most effective approach combines code embeddings, abstract syntax tree analysis, documentation extraction, and cross references to related diagrams and docs. When someone asks about authentication, they should get code examples, architecture diagrams showing where auth fits, configuration snippets, and API documentation together.

I deployed this for a fintech company's internal developer platform. Before multimodal code retrieval, developers spent significant time hunting through repos for examples. They'd find code but not understand the architecture. Or find diagrams but not see implementation. The multimodal system connected everything. Code examples came with architectural context and configuration. Onboarding time for new developers dropped by 40%. 🔧

Video and Audio: The Temporal Dimension

Video and audio add temporal complexity. A one hour technical presentation contains dozens of topics, visual aids, code examples, and explanations. How do you make this searchable?

The answer is temporal segmentation combined with multimodal indexing. You break videos into segments at natural boundaries like scene changes or topic shifts. For each segment, you extract a representative frame, transcribe the audio, pull out any on screen text, and create embeddings that capture both visual and audio content.

Now when someone asks "Where does the presenter explain database sharding?" the system can retrieve the specific 2 minute segment, show a representative frame, and provide the transcript excerpt. Users jump directly to relevant moments instead of watching entire videos hoping to find what they need.

This transformed training material accessibility at a healthcare organization I worked with. They had hundreds of hours of recorded training sessions, procedure demonstrations, and expert talks. Before multimodal retrieval, this content was essentially lost. People knew it existed but couldn't find specific information. After implementing video segmentation and multimodal indexing, utilization of training videos increased 10x. Medical staff could search "catheter insertion technique" and jump to the exact moment in the exact video where it's demonstrated. 🎥

The Cross Modal Retrieval Challenge

The hardest part of multimodal RAG is ranking results across different modalities. When a user asks a question, you might have highly relevant text passages, somewhat relevant diagrams, and tangentially related code snippets. How do you decide what to show?

Pure embedding similarity doesn't work. Different modalities embed differently even when semantically equivalent. A diagram explaining concept X might have lower cosine similarity to the query than a text passage mentioning X in passing. But the diagram is more useful.

Effective cross modal ranking uses learned relevance models trained on user behavior. When users consistently click on images over text for certain query types, the system learns to rank images higher for similar queries. When tables get more engagement for data questions, they rise in rankings.

The system also considers query intent. A question starting with "Show me" or "What does it look like" signals preference for visual content. "How much" or "Compare" suggests tables might be most relevant. "How do I implement" indicates code examples are valuable.

Context matters too. The same query in a technical documentation system versus a product catalog should return different modality mixes. Documentation users often want diagrams and code. Product catalog users want images and specifications.

This adaptive ranking is where multimodal RAG becomes truly intelligent. Not just retrieving diverse content, but understanding what format will be most useful for each specific query and user context. 📊

Presentation: Making Multimodal Results Useful

Retrieving multimodal content is only half the battle. Presenting it in a way users can actually use is equally important.

Text only RAG has it easy. Return a few paragraphs, let the LLM synthesize, done. Multimodal results require thoughtful interface design.

Visual prominence for non text content. Images, diagrams, and charts should be displayed prominently, not buried in text. Users' eyes are drawn to visuals. Make them easy to see. I've tested layouts where images were small thumbnails versus large featured displays. Engagement with visual content was 3x higher with prominent display.

Contextual integration where each piece of content is presented with just enough context to understand its relevance. An image without explanation is confusing. A paragraph explaining why this diagram matters helps users quickly assess relevance. Too much explanation defeats the purpose of visual content. Balance is key.

Progressive disclosure that shows the most relevant content first but lets users dig deeper. Initial view shows top 3 results across modalities. One image, one text excerpt, one table or code sample. Click to see more of each type. Users who found what they need stop there. Users who need more keep exploring without being overwhelmed initially.

Modality specific interactions. Images should be zoomable and downloadable. Tables should be sortable and filterable. Code should be copyable with syntax highlighting maintained. Video should jump to relevant timestamps and allow playback speed control.

The goal is seamless integration where users don't think about modalities. They just get the most useful information in the most useful format. 🎨

Implementation Challenges

Building multimodal RAG is harder than text only systems. Several challenges consistently appear.

Computational cost increases significantly. Processing images, video, and audio requires more compute than text. Vision language model embeddings are expensive at scale. Video processing is compute intensive. Budget for 3 to 5x the infrastructure costs of text only RAG. One organization I worked with underestimated this and had to redesign their pipeline for efficiency after hitting budget constraints.

Storage requirements explode. Images and videos are large. Storing raw content plus embeddings plus extracted features adds up quickly. A 10GB text corpus might expand to 100GB when you include associated images and videos. Factor this into your infrastructure planning. Consider compression strategies and tiered storage where older content moves to cheaper storage.

Latency concerns multiply. Embedding an image takes longer than embedding text. Retrieving and transmitting images to users is slower than text. Multimodal systems need aggressive caching and optimization to stay responsive. Users expect sub second responses for simple queries. Achieving this with multimodal content requires careful engineering.

Quality control gets complicated. With text, you can usually tell if retrieval failed. With multimodal content, evaluation is harder. Is this image relevant? Somewhat relevant? It depends on what the user actually wanted, which may not be clear from the query. You need robust evaluation frameworks and extensive user testing.

Format inconsistencies create headaches. Images come in different sizes, resolutions, formats. Tables might be in PDFs, spreadsheets, or HTML. Code might be in files, screenshots, or documentation. Normalizing all this requires robust preprocessing pipelines that handle edge cases gracefully. 🔧

Evaluation: Measuring Multimodal Performance

How do you know if your multimodal RAG system is actually working well? Traditional metrics like precision and recall are necessary but insufficient.

Modality coverage measures whether you're actually retrieving diverse content types. If 95% of results are text despite having rich image libraries, something's wrong with your ranking. Track the distribution of modalities in top results. A healthy multimodal system shows balanced representation.

Cross modal relevance evaluates whether retrieved content in different modalities actually relates to the query. An image might be visually similar but conceptually irrelevant. A table might contain related data but not answer the specific question. Manual evaluation on a sample of queries reveals these issues.

User engagement tracks what people actually use. Do they view the images you retrieve? Do they click on code examples? Do they skip over tables? Behavior reveals relevance better than any automatic metric. High skip rates for a particular modality indicate ranking problems.

Task completion is the ultimate measure. Can users accomplish their goals? For technical documentation, can engineers find the diagrams they need? For product search, do customers find what they're looking for? For medical knowledge bases, do doctors get useful case comparisons? Track these outcome metrics relentlessly.

Time to answer measures efficiency gains. Multimodal RAG should reduce the time from question to satisfactory answer. If users are spending more time than with previous systems, something's wrong despite good retrieval metrics.

Build evaluation into your system from day one. Log everything. Track which results users engage with, which they skip, which queries fail. A/B test ranking approaches. Collect explicit feedback through ratings. The systems that improve fastest are those with the richest evaluation data. 📈

The Synthesis Challenge

Retrieving multimodal content is one thing. Having the LLM reason about it is another.

Modern large language models are increasingly multimodal themselves. GPT 4 Vision, Claude with image understanding, Gemini with native multimodality. They can look at images, read tables, analyze charts, and discuss code with visual context.

This enables genuine multimodal synthesis. The LLM can say "As shown in the second diagram, the authentication flow involves three steps..." or "According to the table in the Q2 report, revenue increased by 33%..." or "The code example demonstrates this pattern with a decorator approach..."

The system isn't just showing you diverse content types. It's reasoning across them, connecting insights from text with evidence from images, supporting claims with data from tables, and illustrating concepts with code examples.

This is where multimodal RAG becomes truly powerful. Not just retrieving diverse content, but synthesizing it into coherent answers that leverage the strengths of each modality. Text provides detailed explanation. Images offer immediate visual understanding. Tables present structured data. Code shows concrete implementation.

I've watched users' reactions to well synthesized multimodal answers. There's a moment of "Oh, now I actually understand" that doesn't happen with text only responses. The combination of modalities creates comprehension that no single format achieves alone. 🎯

The Future Is Already Here

Multimodal RAG isn't experimental technology. It's production ready and delivering value today. But we're still early in understanding what's possible.

3D content retrieval for CAD models, architectural designs, product designs. Imagine asking "Show me similar bracket designs" and getting 3D models you can rotate and inspect. The embedding technology exists. The challenge is building intuitive interfaces for 3D content exploration.

Interactive diagrams that aren't just images but structured objects the AI can manipulate and explain. "Show me this flowchart but simplified for a non technical audience." The system doesn't just retrieve a simpler diagram. It generates one, preserving the essential logic while removing complexity.

Real time multimodal streams where the system processes live video, extracts information on the fly, and answers questions about what's happening now. Security monitoring becomes "Has anyone wearing a red jacket entered through the north entrance today?" Manufacturing quality control becomes "Show me all products from batch 47 that had visual defects."

Collaborative multimodal workspaces where teams share images, diagrams, code, and documents, and the AI helps connect everything together. "Find all the architecture diagrams related to this code module" across your entire organization's content. The system understands relationships between artifacts and surfaces relevant connections.

Generative multimodal augmentation where the system doesn't just retrieve existing images but generates new visualizations to explain concepts. "Show me a diagram of how this algorithm works" when no such diagram exists. The system creates one based on understanding the text description.

Starting Your Multimodal Journey

If you're convinced multimodal RAG is necessary (and it is), how do you actually get started?

Begin with images because the technology is mature and the value is immediate. Most organizations have images scattered throughout their documentation, presentations, and knowledge bases. Making these searchable delivers quick wins. Start with vision language models like CLIP for embedding. Implement text to image and image to text search. Measure the impact on user satisfaction.

Add tables next if your domain involves data. Financial services, e commerce, analytics, research. Structured data is everywhere and current systems handle it poorly. Implement table aware processing that preserves structure. Enable queries that can be answered directly from tabular data.

Incorporate code when documentation includes implementation examples. Developer documentation, technical tutorials, integration guides. Code aware retrieval helps users find relevant examples and understand implementation patterns.

Tackle video last because it's the most complex. Start with high value video content that users frequently reference. Training materials, recorded presentations, product demonstrations. The processing is expensive but the payoff for searchable video content is enormous.

Build incrementally rather than trying to do everything at once. Each modality you add delivers value independently. You don't need all modalities working perfectly before launching. Ship image search, measure impact, iterate. Add tables, measure impact, iterate. This approach reduces risk and accelerates learning.

Invest in evaluation from the start. Multimodal systems are complex. You need data to understand what's working and what isn't. Log user interactions. Track engagement by modality. Collect explicit feedback. Use this data to tune your ranking algorithms and improve retrieval quality. 🚀

The Economic Argument

Multimodal RAG costs more to build and operate than text only systems. But the ROI is compelling when you consider the full picture.

Reduced search time directly impacts productivity. If knowledge workers spend 20% of their time searching for information, and multimodal RAG cuts that by half, you've unlocked 10% productivity improvement. For a 100 person engineering team, that's 10 full time equivalents worth of capacity.

Better decisions from having complete information. Text only systems force users to make decisions with incomplete context because they can't find relevant visual information. Multimodal systems surface diagrams, charts, and images that inform better choices. The value of better decisions is hard to quantify but often exceeds the cost of the system.

Reduced support burden when users can self serve effectively. If your documentation chatbot can actually show users the diagram they need instead of describing it inadequately, support ticket volume drops. Customer satisfaction increases. Support costs decrease.

Competitive advantage in customer facing applications. Product search that works with images. Technical support that understands screenshots. Medical diagnosis that leverages imaging. These capabilities differentiate you from competitors still limited to text.

The question isn't whether you can afford to build multimodal RAG. It's whether you can afford not to.

The Bottom Line

Text only RAG was a good start. Multimodal RAG is what production systems actually need to serve real users solving real problems.

The upgrade is substantial. More complex architecture, higher costs, longer development cycles, new technical challenges. But the capability gap between text only and multimodal systems is enormous. Questions that were impossible to answer become straightforward. User satisfaction jumps because the system actually understands their visual context.

The organizations building multimodal RAG now will have significant advantages. They'll attract users because their systems actually work across all content types. They'll retain users because the experience is dramatically better. They'll extract more value from their existing content because it's finally searchable and usable.

The rest will keep telling users "I can't see images" while their competitors deliver visual intelligence. They'll watch their carefully curated image libraries go unused because they're not indexed. They'll lose users who expect AI systems to understand the full richness of human communication, not just text.

Start with image retrieval if you're new to multimodal. It delivers immediate value and the technology is mature. Add tables when you need structured data understanding. Incorporate code when documentation includes implementation examples. Build incrementally toward full multimodal capability.

The future of RAG isn't just about better text retrieval. It's about systems that understand information the way humans do across all modalities, with visual and textual reasoning integrated seamlessly.

That's what multimodal RAG enables. And it's available today, not in some distant future. 🚀

#multimodalAI #RAG #computervision #machinelearning #artificialintelligence #dougortiz

From RAG to RAG-Plus: Why Your AI Retrieval Strategy Needs an Upgrade

I thought we'd solved the hallucination problem with RAG.

Turns out we just created a new set of challenges. 🔍

The Promise That Didn't Quite Deliver

Two years ago, Retrieval Augmented Generation felt revolutionary. Finally, we could ground large language models in real data. No more making things up. No more confidently wrong answers. Just retrieve the relevant documents, pass them to the LLM, and let it synthesize an accurate response.

Except it didn't work nearly as well as hoped.

I remember deploying our first RAG system. We fed it thousands of documentation pages, set up a vector database, and watched it... give decent but frustratingly incomplete answers. Sometimes it missed obvious relevant information. Other times it retrieved the right documents but failed to synthesize them properly. Occasionally it still hallucinated despite having the correct information right in front of it. 🤦

The gap between RAG's promise and its reality became impossible to ignore.

Understanding Where Basic RAG Falls Short

Let's be honest about what basic RAG actually does. You split documents into chunks. You embed those chunks into vectors. When a user asks a question, you embed their query, find the most similar chunks, and stuff them into the LLM's context window.

Simple. Elegant. Limited.

The problems show up immediately in production. Your chunking strategy matters enormously but there's no universal right answer. Chunk too small and you lose context. Chunk too large and you dilute semantic meaning. A 512 token chunk might work great for technical documentation but terrible for narrative content.

Vector similarity doesn't capture everything that makes information relevant. Two passages can be semantically similar without one actually answering the question. Or the most relevant information might be split across multiple chunks that individually don't seem related to the query.

The LLM sees retrieved chunks in isolation. It doesn't know where each chunk came from, how reliable the source is, or how the chunks relate to each other. You're asking it to synthesize information without providing crucial metadata and structural context. 📚

The Evolution to RAG Plus

RAG Plus isn't a single technology. It's a collection of enhancements that address basic RAG's limitations. Think of it as RAG that actually works in production.

The shift involves multiple layers of improvement. Better retrieval strategies that go beyond simple vector similarity. Smarter context management that preserves document structure and relationships. Enhanced reasoning that lets the system decide what information it actually needs. Real feedback loops that improve performance over time.

I've spent the last six months implementing these enhancements across different projects. The results speak for themselves. Answer accuracy improved by 40 to 60 percent. User satisfaction jumped. Support tickets about wrong or incomplete answers dropped dramatically. 📈

Let me walk you through what actually works.

Hybrid Search: The Foundation

Pure vector search misses too much. You need to combine it with traditional keyword search and metadata filtering.

The key insight here is that semantic similarity and lexical matching capture different aspects of relevance. A document might use completely different words but express the same concept (where vector search shines). Or it might use the exact technical terms that matter for a precise answer (where keyword search excels).

Here's the core logic for blending both approaches:

python

def hybrid_search(query, top_k=10, alpha=0.5):

# Get vector search results

vector_results = vector_search(query, top_k * 2)

# Get keyword search results

keyword_results = keyword_search(query, top_k * 2)

# Blend scores: alpha * vector + (1-alpha) * keyword

blended = blend_scores(vector_results, keyword_results, alpha)

return sorted(blended, key=lambda x: x['score'], reverse=True)[:top_k]

The alpha parameter lets you tune the balance. For technical queries where exact terms matter, lean toward keyword search. For conceptual questions, favor vector similarity. I typically start with alpha at 0.5 and adjust based on domain testing. 🔧

Query Expansion and Rewriting

Users don't always ask questions clearly. They use ambiguous terms, provide incomplete context, or frame questions in ways that don't match how your documents describe the answer.

Query expansion helps by generating multiple variations of the user's question before retrieval. Instead of searching once with the original query, you search with 3 to 5 variations that might match different document styles.

python

def expand_query(original_query):

# Use LLM to generate alternative phrasings

prompt = f"Rephrase this question 3 different ways: {original_query}"

alternatives = llm_generate(prompt)

# Return original plus alternatives

return [original_query] + parse_alternatives(alternatives)

This approach catches relevant documents that might use different terminology than the user. When someone asks "How do I reset my password?" your system also searches for "change password," "forgot password," and "account recovery." 🔑

Contextual Chunk Enhancement

Basic RAG loses critical context by treating chunks as isolated units. RAG Plus preserves document structure and relationships.

Think about how you read documentation. You don't just process random paragraphs in isolation. You understand which section you're in, what came before, what comes after, and how it all fits into the larger document structure.

Your RAG system should do the same:

python

class EnhancedChunk:

def __init__(self, content, metadata):

self.content = content

self.document_title = metadata['title']

self.section_path = metadata['sections'] # e.g. ["Chapter 2", "Authentication", "Password Reset"]

self.prev_context = metadata['prev_chunk']

self.next_context = metadata['next_chunk']

When you pass enhanced chunks to your LLM, include the structural context. Now your LLM sees where each chunk came from and how it fits into the larger document structure. This dramatically improves synthesis quality. 📖

Agentic Retrieval: Let the AI Decide

The biggest leap in RAG Plus is giving the system agency over its retrieval process. Instead of retrieving once and hoping you got the right documents, let the LLM decide what information it needs and when.

Traditional RAG follows a rigid pattern: query arrives, retrieve documents, generate answer, done. Agentic RAG is dynamic: query arrives, LLM thinks about what information it needs, retrieves targeted information, evaluates if it has enough, retrieves more if needed, then generates answer.

The implementation uses function calling to give the LLM retrieval tools:

python

def agentic_rag(question, max_iterations=5):

conversation = [{"role": "user", "content": question}]

for i in range(max_iterations):

response = llm_with_tools(conversation)

if response.wants_to_retrieve:

# LLM decided it needs more info

docs = retrieve(response.search_query)

conversation.append(format_retrieval_results(docs))

elif response.has_final_answer:

return response.answer

return "Need more iterations to answer completely"

The beauty of agentic retrieval is adaptability. For simple questions, it retrieves once and answers. For complex questions, it breaks the problem down, retrieves multiple times with different queries, and synthesizes everything together.

I've seen this approach solve problems that stumped basic RAG completely. A user asks "Compare our Q3 performance to industry benchmarks." The agent retrieves company financials, then separately retrieves industry data, then retrieves previous quarters for context, then synthesizes a comprehensive comparison. 🎯

Re-ranking for Precision

Retrieval gets you candidates. Re ranking identifies the truly relevant ones.

Your initial retrieval should cast a wide net, pulling in maybe 20 to 30 candidate documents. Then a more sophisticated model re evaluates each candidate specifically for the user's query.

python

def rerank_documents(query, documents, top_k=5):

# Use cross-encoder to score query-document pairs

scores = cross_encoder.predict([(query, doc) for doc in documents])

# Sort by relevance score

ranked = sorted(zip(documents, scores), key=lambda x: x[1], reverse=True)

return [doc for doc, score in ranked[:top_k]]

Re ranking catches nuances that vector similarity misses. A document might contain your exact keywords but not actually answer the question. Or it might be semantically similar but from the wrong time period or context. Cross encoders evaluate query document relevance holistically. 🎓

Response Synthesis with Citations

Basic RAG often fails at synthesis. It either copies chunks verbatim or ignores them and hallucinates. RAG Plus explicitly instructs the model on how to synthesize and cite sources.

The key is being extremely explicit in your prompt about how to use the retrieved information:

python

prompt = f"""Answer using ONLY the provided sources.

Cite each claim with [Source N].

If sources conflict, mention both perspectives.

If information is missing, state what's missing.

Question: {query}

Sources: {formatted_documents}

"""

Explicit citation requirements force the model to ground its responses. Users can verify claims by checking sources. You can audit whether the system is actually using retrieved information or making things up. 📝

Feedback Loops for Continuous Improvement

RAG Plus isn't static. It learns from user interactions to improve over time.

After each interaction, collect feedback. Was the answer helpful? What was missing? Which retrieved documents were actually relevant?

python

def log_interaction(query, answer, docs, user_rating, comment):

db.store({

'query': query,

'answer': answer,

'retrieved_docs': docs,

'rating': user_rating,

'comment': comment,

'timestamp': now()

})

Then periodically analyze this feedback to identify patterns. Are certain types of questions consistently rated poorly? Are there common phrases in negative feedback that indicate missing information? Are some document sources more helpful than others?

Use this feedback to identify gaps in your knowledge base, tune retrieval parameters, or improve chunking strategies. The systems that improve fastest are those that systematically learn from failures. 📊

Real World Performance Gains

I deployed variations of this system across different domains. The improvements were consistent.

For a technical documentation chatbot, answer accuracy went from 62% to 89% as measured by human eval. Time to answer dropped because the system retrieved the right information on the first try instead of requiring follow up questions.

For a legal research assistant, the agentic retrieval approach reduced the number of queries where lawyers had to manually search for additional context by 73%. The system learned to anticipate what related information would be needed.

For an internal knowledge management system at a mid sized company, adoption increased by 4x after upgrading to RAG Plus. People actually trusted the answers enough to rely on them instead of pinging colleagues on Slack.

The pattern holds across domains. Better retrieval plus smarter synthesis equals systems people actually use. 💡

Common Implementation Challenges

Implementing RAG Plus isn't trivial. Here are the challenges that tripped me up and how to handle them.

Computational cost increases significantly. You're doing multiple retrievals, running re rankers, and making multiple LLM calls per query. Budget accordingly. For high traffic systems, consider caching aggressively and using smaller models where possible.

Latency becomes a concern. Users expect sub second responses for simple queries. Agentic retrieval might take 5 to 10 seconds for complex questions. Set expectations appropriately in your UI. Show progress indicators. Let users see the system thinking.

Debugging gets harder. When a basic RAG system fails, you can trace through one retrieval and one generation. With RAG Plus, failures can occur at multiple stages. Build comprehensive logging from the start. Track retrieval scores, re rank scores, which documents were actually used, and user feedback.

Tuning requires domain expertise. The right balance between vector and keyword search varies by domain. The optimal number of retrieval iterations depends on query complexity. The best chunking strategy depends on your document structure. Expect to iterate based on real usage data. 🔧

The Philosophical Shift

Moving from RAG to RAG Plus represents more than technical improvements. It's a shift in how we think about AI systems.

Basic RAG treats the LLM as a passive responder. We retrieve information and force feed it context. RAG Plus treats the LLM as an active reasoner. We give it tools and let it decide what information it needs.

This mirrors the broader evolution in AI from narrow task completion to agentic behavior. Systems that can plan, reason about their own knowledge gaps, and take actions to fill those gaps.

The implications extend beyond retrieval. If we can build systems that know when they need more information and can autonomously gather it, what else becomes possible? Systems that recognize when they're uncertain and proactively seek validation. Systems that identify knowledge gaps in your documentation and suggest what's missing. Systems that learn not just from explicit feedback but from observing their own failure patterns.

We're moving from AI as a tool you operate to AI as a collaborator that operates alongside you. 🤝

Looking Ahead

RAG Plus is still evolving. Several exciting directions are emerging.

Multi modal RAG that retrieves and synthesizes across text, images, tables, and code. Imagine asking "Show me examples of this design pattern" and getting relevant code snippets, architecture diagrams, and explanatory text all properly integrated.

Hierarchical memory systems where the AI maintains both short term context from the current conversation and long term memory of past interactions. Your system remembers what information you've asked about before and proactively suggests related content.

Collaborative retrieval where multiple AI agents work together, each specializing in different knowledge domains. One agent handles technical documentation, another handles business context, a third handles regulatory requirements, and they coordinate to provide comprehensive answers.

Self improving systems that automatically generate synthetic training data from successful interactions, fine tuning retrieval and generation models without manual data curation. 🌟

Implementation Roadmap

If you're ready to upgrade your RAG system, here's a practical path forward.

· Implement hybrid search combining vector and keyword retrieval. This alone will give you a 15 to 25 percent improvement in retrieval quality. Start with alpha at 0.5 and tune based on your domain.

· Add query expansion using an LLM to generate alternative phrasings. Test with 3 to 5 variations per query. Measure whether expanded queries retrieve different relevant documents than the original.

· Enhance your chunks with structural context. Add document titles, section hierarchies, and surrounding context. Update your prompts to make use of this additional information.

· Implement re ranking on top of your retrieval. Start with a pre trained cross encoder model. Retrieve 20 to 30 candidates, rerank to top 5.

· Build agentic retrieval with function calling. Start simple with just a retrieve tool. Let the LLM decide when and what to retrieve. Monitor how many retrieval calls it makes per query.

· Deploy feedback collection and build analysis pipelines. Track what works and what doesn't. Use this data to tune all your previous enhancements.

Each step builds on the previous. You get incremental improvements at each stage, and by the end you have a system that's dramatically better than basic RAG. 🚀

The Technical Stack

For those getting started, here's the stack that works well for RAG Plus systems:

Vector Database: PostgreSQL, Pinecone, Weaviate, or Qdrant for vector storage and similarity search. For smaller projects, FAISS works fine.

Keyword Search: Elasticsearch for robust keyword search with advanced filtering. Or use built in hybrid search capabilities if your vector DB supports it.

Embeddings: OpenAI's ada-002 for general purpose, or domain specific models from Hugging Face for specialized applications.

Re-rankers: Cross encoder models from sentence transformers. The ms-marco models work well out of the box.

LLMs: GPT-4 for agentic retrieval and synthesis. Claude for longer context windows. Llama for self hosted requirements.

Observability: LangSmith or custom logging with PostgreSQL for tracking interactions and debugging.

The specific tools matter less than the architecture. Focus on the patterns: hybrid retrieval, query expansion, contextual chunks, re ranking, agentic behavior, and feedback loops. 🛠️

Measuring Success

How do you know if your RAG Plus upgrade actually worked? Track these metrics:

Answer Accuracy: Have domain experts evaluate a sample of answers. Compare before and after. Aim for 80%+ accuracy on your evaluation set.

Retrieval Precision: What percentage of retrieved documents are actually relevant? Should be 60%+ after re ranking.

User Satisfaction: Direct feedback ratings. Track thumbs up/down or 1 to 5 stars. Watch this trend over time.

Adoption Metrics: Are people actually using the system? Track daily/weekly active users. Compare to alternative information sources (Slack questions, support tickets, etc).

Time to Answer: Measure from query submission to answer delivery. Balance thoroughness with speed. Most queries should complete in under 5 seconds.

Iteration Count: For agentic systems, how many retrieval calls does the average query require? Complex questions might need 3 to 5. If every query needs 10+, your initial retrieval quality needs improvement.

Set baselines before you start upgrading. Measure at each stage. Celebrate wins and dig into failures. 📊

When NOT to Use RAG Plus

RAG Plus isn't always the answer. Sometimes basic RAG is sufficient or even preferable.

If your knowledge base is small (under 1000 documents) and rarely changes, basic RAG probably works fine. The complexity of RAG Plus isn't worth it.

If your queries are very simple and predictable (like FAQ lookup), basic keyword search might be better than any RAG approach.

If latency is absolutely critical (sub 500ms requirements), the multiple retrieval calls and re ranking steps might be too slow. Stick with single shot retrieval or pre compute common queries.

If your team lacks ML engineering experience, starting with RAG Plus might be overwhelming. Build basic RAG first, understand the fundamentals, then upgrade incrementally.

Know your requirements. Match your solution to your actual needs, not to what's trendy. 🎯

The Bottom Line

Basic RAG was a good first step. RAG Plus is what you need for production systems that people actually trust and use.

The upgrade requires investment. More complex infrastructure, higher computational costs, longer development cycles. But the returns justify it. Higher accuracy, better user satisfaction, reduced support burden, and systems that improve over time instead of stagnating.

The organizations building RAG Plus systems now will have a significant advantage as retrieval augmented generation becomes the default interaction paradigm. The rest will wonder why their chatbots still hallucinate despite having access to the right information.

Start with hybrid search and re ranking. Those alone will dramatically improve your results. Add agentic retrieval when you need to handle complex multi step questions. Build feedback loops from day one so your system gets smarter with usage.

The future of AI isn't just about bigger models. It's about smarter systems that know how to find and use information effectively. That's what RAG Plus enables. 🎯

#RAG #artificialintelligence #machinelearning #llm #retrieval #dougortiz

Search This Blog

Monday, November 17, 2025

Multimodal RAG: Why Text-Only Retrieval is Holding Your AI Back

From RAG to RAG-Plus: Why Your AI Retrieval Strategy Needs an Upgrade