Building RAG Applications: Retrieval-Augmented Generation from Scratch
Large language models are remarkably capable, but they have a fundamental limitation: they can only work with information baked into their parameters during training. Ask GPT-4 about your company's internal documentation, last quarter's sales data, or a research paper published after its training cutoff, and it will either hallucinate a plausible-sounding answer or honestly admit it does not know. Retrieval-Augmented Generation, or RAG, solves this problem by giving the model access to external knowledge at inference time, retrieving relevant documents and feeding them into the prompt alongside the user's question.
The concept is elegantly simple. The implementation is where things get interesting, and where most teams struggle. This guide walks through the complete architecture of a production RAG system, from document ingestion to evaluation, covering the design decisions that separate a working prototype from a reliable production application.
RAG Architecture Overview
Before diving into components, it helps to see the complete system at a high level. A RAG application has two main phases: an offline indexing pipeline and an online query pipeline.
The Indexing Pipeline (Offline)
Documents --> [Loader] --> Raw Text
Raw Text --> [Chunker] --> Text Chunks
Chunks --> [Embedding Model] --> Vector Embeddings
Embeddings --> [Vector Database] --> Indexed and Stored
This pipeline runs when new documents are added to the system. It processes raw content into searchable vector representations and stores them in a database optimized for similarity search.
The Query Pipeline (Online)
User Query --> [Embedding Model] --> Query Vector
Query Vector --> [Vector Database] --> Top-K Relevant Chunks
Relevant Chunks + Query --> [Re-Ranker] --> Re-Ranked Chunks
Re-Ranked Chunks + Query --> [LLM Prompt] --> Generated Answer
When a user asks a question, the system embeds the query, searches for similar chunks, optionally re-ranks the results, and then constructs a prompt that includes both the retrieved context and the original question. The LLM generates an answer grounded in the retrieved documents.
Each component in this architecture involves non-trivial design decisions. Let us examine them in order.
Document Chunking Strategies
Chunking, the process of splitting documents into smaller pieces for embedding and retrieval, is arguably the most underappreciated component of a RAG system. Poor chunking decisions cascade through the entire pipeline, degrading retrieval quality in ways that no amount of downstream sophistication can compensate for.
Fixed-Size Chunking
The simplest approach splits text into chunks of a fixed number of tokens or characters, typically with overlap between consecutive chunks to avoid splitting important content across boundaries. A common configuration is 512-token chunks with 50-token overlap.
Fixed-size chunking is fast, predictable, and easy to implement. Its weakness is that it is completely indifferent to the semantic structure of the document. A chunk boundary might fall in the middle of a paragraph, splitting a coherent argument across two chunks and reducing the usefulness of both. For documents with clear structural elements, like headers, sections, and subsections, this approach leaves significant retrieval quality on the table.
Recursive Character Splitting
LangChain popularized recursive character splitting, which attempts to split on natural boundaries (paragraph breaks, then sentence breaks, then word breaks) while keeping chunks under a maximum size. This produces chunks that are somewhat more semantically coherent than fixed-size splitting, though the improvement varies with document structure.
Semantic Chunking
A more sophisticated approach uses embedding similarity to identify natural breakpoints. The algorithm embeds sliding windows of text and measures the cosine similarity between consecutive windows. Points where similarity drops sharply, indicating a topic transition, become chunk boundaries. This produces chunks that align with the document's actual semantic structure, regardless of formatting. The trade-off is computational cost: you need to embed the entire document at window granularity before you can determine where to split it.
Document-Structure-Aware Chunking
For structured documents like HTML, Markdown, or PDF with extractable headings, the most effective approach is to chunk based on the document's own hierarchy. Split at heading boundaries, preserve the heading hierarchy as metadata, and ensure that each chunk inherits contextual information from its parent sections. A chunk about "Performance Tuning" within a section about "Database Configuration" should carry both pieces of context, either as prepended text or as metadata that the retrieval system can use.
In practice, the best chunking strategy depends on your document types, query patterns, and the embedding model's context window. There is no universally optimal approach, and the right answer usually emerges from experimentation against your specific data.
Embedding Models: Turning Text into Vectors
Embedding models transform text into dense vector representations that capture semantic meaning. Two pieces of text that discuss similar topics will produce vectors that are close together in the embedding space, enabling similarity search.
Key Model Comparisons
OpenAI's text-embedding-3-small and text-embedding-3-large are the most widely used commercial options. They offer strong performance, easy API integration, and adjustable dimensionality (you can truncate the vectors to reduce storage costs while maintaining most of the quality). The large model produces 3072-dimensional vectors and tops many retrieval benchmarks.
Cohere's embed-v3 models are competitive with OpenAI's offerings and include built-in support for different "input types" (search_document, search_query, classification, clustering), which adjusts the embedding approach based on the intended use case. This distinction between document and query embeddings can improve retrieval quality for asymmetric search tasks where the query is short and the document is long.
Among open-source options, the landscape has improved dramatically. BGE (BAAI General Embedding) models from the Beijing Academy of Artificial Intelligence, E5 models from Microsoft, and GTE (General Text Embeddings) models from Alibaba all perform competitively on the MTEB (Massive Text Embedding Benchmark) leaderboard. The gte-large model, in particular, offers strong retrieval performance at a fraction of the cost of commercial APIs, since you can run it on your own infrastructure.
For multilingual applications, Cohere's multilingual embed model and the multilingual E5 variants provide cross-language retrieval capabilities, allowing queries in one language to retrieve documents written in another.
Practical Considerations
Beyond benchmark performance, several practical factors influence embedding model selection. Latency matters for online query embedding; a model that takes 200ms to embed a query adds 200ms to every user interaction. Dimensionality affects storage costs; 3072-dimensional vectors in a database with millions of documents require substantially more storage and memory than 384-dimensional vectors. And context window length matters for chunking strategy; a model with a 512-token context window forces shorter chunks than one with an 8192-token window.
One common mistake is using the same embedding approach for documents and queries without considering the asymmetry of the task. A search query like "how do I reset my password?" and a document paragraph that explains the password reset process in detail are semantically related but structurally very different. Models and configurations that account for this asymmetry typically produce better retrieval results.
Vector Database Options
The vector database stores embeddings and performs similarity search at query time. The choice of database affects latency, scalability, cost, and operational complexity.
Pinecone
Pinecone is a fully managed vector database that removes nearly all operational burden. You do not manage infrastructure, handle index optimization, or worry about scaling. The API is straightforward, and performance is consistently good. Pinecone supports metadata filtering, namespace isolation, and hybrid search (combining vector similarity with keyword matching).
The trade-off is cost and vendor lock-in. Pinecone's pricing scales with the number of vectors and the pod type, and at large scale the costs can become significant. You are also dependent on Pinecone's infrastructure and service availability, which may be unacceptable for some enterprise applications.
Weaviate
Weaviate is an open-source vector database that can be self-hosted or used as a managed service. It supports a rich GraphQL-based query API, built-in vectorization (it can call embedding models directly), and a hybrid search that combines BM25 keyword scoring with vector similarity. Weaviate's multi-tenancy support makes it well-suited for SaaS applications where each customer's data needs to be isolated.
Weaviate's flexibility comes with operational complexity. Self-hosted deployments require attention to resource provisioning, backup strategies, and cluster management. The learning curve is steeper than Pinecone's but the configurability is substantially greater.
ChromaDB
ChromaDB positions itself as the "AI-native" database with the simplest developer experience. It can run in-process (embedded in your Python application), as a standalone server, or as a managed cloud service. For prototyping and small-scale applications, the in-process mode is exceptionally convenient: pip install chromadb and you have a working vector database in three lines of code.
ChromaDB's simplicity is its strength for getting started and its limitation at scale. It lacks some of the advanced features of Weaviate and Pinecone, and its performance characteristics at hundreds of millions of vectors are less well-established.
pgvector
For teams already running PostgreSQL, pgvector adds vector similarity search as a PostgreSQL extension. The appeal is operational simplicity: you do not need to deploy and manage a separate database system. Your vectors live alongside your relational data, and you can combine vector similarity search with traditional SQL queries in a single statement.
pgvector's performance has improved substantially with the addition of HNSW (Hierarchical Navigable Small World) indexes, which provide approximate nearest neighbor search with recall rates above 99% at query latencies measured in single-digit milliseconds for collections up to a few million vectors. For larger collections or latency-critical applications, purpose-built vector databases may still have an edge, but pgvector is a serious option for many production workloads.
Retrieval Strategies
Basic vector similarity search, retrieve the top-K chunks closest to the query embedding, is the foundation of RAG retrieval. But several enhancements can significantly improve the quality of retrieved results.
Hybrid Search
Combining vector similarity with keyword-based (BM25) search often outperforms either approach alone. Vector search excels at semantic matching: finding documents that discuss the same concept using different terminology. BM25 excels at exact match: finding documents that contain specific terms, names, or identifiers that embedding models might not distinguish well. Hybrid search combines both scores, typically using reciprocal rank fusion or a learned weighting, to get the benefits of both approaches.
Query Transformation
The user's original query is not always the best search query. Several techniques improve retrieval by transforming the query before searching. HyDE (Hypothetical Document Embeddings) asks the LLM to generate a hypothetical answer to the query, then uses that hypothetical answer as the search query. The intuition is that a document-like text will be closer in embedding space to actual documents than a short question would be. Multi-query retrieval generates multiple reformulations of the original query and retrieves results for each, merging the results to improve recall. Step-back prompting asks the LLM to generate a more general version of the query, which can retrieve broader context that helps answer specific questions.
Contextual Compression
Retrieved chunks often contain relevant information buried within irrelevant context. Contextual compression uses an LLM or a smaller extraction model to distill retrieved chunks down to only the sentences or passages that are directly relevant to the query. This reduces the amount of context the generation model needs to process and can improve answer quality by removing distracting information.
Re-Ranking: The Critical Middle Step
Re-ranking is perhaps the single highest-leverage improvement you can make to a basic RAG system. The idea is simple: retrieve a larger initial set of candidates (say, top-20 or top-50) using fast but approximate vector search, then use a more accurate but slower model to re-order those candidates by relevance.
Cross-encoder re-rankers, such as Cohere Rerank, models from the sentence-transformers library, or BGE-reranker, process the query and each candidate document jointly, producing a relevance score that accounts for the specific interaction between the query and document. This is much more accurate than the independent embedding comparison used in the initial retrieval stage, but it is also much slower (since each query-document pair must be processed separately).
The two-stage retrieve-then-rerank approach gives you the speed of vector search for candidate generation and the accuracy of cross-encoder scoring for final ranking. In benchmarks and production systems, adding a re-ranker to a basic RAG pipeline typically improves answer quality by 10-25% as measured by standard metrics, a larger improvement than most other single interventions.
Evaluation Metrics
Evaluating RAG systems is notoriously difficult because you need to assess two things: whether the right documents were retrieved, and whether the generated answer is correct and well-grounded in those documents.
Retrieval Metrics
Traditional information retrieval metrics apply to the retrieval component. Recall@K measures the fraction of relevant documents that appear in the top-K results. Precision@K measures the fraction of top-K results that are relevant. Mean Reciprocal Rank (MRR) measures how high the first relevant result appears. Normalized Discounted Cumulative Gain (NDCG) provides a more nuanced measure that accounts for the position and graded relevance of all results.
Computing these metrics requires a labeled evaluation dataset: a set of queries paired with the documents that should be retrieved. Creating this dataset is labor-intensive but essential. Without it, you are optimizing blind.
Generation Metrics
For the generation component, the RAGAS framework has become a popular evaluation toolkit. It defines several metrics: faithfulness (does the answer stick to the retrieved context, or does it hallucinate?), answer relevancy (does the answer actually address the question?), and context relevancy (are the retrieved documents relevant to the question?). These metrics use an LLM as a judge, which introduces its own biases but provides a scalable alternative to human evaluation.
Human evaluation remains the gold standard, particularly for detecting subtle hallucinations, assessing the quality of reasoning, and evaluating whether answers are genuinely helpful in context. A practical approach is to use automated metrics for rapid iteration during development and targeted human evaluation for validation before deployment and periodically during operation.
Production Considerations
Moving a RAG system from prototype to production introduces several challenges that do not arise in demo environments.
Document Freshness
Real document collections change over time. Pages are updated, new documents are added, old documents are deprecated. Your indexing pipeline needs to handle incremental updates without reprocessing the entire collection. This requires tracking document versions, detecting changes, and updating or replacing the corresponding chunks and embeddings in the vector database. It also requires a strategy for handling documents that have been deleted or superseded: stale results that reference outdated information are often worse than no results at all.
Access Control
In enterprise applications, not all users should see all documents. If your RAG system indexes HR policies, financial reports, and engineering documentation, a query from a marketing intern should not retrieve confidential financial projections. Implementing access control in a RAG system requires propagating permission metadata through the indexing pipeline and filtering search results based on the querying user's permissions. This sounds straightforward but becomes complex when permissions are inherited, role-based, or change dynamically.
Latency and Cost
Production RAG systems need to respond quickly enough for interactive use, typically under two seconds end-to-end. This budget must accommodate query embedding, vector search, optional re-ranking, prompt construction, and LLM generation. Each component contributes latency, and the LLM generation step typically dominates. Caching frequently asked queries and their answers, streaming LLM responses to the user, and pre-computing embeddings for common query patterns can all help manage latency.
Cost is equally important. Embedding API calls, vector database hosting, re-ranker inference, and LLM generation all incur costs that scale with usage. Understanding the cost per query and optimizing the most expensive components, often the LLM generation step, is essential for sustainable operation.
Observability and Debugging
When a RAG system produces a wrong answer, you need to determine whether the failure was in retrieval (the right documents were not found), in ranking (the right documents were found but ranked too low), or in generation (the right documents were provided but the LLM misinterpreted or ignored them). Logging the intermediate outputs of each pipeline stage, including the retrieved chunks, their scores, the constructed prompt, and the raw LLM output, is essential for diagnosing failures and driving iterative improvement.
Tools like LangSmith, Phoenix (from Arize), and Langfuse provide observability platforms specifically designed for LLM application debugging, offering trace visualization, evaluation dashboards, and feedback collection interfaces.
Common Pitfalls and How to Avoid Them
Several failure patterns recur frequently in RAG implementations, and awareness of them can save significant debugging time.
Chunks that are too small lose context and become ambiguous. A chunk containing only "The rate was increased to 5.25%" is useless without knowing which rate, when, and why. Including surrounding context, prepending section headers, or using parent-child chunk relationships (where the retrieved chunk is expanded to include its surrounding content before being passed to the LLM) can mitigate this problem.
Chunks that are too large dilute relevance. If each chunk is an entire document, vector search becomes less discriminating, and the LLM must process large amounts of irrelevant text to find the answer. Larger chunks also consume more of the LLM's context window, limiting the number of chunks you can include.
Embedding model mismatch is a subtle issue. If you embed documents with one model and queries with a different model (or even a different version of the same model), the vector spaces may not be compatible, and similarity scores become meaningless. Always use the same model and version for both document and query embedding.
Ignoring metadata is a common missed opportunity. Vector similarity search alone cannot distinguish between a document from 2020 and one from 2024, between a draft and a published version, or between a policy that applies to the US and one that applies to Europe. Storing and filtering on metadata, document date, source, type, department, and version, dramatically improves result relevance for many query types.
RAG is not a solved problem. The architecture described here represents current best practices, but the field is evolving rapidly. New embedding models, retrieval techniques, and evaluation methods appear regularly. The teams that build the best RAG systems are those that treat the pipeline as a continuous optimization problem, systematically measuring performance, identifying failure modes, and iterating on each component. The fundamentals, chunk well, embed well, retrieve well, generate well, remain constant even as the specific tools and techniques evolve.