Build Your Own RTFM For Me Agent
This challenge is to build your own AI-powered documentation assistant - a tool that can ingest technical documentation, answer questions about it using AI, and remember context across conversations.
If you’ve ever used an AI chatbot and wished it could answer questions specifically about your own documentation, that’s exactly what you’re building here. The technique behind it is called Retrieval-Augmented Generation (RAG). Instead of relying solely on what an AI model was trained on, you retrieve the specific documents relevant to a question and feed them to the model as context. The result is grounded, accurate answers with source citations rather than hallucinated guesses.
Redis is the backbone of this project. It handles vector search for finding relevant documents, semantic caching for avoiding redundant AI calls, session storage for conversation history, and long-term memory for remembering user context across sessions. Everything else - the AI model, the embedding provider, your programming language, and your framework - is entirely your choice. You can read all about Redis’ AI offerings here.
The Challenge - Building The RTFM For Me Agent
In this challenge you’re going to build a RTFM For Me Agent, a full-stack AI assistant that ingests documentation files, answers questions using retrieval-augmented generation, caches semantically similar queries to reduce costs, and maintains memory across sessions. By the end, you’ll have a system that gets more useful the more you interact with it and can search and read documentation for you.
Step Zero
In this introductory step you’re going to set your environment up ready to begin developing and testing your solution.
You’ll need Docker and Docker Compose to run Redis Stack, which provides Redis with built-in vector search capabilities. Set up a docker-compose.yml that runs Redis Stack.
Next, make three decisions that will shape the rest of your build:
Choose your LLM and embedding provider. You’ll need a language model for generating answers and an embedding model for converting text into vectors. Options include OpenAI, Anthropic, Google Gemini, Mistral, Cohere, or running models locally with Ollama. Whatever embedding model you choose, note its output dimensions - you’ll need this when creating your Redis vector index. If you’re new to this, an embedding model turns your text into a list of numbers (called a vector) that captures its meaning. This is what lets the system find relevant chunks of text later, by comparing how similar those number lists are.
Choose your programming language and framework. Redis has client libraries for Python, TypeScript, Java, Go, Rust, C#, and more. You can use an AI agent framework like PydanticAI, LangChain, or LlamaIndex - or skip the framework entirely and call your LLM’s API directly.
Choose your Redis client library. Python developers might want to look at The Python version of RedisVL, Java developer can grab the new Java version of RedisVL, both of which provides high-level abstractions for vector search, caching, and sessions.
Prepare some sample documentation files (markdown, text, or HTML) that you’ll use to test your system throughout the challenge. Technical documentation with clear sections works well - API references, getting started guides, or architecture documents. A great example would be the Pro Git book: https://github.com/progit/progit2 allowing you to create an agent to help with git commands.
Testing:
Verify Redis is running by connecting with redis-cli and running PING. You should receive PONG in response. Verify you can call your chosen LLM and embedding APIs successfully.
By the way, there is also a coding challenge that has you build your own Redis.
Step 1
In this step your goal is to build a document ingestion pipeline that loads documentation files, splits them into chunks, generates vector embeddings for each chunk, and stores everything in a Redis vector index.
Start by loading your sample documentation files. Then split the text into smaller chunks - roughly 500 tokens each with some overlap between consecutive chunks so you don’t lose context at the boundaries. Try to split on natural boundaries like paragraphs rather than cutting mid-sentence.
For each chunk, generate a vector embedding using your chosen embedding provider. Then store the chunk in Redis along with its embedding and metadata: the source file name, the section heading, and the chunk’s position in the document. This metadata will become important later when you add filtering.
You’ll need to create a Redis vector index that supports similarity search over these embeddings. The index should make the chunk text full-text searchable, the metadata fields filterable, and the embedding vectors searchable by similarity. Refer to the Redis vector search documentation for details on creating indexes with the FT.CREATE command.
Testing:
Ingest your sample documentation files and verify the data is in Redis:
Run FT.INFO on your index to confirm it exists and shows the correct number of documents.
Run HGETALL on one of your stored document keys to verify it contains the chunk text, metadata fields, and embedding vector.
Try ingesting the same files again and verify your pipeline handles duplicates sensibly.
Step 2
In this step your goal is to implement vector search and RAG-based answer generation. When a user asks a question, your system should find the most relevant document chunks and use them to generate a grounded answer.
The flow works like this: take the user’s question, convert it to an embedding using the same model you used for your documents, then search your Redis vector index for the most similar chunks. Take the top results and pass them to your LLM as context alongside the question.
Your system prompt should instruct the LLM to answer using only the provided context and to cite which source file each piece of information comes from. If the context doesn’t contain enough information to answer the question, the LLM should say so honestly rather than making something up.
Wrap this in a REST API with at least two endpoints: one for ingesting documents and one for asking questions. A streaming endpoint for the chat response is a nice addition if your framework supports server-sent events.
Testing: Ask questions that you know the answers to based on your sample documentation:
Ask a question that’s directly covered in your docs. The response should be accurate, cite the correct source file, and not include information that isn’t in the docs.
Ask a question that isn’t covered at all. The system should tell you it doesn’t have enough information rather than hallucinating an answer.
Ask a question that spans multiple documents. The system should pull context from several sources.
Test with curl to verify your API endpoints work correctly.
Step 3
In this step your goal is to add semantic caching so that repeated or similar questions get instant answers without an LLM call.
Traditional caching uses exact string matches, which means “how do I authenticate?” and “what’s the authentication process?” would be treated as completely different queries. Semantic caching embeds the question and checks whether any previously cached question is close enough in vector space. If it is, the cached answer is served without touching the LLM at all.
You’ll need a separate Redis vector index for your cache entries. Each entry stores the original question, its embedding, and the generated response. When a new question comes in, search this cache index first. If the closest match is within your similarity threshold, return the cached response. Otherwise, proceed with the full RAG pipeline and cache the result afterwards.
Start with a similarity threshold of around 0.15 (cosine distance) and tune from there. Too strict and you’ll rarely get cache hits. Too loose and you’ll serve wrong answers for questions that are only loosely related.
Python developers can use RedisVL’s SemanticCache or LangCache which handle much of this for you. In other languages, it’s straightforward to build yourself - it’s just a vector index with a similarity check.
Track your cache metrics: hit rate, average latency for cached versus uncached responses, and estimated cost savings. Store these counters in Redis using INCR so they persist across restarts. Expose them through a /metrics endpoint.
Testing:
Ask the same question twice. The second time should be noticeably faster and your metrics should show a cache hit.
Rephrase the question slightly (e.g. “how does auth work?” then “what’s the authentication process?”). If your threshold is tuned correctly, the second should also be a cache hit.
Ask a completely different question and verify it’s a cache miss.
Check your /metrics endpoint to see hit rate and latency comparisons.
Add a cache flush endpoint and verify that clearing the cache causes previously cached queries to miss again.
Step 4
In this step your goal is to add session memory so your assistant can handle follow-up questions within a conversation.
Without session memory, each question is treated in isolation. If a user asks “what’s the authentication flow?” and then follows up with “how do I refresh the token?”, the system has no idea what “the token” refers to. Session memory fixes this by maintaining conversation history.
Store conversation messages in Redis, keyed by session ID. Each time a user sends a message, append it to the session’s history. When building the prompt for the LLM, include the recent conversation messages so the model has context for follow-up questions. Redis lists or streams both work well for this.
Set a time-to-live on your sessions so they clean up automatically after a period of inactivity - 24 hours is a reasonable default.
Testing:
Start a new session and ask a question about a specific topic in your docs.
Ask a follow-up question that relies on context from the first answer (e.g. use “it”, “that”, or “the same endpoint” to refer back). The assistant should understand what you’re referring to.
Start a different session and verify it has no memory of the first conversation.
Wait for the session TTL to expire (or set a short TTL for testing) and verify the session data is cleaned up from Redis.
Step 5
In this step your goal is to add long-term agent memory so your assistant remembers user context across sessions and uses it to personalise answers.
Session memory disappears when a session ends. Long-term memory persists. If a user tells the assistant “I’m working on the payments microservice in Go” in one session, the assistant should remember that context in future sessions and tailor its answers accordingly.
Set up the Redis Agent Memory Server as a Docker container alongside your Redis instance. The memory server provides a REST API for storing and searching memories, with built-in support for topic extraction, entity recognition, and semantic search over stored memories. It supports over 100 LLM providers via LiteLLM, so whatever model you’re using for your main application will work here too.
Integrate the memory server into your chat flow. After each conversation, extract any important context - user preferences, project details, technical decisions - and store it as a long-term memory. Before generating answers, search for relevant memories and include them in the prompt.
Your LLM prompt should now assemble context from three sources: document chunks from vector search, recent messages from the session, and relevant long-term memories. The memories help the assistant give more relevant answers - if the user has previously mentioned they use Python, documentation examples should lean towards Python where possible.
Testing:
In one session, tell the assistant about your project context (e.g. “I’m building a payment service in Go”).
End the session and start a new one. Ask a general question. The assistant’s answer should reflect your project context even though it’s a new session.
Search the memory server’s REST API directly to verify memories were stored with the correct topics and entities.
Ask the assistant what it knows about your project - it should surface relevant stored memories.
Step 6
In this step your goal is to add hybrid search and production hardening to make your system more robust and precise.
Pure vector search works well for general questions, but sometimes users want answers from a specific document or section. Hybrid search combines vector similarity with metadata filtering. For example, a user might ask “how does authentication work in the API reference?” - the vector search finds semantically relevant chunks, and the metadata filter narrows results to only the API reference document.
You can also use context from long-term memory to apply filters automatically. If the assistant knows the user is working on authentication, it can prioritise chunks from authentication-related sections without being asked.
Add conversation summarisation to handle long sessions gracefully. When the conversation history grows beyond a token threshold, summarise the older messages and keep only the recent ones intact. This prevents your context window from overflowing while preserving important information from earlier in the conversation.
Finally, make your system degrade gracefully when non-critical components fail. If the semantic cache is unavailable, skip it and call the LLM directly. If the memory server is down, answer without long-term context. Only the vector search and LLM are truly essential - everything else should fail silently with appropriate logging.
Testing:
Ask a question scoped to a specific document (e.g. “based on the getting started guide, how do I...”). Verify the results come only from that document.
Have a long conversation (15+ messages) and verify the system still responds correctly as older messages get summarised.
Stop the memory server container and verify the chat still works, just without personalisation.
Stop and restart the semantic cache and verify the system recovers gracefully.
Check your observability metrics: response latency, cache hit rate, token usage, and estimated cost.
Going Further
You’ve built a documentation assistant with RAG, semantic caching, and persistent memory. Here are some ways to push further:
Semantic routing: Classify incoming queries before processing them. Is it a documentation question, an off-topic chat, or a request for an action? Route each type differently.
Multi-tenant support: Scope all indexes, caches, and memories by organisation or team using Redis key prefixes, so multiple teams can share one deployment.
Document versioning: Track document versions and warn users when answers are based on outdated documentation.
MCP integration: Expose your assistant as an MCP server so other AI agents can use it as a tool. The Agent Memory Server already supports MCP natively.
Evaluation suite: Build a test harness that measures retrieval precision, answer accuracy, and cache effectiveness across a standard set of questions.
Real-time updates: Use Redis Pub/Sub or Streams to notify a frontend when document ingestion completes or new memories are created.
Multi-model strategy: Use a cheaper model for memory extraction and caching, and a more capable model for final answers. Redis doesn’t care which model generates the content it stores.
Web crawl: find and ingest documentation from the web.
This coding challenge was sponsored by Redis.
Help Others by Sharing Your Solutions!
If you think your solution is an example other developers can learn from please share it, put it on GitHub, GitLab or elsewhere. Then let me know - ping me a message on the Discord Server, via Twitter or LinkedIn or just post about it there and tag me. Alternately please add a link to it in the Coding Challenges Shared Solutions Github repo.
Get The Challenges By Email
If you would like to receive the coding challenges by email, you can subscribe to the weekly newsletter on SubStack here: