RAG Chatbot from indexed public documentation

On this page

This article documents the design and implementation of a domain-specific Retrieval-Augmented Generation (RAG) conversational assistant. The project was completed as the Week 15 Graded Mini Project of the IITM Pravartak Professional Certificate Programme in Agentic AI and Applications. The brief required the construction of an assistant that retrieves content from a private document set and produces answers grounded in that retrieved context, while refusing to answer when supporting evidence is absent.

The implementation selects the Technology domain and indexes six official pages of the GitHub REST API documentation. The assistant accepts free-form questions in a command-line interface, performs context-aware follow-up handling, returns concise answers with source attribution, and emits a fixed refusal phrase when the retrieved context does not contain the answer.

Architecture

The pipeline is split into two phases. An offline ingestion stage prepares a reusable vector index; a runtime stage answers user turns by retrieving from that index and prompting a large language model under strict grounding instructions.

RAG architecture for the GitHub REST API assistant, showing ingestion and runtime phases

Public Data Sources

Six public pages from the official GitHub Docs were saved locally as plain-text files in data/raw/:

Quickstart for GitHub REST API
Getting started with the REST API
Authenticating to the REST API
Rate limits for the REST API
Best practices for using the REST API
Troubleshooting the REST API

These pages were chosen because they form a self-contained operational knowledge set: a developer integrating with the API typically needs to consult exactly this material when handling authentication, rate-limit pressure, and recovery from common errors.

Components and Concepts

The following components and ideas underpin the implementation. Each is summarised briefly to make the design choices easy to evaluate.

Retrieval-Augmented Generation (RAG). A pattern in which a language model is grounded on retrieved passages from an external corpus rather than relying solely on parametric memory. Retrieval reduces hallucination and lets the assistant cite source material it actually consulted.

LangChain. The orchestration framework used throughout the project. It provides composable abstractions for document loaders, text splitters, embeddings, vector stores and chat models, which keeps the ingestion and runtime code small and idiomatic.

RAG Chunking. Long documents are split into smaller passages before embedding. This implementation uses RecursiveCharacterTextSplitter with a 900-character chunk size and a 150-character overlap. Splitting prefers semantic boundaries (paragraph breaks and Markdown headings) so that each chunk is internally coherent and retrieval surfaces meaningful units rather than fragments.

OpenAI Embeddings. Each chunk is converted into a high-dimensional vector using the text-embedding-3-small model. The vector captures the semantic content of the chunk so that questions phrased differently from the source text still retrieve the right passages.

FAISS. Facebook AI Similarity Search is the local vector store used to index the embeddings. The index is persisted to disk under vectorstore/github_rest_api_faiss so that ingestion runs once and the runtime simply loads the existing index, satisfying the rubric’s reusable-index criterion.

Top-k Retrieval. At each user turn the retriever returns the k=4 chunks most similar to the query. A small k keeps the prompt focused and reduces the chance that off-topic passages dilute the grounded answer.

Follow-up Rewriter. A short, dedicated LLM call rewrites the user’s latest question into a standalone retrieval query using recent conversation history. This is the mechanism that allows ambiguous follow-ups such as “what about rate-limit headers?” to be embedded and retrieved correctly without the model having to guess the missing antecedent.

Grounded Answerer. The final answer is produced by gpt-4o-mini at temperature zero under a system prompt that instructs the model to answer only from the retrieved context, append a Sources: line naming the document titles consulted, and otherwise return the exact refusal phrase.

Refusal Behaviour. When retrieval returns no usable context, the runtime short-circuits and emits the required string verbatim: I don't have enough information in the provided documents. The refusal is also enforced by the system prompt, providing defence in depth against hallucinated answers.

Implementation Layout

The submission is organised as two thin Python modules and a Jupyter notebook. ingest.py loads the text files, removes GitHub Docs templating syntax, extracts a human-readable title from each document, chunks the text, embeds the chunks, and writes the FAISS index. chatbot.py loads the persisted index, manages conversation history, invokes the rewriter and the grounded answerer in sequence, and prints the response with source attribution. The accompanying notebook imports the same functions so that the notebook and the CLI execute identical code paths, eliminating drift between the two surfaces.

Configuration is environment-driven. The OpenAI API key is read from a .env file and an optional OPENAI_BASE_URL is honoured so that the same code runs against the Vocareum-provided proxy without modification. No secrets are committed to source.

Sample Conversation

A short transcript illustrates the three behaviours the rubric tests — grounded answer, follow-up resolution, and refusal.

You: What headers should I send when making requests to the GitHub REST API?
Assistant: GitHub REST API requests commonly use an Accept header such as
application/vnd.github+json. Authenticated requests send an Authorization
header with Bearer YOUR-TOKEN. GitHub also recommends a valid User-Agent
value, such as your GitHub username or application name.
Sources: Getting started with the REST API; Authenticating to the REST API

You: What about rate limit headers?
Search query: rate limit response headers GitHub REST API
Assistant: Response headers can include x-ratelimit-remaining and
x-ratelimit-reset, which indicate how many requests remain in the current
period and when the rate limit resets.
Sources: Getting started with the REST API; Rate limits for the REST API

You: How should an integration handle rate limit errors?
Assistant: If a retry-after header is present, wait that many seconds before
retrying. If x-ratelimit-remaining is 0, wait until the time given by
x-ratelimit-reset. Otherwise, wait at least one minute and then use
exponentially increasing waits for repeated secondary rate-limit failures.
Continuing to make requests while rate-limited may result in a ban.
Sources: Best practices for using the REST API; Troubleshooting the REST API

The second exchange demonstrates the follow-up rewriter at work: the original “what about rate limit headers?” is expanded into a self-contained search query before retrieval, which is what allows the retriever to surface the correct passages despite the missing antecedent.

Now, a deliberately off-topic question — one the index cannot possibly support — triggers the safety path:

⛔ Refusal path — out-of-scope question
You: What is the best laptop to buy for running the GitHub REST API?
Assistant: I don't have enough information in the provided documents.

No chunk in the index supports an opinion on hardware, so the assistant returns the mandated refusal verbatim rather than producing a plausible-sounding but ungrounded answer. The refusal is enforced twice — once procedurally when retrieval returns empty, and once in the system prompt — so the behaviour is stable even when retrieval surfaces weakly related chunks.

Closing Observations

Three observations stand out from this exercise.

Retrieval quality is set upstream, not by the vector store. The quality of retrieval is determined less by the choice of vector store than by the upstream decisions about cleaning and chunking. Stripping GitHub Docs templating syntax — the {% data ... %} Liquid blocks, the reusables, and the [AUTOTITLE] link macros — materially improved the relevance of returned chunks because the embeddings stopped clustering around boilerplate.
Follow-up handling must be engineered, not assumed. It is not a free behaviour of the language model. A separate rewriter step that converts conversational queries into standalone search queries is the smallest reliable mechanism, and it removes a category of retrieval failures that would otherwise be invisible in casual testing.
Refusal is a feature, not a fallback. Enforcing the refusal both procedurally (short-circuit on empty context) and in the system prompt is what gives the assistant a useful posture in production: it will say nothing it cannot support, and it will say so in a predictable way.