
In this article, we’ll walk through the process of building a Retrieval-Augmented Generation (RAG) based Command-Line Interface (CLI) Chatbot System using the LlamaIndex framework.
This system allows users to interact with a chatbot that can answer questions based on a corpus of documents, utilizing advanced natural language processing techniques.
Table of Contents
- Introduction to RAG and LlamaIndex
- System Architecture
- Setting Up the Environment
- Document Processing
- Indexing with Qdrant
- Implementing the Chatbot
- Main Application Logic
- Customization and Extension
- Conclusion
1. Introduction to RAG and LlamaIndex

Retrieval-Augmented Generation (RAG) is a powerful technique that combines the strengths of retrieval-based and generative AI models. In a RAG system, relevant information is first retrieved from a knowledge base, and then a language model uses this information to generate more accurate and contextually relevant responses.
LlamaIndex is a data framework designed to help developers build RAG systems. It provides tools for ingesting, structuring, and accessing data in LLM applications. LlamaIndex simplifies the process of connecting large language models with external data sources, making it easier to create context-aware AI applications.
2. System Architecture
Our RAG-based CLI Chatbot System consists of several key components:
project_structure/
├── Dockerfile
├── requirements.txt
├── README.md
├── src/
│ ├── __init__.py
│ ├── main.py
│ ├── document_processor.py
│ ├── indexer.py
│ ├── chatbot.py
│ └── config.py
└── data/
└── sample_document.txt
- Document Processor: Responsible for loading documents from various sources.
- Indexer: Creates and manages the vector index using Qdrant.
- Chatbot: Implements the chat engine with memory and custom prompts.
- Main Application: Orchestrates the entire system and provides the CLI interface.
The system uses a Hugging Face embedding model for document encoding and Gemini as the language model for generating responses. Qdrant serves as the vector database for efficient similarity search.
3. Setting Up the Environment
Let’s start by setting up our project environment. We’ll use a requirements.txt
file to manage our dependencies:
llama-index
sentence-transformers
llama-index-llms-huggingface
accelerate
bitsandbytes
llama-index-readers-web
llama-index-embeddings-huggingface
llama-index-llms-gemini
google-generativeai
llama-index-vector-stores-qdrant
qdrant-client
These dependencies include LlamaIndex and its various components, as well as the necessary libraries for our embedding model, language model, and vector database.
4. Document Processing
The document processing module is responsible for loading documents into our system. We’ll implement two methods: one for loading documents from web URLs and another for loading documents from a local directory.
# src/document_processor.py
from llama_index.readers.web import BeautifulSoupWebReader
from llama_index.core import SimpleDirectoryReader
from config import DATA_DIR
def load_documents_from_web(urls):
return BeautifulSoupWebReader().load_data(urls)
def load_documents_from_directory():
return SimpleDirectoryReader(DATA_DIR).load_data()
The load_documents_from_web
function uses LlamaIndex’s BeautifulSoupWebReader
to scrape and process web pages. The load_documents_from_directory
function uses SimpleDirectoryReader
to load documents from a local directory specified in the configuration.
5. Indexing with Qdrant
The indexing module is crucial for creating and managing our vector index using Qdrant. We’ll use an in-memory Qdrant client for simplicity and ease of setup.
# src/indexer.py
from llama_index.core import VectorStoreIndex, ServiceContext
from llama_index.vector_stores.qdrant import QdrantVectorStore
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from qdrant_client import QdrantClient
from config import COLLECTION_NAME, EMBEDDING_MODEL
def create_index(documents, llm):
client = QdrantClient(":memory:")
vector_store = QdrantVectorStore(client=client, collection_name=COLLECTION_NAME)
embed_model = HuggingFaceEmbedding(model_name=EMBEDDING_MODEL)
service_context = ServiceContext.from_defaults(
llm=llm,
embed_model=embed_model
)
index = VectorStoreIndex.from_documents(
documents,
service_context=service_context,
vector_store=vector_store
)
return index
Let’s break down this process:
- We initialize an in-memory Qdrant client with
QdrantClient(":memory:")
. - We create a
QdrantVectorStore
using this client, specifying a collection name. - We initialize our embedding model (Hugging Face’s “sentence-transformers/all-MiniLM-L6-v2”).
- We create a
ServiceContext
that combines our language model (passed as an argument) and the embedding model. - Finally, we create and return a
VectorStoreIndex
using our documents, service context, and vector store.
This index allows for efficient similarity search when querying our documents.
6. Implementing the Chatbot
The chatbot module implements the core logic for our conversational interface. It uses LlamaIndex’s CondensePlusContextChatEngine
for generating responses and includes a custom prompt template.
# src/chatbot.py
from llama_index.core.memory import ChatMemoryBuffer
from llama_index.core.chat_engine import CondensePlusContextChatEngine
from llama_index.core import PromptTemplate
def create_chat_engine(index, llm):
memory = ChatMemoryBuffer.from_defaults(token_limit=4500)
template = (
"We have provided context information below. \n"
"---------------------\n"
"{context_str}"
"\n---------------------\n"
"Given this information, please answer the question and each answer should end with Thank You for asking: {query_str}\n"
)
qa_template = PromptTemplate(template)
chat_engine = CondensePlusContextChatEngine.from_defaults(
index.as_retriever(),
memory=memory,
llm=llm,
text_qa_template=qa_template
)
return chat_engine
Key components of this module:
- We create a
ChatMemoryBuffer
to store conversation history, limited to 4500 tokens. - We define a custom prompt template that includes context information and ensures each response ends with “Thank You for asking.”
- We create and return a
CondensePlusContextChatEngine
, which combines the index’s retrieval capabilities, our memory buffer, the language model, and our custom prompt template.
This chat engine will generate responses based on retrieved context and maintain conversation history.
7. Main Application Logic
The main application ties all these components together and provides the CLI interface for user interaction.
# src/main.py
from llama_index.llms.gemini import Gemini
from document_processor import load_documents_from_web, load_documents_from_directory
from indexer import create_index
from chatbot import create_chat_engine
def main():
# Initialize LLM
llm = Gemini()
# Load documents (you can switch between web and directory loading)
# documents = load_documents_from_web(["https://www.example.com"])
documents = load_documents_from_directory()
# Create index
index = create_index(documents, llm)
# Create chat engine
chat_engine = create_chat_engine(index, llm)
print("Chatbot is ready. Type 'exit' to quit.")
while True:
user_input = input("You: ")
if user_input.lower() == 'exit':
break
response = chat_engine.chat(user_input)
print(f"Chatbot: {response}")
if __name__ == "__main__":
main()
This main script:
- Initializes the Gemini language model.
- Loads documents (from a directory in this case, but web loading is also available).
- Creates the vector index using our loaded documents.
- Initializes the chat engine.
- Enters a loop for user interaction, processing inputs and generating responses until the user types ‘exit’.
8. Customization and Extension
Our system is designed to be easily customizable and extensible. Here are some ways you can modify it:
- Changing the embedding model: Update the
EMBEDDING_MODEL
variable inconfig.py
. - Using a different language model: Modify the
llm
initialization inmain.py
. - Switching to web document loading: Uncomment the relevant line in
main.py
. - Persistent storage: Replace the in-memory Qdrant client with a connection to a persistent Qdrant server.
- Adding new document sources: Extend the
document_processor.py
module with new loading functions.
10. Conclusion
Github Link: RAG With LlamaIndex & Qdrant
We’ve built a powerful RAG-based CLI Chatbot System using LlamaIndex, combining advanced NLP techniques with efficient vector search. This system demonstrates how to create a context-aware chatbot that can answer questions based on a large corpus of documents.
The modular design allows for easy maintenance and extensibility, while the use of Docker ensures consistent deployment across different environments. By leveraging the power of retrieval-augmented generation, this chatbot can provide more accurate and contextually relevant responses compared to traditional chatbots.
As AI and NLP technologies continue to evolve, systems like this will become increasingly important for creating intelligent, context-aware applications that can understand and interact with large amounts of information.