Building A RAG-based CLI Chatbot System: LlamaIndex & Qdrant

In this article, we’ll walk through the process of building a Retrieval-Augmented Generation (RAG) based Command-Line Interface (CLI) Chatbot System using the LlamaIndex framework.

This system allows users to interact with a chatbot that can answer questions based on a corpus of documents, utilizing advanced natural language processing techniques.

Introduction to RAG and LlamaIndex
System Architecture
Setting Up the Environment
Document Processing
Indexing with Qdrant
Implementing the Chatbot
Main Application Logic
Customization and Extension
Conclusion

1. Introduction to RAG and LlamaIndex

Retrieval-Augmented Generation (RAG) is a powerful technique that combines the strengths of retrieval-based and generative AI models. In a RAG system, relevant information is first retrieved from a knowledge base, and then a language model uses this information to generate more accurate and contextually relevant responses.

LlamaIndex is a data framework designed to help developers build RAG systems. It provides tools for ingesting, structuring, and accessing data in LLM applications. LlamaIndex simplifies the process of connecting large language models with external data sources, making it easier to create context-aware AI applications.

2. System Architecture

Our RAG-based CLI Chatbot System consists of several key components:

project_structure/
├── Dockerfile
├── requirements.txt
├── README.md
├── src/
│   ├── __init__.py
│   ├── main.py
│   ├── document_processor.py
│   ├── indexer.py
│   ├── chatbot.py
│   └── config.py
└── data/
    └── sample_document.txt

Document Processor: Responsible for loading documents from various sources.
Indexer: Creates and manages the vector index using Qdrant.
Chatbot: Implements the chat engine with memory and custom prompts.
Main Application: Orchestrates the entire system and provides the CLI interface.

The system uses a Hugging Face embedding model for document encoding and Gemini as the language model for generating responses. Qdrant serves as the vector database for efficient similarity search.

3. Setting Up the Environment

Let’s start by setting up our project environment. We’ll use a requirements.txt file to manage our dependencies:

llama-index
sentence-transformers
llama-index-llms-huggingface
accelerate
bitsandbytes
llama-index-readers-web
llama-index-embeddings-huggingface
llama-index-llms-gemini
google-generativeai
llama-index-vector-stores-qdrant
qdrant-client

These dependencies include LlamaIndex and its various components, as well as the necessary libraries for our embedding model, language model, and vector database.

4. Document Processing

The document processing module is responsible for loading documents into our system. We’ll implement two methods: one for loading documents from web URLs and another for loading documents from a local directory.

# src/document_processor.py
from llama_index.readers.web import BeautifulSoupWebReader
from llama_index.core import SimpleDirectoryReader
from config import DATA_DIR

def load_documents_from_web(urls):
    return BeautifulSoupWebReader().load_data(urls)

def load_documents_from_directory():
    return SimpleDirectoryReader(DATA_DIR).load_data()

The load_documents_from_web function uses LlamaIndex’s BeautifulSoupWebReader to scrape and process web pages. The load_documents_from_directory function uses SimpleDirectoryReader to load documents from a local directory specified in the configuration.

5. Indexing with Qdrant

The indexing module is crucial for creating and managing our vector index using Qdrant. We’ll use an in-memory Qdrant client for simplicity and ease of setup.

# src/indexer.py
from llama_index.core import VectorStoreIndex, ServiceContext
from llama_index.vector_stores.qdrant import QdrantVectorStore
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from qdrant_client import QdrantClient
from config import COLLECTION_NAME, EMBEDDING_MODEL

def create_index(documents, llm):
    client = QdrantClient(":memory:")
    vector_store = QdrantVectorStore(client=client, collection_name=COLLECTION_NAME)
    embed_model = HuggingFaceEmbedding(model_name=EMBEDDING_MODEL)

    service_context = ServiceContext.from_defaults(
        llm=llm,
        embed_model=embed_model
    )

    index = VectorStoreIndex.from_documents(
        documents,
        service_context=service_context,
        vector_store=vector_store
    )

    return index

Let’s break down this process:

We initialize an in-memory Qdrant client with QdrantClient(":memory:").
We create a QdrantVectorStore using this client, specifying a collection name.
We initialize our embedding model (Hugging Face’s “sentence-transformers/all-MiniLM-L6-v2”).
We create a ServiceContext that combines our language model (passed as an argument) and the embedding model.
Finally, we create and return a VectorStoreIndex using our documents, service context, and vector store.

This index allows for efficient similarity search when querying our documents.

6. Implementing the Chatbot

The chatbot module implements the core logic for our conversational interface. It uses LlamaIndex’s CondensePlusContextChatEngine for generating responses and includes a custom prompt template.

# src/chatbot.py
from llama_index.core.memory import ChatMemoryBuffer
from llama_index.core.chat_engine import CondensePlusContextChatEngine
from llama_index.core import PromptTemplate

def create_chat_engine(index, llm):
    memory = ChatMemoryBuffer.from_defaults(token_limit=4500)

    template = (
        "We have provided context information below. \n"
        "---------------------\n"
        "{context_str}"
        "\n---------------------\n"
        "Given this information, please answer the question and each answer should end with Thank You for asking: {query_str}\n"
    )
    qa_template = PromptTemplate(template)

    chat_engine = CondensePlusContextChatEngine.from_defaults(
        index.as_retriever(),
        memory=memory,
        llm=llm,
        text_qa_template=qa_template
    )

    return chat_engine

Key components of this module:

We create a ChatMemoryBuffer to store conversation history, limited to 4500 tokens.
We define a custom prompt template that includes context information and ensures each response ends with “Thank You for asking.”
We create and return a CondensePlusContextChatEngine, which combines the index’s retrieval capabilities, our memory buffer, the language model, and our custom prompt template.

This chat engine will generate responses based on retrieved context and maintain conversation history.

7. Main Application Logic

The main application ties all these components together and provides the CLI interface for user interaction.

# src/main.py
from llama_index.llms.gemini import Gemini
from document_processor import load_documents_from_web, load_documents_from_directory
from indexer import create_index
from chatbot import create_chat_engine

def main():
    # Initialize LLM
    llm = Gemini()

    # Load documents (you can switch between web and directory loading)
    # documents = load_documents_from_web(["https://www.example.com"])
    documents = load_documents_from_directory()

    # Create index
    index = create_index(documents, llm)

    # Create chat engine
    chat_engine = create_chat_engine(index, llm)

    print("Chatbot is ready. Type 'exit' to quit.")

    while True:
        user_input = input("You: ")
        if user_input.lower() == 'exit':
            break

        response = chat_engine.chat(user_input)
        print(f"Chatbot: {response}")

if __name__ == "__main__":
    main()

This main script:

Initializes the Gemini language model.
Loads documents (from a directory in this case, but web loading is also available).
Creates the vector index using our loaded documents.
Initializes the chat engine.
Enters a loop for user interaction, processing inputs and generating responses until the user types ‘exit’.

8. Customization and Extension

Our system is designed to be easily customizable and extensible. Here are some ways you can modify it:

Changing the embedding model: Update the EMBEDDING_MODEL variable in config.py.
Using a different language model: Modify the llm initialization in main.py.
Switching to web document loading: Uncomment the relevant line in main.py.
Persistent storage: Replace the in-memory Qdrant client with a connection to a persistent Qdrant server.
Adding new document sources: Extend the document_processor.py module with new loading functions.

10. Conclusion

Github Link: RAG With LlamaIndex & Qdrant

We’ve built a powerful RAG-based CLI Chatbot System using LlamaIndex, combining advanced NLP techniques with efficient vector search. This system demonstrates how to create a context-aware chatbot that can answer questions based on a large corpus of documents.

The modular design allows for easy maintenance and extensibility, while the use of Docker ensures consistent deployment across different environments. By leveraging the power of retrieval-augmented generation, this chatbot can provide more accurate and contextually relevant responses compared to traditional chatbots.

As AI and NLP technologies continue to evolve, systems like this will become increasingly important for creating intelligent, context-aware applications that can understand and interact with large amounts of information.

Building a RAG-based CLI Chatbot System: LlamaIndex & Qdrant

Table of Contents