Host Your Own LLM Using Gradio and Ollama

Host Your Own LLM Using Gradio and Ollama

In this tutorial, we’ll walk through the process of creating a chat interface using Gradio and Ollama to interact with a large language model (LLM).

Gradio is a Python library that allows you to quickly create user interfaces for machine learning models, while Ollama is a tool that simplifies the deployment and management of LLMs.

In the previous article Deploy LLM in HuggingFace Spaces For Free Using Ollama, We deployed LLM on Huggingface Spaces using Ollama and FastAPI. But this time, We will use gradio. One advantage of gradio is easy deployment. Only one line of code is needed to deploy your LLM to the cloud. We will discuss about this later in the article. Another advantage is inbuilt API. One disadvantage can be setup is somewhat complex than before.

By the end of this tutorial, you’ll have a fully functional web-based chat interface that can generate responses using an LLM.

Prerequisites

Before we begin, make sure you have the following installed:

  1. Python 3.7 or higher
  2. Gradio: Install it using pip:
   pip install gradio
  1. Ollama: We’ll set up Ollama during the tutorial.

Step 1: Setting Up Ollama

Ollama is a tool that allows you to easily run large language models locally. In this tutorial, we’ll use Ollama to serve the Qwen 2.5 1b model.

1.1 Download and Extract Ollama

First, we need to download and extract the Ollama binary. We’ll do this programmatically using Python:

import subprocess
import os
import shutil
import time
import requests

def setup_ollama():
    try:
        print("Starting Ollama setup process...")

        # Download Ollama tarball
        tarball_url = "https://ollama.com/download/ollama-linux-amd64.tgz"
        print(f"Downloading Ollama from {tarball_url}")
        subprocess.run(f"curl -L {tarball_url} -o ollama.tgz", shell=True, check=True)

        if not os.path.exists("ollama.tgz"):
            raise Exception("Failed to download ollama.tgz")

        # Create directory and extract
        print("Creating directory and extracting tarball...")
        if os.path.exists("ollama_dir"):
            shutil.rmtree("ollama_dir")
        os.makedirs("ollama_dir", exist_ok=True)

        # Extract with verbose output
        result = subprocess.run(
            "tar -xvzf ollama.tgz -C ollama_dir",
            shell=True,
            capture_output=True,
            text=True
        )
        print("Tar command output:", result.stdout)
        if result.stderr:
            print("Tar command errors:", result.stderr)

        # Check extraction
        print("Extracted directory contents:")
        subprocess.run("ls -la ollama_dir", shell=True)

        # Make sure ollama is executable
        ollama_path = "./ollama_dir/bin/ollama"
        if not os.path.exists(ollama_path):
            raise Exception(f"Ollama binary not found at {ollama_path}")

        os.chmod(ollama_path, stat.S_IRWXU)

        # Set LD_LIBRARY_PATH to include the lib directory
        lib_path = os.path.abspath("./ollama_dir/lib/ollama")
        os.environ["LD_LIBRARY_PATH"] = f"{lib_path}:{os.environ.get('LD_LIBRARY_PATH', '')}"

        # Start Ollama server
        print("Starting Ollama server...")
        ollama_process = subprocess.Popen(
            [ollama_path, "serve"],
            stdout=subprocess.PIPE,
            stderr=subprocess.PIPE,
            env=os.environ
        )

        # Wait for Ollama to start
        max_attempts = 30
        for attempt in range(max_attempts):
            try:
                response = requests.get("http://localhost:11434/api/tags")
                print(f"Ollama server is running (attempt {attempt + 1})")
                break
            except Exception as e:
                print(f"Waiting for Ollama server... (attempt {attempt + 1}/{max_attempts})")
                # Check if process is still running
                if ollama_process.poll() is not None:
                    stdout, stderr = ollama_process.communicate()
                    print("Ollama process terminated unexpectedly")
                    print("stdout:", stdout.decode() if stdout else "None")
                    print("stderr:", stderr.decode() if stderr else "None")
                    raise Exception("Ollama process terminated unexpectedly")
                time.sleep(1)

        # Pull the model
        print("Pulling Qwen model...")
        subprocess.run([ollama_path, "pull", "qwen2.5:1.5b"], check=True)

        return ollama_process
    except Exception as e:
        print(f"Error setting up Ollama: {str(e)}")
        # Print stack trace
        import traceback
        traceback.print_exc()
        return None

We aren’t installing ollama directly using the linux command:

curl -fsSL https://ollama.com/install.sh | sh

We used the above command in the previous article but this time we will download the ollama file manually for installation. The reason being we won’t be able to run the dockerfile since gradio has its own inbuilt docker, thus, we will use the python command to download, install and run the ollama server as well as download the model.

1.2 Start the Ollama Server

The setup_ollama() function will download, extract, and start the Ollama server. It will also pull the Qwen 2.5 model, which we’ll use for generating responses.

Step 2: Creating the Gradio Interface

Now that Ollama is set up, we’ll create a Gradio interface to interact with the LLM.

2.1 Define the Query Function

We’ll define a function query_ollama() that sends a prompt to the Ollama server and streams the response back to the user.

import requests
import json
import hashlib
from functools import lru_cache


def query_ollama(prompt):
    try:
        # Check cache first
        cached_response = get_cached_response(prompt)
        if cached_response is not None:
            print("Using cached response")
            yield cached_response
            return

        # Initialize response text
        response_text = ""
        buffer = ""
        chunk_size = 0

        # Make streaming request to Ollama
        response = requests.post(
            'http://localhost:11434/api/generate',
            json={
                'model': 'qwen2.5:1.5b',
                'prompt': prompt,
                'stream': True,
                'context_size': 2048,
                'num_predict': 1000,
            },
            stream=True
        )

        # Process the stream
        for line in response.iter_lines():
            if line:
                json_response = json.loads(line)
                if 'response' in json_response:
                    chunk = json_response['response']
                    buffer += chunk
                    chunk_size += len(chunk)

                    # Only yield when buffer reaches certain size or on completion
                    if chunk_size >= 20 or json_response.get('done', False):
                        response_text += buffer
                        yield response_text
                        buffer = ""
                        chunk_size = 0

                if json_response.get('done', False):
                    # Yield any remaining buffer
                    if buffer:
                        response_text += buffer
                        yield response_text
                    # Save the complete response to cache
                    save_to_cache(prompt, response_text)
                    break

    except Exception as e:
        yield f"Error: {str(e)}"

If you want to understand the code properly, ignore all the caching command and just focus on the request.post. The above ollama function will run on the port 11434, so we will be sending the post request to the localhost:11434 to interface the Qwen 2.5 1B model and get the response.

2.2 Create the Gradio Interface

Gradio Chat Interface

Next, we’ll create the Gradio interface using the query_ollama() function.

import gradio as gr

def create_interface():
    iface = gr.Interface(
        fn=query_ollama,
        inputs=gr.Textbox(
            label="Prompt",
            placeholder="Enter your prompt here...",
            lines=3
        ),
        outputs=gr.Textbox(
            label="Response",
            lines=5,
            show_copy_button=True
        ),
        title="Qwen 2.5 Chat Interface",
        description="Chat with Qwen 2.5 model using Ollama backend",
        examples=[
            ["Tell me a short story about space exploration"],
            ["Explain how photosynthesis works"],
            ["Write a haiku about artificial intelligence"]
        ],
        cache_examples=False,  # Disable Gradio's example caching
        examples_per_page=10
    )
    return iface

2.3 Launch the Interface

Finally, we’ll launch the Gradio interface.

if __name__ == "__main__":
    print("Starting Ollama setup...")
    # Setup Ollama
    ollama_process = setup_ollama()

    if ollama_process:
        try:
            print("Starting Gradio interface...")
            # Launch Gradio interface
            iface = create_interface()
            iface.queue()
            iface.launch(
                server_name="0.0.0.0",
                server_port=7860,
                share=False,
                root_path="",
                show_error=True  # Show detailed error messages
            )
        finally:
            print("Cleaning up...")
            # Clean up Ollama
            ollama_process.terminate()
            ollama_process.wait()
            subprocess.run("rm -rf ollama_dir ollama.tgz", shell=True)
            # Don't clean up cache directory to persist cached responses
    else:
        print("Failed to start Ollama server")

Step 3: Running the Application

To run the application, simply execute the Python script:

python app.py

This will start the Ollama server and launch the Gradio interface. You can access the interface by navigating to http://localhost:7860 in your web browser.

Step 4: Deploy on Huggingface Spaces

You can easily deploy your code on cloud ( in our case Huggingface Spaces) from the command:

gradio deploy

Performance

The performance is fast enough but keep in mind that we are using 1.5 billion model, so it won’t be able to perform complex task.

In the below example, We used the default API from the gradio and the output was generated in around 1 minute and it generated 282 words.

ollama output in huggingface space

It is possible to get more faster performance by using more smaller models like Qwen 0.5b and Tinyllama, however, it decreases the ability to process complex tasks.

Full code: Qwen Space

Conclusion

In this tutorial, we’ve built a chat interface using Gradio and Ollama to interact with a large language model. We’ve covered how to set up Ollama, create a Gradio interface, and handle streaming responses from the LLM. This setup can be extended to include more models, additional features, or even deployed as a web service.

Feel free to experiment with different models and prompts, and customize the interface to suit your needs. Happy coding!