How to host your own ChatGPT-like model?

Kevin Naidoo - Feb 9 - - Dev Community

Want to run a similar model to ChatGPT on your infrastructure?

With a huge push to build open-source models, Mixtral 7B is one of the best models available for free. It's also super efficient to run on low-spec hardware.

Although Mixtral is not as powerful as ChatGPT, it still is powerful enough for most generation tasks. I use Mixtral for the classification of products, labeling, and generating descriptions.

Setting up

You probably can get away with a decent-sized VPS or dedicated server, I suggest though - get a GPU box. These can be expensive, however, there are companies like Hetzner where you can get a GPU box for under $150 pm.

First things first, you would want to set up the graphics drivers and CUDA. These instructions are for Ubuntu 22.04, and may or may not work with other Ubuntu versions.

sudo add-apt-repository ppa:graphics-drivers/ppa --yes
sudo apt update -y
sudo apt-get install linux-headers-$(uname -r)
sudo ubuntu-drivers install --gpgpu
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update -y
sudo apt-get -y install cuda-toolkit-12-3
Enter fullscreen mode Exit fullscreen mode

Install Ollama

Ollama is a powerful Golang library that can run large language models more efficiently. I have tested various ways of running models including llama.cpp, hugging face inference API, and various other tools, Ollama tends to perform the best with a GPU.

If you are stuck on a CPU, llama.cpp may work better, but still, I managed to get Ollama working on a CPU just fine. I didn't do enough tests to draw a conclusion on which is better for CPU-only machines, however, on a GPU box Ollama wins by a large margin.

To install Ollama:

curl https://ollama.ai/install.sh | sh
Enter fullscreen mode Exit fullscreen mode

Now you should have Ollama installed, to set up Mixtral:

ollama pull mixtral:instruct
Enter fullscreen mode Exit fullscreen mode

The above command will pull down the Mixtral model and configure it for you so that Ollama can run this model locally.

Running Mixtral 7B

Now that you have successfully configured Mixtral 7B with Ollama, running this model is as simple as:

ollama run mixtral:instruct
Enter fullscreen mode Exit fullscreen mode

The command above will open a prompt shell, where you can prompt the model similar to how you would chat with ChatGPT.

This is great for local testing but not very useful for integrating with web applications or other external apps. Next, We will look at running Ollama as an API server to solve this very problem.

Running Ollama as an API Server

To run Ollama as an API server you can use "systemd". "Systemd" is a Linux daemon that allows you to run and manage background tasks.

Here is an example of a Systemd config:
mlapi.service

[Unit]
Description=Ollama Service
After=network-online.target

[Service]
ExecStart=/usr/local/bin/ollama serve
User=ollamauser
Group=ollamauser
Restart=always
RestartSec=3
Environment="OLLAMA_HOST=0.0.0.0:8000"
[Install]
WantedBy=default.target
Enter fullscreen mode Exit fullscreen mode

In the above config, we run this process as "ollamauser", this is just an isolated system user I created for security purposes.

You can run Ollama as any available user on your server, however, I would avoid running this process as "root". Rather, create a new system user and keep the process as isolated as possible.

You will then need to place the config file in: "/etc/systemd/system/"

I called the file "mlapi.service". You can name this whatever you like, just be aware when using the Systemd CLI tool, you need to reference this name as per the file name.

To enable your service:

systemctl enable mlapi.service
Enter fullscreen mode Exit fullscreen mode

Now start your service as follows:

systemctl start mlapi.service
Enter fullscreen mode Exit fullscreen mode

To check that the service is up and running, you can use:

systemctl status mlapi.service
Enter fullscreen mode Exit fullscreen mode

Now that you are all set up, you can make an API call to the service as follows:

import requests
import json

url = "http://127.0.0.1:8000/api/generate"

payload = json.dumps({
  "model": "mixtral:instruct",
  "stream": False,
  "prompt": "Designing Data-Intensive Applications By Martin Kleppmann",
  "system": "Tag this book as one of the following: programming, cooking, fishing, young adult. Return only the tag exactly as per the tag list with no extra spaces or characters."
})
headers = {
  'Content-Type': 'application/json'
}

response = requests.request("POST", url, headers=headers, data=payload)

print(response.text)

Enter fullscreen mode Exit fullscreen mode

Correctly so, the model returns "programming" as the tag. Ollama supports various options. You can get more detailed information about the options available here.

To get you started, here is a breakdown of the most common parameters:

  1. model (required) - Ollama can run multiple models from the same API, so we need to tell it which model to use.
  2. stream (optional) - Set this to "false" to just return the whole model's response. The default is to stream the response, so you will get a large JSON object with several child objects containing chunked phrases.
  3. prompt (required) - The actual chat prompt.
  4. system (optional) - Any context information you want to give the model before it processes your prompt.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .