How to have your AI stack locally (Vision, Chat, TTS, STT, Image Generation and RAG)

How to have your AI stack locally (Vision, Chat, TTS, STT, Image Generation and RAG)

Choosing a title for this post was challenging. The possibilities for a well-implemented, self-hosted AI stack are limitless, so I had to select the most eye-catching option to generate interest and engagement.

For as long as I can remember, I've been an avid supporter of self-hosted applications. As someone who values privacy, I make it my mission to discover alternatives to cloud services so that I can keep control of my data and avoid recurring payments for services that are not mine.

Just like any emerging technology, AI demands either dedicated hardware (like Groq's LPU architecture, explained in details here) or extremely powerful GPUs (as used by OpenAI, Microsoft, and others). Typically, this kind of hardware is not readily accessible to consumers. However, with certain caveats, this is no longer always the case today.

GPU Requirements

Before we proceed, let's address the obvious: we require a robust GPU for our small-scale project. When I say robust, I mean something with 16GB of VRAM like the 4070 Super TI. Alternatively, if you prefer excellent performance at less than half the cost, consider any 16GB GPU from AMD.

This class of GPU can (just about) handle loading our necessary models locally, preventing us from sharing our data and spending on cloud services. Additionally, it allows you to operate your AI system entirely offline, even if your internet connection is unexpectedly disrupted.

The Modeldrome: our AI stack

Now, our task is to precisely outline the requirements for a comprehensive, functional, and convenient AI stack. We will promptly explore what constitutes an AI stack and the steps we can take to implement it on a local machine.

System Requirements

Our system requirements are quite demanding but manageable if you're a PC enthusiast. If you’re a gamer or currently work with AI locally, it's likely that you already have the necessary setup in place.

Requirements

  • 24GB RAM (minimum, 32GB recommended)
  • 8-core modern CPU
  • 16GB VRAM GPU (the more, the better)
  • Docker and Docker Compose

You have the option to use Windows or Linux for this task. In theory, MacOS should function similarly, but you may need to manually adjust the instructions for GPU usage in Docker since I no longer possess a Mac (unfortunately). The same applies to AMD GPUs; as I am employing an Nvidia GPU for the purposes of this tutorial.

What do We Need, How do We Get It

First things First: running LLM locally with Ollama

The initial requirement is a means of operating our Large Language Models without an internet connection. Our preference is to avoid utilizing any cloud-based services, whether from OpenAI or other providers. It's crucial for us that our models are entirely under our control and ownership.

Installing Ollama is quite straightforward: simply visit their Downloads page, choose your operating system, and follow the provided steps. If you're using Windows, you will notice a new icon in your taskbar; Ollama launches at startup and begins listening for connections automatically. On Linux, Ollama operates as a service, allowing you to manage it via systemctl just like any other service. It initiates automatically and awaits connections on port 11434.

It's definitely feasible to set up Ollama via a Docker container, but I highly recommend installing the native edition instead. The native version is significantly more straightforward to manage and customize, plus you can sidestep some unnecessary overhead that will be beneficial in subsequent steps.

Downloading a model

The final action required to run our LLM locally is to choose a model for execution. You can consult the Ollama library for pulling and running a model. Regardless, I will provide you with my configuration that operates seamlessly on a 16GB NVIDIA GeForce RTX 4070 Ti GPU.

I am using a model called Mistral Small Instruct Abliterated that you can download for free from HuggingFace.

NOTE: You can skip the following part and just install the model from here. Please note that this model is not abliterated (not uncensored).

Once downloaded, you can import it in Ollama very easily. Just create a Modelfile text file in the same folder as the mistral-small-instruct-2409-abliterated-q4_k_m.gguf file you just downloaded and fill it with the following content:

FROM mistral-small-instruct-2409-abliterated-q4_k_m.gguf
TEMPLATE """{{- if .Messages }}
{{- range $index, $_ := .Messages }}
{{- if eq .Role "user" }}
{{- if and (le (len (slice $.Messages $index)) 2) $.Tools }}[AVAILABLE_TOOLS] {{ $.Tools }}[/AVAILABLE_TOOLS]
{{- end }} [INST] {{ if and $.System (eq (len (slice $.Messages $index)) 1) }}{{ $.System }}

{{ end }}{{ .Content }} [/INST]
{{- else if eq .Role "assistant" }}
{{- if .Content }} {{ .Content }}
{{- if not (eq (len (slice $.Messages $index)) 1) }}</s>
{{- end }}
{{- else if .ToolCalls }}[TOOL_CALLS] [
{{- range .ToolCalls }}{"name": "{{ .Function.Name }}", "arguments": {{ .Function.Arguments }}}
{{- end }}]</s>
{{- end }}
{{- else if eq .Role "tool" }}[TOOL_RESULTS] {"content": {{ .Content }}}[/TOOL_RESULTS]
{{- end }}
{{- end }}
{{- else }} [INST] {{ if .System }}{{ .System }}

{{ end }}{{ .Prompt }} [/INST]
{{- end }}
{{- if .Response }} {{ end }}{{ .Response }}
{{- if .Response }}</s> {{ end }}
"""

Here we are using an abliterated model, meaning that it is uncensored and free from the usual LLM constraints.

Now you can simply run ollama create mistral_small_abliterated -f Modelfile in a terminal in the very same folder and you are done.

Remember to test your setup by chatting a little with your model. You can do so by running ollama run yourmodelname.

Let there be Voice: setting up openedai/speech TTS engine

Although OpenAI and numerous other firms provide highly competitive TTS APIs at attractive prices, we are hesitant to depend on external services. As digital libertarians, we prefer to maintain control over our resources. Furthermore, who can say with certainty what these companies might do with your voice data?

Fortunately, the open source community lends us a hand once again. Enter the openedai-speech project, an open source initiative that utilizes accessible and free text-to-speech models to deliver an API fully compatible with OpenAI's specifications.

For detailed installation steps and insight into how the software operates, consult their repository listed above. To swiftly deploy using default ports (specifically port 8000), follow their Docker Compose guidelines to get your openedai-speech server up and running without delay.

IMPORTANT: If your system has limited VRAM or GPU resources, it is advisable to follow the instructions for running the CPU-based version of openedai-speech. You might experience a slight reduction in speed, but this will alleviate the burden on your GPU.

I have ears and I can hear: setting up fasterwhisper for STT

Once more, we find ourselves here. A while back, OpenAI developed an exceptional model known as Whisper, which excels in Speech-To-Text recognition with remarkably high quality. Its performance is so impressive that it can operate in real time.

Once again, the open source community comes to our rescue. The Whisper model has been successfully rebuilt and improved by numerous contributors, resulting in completely open source versions of both the model and its underlying engine.

Without delving into the background of models such as fasterwhisper and their ilk, let's cut to the chase: we need a set of APIs compatible with OpenAI that leverage our own machine processing capabilities. After all, if we can operate an LLM, surely we can run a STT model too, correct?

That's right! We will utilize faster-whisper-server as our backend to meet our Whisper requirements. The repository provides comprehensive guidelines on installing and operating the server. As is customary, opt for either CPU or GPU depending on your specific needs (though it’s worth noting that the GPU version performs significantly better without a substantial memory footprint).

On this occasion, it's recommended to utilize the Docker Compose image. This approach will spare you numerous frustrations. Nonetheless, you might encounter a scenario where the server declines to employ your GPU. The root cause of this issue lies in an incorrect configuration within their docker-compose.yml file, which we shall promptly rectify with ease.

Once you clone the repository, open the compose.yml file and find this section:

    deploy:
      resources:
        reservations:
          devices:
            - capabilities: ["gpu"]

Nuke it gently and replace it with:

deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              #device_ids: ['0', '1'] # Select a gpu, or
              count: all
              capabilities: [gpu]

This will fix the misconfigurations and allow the Docker image to use your NVidia card.

Start the Docker compose image and you are good to go!

Do Your Own Research: searxng as a local, private search engine

Searching online is essential for any artificial intelligence system. Since LLMs are trained on a static dataset, their knowledge can quickly become obsolete. Consequently, AI systems adopt RAG techniques to gather and consume new data without requiring additional training or fine-tuning.

There is a multitude of hosted search services available, including privacy-focused options such as DuckDuckGo and Brave Search (which I strongly recommend for everyday use). However, we are selective and prefer that our search queries remain on our computer (to the greatest extent possible).

For our small-scale project, we will utilize searxng, an open-source metasearch engine that can be hosted on your own server for local use. This tool directs your search queries to customizable endpoints (such as Google, Bing, Brave, and DuckDuckGo) while maintaining your privacy. It then aggregates the results from these various sources into a single combined set of results.

This project includes a dedicated repository for executing searxng within a Docker container by using Docker compose. Although all methods are valid, I opted for the "Method 2" approach because I already have my own reverse proxy. You may select either of the two approaches as long as your configuration functions effectively.

VERY IMPORTANT: You have to edit searxng/settings.yml adding the following at the end (after the redis part):

search:
  formats:
    - html
    - csv
    - json
    - rss

This is mandatory to allow our system to get search results from the searxng API server.

BONUS: Imagine all the bots: SD.Next for Stable Diffusion

Image Generation has become ubiquitous. Although the initial frenzy has somewhat subsided, there's no denying that AI-produced images remain incredibly sought after. This is why we aim for our AI system to have the capability of generating images for us.

Hosting a Stable Diffusion server demands significant VRAM, so it's typically advised not to operate both SD (Stable Diffusion) and other system tasks simultaneously. However, if you possess a high-end GPU with ample VRAM (24GB), this concern might be alleviated.

Setting up SD.Next, an advanced fork of Automatic boasting extensive model support, is straightforward: simply adhere to the official guidelines, and your server will be operational promptly.

One WebUI to rule them all: installing Open WebUI and putting all together

Now that our APIs and services are fully prepared for deployment, we require a straightforward method to utilize them. For this guide, we'll employ Open WebUI, an open source, user-friendly interface designed to facilitate interaction with our components.

You can set up Open WebUI using pip or Docker compose, with both methods offering a simple installation process. After following the provided steps, you will be presented with the WebUI interface.

By going to the /admin/settings page, you will be able to configure everything you need. Specifically:

  • Under "Connections", insert your Ollama URL endpoint (usually the one ending with 11434)
  • Under "Web Search", configure your SearXNG endpoint (either the Caddy link or the reverse proxy link based on your configuration)
  • Under "Audio", set STT to OpenAI and insert your openedai-speech endpoint URL (e.g. http://192.168.1.10:8000/v1). Remember to add v1 at the end of the URL. Set the model to Systran/faster-whisper-large-v3 or anything else you want to use
  • Still under "Audio", set TTS to OpenAI and insert your fasterwhisper endpoint URL (e.g. http://192.168.1.10:9000/v1). Remember to add v1 at the end of the URL. Set the Voice to alloy and the Model to tts-1 or anything else you want to use
  • (Optional) Under "Images", configure your SD.Next endpoint URL (e.g. http://192.168.1.10:7860) and enable the feature to be able to query images to your model

That's it! You should be able to just press "New Chat" and start using your system!

Conclusion

Currently, we possess a complete array of AI systems prepared for local use and exploitation. Except for Web Search (naturally), all functionalities can be employed offline as well. Given the advancements in AI models and software, this manual might become somewhat outdated over time. You are welcome to disseminate it and make any necessary modifications.

Happy experiments!