Ollama¶

Ollama is a free and open source inference runtime for large language model (LLM) applications.

GPU acceleration is required for inference so it needs to be run on a GPU partition.

Setup and basic operation¶

The container that has the Ollama software in it is called the Ollama Shell Environment. When you run it it loads up the Ollama inference server in the background and when that has loaded it then drops you into an Ubuntu 22.04 Bash shell where you can start to give commands via the command line:

Terminal

[harsi12p@aoraki27 ~]$ ollama-env.sh
NOTICE: Starting Ollama server in the background.
NOTICE: Waiting for the server to come online (1/10)

## Ollama Container Shell Environment ##

Any missing packages or libraries? Send requests to:
  Mail:     rtis.solutions@otago.ac.nz
  Subject:  Additions to Ollama shell environment (container_ollama_shellenv)

Use the following environmental variables for this container instance:
  * OLLAMA_HOST     : 127.0.0.1:11444
  * OLLAMA_BASE_URL : http://127.0.0.1:11444
  * OLLAMA_MODELS   : /home/harsi12p/.ollama/models
  * HF_HOME         : ~/.cache/huggingface
  * OPENAI_URL_BASE : http://127.0.0.1:11444/v1

Install any extra Python packages with:
  python install --user <<PACKAGE_NAME>>

Press [CTRL] + [D] to exit.

OllEnv harsi12p@aoraki27:~$

This has been done using a convience script called ollama-env.sh. Useful files such as this can be extracted from the container by running the following in an empty directory:

Terminal

apptainer run /opt/apptainer_img/ollama_shellenv.sif --copy-execute-files

I recommend that you put ollama-env.sh in a directory that has been added to your PATH so you can call it regardless of where you are in the directory tree. If you don't want to do this you can run the container:

Terminal

apptainer run --nv /opt/apptainer_img/ollama_shellenv.sif

The container picks an unused TCP port (not the dafault) and then starts the server on that. This is to provide isolation between different container instances. The container sets the OLLAMA_HOST environmental variable that tells the ollama binary and Ollama python library where to send it requests.

Home directory quota constraints¶

A quota system is in place on Aoraki that limits the data in a users home directory to 15GB. This can be easy exceeded with LLM models. The possible solutions involve moving your data onto another storage medium (Ohau or HCS) and either setting environmental variables or copying data and then creating symlinks to the new data location. Other hidden directories can also contains large amount of hidden files: ~/.cache and ~/.local. You can set the environmental variable OLLAMA_MODELS to a directory that is not in /home and Ollama will put any downloaded model files in there.

Operation¶

Command line

Commands can be given at the command line. For example:

Terminal

OllEnv harsi12p@aoraki27:~$ ollama run llama3
>>> What is a cat? Give a response as a single sentence.
A cat is a small, typically furry, carnivorous mammal of the family Felidae that purrs, scratches, and curls up in adorable ways to delight its human companions.
>>> Send a message (/? for help)

It may take anywhere from 10 seconds to two minutes to initially load the code and weights into VRAM. Pressing [CTRL] + [D] exits Ollama. You can use standard Unix pipes and redirection as well:

Terminal

echo "What is a cat? Give a response as a single sentence." | ollama run llama3 > what-is-a-cat.txt

ipython

iPython provides syntax highlighting and completion in the terminal. For example:

Terminal

OllEnv harsi12p@aoraki27:~$ ipython
Python 3.10.12 (main, Jul 29 2024, 16:56:48) [GCC 11.4.0]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.26.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import ollama
In [2]: res = ollama.generate("llama3", "What is a cat? Give a response as a single sentence.")
In [3]: print(res["response"])
A cat is a small, typically furry, carnivorous mammal that belongs to the family Felidae and is characterized by its agility, playful behavior, and distinctive vocalizations.

Pressing [CTRL] + [D] exits iPython. You have have to have run ollama run llama3 or ollama pull llama3 before you run any Python code using that LLM model.

jupyter-notebook

Terminal

OllEnv harsi12p@aoraki27:~$ jupyter-notebook

Jupyter notebook pops up Firefox that is also included within the container. Because of this you have to be able to display X11 programs (through WSL on Windows and XQuartz on Mac) on your local machine. This requires a bit more setup to function correctly.

Batch mode The container can also run in batch mode. This is done by giving a container a single parameter. What happens is that the container starts the Ollama server but instead of dropping you into an interractive bash shell it runs your command and then exits. For example:

Terminal

apptainer run --nv /opt/apptainer_img/ollama_shellenv.sif 'echo "What is a cat? Give a response as a single sentence." | ollama run llama3 > what-is-a-cat-batched.txt'

This is useful if you have large inference jobs that you want to run and you want to use SLURM (for a significant speed and resource increase).

Resources¶

LLM inference jobs are very heavy on VRAM (video card RAM) and also on CUDA cores. Under testing it was found that Ollama would only run one model in VRAM at a time. This would make using multiple models at the same time excruciating slow (i.e. having a standard inference and embedding model running at the same time). Because of this the number of models has been set to three. You can change this before you run the container by setting the following environmental variables:

Terminal

export OLLAMA_KEEP_ALIVE=5m
# How long to keep the model in VRAM before it is unloaded.

export OLLAMA_MAX_LOADED_MODELS=3
# How many models to have in VRAM at one time.

export OLLAMA_NUM_PARALLEL=1
# How many inference jobs to do in parallel.

You can check resource usage by using nvidia-smi and ollama ps:

Terminal

OllEnv harsi12p@aoraki27:~$ nvidia-smi
Fri Aug 30 18:26:42 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.06              Driver Version: 555.42.06      CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100-PCIE-40GB          Off |   00000000:21:00.0 Off |                    0 |
| N/A   27C    P0             32W /  250W |       1MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA A100-PCIE-40GB          Off |   00000000:81:00.0 Off |                    0 |
| N/A   26C    P0             36W /  250W |    5477MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    1   N/A  N/A   3034108      C   ...unners/cuda_v11/ollama_llama_server       5468MiB |
+-----------------------------------------------------------------------------------------+

And for Ollama process information:

Terminal

OllEnv harsi12p@aoraki27:~$ ollama ps
NAME            ID              SIZE    PROCESSOR       UNTIL
llama3:latest   365c0bd3c000    5.4 GB  100% GPU        4 minutes from now