Harnessing Llama CPP for Efficient HTTP Server Deployment of LLMs

Vithushan Sylvester
6 min readOct 17, 2023

--

Image by author via DALL-E 3

In the rapidly advancing field of natural language processing (NLP), leveraging powerful tools can significantly streamline development and boost productivity. One such tool that stands out is Llama CPP, a project designed to run large language models (LLMs) efficiently. This article delves into the core components of the Llama CPP project, its models, and how developers can harness its potential to drive their NLP projects to success.

Llama CPP

Llama CPP is a robust framework built to handle the operational intricacies of running large language models. By offering a streamlined interface and optimized performance, it plays a pivotal role in simplifying the deployment of LLMs in various applications.

Llama 2

The Llama project encompasses a suite of models tailored for diverse NLP tasks. Among these, Llama 2 stands out for its enhanced capabilities and performance optimizations, making it a preferred choice for developers aiming to tackle complex NLP challenges.

Benefits of Using Llama Models:

  • Open Source Commercial License: One of the notable advantages of Llama models is their open-source commercial license, which encourages a collaborative development environment while ensuring legal clarity for commercial use.
  • Community Support: Being open-source, Llama models enjoy robust community support, which is crucial for troubleshooting, optimizations, and continuous improvement.
  • Performance: Llama models are optimized for superior performance, ensuring that developers can run their NLP tasks efficiently without compromising on accuracy.

Run Llama Models using Llama CPP

Utilizing Llama CPP to run Llama models is a straightforward process. By following the well-documented steps provided in the project repository, developers can get their models up and running in no time.

Run the Llama Models as HTTP Server

Running Llama models as an HTTP server is an essential feature for developers looking to integrate these models into web-based applications. Llama CPP provides the necessary tools and documentation to facilitate this setup, enabling seamless interactions with the models via HTTP requests.

Setting Up the Environment:

Open a terminal and execute the following commands to clone the Llama CPP repository and navigate to the cloned directory:

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

This will create a directory called llama.cpp in your current directory and change the directory to llama.cpp

Now, build the Llama CPP project by executing the following command:

make

Downloading Llama Models:

Download the necessary GGML file for the model you intend to use is a must. you can do this in two ways.

  1. Request the models from Meta AI via the following request form. https://ai.meta.com/resources/models-and-libraries/llama-downloads/
  2. Or, simply downloading a ggml format of your choice by browsing the Hugging Face repository.

Placing the Models in Llama CPP Project:

Place the downloaded GGML file into the ~/llama.cpp/models/ directory of the Llama CPP project​

Run the Llama Models

Execute the following command to start the server:

./server -m ./models/<your downloaded model name>

Once you execute the command you will be able to see the confirmation in the console for the server running localhost port number 8080.

Image by author

With that you can start interacting with the server by sending http requests as well. you can use the following curl to test this out:

curl --request POST \
--url http://localhost:8080/completion \
--header "Content-Type: application/json" \
--data '{"prompt": "Building a website can be done in 10 simple steps:","n_predict": 128}'

In addition to that by accessing the respective localhost address, you will be able to access an interactive chatbot interface as well.

Image by author
Image by author

Optimizing the efficiency

Efficiency is paramount when working with large language models. Now that we have managed to run the server in local. but the real power of Llama CPP comes when you tune the parameters. Llama CPP offers a range of configurable parameters that developers can tweak to optimize the performance based on their specific needs. Understanding these parameters and how they impact the operational efficiency is key to unlocking the full potential of Llama CPP. However, it’s very difficult to pick the perfect parameter without understanding the context on the business and infrastructural requirements.

Following are the list of such parameters that can be used tune the performance and efficiency of the models.

temperature: Adjust the randomness of the generated text (default: 0.8).

top_k: Limit the next token selection to the K most probable tokens (default: 40).

top_p: Limit the next token selection to a subset of tokens with a cumulative probability above a threshold P (default: 0.95).

n_predict: Set the number of tokens to predict when generating text. Note: May exceed the set limit slightly if the last token is a partial multibyte character. When 0, no tokens will be generated but the prompt is evaluated into the cache. (default: -1, -1 = infinity).

n_keep: Specify the number of tokens from the initial prompt to retain when the model resets its internal context. By default, this value is set to 0 (meaning no tokens are kept). Use -1 to retain all tokens from the initial prompt.

stream: It allows receiving each predicted token in real-time instead of waiting for the completion to finish. To enable this, set to true.

prompt: Provide a prompt as a string, or as an array of strings and numbers representing tokens. Internally, the prompt is compared, and it detects if a part has already been evaluated, and the remaining part will be evaluate. If the prompt is a string, or an array with the first element given as a string, a space is inserted in the front like main.cpp does.

stop: Specify a JSON array of stopping strings. These words will not be included in the completion, so make sure to add them to the prompt for the next iteration (default: []).

tfs_z: Enable tail free sampling with parameter z (default: 1.0, 1.0 = disabled).

typical_p: Enable locally typical sampling with parameter p (default: 1.0, 1.0 = disabled).

repeat_penalty: Control the repetition of token sequences in the generated text (default: 1.1).

repeat_last_n: Last n tokens to consider for penalizing repetition (default: 64, 0 = disabled, -1 = ctx-size).

penalize_nl: Penalize newline tokens when applying the repeat penalty (default: true).

presence_penalty: Repeat alpha presence penalty (default: 0.0, 0.0 = disabled).

frequency_penalty: Repeat alpha frequency penalty (default: 0.0, 0.0 = disabled);

mirostat: Enable Mirostat sampling, controlling perplexity during text generation (default: 0, 0 = disabled, 1 = Mirostat, 2 = Mirostat 2.0).

mirostat_tau: Set the Mirostat target entropy, parameter tau (default: 5.0).

mirostat_eta: Set the Mirostat learning rate, parameter eta (default: 0.1).

grammar: Set grammar for grammar-based sampling (default: no grammar)

seed: Set the random number generator (RNG) seed (default: -1, -1 = random seed).

ignore_eos: Ignore end of stream token and continue generating (default: false).

logit_bias: Modify the likelihood of a token appearing in the generated text completion. For example, use "logit_bias": [[15043,1.0]] to increase the likelihood of the token 'Hello', or "logit_bias": [[15043,-1.0]] to decrease its likelihood. Setting the value to false, "logit_bias": [[15043,false]] ensures that the token Hello is never produced (default: []).

n_probs: If greater than 0, the response also contains the probabilities of top N tokens for each generated token (default: 0)

Conclusion

Llama CPP emerges as a valuable asset for developers venturing into the realms of LLMs. By providing a solid foundation to run large language models efficiently, it significantly lowers the entry barriers and operational complexities, paving the way for innovation and success in NLP projects. Whether you are a seasoned developer or a newcomer to the LLM landscape, leveraging Llama CPP and its suite of models can significantly accelerate your development journey.

--

--