Benchmark LLMs with Ollama Python Library

No video

Benchmark LLMs with Ollama Python Library

Рет қаралды 821

Күн бұрын

The ollama and transformers libraries are two packages that integrate Large Language Models (LLMs) with Python to provide chatbot and text generation capabilities. This tutorial covers the installation and basic usage of the ollama library.
The first step is to install the ollama server. After the server is running, install the ollama python package with pip:
pip install ollama
With the ollama server and python package installed, retrieve the mistral LLM or any of the available LLM models in the ollama library. The mistral model is a relatively small (7B parameter) LLM that can run on most CPUs. Larger models such as the mixtral model work best on GPUs with sufficient processing power and VRAM memory.
import ollama
ollama.pull('mistral')
The ollama and LLM model installation occurs once. List the available models.
ollama.list()
With the model installed, use the generate function to send a prompt to the LLM and print the response.
import ollama
q = 'How can LLMs be used in engineering?'
ollama.generate(model='mistral', prompt=q)
The chat function retains memory of prior prompts while the generate function does not.
import ollama
prompt1 = 'What is the capital of France?'
response = ollama.chat(model='mistral', messages=[
{'role': 'user','content': prompt1,},])
r1 = response['message']['content']
print(r1)
prompt2 = 'and of Germany?'
response = ollama.chat(model='mistral', messages=[
{'role': 'user','content': prompt1,},
{'role': 'assistant','content': r1,},
{'role': 'user','content': prompt2,},])
r2 = response['message']['content']
print(r2)
The responses are:
🗣️ The capital city of France is Paris. Paris is one of the most famous cities in the world and is known for its iconic landmarks such as the Eiffel Tower, the Louvre Museum, Notre-Dame Cathedral, and the Champs-Élysées. It is also home to many important cultural institutions and is a major European political, economic, and cultural center.
🗣️ The capital city of Germany is Berlin. Berlin is the largest city in Germany by both area and population, and it is one of the most populous cities in the European Union. It is located in northeastern Germany and serves as the seat of government and the main cultural hub for the country. Berlin is known for its rich history, diverse culture, and numerous landmarks including the Brandenburg Gate, the Reichstag Building, and the East Side Gallery.
The ollama library also supports streaming responses so that the text appears in pieces as it is generated instead of printed once after completion. This improves the perceived responsiveness of the LLM, especially with limited computing resources.
import ollama
prompt = 'How can LLMs improve automation?'
stream = ollama.chat(model='mistral',
messages=[{'role': 'user', 'content': prompt}],
stream=True,)
for chunk in stream:
print(chunk['message']['content'], end='', flush=True)
The library API is designed to access the ollama REST API with functions like chat, generate, list, show, create, copy, delete, pull, push, and embeddings.
Applications in Engineering
The ollama python library facilitates LLMs in applications such as chatbots, customer support agents, and content generation tools. Code generation, debugging, and cross-language programming support can be accelerated with LLMs if used effectively. The ollama library simplifies interaction with advanced LLM models enabling more sophisticated responses and capabilities.
One potential obstacle to using more sophisticated models is the size of the LLM and speed of response without a high-end GPU. Cloud computing resources are a viable option for application deployment.
Activity: Evaluate Model Performance
It is important to understand the trade-offs for quality, cost, and speed for different LLMs. There are AI model websites that benchmark model performance and service providers. The purpose of this exercise is to compare model performance on your computer. Install 3 different models on your local ollama server. Test the speed of response versus the size of the model with the script below.