LLM-Server

A vLLM wrapper to use local LLMs easier.

Supported Models

qwen-3: Qwen/Qwen3-1.7B
qwen-3-next: Qwen/Qwen3-Next-80B-A3B-Instruct
llama-4-scout: meta-llama/Llama-4-Scout-17B-16E-Instruct
llama-3-2: meta-llama/Llama-3.2-1B-Instruct

How to add more models

Add a new model in MODELS. A key is an alias for options in command, and a value is a model name in Hugging Face.

Pre-requisites

Python 3.10+
Pytorch with CUDA support (if using GPU)

Anaconda or Miniconda is strongly recommended.

Note: Qwen recommends Python 3.11+ for best performance.

Install dependencies:

pip install -r requirements.txt

⚠️ Note: PyTorch is not included in requirements.txt. Please install PyTorch separately according to your system and CUDA version. See https://pytorch.org/get-started/locally/ to install locally.

Pre-download models

We highly recommend to pre-download the model before running llm-server.

To download a model in CLI:

huggingface-cli download <model>

How to run the server

Run the following command to start the server:

python ./run.py <options> <model_name>

Replace <model_name> with the desired model from the list below (e.g., qwen-3-next).

Options:

--gpu-id: Space seperated list of GPU ids to use. Starts from 0. For example, --gpu-id 0 1 2 to use first three GPUs. (default: all available GPUs)
--gpu-mem-utilization <float>: GPU memory utilization used in vLLM. For example, 0.8 will use 80% of the GPU memory. (default: 0.85)
-p, --port: Port to run the server on (default: 5000)
--max-tokens: Maximum tokens for model (prompt + response). Default: 8192
--log-debug: Enable debug logging in vLLM

Get a response (inference)

LLM-Server and vLLM uses OpenAI-style API.

For example, in Python:

import openai
client = openai.OpenAI(base_url='http://localhost:5000/v1', api_key='') # Init OpenAI API client

where base_url is the url to LLM-Server. We disable api_key to access local server now.

To inference,

response = client.chat.completions.create(
  model='<model_name>',
  messages=[
    {'role': 'system', 'content': SYSTEM_MESSAGE},
    {'role': 'user', 'content': prompt},
  ],
  temperature=<temperature>,
  max_tokens=<max_tokens>,
)

print(response.choices[0].message.content) # Print response

where <model_name> is proper model name in Hugging Face, <temperature> is temperature and <max_tokens> is maximum tokens to inference (i.e., length of system msg and prompt).

Note that <max_tokens> does not include the response. Thus, it should be smaller than --max-tokens used to run LLM-Server.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
run.py		run.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM-Server

Supported Models

How to add more models

Pre-requisites

Pre-download models

How to run the server

Get a response (inference)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LLM-Server

Supported Models

How to add more models

Pre-requisites

Pre-download models

How to run the server

Get a response (inference)

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages