Adding scripts & Readme steps for vLLM based workloads over IBM LSF#23
Adding scripts & Readme steps for vLLM based workloads over IBM LSF#23arshabbir wants to merge 2 commits intoIBMSpectrumComputing:masterfrom
Conversation
| This repository shows how to run a long-running vLLM inference service under IBM LSF, | ||
| validate it through a standard OpenAI-compatible API, access it from a Jupyter notebook, | ||
| and reuse the same service from a downstream batch job. | ||
|
|
There was a problem hiding this comment.
A little wordsmithing:
In this repository we demonstrate how to deploy a large-language model inference service on an LSF cluster using vLLM. The service exposes an OpenAI-compatible API. We show how various clients can use the model for interactive or batch inference.
There was a problem hiding this comment.
Addressed in the latest commit
| - python3 installed | ||
| - curl installed | ||
| - network access from the execution host to pull the vLLM image and model | ||
| - a single-node IBM LSF setup is sufficient for this implementation |
There was a problem hiding this comment.
I think we also require a shared $HOME directory, correct? That is not strictly necessary for LSF, but is a common deployment.
There was a problem hiding this comment.
Added this in the latest README
| - scripts/batch_client.py | ||
| Reads a prompt corpus and sends requests to the registered vLLM service. | ||
| - notebook/LSF_vLLM_Client.ipynb | ||
| Jupyter notebook for interactive validation against the IBM LSF-managed runtime. |
There was a problem hiding this comment.
There is no notebook subdirectory
There was a problem hiding this comment.
Addressed in the latest commit
| - notebook/LSF_vLLM_Client.ipynb | ||
| Jupyter notebook for interactive validation against the IBM LSF-managed runtime. | ||
| - corpus/prompts.txt | ||
| Sample prompt corpus for downstream batch validation. |
There was a problem hiding this comment.
There is no corpus subdirectory
There was a problem hiding this comment.
Addressed in the latest commit
| Prerequisites | ||
| ------------- | ||
| - IBM LSF installed and operational | ||
| - podman installed |
There was a problem hiding this comment.
I guess this must be installed on all compute nodes of the cluster right? Not sure whether we need to use the LSF podman integration? I guess likely not (which is fine)
There was a problem hiding this comment.
Yeah. We dont need LSF podmain integration
|
|
||
| ```bash | ||
| cp corpus/prompts.txt ~/lsf_vllm_poc/corpus/prompts.txt | ||
| ``` |
There was a problem hiding this comment.
I suggest to include some lines to say to clone this repo, and cd into whatever base directory. Just make it easy for people to cut-and-paste lines so that they can reproduce this without having to think too much.
Also, need to update corpus -> scripts
There was a problem hiding this comment.
I have addressed it and added the below ..hope this is fine . Please verify
git clone https://github.com/IBMSpectrumComputing/lsf-integrations.git
cd lsf-integrations/LSF-vLLMAfter this follow the instructions step by step given below.
|
|
||
| ```bash | ||
| MODEL=Qwen/Qwen3-0.6B PORT=8001 API_KEY=local-vllm-key | ||
| ``` |
There was a problem hiding this comment.
how to do this? grep a line in one of the config files?
sounds like the step should be to update the API_KEY. Where do users get this key from? Should this be a prerequisite?
There was a problem hiding this comment.
I have added the below note, in the updated README.
NOTE :
Default demo API key: local-vllm-key
The service script uses this value unless API_KEY is explicitly set before submission.
If you choose a different value, update the curl commands, notebook cells, and batch client inputs accordingly.
|
|
||
| ``` | ||
| http://127.0.0.1:8001/v1 | ||
| ``` |
There was a problem hiding this comment.
for this one, looks like you are starting the notebook on the cluster node, and then connecting from the laptop through ssh tunnel.
You should mention which host each command gets run on (laptop vs. LSF compute host) and also for the URL to use that in the web browser.
Also mention that a prerequisite for this is to have ssh access to a cluster node.
There was a problem hiding this comment.
Updated the README with these steps explaining where to run the commands
| bjobs | ||
| bpeek ${BATCH_JOBID} | ||
| cat ~/lsf_vllm_poc/results/batch_${JOBID}.jsonl | ||
| ``` |
There was a problem hiding this comment.
Overall, I suggest to break this into a few sections:
(1) Deploy the LLM
- deploy
- monitor
- kill
(2) Use the LLM - curl
- Jupyter
- LSF job
There was a problem hiding this comment.
Please review the new Restructured readme file.
76cc627 to
cd5baca
Compare
No description provided.