Make vllm-openai Docker container compatible with HuggingFace Inference Endpoints. Specifically, the most recent VLLM version supports vision language models like Phi-3-vision that Text Generation Inference does not yet support, so this repo is useful for deploying those VLM models not supported by TGI.
This repo was heavily inspired by https://github.com/philschmid/vllm-huggingface, but is simpler because it does not fork from vllm.
- Install dependencies with
poetry install. If usingpoetryas your environment manager, runpoetry shellto activate your environment. - Add a
.envfile in the root directory withHF_TOKENdefined as a read/write token from huggingface. See.env.examplefor how to format.
- View/Edit the details in
examples/deploy.py. It is set up to deploy a HuggingFace Inference Endpoint for the Phi-3-vision model. Once you have set up the necessary variables, runpython examples/deploy.py. - Go to the link printed by the previous
deploy.pyscript to watch the endpoint deployment status and to retrieve the inference base url when finished deploying. - Copy this Endpoint Url from step 2 and add the env variable
HF_ENDPOINT_URLwith this copied value. Again, see.env.examplefor how to format.
- The endpoint you have deployed above is OpenAI API Compatible, meaning you can use the OpenAI library and any other library built to use OpenAI's library with your endpoint. To see an example of how you can call inference using your new endpoint, see
examples/inference.py. - To run the inference, run
python examples/inference.py.