Provide ability to dynamically allocate all available CPU threads without affecting prior functionality#1364
Merged
abetlen merged 1 commit intoabetlen:mainfrom Apr 23, 2024
baileytec-labs:main
Merged
Provide ability to dynamically allocate all available CPU threads without affecting prior functionality#1364abetlen merged 1 commit intoabetlen:mainfrom baileytec-labs:main
abetlen merged 1 commit intoabetlen:mainfrom
baileytec-labs:main
Conversation
Owner
|
Thank you @sean-bailey, I'll take a look at adding this to the |
xhedit
pushed a commit
to xhedit/llama-cpp-conv
that referenced
this pull request
Apr 30, 2024
…desired using `-1` (abetlen#1364)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
I actually have been using llama-cpp-python server in AWS Lambda functions for some time now. While the default behavior of only allocating half the cpu threads by default is wise, especially considering you may have a multi-user operation occurring and don't want to potentially lock things up, when it comes to single-tenant systems like Lambda, that concern is mitigated.
However, no original functionality is changed. If the user runs llama-cpp-python server today, with their current configurations, they should get the same experience as they have previously. However, if the user specifies
n_threadsorn_threads_batchas -1, then similar to then_gpu_layersit will default to using all available listed by multiprocessing.cpu_count(). This is especially effective when using AWS Lambda as the CPU count scales with the memory allocation, and if a model needs more memory to perform, it likely will benefit from a higher CPU count as well.As an example which leverages this effectively:
https://github.com/baileytec-labs/llama-on-lambda/tree/main/llama_lambda/llama-cpp-server-container
This update is designed to add functionality while not interfering with any current functionality.