-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
llama-cpp multi server support #316
base: main
Are you sure you want to change the base?
Conversation
f5f3fdc
to
0e12c33
Compare
0e12c33
to
916c8dd
Compare
We can already do parallel requests with vLLM, which anyone on Linux should be able to use. So, the goal here is to get higher performance strictly with llama.cpp for Mac users? And the premise behind these changes is that today we are unable to efficiently utilize a Mac with a single llama.cpp instance, so we need to run multiple of them? Is there any upstream llama.cpp discussion or documentation around this to show that a single server cannot saturate typical Mac hardware? |
6b29f4a
to
a86542d
Compare
d36a5b6
to
7e0fa83
Compare
llama-cpp does not support batching, concurrent completions requests, or really anything to speed our processes up. The only clear solution here is to create our own form of paralellism by supporting running multiple servers at once. via a `--num-servers` flag from the cli, a user can spin up 2,3, or even 4 of the `mistral 7b instruct` models since they only take about 5GB of RAM. This allows us to split our dataset into batches like we do with vllm and execute threads running each batch in parallel. Each server handles its own batch Signed-off-by: Charlie Doern <[email protected]>
7e0fa83
to
b8f614a
Compare
So the idea here is pretty close to what you have captured. Though, Its not that a single llama server can't saturate Mac HW, its that it can't take more than 1 singular large completion request at once. I can run full sdg on a laptop with a 2 page md in ~2 hrs, using debug logging I can see the completions come back pretty quickly since the chunks of data are quite small. Now, if I turn this up to a 50 pg markdown, gen_spellcheck and gen_knowledge alone take 10 hours. I tried a couple of things like seeing if the ThreadPoolExecutor could just work on llamacpp, but it seems that llama servers can only take 1 completion request at once (from what I could tell). So splitting into threads and trying to split the data that way didn't work. Since it seems llama-cpp-python can't natively support parallel completion requests, the only way to get close to the behavior we support w/ vLLM is to spin up a few servers, and kick off threads where the client is different in each one and the dataset is a subsection of the overall set. This allows us to concurrently run something like 3 pipeline processes at once, each processing a subset of the data. The data is then returned, concatenated, and mixed normally cutting the time into thirds of what it once was! if I am wrong about llama not taking multiple completions at once in threads I can try that again, but I had no luck there. I have an ilab branch here: https://github.com/cdoern/instructlab/tree/llama-batch showing how this would feed into sdg @bbrowning |
This pull request has merge conflicts that must be resolved before it can be |
This doesn't exactly overlap with #358 I don't think - this one is more about speeding up SDG by executing parallel requests against multiple llama-cpp servers. I agree that SDG should be able to send multiple requests in parallel. I disagree with the approach in this PR of having multiple different OpenAI Clients hitting different endpoints, as that's not typically how servers are load-balanced. In the production case, we'd have a single load-balanced endpoint and the user should have some knob to control how many SDG requests we execute in parallel against that backend. Perhaps we need to separate this out a bit into a couple of phases. One phase is providing a knob so users can control how many SDG requests we execute in parallel, and that may require rethinking some of our concurrency primitives in use during the data generation loop. The other phase would be giving options in the CLI to spin up multiple llama-cpp-python servers load-balanced behind the uvicorn it's already managing. That would be multiple llama-cpp-python but all behind a single endpoint, so that we're not having to juggle multiple endpoints and multiple separate OpenAI Client instances. |
This pull request has merge conflicts that must be resolved before it can be |
llama-cpp does not support batching, concurrent completions requests, or really anything to speed our processes up.
The only clear solution here is to create our own form of paralellism by supporting running multiple servers at once.
via a
--num-servers
flag from the cli, a user can spin up 2,3, or even 4 of themistral 7b instruct
models since they only take about 5GB of RAM.This allows us to split our dataset into batches like we do with vllm and execute threads running each batch in parallel. Each server handles its own batch