-
Notifications
You must be signed in to change notification settings - Fork 10.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server
: fix tool-call of DeepSeek R1 Qwen, return reasoning_content (Command 7RB & DeepSeek R1) unless --reasoning-format none
#11607
Conversation
…le tool call delta!)
… template update adds it)
@WangxuP Based on my (limited) understanding of the delta format used by OpenAI (incl. for tool calls), the "correct" way to stream thoughts back would be to hold off on anything that might be an opening |
| nvidia-Llama-3.1-Nemotron-70B-Instruct-HF.jinja | llama 3.x tool calls (w/ builtin tools) | | ||
| openchat-openchat-3.5-0106.jinja | generic tool calls | | ||
| teknium-OpenHermes-2.5-Mistral-7B.jinja | generic tool calls | | ||
| Almawave-Velvet-14B.jinja | Hermes 2 Pro | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just noting here (no need to take any actions right now), but this README file is now too long and hard to follow for new users. I'm planning to break this into small files (like what we did with docs
directory). Potentially we will end up with a main API docs, tool-calling docs and development docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds great!! happy to help with this (if only reviewing)
// Distill Qwen 7B & 32B models seem confused re/ syntax of their tool call opening tag, | ||
// so we accept common variants (then it's all constrained) | ||
builder.add_rule("root", | ||
"( \"<|tool▁calls▁begin|>\" | \"<|tool_calls_begin|>\" | \"<|tool calls begin|>\" | \"<|tool\\\\_calls\\\\_begin|>\" ) " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small nits, if you're doing multiple string concatenations, it's better to use std::ostringstream
to reduce the number of copy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair point, for now I've been favouring readability but will keep this in mind when doing an optimization pass (depending on how much this all ends up costing, we might want to cache various bits and/or create a grammar DSL that would bypass the string stage altogether; JSON schema conversion has lots of room for optimization & I'd also like to take the llguidance stuff into account: exciting prospects!)
common/chat.cpp
Outdated
data.grammar = build_grammar([&](const common_grammar_builder & builder) { | ||
std::vector<std::string> tool_rules; | ||
foreach_function(inputs.tools, [&](const json & tool) { | ||
const auto & function = tool["function"]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to use .at()
instead of operator[]
when it's possible, as explained in https://github.com/nlohmann/json
In function from_json, use function at() to access the object values rather than operator[]. In case a key does not exist, at throws an exception that you can handle, whereas operator[] exhibits undefined behavior
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated, thanks!
Co-authored-by: Georgi Gerganov <[email protected]>
Yes I would say this will be a hard approach. Specially because each model has their own format, so we can't really rely on regex much in stream mode. Indeed, I assume that most of the complication will be about moving away from regex and use some kind of "state machine" to keep track of the generated text. From this perspective, I'm wondering, is it worth inventing our own implementation of regex? Regex is just a state machine under the hood, so by doing this we can fully manage the state on our own. Written as (pseudo-) code, my idea looks like: struct chat_regex tool_regex;
tool_regex.add_literal("<tool_name>")
tool_regex.add_string()
tool_regex.add_literal("</tool_name><tool_data>")
tool_regex.add_json()
tool_regex.add_literal("</tool_data>")
tool_regex.end() Which will be compiled into: flowchart LR
F@{ shape: dbl-circ, label: "end" }
A --<tool_name>--> B
B --string--> C
C --</tool_name><tool_data>--> D
D --json--> E
E --</tool_data>--> F
We create a "state" object each time we want to use it (i.e. store it into the server slot.chat_parser_state = chat_parser_state(tool_regex); // initial state A
slot.chat_parser_state << slot.generated_text; // with "generated_text" is the "delta" generated content
slot.chat_parser_state.get_matched(); // not sure yet what it should return |
@ngxson Yesss!! 🫡🫡🫡🫡🫡 So, my original dream was to write a recursive descent / backtracking parser based on the existing GBNF grammars, and use a crude naming convention to extract data out of rules:
(the A bit adhoc and magic, but very limited modifications needed in the tool call code (just stick to a consistent naming, and delete all the regexp code) and a pretty simple parser to implement (can add some hooks to make it generic wrt/ the naming convention to extract any kind of data). It would also make it trivial to support partial extractions / streaming by memorizing the parsing stack (& extracted variables) that consumed the longest text (when parsing fails). (and +1 to keeping the state in the slot, although TBD whether that should be a parsing stack state - first stack that failed because of an EOF? - or just the JSON tree of the last response returned, doing a React-style full-DOM diff at every step; much slower but might be safer, to be investigated) If we agree to explore this route, I might start by refactoring the grammar parsing code to output an easier intermediate grammar AST that can then be used directly by the recursive descent parser (and be trivially converted to the pointer-heavy sampling grammar structure). |
@ngxson we could also explore this kind of syntax to build a DSL to create the dual-use GBNF grammar (possibly also llguidance grammar) cc/ @mmoskal @HanClinto |
I don't really understand the second part of your phrase about "first stack that failed because of an EOF", but IMO storing the parsing stack is fine. The React-style diff may sounds intuitive/safer, but I think
I have a quick look at all of the regex you're currently using in Most of your regex(es) use And to make it look even nicer, we can use cpp operator overloading, for example with tool_regex -> literal("<tool_name>") -> string() -> literal("</tool_name>"); Another benefit of this approach is that some expressions can also be optimized during compile time. Edit: on second thought, using |
@ochafik right! llguidance already does support streaming and emitting capture groups for subgrammars; it will even know to only emit "foo" when the tokens so far are "foo<", but then emit "<text" when "text" is sampled (and not "tool"). There is also some code in there to support general stop regexes using a (lazy) state machine. Note that as people develop the tool calling more in models, they are likely to use special tokens for tool calling, JSON mode etc. Not sure gbnf handles that particularly well (that is the difference between "<|foo|>" and "<|" "foo" "|>"). |
Most of the grammar and tool functionalities are way over my head and I cannot provide a very meaningful feedback. But overall I think the One more thought is that long-term we can also think about moving some core functionality about grammars to |
@ochafik here's how the lazy matching is handled in llguidance, see also docs import llguidance
import huggingface_hub
import json
lark_grm = """
start: "<tool_name>" name "<tool_data>" data "</tool_data>"
name[capture, suffix="</tool_name>"]: /.*/
data[capture]: %json {
"properties": {
"foo": { "type": "string" }
},
"required": ["foo"]
}
"""
def main():
tok_name = huggingface_hub.hf_hub_download(
"microsoft/Phi-3.5-mini-instruct", "tokenizer.json"
)
with open(tok_name, "r") as f:
text = f.read()
tok = llguidance.LLTokenizer(text)
interp = llguidance.LLInterpreter(
tok,
json.dumps({"grammars": [{"lark_grammar": lark_grm}]}),
enable_ff_tokens=False,
enable_backtrack=False,
log_level=1,
)
interp.start_without_prompt()
toks = tok.tokenize_str("<tool_name>foo<bar></tool_name><tool_data>{\"foo\": \"bar\"}</tool_data>")
for t in toks:
mask, r = interp.compute_mask()
obj = json.loads(r)
for p in obj["progress"]:
if p["object"] != "text":
print("\n ", end="")
print(p)
# feeding token now
print(tok.dbg_tokens([t]), end=" ")
interp.commit_token(t)
print("\n")
if __name__ == "__main__":
main() When you run it, you get:
The captures are generated immedietly after getting enough tokens. If the model use special tokens, you need to write the grammar slightly differently: start: <|assistant|> name <|end|> /\s*/ data
name[capture]: /.*/
data[capture]: %json {
"properties": {
"foo": { "type": "string" }
},
"required": ["foo"]
} Note lack of |
--reasoning-format FORMAT
flag that populatesmessage.reasoning_content
in the response using native<think>
tags for DeepSeek R1 and<|START_THINKING|>
for Command R7B if the format isdeepseek
(default), otherwise leaves thinking traces as they are inmessage.content
if format isnone
tool_plan
field added temporarilytool-call
: support Command R7B (+ return tool_plan "thoughts" in API) #11585 into the newreasoning_content
(non-streaming api only for now).
Usage
Get and build this PR's branch
Run with (add
--verbose
to inspect prompt / grammars used):Call the API and profit
Show result w/ `DeepSeek-R1-Distill-Qwen-32B-GGUF:Q6_K_L`
Which is this code:
Not too bad, but it didn't do lower-case and word split is a bit poor.
Trying again w/ the following extra args to make the sampling greedy:
We have a winner:
And the thoughts:
Implementation notes
sync
: minja #11641<|tool▁output▁end|>
or<|tool▁call▁end|>
(need to close the list of outputs / calls w/ plural<|tool▁outputs▁end|>
/<|tool▁calls▁end|>
, respectively, and then missing end of sentence + optional add_generation_prompt)models/templates/llama-cpp-deepseek-r1.jinja
)--reasoning-format
flag, which controls output ofreasoning_content
in the API (seetest_thoughts
)<think>...</think>
tags for DeepSeek R1 and<|START_THINKING|>...<|END_THINKING|>
for Command R7B.tool_plan
field / now flowing intoreasoning_content
(was added intool-call
: support Command R7B (+ return tool_plan "thoughts" in API) #11585)test_calc_result
(checking models make some use of tool call results, which some struggle a bit with)TODOs:
tool-call
: allow--chat-template chatml
w/--jinja
, default to chatml upon parsing issue, avoid double bos #11616)tool-call
: command r7b fix for normal responses #11608)Possible follow ups
Document differences between stream & non-stream modes (thought & tool_plan not sent in stream)
look at the Llama distill more closely (see Eval bug: trivial grammar crashes (DeepSeek R1 Distill Llama 8B) #11591)
Reintroduce forced thinking in generic handler under some
--reasoning
flag (+ explore @ngxson's idea to support adisabled
value that biases thinking tags)