Replies: 2 comments
-
>>> othiele |
Beta Was this translation helpful? Give feedback.
0 replies
-
>>> nmstoker |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
>>> nmstoker
[February 22, 2021, 12:20am]
Further to the idea floated here: Adding custom words to language
model slash
I was wondering if I could submit a PR for this? I wanted to sound out
the team to avoid wasting time (eg if my proposed approach wasn't
optimal) or you have anything particular I should try to do with it.
There are a few ways this could be handled but I though it might be
minimally disruptive if I simply let the user specify multiple input
texts, so it would cycle over all of them.
I've got a basic version hacked together. I keep the CLI parameters the
same but on the end I've added one to let you specify a delimiter (which
defaults to a comma).
Then all it does is it splits the slash --input_text up based on the
delimiter and creates the vocab and lower.txt files for each input text.
Right now I'm using the same top K for all, but if this looks worth
pursuing I'd make it so the top K was split similarly to the input text
parameter.
With the example I'm testing with it doesn't make a big difference not
having the top K be input specific, as the second input text is
massively shorter than the 500k (so for it all words get included) but
I'm thinking there could be scenarios where it's good to be able to set
this per input text.
The other minor thing I was going to do was have it check for the KenLM
path up front (as it's annoying to find you've messed up the path only
after it has processed the vocab, which takes a bit of time even on a
fast PC, only to crash if there's a problem
Anything else I should consider?
Any objections if I switch the .format strings to f-strings? (or is that
best done separately/not at all?)
[This is an archived TTS discussion thread from discourse.mozilla.org/t/discuss-potential-pr-related-to-generate-lm-py]
Beta Was this translation helpful? Give feedback.
All reactions