Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add count tokens function #17

Open
wants to merge 8 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 24 additions & 6 deletions Encoder.js
Original file line number Diff line number Diff line change
Expand Up @@ -82,11 +82,11 @@ const byte_decoder = {}
Object.keys(byte_encoder).map(x => { byte_decoder[byte_encoder[x]] = x })

const bpe_ranks = dictZip(bpe_merges, range(0, bpe_merges.length))
const cache = {}
const cache = new Map;

function bpe(token) {
if (token in cache) {
return cache[token]
if (cache.has(token)) {
return cache.get(token)
}``

let word = token.split('')
Expand Down Expand Up @@ -147,7 +147,7 @@ function bpe(token) {
}

word = word.join(' ')
cache[token] = word
cache.set(token, word)

return word
}
Expand All @@ -166,6 +166,23 @@ function encode(text) {
return bpe_tokens
}

// This function works by iterating through the matches of the pat pattern in the input text,
// encoding each match using the encodeStr function and the byte_encoder mapping,
// and then applying the bpe function to the encoded token. The number of tokens produced by the bpe function is then added to the count variable.
// Finally, the count variable is returned as the result.
function countTokens(text) {
let count = 0
const matches = Array.from(text.matchAll(pat)).map(x => x[0])
for (let token of matches) {
token = encodeStr(token).map(x => {
return byte_encoder[x]
}).join('')

count += bpe(token).split(' ').length
}
return count
}

function decode(tokens) {
let text = tokens.map(x => decoder[x]).join('')
text = decodeStr(text.split('').map(x => byte_decoder[x]))
Expand All @@ -174,5 +191,6 @@ function decode(tokens) {

module.exports = {
encode,
decode
};
decode,
countTokens
};
9 changes: 8 additions & 1 deletion Encoder.test.js
Original file line number Diff line number Diff line change
Expand Up @@ -34,4 +34,11 @@ test('emojis', () => {
const str = "hello 👋 world 🌍";
expect(encode(str)).toEqual([31373, 50169, 233, 995, 12520, 234, 235])
expect(decode(encode(str))).toEqual(str)
});
});

test('properties of Object',()=>{
const str = "toString constructor hasOwnProperty valueOf";

expect(encode(str)).toEqual([1462, 10100, 23772, 468, 23858, 21746, 1988, 5189]);
expect(decode(encode(str))).toEqual(str);
})
4 changes: 4 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,7 @@
# This is a fork of https://github.com/latitudegames/GPT-3-Encoder. I made this fork so I could apply some PRs that had been sent to the upstream repo.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another spot where we could revert this change for cleaning it up to merge

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, we don't want to pull in the changes from my fork.


~~~

# GPT-3-Encoder
Javascript BPE Encoder Decoder for GPT-2 / GPT-3

Expand Down
Loading