This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.

Name	Name	Last commit message	Last commit date
Latest commit syonfox Add tokenStats some tests, and some error checks, also docs and readm… Dec 25, 2022 65a3181 · Dec 25, 2022 History 52 Commits
.github/workflows	.github/workflows	fix exports and decoding	Sep 8, 2020
.gitignore	.gitignore	fix exports and decoding	Sep 8, 2020
Encoder.js	Encoder.js	Add tokenStats some tests, and some error checks, also docs and readm…	Dec 25, 2022
Encoder.test.js	Encoder.test.js	Add tokenStats some tests, and some error checks, also docs and readm…	Dec 25, 2022
LICENSE	LICENSE	Initial commit	Sep 4, 2020
README.md	README.md	Add tokenStats some tests, and some error checks, also docs and readm…	Dec 25, 2022
encoder.json	encoder.json	update	Sep 4, 2020
encoder.py	encoder.py	Fix for Possible bug with BPE RegEx #5	Apr 21, 2021
example.js	example.js	update	Sep 4, 2020
index.d.ts	index.d.ts	Add tokenStats some tests, and some error checks, also docs and readm…	Dec 25, 2022
index.js	index.js	Add tokenStats some tests, and some error checks, also docs and readm…	Dec 25, 2022
jest.config.js	jest.config.js	fix exports and decoding	Sep 8, 2020
package-lock.json	package-lock.json	Add tokenStats some tests, and some error checks, also docs and readm…	Dec 25, 2022
package.json	package.json	Add tokenStats some tests, and some error checks, also docs and readm…	Dec 25, 2022
vocab.bpe	vocab.bpe	update	Sep 4, 2020

Repository files navigation

This is a fork of https://github.com/latitudegames/GPT-3-Encoder. I made this fork so I could apply some PRs that had been sent to the upstream repo.

changelog: 
    add countTokens function
    updated docs (npm run docs)

GPT-3-Encoder

Javascript BPE Encoder Decoder for GPT-2 / GPT-3

About

GPT-2 and GPT-3 use byte pair encoding to turn text into a series of integers to feed into the model. This is a javascript implementation of OpenAI's original python encoder/decoder which can be found here

Install with npm

npm install @syonfox/gpt-3-encoder

Usage

Compatible with Node >= 12

import {encode, decode, countTokens, tokenStats} from "gpt-3-encoder"
//or
const {encode, decode, countTokens, tokenStats} = require('gpt-3-encoder')

const str = 'This is an example sentence to try encoding out on!'
const encoded = encode(str)
console.log('Encoded this string looks like: ', encoded)

console.log('We can look at each token and what it represents')
for(let token of encoded){
  console.log({token, string: decode([token])})
}

//example count tokens usage
if(countTokens(str) > 5) {
    console.log("String is over five tokens, inconcevable");
}

const decoded = decode(encoded)
console.log('We can decode it back into:\n', decoded)

Developers

git clone https://github.com/syonfox/GPT-3-Encoder.git

cd GPT-3-Encoder

npm install

npm run test
npm run docs

less Encoder.js

firefox ./docs/index.html

npm publish

todo

More stats that work well with this token representation.

Clean up and keep it simple.

more tests.

performance analysis

There are several performance improvements that could be made to the encode function: (from gpt todo vet these recommendations)

Cache the results of the encodeStr function to avoid unnecessary computation. You can do this by using a map or an object to store the results of encodeStr for each input string.
Use a regular expression to match the tokens in the input text instead of using the matchAll function. Regular expressions can be faster and more efficient than matchAll for certain types of patterns.
Use a different data structure to store the byte_encoder and encoder maps. Objects and maps can have different performance characteristics depending on the size and complexity of the data. You may want to experiment with different data structures to see which one works best for your use case.
Use a different data structure to store the bpe_tokens array. Arrays can be slower than other data structures for certain operations, such as appending new elements or concatenating multiple arrays. You may want to consider using a different data structure, such as a linked list or a queue, to store the bpe_tokens array.
Use a different algorithm to compute the BPE codes for the tokens. The current implementation of the bpe function may be inefficient for large datasets or for datasets with complex patterns. You may want to consider using a different algorithm, such as a divide-and-conquer or a hashing-based approach, to compute the BPE codes more efficiently.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

This is a fork of https://github.com/latitudegames/GPT-3-Encoder. I made this fork so I could apply some PRs that had been sent to the upstream repo.

GPT-3-Encoder

About

Install with npm

Usage

Developers

todo

About

Releases

Packages

Used by 5.7k

Contributors 7

Languages

License

latitudegames/GPT-3-Encoder

Folders and files

Latest commit

History

Repository files navigation

This is a fork of https://github.com/latitudegames/GPT-3-Encoder. I made this fork so I could apply some PRs that had been sent to the upstream repo.

GPT-3-Encoder

About

Install with npm

Usage

Developers

todo

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Used by 5.7k

Contributors 7

Languages

Packages