Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
.github/workflows		.github/workflows
docs		docs
.gitignore		.gitignore
Encoder.js		Encoder.js
Encoder.test.js		Encoder.test.js
LICENSE		LICENSE
README.md		README.md
encoder.json		encoder.json
encoder.py		encoder.py
example.js		example.js
index.d.ts		index.d.ts
index.js		index.js
jest.config.js		jest.config.js
package-lock.json		package-lock.json
package.json		package.json
vocab.bpe		vocab.bpe

Repository files navigation

This is a fork of https://github.com/latitudegames/GPT-3-Encoder. I made this fork so I could apply some PRs that had been sent to the upstream repo.

changelog: 
    add countTokens function
    add tokenStats function
    updated docs (npm run docs)

GPT-3-Encoder

Javascript BPE Encoder Decoder for GPT-2 / GPT-3

About

GPT-2 and GPT-3 use byte pair encoding to turn text into a series of integers to feed into the model. This is a javascript implementation of OpenAI's original python encoder/decoder which can be found here

Install with npm

npm install @syonfox/gpt-3-encoder

Usage

View on GitHub

View Docs Pages

Compatible with Node >= 12

import {encode, decode, countTokens, tokenStats} from "gpt-3-encoder"
//or
const {encode, decode, countTokens, tokenStats} = require('gpt-3-encoder')

const str = 'This is an example sentence to try encoding out on!'
const encoded = encode(str)
console.log('Encoded this string looks like: ', encoded)

console.log('We can look at each token and what it represents')
for(let token of encoded){
  console.log({token, string: decode([token])})
}

//example count tokens usage
if(countTokens(str) > 5) {
    console.log("String is over five tokens, inconcevable");
}

const decoded = decode(encoded)
console.log('We can decode it back into:\n', decoded)

Developers

git clone https://github.com/syonfox/GPT-3-Encoder.git

cd GPT-3-Encoder

npm install

npm run test
npm run docs

less Encoder.js

firefox ./docs/index.html

npm publish --access public

todo

More stats that work well with this token representation.

Clean up and keep it simple.

more tests.

performance analysis

There are several performance improvements that could be made to the encode function: (from gpt todo vet these recommendations)

Cache the results of the encodeStr function to avoid unnecessary computation. You can do this by using a map or an object to store the results of encodeStr for each input string.
Use a regular expression to match the tokens in the input text instead of using the matchAll function. Regular expressions can be faster and more efficient than matchAll for certain types of patterns.
Use a different data structure to store the byte_encoder and encoder maps. Objects and maps can have different performance characteristics depending on the size and complexity of the data. You may want to experiment with different data structures to see which one works best for your use case.
Use a different data structure to store the bpe_tokens array. Arrays can be slower than other data structures for certain operations, such as appending new elements or concatenating multiple arrays. You may want to consider using a different data structure, such as a linked list or a queue, to store the bpe_tokens array.
Use a different algorithm to compute the BPE codes for the tokens. The current implementation of the bpe function may be inefficient for large datasets or for datasets with complex patterns. You may want to consider using a different algorithm, such as a divide-and-conquer or a hashing-based approach, to compute the BPE codes more efficiently.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

This is a fork of https://github.com/latitudegames/GPT-3-Encoder. I made this fork so I could apply some PRs that had been sent to the upstream repo.

GPT-3-Encoder

About

Install with npm

Usage

Developers

todo

About

Releases

Packages

Contributors 7

Languages

License

latitudegames/GPT-3-Encoder

Folders and files

Latest commit

History

Repository files navigation

This is a fork of https://github.com/latitudegames/GPT-3-Encoder. I made this fork so I could apply some PRs that had been sent to the upstream repo.

GPT-3-Encoder

About

Install with npm

Usage

Developers

todo

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 7

Languages

Packages