Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to develop a C++ tokenizer for MarianMT in C++ #418

Open
Zapotecatl opened this issue Sep 16, 2023 · 1 comment
Open

How to develop a C++ tokenizer for MarianMT in C++ #418

Zapotecatl opened this issue Sep 16, 2023 · 1 comment

Comments

@Zapotecatl
Copy link

Hi,

My intention is to develop a C++ project in Visual Studio (Windows) that runs the MarianMT model (exported to onnx) to translate from Spanish to English: https://huggingface.co/Helsinki-NLP/opus-mt-es-en. For this reason, I want to develop a C++ tokenizer based on sentepiece (https://github.com/google/sentencepiece).

I used the sentencepiece library (I built the static library and configured it in my visual studio). I used the source.spm file as the model. My program and output is this:

#include <iostream>
#include <sentencepiece_processor.h>
int main()
{
    sentencepiece::SentencePieceProcessor processor;
    const auto status = processor.Load("D:\\SentencePiece\\source.spm");

    if (!status.ok()) {
        std::cerr << status.ToString() << std::endl;
        // error
    }
   
    std::vector<std::string> pieces;
    processor.Encode("Hola mi amor", &pieces);
    for (const std::string& token : pieces) {
        std::cout << token << std::endl;
    }

    std::vector<int> ids;
    processor.Encode("Hola mi amor", &ids);
    for (const int id : ids) {
        std::cout << id << std::endl;
    }
}

Output

ÔûüHola
Ôûümi
Ôûüamor
868
64
866

Which apparently tokenizes correctly. However, my problem is with the ids. My python program delivers the correct ids.

from transformers import AutoTokenizer, MarianMTModel

src = "es"  # source language
trg = "en"  # target language
model_name = f"Helsinki-NLP/opus-mt-{src}-{trg}"
model = MarianMTModel.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

sample_text = "Hola mi amor"
batch = tokenizer([sample_text], return_tensors="pt")
print(batch)

Output

{'input_ids': tensor([[2119, 155, 1821, 0]]), 'attention_mask': tensor([[1, 1, 1, 1]])}

I'm probably interpreting something wrong? Could you please give me a suggestion on how to proceed?

@ZJaume
Copy link

ZJaume commented Jan 27, 2025

I don't know if you solved this, but I ran into a similar issue that can explain yours. Some (if not all) OpusMT models do not use SentencePiece integrated into Marian. Just as a tokenizer. Therefore the SentencePiece id's may not match the marian vocab id's. Maybe the tokenizer in Transformers library is just returning the marian vocab tensor id's, not the SentencePiece vocab id's.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants