Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Short translated sentences in training sets #1

Open
XinnuoXu opened this issue Mar 25, 2023 · 2 comments
Open

Short translated sentences in training sets #1

XinnuoXu opened this issue Mar 25, 2023 · 2 comments

Comments

@XinnuoXu
Copy link

XinnuoXu commented Mar 25, 2023

Hi, I noticed that for br, cy, mt, and ga (ru is fine), the translated multi-lingual sentences tend to be one sentence shorter than the original English ones. For example, one datapoint from the br training set:

<lex comment="good" lang="br" lid="Id1">E Saldevanahalli, Acharya Dr. Sarvapalli Radharrishnan Road, Hessarghatta Main Road, Bangalore - 560090 a zo lec'hiadur ar Institouenn-Tekoloù Acharya a-seizet e stad Karnataka, Indez e 2000.</lex>
<lex comment="good" lang="br" lid="Id2">Krouiñ e 2000 e c'hastouenn an arlañv an taktoniñ e 2000 e c'eo e c'hastral Saldevanahalli, Acharya e Dr. Sarvapalli Radhakrishnan Road, Hessarghatta Main Road, Bangalore, Karnataka, Indi, 560090.</lex>
<lex comment="good" lang="br" lid="Id3">Ar c'hwec'h a zo bet krouiñ e 2000 a zo bet ar Stitankad an takeladoù Acharya (moto : &quot;Derezh-eñvoudur&quot;) ha e plijout e Soldevanahalli, Acharya e Dr. Sarvapalli Radhakrishnan Road, Hessarghatta Main Road, Bangalore - 560090, Karnataka, Indezia.</lex>
<lex comment="" lang="en" lid="Id4">In Soldevanahalli, Acharya Dr. Sarvapalli Radhakrishnan Road, Hessarghatta Main Road, Bangalore – 560090 is the location of the Acharya Institute of Technology established in the state of Karnataka, India in the year 2000. The Institute, whose motto is &quot;Nurturing Excellence&quot; is affiliated with the Visvesvaraya Technological University in the city of Belgaum.</lex>
<lex comment="" lang="en" lid="Id4">The Acharya Institute of Technology was established in 2000. Its campus is located in Soldevanahalli, Acharya Dr. Sarvapalli Radhakrishnan Road, Hessarghatta Main Road, Bangalore, Karnataka, India, 560090. It is motto is &quot;Nurturing Excellence&quot; and it is affiliated with the Visvesvaraya Technological University in Belgaum.</lex>
<lex comment="" lang="en" lid="Id4">Acharya Institute of Technology (motto: &quot;Nurturing Excellence&quot;) was established in 2000 and is located at Soldevanahalli, Acharya Dr. Sarvapalli Radhakrishnan Road, Hessarghatta Main Road, Bangalore – 560090, Karnataka, India. The Institute is affiliated with Visvesvaraya Technological University of Belgaum.</lex>

For the examples containing more that 3 triples as input, the percentage of cases that are missing sentences are:

Language Percentage
BR 46.03%
CY 40.27%
GA 53.09%
MT 56%
RU 12.2%

I Do you mind to give them a check? Cheers : )

@liamcripwell
Copy link
Collaborator

Hi Xinnuo,

Thank you for reaching out. I had not previously noticed this phenomenon in the training data, but I think this is likely just a natural result of the multilingual NMT system we use to translate from English to the new low-resource languages (br, cy, ga, mt). Performance is quite bad for these languages and so it wouldn't surprise me if decoding stops too early for the longer sequences.

Be aware that this silver training data we provide is likely to be very noisy and we only include it as an optional starting point. If you can think of alternative sources/methods to produce higher quality training data for your system we highly encourage you to do so. Feel free to let me know if you have any further questions.

Cheers,

Liam

@XinnuoXu
Copy link
Author

Hi Liam, many thanks for the fast response! I'm very appreciative : )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants