Short translated sentences in training sets #1

XinnuoXu · 2023-03-25T14:46:35Z

Hi, I noticed that for br, cy, mt, and ga (ru is fine), the translated multi-lingual sentences tend to be one sentence shorter than the original English ones. For example, one datapoint from the br training set:

<lex comment="good" lang="br" lid="Id1">E Saldevanahalli, Acharya Dr. Sarvapalli Radharrishnan Road, Hessarghatta Main Road, Bangalore - 560090 a zo lec'hiadur ar Institouenn-Tekoloù Acharya a-seizet e stad Karnataka, Indez e 2000.</lex>
<lex comment="good" lang="br" lid="Id2">Krouiñ e 2000 e c'hastouenn an arlañv an taktoniñ e 2000 e c'eo e c'hastral Saldevanahalli, Acharya e Dr. Sarvapalli Radhakrishnan Road, Hessarghatta Main Road, Bangalore, Karnataka, Indi, 560090.</lex>
<lex comment="good" lang="br" lid="Id3">Ar c'hwec'h a zo bet krouiñ e 2000 a zo bet ar Stitankad an takeladoù Acharya (moto : &quot;Derezh-eñvoudur&quot;) ha e plijout e Soldevanahalli, Acharya e Dr. Sarvapalli Radhakrishnan Road, Hessarghatta Main Road, Bangalore - 560090, Karnataka, Indezia.</lex>
<lex comment="" lang="en" lid="Id4">In Soldevanahalli, Acharya Dr. Sarvapalli Radhakrishnan Road, Hessarghatta Main Road, Bangalore – 560090 is the location of the Acharya Institute of Technology established in the state of Karnataka, India in the year 2000. The Institute, whose motto is &quot;Nurturing Excellence&quot; is affiliated with the Visvesvaraya Technological University in the city of Belgaum.</lex>
<lex comment="" lang="en" lid="Id4">The Acharya Institute of Technology was established in 2000. Its campus is located in Soldevanahalli, Acharya Dr. Sarvapalli Radhakrishnan Road, Hessarghatta Main Road, Bangalore, Karnataka, India, 560090. It is motto is &quot;Nurturing Excellence&quot; and it is affiliated with the Visvesvaraya Technological University in Belgaum.</lex>
<lex comment="" lang="en" lid="Id4">Acharya Institute of Technology (motto: &quot;Nurturing Excellence&quot;) was established in 2000 and is located at Soldevanahalli, Acharya Dr. Sarvapalli Radhakrishnan Road, Hessarghatta Main Road, Bangalore – 560090, Karnataka, India. The Institute is affiliated with Visvesvaraya Technological University of Belgaum.</lex>

For the examples containing more that 3 triples as input, the percentage of cases that are missing sentences are:

Language	Percentage
BR	46.03%
CY	40.27%
GA	53.09%
MT	56%
RU	12.2%

I Do you mind to give them a check? Cheers : )

The text was updated successfully, but these errors were encountered:

liamcripwell · 2023-03-25T15:25:50Z

Hi Xinnuo,

Thank you for reaching out. I had not previously noticed this phenomenon in the training data, but I think this is likely just a natural result of the multilingual NMT system we use to translate from English to the new low-resource languages (br, cy, ga, mt). Performance is quite bad for these languages and so it wouldn't surprise me if decoding stops too early for the longer sequences.

Be aware that this silver training data we provide is likely to be very noisy and we only include it as an optional starting point. If you can think of alternative sources/methods to produce higher quality training data for your system we highly encourage you to do so. Feel free to let me know if you have any further questions.

Cheers,

Liam

XinnuoXu · 2023-03-25T15:36:36Z

Hi Liam, many thanks for the fast response! I'm very appreciative : )

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Short translated sentences in training sets #1

Short translated sentences in training sets #1

XinnuoXu commented Mar 25, 2023 •

edited

Loading

liamcripwell commented Mar 25, 2023

XinnuoXu commented Mar 25, 2023

Short translated sentences in training sets #1

Short translated sentences in training sets #1

Comments

XinnuoXu commented Mar 25, 2023 • edited Loading

liamcripwell commented Mar 25, 2023

XinnuoXu commented Mar 25, 2023

XinnuoXu commented Mar 25, 2023 •

edited

Loading