Top tips for getting better machine translation results

Authors

Tim Branton

Tim Branton

PureFluent CEO

Share this

Tweet Share Share

More content

  • Top tips for getting better machine translation results

    Read now
  • Top Tips: Making the most of a reduced translation budget

    Read now
  • Does multilingual video help SEO performance?

    Read now
  • PureFluent introduces the first Translation Subscription Service - WordStore

    Read now
  • Pentland Brands talk about their drive to provide shoppers with a great experience on Amazon

    Read now
  • How should customers assess translation quality?

    Read now
  • Do translations drive ecommerce results even when customers speak some English?

    Read now
  • Is English ok for my customers?

    Read now
  • Should you add subtitles or voiceover to your video content?

    Read now
  • How is the gig economy debate affecting the translation business?

    Read now

Luglio 28, 2020

Everyone knows that Machine Translation is getting better. If you’re interested in hearing more about this, take a look at our blog post “Is Neural Machine Translation good enough yet?”. In this post, I want to take a look at some of the practical steps you can take to getting better machine translation results.

Customised engines deliver better machine translation output

MTEs do a pretty good job nowadays, but they don’t know about your context, desired style and terminology. If you have a large enough Translation Memory, you can use it to train an engine just for you. You need to have a minimum of 10,000 sentences pairs for each language combination, and preferably more than that. This is your “training data” and it needs to be good quality human translation – garbage in, garbage out!

Improve your training data

In the real world, your training data – those translated sentences – is likely to be a bit messy. You may have variations in capitalisation, extraneous bits of punctuation, bullet points or numbered lists. Good training data meets a few key criteria:

  • Normalised text – you want the plain vanilla version of the text, without ALL CAPS for instance.
  • Longer sentences – the best training data is longer than 5 words per sentence.
  • Avoid overly long sentences – your sentences should be below 50 words per sentence.
  • No bullets or numbered lists – these will harm the training process.
  • Delete repeating characters – for example duplicate spaces or sequences like “……”
  • Avoid tabs – particularly prevalent where people have tried to manually create a Table of Contents in Word.

We’re working on a development right now which we’re calling “Laundry” which will “clean” your Translation Memory so that it provides higher quality training data. More news on that soon!

Incorporate your Terminology into the MT output

Some Machine Translation Engines like Google AutoML allow you to incorporate custom terminology into the Machine Translation process. This is important because even with good training data, the MTE is likely to get specific items of terminology wrong and may do unhelpful things like translating brand names. The terminology process overlays the original MT output with your specific terminology preferences, reducing the work required by the translator.

PEMT = MTPE = Post-Editing Machine Translation

“The translator” I hear you cry! Yes, I’m afraid you still need a human translator for most of your translated content. Sometimes called the Post-Editor, this translator checks the MT output and corrects it where required. The goal of the steps above is to reduce the effort required by the Post-Editor as far as possible.

Some Machine Translation Engines are better than others

Let me be more specific. Some Machine Translation Engines (MTEs) have better results for specific language combinations and domains. Memsource have an interesting approach to this conundrum. They assess the performance of different engines for specific language combinations and domains and use this to suggest the optimal engine for each specific project. The bottom line is – don’t assume that one engine is “the best”, be prepared to use multiple engines for the best overall performance. If you’re interested in this subject, I really recommend reading the latest Machine Translation Report from Memsource.

Generic, domain-specific engines can deliver better Machine Translation

If you don’t have a large enough corpus of existing translations, don’t despair. Machine Translation providers like Microsoft and ModernMT have domain-specific engines which use large volumes of existing translations for legal, medical, industrial etc. This is a good short cut to achieving better Machine Translation results as the MTE is more likely to get the terminology right for that domain.

This is a big subject and can seem out of reach if you’re not a big translation buyer. The good news is that the entry point for Machine Translation is coming down all the time. If you want to discuss whether it might work for you, get in touch and we’ll be happy to explore the options with you.

About the authors

Tim Branton Tim Branton

Tim Branton is PureFluent's CEO and a passionate advocate for the role of technology in the language industry. He has 30 years of business experience across the chemicals, telecoms, business services and software sectors in the UK, Singapore, Japan, China and South Africa.


See all posts by Tim Branton

Share this

Tweet Share Share