Everyone knows that Machine Translation is getting better. If you’re interested in hearing more about this, take a look at our blog post “Is Neural Machine Translation good enough yet?”. In this post, I want to take a look at some of the practical steps you can take to getting better machine translation results.
Customised engines deliver better machine translation output
MTEs do a pretty good job nowadays, but they don’t know about your context, desired style and terminology. If you have a large enough Translation Memory, you can use it to train an engine just for you. You need to have a minimum of 10,000 sentences pairs for each language combination, and preferably more than that. This is your “training data” and it needs to be good quality human translation – garbage in, garbage out!
Improve your training data
In the real world, your training data – those translated sentences – is likely to be a bit messy. You may have variations in capitalisation, extraneous bits of punctuation, bullet points or numbered lists. Good training data meets a few key criteria:
- Normalised text – you want the plain vanilla version of the text, without ALL CAPS for instance.
- Longer sentences – the best training data is longer than 5 words per sentence.
- Avoid overly long sentences – your sentences should be below 50 words per sentence.
- No bullets or numbered lists – these will harm the training process.
- Delete repeating characters – for example duplicate spaces or sequences like “……”
- Avoid tabs – particularly prevalent where people have tried to manually create a Table of Contents in Word.
We’re working on a development right now which we’re calling “Laundry” which will “clean” your Translation Memory so that it provides higher quality training data. More news on that soon!
Incorporate your Terminology into the MT output
Some Machine Translation Engines like Google AutoML allow you to incorporate custom terminology into the Machine Translation process. This is important because even with good training data, the MTE is likely to get specific items of terminology wrong and may do unhelpful things like translating brand names. The terminology process overlays the original MT output with your specific terminology preferences, reducing the work required by the translator.
PEMT = MTPE = Post-Editing Machine Translation
“The translator” I hear you cry! Yes, I’m afraid you still need a human translator for most of your translated content. Sometimes called the Post-Editor, this translator checks the MT output and corrects it where required. The goal of the steps above is to reduce the effort required by the Post-Editor as far as possible.
Some Machine Translation Engines are better than others
Let me be more specific. Some Machine Translation Engines (MTEs) have better results for specific language combinations and domains. Memsource have an interesting approach to this conundrum. They assess the performance of different engines for specific language combinations and domains and use this to suggest the optimal engine for each specific project. The bottom line is – don’t assume that one engine is “the best”, be prepared to use multiple engines for the best overall performance. If you’re interested in this subject, I really recommend reading the latest Machine Translation Report from Memsource.
Generic, domain-specific engines can deliver better Machine Translation
If you don’t have a large enough corpus of existing translations, don’t despair. Machine Translation providers like Microsoft and ModernMT have domain-specific engines which use large volumes of existing translations for legal, medical, industrial etc. This is a good short cut to achieving better Machine Translation results as the MTE is more likely to get the terminology right for that domain.
This is a big subject and can seem out of reach if you’re not a big translation buyer. The good news is that the entry point for Machine Translation is coming down all the time. If you want to discuss whether it might work for you, get in touch and we’ll be happy to explore the options with you.