What does it take to integrate a Custom MT into your company's localization flow?
Which options to choose when your budget is not unlimited?
What data to collect to make the right decision?
I asked myself these questions when I was preparing a budget for my first custom #mt project. I read many papers and case studies and watched numerous presentations but didn't find one that would give a complete overview of the essential expense articles to consider and which options to choose from in different cases.
That's why when the project I conceived had been successfully deployed, I thought it would be fair to share some insights and take-offs.
In brief, to calculate the budget you need to take into account two main factors:
Labor cost: dedicate team or outsourcing
Computational resources: proprietory or third-party infrastructure
Let's elaborate on these two topics and see what we need to know to make the right decision.
A Custom MT Budget
In any project (regardless of your purpose and goals), there always (at least for now) will be people and machines, so when calculating any budget, one should consider these two main expense components.
From the human side, you would need a person or a team in your company (or a contractor) who does everything a machine can't do. And there is a lot to do, actually. For the Computing resources, whatever custom MT option you choose, you will need to pay either an MT provider or invest in your own infrastructure.
Machine Translation Operationalization is still new expertise, and finding the right person for your project can be challenging. It's not a secret that in our Machine-driven and AI-powered time, we still need people who can do the human part of the job. In some cases, you might already have such a person in your team to whom you can entrust this complex and time-consuming process, but in most cases, you would need to find one or outsource it to an external vendor.
Behind every efficient custom machine translation engine there is a highly skilled and creative human.
Human Role in an MT project
First, they should perform data analysis to evaluate the current state of the Localization processes, potential integration that would be needed to integrate the MT into existing flows and decide on the KPIs.
Then the MT Project would need to be set up (MT integration into the existing flow), providers to be chosen, and connections and processes arranged. Depending on your company's tech situation, it might require investing in additional development resources (also a human component of the costs).
When everything is ready, it's time to prepare the parallel data (in the best possible scenario, you would have TMX files in the correct format that you might want to clean up from sensitive and other data that you do not wish to feed into an MT model. In the worst case, the data should be scraped and aligned). I mention all this for your to understand that any MT training project might require hours of preparation, which translates into additional labor costs.
Then comes the MT engine training itself. In most cases, it doesn't take much time.
Educating staff / a PEMT course. If you have an in-house translation team or outsource your localization tasks to freelancers, you must train them on post-editing.
After you have your MT engine or engines up and running, your expert should measure the KPIs and adjust settings and processes when needed, retrain the engine if it's not performing well enough, explore additional options and add more languages to the scope when needed.
There are might (and probably will) be more points in this list depending on your needs and environment, but even the basic tasks we listed above are enough for a dedicated employee in your company, if not a whole MT team, so the part of the human resource of the budget would be the most important.
Thus, each company will need to decide whether to have an in-house person or team or outsource this type of project to an external vendor.
Let's see what to take into consideration.
In-House or Outsourcing?
Some companies need to translate the same types of content with barely changing terminology into the same set of languages. In this case, after the first MT model got trained and deployed (and if the results are satisfactory indeed), you can use it for quite a long time without further retraining or maintenance.
Others have new products with the new terminology released regularly, adding new markets and languages. When a custom MT is needed for all or a part of such a company's content, this project requires the continuous presence of an on-site expert.
So to understand what is the best option for your budget calculation, you need to ask yourself two main questions:
Will it be an ongoing project?
Do we already have someone in the team who wants and can take it on?
If the answer to the above questions is NO: consider outsourcing your MT project to a vendor (company or freelancer).
If the answer to the above questions is YES: Opt for a dedicated person within your localization team who will always be here to meet your evolving needs.
Both options save your budget depending on your needs.
In the first case, you don't keep an employee who doesn't have much work to do, and you can get your MT needs to be met faster with some ready-made solutions from the external experts. Also you would have the flexibility of choosing the best offer on the market.
In the second case, on the contrary, it might be cheaper to have an in-house expert rather than to outsource your projects to an external vendor all the time. This removes an unnecessary dependency and caters to the process velocity.
I am not providing any numbers here as the human resource budget largely depends on the situation in your market and the workforce cost varies from country to country, but you know you can understand what to anticipate and add to your budget calculations.
Now when we know what to anticipate in terms of workforce, let's move to the computational resources costs.
As the competition in the Artificial Intelligence and a Machine Learning market grows, the translation solutions become cheaper and more accessible to a more extensive clients range. Not only can big players and institutions afford to integrate custom MT into their localization processes, but even small companies and freelancers can also allocate money to train their own MT models.
Custom Machine Translation costs money, though it gets cheaper now. If you don't have an unlimited budget, choose the candidates for it wisely.
The budget for the MT engine usage itself depends on the solution you choose. It can be either a CAT-tool integrated one (cheaper but less flexible solution), or a custom engine from such services as #GoogleAutoML, #AmazonACT, etc., where you train MT models with your data.
Finally, it can also be your very own solution hosted on your servers and maintained by your team of AI experts - and the budget would correspondingly be the largest.
In the case of buying computational resources, there are two main expenses to take into account.
MT models creation/training (counted in hours of training)
MT usage (measured in characters translated)
Not all providers use the same pricing model. Like Amazon Web Services, some let you upload your parallel data for free and charge for MT usage only. Others, like Google AutoML, charge both per hour of a model training and per character of usage. Calculating the budget, you would need to allocate for computational resources is easy if you know your current text volumes.
If you have a large budget and don't need to decide where to save money on your custom MT implementation, you can stop reading now and enjoy a cup of coffee
For some companies creating models for all the languages and/or subject matters could be financially challenging. In this case, they must decide which languages would benefit from a custom MT the most.
As we all know, while getting better and better each year or even faster, MT engines don't cope with all the language pairs equally well.
MT options for any budget
Choosing Language Candidates
Not all the languages equally benefit from Machine Translation — be it a generic or custom one.
If you need to deploy a custom MT project for several languages, you might want to take care first of those languages that benefit less from the generic MT to align the quality with some languages performing better with a generic engine.
From top to bottom, the quality of MT suggestions decreases.
* Some Asian languages perform better than others (as in the case of Chinese). Also depending on the MT engine used (some Asian engine may give better results that commonly used ones).
When choosing language candidates for your first MT training project, ask yourself the following questions.
What are the priorities for my product/company: markets, content volumes, and available human resources.
What languages are not performing well enough with a generic MT?: edit distance, content velocity.
Do we have enough data for MT training: Existing parallel data (memories, glossaries).
Let's check a couple of cases:
"I need to localize my products from English to Japanese earlier than into French, but my JA team spends more time post-editing a generic MT than my FR team. We already have some 200K strings and a substantial glossary in our translation memory".
Then the answer would be eventually YES (you would need to invest in the Japanese model first)
"I have a lot of parallel data in the Fr to En pair, I can train a custom MT with, but now we don't have much stuff to localize in this pair, and our team copes with localization well enough*."
Here you should rather NOT bother training a Custom engine if you have other languages that need assistance. If this is your only language pair, you might want to improve it.
You probably know a lot about the metrics used to measure the custom MT quality and gain; if not, you can google them. One of the most secure metrics would be the edit distance: how less a professional human translator would need to edit the content translated by a trained MT engine compared to a generic MT.
When choosing which languages pair to upgrade with a custom MT, I suggest measuring the current Edit distance for your content.
You can do it in two main ways:
Many TMS tools provide the edit distance report, and you can check the current performance of the generic MT.
If not, you can run an experiment and measure these KPIs on a little chunk of content to see how a language pair or pairs perform.
Also, you can measure the time required for post-editing to see the percentage of sentences that don't need any post-editing at all and are good to go.
Knowing all this, you can decide which language pair requires a more urgent improvement.
Here is an example comparing Edit Distance for a generic MT vs a custom one:
As we see, German, Russian, and Italian benefited from the generic MT less than other languages, so it was natural to create custom models for those languages first and then go on with the rest.
To make the right decision in terms of the budget for a custom Machine translation project, take into account the three following elements:
Either you can afford an in-house team or would better outsource this project.
Either you want to have a proprietary/third-party MT engine or use ready-made models available on the market.
And finally, what languages to upgrade first.