Translation Memory Selection Guidelines
Phrase Custom AI leverages translation memories (TMs) to create custom machine translation (MT) models that adhere to specific terminology and style, leading to improved translation quality (and thus reduced post-editing times) for these content types when compared to generic machine translation.
The most important factor that can influence the effectiveness of the customization process is the used translation memories. These are general guidelines that can help to determine what data to use for this purpose:
-
Single domain:
It is best if the dataset focuses on content covering a single style and terminology. If the dataset contains a mixture of domains (e.g., both the legal terms of a website and the product descriptions) the model can fail to learn what the desired style is.
-
Unique content type:
The custom MT model builds on top of generic models trained on vast amounts of public data collected from the internet. If the translation memory contains data which is quite similar to the generic data used to build the generic models, there will not be much to be gained from the customisation process.
-
Data quality:
The model will assume that every sentence pair in the translation memory is an example of the output it will be expected to produce. The translation memory must be of good quality, ideally created from professional human translations. The data cleaning pipeline can help to filter out the most harmful parts of the dataset.
-
Expected volume:
For the customization to be impactful in terms of RoI, the dataset needs to be representative of the bulk of the data where MT quality will have more impact. For example, if some of the MT output is to be post-edited by human translators, to maximize the RoI the data needs to be representative of the content that will be post-edited.
To create a dataset for the purpose of training a custom MT engine, follow these steps:
-
From the Train a custom MT engine.
page, clickThe
page opens. -
Provide a name for the dataset.
-
The language selectors allow for various options:
-
To create a general language dataset, select the same source and target languages in the source and target language and locale selectors.
-
To create a locale-specific dataset, select the source and target languages from the first dropdown list then specify the source and target locales from the second dropdown list.
Multiple target locales to leverage more data sources can also be added.
-
To create a dataset with multiple source and target locales, select the source and target languages from the first dropdown list, specify the source and target locales from the second dropdown list (multiple target locales can be added) and click on + Add more locale pairs.
The
window appears. -
-
Click Add translation memories.
The
page opens with a search functionality . -
To add a TM to the dataset, click the icon. The TM is added to the
column.Multiple TMs can be added to a maximum of 200 and a dataset should ideally contain at least 10,000 segments.
Clicking on the TM name will present the selection on the translation memory page.
Click the icon to remove the TM from the
column. -
Click Save.
The
page opens. -
Review the details as presented and if correct, click Continue.
The
page opens. -
Apply required filters and click Create.
The dataset is created and added to the list on the
page with the initial status of and the status of in the column.