Create a Dataset

Content is machine translated from English by Phrase Language AI.

Translation Memory Selection Guidelines

Phrase Custom AI leverages translation memories (TMs) to create custom machine translation (MT) models that adhere to specific terminology and style, leading to improved translation quality (and thus reduced post-editing times) for these content types when compared to generic machine translation.

The most important factor that can influence the effectiveness of the customization process is the used translation memories. These are general guidelines that can help to determine what data to use for this purpose:

Single domain:

It is best if the dataset focuses on content covering a single style and terminology. If the dataset contains a mixture of domains (e.g., both the legal terms of a website and the product descriptions) the model can fail to learn what the desired style is.
Unique content type:

The custom MT model builds on top of generic models trained on vast amounts of public data collected from the internet. If the translation memory contains data which is quite similar to the generic data used to build the generic models, there will not be much to be gained from the customisation process.
Data quality:

The model will assume that every sentence pair in the translation memory is an example of the output it will be expected to produce. The translation memory must be of good quality, ideally created from professional human translations. The data cleaning pipeline can help to filter out the most harmful parts of the dataset.
Expected volume:

For the customization to be impactful in terms of RoI, the dataset needs to be representative of the bulk of the data where MT quality will have more impact. For example, if some of the MT output is to be post-edited by human translators, to maximize the RoI the data needs to be representative of the content that will be post-edited.

Creating a dataset for automated asset curation has a slightly different process.

To create a dataset for the purpose of training a custom MT engine, follow these steps:

From the Datasets page, click Train a custom MT engine.

The Dataset details page opens.
Provide a name for the dataset.
The language selectors allow for various options:
1. To create a general language dataset, select the same source and target languages in the source and target language and locale selectors.
2. To create a locale-specific dataset, select the source and target languages from the first dropdown list then specify the source and target locales from the second dropdown list.
  
  Multiple target locales (i.e. different variants of the same language) to leverage more data sources can also be added.
3. To create a dataset with multiple source and target locales, select the source and target languages from the first dropdown list, specify the source and target locales from the second dropdown list (different variants of the same target language can be added) and click on + Add more locale pairs.
The Input data window appears.
Click Add translation memories.

The Choose translation memories page opens with a search functionality .
To add a TM to the dataset, click the icon. The TM is added to the Selected column.

Multiple TMs can be added to a maximum of 200 TMs and a maximum of 8 million segments. A dataset should ideally contain at least 10,000 segments.

Clicking on the TM name will present the selection on the translation memory page.

Click the icon to remove the TM from the Selected column.
Click Save.

The Dataset details page opens.
Review the details as presented and if correct, click Continue.

The Cleaning filters page opens.
Apply required filters and click Create.

The dataset is created and added to the list on the Datasets page with the initial status of Cleaning and the status of Training MT in the Created for column.