Custom AI

Create a Dataset

Content is machine translated from English by Phrase Language AI.

Translation Memory Selection Guidelines

Phrase Custom AI leverages translation memories (TMs) to create custom machine translation (MT) models that adhere to specific terminology and style, leading to improved translation quality (and thus reduced post-editing times) for these content types when compared to generic machine translation.

The most important factor that can influence the effectiveness of the customization process is the used translation memories. These are general guidelines that can help to determine what data to use for this purpose:

  • Single domain:

    It is best if the dataset focuses on content covering a single style and terminology. If the dataset contains a mixture of domains (e.g., both the legal terms of a website and the product descriptions) the model can fail to learn what the desired style is.

  • Unique content type:

    The custom MT model builds on top of generic models trained on vast amounts of public data collected from the internet. If the translation memory contains data which is quite similar to the generic data used to build the generic models, there will not be much to be gained from the customisation process.

  • Data quality:

    The model will assume that every sentence pair in the translation memory is an example of the output it will be expected to produce. The translation memory must be of good quality, ideally created from professional human translations. The data cleaning pipeline can help to filter out the most harmful parts of the dataset.

  • Expected volume:

    For the customization to be impactful in terms of RoI, the dataset needs to be representative of the bulk of the data where MT quality will have more impact. For example, if some of the MT output is to be post-edited by human translators, to maximize the RoI the data needs to be representative of the content that will be post-edited.

To create a dataset for the purpose of training a custom MT engine, follow these steps:

  1. From the Datasets page, click Train a custom MT engine.

    The Dataset details page opens.

  2. Provide a name for the dataset.

  3. The language selectors allow for various options:

    1. To create a general language dataset, select the same source and target languages in the source and target language and locale selectors.

    2. To create a locale-specific dataset, select the source and target languages from the first dropdown list then specify the source and target locales from the second dropdown list.

      Multiple target locales to leverage more data sources can also be added.

    3. To create a dataset with multiple source and target locales, select the source and target languages from the first dropdown list, specify the source and target locales from the second dropdown list (multiple target locales can be added) and click on + Add more locale pairs.

    The Input data window appears.

  4. Click Add translation memories.

    The Choose translation memories page opens with a search functionality search.jpg.

  5. To add a TM to the dataset, click the AddTM.jpg icon. The TM is added to the Selected column.

    Multiple TMs can be added to a maximum of 200 and a dataset should ideally contain at least 10,000 segments.

    Clicking on the TM name will present the selection on the translation memory page.

    Click the RemoveTM.jpg icon to remove the TM from the Selected column.

  6. Click Save.

    The Dataset details page opens.

  7. Review the details as presented and if correct, click Continue.

    The Cleaning filters page opens.

  8. Apply required filters and click Create.

    The dataset is created and added to the list on the Datasets page with the initial status of Cleaning and the status of Training MT in the Created for column.

Was this article helpful?

Sorry about that! In what way was it not helpful?

The article didn’t address my problem.
I couldn’t understand the article.
The feature doesn’t do what I need.
Other reason.

Note that feedback is provided anonymously so we aren't able to reply to questions.
If you'd like to ask a question, submit a request to our Support team.
Thank you for your feedback.