Segmentation Rules (TMS)

Content is machine translated from English by Phrase Language AI.

Segmentation

Segmentation is the splitting of source texts into smaller parts. This improves the retrieval of previously translated text from a translation memory. Segments are presented in the editor and can be filtered. If a project has workflow steps, changes in segments are presented in the translation changes pane.

Default segmentation rules correspond with specifics of each supported language and can be customized.

Jobs imported with bad segmentation such as poorly formatted document files or the application of inappropriate segmentation customization can affect TM match values. It is recommended that some time is spent reviewing and preparing the source file before importation; a common problem is the incorrect use of line breaks versus paragraph breaks.

Example:

Good segmentation:

Translation memories with multilingual target languages are supported and can be used bidirectionally.

Match value of 100%.

Poor segmentation

Translation memories with multilingual target languages are supported.

Match value of 100%.
and can be used bidirectionally.

Match value of 63%.

Customize Segmentation Rules

Customized segmentation rules can be applied to jobs and project templates. If a project requires a customized segmentation rule, a template will need to be created for that project. When set as primary, customized segmentation rules are applied to all new jobs imported for that source language.

There are two types of segmentation rules:

Abbreviations to the .XLSX file
Regular expression of .SRX files

To use customized rules, download the default rules, modify them, upload the modified file and then apply them to specified jobs.

Caution

When adding custom segmentation rules for a space-less CJK source language (while a target language will be using spaces as a word delimiter), ensure leading or trailing spaces are added to the target segments split by the custom rule; this delimits words in the translation. While this happens automatically in segments formed by the default segmentation rules, no spaces are added in manually split segments or those formed by additional custom segmentation rules.

Download Default Segmentation Rules

To download the default segmentation rules, follow these steps:

From the Settings page, scroll down to the Project settings section and click on Segmentation.

The Segmentation page opens.
Select the language to be customized and click Export XLSX/SRX.

The Export XLSX/SRX window opens.
Select format:
- XLSX provides an abbreviation list.
- SRX provides regular expression rules.
Select a language from the dropdown list.
Click Download.

The file is downloaded to your system.

To download a previously uploaded segmentation rule that you uploaded previously, follow these steps:

From the Settings page, scroll down to the Project settings section and click on Segmentation.

The Segmentation page opens.
Click on the Settings icon on the right and choose Customize columns:
Enable the Filename column
Click on a filename to download a pre-saved rule.

Edit Abbreviations in an .XLSX File

Abbreviations can be specified for individual languages after which new segments should not be created.

To edit abbreviations, follow these steps:

Open the downloaded .XLSX file in an editor.
Change the contents with the following formatting:

The XLSX file must have two columns with no headings.
- Column 1: Abbreviation to be specified
- Column 2: Specification of segmentation behavior
  - ABBR_UPPER_NUM
    
    A new segment is not be created if the abbreviation is followed by white-space and then by a number, a symbol (math, currency signs, dingbats, etc.) or a word with the first letter in upper case.
  - ABBR_NUM
    
    A new segment will not be created if the abbreviation is followed by white-space and then by a number.
Save the edited .XLSX file.

Edit Regular Expressions in an .SRX File

Editing .SRX files is a complex process suitable only for users experienced in using regular expressions

There are several rules that can be changed in an SRX file:

Import text from an XLSX file without segmentation; one cell is equal to one segment.
Import text with a new line in order to split one segment into two.
Use a colon (or any other character) as a segment separator.
Forbid the use of a semicolon (or any other character) as a segment separator.
Removing an abbreviation from the list (the text will be segmented).

These rules are character-based; only a single character can be used as segment separator. Groups of characters (for example: <p>) cannot be used as a segment separator.

To edit an SRX file, follow these steps:

Open the file in a text editor such as Notepad ++.
Edit using regular expressions or remove the inner segmentation completely.

Example:
- <rule break="no">
  
  The list of rules, where the segment will not be broken. I.E. a list of abbreviations
- <rule> <beforebreak>
  
  A regular expression for a character before a break (for example, at the end of a sentence ". ? ! :"). If you, for example, don't want segment text after a colon, simply delete : from every <rule><beforebreak> code.
- <rule> <afterbreak>
  
  A regular expression for a character after a break (for example, at the start of a new sentence; a space and capital letter).
Save the modified SRX file.

Upload New Segmentation Rules

To upload modified or new segmentation rules, follow these steps:

From the Settings page, scroll down to the Project Settings section and click on Segmentation.

The Segmentation page opens.
Click New.

The Upload custom XLSX or SRX segmentation file page opens.
Select a Language from the dropdown list.
Provide a Name for the rule.
Click Choose file.

A file selection window opens.
Select the modified rules file for upload.
Check Primary if the custom segmentation rules will be the primary segmentation rules for the selected language.
Click Create.

The Segmentation page opens and the rule has been added to the list.

Use Custom Segmentation Rules on Job Import

To use custom rules on a job import or configure target segment length, follow these steps:

At step 8 of creating a job, click Segmentation and segment length from the File import settings.

The Segmentation and segment length options dropdown opens.
Select the modified rules from the Source segmentation rules dropdown list.
Optionally, configure a limit for target segment length based on project requirements (e.g. subtitles translation):
- Select Max. target segment length in % of source and enter the preferred percentage to limit the segment length based on the source segment.
- Select Max. target segment length in characters and enter the character count to limit the segment length by number of characters.
Click Create.

The job is created and added to the list using the specified segmentation rules.

Changing Segmentation Example (1 Cell 1 Segment)

Remove all inner segmentation rules from an SRX file leaving only the basic segmentation of the whole paragraph, element, or cell being applied. This segmentation rule can be applied to every file type (MS Word, XML, HTML, Excel, etc.).

Example:

	A	B
1	Peter! Wait!
2	Hello.
3

This XLSX example imported with default segmentation will have 3 segments: Peter!, Wait!, and Hello.

If all inner segmentation is removed leaving only the basic segmentation based on the Cell, then there are only two segments: Peter! Wait! and Hello.

Edit the SRX file to remove all the default segmentation rules, i. e. the code between  and </languagerule>.