Regex (TMS)

Content is machine translated from English by Phrase Language AI.

The regular expression (abbreviated as regex or regex) is a sequence of characters that form a search pattern mainly for use in pattern-matching with strings or string-matching. Functionality is similar to find and replace operations with more complexity and specificity or as a method for excluding defined content. See the wikipedia entry for a detailed description of regex and a table of used characters.

To use multiple regex at a time, insert a pipeline character | between them.

Regex can be used in the filter, search and replace fields in the CAT editor, in the source and target fields of the search for content feature, for the Convert to tags feature in File import settings and for customizing segmentation rules. The converter and CAT desktop editor use Java regex, while the CAT web editor and search in TMS use the Lucene regex engine.

Tip

AI chatbots can be very effective at generating and verifying regex.

Use tools like Regex101 to test regex with different inputs.

Important

Phrase supports Java regex, but will reject complex regular expressions to protect the system from overload. Complex regex are those with quantifiers (except possessives) on groups which contain other quantifiers (except possessives).

General Examples

Examples for converting text into tags when importing files and using regex in the desktop editor for filtering and find and replace functions:

Example	Description
<[^>]+>	represents <html_tag>
\{[^\}]+\}	represents {variable},
\[[^\]]+\]	represents [variable],
\[\[.+?\]\]	represents [[aa[11]bb]].
\$[^\$]+\$	represents $operator_Name1$.
\d+	represents numbers. Also, [0-9]+
[A-Za-z0-9]	represents any alphanumeric character.
.+\@.+\..+	email address name@domain.com
\d{4}[-]\d{2}[-]\d{2}	the date 2018-08-01
\s$	a whitespace at the end of the segment
^\s	a whitespace at the beginning of the segment
\s\s	a double whitespace
^\d	a digit at the beginning of the segment
\w+\s\s\w+	a double whitespace between words
\s\n	a newline preceded by any whitespace character
\S\n	a newline preceded by any non-whitespace character
<[^>]+>\|\$[^=]+=	converts php variables and html code ($svariable['name'] =)
^\s*\'[^:]+:	converts javascript's field key with added whitespaces at the beginning of the line ( 'key' :)
\{\{[^\}]+\}\}\|\'[^']+\'	does not translate {{text here}} '{{text here}} content and converts it to tags
\{\{[^\}]+\}\}	represents text in between {{}} brackets
$[^$]+\)	represents text in between () brackets
\^[^\^]+\^	represents text in between ^ marks
\@[^\@]+\@	represents text in between @ marks
\^[^\^\?]+\?	represents text in between ^ and ? marks
\'[^']+\'	represents text in between ' ' apostrophes
\"[^"]+\"	represents text in between "" quotation marks
\%[^\%]+\%	represents text in between % symbols
\$\{[^}]*\}	represents text in between ${ and }, e.g. ${variable}
\$[a-zA-Z0-9\-_]+	represents a string that starts with $, e.g. $appName
(?<=\: ").*(?=")	represents text inside double quotes after a colon and space, e.g. `value` in the string `"key": "value"`
(?<=\: ').*(?=')	represents text inside single quotes after a colon and space, e.g. `JohnDoe` in the string `user: 'JohnDoe'`
(?<=\=).*(?=)	represents text after an equals sign and without space, e.g. key=value
(.*)=	represents text before an equals sign
=(.*)	represents text after an equals sign
\/\/\S*	represents hyperlinks. Also, https:\/\/\S*
</?mrk[^>]*>	represents HTML/XML open and closed `mrk` tags, e.g. <mrk id="abc"> and </mrk>

TXT Import

Note

Since TXT files in TMS are processed line by line, certain regular expressions that work in other environments may not function as expected.

Examples of regular expressions when importing a specific text:

## ErrorMessage ##1## The number must be higher than 0. ##Z##

To import text between ##1## and ##Z## ,use regex: (?<=##1## ).*(?= ##Z##)
ErrorMessage ("The number must be higher than 0.")

To import text between (" and ") , use regex: (?<=$").*(?="$)
'errorMessage' = 'The number must be higher than 0.'

To import text after the = sign and between ' and ' , use regex: (?<=\= ').*(?=')
errorMessage = "this is to be translated"

To import text after the = sign and between 'and' use regex: (?<=\= ").*(?=")
msgstr ("The number must be higher than 0.")

To import msgstr strings in monolingual PO files using a TXT filter, use regex: (?<=msgstr ").*(?=")
# Note: This is a note

To exclude lines starting with # , use regex: (^[^#].*)
values '126', 'DCeT', 'Text (en)'

To import only text in quotes and with (en), such as Text (en)' use regex: (?<=')[^']*$en$(?=')

JSON Import

JSON structure example:

{
"list": {
        "id": "1",
        "value": "text 1 for translation."
        },
"text": {
        "id": "2",
        "value": "text 2 for translation."
        },
"menu": {
        "id": "3",
        "value": "text 3 for translation."
         },"array": ["blue","green"],"arrays": [{        "color": "blue",        "title": "BLUE"
         },         {        "color": "green",        "title": "GREEN"         }    ]}

for importing every value regardless of the level, use: (^|.*/)value
for importing only one value from a list, use: list/value
for importing a value from a list and/or menu, use the | (OR) operator: list/value|menu/value
for importing only the first instance of a value from a menu, use: menu\[1\]/value
for importing the content of a JSON array following a certain key, use: (^|.*/)array\[.*\]
to import the content of a specific array of objects, use: (^|.*/)arrays\[.*\].*

YAML Import

YAML file example:

title: A
text: translate A
categories:
  title: B
  text: translate B
categories:
  title: C
  text: translate C
categories:
  content:
      title: D
      text: translate D

regex for importing:

only 'translate A' : text
only 'translate C': categories\[2\]/text
only 'translate D': categories\[\d+\]/content[\1\]/text
all text: text|categories\[\d+\]/text|categories\[\d+\]/content[\d+\]/text

Segmentation Rules

Okapi, Java and Unicode are used for segmentation rules in .SRX files.

Using regex in .SRX files is complex and a basic knowledge of regular expression use is recommended before attempting to work with them.

Nobreak rules (Abbreviations etc.) and Break rules (End of the sentence with a dot, etc) are in .SRX files.

Example	Description
[\p{C}]	Invisible control character.
[\p{Z}]	Whitespace
[\p{Lu}]	An uppercase letter that has a lowercase variant.
[\p{N}]	Any kind of numeric character.
\Q ... \E	Start and end of a quotation - (\QApprox.\E). This is used for Abbreviations.
\t	Tabulator
\n	Newline
\u2029	Paragraph separator
\u200B	Zero-width space
\u3002	Ideographic full stop
\ufe52	Small full stop
\uff0e	Fullwidth full stop
\uff61	Halfwidth ideographic full stop
\ufe56	Small question mark
\uff1f	Fullwidth question mark
\u203c	Double exclamation mark
\u2048	Question exclamation mark
\u2762	Heavy exclamation mark ornament
\u2763	Heavy heart exclamation mark ornament
\ufe57	Small exclamation mark
\uff01	Fullwidth exclamation mark
`[\u0080-\uFFFF]+`	Characters from the Unicode range \u0080 to \uFFFF
`[\u00a8\u00b9\u00c4]+`	One or more occurrences of the specified Unicode characters inside the square brackets, e.g. \u00a8 + \u00b9 + \u00c4

Common Custom QA Checks

QA Check	Source regex	Target regex
Additional numbers in target	`\d`	`\d`
Tags order (unpaired, for segments with 3 tags). Adjust the regex according to the required number of tags.	`^.\{1\}.\{2\}.\{3\}.$`	`^.\{1\}.\{2\}.\{3\}.$`
Tags order (paired, for segments with 3 tags). Adjust the regex according to the required number of tags.	`^.\{1\>.\<1\}.\{2\>.\<2\}.\{3\>.\<3\}.*$`	`^.\{1\>.\<1\}.\{2\>.\<2\}.\{3\>.\<3\}.*$`
Spaces before tags	`\s(\{[1-9][0-9]\}\|\{[1-9][0-9]>\|<[1-9][0-9]*\}\|\{[biu_\^]{1,4}>\|<[biu_\^]{1,4}\})`	`\s(\{[1-9][0-9]\}\|\{[1-9][0-9]>\|<[1-9][0-9]*\}\|\{[biu_\^]{1,4}>\|<[biu_\^]{1,4}\})`
Spaces after tags	`(\{[1-9][0-9]\}\|\{[1-9][0-9]>\|<[1-9][0-9]*\}\|\{[biu_\^]{1,4}>\|<[biu_\^]{1,4}\})\s`	`(\{[1-9][0-9]\}\|\{[1-9][0-9]>\|<[1-9][0-9]*\}\|\{[biu_\^]{1,4}>\|<[biu_\^]{1,4}\})\s`
No space before tags	`\S(\{[1-9][0-9]\}\|\{[1-9][0-9]>\|<[1-9][0-9]*\}\|\{[biu_\^]{1,4}>\|<[biu_\^]{1,4}\})`	`\S(\{[1-9][0-9]\}\|\{[1-9][0-9]>\|<[1-9][0-9]*\}\|\{[biu_\^]{1,4}>\|<[biu_\^]{1,4}\})`
Non-whitespace characters after paired tags	`((\{[1-9][0-9]>)\|(<[1-9][0-9]\}))\S`	`((\{[1-9][0-9]>)\|(<[1-9][0-9]\}))\S`
Missing square brackets	`[^\[\]]\[[^\[\]]\][^\[\]]*`	`[^\[\]]\[[^\[\]]\][^\[\]]*`
Missing round brackets	`[^]$[^\($]\)[^]*`	`[^]$[^\($]\)[^]*`
Use the following regular expressions to check for the same count of identical decimal numbers, using the appropriate language-specific decimal separator.	`(?<;n1>;\d+)\.(?<;n2>;\d+)`	`(?<;n1>;\d+),(?<;n2>;\d+)`