Audio transcription takes audio as input and uses Automated Speech Recognition and Automated Speaker Identification to generate text output. Specifically, the system uses a proprietary instance of OpenAI Whisper Automated Speech Recognition system.
Monolingual term bases can be created in the page to improve AI transcription accuracy for specialized or difficult terms. Term bases are automatically shared with all users of the same organization in read-only mode.
Phrase Studio consumes Video Localization Hours.
Use cases
-
A 45-minute customer interview recorded as an MP4 file.
A text transcript is generating with speaker identification which can be used to create a case study and pull quotes for a website.
To create an audio transcription project, follow these steps:
-
From Phrase Studio, click New Project.
The page opens.
-
Either drag a file onto the upload field or click Upload file to locate a file on your system.
The uploaded file is displayed.
-
Optionally, specify the number of in the uploaded file.
-
To set the number of speakers manually, open the dropdown and select a value from 1 to 5. If the file includes more than five speakers, use the default option.
-
-
Provide a name for the project and set the project visibility as required:
-
New projects are public by default. Public projects are visible to all users in the organization who have access to Studio.
-
Deselect to create a private project that is visible only to the project owner. A private project can still be shared with selected users if needed.
-
-
Manually select the or enable for automatic detection.
-
If required, under , enable and select language(s) for the file to be translated into.
-
The translation engine is configurable.
-
If is selected, the file will be transcribed, translated and dubbed immediately without the opportunity to check the translation beforehand.
-
-
Select a to determine subtitle display rules.
Enable to select a profile for each language.
-
Optionally, enable to select existing pronunciations and related pairs for dubbing workflows.
-
If required, configure additional options:
-
Open the section to import existing subtitle files in SRT or VTT format for both source and target languages.
The system will skip automatic audio transcription with speaker identification and align the existing subtitles with the video. Users need to create and assign speakers manually since SRT/VTT files do not include speaker information.
-
Open the section to override the account-level settings and select the preferred at the project level.
-
Open the section to select an existing term base or add terms that will be used to detect and match similar-sounding words during transcription.
-
Open the section to select the desired summaries and insights that will be generated for the uploaded recording, and the relevant AI models.
-
-
Click Create project.
The file is uploaded and is displayed on the page.
Click on the recording name to open it in the editor and view it in the and tabs. Both texts can be edited if required.
Click Download to select the transcription and the translations for download to your system. It is also possible to download audio-only tracks in MP3 format.
Extracts structured and meaningful insights such as summaries, sentiment, quality flags, or safety issues from subtitles using AI models.
Insights created in the page are automatically shared with all users of the same organization in read-only mode.
Use cases
-
Summarize customer support calls or identify potentially unsafe or low-quality communication. Phrase Studio returns a summary and flags sections for review.
Detects and labels different speakers in an audio file for clearer transcripts and subtitles.
Automatic speaker identification is not available for projects with imported subtitle files.
Use cases
-
A podcast with multiple participants is processed and each speaker is automatically tagged (e.g., "Speaker 1", "Speaker 2").
Click Manage Speakers under the menu to edit the speaker name or add other speakers.
Use the Combined/Speakers toggle at the bottom of the editor to switch between a single waveform and individual waveforms for each speaker. When multiple speakers are detected, segments can be dragged within a row to reflect overlapping speech, or moved to another row to change the assigned speaker.