SSML for TTS 

Cloud TTS Hub supports the use of Speech Synthesis Markup Language (SSML). SSML is an XML-based markup language that allows you to specify many aspects of how text is synthesized into speech. You can use it to fine-tune pronunciation, rate of speech, voice pitch, volume, and more.

SSML provides a standard markup language, but individual providers may have differences in how it's implemented. You need to use the supported markup from the TTS provider in your TTS scripts. Other TTS markup may not work. Refer to the documentation from your TTS service provider for information about any SSML variations or requirements specific to that provider.

To use SSML, text input must be:

  • Valid XML
  • Valid SSML
  • Contained within a set of <speak> </speak> tags
  • Marked up with tags that each have only one attribute (this includes the <speak> tag) 
  • Assigned to a dynamic data object or a variable in a SNIPPET action in your script. The object or variable you assign the marked up text to must be used later in your script in the appropriate place to be passed to the TTS service provider. See the examples later in this section. When working with SSML in snippets: 

A TTS script that includes SSML should be similar to the following example: 

Image of a script for TTS with SSML, with a BEGIN, CLOUD TTS, SNIPPET, and PLAY action linked together.

In this example, the CLOUD TTS action defines the Cloud TTS Hub TTS provider and voice. The SNIPPET action contains the marked up SSML text. The SSML text is assigned to a variable that's passed to the PLAY action as a prompt sequence. If you're using this option with a virtual agent, you would use a VOICEBOT EXCHANGE action instead of PLAY.

Example of One Attribute per Tag in SSML

This example shows that SSML markup should only have one attribute per tag.

<speak xml:lang="en-US">

<voice name="en-US-JennyNeural"> Good morning Chris! </voice>

<voice name="en-US-ChristopherNeural"> Good morning to you too, Jenny! </voice>

</speak>

Example of Multiple Sentences with Different Markups in SSML

This example shows text marked up with SSML that contains multiple sentences between the <speak>.. </speak> tags.

<speak xml:lang="en-US">

Here are <say-as interpret-as="characters">SSML</say-as> samples. I can pause <break time="3s"/>.

I can say cardinal numbers. This number is <say-as interpret-as="cardinal">1135</say-as>.

Or I can say ordinal numbers. You are <say-as interpret-as="ordinal">1135</say-as> in line.

I can even say numbers as digits. The digits are <say-as interpret-as="characters">1135</say-as>.

I can also substitute phrases, like the <sub alias="World Wide Web Consortium">W3C</sub>.

</speak>

Example of a Dynamic Data Object with Marked Up Text in Snippet Code

This example shows how to assign text marked up with SSML to a dynamic data object.

DYNAMIC promptSSML
ASSIGN promptSSML.prompt[1].textToSpeech = "<speak>The SSML should be read in the TTS voice selected in the CLOUD TTS action.\<speak\>";
ASSIGN promptSSMLJSON = "{promptSSML.asjson()}";

Example of One Message Assigned to a Single Variable in Snippet Code

This example shows how to assign text marked up with SSML to a variable.

ASSIGN playSSML = "<speak xml:lang='en-US'>Here are the SSML samples. Here are <say-as interpret-as='characters'>SSML</say-as> samples. I can pause <BREAK time='3s'/>. I can say cardinal numbers. This number is <say-as interpret-as='cardinal'>1135</say-as> Or I can say ordinal numbers. You are <say-as interpret-as='ordinal>1135</say-as> IN line. I can even say numbers as digits. The digits are <say-as interpret-as='characters'>1135</say-as>. I can also substitute phrases, like the <sub alias='World Wide Web Consortium'>W3C</sub>. </speak>"

Example of One Message Spread Across Multiple Variables in SNIPPET Code

This example shows using multiple variables to define pieces of the text that you want TTS to speak. The value of myText2 includes the text of myText. The text of myText3 includes the value of myText2, which includes the value of myText, and so on.

 ASSIGN myTime = "2:30pm"
ASSIGN myText = "<speak> Here are some examples of what CXone Mpower can do with SSML and cloud TTS.  CXone can include a break <break time=3s/> in a spoken sentence as well as read back numbers in different ways."
ASSIGN myText2 = "{myText} for example, saying the number <say-as interpret-as=verbatim>12345</say-as> as individual digits or reading it as a cardinal number like this. <say-as interpret-as=cardinal>12345</say-as> ."
ASSIGN myText3 = "{myText2} CXone can also read back words as words or as individual characters <say-as interpret-as=characters>like this</say-as> ."
ASSIGN myText4 = "{myText3} CXone can also use SSML to slow down spoken sentences. <prosody rate=70%> to help people better understand something that's being said </prosody> "
ASSIGN myText5 = "{myText4} or speed them up <prosody rate=170%> where, for example, the fine print of an agreement can be read back in a short amount of time. </prosody> " 
ASSIGN myText6 = "{myText5} Combining SSML and cloud TTS, CXone can also be used for many other things, like reading back time correctly like this.  Currently, it's<say-as interpret-as=time format=hms24 detail=2>{myTime}</say-as></speak>"