讨论过Voice commands与Speech recognition后,接下来该篇要讨论的即是Text-to-Speech。相较于上述二种本篇的内容比较容易一些。

主要即是在应用程序中指定Speech System阅读指定的文字。搭配Windows.Phone.Speech.Synthesis API建立synthesized speech(合成语音),

或称text-to-speech (TTS),运用于应用程序之中做为提示用户输入、阅读消息的内容、目前搜寻的结果…等。


(1) 准备必要的capabilites


(2) 基本的TTS Sample




private async void ButtonSimpleTTS_Click(object sender, RoutedEventArgs e)
  SpeechSynthesizer synth = new SpeechSynthesizer();
  await synth.SpeakTextAsync("You have a meeting with Peter in 15 minutes.");


Speech System所以采用asynchronous机制让应用程序可以继续处理其它任务,

(3) 选择要朗读的Voice

WP 8系统包括多个国家的语音,每一个语音(voice generates synthesized speech )搭配一个语系,依「设定/系统/语言+地区」有所不同。










(1) only return femle或only return male;

(2) return femle and male;



透过下列范例来说明:撷取<Text-to-speech (TTS) for Windows Phone>范例:

// Declare the SpeechSynthesizer object at the class level.
SpeechSynthesizer synth;
// Handle the button click event.
private async void SpeakFrench_Click_1(object sender, RoutedEventArgs e)
  // Initialize the SpeechSynthesizer object.
  synth = new SpeechSynthesizer();
  // Query for a voice that speaks French.
  IEnumerable<VoiceInformation> frenchVoices = from voice in InstalledVoices.All
                     where voice.Language == "fr-FR"
                     select voice;
  // Set the voice as identified by the query.
  // Count in French.
  await synth.SpeakTextAsync("un, deux, trois, quatre");


另外,更可以使用Speech Synthesis Markup Language (SSML)来指定需要语系的语音

可参考<Speech Synthesis Markup Language Reference>。



该namespaces定义了包括启动、设定speech synthesis engine的类别,以创建成语音提示(prompts)、响应事件或是为了修改语音的特性。

SpeechSynthesizer负责speech synthesis engine连结与功能,更可以搭配指定特定的语系语音来朗读与呈现;

PromptBuilder类别提供appens speech synthesis engine的内容,透过从文字、SSML标记或录好的语音文件;


(a) SpeechSynthesizer

主要负责text-to-speech (TTS)语音工作的类别。重要的Event与Method如下:

Type Name Description
Event BookmarkReadched An event that fires when a <mark> element is reached in a Speech Synthesis Markup Language (SSML) file.
Event SpeechStarted An event that fires when the synthesized voice begins output.
Method CancelAll Cancels all asynchronous text-to-speech calls that are in the active queue.
Method Close Performs application-defined tasks associated with freeing, releasing, or resetting allocated resources.
Method SetVoice Sets the synthesized voice.
Method GetVoice Gets the active synthesized voice.
Method SpeakSsmlAsync(String) Asynchronously speaks a string of text with Speech Synthesis Markup Language (SSML) markup with a text-to-speech voice.
Method SpeakSsmlFromUriAsync(Uri) Asynchronously speaks the content of a standalone Speech Synthesis Markup Language (SSML) document with a text-to-speech voice.
Method SpeakTextAsync(String) Asynchronously speaks the content of a plain-text string.

synthesis API有提供上述三种Speak方法来启动语言输出,分别支持朗读纯本文、具有SSML标签内容或加载完整的SSML文件;

(b) VoiceInformation

定义一个text-to-speech voice的信息。重要的属性如下:

Property Access-Type Description
Description Read-only Gets the description of a text-to-speech (TTS) voice.
DisplayName Read-only Gets the display name of the text-to-speech (TTS) voice.
Gender Read-only Gets the gender of the text-to-speech (TTS) voice.
Id Read-only Gets the identifier of the text-to-speech (TTS) voice.
Language Read-only Gets the language of the text-to-speech (TTS) voice.


(c) InstalledVoices

提供连结在设备中「设定/语音」已安装的synthesis voices。

Property Access-Type Description
All Read-only Gets the full set of synthesized voices that are available to use as part of the Speech feature.
Default Read-only Gets the default synthesized voice.

Speech Synthesis Markup Language (SSML)

SSML是XML-based的标准格式语言被设计用于speech synthesis应用程序。在W3C's voice browser working group也有推荐该定义语言。

它允许开发人员控件多种synthesis speech的特性,例如:语音、语言、发音…等。然而MS实作SSML版本是基于World Wide Web Consortium

所定义的1.0版本(Speech Synthesis Markup Language (SSML) Version 1.0.)。




往下参考<Using SSML for advanced text-to-speech on Windows Phone 8>来说明SSML的结构:

(1) SSML文件或文字必定由<speak />卷标给包装起来

<speak />是在文件中是root element,也可以直接使用不包装其它element的组合。例如:

<speak version="1.0" 
       xml:lang="string"> </speak>



private async void SpeachBySsmlString() 
    synth = new SpeechSynthesizer();
    // 定义一个简单的<speak />,指定发音语系为en-US;
    string ssmlText = "<speak version=\"1.0\" ";
    ssmlText += "xmlns=\"\" xml:lang=\"en-US\">";
    ssmlText += " Testing Windows Phone 8 TTS";
    ssmlText += "</speak>";
    await synth.SpeakSsmlAsync(ssmlText);

(2) 加入指定的Sound Files

除了上述直接定义<speak />搭配文字内容外,还可以指定<audio />于要发音的文字段中,举例来说:

有一段「this is a book.」我想要把「book」用上自己的音档,则可以写成

「this is a <audio src="ms-appx:///Assets/book.wav">book</audio>」。

然而,并非什么音档格式均可以搭配<audio />,音檔格式需要符合

‧support file in PCM, a-law and u-law format;

‧8 bits or 16 bits depth;

‧non-stereo (mono only);

private async void SpeakByStringInAudio()
    ssmlText = "<speak version=\"1.0\" ";
    ssmlText += "xmlns=\"\" xml:lang=\"en-US\">";
    ssmlText += "Here comes the dog, ";
    // 指定要播放音檔
    ssmlText += "<audio src=\"ms-appx:///Assets/cats.wav\">Dog </audio>";
    ssmlText += "</speak>";await synth.SpeakSsmlAsync(ssmlText);


另外,src采用的location有些可以支持Assets/cats.wav,但保险一点建议写成具有完整URI Scheme的格式比较好。

(3) 插入暂停

<break />标签被用于插入至朗读过程暂停或暂停指定时间,可搭配二个属性使用:

‧strength:选用属性,其值包括:none, x-weak, weak, medium, strong, or x-strong;


ssmlText = "<speak version=\"1.0\" ";
ssmlText += "xmlns=\"\" xml:lang=\"en-US\">";
// 分别定义要暂停时间与暂停强度
ssmlText += "There is a pause <break time=\"500ms\" /> here, ";
ssmlText += "and another one <break strength=\"x-strong\" /> here";
ssmlText += "</speak>";
await synth.SpeakSsmlAsync(ssmlText);

(4) 定义或改变单词的发音

SSML提供二种方法用于指定speech synthesis调整某一个字的发音。如下:

‧针对该字定义on-time的发音(pronunciation),采用<phoneme />




ssmlText = "<speak version=\"1.0\" ";
ssmlText += "xmlns=\"\" xml:lang=\"en-US\">";
// 定义<phoneme/>与相关属性,ph为发音的方式;alphabet为固定
ssmlText += "<phoneme alphabet=\"x-microsoft-ups\" ph=\"O L AA\">hello</phoneme>";
ssmlText += ", I mean hello";ssmlText += "</speak>";
await synth.SpeakSsmlAsync(ssmlText);

‧在一个地方定义多个字的发音,采用<lexicon />

=>定义<lexicon />需要额外产生份lexicon file。该份文件也是XML-based,内容包括了发音与文字对应。如下范例:

<?xml version="1.0" encoding="UTF-8"?>
<lexicon version="1.0"  
        alphabet="x-microsoft-ups" xml:lang="en-US">  
        <phoneme> W AI F AI</phoneme>  

每一个字定义一个<lexeme />,它包含<phoneme />(定义该字如何发音)<grapheme />(定义什么字要用特定发音)

=>定义好的lexicon file,搭配SpeakSynthesizer.SpeakSsmlAsync()时,需要在<speak />中建立的<lexicon />


ssmlText = "<speak version=\"1.0\" ";
ssmlText += "xmlns=\"\" xml:lang=\"en-US\">";
ssmlText += "<lexicon uri=\"ms-appx:///Assets/lexicon1.xml\"";
//指定type类型,与MIME Type相同
ssmlText += " type=\"application/pls+xml\"/>";
ssmlText += "She is not my wife";ssmlText += "</speak>";
await synth.SpeakSsmlAsync(ssmlText);

需注意,如果一份SSML中同时存在<phoneme />与<lexicon />时,speech synthesis会以<phoneme />为较高的优先权。

更多相关的内容可以参考<lexicon Element SSML>与<Speech Synthesis Markup Language Reference>。

(5) 更改voices


从InstalledVioce中搜寻到需要语系再指定。在SSML里提供<voice />卷标来指定,该卷标具有多个属性,

但都是选择使用,但至少要有一个,这些属性被认为是speech synthesis的优先选的值


Attribute Description
name Optional. Specifies the name of the installed voice that will speak the contained text.
gender Optional. Specifies the preferred gender of the voice that will speak the contained text.

The allowed values are: male, female, and neutral.

age Optional. Specifies the preferred age in years of the voice that will speak the contained text.

The allowed values are: 10 (child), 15 (teen), 30 (adult), and 65(senior).

xml:lang Optional. Specifies the language that the voice must support.

The value may contain either a lower-case, two-letter language code, (such as en for English), or may optionally include an upper-case, country/region or other variation in addition to the language code, (such as zh-CN).

variant Optional. An integer that specifies a preferred voice when more than one voice matches the values specified in any of the xml:lang, gender, or age parameters.


ssmlText = "<speak version=\"1.0\" ";
ssmlText += "xmlns=\"\" xml:lang=\"en-US\">";
ssmlText += "<voice name=\"Microsoft Susan Mobile\" gender=\"female\" age=\"30\"";
ssmlText += " xml:lang=\"en-US\">";ssmlText += "This is another test </voice>";
ssmlText += "</speak>";
await synth.SpeakSsmlAsync(ssmlText);

另外,还可以透过<p xml:lang="" />与<s xml:lang="" />针对某些内容修改vocie,如下:

ssmlText = "<speak version=\"1.0\" ";
ssmlText += "xmlns=\"\" xml:lang=\"en-GB\">";
// 利用<p />与<s />切换voice
ssmlText += "<p>";ssmlText += "<s>First sentence of a paragraph</s>";
ssmlText += "<s xml:lang=\"en-US\">And this is the second sentence</s>";
ssmlText += "</p>";ssmlText += "</speak>";
await synth.SpeakSsmlAsync(ssmlText);

(6) 改变语音的韵律

可透过<break />标签去暂停或调整朗读的速度,另外可以搭配<prosody />提供更多属性的设定来达到需求。例如:

<prosody pitch="value" contour="value" 
         range="value" rate="value" 
         duration="value" volume="value"> </prosody>

Attribute Description
pitch Optional. Indicates the baseline pitch for the contained text.

This value may be expressed in one of three ways:

  • An absolute value, expressed as a number followed by "Hz" (Hertz). For example, 600Hz.
  • A relative value, expressed as a number preceded by "+" or "-" and followed by "Hz" or "st", that specifies an amount to change the pitch. For example +80Hz or -2st. The “st” indicates the change unit is semitone, which is half of a tone (a half step) on the standard diatonic scale.
  • An enumeration value, from among the following: x-low, low, medium, high, x-high, or default.
contour Optional. Represents changes in pitch for speech content as an array of targets at specified time positions in the speech output.

Each target is defined by sets of parameter pairs, for example:

<prosody contour="(0%,+20Hz) (10%,-2st) (40%,+10Hz)">

The first value in each set of parameters specifies the location of the pitch change as a percentage of the duration of the contained text (a number followed by "%").

The second value specifies the amount to raise or lower the pitch, using a relative value or an enumeration value for pitch, see above.

range Optional. A value that represents the range of pitch for the contained speech content.

This value may be expressed using the same absolute values, relative values, or enumeration values used to describe pitch, see above.

rate Optional. Indicates the speaking rate of the contained text.

This value may be expressed in one of two ways:

  • A relative value, expressed as a number that acts as a multiplier of the default. For example, a value of 1 results in no change in the rate. A value of .5 results in a halving of the rate. A value of 3 results in a tripling of the rate.
  • An enumeration value, from among the following: x-slow, slow, medium, fast, x-fast, or default
duration Optional. A value in seconds or milliseconds for the period of time that should elapse while the speech synthesis (TTS) engine reads the contents of the element. For example 2s or 1800ms.
volume Optional. Indicates the volume level of the speaking voice.

This value may be expressed in one of three ways:

  • An absolute value, expressed as a number in the range of 0.0 to 100.0, from quietest to loudest. For example, 75. The default is 100.0.
  • A relative value, expressed as a number preceded by "+" or "-" that specifies an amount to change the volume. For example +10 or -5.5.
  • An enumeration value, from among the following: silent, x-soft, soft, medium, loud, x-loud, or default.


ssmlText = "<speak version=\"1.0\" ";
ssmlText += "xmlns=\"\" xml:lang=\"en-US\">";
ssmlText += "Testing the ";
// 定义<prosody />
ssmlText += "<prosody pitch=\"+100Hz\" volume=\"70.0\" >Prosody</prosody>";
ssmlText += " element";
ssmlText += "Normal,<prosody rate=\"2\"> Very Fast,</prosody>";
ssmlText += "<prosody rate=\"0.4\"> now slow,</prosody>";
ssmlText += "and normal again";
ssmlText += "</speak>";
await synth.SpeakSsmlAsync(ssmlText);

(7) 监控讲话进度

如果应用程序中需要针对朗读时有具体的监控行动,可以在SSML中为每一个监控点加上<mark />标签,那么,speech synthesizer在朗

读时如遇到<mark />会自动触发SpeechBookmarkReached event,透过该事件即可得到相关<mark />的信息。如下程序内容:

public MainPage()
    synth = new SpeechSynthesizer();    
    // Add the event handler for the speech progress events    
    synth.BookmarkReached += new TypedEventHandler<SpeechSynthesizer, 
private async void Button7_Click(object sender, RoutedEventArgs e)
    ssmlText = "<speak version=\"1.0\" ";    
    ssmlText += "xmlns=\"\" xml:lang=\"en-US\">";
    //标记要取得的<mark />
    ssmlText += "<mark name=\"START\"/>";    
    ssmlText += "This is the first half of the speech.";    
    ssmlText += "<mark name=\"HALF\"/>";    
    ssmlText += "and this the second half. Ending now";    
    ssmlText += "<mark name=\"END\"/>";    
    ssmlText += "</speak>";    
    await synth.SpeakSsmlAsync(ssmlText);
static void synth_BookmarkReached(object sender, SpeechBookmarkReachedEventArgs e)
    Debugger.Log(1, "Info", e.Bookmark + " mark reached\n");

(8) Specifying content type and aliasing parts of a speech

利用<say-as />来表示特定的content type(例如:日期、数字)。其格式如下:

<say-as interpret-as="string" format="digit string" detail="string"> <say-as>

Attribute Description
interpret-as Required. Indicates the content type of text contained in the element.

The SSML 1.0 say-as attribute values specification defines six content types.

format Optional. Provides additional information about the precise formatting of the contained text for content types that may have ambiguous formats. SSML defines formats for content types that use them.
detail Optional. Indicates the level of detail to be spoken. For example, this attribute might request that the speech synthesis engine pronounce punctuation marks.

There are no standard values defined for the detail attribute. Support for this attribute depends on the individual speech synthesis engine.


Interpret-as Format Interpretation
date dmy, mdy, ymd,

ym, my, md,

dm, d, m, y

The contained text is a date in the specified format.

In the format designations, d=day, m=month, and y=year.

The format for date indicates which date components are represented and their sequence.

The following is an example of a say-as element that contains a date:

Today is <say-as interpret-as="date" format="mdy">10-19-2003</say-as>

The speech synthesizer should pronounce “Today is October nineteenth two thousand three”.

cardinal - The contained text should be spoken as a cardinal number.

The following is an example of a say-as element that contains a cardinal number:

There are <say-as interpret-as="cardinal">3</say-as> alternatives.

The speech synthesizer should pronounce “There are three alternatives”.

ordinal - The contained text should be interpreted as an ordinal number.

The following is an example of a say-as element that contains an ordinal number:

Select the <say-as interpret-as="ordinal">3rd</say-as> option.

The speech synthesizer should pronounce “Select the third option”.

characters - Indicates that each letter in the contained text should be pronounced individually (spelled out).

The following is an example of a say-as element that contains a word that should be spoken as individual letters:

<say-as interpret-as="characters">test</say-as>.

The speech synthesizer should pronounce each letter: “T E S T”.

time hms12,


The contained text is a time. Time may be expressed using either a 12-hour clock (hms12) or a 24-hour clock (hms24).

The format attribute indicates which clock to use. The following is an example of a say-as element that contains a time:

The train departs at <say-as interpret-as="time" format="hms12">4:00am</say-as>.

The speech synthesizer should speak “The train departs at four A M”.

Use a colon to separate numbers representing hours, minutes, and seconds.

The following time strings are all valid examples: 12:35, 1:14:32, 08:15, and 02:50:45.

telephone digit string The contained text is a telephone number. The format attribute may contain digits that represent a country code, for example “1” for the United States or “39” for Italy.

The speech synthesis engine may use this information to guide its pronunciation of a phone number.

The country code may also be included in the phone number, and if so, takes precedence over the country code in the format attribute if there is a mismatch. The following is an example of a say-as element that contains a telephone number:

The number is <say-as interpret-as="telephone" format="1">(888) 555-1212</say-as>.

The speech synthesizer should speak “My number is area code eight eight eight five five five one two one two”.


ssmlText = "<speak version=\"1.0\" ";
ssmlText += "xmlns=\"\" xml:lang=\"en-US\">";
ssmlText += "<p>This is an ordinal number: <say-as interpret-as=\"ordinal\">121</say-as></p>";
ssmlText += "<p>This is a cardinal number: <say-as interpret-as=\"cardinal\">121</say-as></p>";
ssmlText += "<p>And these are just individual numbers: <say-as interpret-as=\"characters\">121</say-as></p>";
ssmlText += "</speak>";
await synth.SpeakSsmlAsync(ssmlText);

另外,还可以搭配<sub />来提供指定某一字需要换读完整的字段,例如:在文字中可以把字写成缩写,但在读的时候想要用完整字来读的情境。



ssmlText = "<speak version=\"1.0\" ";
ssmlText += "xmlns=\"\" xml:lang=\"en-US\">";
//定义别名,所以当遇到WP8时,不会读WP8,而换成Windows Phone 8
ssmlText += "This code runs on <sub alias=\"Windows Phone 8\">WP8</sub>";
ssmlText += "</speak>";
await synth.SpeakSsmlAsync(ssmlText);

(9) 播放一份SSML document

定义SSML document本身即是XML文件,把上述介绍过的一些参数与格式整理成一份档案。

搭配SpeackSssmlFromUri()将应用程序中的SSML Document透过URI的方式加载进行朗读。一份完整的SSML Document如下:

<speak version="1.0" 
    <voice gender="male" xml:lang="en-US">    
        <prosody rate="0.8">      
            <p>Thanks for reading the article, and thanks for trying the examples</p>
            <p>Now be creative, and create amazing applications for this fantastic platform</p>      
            <voice gender="male" xml:lang="es">Adios</voice>    


await synth.SpeakSsmlFromUriAsync(new Uri("ms-appx:///Assets/SSML1.xml"));


