Text to speech in language education

This post is my response to the first assignment of the open teacher education course identified below.

Pronunciation: text to speech

Open Educational Resources (OER) and Automatic Language Processing for Language Learning


In up to 500 words, write down your experience with the OERs proposed. Have they been useful? Do you think they will help your students learn more successfully? Why?

The OERs are these:

http://text-to-speech.imtranslator.net (available in multiple languages) http://www.fromtexttospeech.com (available in multiple languages)
https://text-to-speech-demo.mybluemix.net (with expressive SSML)

1. Natural readers https://www.naturalreaders.com/index.html

I put in this text from Google news and tried various accents.

A surprise victory for the government at this late stage seems unlikely and would be met with head-scratching in No 10, which has already conceded that parliament should be consulted at the end of the Brexit process.

Mike (US) and Graham (UK) miss No 10 (“no ten” instead of “number ten”) and intonation on head-scratching is off (head SCRATCHing instead of HEAD scratching).

I’m not sure I understand why we would want to hear it read by French or Italian speakers. How is this engineered? Is it sampled from French speakers reading English text, or does it just apply algorithms for machine reading of French to English text? I suspect the second. I teach English to French speakers so am certainly used to French-accented English, but nothing like “Alain” reading about Brexit (to hear him, paste my inset text above here and choose “Alain”). I defy anyone to understand “Juliette’s” version without a transcript.

I teach mainly French learners of English in higher education contexts in France. Some of them are future secondary school teachers of English facing national teacher entrance examinations which place high value on phonological and morphosyntactic accuracy in planned monologues. I have discussed some of the pronunciation problems I see in this post Improving spoken English: intermediate/advanced. I’m not at all sure how I would exploit text-to-speech tools with these students. They can get better information on phonemes and word stress from online dictionaries, and the suprasegmental information in the samples I’ve heard here don’t seem reliable enough to be useful.

2. IMtranslator http://text-to-speech.imtranslator.net/

I thought this was quite impressive. I typed in conversational French and the translation was pretty accurate, intonational contours less so perhaps.


3. Text to speech http://www.fromtexttospeech.com/

Next, another resource from text to speech, using the first paragraph of a CALL article (Gonzalez-Lloret, 2011):

The potential of CMC for L2 development resides mainly in the possibility that learners have to engage with other speakers of the language, including L1 speakers, which is especially important for the acquisition of not only linguistic resources but also social and pragmatic competence. As Thorne (2006) states “the use of Internet technologies to encourage dialogue between distributed individuals and partner classes proposes a compelling shift in second (L2) and foreign language (FL) education, one that ideally moves learners from simulated classroom-based contexts towards actual interaction with expert speakers of the language they are studying” (p. 3).

This tool creates an mp3 which you can link to (how long is it stored?) or download. My WordPress won’t accept this file type so I put it on SoundCloud for convenience:

Of course you can just visit from text to speech and do your own cut-and-paste with choice of speaker (that was British “Emma”). Or try French “Gabriel” (set to fast; there are 4 speeds) for another surreal experience.

To my ear “Emma’s” is a pretty good rendering – no obvious errors in intonation that would mislead the listener. But I’m struggling to imagine uses in the language classroom. I might use it if I wanted to have an article read to me during a commute, for example, though the time and planning required to convert and save the file to a device might not be worthwhile. If learners wanted more aural input, better to use authentic sources, surely, of which there is no lack.

In our course assignment, we were encouraged to experiment with different versions of sentences “to see how grammar affects voice outputs.” Here we see that the US voice distinguishes between the lexical verb have and the modal have to /hæftə/

  • learners have the possibility to engage with other speakers of the language mp3
  • learners have to engage with other speakers of the language mp3

You don’t seem to be able to retrieve this information from the Collins dictionary (have) so this gives the tool an advantage over a traditional learner dictionary in this case.

(On another note, you need to refresh the page (click the banner icon) for each new query; you can’t just cut and paste new text in the window.)

4. My blue mix

The tool was developed by IBM presumably for commercial purposes (see description). Here I listened to British and American voices reading English, and a French voice for French. I thought the French sounded better; is French intonation easier to imitate, or is my ear for French less discriminating?

There’s a feature called “expressive SSML” that tweaks the output in prosodic terms, in the example for customer service ends:

The Apology mode seems to place more emphasis (volume, length, pausing). Uncertainty has more pausing, Good News more pitch variation.

Another Voice Transformation features shows variation along different parameters: glottal tension, breathiness, strength, pitch range. Eleven of the 13 voices are female, and only two (female) are transformable in this way. It feels a bit Ex Machina.

Developers can use the tools to customise their own voices and specific texts.

Again, it’s not obvious to me how either the demo or the tool could be used for language teaching and learning beyond awareness-raising. I suppose lower proficiency learners could compare intonation in native and target languages, and more advanced ones could record themselves and compare with the synthetic voices. The tools seems to be ranked in order of sophistication, with perhaps the IBM demo the most convincing. It’s certainly interesting to see how these tools have developed in the past decade or so.


González-Lloret, M. (2011). Conversation analysis of computer-mediated communication. Calico Journal, 28(2), 308-325. PDF Calico