|
|
Text-To-Speech (TTS) -- Frequently Asked Questions
|
Home |
Demo |
> FAQ |
Publications |
Contact
|
|
Table of Contents
- About the website.
- About the product.
- About the research group.
- About the technology.
|
TOP
|
|
How does the demo work?
|
|
This section addresses the mechanics of using the web page. For more information
on the Text-To-Speech technology behind the demo please see
How does TTS work?
The Text-To-Speech demo has three easy steps which are numbered on the page.
-
First, choose one of the available voices. This also selects the language.
Each voice was created from recordings of a native speaker of that language. NOTE:
the web demo limits the length of the text.
-
Second, enter some text. NOTE: the text should match the chosen language. Please
note that no language translation is done. If you enter German text for
Italian voices the results will be unpredictable at best.
-
Third and last, click "SPEAK" or "DOWNLOAD". The "SPEAK" button should start
downloading audio within a few seconds and play it with no further action (unless
your browser is set up to download it instead). The "DOWNLOAD" button displays
a new page containing a link to the audio file. In most browsers you can left-click
that link to play the file or right-click for a menu. The menu probably has an
option named "Save Link Target As..." or something to that effect which you can
use to save the file with a name and location of your choosing.
The HTML form uses ISO 8859-1 (Latin-1) character encoding so input text
generally works as expected, though there have been one or two exceptions.
TTS accepts diacritics appropriate to the language being synthesized, e.g.
"ñ" in Spanish and "ü" in German. Some browsers provide a way
to input special characters, or you may want to copy and paste from another
application.
|
|
TOP
|
|
What if the demo does not work properly?
|
|
The demo page was designed to be as simple as possible. You submit
an HTML form containing text and voice selection. We return either an HTTP header redirection
to the completed audio file ("SPEAK" button) or a new page displaying the audio URL ("DOWNLOAD"
button). There is some minimal javascript but it does not affect website operations.
There are two categories of problems that sometimes arise, and it is helpful if
they are reported separately, as follows:
-
If a word or name is mispronounced or garbled, use the "SEND FEEDBACK" button
to let us know. This sort of feedback is very helpful for improving the
dictionaries and finding errors in the voice database. Please be sure that
the text and voice being submitted actually created the problem so that we
can replicate it. Any detail you can provide are most appreciated, for example
"My name should sound like ...", or "should rhyme with ...".
Please, please, please check the spelling before you report a mispronunciation.
Less typing for you, less time reading email for us.
And, no, our software does not include spelling correction. That can be done
by an application before submitting text to the TTS engine, if appropriate.
Spelling correction can sometimes makes outrageous changes to names and acronyms.
-
If you get no audio at all or only part of it, if it is clearly not speech,
or if it stops and starts in pulses, then something is wrong with the speech
creation, delivery, or playback. Please continue reading.
If something is wrong, there may be a problem with our web servers, with the internet itself
or with the hardware or software on your computer. This checklist should help you determine
where the problem is.
If you don't solve the problem directly, you will at least have a better
idea what is wrong. If you believe that there is a problem with our website or TTS
software, please pass along as much detail as you have (scenario, error messages, etc.)
to tts-feedback.
Please note: if the problem is with your computer or browser, we cannot help you.
We're not qualified and it's not our job.
- Step 1. Try again.
Maybe there was a glitch in the network. Consistent problems are real problems.
- Step 2. Error message?
Was there an actual error message? From the web? From the audio player?
Please send the exact text of any error messages.
- Step 3. How far did you get?
Did you see the website demo page? Select a speaker and type some text?
Click the "SPEAK" or "DOWNLOAD" button? Did it appear to download any audio?
Was there a download progress indicator? Was an audio player launched or attempted?
If you get no audio download, there may be something wrong with our TTS server.
If you download audio but cannot play it, the problem is likely on your side
(but keep reading for hints).
The website's basic operation is as simple as possible. You submit voice name
and text. If all is well, our site will redirect your browser to an ordinary WAV file
containing the speech. That file will be available for about 5 minutes, after
which it will be deleted to recover disk space. You only need a standard browser
and the ability to play a WAV file.
Please note that one of the most common problems is caused by packages that block
pop-up windows. These packages also frequently block indirect audio downloads.
Test by clicking "DOWNLOAD" rather than "SPEAK".
Click the resulting audio link to listen. This should work even if HTTP redirections
to audio files are blocked. You'll have to reconfigure or disable your blocker to
use the "SPEAK" button.
- Step 4. Sound, but something is wrong?
Maybe you hear sound, or even speech, but it sounds wrong somehow. There are
several possibilities. First be sure that you can play audio files.
Try a sample audio file or test your browser's
audio capability on some multimedia test page.
Did everything sound fine but then stop early? There is currently a 300 character
limit on the website to reduce server load. The CGI script deletes the extra characters,
even in the middle of a word, which can yield odd results for the final word.
The limit may change without notice as server load varies.
Note that the TTS software itself will synthesize any length text.
Did it play at the wrong speed? We currently deliver only 16 KHz audio. You
may have to convert the sample rate if your audio card cannot handle this, but
that would be rare these days.
Did the speech sound choppy or stop too early? Some Microsoft audio players
seem to have problems on the initial download. If the player can replay, e.g.
with VCR-type buttons, it should sound OK on the second playing.
This is a common complaint but we're not sure how to avoid this.
|
|
TOP
|
|
What audio format is used?
|
|
The speech output audio format is a simple WAV (a.k.a. RIFF) file. The sample
rate is 16KHz 16-bit linear, i.e. 16,000 samples per second, each sample a 16-bit integer,
mono (not stereo). The website uses these wideband voices for best quality.
We also ship 8KHz versions of the voice (8,000 per second, one 8-bit Mulaw value per sample)
for a 4-times reduction in voice database size.
The 8K voices are useful for telephony applications (where the phone line limits quality anyway)
and for platforms with storage limitations.
There is no option on the page for MP3 or similar encodings.
The server would likely be overloaded if we added audio compression.
If you need a different sample rate or audio format you can probably find free software to
convert what we deliver. But before you use the audio for something more than private listening
please check the website usage policy.
|
|
TOP
|
|
Are there restrictions on the use of this site?
|
|
Yes, both the site itself and any downloaded audio files have restrictions.
The website is for demonstration purposes only, and is not a free service.
AT&T Natural Voices™ is available
commercially at reasonable prices, so please consider
purchasing if these limitations affect you.
Natural Voices™ is available in desktop, server, and SDK editions.
The desktop voices are an inexpensive add-on to several PC packages
which read documents, convert to MP3 files, etc.
1. Randomized Numbers.
Unfortunately, we now find it necessary to randomize incoming digits to
discourage commercial use of the audio. This means that if you enter "1 2 3"
you might hear "9 3 5". Please note that this is an intentional behavior of
the website itself, and not of the synthesis software. If this is a
problem, please see the How to Buy section.
2. Limited Use of Audio Clips.
Audio files produced on our site are intended only for private, non-commercial use.
This is not legal advice (hey, we're researchers) but here are some scenarios.
A class project is probably OK. Bugging your friends is probably OK (with us).
The common thread here is temporary use with a very small distribution.
Any use that involves wide distribution or long lifetime is probably not OK,
whether or not it is commercial. Audio clips used in songs, videos or game levels
cannot be made publicly available on the internet. Building or prototyping
a software package using our audio rather than recording your own prompts is not OK.
If you aren't sure, ask. If you need the software or a license to use
the audio, you can find a link to Wizzard Software in the
How to Buy section of this FAQ.
3. Only Through This Demo Site.
Website resources are limited and are needed to support this site. Direct access
to the CGI scripts is not permitted. You may refer people to this site
but you may not have users enter text on your site and use our servers to provide
audio, even if you give us credit. To do this, you must install your own TTS server
with proper licenses from Wizzard Software.
4. Text Length is Limited.
Length of the input text is limited, typically to 300 characters. Anything longer
is chopped off, and may result in partial words or single letters at the end of
the speech. The length limit is a website feature and helps regulate server load.
If you need to synthesize longer text, the product can handle input of any length.
5. Number of Submissions Per Computer is Limited.
The number of submissions is limited. Access may be temporarily blocked if
there are too many submissions from one computer. This, together with the length limit,
allows more users to try the demo and shares the limited resources more fairly.
Limits on text length and submissions per day will be adjusted as needed to
keep the website functioning normally. The ultimate purpse of this site, after
all, is for as many people as possible to listen to AT&T Natural Voices™
text to speech.
|
|
TOP
|
|
Can the synthesis be modified?
|
It is possible to change the way the speech sounds by altering the input text.
Liberal use of commas is the easiest way to get better phrasing,
especially in long complex sentences. Overall speed can be controlled
using XML-style tags from the SSML standard, e.g.
<prosody rate="slow"> this is speaking slowly </prosody>.
<prosody rate="fast"> this is speaking fast </prosody>.
<prosody rate="-50%"> this is 50% slower </prosody>.
Precise pauses can also be inserted using the <break/> tag, e.g.
Break for 100 milliseconds <Break time="100ms"/> Okay, keep going."
Break for 3 seconds <Break time="3s"/> Okay, keep going."
Voices and languages can be intermixed using the <voice> tag, e.g.
<voice name="crystal">Crystal, 1 2 3.</voice>
<voice name="mike">Mike.
<voice name="rosa">Rosa, 1 2 3.</voice>
Back to Mike.</voice>
The Speech Synthesis Markup Language,
or SSML, is defined by the W3C organization. Note that not all tags are supported. See the documentation
for specific product releases for more details.
|
|
TOP
|
|
Is this AT&T Natural Voices™?
|
|
Our Research group at AT&T Labs produced AT&T Natural Voices™.
The website demo runs a recent Research version of the synthesizer (and you
may note differences from the released product).
This TTS system was originally known as "Next-Generation TTS" or "Next-Gen"
and some published technical papers refer to it by that name. The "Natural Voices"
name came about when our system was introduced as a commercial product.
|
|
TOP
|
|
How can I get AT&T Natural Voices™?
|
|
For sales, licensing and support of released versions of
AT&T Natural Voices™ please contact
Wizzard Software.
Please note that results from the Research version in this demo may differ from the released product.
Natural Voices™ is available in desktop, server, and SDK editions.
The desktop voices are an inexpensive add-on to several PC packages
which read documents, convert to MP3 files, etc.
You can also find more details in the Product Support section below.
|
|
TOP
|
|
Who supports AT&T Natural Voices™?
|
|
If you already own AT&T Natural Voices™, your vendor is the first line of support.
They know their own products and common problems involving installation and interaction
with TTS, and so are mostly likely to be of immediate help.
The next stop is
Wizzard Software.
All questions about sales, licensing, updates, future releases, new languages,
supported platforms, etc. should go to Wizzard.
Our TTS research group at AT&T Labs cannot handle direct customer support. As a small
research group focused on the underlying science and technology, we just don't have sufficient
resources for field support. In particular, we haven't even seen half the applications that use
our TTS and have no idea what their error messages mean. Problems may be escalated to us by vendors,
in which case we can handle any serious issues once per vendor rather than per customer.
This is infrequent enough that we can continue our research to improve TTS. This also keeps vendors
in the loop and ready to handle the next customer.
|
|
TOP
|
|
Are there updates for AT&T Natural Voices™?
|
|
All questions about sales, licensing, updates, future releases, new languages,
supported platforms, etc. should be directed to
Wizzard Software.
See Product Support for further details.
|
|
TOP
|
|
Who works on TTS?
|
|
AT&T has a very long history in speech synthesis, beginning in Bell Labs and continuing
in the newly formed AT&T Labs following the Lucent spin-off in 1996. Initially at Murray Hill,
we are now located in Florham Park, NJ.
Though our research
team is relatively small, we have leveraged available resources to create a ground-breaking
TTS system in AT&T Natural Voices™. And although TTS has advanced considerably
in the past few years there is still much room for improvement, and research continues.
You can learn more about our work on the Publications page.
|
|
TOP
|
|
How can I contact the research group?
|
|
You can send questions, suggestions and problems to the research group at
tts-feedback.
We cannot promise to respond to every email
-- there are just too many --
but we do our best.
Please contact
Wizzard Software
with questions about sales and licensing, about what language will be released when,
about support for your favorite hardware platform, etc.
If you ask us, we'll just refer you to Wizzard anyway.
|
|
TOP
|
|
What is Text-To-Speech?
|
|
Text-To-Speech, or TTS for short, is computer software that converts text into audible
speech. You can try it yourself on our demo page. See our
Home page for more information.
TTS is separate from speech recognition. You can think of TTS as "talking" and speech
recognition as "listening". There is some shared technology, but neither is just the
reverse of the other. And the talking/listening analogy is limited too. Neither
technology really involves much language understanding.
TTS is also distinct from language translation, though voice to voice
translation would employ both speech recognition and TTS. Again, translation requires
significant understanding of the meaning.
People new to the idea of TTS often underestimate the difficulty of the task. After all,
humans can typically learn this stuff in early childhood. They talk, listen, understand, and even
translate without much apparent effort. Humans do all this work without even being
aware of it in most cases, but that doesn't make it easy.
If programmers could create software that really understands human language we could avoid most
of the guesswork in TTS, but that hasn't happened yet. Until then, TTS is more like
learning to read a foreign language aloud without ever understanding the words.
With a good dictionary, grammar rules, etc. you can get better and better but will still make
mistakes occasionally that are obvious to native speakers.
|
|
TOP
|
|
How does TTS work?
|
|
TTS is often described as two conceptual stages. In the first stage, it decides how
the text should be spoken, that is, how each word should be pronounced, what
length and pitch each phoneme should have, etc. In the second stage, the system does
it's best to create audio that matches the specifications produced by stage one.
TTS software has little or no understanding of the text being read. It uses
rules, lists, dictionaries, etc. to make very sophisticated guesses about how a piece
of text should be read. While general performance can be quite good, some decisions
are intrinsically hard to make without some level of understanding.
For example, the word "bass" in the phrases "bass drum" or "bass boat".
Intonation depends in many cases on the writer's intention, which often cannot be inferred in
short texts even by human readers. As a result, TTS systems will occasionally make mistakes
and can be fooled by carefully constructed texts.
These are challenging problems for all TTS systems, and we continue to improve ours as we are able.
The type of TTS we do is called a "concatenative" system, meaning that we record a human
speaker to make a voice database. We re-use small chunks of the recordings to create
new sentences containing words that were never recorded. Further, we do "unit selection"
synthesis. This means that we use large voice databases and do clever searches on-the-fly
to find chunks in the voice database that best match the requested sentences.
|
|
TOP
|
|
Who uses Text-To-Speech?
|
|
TTS is used in a wide variety of services and applications. Commercially, help desks
and voice response systems are probably the most important. On nights and weekends, when
people are scarce, customers can still get some basic information from computers, for
example, an account balance. The computer listens either to speech or touch tones and
responds with TTS. These apps are typically over the telephone but might also be at
kiosks or automated tellers.
TTS is also used on personal devices, e.g. on a PC to proof-read a document or to learn
a new language. This category also includes assistive technologies such as screen-readers
for the visually impaired and as a substitute voice for those who cannot speak.
As the quality of synthetic voices continues to improve, barriers to new applications
drop. Some applications, to guarantee high quality, record all the things that need to
be said. This can be expensive, impractical, or impossible depending on the task.
TTS is often a better option if the voices are good enough.
|
|
TOP
|
|
Can I use my own voice?
|
|
Not in AT&T Natural Voices™.
The reasons for wanting customized voices are varied. Some people just think it
would be cool. Some are losing their voices due to a medical condition or upcoming surgery
and would like to have their own synthetic voice rather than a generic one. Some people have
audio tapes of a late loved one.
(See the reference to ModelTalker
in the section on Assistive Technologies
below that may be useful for people soon to lose their voices.)
Creating high-quality voices
requires a good voice talent, a sound-proof room, professional audio equipment,
hours of written material with thorough coverage of phoneme combinations in the
language, and the time and expertise to turn those recordings into a decent synthetic
voice. Because of the expense involved, custom voice builds are usually done for
corporations that want to computerize an existing actor's voice, for example to
continue a brand image.
Since even professional actors reading well-chosen material don't always synthesize well,
another possibility is to get the highest quality recordings possible, and as much of
it as possible. Keep the recordings in a safe place until the technology improves for
transforming one voice to sound like another.
It may take far less material to build a tranformation model than it does to
build a TTS voice from scratch. Eventually it may be possible to take a good TTS voice that
is roughly similar (e.g. mid-pitch-range male, same accent) and morph it to sound like the
desired person.
|
|
TOP
|
|
What about assistive technologies?
|
TTS systems have a long history in assistive technologies.
First, there are two basic classes of TTS software:
- General-purpose TTS software on a desktop or laptop PC, and
- Specialized packages for various disabilities.
Small, fast general-purposes TTS systems are available, and many come with simple
applications which allow text typed on the screen to be read.
There are multiple systems on the market (ours included), varying widely in voice
quality, hardware requirements, and price.
These vary greatly in price and capability, and are sometimes customized
for particular users. These products are needed when general purpose
hardware and software do not suit the users needs.
Disability-related systems can be grouped into two basic types:
- User is the listener.
- User types, others listen.
Screen readers for the sight-impaired are an example of the first case.
The user listens to the voice long enough to adapt and reach high
intelligibility even for low quality voices. A convenient user
interface with good responsiveness is perhaps more important than
voice quality. The ability to vary the speaking rate, and particularly
to choose extremely fast speaking rates, is crucial.
Stephen Hawking is the classic example of the second case, typing on a PC
so that other people can listen. Here, the listeners come and go
frequently and so most have less opportunity to adapt to low-quality voices.
Fast speaking rates are not needed. If keyboard input is difficult
or impossible, touch screens or physical buttons are customized to
represent words or phrases. Buttons may be selected in a variety of ways,
depending on the user's abilities.
These systems range from a budget PC with inexpensive TTS software
to highly customized button boards controlled with exotic control
hardware.
Here are some links that may help. I'm sure that many other
relevant sites can be found by searching on the web.
-
A "beta" (free preview version) of a program called
ModelTalker is
available now (early 2008). This package allows a person
(e.g. someone soon to lose their voice) to record themselves
and create a synthetic voice that sounds more or less similar.
The examples from the website sound pretty good and there are
some good researchers behind it. We don't know what the
commercial version will cost or when that will come out,
but this is the first such program that we're aware of. It
may also be available with pre-existing voices.
-
The misc.handicap newgroup sometimes
has discussions of commercial TTS products. This is especially useful
because you can ask questions of people who face similar challenges.
-
http://www.loc.gov/nls/reference/factsheets/index.html
Info on many specialized solutions.
-
http://www.speech.cs.cmu.edu/comp.speech/Section5/speechlinks.html
Links to many TTS systems, some commercial, some research.
|
|
TOP
|
|
How can I learn more about TTS?
|
|
A good starting point is "An Introduction to Text-to-Speech Synthesis"
by Thierry Dutoit, published by Kluwer Academic Press. For hands-on
experience, take a look at free TTS software called Festival from the
University of Edinburgh in Scotland. Downloads are available, and it
includes instructions for creating new voices and languages. Many
universities around the world use this software as a framework for their
own research. There are also some technical papers on our website that
might serve as entry points into the research literature, e.g. by also
reading some of the referenced papers.
|
|