Researchers
at Microsoft have made software that can learn the sound of your voice,
and then use it to speak a language that you don't. The system could be
used to make language tutoring software more personal, or to make tools
for travelers.
In a demonstration at Microsoft's Redmond, Washington, campus on Tuesday, Microsoft research scientist Frank Soong showed
how his software could read out text in Spanish using the voice of his
boss, Rick Rashid, who leads Microsoft's research efforts. In a second
demonstration, Soong used his software to grant Craig Mundie,
Microsoft's chief research and strategy officer, the ability to speak
Mandarin.
Hear Rick Rashid's voice in his native language and then translated into several other languages:
In
English, a synthetic version of Mundie's voice welcomed the audience to
an open day held by Microsoft Research, concluding, "With the help of
this system, now I can speak Mandarin." The phrase was repeated in
Mandarin Chinese, in what was still recognizably Mundie's voice.
"We will be able
to do quite a few scenario applications," said Soong, who created the
system with colleagues at Microsoft Research Asia, the company's
second-largest research lab, in Beijing, China.
"For a
monolingual speaker traveling in a foreign country, we'll do speech
recognition followed by translation, followed by the final text to
speech output [in] a different language, but still in his own voice,"
said Soong.
The new technique
could also be used to help students learn a language, said Soong.
Providing sample foreign phrases in a person's own voice could be
encouraging, or easier to imitate. Soong also showed how his new system
could improve a navigational directions phone app, allowing a stock
synthetic English voice to seamlessly read out text written on Chinese
road signs as it relayed instructions for a route in Beijing.
The system needs
around an hour of training to develop a model able to read out any text
in a person's own voice. That model is converted into one able to read
out text in another language by comparing it with a stock text-to-speech
model for the target language. Individual sounds used by the first
model to build up words using a person's voice in his or her own
language are carefully tweaked to give the new text-to-speech model a
full ability to sound out phrases in the second language.
Soong says that this approach can convert between any pair of 26 languages, including Mandarin Chinese, Spanish, and Italian.
Preserving a
person's voice when synthesizing speech for them in another language
would likely be reassuring to a user, and could make interactions
reliant on translation software more meaningful, says Shrikanth Narayanan,
a professor at the University of Southern California, in Los Angeles,
leads a research group working on systems to translate speech in
situations such as doctor-patient consultations.
"The word is just
one part of what a person is saying," he says, and to truly convey all
the information in a person's speech, translation systems will need to
be able to preserve voices and much more. "Preserving voice, preserving
intonation, those things matter, and this project clearly knows that,"
says Narayanan. "Our systems need to capture the expression a person is
trying to convey, who they are, and how they're saying it."
His research
group is investigating how features such as emphasis, intonation, and
the way people use pauses or hesitation affects the effectiveness and
perceived quality of a word-for-word translation. "We're asking if you
can build systems that can mediate between people as well as just
replacing the words," he says. "I view this [Microsoft research] as a
part of how you make this happen."
0 comments:
Post a Comment