Apple’s latest post on its fledgling Machine Learning Journal provides some interesting insights into how the company is using artificial intelligence technology to not only improve Siri’s speech recognition capabilities, but to make the personal assistant’s voice sound more natural, smother, and provide more personality. The article provides technical details about the deep learning technology used to improve Siri behind the scenes, describing the basic technology behind speech synthesis and the differing approaches used to provide digitized and sampled speech, and how previously lower-quality “parametric” speech synthesis is being improved by the implementation of deep learning in speech technology.
Interestingly, in the post Apple also reveals that the company is using a new female voice talent for iOS 11, “with the goal of improving the naturalness, personality, and expressivity of Siri’s voice,” following an evaluation of hundreds of potential candidates. The article then goes on to explain how engineers and developers recorded over 20 hours of speech in a professional studio to build a new text-to-speech voice using the company’s newest deep learning based technology, with a script including reading audio books, reciting navigation instructions, prompted answers, and witty jokes. The result of all of this is a U.S. English Siri voice that sounds significantly better than the one used in prior iOS versions — at the bottom of the article, Apple provides a series of sample audio files comparing Siri in iOS 9, iOS 10, and iOS 11, along with several cross-references to academic research papers the company has published detailing its efforts in speech synthesis and deep learning.