In a new post in Apple’s Machine Learning Journal, the company explains how personalization works behind the “Hey Siri” voice activation feature to reduce the number of false positives. The journal points back to an earlier entry that describes the general technical approach and implementation details of the “Hey Siri” detector and the more general, speaker-independent “key-phrase detection” problem, and begins with that as an assumed foundation for this latest paper, which focuses in on the machine learning technologies that Apple has implemented in developing a rudimentary speaker recognition system to reduce the number of false positives triggered by other people in the vicinity saying phrases that may sound similar to “Hey Siri.”
Apple introduced “Hey Siri” with the debut of the iPhone 6 in 2014, although the feature originally required the iPhone to be connected to a power source; it wasn’t until the debut of the iPhone 6s a year later that “always-on Hey Siri” became available, courtesy of a new lower-power coprocessor that could offer continuous listening without significant battery drain. At the same time the feature was also further improved in iOS 9 by adding a new “training mode” to help personalize Siri to the voice of the specific iPhone user during initial set up.
The paper goes on to explain that the phrase “Hey Siri” was originally chosen to be as natural as possible, adding that even before the feature was introduced, Apple found many users were naturally beginning their Siri requests with “Hey Siri” after using the home button to activate it. However, the “brevity and ease of articulation” of the phrase is a double-edged sword, since it also has the potential to result in many more false positives; as Apple explains, early experiments showed an unacceptably high number of unintended activations that were disproportional to the “reasonable rate” of correct invocations.
Apple’s goal has therefore been to leverage machine learning technologies to reduce the number of “False Accepts” to ensure that Siri only wakes up when the primary user says “Hey Siri,” and to particularly avoid situations where a third party in the room says something that’s misinterpreted as a call for Siri.
Apple adds that “the overall goal” of speaker recognition technology is to determine the identity of a person by voice, suggesting longer-term plans that may offer additional personalization and even authentication, particularly in light of multi-user devices such as Apple’s HomePod. The goal is to determine “who is speaking” rather than simply what is being spoken, and the paper goes on to explain the difference between “text-dependent speaker recognition” where identification is based on a known phrase (like “Hey Siri), and the more challenging task of “text-independent” speaker recognition which involves identifying a user regardless of what they happen to be saying.
Perhaps most interestingly, the journal explains how Siri continues to “implicitly” train itself to identify a user’s voice, even after the explicit enrolment process (asking the user to say five different “Hey Siri” phrases during initial setup) has been completed. The implicit process continues to train Siri after the initial set up by analyzing additional “Hey Siri” requests and adding them to the user’s profile until a total of 40 samples (known as “speaker vectors”) have been stored, including the original five from the explicit training process.
This collection of speaker vectors is then used to compare against future “Hey Siri” requests to determine their validity. Apple also notes that the “Hey Siri” portion of each utterance waveform is also stored locally on the iPhone so that user profiles can be rebuilt using those stored waveforms whenever improved transforms are incorporated into iOS updates. The paper also posits a future where no explicit enrolment step will be required, and users can just begin using the “Hey Siri” feature from an empty profile that will grow and update organically. At the present time, however, it seems that the explicit training is necessary to provide a baseline to ensure the accuracy of later implicit training.
While not surprising considering Apple’s stance on privacy, it’s still worth noting that all of this computation and the storage of the user’s voice profile occurs solely on each user’s iPhone, rather than on any of Apple’s servers, suggesting that such profiles are not currently synced between devices in any way.