Voice technology produced its first mainstream sales success with the home assistant Amazon Echo. This device was released in November 2014. Since then Morgan Stanley, the financial services firm, estimates had cumulative sales of over 11 million worldwide by December 2016.
Since then Voice User Interfaces have created a lot of buzz in tech circles. Google joined the market in November 2016 with their device Google Home, which they describe as a hands-free smart speaker which can reply to questions (with the added promise of being able to do more in the future.)
Analysts at VoiceLabs, a voice analytics company, predict that ‘between Apple, Samsung and Microsoft, two of the three will ship compelling voice-first devices in 2017’ bringing around 33 million ‘voice-first’ devices into circulation. Apple has indeed announced the ‘HomePod’ in June, which will be available this December. These home assistants are referred to as ‘voice-first’ devices since their main interface is voice, although they are not limited to that and can incorporate physical and screen interfaces. This surge is due to the vast improvement in machine learning which underpins voice recognition and natural language processing. Error rates in converting audio to text, and then to an understandable command, have gotten as low as 5%. The technological progress is impressive, yet an error in every 20 words spoken still makes voice as an input frustrating for many.
Home assistants have become the take-off point for voice technology. Prior to this, mobile phone based virtual assistants, such as Siri (Apple), Cortana (Microsoft) and Google Assistant/Now (Google) had been the flag bearers. Six months before the release of Google Home, Google’s CEO Sundar Pichai said that 20% of queries made on Android were done by voice via Google Now. These recent gains have led to real excitement around the technology. ‘We believe that the next big platform is voice,’ said Dave Limp, vice-president of devices at Amazon. Can voice technology really become the next UX platform? Can it create a universal interface that could replace the smartphone touchscreens for an endlessly increasing amount of interactions – the vision behind the movie ‘Her’?
Joaquin Phoenix in the movie 'Her', a story about a man who has a voice-based intimate relationship with an operating system.
Of course, the voice user interface or VUIs has a lot to offer. Yet it isn’t the silver UX bullet some of the promoters claim it is. There are occasions when voice technology helps the experience and others when it doesn’t. It is important to think through a product’s use cases in detail to determine how much value voice technology is adding. However, it is not a simple choice between a voice interface and touchscreen; modes can be combined in different ways, as we explore in the table below.
Here are three cases when voice does and doesn’t help.
When Voice makes more sense
Speed and multiple requests
A recent study at Stanford University showed that speech dictation is three times faster than typing in English and 2.8 times faster in Mandarin. Certain languages that are more difficult to type could benefit more than others from dictation. This extends to certain symbols used for mathematics and musical notes that could be made simpler to input. Similarly, using voice to give compound requests involving multiple yet simple inputs could save time. You could say, ‘Alexa, turn off the music in the kitchen, turn off the lounge lights and load House of Cards season 5 episode 1’ faster than it would take to laboriously make each query individually.
Convenience and safety
Some situations make inputting through a voice user interface (VUI) and receiving an audio output much more practical than typing and looking at a screen. This is particularly useful when hands are required to operate something else safely. It is important to note however that although using a VUI allows a user to remain hands-free, it doesn’t prevent a user from being distracted. A recent MIT study showed that drivers who use voice commands have a tendency to look at a screen for visual feedback to particular voice commands. For example Apple’s CarPlay deactivates the in-car screen at times to avoid exactly this kind of dangerous distraction.
Shared control of devices
Voice interfaces naturally encourage more social interactions. A device with a VUI turns the space around it into an interactive environment, where people can share the interface. Nearly 85% of Amazon Echo owners place their device in a communal area, either a kitchen or lounge, which allows everyone in a home to interact with it. In a similar vein, over 80% of households with an Amazon Echo tend to have 2 or more people interacting with Alexa at the same time. Using a VUI is particularly useful in allowing anyone in a household to control smart devices, such as lights and thermostats, around the home, without having to get a phone out every time.
When Voice makes less sense
Talking in public spaces
Whilst Iron Man talks openly to Jarvis, the rest of us are more reticent. That’s not to say that this couldn’t change in the future as etiquette for interacting with tech evolves. Talking to a VUI in public is difficult to manage and potentially chaotic. Imagine speaking to your computer in an open-plan office, when everyone else around you is doing the same. Privacy becomes a bigger issue to think about. Just as users avoid talking on their mobile phones in public about sensitive topics such as health problems so they would be reluctant to do so with a VUI. Additionally there is potential for a private message to be read out loud to you.
What exactly can a VUI do? It seems obvious but actually it is not always clear. Users don’t know what they can and can’t ask for since VUI devices are not great at communicating the options available. For example, to learn about the breadth of Amazon Echo skills (applications that can be activated by voice), users need to browse Alexa’s skills via the Amazon app or website, and enable the ones they want to use. Invariably, users end up enabling more skills than they can recall and tend to regularly use just a handful. According to The 2017 Voice Report by VoiceLabs, this is one of the reason’s why only 3% of voice applications retain their users after two weeks.
Complex inputs and outputs
Some tasks require a number of exchanges, particularly when various inputs depend on others; in such cases, users are better off inputting information with a visual interface rather than a vocal one. For example, trying to schedule multiple interdependent meetings is easier to do with a calendar on a screen rather than speaking to a device. Audio output can be equally difficult to digest. Imagine being read out a long list of items from which you have one to choose: ‘which pizza do you want to order: Margherita: Funghi; Capricciosa; Quattro stagioni, Hawaiian, etc. etc.?’ It’s also more difficult to do a comparison with multiple items read out to you.
Also published on Medium.