Designing for voice assistants vs. chatbots

This article is by Masha Guermonprez, CX Lead for Voice Assistant at SPIX industry.

Marshall McLuhan coined the famous phrase “the medium is the message” in 1964—not long before ELIZA, the first ever chatbot therapist, appeared. Communication has certainly evolved since then, but the idea still stands: the channel we communicate through defines the perception of the informational content we receive or share.

In the 1960s, the media channel environment was changing fast. The telephone was getting more and more popular (and mostly stable), people were witnessing the apparition of the first satellite TVs, and radios were in almost every house. However, people were still largely communicating by written message or face-to-face.

Fast-forward to the 2020s, and we’re living in a world where we can communicate not only with other human beings without distance stopping us, but also with creations of a digital nature.

(Isaac Asimov would totally enjoy that.)

The chatbot took the world by surprise in the 2010s, though it had a long history before that. Suddenly we started seeing the little chat windows everywhere on the web and we asked ourselves peculiar questions like, "Am I speaking to a human or... what am I speaking to?”

In the mid 2010s, the voicebot tried to perform the same quiet coup d’etat as its big cousin. And though the conversations with the voicebot are very different from those with chatbots—the former being still very purpose-driven and not necessarily “natural”—we’re definitely heading somewhere here.

Today we tend to design a lot for both mediums. You can develop an Alexa skill in a few hours without any particular experience, or create a Facebook bot just by following a YouTube tutorial. The online builders for creating voice or chatbots propose some appealing visual canvases with intuitive design, helping you quickly design great conversation flows. At the same time, this same visual canvas remains whether you’re designing for voice or chat—so we tend to design those experiences similarly.

But one thing to keep in mind:

The experience, though designed similarly, is not the same. Different mediums should convey the information differently.

Be particularly cautious if you already have a chatbot and you want to add voice to it, or build a voice assistant based on this chatbot. The two mediums are not the same, and it’s not only due to the difference of our brain functioning while listening or reading.

To understand why we design differently, we have to clearly distinguish different use cases surrounding chatbots and voicebots. While written conversation is more intimate and more engaging, it’s only natural that the “conversational” bots (designed to keep you company, like Mitsuki or Replica) are text-based. The “informational” bots like ChatGPT are also text-based, because they generate a big amount of text in their answers—and that’s not really adapted to voice, either.

Voicebots, on the other hand, tend to be more about “action.” Voice interactions with bots are more command-based, like “turn on the lights” or “set a five-minute timer.” They are interesting when you can’t type because you’re in the middle of something else or not in front of the screen (e.g. Alexa or Google), or when you don’t have interface (e.g. industrial setting), or when the audio channel is the only one available (e.g. driving). As of today, it’s most useful when the user has to make a clear and short command and expects either an action or a short piece of information in return.

Note: there are certain voice assistants that are all about hospitality (think ElliQ) and keeping company, but those are more of an exception to the rules in actual context and have a very specific use case.

So, cutting to the point:

What are the unique challenges of voice, and how is it different from chat?

1. Engagement

The way I see it, one of the important distinctions between reading and listening is that reading is an action you take, while listening is something that happens to you. Reading requires engagement. You need to actively process the written information to make progress. A voice assistant makes progress regardless if you’re actively listening or not. Listening to a voice assistant can be a more passive experience, whereas chatbot requires more active participation.

2. Cognitive load

Voice interactions never last. They are spoken, listened to, and then they disappear—and the only place they may remain for some time is someone’s memory, or the log file. In contrast, text-based chatbot interactions leave a written record, which can be referred to at a later time. This fundamental difference has several important implications, but the most significant one is that while designing a voice assistant, the cognitive load must be taken into consideration and addressed. The point that is not as crucial for text-based chatbot interactions.

3. Timing

Another aspect to consider is the time frame. With a chatbot, you have the flexibility to read, ponder, research, and then respond to the bot. You have the luxury of taking your time. However, with voice interactions, there is less room for hesitation or long pauses, as it is necessary to respond promptly in order to maintain the flow of the conversation. In current voice assistants, for example, the user is given a short window of time, usually two to three seconds, to respond before the conversation ends. This can create an atmosphere of tension throughout the voice interaction, as the user find themself in a pressing “listen-get it-respond” situation.

In addition, in text-based interactions, users can be provided with links, images, and other multimedia elements to reference while they are responding. This is not possible in voice-based interactions (unless there is a UI provided), where the only means of communication are the user’s voice and the voice of the assistant.

Cognitively all those factors alter the ways we perceive information through different mediums of communication.

Things to consider when designing for voice assistants

Through smart design choices, we have an opportunity to face and successfully address those points. There are certain things to take into consideration while designing for voice: how we handle errors, how we’re presenting lists and options, how long our messages are, how informative and how simple they are.

Let’s see those in detail.

1. Error handling

Error handling and correction are more intricate and prevalent in voice interactions compared to text-based interactions. While one may make errors in typing or spelling while using text, if there are no typographical errors, the chatbot would understand the message correctly.

However, in voice interactions, even if the words are pronounced correctly, there is still a possibility of being misunderstood. Some typical voice errors come from the ASR system, like the cases of “no match” because of the heavy background noise or an accent. Resolving these misunderstandings can be challenging, and it’s definitely something to think through with your voice assistant.

Technique tip: Implicit or explicit confirmation prompts can be used to verify the user’s intent and to ensure that the system understands the user’s request correctly. This can help to reduce the cognitive load on the user, by reducing the risk of errors and misunderstandings.

2. Limit or optimize lists

This can help to reduce the cognitive load on the user, by making it easier for them to choose the right option.

In a chatbot, you can easily give a user a list of choices like that:

“Greetings! How can I assist you today? Please choose from the following options:

  1. Book a flight
  2. Make a hotel reservation
  3. Rent a car
  4. Purchase a vacation package
  5. Find travel deals and discounts”

And though I always preach for not more than three options in a chatbot, the voice assistant takes it to a very different level. The user needs to be highly focused to be able to process in auditory manner all the different options presented and respond accordingly. The voice assistant needs to make it easier for the user to make a choice, either by allowing the barge-in (so when the right option is heard, the user could say "Okay, yes, I want this"), or limiting the options presented to the user at one time.

Be sure that you have different ways to select those options (“rent a car,” “option three,” “this one,” “the first one,” “the last one”). And the options should be as short in text as possible.

Technique tip: If you have a long list that you need to present by voice, don't give more than three choices at a time, with a possibility to say “next” to skip to the next selection.

This brings us logically to the next point:

3. Limit the length

With voice, you have to keep it short.

The audio channel is more vulnerable to cognitive overload. Often the user forgets the beginning of the conversation by its end. That's mostly because their attention isn't solely focused on the virtual assistant—they're trying to do something else, and that’s the whole point of the vocal assistant.

The real pitfall with voice interactions is over-informing the user. Remember that in vocal interaction, the user can’t usually skip the messages of the assistant, like in a chatbot. Keep it short and stick to what is needed to be said.

Technique tip 1:

One breath test: speak your message out loud. out and if you can’t do it on one breath, it means it’s too long.

Technique tip 2:

Jenga technique aimed to reduce the volume of messages is a very useful thing in designing for chat and voice. It consists in taking out of your message, piece by piece, the information that doesn’t directly add any value to your message.

Or

Jenga technique consists of taking out the information that doesn’t add value to your message.

See what I did there? ;)

To this matter:

4. Skip the unnecessary

With voice, we can go even further—not every message has to have actual words. Sometimes sonar notifications are a more optimized way to respond.

Don’t repeat back a message if a user has a clear understanding that something has been done. I mean, it’s only common sense that the voicebot doesn’t reply "I have switched off the light" if the user sees that the light is switched off, right? (For inclusivity purposes, we can always add a setting that allows the bot to be more verbal.) Sometimes a little sonar notification is enough, and sometimes even this is redundant. Think about which parts of your dialogues can be cut off or replaced by "success,"  "failure," or other sounds.

5. Simplify

Use simple and clear language when designing the voice assistant’s responses. This can help reduce the cognitive load on the user by making it easier for them to understand and process the information.

Play Jenga with your scripts, then take them and play Jenga again. Replace long words with shorter synonyms. At the end, if you haven’t shortened your scripts by 50%, you’re not doing it right.

Make the instruction as clear as possible. Review the context, and read the script OUT LOUD. You will quickly understand the parts of your bot that are too long or not worth mentioning at all. Remember: chatbots are way more about politeness and small talk than voicebots are. Voice equals agility. Skip the "Hello honey, how are you doing today?” part—unless the voice use case is specifically about hospitality. Doesn’t mean you shouldn’t include some miscellaneous jokes or easter eggs but only trigger them on a prompt.

We’re still learning how to communicate with digital assistants. The use cases will evolve and the conversations will evolve at the same pace.

One might say that all the steps mentioned above make a bot sound less human and more robotic. I say, let humans be humans and robots be robots. Robotic doesn’t necessarily rule out “user-friendly.” On the contrary, it optimized our interactions with the voice assistants, making it more about purpose and less about the noise.

And that’s a wrap.

Editor's note: This story was originally published on Medium. Header image by Volodymyr Hryshchenko.

RECOMMENDED
square-image

Breaking and rebuilding Nike's chatbot

RECOMMENDED RESOURCES
No items found.