Are you familiar with the quote from Stephen Hawking about human speech? “For millions of years mankind lived just like the animals. Then something happened which unleashed the power of our imagination. We learned to talk.” These wise words are also the prelude in one of my favorite “Pink Floyd” songs (“Keep Talking”).
The quote couldn’t be any more accurate, and since long we have been dreaming of extending our conversations onto machines. I guess I’m not surprising you with the news that this day has finally come, or actually that this day has passed quite a few months – questionably even years – ago. Yes, we can talk to machines!
The wonderful advances in Artificial Intelligence (AI), and more specifically the Natural Language Understanding (NLU) or Processing (NLP) technology have become mature enough to support our ambition.
In this article, the goal is simply to share my disappointment with many chatbot implementations. I’m not expecting a pitch perfect tone, complex conversations or the ability to accurately talk about the meaning of life. The natural language technology is still in its infancy, and we need to learn to crawl before we start walking (not to mention running). The focus of this article is on a very small frustration; the time delay between sending a question, and receiving a response.
You are more than welcome to challenge my authority on this subject, and feel free to completely disagree. I’ve gained this insight by designing, building and maintaining a chatbot for youngsters; an implementation that has received over 150.000 questions to date, with conversations lasting up to 20 minutes.
The evolution to the voice interface, and beyond…
In the last 30 years, the interface of technology has made tremendous progress. I can still remember the Command Line Interface (CLI) on the ZX Spectrum, the Commodore 64 and later the 286, 386 and 486 personal computers. The CLI was the primary means of interaction with most computer systems since the mid-1960s, and continued to be used throughout the 1970s and 1980s. As a user, the interface forced the use of commands in the form of successive lines of text (command lines) to launch programs.
The Graphical User Interface (GUI) introduced visual indicators and graphical icons to interact with computers. The GUI addressed the steep learning curve of the CLI, which require commands to be memorized and typed on the computer keyboard. In my lifespan the GUI was introduced to me by Microsoft in Windows 3.x back in the early-1990s. I’m going to skip the debate who invented the GUI, and simply state that neither Apple, nor Microsoft did. As a user, it was now possible to manipulate graphical elements to launch and operate programs.
The Natural User Interface (NUI) is the latest evolution of the interface. The technology has been around for quite some time, but only really kicked into mainstream when we started to use the touchscreen on our smartphones. As a user, it has become very natural to manipulate our electronic devices using single- or multi-touch gestures.
The ability “to touch”, is really only the first (of four) natural interfaces. The evolution of the NUI will be as follows: “to touch”, “to speak”, “to see” and finally “to think”. We will skip the eye-tracking (“to see”) and brain (“to think) interfaces for now, although you can respectively investigate the companies Eyefluence and Neuralink to see that the groundwork is in progress.
In the evolution of the NUI, the ability “to speak” highlights a first glimpse of the upcoming innovation cycle; a transition from a Mobile-First to an AI-First world. I personally believe that in five years, any technology will be useless if you can’t have a conversation with it.
The latency requirement for chat conversations with machines…
In order to truly speak to electronic devices, we are faced with several distinct challenges: the ability to understand language, and then the capability to translate speech into text and vice versa.
This article is focused on a particular issue with “chatbots“; computer programs that can conduct a conversation using natural language. This technology uses a textual of “chat” interface for a dialog between man and machine.
A chatbot is typically exposed via a “conversational interface“. The implementations vary from integrations in websites, mobile apps or onto popular messaging platforms like Messenger, Telegram and soon probably even WhatsApp.
The most important business benefits are an improved – more natural or more familiar – user experience, and an increase in productivity. The (early) top use cases include customer advice and recommendations, customer service and support, or even conversational commerce.
In chatbots, the technical term “latency” refers to the time delay between receiving a question and providing an answer. You have probably experienced chatbots where you ask a simple question, and within a few millisecond you receive a multi-sentence reply. This feels like receiving an instant out-of-office message. The reply is provided too quickly, and this makes your chat feel too much like you’re talking to a machine.
It’s important that a chatbot implementation respects a typing delay when talking to a human being. The average person types between 38 and 40 words per minute, or between 190 and 200 characters per minute. This slight delay makes (or breaks) the correct experience when talking to a chatbot.
In addition, in a chat interface we typically answer one sentence at a time. The goal should not be to fool your users (and make them believe they are talking to a human being), but to respect the “expected” experience when using a chat interface.
In our implementation we do the following:
- We receiving a question.
- Tell me a joke!
- We mark a first time-stamp (for example “07:08:29,00”) and immediately display a typing indicator.
- The question is now being analyzed; more precisely we first correct any spelling mistakes (using Microsoft Azure Bing Spell Check API), and then send the corrected question off to our Natural Language Processing engine (using IBM Watson Conversation).
- We capture a second time-stamp (for example “07:08:30,10”) as soon we receive the answer.
- A robot walks into a bar. “What can I get you?” the bartender asks. “I need something to loosen up,” the robot replies. So the bartender serves him a screwdriver.
- As this reply consists of multiple phrases, we chop the reply now in distinct sentences.
- A robot walks into a bar.
- “What can I get you?” the bartender asks.
- “I need something to loosen up,” the robot replies.
- So the bartender serves him a screwdriver.
- The first sentence of the reply “A robot walks into a bar.” is 25 characters long. In case of a human, it would take 7 seconds (25 characters * 0,3 seconds) to create this reply. This is obviously a bit slow. We allow our bot to type at 0,1 seconds per character; or 2,5 seconds for this first sentence. As this is our first sentence, we still have to subtract the delay in receiving the reply: 1,1 seconds (07:08:30,10 – 07:08:29,00).
- We reply with the first sentence (after 1,1 seconds + 1,4 seconds), and then again show the typing indicator.
- We delay the reply of second sentence by 4,1 seconds (41 characters * 0,1 seconds), and then again show the typing indicator.
Addendum: If a reply sentence is longer than 50 characters, we have learned to limit the maximum delay to 5 seconds. It’s our intention to make our user feel as if the machine is typing an answer, and not the goal to make this a waiting game.
We repeat this last step for each sentence. After replying the last sentence, we obviously don’t display the typing indicator anymore.
Enjoy chatbots in the slow lane…
This will probably be one of the only projects without the requirement for a quick and snappy service. In technical terms, you are making this a “high latency” implementation on purpose. You should however notice that this trick makes a big difference; it provides your users a more natural experience when chatting with your chatbot.
I mentioned earlier that this issue is focused on solutions that uses a textual or “chat” interface. A virtual assistant that uses an auditory interface – like Amazon Echo with Alexa, Google Home or Siri – doesn’t have this issue. In speech, the spoken words are automatically causing the delay.
As an afterthought, the “Keep Talking” song (by “Pink Floyd”) only partially quotes Stephen Hawking. The words of the brilliant English theoretical physicist and cosmologist continue: “Speech has allowed the communication of ideas, enabling human beings to work together to build the impossible. Mankind’s greatest achievements have come about by talking, and its greatest failures by not talking.”
And that’s exactly why I’m writing these words, and sharing this experience with you.