Mathew Sanders /

Hello, computer: conversational interfaces

I wanted to write about the idea of designing technology where the interface is based around a conversational experience.

I thought about it some more, and I realized that we’ve always spoken to machines, just most of the time it doesn’t seem like they’re listening.

And the humble command line —the first modern interface to digital computers— is also in it’s own way, a conversational interface.

Computers are actually excellent at following instructions, and if asked, can do a task millions of times in succession without error.

Only we have to be careful when we ask computers to do something, that it’s clear what our instructions are asking.

> Me: Put the box on the table in the kitchen.
> Computer: ☠ ☠ ☠

Human languages are actually pretty easy to misinterpret, so we create special languages for computers that allow us to be very specific or strict about what we’re instructing them to do.

These languages look strange to people that aren’t familiar with them (and often even strange for people who are familiar with them) so we when we write in these languages we call the result code.

Even with these strict languages, and our best intentions, we still sometimes end up instructing machines to do something weird.

To bypass this mess, geniuses at Xerox PARC pioneered the approach of direct manipulation.

Instead of needing to know the correct set of instructions to get a computer to do something, computers presented content as virtual objects that borrowed metaphors like the desktop, trash can, and file folder from the physical world, and used a pointing device allowing people to interact with those objects directly.

These metaphors created a bridge between the physical and digital worlds allowing computers to move from a an academic tool, to a consumer product.

In terms of the metaphors we use in graphical user interfaces, not a lot has changed. A PARC researcher from the 1970s —while likely astounded by advances in capacity, miniaturization, and speed— would easily recognize any modern operating system.

The most interesting advances in this model has been in making the input methods in direct interaction more [ahem] direct .

In 2006 Jef Han presented experiments demonstrating how multi-touch input can get us closer to our digital objects.

One year later, Apple launched the iPhone, the first consumer devices with multi-touch input.

A decade later and Apple have sold a total of 1 billion iOS devices, and sales of Android smartphones have reached 1 billion per year.

The combination of an easily understood interaction metaphor, and intuitive touch-based input has allowed digital technology to be accessible for the very young, and also older populations1.

While the manipulation of virtual objects can lead to interfaces with strong affordances, they also face the same issues of interfaces in the physical world.

Complex models require complex modes of manipulation, some which may be poorly suited for direct manipulation.

And while space on a digital interface is technically unlimited, our cognitive capacity is not.

So for practical purposes, digital interfaces are still constrained by physical space with compromises being made between efficiency, usability, and cognitive load of the user.

Even in the paradigm of direct manipulation, it still makes sense to make use of a conversational approach . It’s no mistake there are an entire class of digital object that we call a dialog2.

Yes, it’s a stretch to call interaction with a dialog a conversation, because you’re limited to a specific set of responses, but by requesting input framed as questions or prompts, you’re being engaged by a narrative device that places a machine as your partner in conversation.

Quartz takes this approach to the extreme in their iPhone app where the experience is based around a controlled conversation within the app.

Like a dialog, your responses are limited to only a couple of choices. But from the presentation of content, which clearly replicates a messaging app, to the casual tone of voice and size of the content itself, it’s easy to forget that this content isn’t directed to you individually, but actually curated for a mass audience.

This approach may feel constrained from the perspective of discovery and exploration of new content. It lacks the ability to search, and forgoes a navigation that guides people through some hierarchy of categories.

Search & browse are both features that are expected by default for a digital product serving journalism, but are also features that I expect are greatly underused by the majority of readers3.

Perhaps Quartz’s approach of a conversational interface will better match the experience that people are already consuming journalism, and prove to be more successful in getting people to read that second, and third story than the traditional approach.

Amazon’s Echo is unique as a consumer product in that it’s uses spoken conversation not as an ancillary mode of interaction, but as the primary interface.

Similar to Quartz’s iPhone app, or any dialog window, there are limits to what you can ask the Echo to do, but thanks for the combinatorial power of grammar, a small set of requests that can be applied to a large library of indexed data (locations, music, time, people, and facts) gives the illusion of a fairly convincing digital agent.

Despite the stilted manner that each request must be made (multiple requests must be made in awkward succession

Alexa play WNYC. Alexa volume 4. Alexa set alarm for 7 am.

rather than the more natural request:

Alexa play WNYC at volume 4, and set alarm for 7 am.

Still, after a year of using the Echo I’m convinced that conversational interfaces, and devices like the Echo will be commonplace in our near future.

Slack is another example of conversational interfaces in practice with the range of experiences that can be triggered through the messaging interface.

Chatbots aren’t a new idea, and like the Echo, the range of topics to discuss are tightly constrained, but there is a lot of potential in being able to interact with programs in the same place, and the same way that you’d interact with other people.

Dan Grover reports of the WeChat messaging app in China growing beyond a messaging app, and become a platform where people can interact with companies through digital agents, who will can either reply through an automated response, or route the request to a human representative.

Skeptics will surely continue to dismiss the potential of conversational approaches for the same reason that many people still prefer command line tools over the direct manipulation of graphical user interfaces: that conversational interfaces lack the power and efficiency of direct manipulation.

Ignoring use cases where physical manipulation is awkward (for people who are cooking or occupied using another tool) or impossible (for people with impaired motor skills from age, injury, or a disability), a conversational interface has the potential to unlock the utility of computers for for abstract problems that are difficult to represent and interact with though the manipulation of digital objects.

In the same way that graphical user interfaces unlocked desktop publishing with WYSIWYG text editors, conversational interfaces could unlock more general purpose computing.

And, in the same way that multi-touch input made direct manipulation interfaces accessible to a wider group of people, conversational interfaces can again help computing reach an even wider audience.

Early graphical user interfaces like the Xerox Star, and Apple Lisa were demonstrations of how direct manipulation of digital objects could transform how people use computers, but until touch interfaces decreased the distance between us and our digital objects, and miniaturization made computers hand-held, the true potential of direct manipulation wasn’t realized4.

Likewise, conversational interfaces demonstrated by Quartz, Echo, and Slack integrations are just a peep of the much more interesting idea of the ability to communicate with digital agents.

Both are approaches that avoid the complication of interacting with computers.

Direct manipulation avoids the issue of needing to give computers precise instructions by allowing you to make changes yourself. Computer interfaces using this approach take on the role of a tool.

Agent-based models of interaction avoid the issue of precise instructions by instead describing the end state, or desired result, and using programs that have the agency to reach such an outcome.

Depending on the complexity required to reach some outcome, digital agents may need some level of awareness of the world and general decision-making skills, or may simply be able to use simple pattern-recognition to complete a request.

Natural language seems like a likely interface for digital agents, but in the same way that GUIs are one approach to direct manipulation, it’s likely that other interfaces will also make sense.

I’m excited by the potential of conversational interfaces, but I don’t believe that they are inherently better, or a replacement for direct manipulation.

Physical manipulation has much lower latency that dialog (tapping a button will always be faster than typing or speaking a command to do the same), and the throughput of rich data can also be a constraint (seeing a graph of data, and interpreting a trend is a lot easier than hearing a description of dataset).

In some cases neither direct manipulation or conversational interface may be ideal, consider the scenario of using an elevator.

An elevator interface that uses direct manipulation will typically have two buttons on each floor to indicate that you either want to go to a floor above, or below your current floor.

When an elevator car arrives, you’ll be given feedback that an elevator has arrived, and if it is traveling up, or down.

Within the elevator, another interface will have buttons that allow you to choose the specific floor you wish to travel to.

An elevator that uses a conversational interface could replace each panel of buttons with the option to receive spoken requests.

Despite the apparent ease of this approach as seen on shows like Star Trek, it’s hard to imagine this as an improvement on direct manipulation, especially in any situation where there is more than one person.

A third way is hinted by designers like Golden Krishna, and Adam Greenfield who describe an ambient interfaces that work in many cases without direct interaction at all.

At ustwo, our studio is on the 16th floor, and we each have a card that enables us to travel to that floor. An elevator that takes the approach of an ambient interface knows that if I’m standing outside of the elevator on the 16th floor that I want to travel to the ground floor, and vice versa and requests an appropriate car to stop for me without any action other than simply waiting.

Given a wider range of inputs such as my current location in the building, past habits, and my calendar5 could even allow for predictive models that decrease the time it takes for an elevator car to be ready by requesting a car going down when I’m standing up from my desk, it’s 12:30, I normally travel to the ground floor between 12 and 1pm, and I have no meetings in my calendar.

Notes

Thanks to Keith Kurson and Francisco Hui for bringing to attention conversational interfaces in the Quartz app, and Lifeline game. I’m hopelessly out of touch of interesting new things being made.

  1. As people age, fine-control of motor skills decreases making pointing devices like a mouse harder to use. The full-screen mode, and focused nature of mobile apps probably also helps.

  2. Reference for Dialogs in Apple’s Human Interface Guidelines.

  3. The New York Times report on … shows that less than a 1/3 of digital visitors touch the newspapers homepage suggesting that most people don’t treat the NYT as a destination site that they navigate, but instead as a secondary site linked from elsewhere (most likely Facebook). In addition data from traffic analytics company comScore shows that in 2014 the average time on newspaper sites is 1.1 minutes per day, compared to 3.6 minutes on search sites, and 33 minutes on social media.

  4. Not so suggest that we’ve reached a plateau in progress for this model. If augmented interfaces like those shown by Magic Leap can reach the mass market, then multi-touch input may one day look as quaint as the first mouse. Gestural interfaces, and use of totems (physical objects acting as proxies for digital objects) are both areas where experiences of direct manipulation where much still remains to be explored.

  5. Obviously, I’m ignoring all implications of privacy in this scenario.