Speaking to software

March 2025

Typing is beginning to feel like a chore… much like handwriting does to so many today. Archaic and slightly painful.

It's obviously not going away, but as conversational AI improves (and it's already really, really good), I believe our physical and digital keyboards will gradually become a less common way to interact with software. We'll still type to preserve our privacy in social spaces - think waiting rooms, trains, planes, and meetings. We'll also type in places where noise levels need to be kept low. But in most scenarios, I think speaking will become the norm. We'll speak to software as much, if not more, than we'll type - in our homes, on the go… even in our offices.

For what it's worth, I wouldn't be surprised if, over time, our workspaces are completely redesigned to accommodate this shift. If the open layouts of the past two decades were a byproduct of agile ways of working, it's worth asking: what should the optimal office layout look like in the AI era, where employees speak to software more frequently than to their human colleagues? Where employees collaborate with and manage agents instead of people? Does this push us back to cubicles? Or does it demand an entirely new working environment?

What makes all of this especially exciting to think about is how deeply rooted speaking is in who we are. It's the most human form of communication - or at least, it has been for the past 50,000+ years. Voice is information-dense; it conveys not just words but also tone and intonation - critical metadata that's difficult to capture with text alone. Hence, emojis. Now, with AI, we can seamlessly speak to software, transmitting emotion alongside words.

All of this represents a rare moment in time for builders. Innovation at the user interface layer in software has been largely incremental over the past decade - product A looks and feels like product B, which looks and feels like product C. And for good reason: we've largely converged on best practices for UI/UX within the current paradigm of computing. As a result, the surface area for startups to challenge incumbents has shrunk. But advancements in conversational AI and the shift to voice are expanding that surface area once again. The definition of "optimal" is being rewritten, creating an opening for startups to reimagine applications from the ground up and disrupt incumbents in novel ways.

For those looking to capitalise on this shift, there are important questions and thoughts to consider:

What defines a great user experience when much of a product's core experience lacks visuals?

What does the design and product development process look like for voice-only products? What does design even mean?

Which product categories will have to remain visual, which will be able to neglect the visual layer in favour of voice, and which will require both? I believe software will largely bifurcate into two key categories: utilities and work tools (e.g. email, messaging, note-taking), which I expect to shift heavily toward voice, and entertainment and social (e.g. Netflix, YouTube), which will remain screen-based. For what it's worth, I think this shift will invite a complete rethink of devices - laptops and phones. While screens won't disappear, I imagine (and hope) we'll engage with them far less. IMO, AirPods, Meta's Smart Glasses, etc, feel much more in line with next-gen computing. Our phones, on the other hand, will become modern-day Game Boys.

What's the ideal way to let users toggle between visual UIs and voice? ChatGPT and Grok do this elegantly with their calling feature - it takes a second to go from text to voice and back to text. I think we'll see many more products following in their footsteps. I expect lots of products across industries to bake in similar calling features over the next months. Think: calling your email client, your CRM, or your news aggregator.

When, if at all, should a voice product proactively engage its user? Some products will hang in the background, listening in. Others will wait to be spoken to. There will probably be a category that falls in between - listening, reacting, but also proactively jumping in when it makes sense.

What role do accents, intonation, and emotion play in the overall product experience? I think they'll be crucial. As we interact with software more through voice, we'll form deeper connections with the personas we engage with. Accents and intonation will matter - I already feel this with 'Maple' when using ChatGPT's Advance Voice Mode. Software will start to feel human, and we'll develop real relationships with our favourite products. The interoperability of voice identities may also become increasingly important, allowing users to carry familiar voices across different platforms and applications.

Naturally, there's more questions than answers here. Lots to figure out, but the time to be playing around with voice is now. The tech is ripe, and it's only getting better. Latency keeps improving, costs are falling, and emotional expressiveness is becoming increasingly human-like. All of this presents an enormous opportunity for builders. We're witnessing a rare shift in how we interact with technology - one that doesn't happen often. Previously crowded spaces are starting to feel open again.

I really think voice is the most natural way to interact with AI, and as AI proliferates, it will soon become the most natural way to interact with almost everything.

← back