When engineers first came up with what we now call automobiles, they probably weren’t imagining that we’d be talking to them one day. And yet here we are, some one-hundred years later, relying on voice to get anything from street directions, food orders, shopping, the forecast, even our favorite music or podcast.
While some still argue about the inventor of the first “real” automobile (Leonardo Da Vinci? Karl Benz? Nicolas-Joseph Cugnot?), we know for certain that the marriage between cars and audio technology happened in 2001 thanks to BMW. It wasn’t exactly a happy marriage, however, as the tech was rather unhelpful (if not borderline intrusive); it wasn’t until 2010 that humans and cars actually started talking.
Things started to change when the technology sector began to make its way to the automotive industry; namely Apple’s Siri (2011) and the Google Assistant (2016, née Google Now four years earlier), which spun off car-dedicated software, respectively CarPlay (2014) and Android Auto (2015).
Siri and the Google Assistant, alongside Alexa and Nuance (now Cerence), have taken advantage of the cloud and the mobile internet infrastructure to create more capable and efficient voice recognition and voice synthesis applications than the ones built by car manufacturers.
The driving style’s history, a suite of interconnected online services, the vehicle’s position and direction all represent accrued data that the system uses to understand and anticipate the user’s needs. Someone driving from Rome to Milan may receive timely information on charging stations along the road, for instance, or about parking spots near the destination.
Technology plays a huge role in this: more robust, integrated Wi-Fi modules; actual, dedicated SoCs, noise-canceling microphones, etc. Some in-car systems actually add artificial noise during the “training phase” (i.e. when the driver helps the system recognize their voice) to replicate real-life situations like busy streets or high-speed stretches.
In 2019, General Motors (GM) became the first manufacturer to fully integrate Amazon’s system, Alexa Auto. Unlike other players, which allow users to switch between systems, GM only lets customers choose between Alexa Auto or Cerence when purchasing a new car.
It is an integral part of the machine, after all. And although audio is slowly becoming ubiquitous, cars appear to be one of the strongest proponents: an Automotive World estimate claims that 9 out of 10 vehicles sold by 2028 will be equipped with a voice assistant of some sort.
Before the introduction of the in-car voice experience, there would be an individual command to try and satisfy each of the user’s needs. If they wanted to open the window, for instance, all they could do was use the dedicated button. A perhaps rudimentary concept, but a very narrow — and thus hardly fallible — one.
This initially translated poorly to audio. In order to make sure that the machine would understand, users had to yell out unnaturally-sounding commands. The only partial solution was brute force, i.e. creating an expanded list of commands the users could draw from. But no actual intelligence.
Over the course of the past decade, the concept of actual speech has become much more critical in car systems (much like it has in the world of AI in general): users want to be free to speak naturally and be understood — a problem tackled by technologists in the field of Natural Language Understanding (NLU).
How many ways are there to say “thank you”? “I’d like to thank you”, “thank you very much”, “thank you so much”, “thanks”, all the way to more implicit forms, such as “you were very helpful” or even others borrowed from other languages.
Such variety becomes even more complex when one considers that people tend not to spell out everything literally. When people use few words, assistants can trip up and ask follow-up questions to request more information, thus slowing down the overall process.
And so to avoid “fixed phrases” it is vital to train systems to recognize as many variables as possible — especially those who may sound ambiguous to a machine but are perfectly normal to humans. This kind of subtlety is precisely what can make machines more sophisticated: slowly but surely, we are moving from simple, direct commands to more natural speech interactions in which the machine understands intention.
Let’s use an example. If someone were to say, “It’s boiling in here”, someone else would immediately pick up on it, and understand the need for refreshment. What about a voice assistant? If it were only trained on fixed phrases, “It’s boiling” would probably be meaningless. NLU makes it possible to link the pieces together and actually make sense of what is being said.
However, it remains the case that natural language can be intrinsically confusing. How should an AI system interpret “It’s boiling”? Should it lower the window or turn the AC on? The answer is probably somewhere in the given dataset, which needs to be filled with additional nodes to better serve the user (using information like location, past requests, etc.) — or, well, resort back to asking.
As voice assistants evolve, so does the user behavior toward them, making it easy to find a middle ground between what one wants and the best way to get it.
Several processes and technologies go into making a voice assistant. Starting from an audio signal, the Automatic Speech Recognition (ASR) technology guesses how to best interpret it and tries to convert it into actual, sensible words. Then things move to NLU, whose job is to make sense of the entire sentence using individual words as cornerstones. The result is a semantic representation of the original query.
NLU systems are usually built with machine learning algorithms trained on linguistics data (such as text and audio) that help better understand the user’s language; an evolution over the old rule-based systems that only allowed for the understanding of few, unnatural, fixed sentences.
Training machine learning algorithms broadly demands two types of skills: knowing how to choose and gather the right kind of data, and being able to fine-tune it in accordance with the given need. As for linguistics, algorithms demand massive training datasets with as many variables (read: natural formulations) of the same command as possible. Ambiguity has to be removed, so a profound knowledge of semantics is absolutely necessary.
Another key skill resides in the assistant’s performance evaluation, to fix bugs and resolve major issues. The following steps consist in executing commands and/or keeping the dialogue going. If the latter is the case, another component comes into play: Natural Language Generation (NLG), tries to formulate a response that considers the context of what was said. Text-to-speech (TTS) will then give those words a voice. This is also the phase in which a company will give its assistant a distinct voice to make it recognizable and suitable for the brand.
Our linguistics skills are particularly high in demand because voice assistants are now usually developed with multiple languages: major firms’ assistants cover most European, American (Northern and Southern), Middle-Eastern, and Asian languages. What’s more, it is often the case that languages with hundreds of millions of users (English, Spanish, Chinese) often get regionalized versions, to allow locals to speak even more freely.
Voice assistants are getting bigger and bigger. Just a few years ago, a voice assistant in a car could barely get us directions to a specific destination, or perhaps make a call to somebody; but their capabilities are now significantly expanding, although the domain areas haven’t radically changed — navigation (with information on traffic, reroutes, etc.), phone calls (with messages dictation, replies, etc.), and entertainment (music, radio, podcasts, etc.).
But there is more on the horizon, especially when it comes to commands that pertain to the car itself. People want to be able to tune the AC and heating, open and close windows, tweak the seats’ height, change the lights’ color, and so on. There’s also a greater push to integrate these assistants in the larger IoT ecosystem, to speak to the car and get things done at home (like, for instance, turning the heating on at home some ten minutes before getting there, and opening the garage door on arrival).
Voice assistants are undeniably getting better: more understanding, flexible, intelligent, and capable of giving information, but also making jokes and telling stories. Human, in a way. The robotic voice that keeps us going in circles at the roundabout without ever guessing the right exit is but a fading memory.