Poison Bottle

AI for Voice to Action – Part 2: Machine Learning Algorithms

By | Artificial Intelligence, Command Matching, Machine Learning, Natural Language | No Comments

My last post discussed the important step of automatically generating vast amounts of relevant content relating to commands to which we apply our machine learning algorithms. Here I want to delve into the design of our algorithms.

Given a command, our algorithms need to:

  1.   Understand the meaning and intent behind the command
  2.   Identify and extract parameters from it
  3.   Determine which app action is most appropriate
  4.   Execute the chosen action and pass the relevant parameters to the action

This post and the next one will address point 1. The other points will be covered in subsequent posts.

So how do we understand what a user means based on their command? Typically commands are short (3 or 4 terms), which makes it very difficult to disambiguate among the multiple meanings a term can have. So if someone says “search for Boston” do they want directions to a city or do they want to listen to a rock band on Spotify? In order to disambiguate among all the possibilities we need to know if a) any of the command terms can have different meanings, b) what those meanings are and finally c) which is the correct one based on context.

Semiotics

In order to do this we developed a suite of algorithms which feed off the data we generated previously (See post #3). These algorithms are inspired by semiotics, the study of how meaning is communicated. Semiotics originated as a theory of how we interpret the meaning of signs and symbols. Given a sign in one context, for example a flag with a skull and crossbones on it, you would assign a particular meaning to it (i.e. Pirates).

Pirate Symbol

Whereas, if you changed the context to a bottle, then the meaning changes completely

Poison Bottle

Poison – do not drink!

Linguists took these ideas and applied them to language and how, given a term (e.g. ‘window’), its meaning can change depending on the meaning of the words around it in the sentence (meanings could be physical window in a room, software window, window of opportunity, etc.).  By applying these ideas to our data we can understand the different meanings a term can have based on its context.

Discourse Communities

We also drew inspiration from discourse communities. A discourse community is a group of people involved in and communicating about a particular topic. They tend to use the same language for important concepts (sometimes called jargon) within their community, and these terms have a specific, understood and agreed meaning within the community to make communication easier. For example members of a cycling community have their own set of terms that is fairly unique to them that they all understand and adhere to. If you want to see what I mean, go here and learn the meanings of such terms as an Athena, a Cassette, a Chamois (very important!) and many other terms. Similarly motor enthusiasts will have their own ‘lingo’. If you want to be able to differentiate your AWS from your ABS and your DDI from your DPF then get up to speed here.

Our users use apps, so in addition we would expect to discover gaming discourses, financial discourses, music discourses, social media discourses and so on. Our goal was to develop a suite of machine learning algorithms which could automatically identify these communities through their important jargon terms. By identifying the jargon terms we can build a picture of the relationship between these terms and other terms used by each discourse community within our data. A characteristic of jargon words is that they have a very narrow meaning within a discourse compared to other terms. For example the term ‘computer’ is a very general term that can have multiple meanings across many discourses – programming, desktop, laptop, tablet, phone, firmware, networks etc. … ‘Computer’ isn’t a very good example of a jargon term as it is too general and broad in meaning. We want to identify narrow, specific terms that have a very precise meaning within a single discourse, e.g. a specific type of processor, or a motherboard. Our algorithms do a remarkable job of identifying these jargon terms and are foundational to our ability to extract meaning, precisely understand user commands and thereby the real intent that lies behind them.

In my next post I will go into the details behind the algorithms that enable us to identify these narrow-meaning, community-specific jargon terms and ultimately to build a model that understands the meaning and intent behind user queries.

Data Augmentation

AI for Voice to Action – Part 1: Data

By | Artificial Intelligence, Machine Learning, Voice Search | No Comments

At Aiqudo two critical problems we solve in voice control are the action discovery problem and the cognitive load problem.

In my first post I discussed how using technology to overcome the challenges of bringing voice control into the mainstream motivated me to get out of bed in the morning. I get a kick out of seeing someone speaking naturally to their device and smiling when it does exactly what they wanted.

In our second post in the series we discussed how Aiqudo has built the the largest (and growing) mobile app action index in the world and our process for on-boarding actions. On-boarding an action only  takes minutes – there is no programming involved and we are not reliant on the app developer to set this up or provide an API. This enables enormous scalability of actions compared to the Amazon and Google approaches that rely on a programming solution where developers are required to code to these platforms, add specific intents, and go through a painful approval process.

In this post  I wanted to start to elaborate on our overall approach and discuss specifically how we create the large amounts of content for our patented machine learning algorithms to analyze, in order to be able to understand a user’s intent. Ours is a significant achievement since even large teams are facing challenges in solving this problem in a generic fashion – as the following quote from Amazon shows.   

“The way we’re solving that is that you’ll just speak, and we will find the most relevant skill that can answer your query … The ambiguity in that language, and the incredible number of actions Alexa can take, that’s a super hard AI problem.” – Amazon

At Aiqudo, we have already solved the challenge that Amazon is working on. Our users don’t have to specify which app to use  and we automatically pick the right actions for their command thereby reducing the cognitive load for the user.

The starting point for generating the content we need is the end of the action on-boarding process, when a few sample commands are added to the action. These training commands enable us to start the machine learning processes that enable us to

  1. extract the correct meaning from the natural language command
  2. understand the intent; and
  3. execute the correct action on the best app

The first step in this process is to gather content relating to each command on-boarded (command content). As is typical with machine learning approaches we are data hungry – the more data we have, the better our performance. Therefore we use numerous data repositories specific to on-boarded commands and apps and interrogate them to identify related content that can be used to augment the language used in the command.

Content Augmentation for Machine Learning

Content augmentation removes noise and increases the semantic coverage of terms

 

Teaching a machine to correctly understand what a user intends from just a few terms in a command is problematic (as it would be for a human) – there isn’t enough context to fully understand the command – e.g. ‘open the window’ – is this a software related command or a command related to a room? Augmenting the command with additional content adds a lot more context for the algorithms to better understand meaning and intent. This augmented content forms the basis of a lexicon of terms relating to each on-boarded command. Later, when we apply our machine learning algorithms this provides the raw data to enable us to build and understand meaning – e.g. we can understand that a movie is similar to a film, rain is related to weather, the term ‘window’ has multiple meanings and so on.

It is equally important that each command’s lexicon is highly relevant to the command and low in noise – for this reason we automatically assess each term within the lexicon to determine its relevance and remove noise. Once we have the low noise lexicon this becomes a final lexicon of terms relating to each command. We then generate multiple command documents from the lexicon for each command. Each command document is generated by selecting terms based on the probability of its occurrence within the command’s lexicon. The more likely a term occurs within the command’s lexicon, the more likely it is to occur in a command document. Note by doing this we are synthetically creating documents which do not make sense to a human, but are a reflection of the probabilities of occurrence of terms in the command’s lexicon. It is these synthetically created command documents which we use to train our machine learning algorithms to understand meaning and intent. Because these are synthetically generated we can also control the number of command documents we create to fine tune the learning process.

Once we have carefully created a relevant command lexicon and built a repository of documents which relate to each command that has been on-boarded, we are ready to analyze the content, identify topics and subtopics, disambiguate among the different meanings words have and understand contextual meaning.  Our innovative content augmentation approach allows us to quickly deploy updated machine learned models that can immediately match new command variants, so we don’t have to wait for large numbers of live queries for training as with other approaches.

The really appealing thing about this approach is it is language agnostic – it allows us to facilitate users speaking in any language by interrogating multilingual content repositories. Currently we are live in 12 markets in 7 languages and and are enabling new languages. We’re proud of this major accomplishment in such a short timeframe.  

In my next post in this series, I will say a little more about the machine learning algorithms we have developed that have enabled us to build such a scalable, multi-lingual solution.

vintage alarm clock

What motivates me to get out of bed in the morning?

By | Artificial Intelligence, Digital Assistants, Voice Search | No Comments

A while back a friend bought an Alexa speaker. He was so excited about the prospects of speaking to his device and getting cool  things done without leaving the comfort of his chair. A few weeks later when I next saw him I asked how he was getting on with it and his reply was very insightful and typical of the problems current voice platforms pose.

Initially when he plugged it in, after asking the typical questions everyone does (‘what is the weather’ and ‘play music by Adele’) he set about seeing what other useful things he could do. He quickly found out that it wasn’t easy to find out what 3rd party skills were integrated with Alexa (I call this the action discovery problem). When he found a resource to provide this information he went about adding skills – local news headlines, a joke teller, Spotify (requiring registration), quiz questions and so on. Then he hit his next problem – in order to use these skills he had to learn a very specific set of commands in order to execute the functionality. This was fine for two or three skills but it very soon became overwhelming. He found himself forgetting the precise language to use for each specific skill and soon became frustrated (the cognitive load problem).

Last week when I saw him again he had actually given the speaker to his son who was using it as a music player in his bedroom. Once the initial ‘fun’ of the device wore off it became apparent that there was very little real utility from it for him. While some skills had value it was painful to find out about them in the first place, add them to Alexa and then remember the specific commands to execute them…

The reason I found this so interesting was that these are precisely the problems we have solved at Aiqudo. Our goal is to provide consumers a truly natural voice interface to actions, starting with all the functionality in their phone apps, without having to remember specific commands needed to execute them. For example if I want directions to the SAP centre in San Jose to watch the Sharks I might say, ‘navigate to the SAP Centre’,  ‘I want to drive to the SAP Centre’ or ‘directions to the SAP Centre’. Since a user can use any of these commands, or other variants, they should all just work. Constraining users to learn the precise form of a command just frustrates them and provides a poor user experience. In order to leverage the maximum utility from voice, we need to understand the meaning and intent behind the command irrespective of what the user says and be able to execute the right action.

So how do we do it?

This is not a simple answer, so we plan to cover the main points in a series of blog posts over the coming weeks. These will focus at a high level on the processes, the technology, the challenges and the rationale behind our approach. Our process has 2 main steps.

  • Understand the functionality available in each app and on-board these actions into our Action Index
  • Understand the intent of a user’s command and subsequently, automatically execute the correct action.

In step 1, by doing the ‘heavy lifting’ and understanding the functionality available within the app ecosystem for users, we overcome the action discovery problem my friend had with his Alexa speaker. Users can simply say what they want to do and we find the best action to execute automatically – the user doesn’t need to do anything. In fact if they don’t have an appropriate app on their device for the command they have just issued we actually recommend it to them and they can install it!  

Similarly in step 2, by allowing users the freedom to speak naturally and choose whatever linguistic form of commands they wish, we overcome the second problem with Alexa – the cognitive load problemusers no longer have to remember very specific commands to execute actions. Voice should be the most intuitive user interface – just say what you want to do.  We built the Aiqudo platform to understand the wide variety of ways users might phrase their commands, allowing users to go from voice to action easily and intuitively.  And did I mention that the Aiqudo platform is multilingual, enabling natural language commands in any language the user chooses to speak in.

So getting back to my initial question – what motivates me to get out of bed in the morning? – well, I’m excited to use technology to bring the utility of the entire app ecosystem to users all over the world so they can speak naturally to their devices and get stuff done without having to think about it!

In the next post in this series, we’ll talk about step 1making the functionality in apps available to users.

Walled gardens

Open or Walled?

By | Artificial Intelligence, Digital Assistants, Machine Learning, Voice | No Comments

Voice has the promise to be the next disruption, upending massive, established business models and putting search, commerce, messaging, and navigation up for grabs again. But a walled garden mentality could stifle that disruption.

Even over its relatively short history, we see a pattern of behavior on the Internet: some innovator creates a marketplace for consumers, helping to organize information (Yahoo and AOL in their first iterations), commerce (Amazon), a place to keep in touch with our friends (Facebook), and they create huge consumer value in bringing us together, providing us with tools that make it easy to navigate, buy, message, etc. But as the audience grows, there is always a slide away from an open marketplace toward a walled garden, with the marketplace operators initially becoming toll takers and moving toward ever greater control (and monetization) of their users’ experience, and more recently, their data.

Mobile carriers in the US tried to erect walled gardens around their users in the 1.0 version of mobile content — the carriers thought they had captive users and captive vendors, and so created closed platforms that forced their subscribers to buy content from them. Predictably, monopoly providers offered narrow product offerings at high prices and squeezed their vendors so hard that there was no free cash flow for innovation. Mobile content stagnated, as the carriers failed to cultivate fertile ecosystems in which vendors could make money and in which consumers had a growing variety of new and interesting content. When the iPhone came along (thankfully Steve Jobs could wave his magic wand over the guys at AT&T), consumers could finally use their phones to get to the Internet for the content they wanted, and the carriers went back to being dumb pipes.

Will voice platforms become walled gardens?

If you want to enable your users to reach you through Alexa, you have to create a Skill. Then you have to train your users to invoke your Skill using a precise syntax. Likewise Google Assistant. For Siri, your business has to fit into one of the handful of domains that SiriKit recognizes. There’s a reason we refer to them as voice platforms — their owners are in control.

Initially, there are good QA reasons for this, making sure we get a good user experience. But pretty quickly, the walls will become constraints on who can be included in the garden (will Amazon and Facebook play nice together?), and ultimately, what will be the tax that must be paid in order to offer services in the garden. As users, this results in less openness, fewer choices, and constraints on our ability to quickly and easily do what we want to do, which typically includes using different services from all of the different platform providers (does Tencent really think that if you block people from Alipay inside WeChat that users will stop using Alipay?)

The carriers’ experience should be a cautionary tale — walled gardens, with their limited choices and monopolist pricing are bad for consumers; the Internet is a place of unlimited choice, the world of mobile apps is vast and diverse, again allowing for broad consumer choice — this is what we expect, and if our horizons are constrained by a platform’s policies, we’ll abandon it. The carriers fumbled Mobile Content 1.0; their walled gardens never met their promise to become massive businesses, and today they don’t even exist.

Voice interfaces should be our gateway to everything we want to do, whether it’s in Alexa, in our mobile apps, or in our connected cars or homes. So will voice platforms be these open gateways that make our lives easier, or will they be cramped walled gardens that try to make our choices for us, funneling us to a narrow selection of preferred vendors?

Q Actions -Voice to Action

Announcing Q Actions

By | Artificial Intelligence, Digital Assistants, Machine Learning, News, Voice | No Comments

It’s only been about 3 months since we formally started working @Aiqudo and we’re thrilled to announce the availability of Q Actions (Beta) on the Play Store.

You say it, we do it!

Q Actions allows you to use simple voice commands to instantly execute actions in your favorite Android apps.  Other voice assistants like Alexa or Google Assistant don’t do this!

We’ve solved a few hard problems:

  • Commands in natural language without specific syntax  – Unlike systems like Alexa, where you need to invoke a skill by name, and need to use a specific syntax for your command to be recognized, you can use natural commands with Q. You don’t have to mention an app in your command – we’ll automatically figure out the right action for your command. In fact, our AI Search (AIS) uses sophisticated Machine Learning algorithms to perform high-quality fuzzy matching across commands across multiple apps in multiple verticals.
  • Action invocation in apps without APIs or developer work –  The Q Platform does not require app developers to expose specific APIs just for voice. We can enable key actions in apps without any APIs or code. You execute actions just as you normally would in the app, with the additional benefit that it is faster, and you don’t need to remember where the function resides deep within the app. Easier and faster.
  • Personal actions without registration or loss of privacy –   Other assistant platforms only expose few personal actions, and even these require the user to register third party services on the platform. Since Q executes actions in apps directly, we don’t require registration, and you are using apps you already trust for messaging, banking, payments, stocks, etc.
  • Scalable Action on boarding  – We have figured out how to on board actions within apps directly. We on board and maintain actions at our end. So neither you, nor the app developer have to worry about making the actions available broadly.

All you have to do to get started with Q Actions is say “show my actions” – you’ll see a list of actions already available for your favorite apps out-of-the-box.

Download Q Actions now!

The Aiqudo Team

Aiqudo Day 1 Spaces

Day 1 at Aiqudo

By | Artificial Intelligence, Digital Assistants, Voice | No Comments

Day 1

We’ve hit the ground running with our core team @Aiqudo.

Humble beginnings in 2 small rooms @Spaces in San Jose, near the beautiful Santana Row.

This is the day 1 team – the laptop represents yours truly!

We’re on a mission – make it super simple for users to get things done with simple and intuitive voice commands.

The state of the art is, shall we say, not good enough! Users should not have to learn skills – AI systems should be smarter!!

Voice to Action – You say it, we do it!

Rajat, for the Aiqudo Team!