Category

Machine Learning

AI Neural Networks

Enhancing Aiqudo’s Voice AI Natural Language Understanding with Deep Learning

By Artificial Intelligence, Deep Learning, Machine Learning, Natural Language, Neural Networks No Comments

Aiqudo provides the most extensive library of voice-triggered actions for mobile apps and other IOT devices. At this difficult time of Covid-19, voice is becoming mandatory as more organizations are seeing the need for contactless interactions. To further improve the performance of Aiqudo voice, we enhanced our unique Intent Matching using Semiotics with Deep Learning (DL) for custom Named Entity Recognition (NER) and Part of Speech (POS) Tagging. 

The task in question was to recognize the relevant Named Entities from user’s commands. This specific task is known as Named Entity Recognition (NER) in the Natural Language Processing (NLP) community. For example, ‘play Adele on Youtube’ involves two named entities, ‘Adele’ and ‘Youtube’. Extracting both entities correctly is critical for understanding the user’s intent, retrieving the right app and executing the correct action. Publicly available NER tools, such as NLTK, Spacy and Stanford NLP proved unsuitable for our purposes for the following reasons:

  1. they often made mistakes especially when processing short sentences typically seen in user commands
  2. they make mistakes such as labelling ‘Youtube’ as an ‘Organization’ and ‘Adele’ as ‘Person’, as opposed to the entity types we need within this command context  – which is ‘App’ and ‘Artist’.
  3. these tools don’t provide us with the granularity we need. As we support a very broad set of verticals or domains, our granularity needs for parameter types is very high – we need to identify almost 70 different parameter types in total (and this continues to grow). It’s not enough for us to identify a parameter as an “Organization”; we need to know if it is a “Restaurant”, “Business” or a “Stock ticker”

Part of Speech (POS) tagging is another essential aspect for both NER detection and action retrieval, but, again, public POS taggers such as NLTK, Spacy and Stanford NLP don’t work well for short commands. The situation gets worse for verbs such as ‘show’, ‘book’, ‘email’, ‘text’, which are normally regarded as nouns by most existing POS taggers. We, therefore, needed to develop our own custom NER module that also facilitates and produces more accurate POS information.

Fortunately, we already had a database of 13K+ commands relating to actions already in our platform and this provided the training data to build an integrated DL model. Example commands (with parameters extracted) in our database included ‘play $musicQuery on $mobileApp’ and, Show my $shoppingList, Navigate from $fromLocation to $toLoaction, etc. (Our named entity types start with ‘$’) For each entity, we created a number of realistic values, such as ‘grocery list’ and ‘DIY list’ for ‘$shoppingList’, and ‘New York” and ‘Atlanta’ for ‘$fromLocation’. We created around 3.7 million instantiated queries, e.g., ‘play Adele on Youtube’,Show my DIY list, and Navigate from New York to Atlanta’. We then used existing POS tools to label all words, chose the most popularly labelled POS pattern for each template, and finally labelled each relevant query accordingly. 

To make the data understandable to a neural network, we then needed to represent each word or token digitally, i.e. as vectors of certain dimensions. This is called Word Embedding. We tried several embedding methods, including Transformer tokenizer, Elmo, Google 300d, GloVe, and random embeddings of different dimensions. A pre-trained transformer produced the best results but required the most expensive computing resources such as a GPU. Elmo produced the 2nd best results but also needed a GPU for efficient computing time. Random embeddings of 64 dimensions work well on a CPU and can produce good results comparable to Elmo, while also being less expensive. Such tradeoffs are critical when you go from a theoretical AI approach to rolling AI techniques into production at scale. 

Our research and experiments were based on the state-of-the-art DL NER architecture of a residual Bidirectional LSTM. We integrated two relevant tasks: POS tagging and multi-label multi-class classification for potential entity types. Therefore, our present solution is a multi-inputs multi-outputs DL model. The neural architecture and data flow are illustrated in Fig. 1. The input module takes users’ speech and transforms it into text; the embedding layer represents the text in a sequence of vectors; the two bidirectional layers capture important recurrent patterns in the sequence; the  residual connection restores some lost features; these patterns and features are then used for labelling named entities and creating POS tags; or are flattened to make global classification for entity (parameter) types.

Deep Learning Architecture

Fig. 1 Neural architecture for Aiqudo Multitask Flow

One real life scenario would be as follows: A user wants to greet his friend Rodrigo on Whatsapp. He issues the following command verbally to his phone ‘Whatsapp text Rodrigo good morning’ (not a well-formed command, but this is common in practice). Each word in his speech is then mapped to a token integer, by which a 64 dimensional vector is indexed; the digital representation of all vectors goes through the neural network of two bidirectional LSTM layers and one residual connection layer; the network outputs parameter and value pairs and POS tags in time series; and the network is flattened on another branch and outputs parameter types. Our platform now has all the information needed to pass on to the next Natural Language Understanding (NLU) component in our system (see Figure 2), to fully understand the user’s intent and execute the correct action for them.

Online Intent Pipeline

Fig. 2 Aiqudo Online Intent Pipeline

Before we could go live in production, we needed to test the performance of the pipeline thoroughly. We devised 600k test scenarios that spanned 114 parameter distributions covering a range of command lengths from very short 2-term commands to much longer 15-term commands. We also focused on out-of-vocabulary parameter terms (terms that do not occur in the training data such as names of cities and movies for example) to ensure that the model could also handle these. 

Analysis of this approach in conjunction with the Aiqudo platform showed how it improved platform performance: The general entity recall ratio increased by over 10%. This integrated multitask model specifically fits well with Aiqudo’s requirements:

  1. The model was trained on our own corpus and produces entities and POS tags compatible with our on-boarded mobile app commands
  2. The three relevant tasks share most hidden layers and better weight optimization can therefore be achieved very efficiently
  3. The system can be easily adapted to newly on-boarded actions by expanding or adjusting the training corpus and/or annotating tags
  4. The random embedding model runs fast enough even on CPUs and produces much better results than publicly available NLP tools

We plan to continue to use DL where appropriate within our platform to complement and augment our existing Semiotics-based NLU engine. Possible future work includes: 

  1. extending the solution for any other languages (our system has commands on-boarded in several languages to use for training)
  2. tagging information and multi-label outputs haven’t been explicitly utilized as yet; we plan to leverage this information to further improve NER performance 
  3. the DL model can be further expanded by integrating it with other subtasks such as predicting relevant mobile apps from commands and/or actions. 

This powerful pipeline employing this flexible combination of Semiotics, Deep Learning and Grammar-based algorithms will offer more powerful Aiqudo voice services in the future. 

Xiwu Han, Hudson Mendes and David Patterson – Aiqudo R&D

Classifier Architecture

A Classifier Tuned to Action Commands

By Artificial Intelligence, Command Matching, Machine Learning No Comments

One thing we have learned through our journey of building the Q Actions® Voice platform is that there are few things as unpredictable as what users will say to their devices. These range from noise or nonsense queries (utterances with no obvious intent such as “this is really great”), to genuine queries such as “when does the next Caltrain leave for San Francisco”. We needed a way to filter the noise before passing genuine queries to Q Actions. As we thought about this further, we decided to categorize the genuine commands into the following 4 classes:

  • Noise or nonsense commands
  • Action Commands that Apps were best suited to answer (such as the Caltrain query above)
  • Queries that were informational in nature, such as “how tall is Tom Cruise”
  • Mathematical queries – “what is the square route of 2024”.

This classifier would enable us to route each query internally within our platform to provide the best user experience. So we set about building a 4-class classifier for Noise, App, Informational & Math. Since we have the world’s largest mobile Action library, and Action commands are our specialty, it was critical to attain as high a classification accuracy as possible for the App type so we route as many valid user commands as possible to our proprietary Action execution engine.

We considered a number of different approaches initially when deciding the best technology to use to do this. These included convolutional & recurrent Multilayer Perceptron’s (MLP), a 3 layer MLP and Transformer models such as BERT & ALBERT plus one we trained ourselves to allow for assessing the impact of different hyperparameters (number of heads, depth etc). We also experimented with different ways to embed the query information within the networks such as word embeddings (Word2vec & Glove) and sentence embeddings such as USE and NNLM.

We created a number of data sets with which to train and test the different models. Our goal was to identify the best classifier to deploy in production as determined by its ability to accurately classify the commands in each test set. We used existing valid user commands for our App Action training & test data sets. Question datasets were gathered from sources such as  Kaggle, Quora and Stanford QA. Mathematical queries were generated using a program written in house and from https://github.com/deepmind/mathematics_dataset. Noise data was obtained from actual noisy queries based on our live traffic from powering Motorola’s Moto Voice Assistant. All this data was split into training and test sets and used to train and test each of our models. The following table shows the size of each data set.

Dataset Training set size Test set size
APP 1794616 90598
Noise 71201 45778
Informational 128180 93900
Math 154518 22850

The result of our analysis was that the 3 layer MLP with USE embedding provided us with the best overall classification accuracy across all 4 categories.

The architecture of this classifier is shown in the following schematic. It gives a posterior probabilistic classification for an input query.

Classifier Architecture

Figure 1  Overview of the model

In effect, the network consisted of two components : the embedding layer followed by a 3 layer feed forward MLP. The first layer consists of N dense units, the second M dense units (where M < N) and the output is a softmax function which is typically used for multi class classification and will assign a probability for each class. As can be seen from Figure 1 the “APP” class has the highest probability and would be the model prediction for the command ‘Call Bill’.

The embedding layer relies on a Tensorflow hub module, which has two advantages:

  • we don’t have to worry about text preprocessing
  • we can benefit from transfer learning (utilizing a pre trained model on a large volume data often based on transformer techniques for text classification )

The hub module used is based on the Universal Sentence encoder (USE) which can give us a rich semantic representation of queries and can also be fine-tuned for our task. USE is much more powerful than word embedding processes as it can embed not only words but phrases and sentences. It is trained on a variety of data sources and a variety of tasks with the aim of dynamically facilitating a wide diversity of natural language understanding tasks.  The output from this embedding layer is a 512-dimensional vector.

We expect similar sentences to have similar embeddings as shown in the following heatmap, where the more similar two sentences are, the darker the color is. Similarity is based on cosine similarity of vectors. We demonstrate the strong similarity between two APP commands (‘view my profile’, view my Facebook profile’); two INFORMATIONAL queries (‘What is Barack obama’s age’, How old is Barack obama’) and two MATH queries (‘calculate 2+2’ ‘add 2+2’)

Heatmap

Figure 2  Semantic similarity

The MLP’s two hidden layers consist of N=500 and M=100 units.  If a model has more hidden units (a higher-dimensional representation space), and/or more layers, then the network can learn more complex representations. However, it makes the network more computationally expensive and may lead to learning unwanted patterns—patterns that improve performance only in terms of the training data (overfitting) but degrade generalization (poorer performance on the test data). This is why it is important to ensure MLP settings are chosen based on the performance on a range of unseen test sets.

In terms of overall performance, our model gives us an accuracy of 98.8% for APP, 86.9% for Informational, 83.5% for Mathematical and 52.3% for Noise. From this it can be seen that we achieved our goal of correctly classifying almost all App Action commands correctly. Informational and Mathematical commands also had a high degree of accuracy, while noise was the worst performing class. The reason Noise was the poorest is because Noise is very difficult to define. Noise can range from grammatically correct sentences with no relevance to the other 3 categories (such as “the weather is hot today”) to complete random nonsense. This is very hard to predict in advance to create a good training set for. We are still working on this aspect of our classifier and plan to improve its performance on this category in the future as a result of improved training data.

Niall Rooney and David Patterson

Q Actions 1.6.2 just released to App Store!

By App Actions, Artificial Intelligence, Conversation, Digital Assistants, Knowledge, Machine Learning, Natural Language, Voice Search No Comments

New Q Actions version now in the App Store

This version of Q Actions features contextual downstream actions, integration with your calendar, as well as under the bonnet improvements to our matching engines. Q Actions help users power through their day by being more useful and thoughtful.

Contextual Awareness

Q Actions understands the context when performing your actions. Let’s say you call a contact in your phonebook with the command “call Tiffany”. You can then follow-up with the command “navigate to her house”. Q Actions is aware of the context based on your previous command and is able to use that information in a downstream action.


  • say “call Tiffany”
    • then “navigate to her house”

Calendar Integration


Stay on top of your schedule and daily events with the recently added Calendar actions. Need to see what’s coming up next? Just ask “when is my next meeting?” and Q Actions will return a card with all the important event information. Need to quickly schedule something on your calendar? Say “create a new event” and after a few questions, your event is booked. On the go and need to join a video conferencing meeting? Simply say “join my next meeting” and Q Actions will take you directly to your meeting in Google Meet. All you have to do from there is confirm your camera/audio settings and join!

  • “when is my next meeting?”
  • “create a new event”
  • “join my next meeting”

Simply do more with voice! Q Actions is now available on the App Store.

Thought to Action

Thought to Action!

By Artificial Intelligence, Machine Learning No Comments

Here at Aiqudo, we’re always working on new ways to drive Actions and today we’re excited to announce a breakthrough in human-computer interaction that facilitates these operations.  We’re calling it “Thought to Action™”. It’s in early-stage development, but shows promising results.

Here’s how it works. We capture user brainwave signals via implanted neural-synaptic receptors and transfer the resulting waveforms over BLE to our cloud where advanced AI and machine learning models translate the user’s “thoughts” into specific app actions that are then executed on the user’s mobile device.   In essence we’ve transcended the use of voice to drive actions. Just think about the possibilities. Reduce messy and embarrassing moments when your phone’s speech recognizer gets your command wrong. “Tweet Laura, I love soccer” might end up as “Tweet Laura, I’d love to sock her”. With “Thought to Action™” we get it right all the time. And perfect for use in today’s noisy environments. Low on gas and you’re driving your entire kids soccer team home from a winning match, you can simply think “Find me the nearest gas station” and let Aiqudo do the rest.  Find yourself in a boring meeting? Send a text to a friend using just your thoughts.

Stay tuned as we work to bring this newest technology to a phone near you.

Semiotics

AI for Voice to Action – Part 3: The importance of Jargon to understanding User Intent

By Artificial Intelligence, Command Matching, Machine Learning No Comments

In my last post I discussed how semiotics and observing how discourse communities interact had influenced the design of our machine learning algorithms. I also emphasized the importance of discovering jargon words as part of our process of understanding user commands and intents.

In this post, we describe in more depth how this “theory” behind our algorithms actually works. We also discussed what constitute good jargon words.  “Computer” is a poor example of a jargon word because it is too broad in meaning, whereas a term relating to a computer chip, e.g. “Threadripper” (a gaming processor from AMD) would be a better example as it is more specific in meaning and is used in fewer contexts.

Jargon terms and Entropy

So – how do we identify good jargon terms and what do we do with them in order to understand user commands?

To do this we use entropy. In general entropy is a measure of chaos or disorder and, in an information theory context, it can be used to determine how much information is conveyed by a term. Because jargon words have a very narrow and specific meaning within specific discourse communities, they have lower entropy (more information value) than broader more general terms.

To determine entropy we take each term in our synthetic documents (see this post for more information of how we create this data set) and build a probability profile of co-occurring terms. The diagram below shows an example (partial) probability distribution for the term ‘computer’.

Entropy

Figure 1: Entropy – probability distributions for jargon terms

These co-occurring terms can be thought of as the context for each potential jargon word. We then use this probability profile to determine the entropy of the word. If that entropy is low then we consider it to be a candidate jargon word.

Having identified the low entropy jargon words in our synthetic command documents, we then use their probability distributions as attractors for these documents themselves. In this way (as seen in the diagram below) we create a set of document clusters where each cluster relates semantically to a jargon term. (Note: in the interest of clarity, clusters are described using high level topic as opposed to the jargon words themselves in the figure below).

Clusters derived from Synthetic Documents

Figure 2: Using jargon words as attractors to form clusters

We then build a graph within each cluster that connects documents based on how similar they are in terms of meaning. We identify ‘neighborhoods’ within these graphs that relate to areas of intense similarity. For example a cluster may be about “cardiovascular fitness” whereas a neighborhood may be more specifically about “High Intensity Training”, or “rowing” or “cycling”, etc.

Clusters and Neighborhoods

Figure 3: Neighborhoods for the cluster “cardiovascular fitness”

These neighborhoods can be thought of as sub-topics within the overall cluster topic. Within each sub-topic we can then extract important meaning-based phrases that precisely describe what that neighborhood is about. e.g. “HIIT”, “anaerobic high-intensity period”, “cardio session”, etc.

Meaning based phrases for sub-topics

Figure 4: Meaning based phrases for the “high intensity training” sub-topic

In this way we create meaning-based structure from completely unstructured content. Documents from the same cluster relate to the same discourse community. Documents from the same cluster that share similar important terms or phrases can be regarded as relating to the same sub-topic. If two clusters share a large number of important phrases then this represents a dialog between two discourse communities. If multiple important phrases are shared among many clusters, then this represents a dialogue among multiple communities.

So having described a little bit about the algorithms themselves, how do they help us understand the correct meaning behind a user’s command? Given this contextual partitioning of the data into discourses based on jargon terms, we can disambiguate among the many different meanings a term can have. For example, if the user were to say ‘open the window’ – we will be able to understand that there is a meaning (discourse) relating to both buildings and to software but if the user were to say ‘minimize the window’, we would understand that this could only have a software meaning and context. Fully understanding the nuances behind a user’s command is, of course, much more complicated than what I have just described, but the goal here is to give a high level overview of the approach.

In subsequent posts, we will discuss how we extract parameters from commands, accurately determine which app action to execute, and how we pass the correct parameters to that action.  

David Patterson and Vladimir Dobrynin

Poison Bottle

AI for Voice to Action – Part 2: Machine Learning Algorithms

By Artificial Intelligence, Command Matching, Machine Learning, Natural Language No Comments

My last post discussed the important step of automatically generating vast amounts of relevant content relating to commands to which we apply our machine learning algorithms. Here I want to delve into the design of our algorithms.

Given a command, our algorithms need to:

  1.   Understand the meaning and intent behind the command
  2.   Identify and extract parameters from it
  3.   Determine which app action is most appropriate
  4.   Execute the chosen action and pass the relevant parameters to the action

This post and the next one will address point 1. The other points will be covered in subsequent posts.

So how do we understand what a user means based on their command? Typically commands are short (3 or 4 terms), which makes it very difficult to disambiguate among the multiple meanings a term can have. So if someone says “search for Boston” do they want directions to a city or do they want to listen to a rock band on Spotify? In order to disambiguate among all the possibilities we need to know if a) any of the command terms can have different meanings, b) what those meanings are and finally c) which is the correct one based on context.

Semiotics

In order to do this we developed a suite of algorithms which feed off the data we generated previously (See post #3). These algorithms are inspired by semiotics, the study of how meaning is communicated. Semiotics originated as a theory of how we interpret the meaning of signs and symbols. Given a sign in one context, for example a flag with a skull and crossbones on it, you would assign a particular meaning to it (i.e. Pirates).

Pirate Symbol

Whereas, if you changed the context to a bottle, then the meaning changes completely

Poison Bottle

Poison – do not drink!

Linguists took these ideas and applied them to language and how, given a term (e.g. ‘window’), its meaning can change depending on the meaning of the words around it in the sentence (meanings could be physical window in a room, software window, window of opportunity, etc.).  By applying these ideas to our data we can understand the different meanings a term can have based on its context.

Discourse Communities

We also drew inspiration from discourse communities. A discourse community is a group of people involved in and communicating about a particular topic. They tend to use the same language for important concepts (sometimes called jargon) within their community, and these terms have a specific, understood and agreed meaning within the community to make communication easier. For example members of a cycling community have their own set of terms that is fairly unique to them that they all understand and adhere to. If you want to see what I mean, go here and learn the meanings of such terms as an Athena, a Cassette, a Chamois (very important!) and many other terms. Similarly motor enthusiasts will have their own ‘lingo’. If you want to be able to differentiate your AWS from your ABS and your DDI from your DPF then get up to speed here.

Our users use apps, so in addition we would expect to discover gaming discourses, financial discourses, music discourses, social media discourses and so on. Our goal was to develop a suite of machine learning algorithms which could automatically identify these communities through their important jargon terms. By identifying the jargon terms we can build a picture of the relationship between these terms and other terms used by each discourse community within our data. A characteristic of jargon words is that they have a very narrow meaning within a discourse compared to other terms. For example the term ‘computer’ is a very general term that can have multiple meanings across many discourses – programming, desktop, laptop, tablet, phone, firmware, networks etc. … ‘Computer’ isn’t a very good example of a jargon term as it is too general and broad in meaning. We want to identify narrow, specific terms that have a very precise meaning within a single discourse, e.g. a specific type of processor, or a motherboard. Our algorithms do a remarkable job of identifying these jargon terms and are foundational to our ability to extract meaning, precisely understand user commands and thereby the real intent that lies behind them.

In my next post I will go into the details behind the algorithms that enable us to identify these narrow-meaning, community-specific jargon terms and ultimately to build a model that understands the meaning and intent behind user queries.

Data Augmentation

AI for Voice to Action – Part 1: Data

By Artificial Intelligence, Machine Learning, Voice Search No Comments

At Aiqudo two critical problems we solve in voice control are the action discovery problem and the cognitive load problem.

In my first post I discussed how using technology to overcome the challenges of bringing voice control into the mainstream motivated me to get out of bed in the morning. I get a kick out of seeing someone speaking naturally to their device and smiling when it does exactly what they wanted.

In our second post in the series we discussed how Aiqudo has built the the largest (and growing) mobile app action index in the world and our process for on-boarding actions. On-boarding an action only  takes minutes – there is no programming involved and we are not reliant on the app developer to set this up or provide an API. This enables enormous scalability of actions compared to the Amazon and Google approaches that rely on a programming solution where developers are required to code to these platforms, add specific intents, and go through a painful approval process.

In this post  I wanted to start to elaborate on our overall approach and discuss specifically how we create the large amounts of content for our patented machine learning algorithms to analyze, in order to be able to understand a user’s intent. Ours is a significant achievement since even large teams are facing challenges in solving this problem in a generic fashion – as the following quote from Amazon shows.   

“The way we’re solving that is that you’ll just speak, and we will find the most relevant skill that can answer your query … The ambiguity in that language, and the incredible number of actions Alexa can take, that’s a super hard AI problem.” – Amazon

At Aiqudo, we have already solved the challenge that Amazon is working on. Our users don’t have to specify which app to use  and we automatically pick the right actions for their command thereby reducing the cognitive load for the user.

The starting point for generating the content we need is the end of the action on-boarding process, when a few sample commands are added to the action. These training commands enable us to start the machine learning processes that enable us to

  1. extract the correct meaning from the natural language command
  2. understand the intent; and
  3. execute the correct action on the best app

The first step in this process is to gather content relating to each command on-boarded (command content). As is typical with machine learning approaches we are data hungry – the more data we have, the better our performance. Therefore we use numerous data repositories specific to on-boarded commands and apps and interrogate them to identify related content that can be used to augment the language used in the command.

Content Augmentation for Machine Learning

Content augmentation removes noise and increases the semantic coverage of terms

 

Teaching a machine to correctly understand what a user intends from just a few terms in a command is problematic (as it would be for a human) – there isn’t enough context to fully understand the command – e.g. ‘open the window’ – is this a software related command or a command related to a room? Augmenting the command with additional content adds a lot more context for the algorithms to better understand meaning and intent. This augmented content forms the basis of a lexicon of terms relating to each on-boarded command. Later, when we apply our machine learning algorithms this provides the raw data to enable us to build and understand meaning – e.g. we can understand that a movie is similar to a film, rain is related to weather, the term ‘window’ has multiple meanings and so on.

It is equally important that each command’s lexicon is highly relevant to the command and low in noise – for this reason we automatically assess each term within the lexicon to determine its relevance and remove noise. Once we have the low noise lexicon this becomes a final lexicon of terms relating to each command. We then generate multiple command documents from the lexicon for each command. Each command document is generated by selecting terms based on the probability of its occurrence within the command’s lexicon. The more likely a term occurs within the command’s lexicon, the more likely it is to occur in a command document. Note by doing this we are synthetically creating documents which do not make sense to a human, but are a reflection of the probabilities of occurrence of terms in the command’s lexicon. It is these synthetically created command documents which we use to train our machine learning algorithms to understand meaning and intent. Because these are synthetically generated we can also control the number of command documents we create to fine tune the learning process.

Once we have carefully created a relevant command lexicon and built a repository of documents which relate to each command that has been on-boarded, we are ready to analyze the content, identify topics and subtopics, disambiguate among the different meanings words have and understand contextual meaning.  Our innovative content augmentation approach allows us to quickly deploy updated machine learned models that can immediately match new command variants, so we don’t have to wait for large numbers of live queries for training as with other approaches.

The really appealing thing about this approach is it is language agnostic – it allows us to facilitate users speaking in any language by interrogating multilingual content repositories. Currently we are live in 12 markets in 7 languages and and are enabling new languages. We’re proud of this major accomplishment in such a short timeframe.  

In my next post in this series, I will say a little more about the machine learning algorithms we have developed that have enabled us to build such a scalable, multi-lingual solution.

Q Actions power Moto Voice

Q Actions Platform now powers App Actions in Moto Voice. #HelloMoto

By App Actions, Digital Assistants, Machine Learning, Voice No Comments

Our first official day at Aiqudo was in April, 2017.  One year later, we are excited to announce that our Q Actions platform is now live and powering app actions in Moto Voice. The experience is being rolled out, as we speak, to millions of users using Motorola phones in 7 languages in 12 markets, with more to come. Watch the coverage of the always on voice capabilities during Motorola’s recent launch event. 

Most of the app actions we power are not currently available in other digital assistant platforms – actions in apps like Facebook, Whatsapp, Wechat, Netflix, Spotify, Hulu, Waze, to mention a few.  And we just got started …

On supported Motorola phones, you just say “Hello Moto” and issue simple commands – hands free.

Our solution provides high utility to users. You can get things done instantly within your favo(u)rite apps, privately and without having to register credentials. Check out the Voice-to-Action™ experience in the video below:

We’ve addressed several hard technical problems, including:

  • Command matching for simple, intuitive commands in multiple languages: You speak naturally – no need to learn a specific syntax. A single command can provide matching actions from multiple apps, providing user choice.
  • Action execution of personal app actions:  We execute actions in your favo(u)rite apps, including your private actions, without requiring registration or login credentials. We use several techniques for action execution, and can even execute tasks consisting of multiple actions in different apps.
  • Action on boarding operations: We support actions in multiple versions of apps simultaneously – in multiple locales. Our on boarding process takes minutes, does not mandate APIs, coding or developer engagement, enabling rapid scale. Our flexible Machine Learning systems are trained incrementally with simple exemplary commands.

We will be writing more about our contributions in these areas over the next few weeks.

For the most powerful, fully hands free experience, get a new phone with always on Moto Voice, and say “Hello Moto”!

Or, for other Android phones, you can download the Q Actions app from the Play Store.

Walled gardens

Open or Walled?

By Artificial Intelligence, Digital Assistants, Machine Learning, Voice No Comments

Voice has the promise to be the next disruption, upending massive, established business models and putting search, commerce, messaging, and navigation up for grabs again. But a walled garden mentality could stifle that disruption.

Even over its relatively short history, we see a pattern of behavior on the Internet: some innovator creates a marketplace for consumers, helping to organize information (Yahoo and AOL in their first iterations), commerce (Amazon), a place to keep in touch with our friends (Facebook), and they create huge consumer value in bringing us together, providing us with tools that make it easy to navigate, buy, message, etc. But as the audience grows, there is always a slide away from an open marketplace toward a walled garden, with the marketplace operators initially becoming toll takers and moving toward ever greater control (and monetization) of their users’ experience, and more recently, their data.

Mobile carriers in the US tried to erect walled gardens around their users in the 1.0 version of mobile content — the carriers thought they had captive users and captive vendors, and so created closed platforms that forced their subscribers to buy content from them. Predictably, monopoly providers offered narrow product offerings at high prices and squeezed their vendors so hard that there was no free cash flow for innovation. Mobile content stagnated, as the carriers failed to cultivate fertile ecosystems in which vendors could make money and in which consumers had a growing variety of new and interesting content. When the iPhone came along (thankfully Steve Jobs could wave his magic wand over the guys at AT&T), consumers could finally use their phones to get to the Internet for the content they wanted, and the carriers went back to being dumb pipes.

Will voice platforms become walled gardens?

If you want to enable your users to reach you through Alexa, you have to create a Skill. Then you have to train your users to invoke your Skill using a precise syntax. Likewise Google Assistant. For Siri, your business has to fit into one of the handful of domains that SiriKit recognizes. There’s a reason we refer to them as voice platforms — their owners are in control.

Initially, there are good QA reasons for this, making sure we get a good user experience. But pretty quickly, the walls will become constraints on who can be included in the garden (will Amazon and Facebook play nice together?), and ultimately, what will be the tax that must be paid in order to offer services in the garden. As users, this results in less openness, fewer choices, and constraints on our ability to quickly and easily do what we want to do, which typically includes using different services from all of the different platform providers (does Tencent really think that if you block people from Alipay inside WeChat that users will stop using Alipay?)

The carriers’ experience should be a cautionary tale — walled gardens, with their limited choices and monopolist pricing are bad for consumers; the Internet is a place of unlimited choice, the world of mobile apps is vast and diverse, again allowing for broad consumer choice — this is what we expect, and if our horizons are constrained by a platform’s policies, we’ll abandon it. The carriers fumbled Mobile Content 1.0; their walled gardens never met their promise to become massive businesses, and today they don’t even exist.

Voice interfaces should be our gateway to everything we want to do, whether it’s in Alexa, in our mobile apps, or in our connected cars or homes. So will voice platforms be these open gateways that make our lives easier, or will they be cramped walled gardens that try to make our choices for us, funneling us to a narrow selection of preferred vendors?

Q Actions -Voice to Action

Announcing Q Actions

By Artificial Intelligence, Digital Assistants, Machine Learning, News, Voice No Comments

It’s only been about 3 months since we formally started working @Aiqudo and we’re thrilled to announce the availability of Q Actions (Beta) on the Play Store.

You say it, we do it!

Q Actions allows you to use simple voice commands to instantly execute actions in your favorite Android apps.  Other voice assistants like Alexa or Google Assistant don’t do this!

We’ve solved a few hard problems:

  • Commands in natural language without specific syntax  – Unlike systems like Alexa, where you need to invoke a skill by name, and need to use a specific syntax for your command to be recognized, you can use natural commands with Q. You don’t have to mention an app in your command – we’ll automatically figure out the right action for your command. In fact, our AI Search (AIS) uses sophisticated Machine Learning algorithms to perform high-quality fuzzy matching across commands across multiple apps in multiple verticals.
  • Action invocation in apps without APIs or developer work –  The Q Platform does not require app developers to expose specific APIs just for voice. We can enable key actions in apps without any APIs or code. You execute actions just as you normally would in the app, with the additional benefit that it is faster, and you don’t need to remember where the function resides deep within the app. Easier and faster.
  • Personal actions without registration or loss of privacy –   Other assistant platforms only expose few personal actions, and even these require the user to register third party services on the platform. Since Q executes actions in apps directly, we don’t require registration, and you are using apps you already trust for messaging, banking, payments, stocks, etc.
  • Scalable Action on boarding  – We have figured out how to on board actions within apps directly. We on board and maintain actions at our end. So neither you, nor the app developer have to worry about making the actions available broadly.

All you have to do to get started with Q Actions is say “show my actions” – you’ll see a list of actions already available for your favorite apps out-of-the-box.

Download Q Actions now!

The Aiqudo Team