Category

Command Matching

Classifier Architecture

A Classifier Tuned to Action Commands

By Artificial Intelligence, Command Matching, Machine Learning No Comments

One thing we have learned through our journey of building the Q Actions® Voice platform is that there are few things as unpredictable as what users will say to their devices. These range from noise or nonsense queries (utterances with no obvious intent such as “this is really great”), to genuine queries such as “when does the next Caltrain leave for San Francisco”. We needed a way to filter the noise before passing genuine queries to Q Actions. As we thought about this further, we decided to categorize the genuine commands into the following 4 classes:

  • Noise or nonsense commands
  • Action Commands that Apps were best suited to answer (such as the Caltrain query above)
  • Queries that were informational in nature, such as “how tall is Tom Cruise”
  • Mathematical queries – “what is the square route of 2024”.

This classifier would enable us to route each query internally within our platform to provide the best user experience. So we set about building a 4-class classifier for Noise, App, Informational & Math. Since we have the world’s largest mobile Action library, and Action commands are our specialty, it was critical to attain as high a classification accuracy as possible for the App type so we route as many valid user commands as possible to our proprietary Action execution engine.

We considered a number of different approaches initially when deciding the best technology to use to do this. These included convolutional & recurrent Multilayer Perceptron’s (MLP), a 3 layer MLP and Transformer models such as BERT & ALBERT plus one we trained ourselves to allow for assessing the impact of different hyperparameters (number of heads, depth etc). We also experimented with different ways to embed the query information within the networks such as word embeddings (Word2vec & Glove) and sentence embeddings such as USE and NNLM.

We created a number of data sets with which to train and test the different models. Our goal was to identify the best classifier to deploy in production as determined by its ability to accurately classify the commands in each test set. We used existing valid user commands for our App Action training & test data sets. Question datasets were gathered from sources such as  Kaggle, Quora and Stanford QA. Mathematical queries were generated using a program written in house and from https://github.com/deepmind/mathematics_dataset. Noise data was obtained from actual noisy queries based on our live traffic from powering Motorola’s Moto Voice Assistant. All this data was split into training and test sets and used to train and test each of our models. The following table shows the size of each data set.

Dataset Training set size Test set size
APP 1794616 90598
Noise 71201 45778
Informational 128180 93900
Math 154518 22850

The result of our analysis was that the 3 layer MLP with USE embedding provided us with the best overall classification accuracy across all 4 categories.

The architecture of this classifier is shown in the following schematic. It gives a posterior probabilistic classification for an input query.

Classifier Architecture

Figure 1  Overview of the model

In effect, the network consisted of two components : the embedding layer followed by a 3 layer feed forward MLP. The first layer consists of N dense units, the second M dense units (where M < N) and the output is a softmax function which is typically used for multi class classification and will assign a probability for each class. As can be seen from Figure 1 the “APP” class has the highest probability and would be the model prediction for the command ‘Call Bill’.

The embedding layer relies on a Tensorflow hub module, which has two advantages:

  • we don’t have to worry about text preprocessing
  • we can benefit from transfer learning (utilizing a pre trained model on a large volume data often based on transformer techniques for text classification )

The hub module used is based on the Universal Sentence encoder (USE) which can give us a rich semantic representation of queries and can also be fine-tuned for our task. USE is much more powerful than word embedding processes as it can embed not only words but phrases and sentences. It is trained on a variety of data sources and a variety of tasks with the aim of dynamically facilitating a wide diversity of natural language understanding tasks.  The output from this embedding layer is a 512-dimensional vector.

We expect similar sentences to have similar embeddings as shown in the following heatmap, where the more similar two sentences are, the darker the color is. Similarity is based on cosine similarity of vectors. We demonstrate the strong similarity between two APP commands (‘view my profile’, view my Facebook profile’); two INFORMATIONAL queries (‘What is Barack obama’s age’, How old is Barack obama’) and two MATH queries (‘calculate 2+2’ ‘add 2+2’)

Heatmap

Figure 2  Semantic similarity

The MLP’s two hidden layers consist of N=500 and M=100 units.  If a model has more hidden units (a higher-dimensional representation space), and/or more layers, then the network can learn more complex representations. However, it makes the network more computationally expensive and may lead to learning unwanted patterns—patterns that improve performance only in terms of the training data (overfitting) but degrade generalization (poorer performance on the test data). This is why it is important to ensure MLP settings are chosen based on the performance on a range of unseen test sets.

In terms of overall performance, our model gives us an accuracy of 98.8% for APP, 86.9% for Informational, 83.5% for Mathematical and 52.3% for Noise. From this it can be seen that we achieved our goal of correctly classifying almost all App Action commands correctly. Informational and Mathematical commands also had a high degree of accuracy, while noise was the worst performing class. The reason Noise was the poorest is because Noise is very difficult to define. Noise can range from grammatically correct sentences with no relevance to the other 3 categories (such as “the weather is hot today”) to complete random nonsense. This is very hard to predict in advance to create a good training set for. We are still working on this aspect of our classifier and plan to improve its performance on this category in the future as a result of improved training data.

Niall Rooney and David Patterson

Q Actions - Action Recipes and Compound Commands

Q Actions – Complex tasks through Compound Commands

By Artificial Intelligence, Command Matching, Conversation, Uncategorized No Comments

In many cases, a single action does the job.

Say it. Do it!

Often, however, a task require multiple actions to be performed across multiple independent apps. On-the-go, you just want things done quickly and efficiently without having to worry about which actions to run, and which apps need to be in the mix.

Compound commands allow you to do just that – just say what you want to do – naturally –  and, assuming this makes sense and you have  access to the relevant apps, the right actions are magically  executed. It’s not that complicated – just say “navigate to the tech museum and call Kevin”, firing off Maps and WhatsApp in the process.  Driving, and in a hurry to catch the train? Just say “navigate to the Caltrain station and buy a train ticket” launching Maps and the Caltrain app in sequence.  Did you just hear the announcement that your plane is ready to board? Say “show my boarding pass and tell susan I’m boarding now” (American, United, Delta,…)  and (Whatsapp, Messenger,…) and you’re ready to get on the flight home – one, two … do!

Compound commands are … complex magic to get things done … simply!

Semiotics

AI for Voice to Action – Part 3: The importance of Jargon to understanding User Intent

By Artificial Intelligence, Command Matching, Machine Learning No Comments

In my last post I discussed how semiotics and observing how discourse communities interact had influenced the design of our machine learning algorithms. I also emphasized the importance of discovering jargon words as part of our process of understanding user commands and intents.

In this post, we describe in more depth how this “theory” behind our algorithms actually works. We also discussed what constitute good jargon words.  “Computer” is a poor example of a jargon word because it is too broad in meaning, whereas a term relating to a computer chip, e.g. “Threadripper” (a gaming processor from AMD) would be a better example as it is more specific in meaning and is used in fewer contexts.

Jargon terms and Entropy

So – how do we identify good jargon terms and what do we do with them in order to understand user commands?

To do this we use entropy. In general entropy is a measure of chaos or disorder and, in an information theory context, it can be used to determine how much information is conveyed by a term. Because jargon words have a very narrow and specific meaning within specific discourse communities, they have lower entropy (more information value) than broader more general terms.

To determine entropy we take each term in our synthetic documents (see this post for more information of how we create this data set) and build a probability profile of co-occurring terms. The diagram below shows an example (partial) probability distribution for the term ‘computer’.

Entropy

Figure 1: Entropy – probability distributions for jargon terms

These co-occurring terms can be thought of as the context for each potential jargon word. We then use this probability profile to determine the entropy of the word. If that entropy is low then we consider it to be a candidate jargon word.

Having identified the low entropy jargon words in our synthetic command documents, we then use their probability distributions as attractors for these documents themselves. In this way (as seen in the diagram below) we create a set of document clusters where each cluster relates semantically to a jargon term. (Note: in the interest of clarity, clusters are described using high level topic as opposed to the jargon words themselves in the figure below).

Clusters derived from Synthetic Documents

Figure 2: Using jargon words as attractors to form clusters

We then build a graph within each cluster that connects documents based on how similar they are in terms of meaning. We identify ‘neighborhoods’ within these graphs that relate to areas of intense similarity. For example a cluster may be about “cardiovascular fitness” whereas a neighborhood may be more specifically about “High Intensity Training”, or “rowing” or “cycling”, etc.

Clusters and Neighborhoods

Figure 3: Neighborhoods for the cluster “cardiovascular fitness”

These neighborhoods can be thought of as sub-topics within the overall cluster topic. Within each sub-topic we can then extract important meaning-based phrases that precisely describe what that neighborhood is about. e.g. “HIIT”, “anaerobic high-intensity period”, “cardio session”, etc.

Meaning based phrases for sub-topics

Figure 4: Meaning based phrases for the “high intensity training” sub-topic

In this way we create meaning-based structure from completely unstructured content. Documents from the same cluster relate to the same discourse community. Documents from the same cluster that share similar important terms or phrases can be regarded as relating to the same sub-topic. If two clusters share a large number of important phrases then this represents a dialog between two discourse communities. If multiple important phrases are shared among many clusters, then this represents a dialogue among multiple communities.

So having described a little bit about the algorithms themselves, how do they help us understand the correct meaning behind a user’s command? Given this contextual partitioning of the data into discourses based on jargon terms, we can disambiguate among the many different meanings a term can have. For example, if the user were to say ‘open the window’ – we will be able to understand that there is a meaning (discourse) relating to both buildings and to software but if the user were to say ‘minimize the window’, we would understand that this could only have a software meaning and context. Fully understanding the nuances behind a user’s command is, of course, much more complicated than what I have just described, but the goal here is to give a high level overview of the approach.

In subsequent posts, we will discuss how we extract parameters from commands, accurately determine which app action to execute, and how we pass the correct parameters to that action.  

David Patterson and Vladimir Dobrynin

Poison Bottle

AI for Voice to Action – Part 2: Machine Learning Algorithms

By Artificial Intelligence, Command Matching, Machine Learning, Natural Language No Comments

My last post discussed the important step of automatically generating vast amounts of relevant content relating to commands to which we apply our machine learning algorithms. Here I want to delve into the design of our algorithms.

Given a command, our algorithms need to:

  1.   Understand the meaning and intent behind the command
  2.   Identify and extract parameters from it
  3.   Determine which app action is most appropriate
  4.   Execute the chosen action and pass the relevant parameters to the action

This post and the next one will address point 1. The other points will be covered in subsequent posts.

So how do we understand what a user means based on their command? Typically commands are short (3 or 4 terms), which makes it very difficult to disambiguate among the multiple meanings a term can have. So if someone says “search for Boston” do they want directions to a city or do they want to listen to a rock band on Spotify? In order to disambiguate among all the possibilities we need to know if a) any of the command terms can have different meanings, b) what those meanings are and finally c) which is the correct one based on context.

Semiotics

In order to do this we developed a suite of algorithms which feed off the data we generated previously (See post #3). These algorithms are inspired by semiotics, the study of how meaning is communicated. Semiotics originated as a theory of how we interpret the meaning of signs and symbols. Given a sign in one context, for example a flag with a skull and crossbones on it, you would assign a particular meaning to it (i.e. Pirates).

Pirate Symbol

Whereas, if you changed the context to a bottle, then the meaning changes completely

Poison Bottle

Poison – do not drink!

Linguists took these ideas and applied them to language and how, given a term (e.g. ‘window’), its meaning can change depending on the meaning of the words around it in the sentence (meanings could be physical window in a room, software window, window of opportunity, etc.).  By applying these ideas to our data we can understand the different meanings a term can have based on its context.

Discourse Communities

We also drew inspiration from discourse communities. A discourse community is a group of people involved in and communicating about a particular topic. They tend to use the same language for important concepts (sometimes called jargon) within their community, and these terms have a specific, understood and agreed meaning within the community to make communication easier. For example members of a cycling community have their own set of terms that is fairly unique to them that they all understand and adhere to. If you want to see what I mean, go here and learn the meanings of such terms as an Athena, a Cassette, a Chamois (very important!) and many other terms. Similarly motor enthusiasts will have their own ‘lingo’. If you want to be able to differentiate your AWS from your ABS and your DDI from your DPF then get up to speed here.

Our users use apps, so in addition we would expect to discover gaming discourses, financial discourses, music discourses, social media discourses and so on. Our goal was to develop a suite of machine learning algorithms which could automatically identify these communities through their important jargon terms. By identifying the jargon terms we can build a picture of the relationship between these terms and other terms used by each discourse community within our data. A characteristic of jargon words is that they have a very narrow meaning within a discourse compared to other terms. For example the term ‘computer’ is a very general term that can have multiple meanings across many discourses – programming, desktop, laptop, tablet, phone, firmware, networks etc. … ‘Computer’ isn’t a very good example of a jargon term as it is too general and broad in meaning. We want to identify narrow, specific terms that have a very precise meaning within a single discourse, e.g. a specific type of processor, or a motherboard. Our algorithms do a remarkable job of identifying these jargon terms and are foundational to our ability to extract meaning, precisely understand user commands and thereby the real intent that lies behind them.

In my next post I will go into the details behind the algorithms that enable us to identify these narrow-meaning, community-specific jargon terms and ultimately to build a model that understands the meaning and intent behind user queries.