One thing we have learned through our journey of building the Q Actions® Voice platform is that there are few things as unpredictable as what users will say to their devices. These range from noise or nonsense queries (utterances with no obvious intent such as “this is really great”), to genuine queries such as “when does the next Caltrain leave for San Francisco”. We needed a way to filter the noise before passing genuine queries to Q Actions. As we thought about this further, we decided to categorize the genuine commands into the following 4 classes:
- Noise or nonsense commands
- Action Commands that Apps were best suited to answer (such as the Caltrain query above)
- Queries that were informational in nature, such as “how tall is Tom Cruise”
- Mathematical queries – “what is the square route of 2024”.
This classifier would enable us to route each query internally within our platform to provide the best user experience. So we set about building a 4-class classifier for Noise, App, Informational & Math. Since we have the world’s largest mobile Action library, and Action commands are our specialty, it was critical to attain as high a classification accuracy as possible for the App type so we route as many valid user commands as possible to our proprietary Action execution engine.
We considered a number of different approaches initially when deciding the best technology to use to do this. These included convolutional & recurrent Multilayer Perceptron’s (MLP), a 3 layer MLP and Transformer models such as BERT & ALBERT plus one we trained ourselves to allow for assessing the impact of different hyperparameters (number of heads, depth etc). We also experimented with different ways to embed the query information within the networks such as word embeddings (Word2vec & Glove) and sentence embeddings such as USE and NNLM.
We created a number of data sets with which to train and test the different models. Our goal was to identify the best classifier to deploy in production as determined by its ability to accurately classify the commands in each test set. We used existing valid user commands for our App Action training & test data sets. Question datasets were gathered from sources such as Kaggle, Quora and Stanford QA. Mathematical queries were generated using a program written in house and from https://github.com/deepmind/mathematics_dataset. Noise data was obtained from actual noisy queries based on our live traffic from powering Motorola’s Moto Voice Assistant. All this data was split into training and test sets and used to train and test each of our models. The following table shows the size of each data set.
|Dataset||Training set size||Test set size|
The result of our analysis was that the 3 layer MLP with USE embedding provided us with the best overall classification accuracy across all 4 categories.
The architecture of this classifier is shown in the following schematic. It gives a posterior probabilistic classification for an input query.
Figure 1 Overview of the model
In effect, the network consisted of two components : the embedding layer followed by a 3 layer feed forward MLP. The first layer consists of N dense units, the second M dense units (where M < N) and the output is a softmax function which is typically used for multi class classification and will assign a probability for each class. As can be seen from Figure 1 the “APP” class has the highest probability and would be the model prediction for the command ‘Call Bill’.
The embedding layer relies on a Tensorflow hub module, which has two advantages:
- we don’t have to worry about text preprocessing
- we can benefit from transfer learning (utilizing a pre trained model on a large volume data often based on transformer techniques for text classification )
The hub module used is based on the Universal Sentence encoder (USE) which can give us a rich semantic representation of queries and can also be fine-tuned for our task. USE is much more powerful than word embedding processes as it can embed not only words but phrases and sentences. It is trained on a variety of data sources and a variety of tasks with the aim of dynamically facilitating a wide diversity of natural language understanding tasks. The output from this embedding layer is a 512-dimensional vector.
We expect similar sentences to have similar embeddings as shown in the following heatmap, where the more similar two sentences are, the darker the color is. Similarity is based on cosine similarity of vectors. We demonstrate the strong similarity between two APP commands (‘view my profile’, ‘view my Facebook profile’); two INFORMATIONAL queries (‘What is Barack obama’s age’, ‘How old is Barack obama’) and two MATH queries (‘calculate 2+2’ , ‘add 2+2’)
Figure 2 Semantic similarity
The MLP’s two hidden layers consist of N=500 and M=100 units. If a model has more hidden units (a higher-dimensional representation space), and/or more layers, then the network can learn more complex representations. However, it makes the network more computationally expensive and may lead to learning unwanted patterns—patterns that improve performance only in terms of the training data (overfitting) but degrade generalization (poorer performance on the test data). This is why it is important to ensure MLP settings are chosen based on the performance on a range of unseen test sets.
In terms of overall performance, our model gives us an accuracy of 98.8% for APP, 86.9% for Informational, 83.5% for Mathematical and 52.3% for Noise. From this it can be seen that we achieved our goal of correctly classifying almost all App Action commands correctly. Informational and Mathematical commands also had a high degree of accuracy, while noise was the worst performing class. The reason Noise was the poorest is because Noise is very difficult to define. Noise can range from grammatically correct sentences with no relevance to the other 3 categories (such as “the weather is hot today”) to complete random nonsense. This is very hard to predict in advance to create a good training set for. We are still working on this aspect of our classifier and plan to improve its performance on this category in the future as a result of improved training data.
Niall Rooney and David Patterson