Aiqudo provides the most extensive library of voice-triggered actions for mobile apps and other IOT devices. At this difficult time of Covid-19, voice is becoming mandatory as more organizations are seeing the need for contactless interactions. To further improve the performance of Aiqudo voice, we enhanced our unique Intent Matching using Semiotics with Deep Learning (DL) for custom Named Entity Recognition (NER) and Part of Speech (POS) Tagging.
The task in question was to recognize the relevant Named Entities from user’s commands. This specific task is known as Named Entity Recognition (NER) in the Natural Language Processing (NLP) community. For example, ‘play Adele on Youtube’ involves two named entities, ‘Adele’ and ‘Youtube’. Extracting both entities correctly is critical for understanding the user’s intent, retrieving the right app and executing the correct action. Publicly available NER tools, such as NLTK, Spacy and Stanford NLP proved unsuitable for our purposes for the following reasons:
- they often made mistakes especially when processing short sentences typically seen in user commands
- they make mistakes such as labelling ‘Youtube’ as an ‘Organization’ and ‘Adele’ as ‘Person’, as opposed to the entity types we need within this command context – which is ‘App’ and ‘Artist’.
- these tools don’t provide us with the granularity we need. As we support a very broad set of verticals or domains, our granularity needs for parameter types is very high – we need to identify almost 70 different parameter types in total (and this continues to grow). It’s not enough for us to identify a parameter as an “Organization”; we need to know if it is a “Restaurant”, “Business” or a “Stock ticker”
Part of Speech (POS) tagging is another essential aspect for both NER detection and action retrieval, but, again, public POS taggers such as NLTK, Spacy and Stanford NLP don’t work well for short commands. The situation gets worse for verbs such as ‘show’, ‘book’, ‘email’, ‘text’, which are normally regarded as nouns by most existing POS taggers. We, therefore, needed to develop our own custom NER module that also facilitates and produces more accurate POS information.
Fortunately, we already had a database of 13K+ commands relating to actions already in our platform and this provided the training data to build an integrated DL model. Example commands (with parameters extracted) in our database included ‘play $musicQuery on $mobileApp’ and, ‘Show my $shoppingList’, ‘Navigate from $fromLocation to $toLoaction’, etc. (Our named entity types start with ‘$’) For each entity, we created a number of realistic values, such as ‘grocery list’ and ‘DIY list’ for ‘$shoppingList’, and ‘New York” and ‘Atlanta’ for ‘$fromLocation’. We created around 3.7 million instantiated queries, e.g., ‘play Adele on Youtube’,‘Show my DIY list’, and ‘Navigate from New York to Atlanta’. We then used existing POS tools to label all words, chose the most popularly labelled POS pattern for each template, and finally labelled each relevant query accordingly.
To make the data understandable to a neural network, we then needed to represent each word or token digitally, i.e. as vectors of certain dimensions. This is called Word Embedding. We tried several embedding methods, including Transformer tokenizer, Elmo, Google 300d, GloVe, and random embeddings of different dimensions. A pre-trained transformer produced the best results but required the most expensive computing resources such as a GPU. Elmo produced the 2nd best results but also needed a GPU for efficient computing time. Random embeddings of 64 dimensions work well on a CPU and can produce good results comparable to Elmo, while also being less expensive. Such tradeoffs are critical when you go from a theoretical AI approach to rolling AI techniques into production at scale.
Our research and experiments were based on the state-of-the-art DL NER architecture of a residual Bidirectional LSTM. We integrated two relevant tasks: POS tagging and multi-label multi-class classification for potential entity types. Therefore, our present solution is a multi-inputs multi-outputs DL model. The neural architecture and data flow are illustrated in Fig. 1. The input module takes users’ speech and transforms it into text; the embedding layer represents the text in a sequence of vectors; the two bidirectional layers capture important recurrent patterns in the sequence; the residual connection restores some lost features; these patterns and features are then used for labelling named entities and creating POS tags; or are flattened to make global classification for entity (parameter) types.
Fig. 1 Neural architecture for Aiqudo Multitask Flow
One real life scenario would be as follows: A user wants to greet his friend Rodrigo on Whatsapp. He issues the following command verbally to his phone ‘Whatsapp text Rodrigo good morning’ (not a well-formed command, but this is common in practice). Each word in his speech is then mapped to a token integer, by which a 64 dimensional vector is indexed; the digital representation of all vectors goes through the neural network of two bidirectional LSTM layers and one residual connection layer; the network outputs parameter and value pairs and POS tags in time series; and the network is flattened on another branch and outputs parameter types. Our platform now has all the information needed to pass on to the next Natural Language Understanding (NLU) component in our system (see Figure 2), to fully understand the user’s intent and execute the correct action for them.
Fig. 2 Aiqudo Online Intent Pipeline
Before we could go live in production, we needed to test the performance of the pipeline thoroughly. We devised 600k test scenarios that spanned 114 parameter distributions covering a range of command lengths from very short 2-term commands to much longer 15-term commands. We also focused on out-of-vocabulary parameter terms (terms that do not occur in the training data such as names of cities and movies for example) to ensure that the model could also handle these.
Analysis of this approach in conjunction with the Aiqudo platform showed how it improved platform performance: The general entity recall ratio increased by over 10%. This integrated multitask model specifically fits well with Aiqudo’s requirements:
- The model was trained on our own corpus and produces entities and POS tags compatible with our on-boarded mobile app commands
- The three relevant tasks share most hidden layers and better weight optimization can therefore be achieved very efficiently
- The system can be easily adapted to newly on-boarded actions by expanding or adjusting the training corpus and/or annotating tags
- The random embedding model runs fast enough even on CPUs and produces much better results than publicly available NLP tools
We plan to continue to use DL where appropriate within our platform to complement and augment our existing Semiotics-based NLU engine. Possible future work includes:
- extending the solution for any other languages (our system has commands on-boarded in several languages to use for training)
- tagging information and multi-label outputs haven’t been explicitly utilized as yet; we plan to leverage this information to further improve NER performance
- the DL model can be further expanded by integrating it with other subtasks such as predicting relevant mobile apps from commands and/or actions.
This powerful pipeline employing this flexible combination of Semiotics, Deep Learning and Grammar-based algorithms will offer more powerful Aiqudo voice services in the future.
Xiwu Han, Hudson Mendes and David Patterson – Aiqudo R&D