Bad NLU Design
Let’s say we’re building a simple banking application, with two intents, check balance and manage credit card. They have the following utterances:
We can see a problem off the bat, both the check balance and manage credit card intent have a balance checker for the credit card! This will potentially confuse the NLU since we don’t have many examples.
If we were thinking of it from UI perspective, imagine your bank app had two screens for checking your credit card balance. That might seem convenient at first, but what if you could only do an action from one of those screens! That would be annoying, the UX is no longer intuitive.
Likewise in conversational design, activating a certain intent leads a user down a path, and if it’s the “wrong” path, it’s usually more cumbersome to navigate the a UI. We should be careful in our NLU designs, and while this spills into the the conversational design space, thinking about user behaviour is still fundamental to good NLU design.
NLU Design Principles
In the past section we covered one example of bad NLU design of utterance overlap, and in this section we’ll discuss good NLU practices. Our end goal is to improve our conversational ai data quality.
- Minimize utterance overlap
- Intent Balance
- Real world data
- Setting confidence thresholds
- Look for other patterns
Minimize utterance overlap
In the past section we started with a training dataset that had some overlap in two intents:
We want to solve two potential issues, confusing the NLU and confusing the user. We could do a couple of things to fix this language model.
- Move the problematic utterance to a different intent
- Rephrase your utterances to be more precise
- Delete the utterance
- Create a new intent
Let’s try approach number one. We can move our How much money do I owe on my credit card? utterance to the manage credit card intent.
This would reduce our confusion problem, but now potentially removes the purpose of our check balance intent.
Perhaps we can try to redesign our intent. To do this we can make a list of actions we want to let the user do and what is the best way to reach there. We can divide our intents into reader and writer set. A reader intent would let the user retrieve information, but not take any action, and writer intent would let the user complete an action, but not see any information. With this structure our intents might look like this:
This looks cleaner now, but we have changed how are conversational assistant behaves! Sometimes when we notice that our NLU model is broken we have to change both the NLU model and the conversational design. It’s part of the iterative UX process.
Our other two options, deleting and creating a new intent, give us more flexibility to re-arrange our data based on user needs.
Intent balance
To start this section off let’s take a gameshow example. Let’s say you’re invited to a gameshow where if you guess 9/10 questions right, you win a prize! This might sound challenging, but you’re told that that 95% of times the answer is A, what will your strategy be?
To guess A all 10 times! You’d usually win games (91% of the time) with this strategy.
Now if you train an NLU, and you give 9 example for intent A, and 1 for Intent B, how do you think your model will behave? It might just learn to guess Intent A, since it will be right 90% of the time!
This dataset distribution is known as a prior, and will affect how the NLU learns. Imbalanced datasets are a challenge for any machine learning model, with data scientists often going to great lengths to try to correct the challenge. So avoid this pain, use your prior understanding to balance your dataset.
To measure the consequence of data unbalance we can use a measure called a F1 score. A F1 score provides a more holistic representation of how accuracy works. We won’t go into depth in this article but you can read more about it here.
These scores are meant to illustrate how a simple NLU can get trapped with poor data quality. With better data balance, your NLU should be able to learn better patterns to recognize the differences between utterances.
Real world data
An important part of NLU training is making sure that your data reflects the context of where your conversational assistant is deployed. This might include the channel, demographic, region or social norm. Understanding your end user and analyzing live data will reveal key information that will help your assistant be more successful.
If we are deploying a conversational assistant as part of a commercial bank, the tone of CA and audience will be much different than that of digital first bank app aimed for students. Likewise the language used in a Zara CA in Canada will be different than one in the UK.
When testing your conversational assistant it’s important to monitor how the data changes and if your customer persona is well represented in your training data. Some factors to look out for include:
- Tone
- Formality
- Grammer
- Spelling/Typos
- Slang and jargon
- Bot domain
- Comfort with technology
You can make assumptions during initial stage, but after the conversational assistant goes live into beta and real world test, only then you’ll know how to compare performance.
Setting Confidence thresholds
When a conversational assistant is live, it will run into data it has never seen before. Even google sees 15% of it’s searches for the first time everyday! With new requests and utterances, the NLU may be less confident in its ability to classify intents, so setting confidence intervals will help you handle these situations.
A higher confidence interval will help you be more sure that a user says is what they mean. The downside is that the user might have to repeat themselves which leads to a frustrating experience. The alternative is to set a lower value and potentially direct the user down an unintended path.
The complexity of your project will also affect what your confidence for an intent should be. If you have 100 intents and one scores 50% on confidence, that’s much better than intent that scores 50% when there are only two options!
Look for other patterns
One of the magical properties of NLUs is their ability to pattern match and learn representations of things quickly and in a generalizable way. Whether you’re classifying apples and oranges or automotive intents, NLUs find a way to learn the task at hand. Sometimes, they learn patterns that weren’t what we expected.
Let’s say we have two intents, yes and no with the utterances below.
Initially this data looks good, no overlap with the intents, no strange phrases. But there a couple hidden patterns here:
- The no intent always starts with an “N”
- The no intent only uses one word
- The no intent has no punctuation
With only a couple examples, the NLU might learn these patterns rather than the intended meaning! Depending on the NLU and the utterances used, you may run into this challenge. To address this challenge, you can create more robust examples, taking some of the patterns we noticed and mixing them in.
We now have more robust training dataset! While NLU choice is important, the data is being fed in will make or break your model.
Conclusion
In this section post we went through various techniques on how to improve the data for your conversational assistant. This process of NLU management is essential to train effective language models, and creating amazing customer experiences.
Ready to chat about your own NLU management workflow? Chat with our team.
NLU Management Terms
NLU: Short for Natural Language Understanding. Commonly refers to a machine learning model that extracts intents and entities from a users phrase
NLUM: Short for NLU Management. The process of managing the data and testing of your NLU.
Utterance Overlap/Conflict: When utterances from different intents overlap in their meeting. This causes confusion for NLU models.
Intent Balance: The ratio between the number of utterances per intent. Good intent balance means that each intent has a roughly equal distribution of member utterances.
NLU Accuracy: The percentage of correct answers from a NLU. Often associated with intent classification.
F1 Score: A more comprehensive metric to calculating NLU accuracy. Incorporates false negatives and false positives. A value between 0 to 1.
Confidence Thresholds: A level that a NLU’s prediction needs to reach to be considered valid. A value between 0 to 1.