You cannot just assume a visitor will talk to the bot in a language the bot understands. In order to reply, even if it’s to give a proper “I don’t speak XXX” message, your chatbot needs to integrate a language detector that it can use to classify the user language. In this post, we show how to add a language detector to your chatbot with Xatkit.
How is a language identified?
You may have the impression this is not exactly the most challenging NLP task, since we have been using tools like Google Translator (that actually can detect the input language) for years, and we are used to it. Nevertheless, language detection in a chatbot can be challenging because:
- Multiple language detection issues: in fact, the user can mix more than one language when chatting with the bot
- Distinguishing among closely related languages: languages with very similar lexicon and/or syntax, e.g. Spanish and Catalan
- Language detection in short texts, such as single words or short sentences. This is actually our hardest challenge since the most common situation in a conversational chatbot is short interactions between the bot and the user.
A simple solution (that doesn’t work) is to use complete dictionaries for every possible language and check in what dictionary the visitors’ words show up. But this solution does not scale. Even less when considering all verb forms and all the other types of word derivations.
Instead, language detection works as a classification problem, similar to how other NLP problems (like intent recognition) are dealt with. Each language we want to consider for the classification becomes a category. Then, given an input text, the language model classifies it into one of these categories. Classification ca be done:
- Based on the distance between the input text and (training) texts for the given languages. This is known as Mutual information based distance measure. “Distance” refers to a meaningful value for the classification task. For instance, the difference between the number of occurrences of a word in 2 languages. The higher it is, the clearer should be the actual language of the word.
- Create n-grams for each of the languages. An n-gram is a probability distribution that assigns probabilities to compositions of n characters (where n can be any natural number). These probabilities are set thanks to training text for each language. Then, given the input text, we only need to multiply the probabilities of the n-gram compositions that appear in the text, and we will conclude that the text language is the one with the higher result. A simple way to assign probabilities could be the Maximum Likelihood Estimation where, for instance, we will assign to “ing” a value 25/1000 in a English 3-gram if we read 25 times “ing” in the 1000 different compositions of the English training corpus, and 3/1000 if we read it 3 times in a Spanish corpus (in this example we are assuming that the total number of groups of 3 characters in the training corpus is 1000).
But you don’t really need to worry about the nitty-gritty details unless you want to create your own language model, which you probably don’t want to. If you just want to add language detection to your chatbot take one of the pretrained language model detectors and integrate it into your chatbot. Even better, this feature is already integrated in Xatkit so there is nothing for you to do!
Language detection in Xatkit
We have been testing many libraries and models for language detection but, due to the challenges described above, many of them were not a good fit for a chatbot language detector. Some worked fine with short inputs, but were too slow. Others simply needed very long texts to actually make good predictions.
We chose Apache OpenNLP and its own language detector model as the best user language detection option. This solution works fast enough for our purpose and makes good predictions for short inputs. And wrap it in a new Language Detection Post-Processor that, if activated, gives you the language information (and the corresponding confidence) for every user input. In fact, given the complexity of the problem we provide two different pieces of information every time.
- The best language detected for the last user utterance
- An accumulated value with our best language guess based on the last n user utterances, where n could be any number greater than 1
This way the user has a specific prediction for the last message, and a global prediction of a longer part of the conversation which should be more accurate and reliable, since the longer the text the easier for the language model is to figure out its language.
Example of a chatbot with language detection
You can do whatever you want with this information. For instance, you could use it to see whether the intent recognition failed because the user is talking to the bot in an unexpected language. If so, in the default fallback you could read the detected language and inform the user that your bot doesn’t understand that particular language. You could even reply that in the same language the user is talking to if you have this predefined error message translated in a number of languages. Or you could even go one step further and use the language detector as the first step of an automatic translation process, or use it to start building a multilingual bot or … There are a lot of different possibilities to choose from.
You can see in the gifs below an example use of this language detection processor. This example chatbot tells you the language of your last message (which, as you can see, sometimes fails) and the language of the last 10 messages. This second prediction is built along the conversation, and after a while settles down in a better prediction (as long as the user sticks to the same language, of course!). As you can see, OpenNLP returns the language names in ISO 639-3 standard.
In the next code excerpt, you can see how we defined the chatbot state that handles the user intent, getting the first language prediction (which is the highest score by the language model). The full code for this example is available in our examples repo.