Teaching Luxembourgish to my computer
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
How we taught a computer to understand Luxembourguish
Today we reveal a project that Kevin and myself have been working on for the past 2 months, Liss. Liss is a sentiment analysis artificial intelligence; you can let Liss read single words or whole sentences, and Liss will tell you if the overall sentiment is either positive or negative. Such tools are used in marketing, to determine how people perceive a certain brand or new products for instance. The originality of Liss, is that it works on Luxembourguish texts.
How to develop a basic sentiment analysis AI
Machine learning algorithms need data in order to get trained. Training a machine means showing it hundreds, thousands, or even more examples in order for the machine to learn patterns, and then, once the machine learned the patterns, we can use to machine for predictions. For example, we can train a machine learning algorithm to determine if a given picture is a picture of a cat or of a dog. For this, we show the machine thousands of pictures of cats and dogs, until the machine has learned some patterns. For example, the machine will learn that cats are on average smaller than dogs, and have eyes that look more like almonds unlike dogs. It is not always (albeit possible) to know which patterns, or features, the machine is using to learn the difference between cats and dogs. Then, once we show the machine a new picture, it will be able to predict, with a certain confidence, that the picture is of a dog, or a cat.
For sentiment analysis, you proceed in a similar manner; you show the algorithm thousands of text that are either labeled as being positive or negative, and then, once you show the machine new texts, it will predict the sentiment. However, where can you find such example data, also called the training set to train your AI with? One solution is to scrape movie reviews. Movie reviews are texts written by humans, with a final score attached to it. A reviewer might write the following review about Interstellar:
“One of the best movies I have ever seen. The actors were great and the soundtrack amazing! 9/10”
Here, the AI will learn to associate the score of 9, very positive, to words such as best and amazing.
Another possible review, for, say, The Room could be:
“Wiseau put a lot of time and effort into this movie and it was utter crap. 2/10”
Here, the AI will learn to associate words such as crap with a low score, 2.
This is the gist of it, but of course feeding all this training examples to the AI requires a lot of thought and data pre-processing. This blog post will not deal with the technicalities, but more with how we tackled a serious problem: where do we find movies reviews written in Luxembourguish?
Where we found Luxembourguish comments
Luxembourg is a small country with a small population. As such, the size of the internet in Luxembourguish is quite small. Add to that the fact that most people only speak Luxembourguish, and don’t know how to write it, and you got a big problem: as far as we are aware, it is not possible to find, say, movie reviews to train a machine on. So how did we tackle this problem? Because there were no comments in Luxembourguish laying around for us to work with, we scraped German comments. Linguistically, Luxembourg is very close to West German dialects, but with some French influences too. However, putting German and Luxembourguish sentences side by side clearly shows the similarities:
Luxembourguish:
“Hallo, wéi geet et dir?” (Hello, how are you?)
“Ganz gudd, merci!” (Very well, thank you!)
German:
“Wie geht es dir?”
“Ganz gut, danke!”
The only word in the Luxembourguish sentences that comes from French is merci, meaning thank you. Of course, this is a simple example, but if we look at more complicated sentences, for example from the Bible, we still see a lot of similarities between Luxembourguish and German:
Wéi d’Elisabeth am sechste Mount war, ass den Engel Gabriel vum Herrgott an eng Stad a Galiläa geschéckt ginn, déi Nazareth heescht,
bei eng Jongfra, déi engem Mann versprach war, dee Jouseph geheescht huet an aus dem Haus vum David war. Dës Jongfra huet Maria geheescht.
Den Engel ass bei si eragaang a sot: “Free dech, [Maria], ganz an der Gnod! Den Här ass mat dir.”
Und im sechsten Monat ward der Engel Gabriel gesandt von Gott in eine Stadt in Galiläa, die heißt Nazareth,
zu einer Jungfrau, die vertraut war einem Manne mit Namen Joseph, vom Hause David; und die Jungfrau hieß Maria.
Und der Engel kam zu ihr hinein und sprach: Gegrüßet seist du, Hochbegnadete! Der Herr ist mit dir!
-Lk 1,26-38
In these sentences, most differences come from the use of different tenses or different choices in what to include in the translation. For instance the first sentence in Luxembourguish starts with Wéi d’Elisabeth am sechste Mount war (As Elisabeth six months pregnant was) while in the German translation starts with Und im sechsten Monat (And in the sixth month). Same meaning, but the reference to Elisabeth is implicit.
So German and Luxembourguish are very close, but what good does that do us? Training a model on German movie reviews and trying to predict sentiments of texts in Luxembourguish will not work. So the solution was to translate the comments we scraped from German to Luxembourguish. We scraped 50000 comments; obviously we could not translate them ourselves, so we use Google’s translate api to do it. There’s a nice R package that makes it easy to work with this api, called {translate}.
The translation quality is not bad, but the longer and more complicated the comments, the more spotty is the translation; but overall the quality seems to be good enough. Let’s translate the German sentences from above back to Luxembourguish using Google Translate:
An am sechsten Mount huet de Engel Gabriel vu Gott geschéckt an eng Stad an Galiläa geschéckt, déi Nazareth genannt gëtt
zu engem Kiischte, dee vum Josephsjäreger bekannt gouf, aus dem Haus vum David; an den Numm vun der Jungfra Maria war .
*De Engel ass si komm an huet gesot: “A Blann, geeschteg bass! Den Här ass mat iech!
The translation here is really not that great, but Bible verses are written in a pretty unusual way. What about a more standard text? Let’s try with the first paragraph on Luxembourg from the German version of Wikipedia:
German:
*Das Großherzogtum Luxemburg ist ein Staat und eine Demokratie in Form einer parlamentarischen Monarchie im Westen Mitteleuropas. Es ist das letzte Großherzog- bzw. Großfürstentum (von einst zwölf) in Europa. Das Land gehört zum mitteldeutschen Sprachraum. Landessprache ist Luxemburgisch, Verwaltungs- und Amtssprachen sind Französisch, Deutsch und Luxemburgisch. Gemeinsam mit seinem Nachbarn Belgien und mit den Niederlanden bildet Luxemburg die Beneluxstaaten.**
Luxembourguish (from Google Translate):
D’Groussherzogtum Lëtzebuerg ass e Staat an eng Demokratie an der Form vun enger parlamentarescher Monarchie an der Westeuropa. Et ass de leschte Grand-Duché oder Grand-Duché (e puer Joeren) an Europa. Dëst Land gehéiert zu der zentrale germanescher Sprooch. D’Nationalsprooch ass Lëtzebuergesch, administrativ an offizielle Sproochen sinn franséisch, däitsch a lëtzebuergesch. Zesumme mat sengem Noper Belgien an Holland ass Lëtzebuerg de Benelux.
English (from Google Translate):
(The Grand Duchy of Luxembourg is a state and a democracy in the form of a parliamentary monarchy in western Central Europe. It is the last Grand Duchy or Grand Duchy (once twelve) in Europe. The country belongs to the central German language area. The national language is Luxembourgish, administrative and official languages are French, German and Luxembourgish. Together with its neighbor Belgium and the Netherlands, Luxembourg is the Benelux.)
Knowing both German and Luxembourguish, I can tell that the translation is pretty good, and would require minimal human editing to make it perfect.
So we were pretty confident that this was a strategy that would be worth to try, so that’s what we did. We translated the comments using Google Translate api with the following R code:
Once we had translated everything, we started training a model.
The sentiment analysis tool we built
To train the model, we use the R programming language and the Keras, a deep learning library. The comments had to be preprocessed, which is what took the most time. Then, building a model with Keras is quite simple, and we did not do anything special to it; actually, we did not spend much time tuning the model and to our astonishment, it worked quite well! To share the results with anyone, we also created a web app that you can access by clicking here.
Try to write, words, sentences, and most importantly give us feedback! See you for the next post.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.