Jacques Mattheij

Technology, Coding and Business

Learning Romanian

I currently find myself living in Romania for a bit. One thing that bothers me is that if I spend time in a place where I don’t speak the language and make the locals speak English to me all the time (provided they can!) is that it feels like I’m rude and not making an effort to meet them halfway. Why should (many) be the ones to adapt rather than me (one)?

So I make it a point to make a strong attempt to learn the language of the place where I currently live. I’ve landed a short-term contract (which hopefully will not interfere with all my other plans too much) and currently live in an apartment in Bucharest. Bucharest, Romania is a place where younger people will all speak English in various degrees of fluency, but older people and most people you will interact with in stores and other day-to-day situations do not. So there is not just a sense of balance involved in this but also a simple need. In order to get through a normal day it would very much help if I spoke the language.

Romanian is best described as modern day Latin with a number of Slavic influences. It has all the complexities of Latin as well (conjugations, declensions) and has numerous exceptions to the regular rules of the language as well as a whole bunch of extensions to the normal Latin alphabet indicated using diacriticals. Romanian is hard. It’s certainly not the hardest thing I’ve learned but it is harder than I expected it to be (normally I start picking up the basics of a language after a few weeks). I’ve been here for two months now and am not making the progress that I would like to make so somewhere last week (after delivering the first part of the job to the customer, a program that spots spammers before they can become a nuisance to a community, tons of machine learning and interesting tricks in order to output a single bit of data for every new signup) I decided to get serious about this language business.

After reading this very fortunately timed thread on HackerNews I found out about MemRise (I first tried ‘Anki’ but it wouldn’t work for some reason and I gave up after getting stuck in dll hell and versioning conflicts for longer than I have patience for).

Memrise is an instance of a technique called Spaced Repetition Software. What that does is to set you up with a set of flashcards that are presented at precisely timed intervals and various ways of responding to input the expected answer. Optionally flash-cards are accompanied by audio files or pictures that you can upload yourself as memory hints. The whole thing has been gamified to the max (the MemRise people were coached by some guy that was responsible for the farmville cancer), though the leaderboard has been shut down because there was too much cheating (who on earth would cheat to fake learning a language just to be top of a leaderboard is a mystery to me). MemRise is very slick and I took to it with total abandon. They use a ‘gardening’ metaphor where you first ‘plant’ a word or sentence and then you ‘water’ it in subsequent reviews. It works very nice in the first few days.

But after completing the first ‘course’ for Romanian (user contributed content) I noticed a few things. For one, the courses are supplied by the end-users and they do not appear to be ‘natives’ to me. Quite a few mistakes in the deck marred the experience and caused me to learn the wrong translations very rigorously. This is a real problem. The second issue with MemRise floated to the top as soon as I added a second course (deck). There was a lot of overlap with the first and yet MemRise wasn’t clever enough to realize I’d already learned a lot of those words in the previous deck (you can solve this using the ‘ignore’ option but that’s a ton of manual work that could easily be automated). Another problem that surfaced is that two decks do not necessarily agree on what a word means. Maybe both are right, maybe one of the two is wrong. Regardless, at a minimum it will cause you to not know what you’re supposed to answer because invariably the answer that will float to the top will be the one from the other deck (because that’s the one you saw last).

Another issue is that the MemRise courses vary greatly in usefullness and quality but the courses are not graded in any way, nor is there any indication of whether or not they have been made by someone fluent in the target language. There is also no way to flip or reverse the deck. Decks (courses) may contain (lots) of words of questionable usability and of course you’ld like to know the most useful words first. (why would I have to learn ‘sobolanul’ (the rat in Romanian) before learning a useful word like ‘toothpaste’ or ‘ticket’ in the first few hundred words in a new language?). I can’t search for a word in the courses that I’m learning (for instance, during the setting up of a new deck to check if a definition interferes with a previous deck) and I have no quick way to compare two decks side-by-side. So it somewhat works but is starting to frustrate me and I feel that the more time I spend on MemRise the stronger those frustrations will become. (and none of them seem to be too hard to fix).

This is written after I’ve been playing with MemRise for only a few days (but full-time, I cranked up about half a million ‘points’ in their system so I think I have some right to speak here) but still I already find myself wondering if there is a way to improve on MemRise.

MemRise is made by ‘Neuroscientists, GrandMasters of Memory and software developers’ and they are reinventing learning. I’m a-ok with that and I think it is a great product to get a first grasp on a new language but I really think they could do a much better job of this than they are doing right now when it comes to learning a new language (which is a very large part of the use-case of memrise, but when you’re on a mission to ‘reinvent learning’ you might miss that little detail).

Spaced Repetition Software not exactly a new concept (even though memrise is a nice and slickly packaged example of it the first SRS dates back decades ago). Currently there are a lot packages that attempt to do this out there, with mobile phone and tablet support and all kinds of goodies. But most of them look pretty arcane (MemRise definitely looks good), and I wonder about the quality of what’s in there. There is the PimSleur language school, of course, but after being spammed to death by them about two years ago I have made an expensive vow to never ever do business with them. One thing you have to keep in mind when using techniques like this for learning is that you are memorizing things, you are basically cramming facts but you are not in any way aiding your actual understanding, although of course those facts can later be used when you do reach a certain level of understanding. Patterns, shortcuts and other deeper modes of understanding are not easily conveyed using flash cards, so the method appears to be somewhat limited in scope. Though you might be able to get around this by crafting your flashcards in a specific manner (I haven’t seen much of that to date though).

One thing I have in common with the MemRise guys is that I’m a software developer too. And that is where I will look for my solution.

User interface issues can be resolved once you have the codebase under your own control, and one way of doing this is to go to work for MemRise. Since that’s not an option I’ll settle for door #2, write my own little flashcard program (it doesn’t look all that complicated from the outside and the basic principle of SRS does not look as though it is impossible to implement either).

The first issue to resolve though is the quality of the content. How to go about setting up a solid language course. Over the last two days I’ve been toying (more and more serious as I’m going along) with the problems of building a thing like this.

One of the first concrete issues you run into is that nobody seems to agree on what the most useful words are. So it looks as if everybody in the ‘flashcards’ industry is doing the same thing: they copy the first 20 pages our of the Berlitz language guides and call it a day (and in the case of MemRise they make their users copy that data). This is likely why you end up with a ton of identical word sets and yet miss out on lots of important words for every day use. And it is also why most or all of these courses top out at relatively low counts of words and focus on teaching you words not a language.

Language is a tremendously interesting phenomenon. It is the verbalization of concepts that can exist independently in our heads. ‘Blue’ and ‘Brother’ are concepts (the colour of the sky (a concept in its own right), and a man who has the same parents as me). They do not even need a sound to be associated with them to exist. And that sound does not necessarily have to have a written representation either. But you could attach such a sound to either one of these concepts and then you’d have the beginnings of a language. Another person might assign different sounds to them and then you’d have to learn the concepts behind the words in order to learn their language and to translate your thoughts into sounds that they would understand. So the concepts are the key.

The next issue is that concepts do not map 1:1 from one language to another. Maybe some concepts don’t exist in another language (try to translate ‘schadenfreude’ from German to another language, or the Dutch word ‘Gezellig’). The more languages I learn (Dutch, German, English, Polish, a bit of French, a bit of Latin, a bit of Spanish), the more frequently I run into words that I can’t easily translate from one language to another.

Then there are synonyms (concepts with multiple representations) and homonyms (multiple concepts that map onto the same spelling), homophones (multiple concepts that sound the same) and so on. These complicate the structure of language quite a bit. Furthermore not every language treats its words as ‘immutable elements’, many languages will modify words dependent on where they are used in a sentence or upon some other contextual element (or who the speaker is, what their relative social status is and so on). Fascinating!!

So, consider my interest engaged :) To find a starting point in all this I’d like to know what the most useful words are. That’s a suspiciously simple question, and like most questions like that the answer is extremely complex and depends on a ton of outside factors. For a lawyer visiting a university abroad to learn about the local legal system that answer is different than for a tourist visiting a museum or even a tourist in a restaurant. Context is everything. And yet, there must be some subset of language that everbody would agree is useful. And beyond mere words, there must be sentences that are more useful than others (and ditto sentence fragments).

Searching around the web for a nice body of spoken text I found this beautiful collection of spoken text. After downloading it and cleaning it up a bit here are the results for the first 30 most spoken words in English (as used in those transcripts, of course this is not) and their frequency in that corpus:

   1508 she
   1535 on
   1542 there
   1613 just
   1632 but
   1826 what
   1841 yeah
   1928 so
   1981 this
   2050 we
   2133 are
   2224 like
   2339 he
   2370 have
   2598 know
   2662 do
   2663 they
   2872 in
   3001 was
   3257 of
   4257 not
   4821 a
   4832 to
   5702 that
   5844 it
   6669 you
   7219 and
   8126 is
   8181 the
   8895 I

You can immediately see how (at least in the English language) a high frequency of use means that words are short, language has optimized itself over time to be economic with the bandwidth between the speakers and the listeners. Longer words are typically used less frequently.

So if I would use this list that I have generated as a basis for learning individual words on flashcards that would be one first step in the right direction. And that list could be cross-checked and corrected using resources such as this list of the top 1500 nouns. Next I need to translate them and come up with a datastructure that will hold the concepts and the representations in the various languages, attributes of words and so on.

This is starting to be a fun project :)! It is also of course a very nice way to procrastinate but I think that in the process of setting this up I might actually learn some Romanian too. And meanwhile I’ll continue to use MemRise, don’t throw away your old shoes before you’ve got your new ones.