The problem of Not-So-Simple-Measures

There are a couple of reasons why you could want to assess the difficulty in reading and understanding a text more objectively or without reading the text. You could for example want to know which level of proficiency in a language is necessary to understand a given text. This is what readability scores were created for. They are mostly used in evaluating medical instructions or selecting suitable reading materials for foreign language students. The majority of these tests use certain linguistic properties to calculate the readbility. Such as the average word length, the average length of sentences and the count of polysyllabic words (i.e. words with more than three syllables). And this is where the problem begins, because counting the syllables of a word is not an easy task and depends very much on the language in question. Some readability scores – most notably SMOG (Simple Measure of Gobledigook) which is sometimes considered to be the „gold standard“ of readability scores – are accessible online or implemented in an existing programm. But they almost always only provide support for the english language. Which can be a good reminder that they should not be applied a language which they were not designed for. But if you want to evaluate there usefullness across different languages, develop your own readability score or just tinker around with them, you need to find a way to count the syllables in a given word.

There are basically two ways to do this. The right way and the pragmatic way. From a linguistics standpoint you would almost certainly have no choice but to teach a computer all the specific rules governing phonetics of a word and its associated syllables. This can and has been done. Yet it is in generally very timeconsuming and you might not have the linguistic expertise to do so. Moreover, the difficulty of this approach depends very much on the language in question. In my case it was my own, namely german. You see, in the german language we have some very nasty rules concerning syllables, which is espescially true for composite words. They are generally considered to be a beautiful part of our language, but they can get quite long and therefore are reallly hard to implement in the form of a general rule.

Then there is the pragmatists way, which some would consider cheating. It consists of using a Naive Bayes Estimation to do just that, naively estimate the number of syllables in a given word. Normally the naivete of this approach refers to treating the words of a sentence (or a longer text) as if they were statistically independent of each other. This assumption is certainly false (also with regard to the syllable count) but it is practical, easy to do and performs quite well in most real-world applications. In addition, the general principle on which the Naive Bayes Classifier is built is perfectly suited for simple one-dimensional classification tasks.

Naive Bayes Classification tries to estimate the probability of an attribute for an entity given the absence or presence of specific features of that entity. For example: we know a word has 3 vowels and a total length of 7 characters, what are the chances for it having three syllables? What are the chances for two? And so on and so fourth. For this to work we need a training corpus from which we can extract the specific features as well as the attributes we want to know. The Naive Bayes Classifier is therefore considered a supervised classification approach, because it has first to be „taught“ which probabilities to assign. A lot of quality you get out of this technique depends on the corpora used in training the classifier.

What you need for this to work:

Basic knowledge of the Python programming language.
An installation of the NLTK package for Python.
A text corpus of words and their associated syllable counts in a specific language.

A simpler solution (not necessarily the most accurate):

Thanks to the Natural Language Toolkit (NLTK) for Python, you do not have to programm your own Naive Bayes Classifier. It comes fresh out of the box, with an exellent documentation and as part of a larger Toolkit, which really helps out with almost any tasks concerning textual data.

The installation is pretty easy and is described in greater detail here. After a successful Installation you just fire up python and type in your usual import statement:

import nltk

There are also couple of other modules you will need. The re-module for using regular expression (always nice when dealing with texts), the random-module to prepare the training corpus and the pickle-module to serialize the resulting classifier, so you do not have to go through the entire process every time you want to classify some words.

import re

import random

import pickle

Now you need a training corpus,

which means you need a bunch of words in a language of your choice and the number of syllables for each of these words. In my case there was no readily available and nicely formated corpus, so I ended up copying all the available examples for the number of syllables in the german language from the website of the „Duden„, Germanys foremost dictionary. In addition to that I added some special cases and lots of randomly selected words from the Duden. Since the Duden does not want to share his data freely I will not post the complete list here, nor will I show you how to get it. May it suffice to say, that it was not that hard. If you want to do students of the german language a solid, please do write a nice letter to the Duden asking them to provide academics with a licensed API to directly access there content. It would make live so much easier.

The resulting list looked something like this:

wordlist = """Freun-de
Män-ner
Mül-ler
Mül-le-rin ...

To generate the corpus to train our classifier we need to create a list which contains each entity we want to classify (i.e.: words) and the known classification for that entity (i.e.: its syllable count) in the form of a tuple.

syllables = [(word.replace('-', '').lower(), word.count('-')+1)
for word in wordlist.split('\n')]

Extracting the feautures:

We need to define the features we want to use for the prediction of the syllable count. This requires a bit of tinkering and at least some basic knowledge about the structure of the language you are dealing with. Since you can evaluate the accuracy with which your model predicts the right syllable count for words (with a known count), you can use and trial-and-error approach here: Define some features, test there accuracy, change the features, keep that which has worked, rinse and repeat. After a couple of tries, this is what I came up with as a list of word features:

Length of the word.
Number of vowels in the word
Number of consonants in a word

This seems not like much, but there is a certain problem with defining to many features which are not independent of each other. As mentioned earlier, this would violate the basic assumptions of the model (too much) and lead to „overfitting“ of the model and possibly statistical artifacts. Therefore you want as few features as possible, to provide the best model for your data. Those were some of the features I toyed with, which did not make it into the final model:

Presence of double consonants.
Presence of double vowels.
Percentage of vowels as part of a single word.

There is a way to choose between those features, which I will talk about later.

First we shall take a look at the specific data format those features need to have. In order to be accepted by the NLTK classifier they have to come in the form of a Python dictionary. Since you might want to redo the classification on different entities (i.e. words that are not contained within the training corpus), the extraction of the features should be done via a function.

def syllables_features(word):
    D = {}
    D['wordlength'] = len(word)
    D['vowels'] = len(re.findall(r'[aeiouäöüy]+', word))
    D['consonants'] = len(re.findall(r'[bcdfghjklmnpqrstvwxz]+', word))
    return D

A corpus for training:

The corpus also needs to come in a certain shape. It has to be a list consisting of tuples, which have to hold the features of a given entity (our single words) and the category (syllable count) one wishes to estimate. So the final tuples need to look like this:

({'feature 1':False ... }, category)

As a first step the list of words needs to be cleaned up a little bit (all lower case, count of syllables):

syllables = [(word.replace('-', '').lower(), word.count('-')+1)
for word in wordliste.split('\n')]

Because the content of the list is almost certainly not randomly distributed, we need to give it a quick shuffle:

random.shuffle(syllables)

Then the features are extracted and placed in a similar list in the above mentioned format, to provide us with the featuresets to train the classifier:

featuresets = [(syllables_features(word), anzahl) for (word, anzahl) in syllables]

Training the classifier:

Before the training can begin, a small subset of the feautureset should be retained, to test the performance and quality of the classifier later on:

train_set, test_set = featuresets[20:], featuresets[:20]

The rest is as easy as it gets:

classifier = nltk.NaiveBayesClassifier.train(train_set)

with this trained classifier, you can now enter the feautures of new words (i.e. not in the training corpus) and estimate there syllable count. Which looks something like this:

classifier.classify(syllables_features('Großmutter'))

For this to work, the word needs to be wrapped in the same function to extract the features as was used before in the construction of the featureset.

Trial and Error

So how can you evaluate how good the trained classifier actually works? This is where the retained featureset (the testset) comes in handy. Since we know the true number of syllables for those words, we can easily check how well the classifier works on them. We can also use this test to evaluate different featuresets, as long as the feautureset used for training and for testing are the same. NLTK provides a function for this as well.

nltk.classify.accuracy(classifier, test_set)

This rating can be different because of the selection of words into the test_set (remember the shuffle()). To get a more reliable figure one could either use crossvalidation, meaning we split the feature- and testsets a couple of times into different non-overlapping intervalls or we use a kind of monte-carlo procedure and run the estimation a couple of times. In both cases the average accuracy can be used as a more predictable measure.

Sociology Hacks

Nifty tricks and hacks for social science computing

Archiv der Kategorie: text analysis

Counting Word Syllables with a Naive Bayes Classifier (the quick and dirty way)