By Gerben de Vries and Rosa Stern
At Wizenoze we have developed a state-of-the-art algorithm to accurately predict the reading level of a text on a five-level scale. With this blog we want to give some insight into the nuts and bolts of our technology.
Measuring the readability of a text has a long history, mainly in the educational domain. The standard way to do this is using hand-crafted readability formulas. For instance, the popular online tool Readability-Score  uses these kinds of formulas to perform its readability assessment.
One of the most well-known examples of such a formula is the Flesch-Kincaid grade level score, which looks like this :
At its core, the score is a combination of the average number of words in a sentence and the average number of syllables in a word.
Essentially all readability formulas look like this. They combine a few superficial textual properties, e.g. the number of words and the number of sentences, in a relatively simple mathematical formula. This formula is manually tuned using a small set of example documents on different reading levels.
We think that such a simple model can never capture all the complexities involved in determining the readability of a text. Accurate readability analysis requires both more linguistic knowledge and a more complex model.
The Power of Machine Learning
Hand-crafting complex formulas (or models) is extremely difficult, especially in domains where it is hard to make the human knowledge explicit. Readability is such a domain. Finding good predictive models in these domains is one of the strengths of Machine Learning (ML).
Learning predictive models with ML requires training data, the more the better. In our case this means documents for which we know the reading level. So, we collected over 100,000 documents from all kinds of different sources and reading levels: schoolbooks, news articles, web texts, etc. The labelling for these documents is very heterogenous: sometimes very precise, sometimes very coarse. Therefore, we mapped all these different labels to our own 5-point reading level scale .
For each document in our training data we extract a large number of features using our NLP pipeline, which we describe below. These features include the traditional readability formulas, like the Flesch-Kincaid score, but also many others. Our machine learning algorithms learns how each feature contributes to making an accurate readability prediction for a new document.
The Power of Natural Language Processing
From a linguistic perspective, the readability of a text is determined by much more than a few superficial textual features. For example, does the reader know most of the words? Does the text contain complex grammatical structures? Are there enough connectives to explain the flow of the text? Is the text about a lot of different concepts?
We use modern Natural Language Processing (NLP) techniques to automatically extract a rich set of linguistic features that directly and indirectly relate to readability. As an example, consider the following sentence (taken and adapted from a New York Times article):
Months after Britain voted to leave the European Union, the first tangible victim of that decision is identified: Marmite, a sludgy and odd-tasting breakfast spread.
To compute the Flesch-Kincaid grade level score for this sentence, we only need the following information:
- the text has 1 sentence,
- the text has 30 words,
- and the text has 43 syllables.
On the other hand, here is a partial view of the internal representation of the text used in our machine learning based model.
Amongst a lot of other things, we identify:
- grammatical structure (a passive construction is used),
- named entities (Britain is a location, Marmite is a product),
- and the part of speech for each word (months is a noun, voted a verb).
This means that NLP allows us to use far more linguistic knowledge in our readability analysis than traditional readability formulas.
Using both natural language processing and machine learning, we created a model for readability prediction that is far more accurate and insightful than standard readability formulas. Try it at wizescan.com, use our Chrome plugin, or directly call our API, and you will see that this blog was written at level 4!
Gerben de Vries
Applied Scientist PhD
Researcher, machine learning, data mining, semantic web, kernel methods, analytical thinker, java programmer, sailing, vegetarian cooking, vinyl records, merino wool clothing, playing and watching football, Amsterdam, blizzard games.