Centering Resonance Analysis Using NLTK and NetworkX

Veröffentlicht am 9. März 2015 von Jan Riebling

Centering Resonance Analysis¶

CRA is a network-based method for the untrained classification of documents and entire corpora, which was proposed bei Corman et al. in their 2002 article „Studying Complex Discursive System“. It is inspired by linguistic ideas about the textual „centers“, i.e. the words (mostly nouns) who contribute the most to the specific meaning of the text. To find those centers they apply structural network measures, which allow for the identification of the most central noun phrases in a text. The „resonance“ of a text is therefore defined as the measure of similarity between two texts in terms of their respective centers.

Python provides two packages, which make the implementation of this method pretty straightforward. The NLTK-Package (Natural Language Toolkit) provides realtively fast methods to deal with the linguistical side of things. Most importantly, algorithms for part-of-speech tagging, which are necessary to identify the central noun phrases of the text. The second important package is NetworkX. It implements a powerful network class as well as algorithms to analyse those networks in terms of centrality.

Corman et al. describe their method as a four-step process, which consists of selection, indexing, linking and the application of this model. Putting the actual apllication last, serves as a reminder that theoretical and methodological issues are to be considered on each step. Since our focus here lies on implementation instead of interpretation we do discuss some of these issues but are generally more concerned with showing how to apply this method using python. We use a portion of the Reuters Newswire Corpus which comes with NLTK and consists of roughly 10.000 short news for illustration.

Selection¶

In the selection stage, the corpus is reduced to its noun phrases:

CRA parses an utterance into its component noun phrases. A noun phrase is a noun plus zero or more additional nouns and/or adjectives, which serves as the subject or object of a sentence. Determiners (the, an, a, etc.), which can also be parts of noun phrases, are dropped in CRA analyses. Thus, a noun phrase is represented by one or more words, and a sentence can consist of one or more noun phrases. (Corman et al. 2002:174)

This makes it necessary to find the right part-of-speech tag for each word in the corpus. Unfortunately, the Reuters News Corpus does not come with already tagged words. Instead a POS-Tager is needed to identify nouns and adjectives. NLTK provides the Penn Treebank Tagger for these tasks. Before the tagging itself stopwords should be removed from the text because they carry very little semantic information and the reduction of the text makes computing the pos-tags much easier. Additionally only a portion of the Reuters corpus (the first 1000 documents) will be used, since both the pos tagging as well as the network analytics can become very computationally expensive.
Furthermore, the selection step as well as linking (i.e. network construction) need to be wrapped in a function definition, to preserve the ability to differentiate the documents according to the fileids or categories. This function will be introduced after a short exemplary look at one document and the single steps leading up to the network construction stage.

In [53]:

from nltk.corpus import reuters
from nltk.corpus import stopwords

stopwords = stopwords.words() + ['.', ',', '"', "'", '-', '.-']
first_doc = reuters.sents(reuters.fileids()[0])
stopd_sents = [[token.lower() for token in sent if token.lower() not in stopwords] 
               for sent in first_doc]

for token in stopd_sents[0]:
    print token

asian
exporters
fear
damage
japan
rift
mounting
trade
friction
japan
raised
fears
among
many
asia
exporting
nations
row
could
inflict
far
reaching
economic
damage
businessmen
officials
said

Using the Penn-Treebank Tagger on the sentences is rather straightforward:

In [41]:

import nltk

## In case you have NLTK 3, the pos_tag_sentence() function 
## for a nested text structure will not work.
## You can install the old Nltk 2.0.4 as described here: 
## http://www.pitt.edu/~naraehan/python2/faq.html
## Or you can use the list comprehension below:

tagged_sents = [nltk.pos_tag(sentence) for sentence in stopd_sents]

Using the tagger results in a list-of-tuples, where the first entry gives us the word and the second one contains the part-of-speech tag. NLTK provides a help function (nltk.help.upenn_tagset(tagpatter)) which supports regular expressions to look up the meaning of the pos-tags. Since we are only interested in nouns and adjectives, we only take a look at them.

In [42]:

print nltk.help.upenn_tagset('NN?|JJ?')

JJ: adjective or numeral, ordinal
    third ill-mannered pre-war regrettable oiled calamitous first separable
    ectoplasmic battery-powered participatory fourth still-to-be-named
    multilingual multi-disciplinary ...
JJR: adjective, comparative
    bleaker braver breezier briefer brighter brisker broader bumper busier
    calmer cheaper choosier cleaner clearer closer colder commoner costlier
    cozier creamier crunchier cuter ...
JJS: adjective, superlative
    calmest cheapest choicest classiest cleanest clearest closest commonest
    corniest costliest crassest creepiest crudest cutest darkest deadliest
    dearest deepest densest dinkiest ...
NN: noun, common, singular or mass
    common-carrier cabbage knuckle-duster Casino afghan shed thermostat
    investment slide humour falloff slick wind hyena override subhumanity
    machinist ...
NNP: noun, proper, singular
    Motown Venneboerger Czestochwa Ranzer Conchita Trumplane Christos
    Oceanside Escobar Kreisler Sawyer Cougar Yvette Ervin ODI Darryl CTCA
    Shannon A.K.C. Meltex Liverpool ...
NNPS: noun, proper, plural
    Americans Americas Amharas Amityvilles Amusements Anarcho-Syndicalists
    Andalusians Andes Andruses Angels Animals Anthony Antilles Antiques
    Apache Apaches Apocrypha ...
NNS: noun, common, plural
    undergraduates scotches bric-a-brac products bodyguards facets coasts
    divestitures storehouses designs clubs fragrances averages
    subjectivists apprehensions muses factory-jobs ...
None

We can now tag all sentences in the corpus and extract their compound noun phrases.

In [43]:

import re
for token, tag in tagged_sents[0]:
    if re.match(r'NN*|JJ*', tag):
        print token, tag

asian JJ
exporters NNS
damage NN
japan NN
rift NN
trade NN
friction NN
japan NN
fears NNS
many JJ
asia NN
nations NNS
row NN
economic JJ
damage NN
businessmen NNS
officials NNS

In [44]:

noun_phrases = [[token for token, tag in sent if re.match(r'NN*|JJ*', tag)] 
                for sent in tagged_sents]

Network Construction (Linking)¶

During this step the network is constructed from the co-occurrence of adjectives and nouns within the same sentence.

To begin with, all words comprising the centers in the utterance are linked sequentially. In the majority of cases, where noun phrases contain one or two words, the sequential connections capture all the linkage intended by the author because no higher-order connections are possible without crossing the boundaries of the noun phrases. However, there are cases where three or more words are contained in a single noun phrase. In that case, the sequential links do not exhaust the connectedness possible in the set created by the author. Hence, we link all possible pairs of words within the noun phrases. For example, the phrase “complex discursive system” would generate the links: complex-discursive, discursive-system, and complex-system. (Corman et al. 2002: 176)

In Python we can realize this with the itertools-module, This module provides methods for iterator construction, meaning a fast way to generate combinations without repetitions from a given input sequence.

In [52]:

import itertools as it
edgelist = [edge for phrase in noun_phrases for edge in it.combinations(phrase, 2)]

The result is a list of tuples which contain the edges of the textual centers. These edges can now be made into a network by simply initialising them as a NetworkX Graph class.

In [46]:

import networkx as nx
G = nx.Graph(edgelist)

Finding the Linguistic Centers (Indexing)¶

The next step in Centering Resonance Analysis uses structural network measures to characterise and analyse the content of the texts. To find the most central words within the network Corman et al. propose the usage of the betweenness centrality measure.

In [47]:

index = nx.betweenness_centrality(G)

The result of the NetworkX algorithm is a dictionary with the nouns and adjectives as key. We can then take a look at the ranking of the words according to their betweenness centrality. Because Python dictionaries cannot be sorted itself we have to transfer them to a list-of-tuples which then can be sorted.

In [48]:

sorted_index = sorted(index.items(), key=lambda x:x[1], reverse=True)

# Top 10 noun phrases by betweenness centrality:
for word, centr in sorted_index[:10]:
        print word, centr

trade 0.256197406309
japan 0.184346707864
exports 0.12946650583
japanese 0.075589651494
imports 0.0667662329149
minister 0.0623348220094
year 0.0477963334009
electronics 0.0472099550373
tariffs 0.0444189296598
kong 0.0354607093611

The resulting network can easily be plotted:

In [50]:

%pylab inline
%config InlineBackend.figure_format = 'png'
plt.rc('figure', figsize=(12, 7))
G.remove_nodes_from([n for n in index if index[n] == .0])
node_size = [index[n]*10000 for n in G]
pos = nx.spring_layout(G)
nx.draw_networkx(G, pos, node_size=node_size, edge_color='y', alpha=.4, linewidths=0)

As mentioned earlier, the steps leading up to the construction of the network can be generalized into a single function.

In [14]:

def cra_network(doc, tags='NN*|JJ*'):
    tagged_doc = [nltk.pos_tag(sentence) for sentence in doc]
    noun_phrases = [[token.lower() for token, tag in sent if re.match(tags, tag)] 
                    for sent in tagged_doc]
    edgelist = [edge for phrase in noun_phrases for edge in it.combinations(phrase, 2)]
    graph = nx.Graph(edgelist)
    return graph

def cra_centered_graph(doc, tags='NN*|JJ*'):
    tagged_doc = [nltk.pos_tag(sentence) for sentence in doc]
    noun_phrases = [[token.lower() for token, tag in sent if re.match(tags, tag)] 
                    for sent in tagged_doc]
    edgelist = [edge for phrase in noun_phrases for edge in it.combinations(phrase, 2)]
    graph = nx.Graph()
    for edge in set(edgelist):
        graph.add_edge(*edge, frequency=edgelist.count(edge))
    betweenness = nx.betweenness_centrality(graph)
    for n in graph:
        graph.node[n]['betweenness'] = betweenness[n]
    return graph

The first function takes a document consisting of a list of sentences, whereby each sentence is itself a list of word tokens. In keeping with the example the tag keyword argument defaults to adjectives and nouns. This can easily be changed via regular expressions. The second function utilizes the networkx class to a greater extent, by including the betweenness centrality as a node attribute and the frequency of every edge as an attribute of that edge. This will be useful in later steps of the analysis, because it allows us to store the constructed networks more easily and keep informations, which would otherwise be lost in the transformation process.

Saving the resulting graph as a file including all its attributes can be done via the many read_ and write_ functions supplied by NetworkX. For example as a GEXF file (write_gexf()) which is essentially a modified XML representation of the graph and can be understood by a wide range of other network analysis tools.

Computing the Resonance¶

Last but not least we need a way to compute the resonance, which is a similarity measure based on comparing the different texts according to their respective centers. Corman et al. 2002 present two ways to calculate the resonance between texts. We start by looking at the simple way first, which claims to be espescially suited for broader comparisons:

The more two texts frequently use the same words in influential positions, the more word resonance they have. The more word resonance they have, the more the communicators used the same words, and the more those words were prominent in structuring the text’s coherence. Word resonance is a more general measure of the mutual relevance of two texts, and has applications in the modeling of large corpora. (Corman et al. 2002:178)

They give the following formula:

\begin{equation*} WR_{AB} = \sum_{i=1}^{N(A)} \sum_{j=1}^{N(B)} I_i^A \cdot I_j^B \cdot \alpha_{ij}^{AB} \end{equation*}

Where \(I_i^A\) represents the betweenness centrality of the \(i\)-th word in text \(A\). \(I_j^B\) is the same for text \(B\) and \(\alpha_{ij}^{AB}\) is a indicator function, which is \(0\) if the \(i\)-th word of text \(A\) and the corresponding word of the same rank in text \(B\) are not the same. If they are the same, \(\alpha\) becomes \(1\). Therefore only words with the same rank are compared.

In [2]:

def simple_resonance(index_a, index_b):
    sorted_a = sorted(index_a, key=lambda x:x[1], reverse=True)
    sorted_b = sorted(index_b, key=lambda x:x[1], reverse=True)
    scores = [a[1]*b[1] for a, b in zip(sorted_a, sorted_b) if a[0] == b[0]]
    return sum(scores)

To give an example, we need to create at least two centered network graphs to compare. These graphs do not necessarily need to be constraint to one specific document in the corpus. Instead we could treat larger (or smaller) collections of text as a single unit of analysis. When the chosen text is too small, chances are there will be a resonance of zero because there are too few similarities in the used words. Yet one has to keep in mind, that calculating the betweenness centrality of all noun phrases can become very computational expensive. For illustration purposes and the sake of simplicity a subset of the reuters corpus is chosen. The categories in the Reuters News Corpus are categorized according to economic goods and services which are mentioned in them. It therefore seems worthwhile to compare the different linguistic centers according to those economic categories. To keep it manageable, just two economic areas are chosen, namely „Finance“ and „Food“. Also, since the methods tends to become computationally expensive when confronted with larger corpora, only a subsample of the categories is used.

Finance: „income“,“trade“, „money-supply“
Food: „wheat“, „coffee“, „rice“, „sugar“

In [21]:

finance = ['income','trade', 'money-supply']
food = ['sugar', 'wheat', 'rice', 'coffee']

finance_docs = reuters.sents(categories=finance)
food_docs = reuters.sents(categories=food)

## As usual stopwords are removed from the documents and 
## lower case is used for all strings:

finance_stopd = [[token.lower() for token in sent if token.lower() not in stopwords] 
                 for sent in finance_docs]
food_stopd = [[token.lower() for token in sent if token.lower() not in stopwords] 
              for sent in food_docs]

finance_net = cra_centered_graph(finance_stopd)
food_net = cra_centered_graph(food_stopd)

## To store the data uncomment the following lines:
#nx.write_gexf(finance_net, 'finance.gexf')
#nx.write_gexf(food_net, 'food.gexf')

In [31]:

# To store the data uncomment the following lines:
nx.write_gexf(finance_net, 'finance.gexf')
nx.write_gexf(food_net, 'food.gexf')

The function cra_centered_graph() was used, because the time it took to create the graphs and calculate the measurements made it seem reasonable to save the data before calculating the resonance. From the resulting networks we can calculate the simple, unstandardized measure of resonance. But since the data is now part of the graph object, we need to retrieve it first and transform it into a tuple. This is made easy by the get_node_attributes() function which returns a Python dictionary of the chosen attribute for a given graph.

In [3]:

finance_index = nx.get_node_attributes(finance_net, 'betweenness').items()
food_index = nx.get_node_attributes(food_net, 'betweenness').items()

print simple_resonance(finance_index, food_index)

0.000703481101996

As Corman et al. note:

This measure is unstandardized in the sense that resonance will increase naturally as the two texts become longer in length and contain more common words. There are cases, however, where a standardized measure is more appropriate. For example, in positioning documents relative to one another (as described below), one does not necessarily want to overemphasize differences in document length, number of words, and so on. (Corman et al. 2002:178)

They therefore propose a standardized measure of resonance:

\begin{equation*} WR’_{AB} = WR_{AB} / \sqrt{ \sum_{i=1}^{N(A)} (I_i^A)^2 \cdot \sum_{j=1}^{N(B)} (I_j^B)^2 } \end{equation*}

Which transfered to Python code looks something like this:

In [4]:

import math

def standardized_sr(index_a, index_b):
    sum_a_squared = sum([centr**2 for word, centr in index_a])
    sum_b_squared = sum([centr**2 for word, centr in index_b])
    norm = math.sqrt((sum_a_squared * sum_b_squared))
    standardized = simple_resonance(index_a, index_b) / norm
    return standardized

In [30]:

print standardized_sr(finance_index, food_index)

0.0337743368476

As we can now see the standardized resonance lies between 0, which means two texts have no structural similarity regarding their centers and 1 meaning they are essentially the same (which they are in this example).

The second way to compare texts is their pair resonance. This measure uses co-occuring word pairs for its comparison, while the simple resonance score is computed on the basis of co-occuring words. For this we have to calculate the influence score of every word-pair weighted by the frequency with which it occurs in a given text:

\begin{equation*} P_{ij}^T = I_i^T \cdot I_j^T \cdot F_{ij}^T \end{equation*}

The influence score of the pairing of the \(i\)-th and \(j\)-th word in a text \(T\) is given by the product of there respective influence scores \(I\) times the frequency \(F\) of there co-occurence. The pair resonance \(PR_{AB}\) between to texts is then given by the following formula:

\begin{equation*} PR_{AB} = \sum_{i=1}^{N(A)-1} \left( \sum_{j=i+1}^{N(A)} \left( \sum_{k=1}^{N(B)-1} \left( \sum_{l=k+1}^{N(B)}\left[ P_{ij}^A \cdot P_{kl}^B \cdot \beta_{ijkl}^{AB} \right] \right) \right) \right) \end{equation*}

wherein \(\beta\) is a indicator function similar to \(\alpha\) in the case of the simple resonance. It is 1 if the word-pair exists in both text and 0 otherwise.

Since we have constructed the above network as a simple, undirected graph it contains no information on the frequency of specific word pairs, because the nx.Graph() class does not allow multiple edges between the same nodes. We could have used an nx.MultiGraph() instead. However many algorithms for the calculation of network measures are not well defined on graphs with multiple edges. Such is the case for the shortest path and therefore the betweenness centrality. There are basically two ways this problem can be solved. Because we stored the frequency of the edges already as an edge attribute we can easily retrieve this information and calculate the influence score for every word-pair (\(P_{ij}^T\)) from it. This can be done by calling the nx.get_edge_attributes() function on the graph and supplying it with the name of the attribute in question. This will return a dictionary whose keys are the tuples representing the graphs edges. .

In [13]:

finance_edgelist = nx.get_edge_attributes(finance_net, 'frequency')

In keeping with the general theme the influence scores can also be stored as graph attributes once they are derived. Making it much easier doing other calculations later on. Therefore the next function assumes a „centered“ graph with a frequency and betweenness attribute.

In [12]:

finance_edgelist.keys()[:10]

Out[12]:

[('republican', 'amendment'),
 ('confident', 'money'),
 ('problems', 'develop'),
 ('alleged', 'senators'),
 ('line', 'firms'),
 ('states', 'various'),
 ('initial', 'issue'),
 ('council', 'party'),
 ('officials', 'abe'),
 ('trend', 'growth')]

In [11]:

def pair_influence(graph, betweenness='betweenness', 
                   frequency='frequency', name='pair_i'):
    for u, v in graph.edges():
        index_u = graph.node[u][betweenness]
        index_v = graph.node[v][betweenness]
        score = index_u * index_v * graph[u][v][frequency]
        graph[u][v][name] = score

pair_influence(finance_net)
pair_influence(food_net)

In [17]:

for u,v in finance_edgelist.keys()[:10]:
    print finance_net[u][v]['pair_i']

2.48007651351e-08
7.42404500114e-07
3.31735558283e-07
5.06553406534e-10
7.78568005463e-07
5.12540823929e-06
6.71969228382e-08
6.97322850628e-07
5.52356746456e-08
1.27828019844e-05

Given the pair influence scores in the form of dictionaries of edges, the pair resonance can be computed like this:

In [18]:

finance_iscore = nx.get_edge_attributes(finance_net, 'pair_i')
food_iscore = nx.get_edge_attributes(food_net, 'pair_i')

def pair_resonance(index_a, index_b):
    edges = set(index_a.keys()) & set(index_b.keys())
    pr_score = sum([index_a[edge] * index_b[edge] for edge in edges])
    return pr_score

print pair_resonance(finance_iscore, food_iscore)

0.167702955524

As with the simple resonance score, there also exists a standardized version of the pair resonance. It is given by the formula:

\begin{equation*} {PR‘}_{AB} = PR_{AB} / \sqrt{\sum_{i=1}^{N(A)-1} \sum_{j=i+1}^{N(A)} (P_{ij}^A)^2 } \cdot \sqrt{\sum_{k=1}^{N(B)-1} \sum_{l=k+1}^{N(B)} (P_{kl}^B)^2 } \end{equation*}

As a function:

In [27]:

def standardized_pr(index_a, index_b):
    a_sqrt = math.sqrt(sum([index_a[edge]**2 for edge in index_a]))
    b_sqrt = math.sqrt(sum([index_b[edge]**2 for edge in index_b]))
    standardized = pair_resonance(index_a, index_b) / (a_sqrt * b_sqrt)
    return standardized

Which leaves us with the following pair resonance in our example:

In [29]:

print standardized_pr(finance_iscore, food_iscore)

0.0183041662733

Application (and Discussion)¶

The very low standardized resonance scores (simple as well as pairwise) suggest both corpora differ widely in the centers which describe them best. This notion is confirmed, when we take a closer look at the top 10 (ranked by their pair resonance scores) of noun phrases which make up those centers:

In [39]:

print 'Finance pair resonance scores:'
for edge, data in sorted(finance_iscore.items(), 
                         key=lambda x:x[1], reverse=True)[:10]:
    print edge, data

print '\nFood pair resonance scores:'
for edge, data in sorted(food_iscore.items(), 
                         key=lambda x:x[1], reverse=True)[:10]:
    print edge, data

Finance pair resonance scores:
('trade', 'trade') 6.89701830257
('japan', 'trade') 1.11887911679
('pct', 'pct') 0.338976169443
('trade', 'last') 0.258786723967
('trade', 'foreign') 0.248529931983
('trade', 'pct') 0.233836832373
('countries', 'trade') 0.229985306327
('japanese', 'trade') 0.227791385679
('year', 'trade') 0.196730820058
('year', 'pct') 0.186173644625

Food pair resonance scores:
('wheat', 'wheat') 0.758335541145
('sugar', 'sugar') 0.630456399413
('tonnes', 'tonnes') 0.417327032293
('mln', 'mln') 0.371389060241
('wheat', 'tonnes') 0.315850370804
('coffee', 'coffee') 0.21921416715
('mln', 'tonnes') 0.207453124546
('sugar', 'tonnes') 0.168802320836
('sugar', 'year') 0.157361672756
('sugar', 'mln') 0.115221229

One of the first things that can be noticed here is the strong influence of noun phrases which seem to be essentially „self loops“ in the network, i.e. an edge connecting to the same node from which it originated. These edges are no artifacts, but stem from the fact that the sentences in the Reuters News Corpus are defined as ending with a colon. Therefore subordinate clauses and sentences ending in semicolons are considered to be part of the overall sentence. This can be corrected by dropping all self loops from the network or tokenizing the text differently. Yet both methods lose some information about the text and depend strongly on the information one wants to retrieve from the corpora. Since self loops are an attribute of the edges of a network, they do not interfere with the betweenness centrality of a node. Therefore their influence on the resonance scores seems to be negligble in most cases. However if two texts were written in a very particular style, where for example one would use lots of subordinate clauses and long winded sentences and the other would be very concise and use only very short sentences the error could become very pronounced. This serves as a reminder to be careful in the application of computational methods of text analysis. A lot depends on ones understanding of the internal structure of the documents in question as well as ones own research interest.

Since the texts used here served only the purpose of illustrating the method, a more extensive interpretation of the findings in terms of the resonance scores or the networks properties would require more data and preferably a solid research question. Therefore we leave the discussion of the specific properties of those networks and instead discuss the method itself.

To discuss the merits of their approach Corman et al. differentiate between three goals of computational text analysis: inference, positioning and representation. Inference refers to the idea of modelling the text according to the statistical properties it has and thus deriving informed guesses on unknown properties, i.e. text classification. For example Naive Bayes Classification or similar (un-)supervised machine learning algorithms. Corman et al. criticise those approaches for their narrow domain of application. Since they must be trained on corpora with known properties they cannot easily be transfered to documents coming from a different context. Furthermore, the authors see a lack of linguistic and semiotic background in the case of the inference models. The second approach tries to position documents in relation to other documents, by placing them in a common semantic space. Prime examples would be LSA or LDA. According to Corman et al. the problem here is the reduction of the text to a vector within the semantic space, whose position depends very much on the quality of the spatial construction and is also only defined in terms of a narrow domain. It also becomes clear that the authors see the loss of information by reducing the documents to vectors as deeply problematic. Finally, they discuss representational techniques which try to preserve most of the linguistic structure of the text. Yet these techniques can also be problematic, which leads them to propose three criteria for a „good“ representational approach:

In particular, we see a need for a representational method that meets three criteria. First, it should be based on a network representation of associated words to take advantage of the rich and complex data structure offered by that form. Second, it should represent intentional, discursive acts of competent authors and speakers. Units of analysis and rules for linking words should be theoretically grounded in discursive acts. Third, it should be versatile and transportable across contexts. Making text representations independent of dictionaries, corpora, or collections of other texts would provide researchers maximum flexibility in analyzing and comparing different groups, people, time periods, and contexts. (Corman et al. 2002:172)

The Centering Resonance Analysis seems to meet those three criteria rather well. The network representation makes it indeed easy to use different methods to analyse the structure of text. For example, instead of the betweenness centrality other measures could have been used to index the network nodes. Yet choosing between different measures and analytical technique seems problematic, for the very same reason Corman et al. had in criticising inference and positional approaches. There is no linguistic or semiotic theory which would make it possible to map the properties of the text to the network in a way that would allow for a clear interpretation of the structural properties of the network. This problem stems mostly from the fact that the network representation assumes specific tokens to be fixed nodes whose edges are constructed via co-occurences in a fixed context. This makes it hard to distinguish between different contextual usages of the same tokens. Therefore synonyms and polynyms can be considered more problematic for the representational methods than for the positional ones. We also know, that words can have very different meanings and informational values depending on the sentence they are placed in as well as the rest of the document and even in relation to the entire corpus. This is the main reason for the use of weighting schemes in topic modelling, e.g. TFIDF which weighs the distribution of words according to local as well as global parameters. Because CRA also assumes the same weight for all tokens as part of a given context, all edges are considered essentially the same across all documents with the exception of their frequency.

The second point they make is the accurate representation of discursive acts in form of a graph consisting of the central noun phrases of a given text. And while this is clearly the main strength of this method it comes at somewhat high computational costs. Finding the correct part-of-speech tags can take quite some time and generally depends on the availability of a high quality pos-tagger for this specific language. It should also be noted, that the representation of linguistic centers is only one of many discoursive acts and construction principles which play a role in the production of documents, albeit a very important one. Other discoursive acts include for example the ordering of sentences according to their priority in an argument or the represenatation of a temporal process. These are commonly reffered to as the narrative structure.

The appeal of a flexible data structure which allows for an easy exchange and cooperation among researchers without the need to transfer the entire corpus, as mentioned in the third criteria, is not lost on the author. This was basically the main reason for implementing the CRA on top of the networkx graph class. As mentioned before, this make it easy to store the data as one coherent data structure as well as expanding upon the analysis. Using inheritance and other more advanced object oriented programming techniques would have made this integration even more rewarding. Yet those techniques are beyond the capabilities of the author and would have had required much more programming knowledge from the reader. It should also be noted, that for didactic reasons this code is more geared towards readability than performance. Using sparse arrays instead of lists would result in a substantial increase of speed for the calculation of the resonance measures. However, since the real bottleneck is the pos-tagging and the calculation of the betweenness centrality, the gain in computational speed seems to be negligible.

Counting Word Syllables with a Naive Bayes Classifier (the quick and dirty way)

Veröffentlicht am 2. September 2014 von Jan Riebling

The problem of Not-So-Simple-Measures

There are a couple of reasons why you could want to assess the difficulty in reading and understanding a text more objectively or without reading the text. You could for example want to know which level of proficiency in a language is necessary to understand a given text. This is what readability scores were created for. They are mostly used in evaluating medical instructions or selecting suitable reading materials for foreign language students. The majority of these tests use certain linguistic properties to calculate the readbility. Such as the average word length, the average length of sentences and the count of polysyllabic words (i.e. words with more than three syllables). And this is where the problem begins, because counting the syllables of a word is not an easy task and depends very much on the language in question. Some readability scores – most notably SMOG (Simple Measure of Gobledigook) which is sometimes considered to be the „gold standard“ of readability scores – are accessible online or implemented in an existing programm. But they almost always only provide support for the english language. Which can be a good reminder that they should not be applied a language which they were not designed for. But if you want to evaluate there usefullness across different languages, develop your own readability score or just tinker around with them, you need to find a way to count the syllables in a given word.

There are basically two ways to do this. The right way and the pragmatic way. From a linguistics standpoint you would almost certainly have no choice but to teach a computer all the specific rules governing phonetics of a word and its associated syllables. This can and has been done. Yet it is in generally very timeconsuming and you might not have the linguistic expertise to do so. Moreover, the difficulty of this approach depends very much on the language in question. In my case it was my own, namely german. You see, in the german language we have some very nasty rules concerning syllables, which is espescially true for composite words. They are generally considered to be a beautiful part of our language, but they can get quite long and therefore are reallly hard to implement in the form of a general rule.

Then there is the pragmatists way, which some would consider cheating. It consists of using a Naive Bayes Estimation to do just that, naively estimate the number of syllables in a given word. Normally the naivete of this approach refers to treating the words of a sentence (or a longer text) as if they were statistically independent of each other. This assumption is certainly false (also with regard to the syllable count) but it is practical, easy to do and performs quite well in most real-world applications. In addition, the general principle on which the Naive Bayes Classifier is built is perfectly suited for simple one-dimensional classification tasks.

Naive Bayes Classification tries to estimate the probability of an attribute for an entity given the absence or presence of specific features of that entity. For example: we know a word has 3 vowels and a total length of 7 characters, what are the chances for it having three syllables? What are the chances for two? And so on and so fourth. For this to work we need a training corpus from which we can extract the specific features as well as the attributes we want to know. The Naive Bayes Classifier is therefore considered a supervised classification approach, because it has first to be „taught“ which probabilities to assign. A lot of quality you get out of this technique depends on the corpora used in training the classifier.

What you need for this to work:

Basic knowledge of the Python programming language.
An installation of the NLTK package for Python.
A text corpus of words and their associated syllable counts in a specific language.

A simpler solution (not necessarily the most accurate):

Thanks to the Natural Language Toolkit (NLTK) for Python, you do not have to programm your own Naive Bayes Classifier. It comes fresh out of the box, with an exellent documentation and as part of a larger Toolkit, which really helps out with almost any tasks concerning textual data.

The installation is pretty easy and is described in greater detail here. After a successful Installation you just fire up python and type in your usual import statement:

import nltk

There are also couple of other modules you will need. The re-module for using regular expression (always nice when dealing with texts), the random-module to prepare the training corpus and the pickle-module to serialize the resulting classifier, so you do not have to go through the entire process every time you want to classify some words.

import re

import random

import pickle

Now you need a training corpus,

which means you need a bunch of words in a language of your choice and the number of syllables for each of these words. In my case there was no readily available and nicely formated corpus, so I ended up copying all the available examples for the number of syllables in the german language from the website of the „Duden„, Germanys foremost dictionary. In addition to that I added some special cases and lots of randomly selected words from the Duden. Since the Duden does not want to share his data freely I will not post the complete list here, nor will I show you how to get it. May it suffice to say, that it was not that hard. If you want to do students of the german language a solid, please do write a nice letter to the Duden asking them to provide academics with a licensed API to directly access there content. It would make live so much easier.

The resulting list looked something like this:

wordlist = """Freun-de
Män-ner
Mül-ler
Mül-le-rin ...

To generate the corpus to train our classifier we need to create a list which contains each entity we want to classify (i.e.: words) and the known classification for that entity (i.e.: its syllable count) in the form of a tuple.

syllables = [(word.replace('-', '').lower(), word.count('-')+1)
for word in wordlist.split('\n')]

Extracting the feautures:

We need to define the features we want to use for the prediction of the syllable count. This requires a bit of tinkering and at least some basic knowledge about the structure of the language you are dealing with. Since you can evaluate the accuracy with which your model predicts the right syllable count for words (with a known count), you can use and trial-and-error approach here: Define some features, test there accuracy, change the features, keep that which has worked, rinse and repeat. After a couple of tries, this is what I came up with as a list of word features:

Length of the word.
Number of vowels in the word
Number of consonants in a word

This seems not like much, but there is a certain problem with defining to many features which are not independent of each other. As mentioned earlier, this would violate the basic assumptions of the model (too much) and lead to „overfitting“ of the model and possibly statistical artifacts. Therefore you want as few features as possible, to provide the best model for your data. Those were some of the features I toyed with, which did not make it into the final model:

Presence of double consonants.
Presence of double vowels.
Percentage of vowels as part of a single word.

There is a way to choose between those features, which I will talk about later.

First we shall take a look at the specific data format those features need to have. In order to be accepted by the NLTK classifier they have to come in the form of a Python dictionary. Since you might want to redo the classification on different entities (i.e. words that are not contained within the training corpus), the extraction of the features should be done via a function.

def syllables_features(word):
    D = {}
    D['wordlength'] = len(word)
    D['vowels'] = len(re.findall(r'[aeiouäöüy]+', word))
    D['consonants'] = len(re.findall(r'[bcdfghjklmnpqrstvwxz]+', word))
    return D

A corpus for training:

The corpus also needs to come in a certain shape. It has to be a list consisting of tuples, which have to hold the features of a given entity (our single words) and the category (syllable count) one wishes to estimate. So the final tuples need to look like this:

({'feature 1':False ... }, category)

As a first step the list of words needs to be cleaned up a little bit (all lower case, count of syllables):

syllables = [(word.replace('-', '').lower(), word.count('-')+1)
for word in wordliste.split('\n')]

Because the content of the list is almost certainly not randomly distributed, we need to give it a quick shuffle:

random.shuffle(syllables)

Then the features are extracted and placed in a similar list in the above mentioned format, to provide us with the featuresets to train the classifier:

featuresets = [(syllables_features(word), anzahl) for (word, anzahl) in syllables]

Training the classifier:

Before the training can begin, a small subset of the feautureset should be retained, to test the performance and quality of the classifier later on:

train_set, test_set = featuresets[20:], featuresets[:20]

The rest is as easy as it gets:

classifier = nltk.NaiveBayesClassifier.train(train_set)

with this trained classifier, you can now enter the feautures of new words (i.e. not in the training corpus) and estimate there syllable count. Which looks something like this:

classifier.classify(syllables_features('Großmutter'))

For this to work, the word needs to be wrapped in the same function to extract the features as was used before in the construction of the featureset.

Trial and Error

So how can you evaluate how good the trained classifier actually works? This is where the retained featureset (the testset) comes in handy. Since we know the true number of syllables for those words, we can easily check how well the classifier works on them. We can also use this test to evaluate different featuresets, as long as the feautureset used for training and for testing are the same. NLTK provides a function for this as well.

nltk.classify.accuracy(classifier, test_set)

This rating can be different because of the selection of words into the test_set (remember the shuffle()). To get a more reliable figure one could either use crossvalidation, meaning we split the feature- and testsets a couple of times into different non-overlapping intervalls or we use a kind of monte-carlo procedure and run the estimation a couple of times. In both cases the average accuracy can be used as a more predictable measure.

Thinking outside the rectangle

Veröffentlicht am 2. Januar 2014 von Jan Riebling

The following should not be considered a full fledged argument but rather as general musings on a certain approach to data and its asociated problems.

When social scientists hear the word „data“, chances are they picture it as a rectangle. We are used to thinking about data as datasets, dataframes or more generally as tables made up of rows (observations) and columns (variables). This is how it is normally taught to students and how most of us work with data professionally. Consequently, we spend many hours on preparing the right rectangle for a given problem. Transforming, maintaining and analyzing data is done mostly in the logic of the dataframe.

This approach to data has its merrits. First of all, it is easy to comprehend and understand since most people are familiar with this kind of data structure. It is also easy to use when writing syntax as most statistic programmes are build around the notion of a rectangular dataset. Secondly, the dataframe fits our methodology. Almost all of our methods are defined by statistical operations which are performed on observations and variables or both. Finally, rectangular or flat data sets are also somewhat of a common ground in social science. They allow for discussion, criticism and exchange of datasets between researchers.

That being said, there are some limitations inherent in a strictly „rectangular“ approach to data. While we should by no means abbandon the dataframe, it pays to see more clearly, what we can and cannot do with it.

What is your rectangle made of?

Most of the common statistic programmes and languages provide the user with a flat data structure. Even so these structures look alike, they can be very different in terms of their actual implementation. Take for example the R data.frame objects and the DataFrame provided py pandas. They have a very similar functionality and feel but are implemented quite differently (since they are from two different languages, this is to be expected). R’s data.frame builds around the native datastructures of vectors and list while pandas is more of a wrapper around NumPy arrays who offers additional methods and more user-friendliness. This is neither the space, nor the right author to give a full account of all the differences between those two. Yet, something that many people who made the transition seem to struggle with is the more functional style of R versus the more object oriented of pandas. Even so the data structure seems to be the same (after all pandas is explicitly modeled after R’s data.frame), the ways of handling problems are not.

The problem I want to point out is this: Because we are used to the look and feel of it, we tend to ignore the actual differences between specific implementations. Although, this sounds rather trivial I do believe it to be the most problematic aspect of strict adherence to the „rectangle paradigm“. By treating data structures as if they were the same, we are essentially ignoring the possibility that there could be a better tool for the job than the one we are currently using. It also obscures the inner workings behind the data structures, which becomes a big problem as soon as you are trying to implement your own stuff or are switching from one framework to another.

How big can a rectangle get?

While rectangular data sets are functional and easy to use, they are not the most memory efficient structures. Arrays and matrices are faster in most cases, which is why most statistic frameworks convert the data to those formats before doing the actual analysis. Yet the real problem is more one of bad practice. „Big Data“ may be all the rage right now, in the social sciences it seems to be mostly a problem of data being not really too „big“ but rather too „large“ for a specific framework and the machine on which it is running. In most cases there is no real need for better algorithms or parallel computing. Part of the problem is keeping the entire data set in memory while in most cases only a fraction of it is actually used.

In my experience, a average data set in the social sciences has roughly 10 000 to 20 000 observations and around 5 000 variables. In principle this is manageable but can become tricky when it comes to transforming or reshaping the data in fundamental ways. Again, this depends heavily on the actual statistic software used for the task and reinforces what was said about knowing the actual implementation. Yet the problem becomes more pronounced when many data sets are combined as is common in cross-country research.

However, in most cases we only need a fraction of the original data. More specifically, 10 to 20 variables are on average enough. And those are in general not the problematic ones. We seldom need those pesky memory-eating string variables anyway. So the solution would be to keep the data as a whole in a data base and use a specialized language like SQL to construct your dataframe. The resulting data structure is not only smaller, but has the advantage of requiring much less memory intensive transformations. Yet this kind of workflow is strangely absent from most curricula I know. What is even more problematic is the insistence of many big surveys to deliver their data in some well known formats like sps, dta, csv and so on. While this is intended to be helpful, it has the side effect of reinforcing the idea that one rectangle fits all.

New possibilities, old rectangles?

The rectangle paradigm is also challenged by new formats and new possibilities of data acquisition. More and more data is directly available through APIs or as a result of data and text mining techniques. In both cases the resulting data seldom comes in the form of a nicely labeled dataframe. Those new data sources are often created by other disciplines, most notably computer scentists and programmers, consequently they are not specifically tailored to the needs and wants of social scientists. So we are often stuck waiting for someone to bridge the gap and provide us with our familiar, rectangular dataframe. Of course this means passing on good opportunities for interesting analyses.

So it seems to make sense to at least broaden our horizon and find a more comprehensive view on data. As said before, there are good reasons to stick to the good old rectangle, but there should be at least some awareness of other options.

Sociology Hacks

Nifty tricks and hacks for social science computing

Archiv des Autors: Jan Riebling