Thinking outside the rectangle

The following should not be considered a full fledged argument but rather as general musings on a certain approach to data and its asociated problems.

When social scientists hear the word “data”, chances are they picture it as a rectangle. We are used to thinking about data as datasets, dataframes or more generally as tables made up of rows (observations) and columns (variables). This is how it is normally taught to students and how most of us work with data professionally. Consequently, we spend many hours on preparing the right rectangle for a given problem. Transforming, maintaining and analyzing data is done mostly in the logic of the dataframe.

This approach to data has its merrits. First of all, it is easy to comprehend and understand since most people are familiar with this kind of data structure. It is also easy to use when writing syntax as most statistic programmes are build around the notion of a rectangular dataset. Secondly, the dataframe fits our methodology. Almost all of our methods are defined by statistical operations which are performed on observations and variables or both. Finally, rectangular or flat data sets are also somewhat of a common ground in social science. They allow for discussion, criticism and exchange of datasets between researchers.

That being said, there are some limitations inherent in a strictly “rectangular” approach to data. While we should by no means abbandon the dataframe, it pays to see more clearly, what we can and cannot do with it.

What is your rectangle made of?

Most of the common statistic programmes and languages provide the user with a flat data structure. Even so these structures look alike, they can be very different in terms of their actual implementation. Take for example the R data.frame objects and the DataFrame provided py pandas. They have a very similar functionality and feel but are implemented quite differently (since they are from two different languages, this is to be expected). R’s data.frame builds around the native datastructures of vectors and list while pandas is more of a wrapper around NumPy arrays who offers additional methods and more user-friendliness. This is neither the space, nor the right author to give a full account of all the differences between those two. Yet, something that many people who made the transition seem to struggle with is the more functional style of R versus the more object oriented of pandas. Even so the data structure seems to be the same (after all pandas is explicitly modeled after R’s data.frame), the ways of handling problems are not.

The problem I want to point out is this: Because we are used to the look and feel of it, we tend to ignore the actual differences between specific implementations. Although, this sounds rather trivial I do believe it to be the most problematic aspect of strict adherence to the “rectangle paradigm”. By treating data structures as if they were the same, we are essentially ignoring the possibility that there could be a better tool for the job than the one we are currently using. It also obscures the inner workings behind the data structures, which becomes a big problem as soon as you are trying to implement your own stuff or are switching from one framework to another.

How big can a rectangle get?

While rectangular data sets are functional and easy to use, they are not the most memory efficient structures. Arrays and matrices are faster in most cases, which is why most statistic frameworks convert the data to those formats before doing the actual analysis. Yet the real problem is more one of bad practice. “Big Data” may be all the rage right now, in the social sciences it seems to be mostly a problem of data being not really too “big” but rather too “large” for a specific framework and the machine on which it is running. In most cases there is no real need for better algorithms or parallel computing. Part of the problem is keeping the entire data set in memory while in most cases only a fraction of it is actually used.

In my experience, a average data set in the social sciences has roughly 10 000 to 20 000 observations and around 5 000 variables. In principle this is manageable but can become tricky when it comes to transforming or reshaping the data in fundamental ways. Again, this depends heavily on the actual statistic software used for the task and reinforces what was said about knowing the actual implementation. Yet the problem becomes more pronounced when many data sets are combined as is common in cross-country research.

However, in most cases we only need a fraction of the original data. More specifically, 10 to 20 variables are on average enough. And those are in general not the problematic ones. We seldom need those pesky memory-eating string variables anyway. So the solution would be to keep the data as a whole in a data base and use a specialized language like SQL to construct your dataframe. The resulting data structure is not only smaller, but has the advantage of requiring much less memory intensive transformations. Yet this kind of workflow is strangely absent from most curricula I know. What is even more problematic is the insistence of many big surveys to deliver their data in some well known formats like sps, dta, csv and so on. While this is intended to be helpful, it has the side effect of reinforcing the idea that one rectangle fits all.

New possibilities, old rectangles?

The rectangle paradigm is also challenged by new formats and new possibilities of data acquisition. More and more data is directly available through APIs or as a result of data and text mining techniques. In both cases the resulting data seldom comes in the form of a nicely labeled dataframe. Those new data sources are often created by other disciplines, most notably computer scentists and programmers, consequently they are not specifically tailored to the needs and wants of social scientists. So we are often stuck waiting for someone to bridge the gap and provide us with our familiar, rectangular dataframe. Of course this means passing on good opportunities for interesting analyses.

So it seems to make sense to at least broaden our horizon and find a more comprehensive view on data. As said before, there are good reasons to stick to the good old rectangle, but there should be at least some awareness of other options.