[I am migrating content from my blog to my new virtual home. This post on big data from 2013 already seems dated. Lots of disappointment and criticism of “big data” in political science.]

My Experience with Big(ish) Data

Many years ago I started a project examining firm-level data on foreign investment. This data is from the U.S. Bureau of Economic Analysis (BEA) on the operations of all of the 20,000+ foreign affiliates of U.S. multinationals. This paper, on the taxation of multinationals, has been finally published at International Studies Quarterly.

I wanted to briefly document my experience with this project since it related to a number of discussions on “big data” in the social sciences (here is one good post on big data). I know, 20,000 obsevations isn’t a lot, but this can be used as time-series data and there are other aspects of this data that are similar to “big data”. Hold on.

Here is a couple of very quick bullet points on my experience with this paper.

I found out about this incredible firm-level data set by reading a few econ papers that used it. To my dismay, the data is confidential, housed in Washington, DC. So I had to petition to the BEA to see the data. After a few months, I got the ok from the BEA and then went through the process of getting a security clearance to use this data. A few more long months.
Once I was given access to the data (and a special sworn employee of the BEA) I had to comply with the rules of using the data. The data is housed in DC and has to be used on site. I had to fly to DC every time I wanted to run a regression. Good thing my college roommate lives in DC.
All of the data are housed in different MS Access files and that only old versions of Stata were available on the computers that I could work with. No downloading files from the internet (R, do files, etc). Putting together the data set and even running a few simple regressions was a lot more difficult than I expected.
This data set was a gold mine, but like most mining operations, extracting anything from it is really, really messy. How do you “clean” data that isn’t comparable to other data sets? For example, I had way too many zero observations for one variable in the data. Were these true zeros or just missing values that were coded as zero? I went to the BEA papers archives and pulled a sample of paper forms to double check the coding.
I rarely get a paper get accepted on the first go. This means for every time the article this article was reviewed, I had to plan a trip to DC to run another set of regressions. I went to APSR, IO, AJPS (R&R that was rejected), JOP and then finally ISQ.
Given the barriers to replication, article reviewers and one NSF panel were probably harsher on this project than my others. I can’t say that I blame them, but I got at least one negative comment from every journal and grant review process on the inaccessibility of the data.

The paper I wrote with this data won an award for best political economy paper at APSA in 2008. It is now forthcoming in International Studies Quarterly in 2013.

What is my experience with “big data”?

The barriers to entry are really high. You probably already knew this.
Data quality is a serious issue. When using a cross-national dataset, I look at the individual observations to make sure nothing looks odd. It took much more legwork to verify the quality of this data.
The potential for “data mining” is much, much lower that you would think. This relates to point 4.
There is no way to let the data “speak” to you. It is a confusing mess that you really need to have a plan on how to analyze it.
Control variables or other important variables aren’t often available at the level of analysis that you’re examining.
Because this is “new” data, many of the standard methods of data analysis might not apply.

My only concrete suggestion is that theory is even more important when using “big data”. You can only really harness the richness of complicated micro data if you have clear micro theories.

Barriers to entry can create rents for a researcher, but they also make it much more difficult to replicate your results. This means that journal reviewers and grant reviewers can hold this against you, and the ultimate impact of your work might be lower. This isn’t a suggestion. It is a warning.

In the end, I’m really like this paper and I am really grateful for the folks at the BEA for giving me access to this project. But this was a tough slog.

Nathan M. Jensen

Professor, University at Texas Austin

Blog by Nate Archives: My Experience with Big(ish) Data (Feb 27, 2013)

My Experience with Big(ish) Data