Storing Data Part 2: DOCA Example

This is part 2 of a series on storing and managing social science data in relational databases. If you haven’t already, read part 1 to get up to speed.

To help illustrate some of the concepts introduced in Part 1, I’m going to use the Dynamics of Collective Action (DOCA) dataset to design an efficient relational database. You can download the entire dataset at the DOCA website. For our purposes, I’m going to refer to a 15 case sample, which you can view here. The DOCA dataset consists of data coded from newspaper coverage of protest events. Each case is a distinct protest event.  Table 1 contains a description of each of the over 100 variables in the dataset. For more detailed description of the dataset, you can browse the DOCA website.

Table 1: DOCA Variables
variable description
coder name of the coder
date date the event was coded
newsnm name of the newspaper the event was coded from
rptmm, rptdd, rptyy date (month, day, and year) the newspaper article was published
evmm, evdd, evyy date (month, day, and year) the event took place
days number of days the event lasted
eventid unique ID number of event
stories number of articles on event
title title of the newspaper article
followup (previous month?)
who, what, where, against, purpose, whysm Brief narrative describing protest event
page, section page and section of newspaper the article appears in
staten number of states the event took place in
state1, state2, state3, state4 list of states the event took place in (only first four)
cityn number of cities the event took place in
city1, city2 first two cities in list of cities event took place in
neighn number of neighborhoods the event took place in
neighbr primary neighborhood the event took place in
particd was the number of participants reported?
particex if yes above, how many participants were reported?
partices if no above, what is the most appropriate estimate of participants (from a list)
groups number of groups (broadly defined identity groups) initiating event
igrp1c1, igrp1c2, igrp2c1, igrp2c2, igrp3c1, igrp3c2 the types of groups involved in the event, with the primary group listed first, and so on.
targd was there an identifiable target of the event?
targnum how many targets did the event have?
targ1, targ2 list of target types
ertarg1, ertarg2 list of target types of ethnic events
smonamed were specific organizations named in the article?
smonum how many organizations were named
smoname1, smoname2, smoname3, smoname4 list of named organizations
claim1, val1, claim2, val2, claim3, val3, claim4, val4 list of claim topics made in the event (claim1) along with whether the claim was pro- or anti- the claim topic (val1)
form1, form2, form3, form4 list of forms the event appears as (e.g. picket, civil disobedience, symbolic display)
act1, act2, act3, act4 list of main behaviors used by the protesters
viold did protesters use violence?
violtype what kind of violence did protesters use?
counterd were counter demonstrators present?
police1 where police mentioned at the event?
police2 did police take action at the event?
police3 did the police use physical force?
police4 did the police use violence?
arrestd where there arrests at the event?
arrestex number of arrests reported (Exact number)
arrestes if exact number of arrests not reported, best estimate (from list)
propdam was there property damage?
dollars dollar amount of property damage
injury was anyone reported as injured?
deaths was anyone reported as killed?
d_eventid
flag_followup
injprot, injprot_est protesters injured, reported and estimated
injby, injby_est bystanders injured, reported and estimated
injpol, injpol_est police injured, reported and estimated
injoth, injoth_est other injured, reported and estimated
dthprot, dthprot_est protesters killed, reported and estimated
dthby, dthby_est bystanders killed, reported and estimated
dthpol, dthpol_est police killed, reported and estimated
dthoth, dthoth_est other killed, reported and estimated
agenda_code Policy Agendas Project topic code

You’ll notice immediately that there are a lot of variables in this dataset – 104 in total. You’ll also notice that many variables are repeats – these are listed in the same cell in the table above. This happens when you want to be able to assign more than one category to a case for a specific variable. For example, the claim variable is repeated four times in the spreadsheet because the DOCA coders wanted to be able to assign more than one claim to a single event. This increases accuracy of the coding, but also makes the data more complex. These sorts of issues are where relational databases really excel at storing data efficiently.

When you have multiple instances of the same variable, like claims, storing the data in a spreadsheet becomes inefficient and cumbersome. You have to have have columns for the maximum number of claims you want to be able to assign to a case. So if you want to be able to sign up to six different claims to a case, you have to have six columns in the spreadsheet, even for cases that only have one claim. This is true even if the vast majority of your cases only have one or two claims, resulting in a large number of empty cells. Plus, you are limited to the number of columns you created – you can’t assign seven claims to an event because there’s not a column to store the data for the seventh claim. In the sample we are using here, only one case has more than 2 claims, and no cases have 4 claims. So we have a bunch of empty cells taking up space in our dataset.

A better way to store this data is to have a separate table for claims. The table will have fields for the row id, the event id it is referring back to, the claim type and the claim valence. When done this way, you don’t have a limit on how many claims you can assign to an event. You just add a row to the claims table for each claim you assign to an event – as many as you would like. The table would look something like the table 2.

Table 2: Claims
id event_id claim valence
1 9009029 1900 1
2 9005036 1505 1
3 8407041 1340 3
4 8109077 1400 2
5 7911085 740 1
6 7911085 1000 1
7 7707051 1713 1
8 7503056 1336 2
9 7105082 1400 3
10 7103032 1106 1
11 6905020 1500 1
12 6804168 711 2
13 6706112 1518 1
14 6706112 1500 1
15 6706112 1334 2
16 6505008 1501 1
17 6505008 1500 1
18 6210014 1501 1
19 6210014 1342 3
20 6112002 1334 3
21 6112002 1501 1
22 6011034 1501 2
23 6011034 1501 1

Notice we’ve assigned each row a unique ID number. This is so it is easier to refer to a specific row in case we want to edit or delete data in the future (the id will become more important when we start using MySQL). Each row also contains the event_id for the event the claim corresponds to. You can see that some events have multiple claims, as they did in the original spreadsheet. However, by storing the claims data in a separate table, we eliminate the blank cells for events that only have one or two claims. Instead, we just add a new row whenever there is another claim for an event. This will allow us to code as many claims as we want for each event.

Applying this technique to all the repeating variables, we would create 9 new tables: states, cities, neighborhoods, groups initiating, targets, smos named, claims, event forms, and acts.

We could also create a table for injuries and deaths, with a field for reported or estimated, a field for person type (protesters, bystanders, police, other), and a field for injury type (injury or death).

This google spreadsheet contains the DOCA data reorganized into separate tables as I described. Each sheet contains a separate table. The data relates back to each other through the event_id variable, common to all tables. Notice how few empty cells there are in each table. This is not necessarily the best way to organize a relational database around the DOCA data, but it is a good illustration of basic concepts we’ve already learned.

Let’s recap what we’ve learned so far:

First, relational databases are good at storing multi-leveled data. In general, when designing relational databases, you want to avoid repeating data across rows in a single table. If you find yourself inputting the exact same data (say, org name, or school, or location information), think about how that data could be stored in a separate table and referred to through a simple id.

Second, relational databases are good for tagging multiple categories or codes for a single case. Instead of creating redundant columns in a single table, move the categorization into its own table – multiple categories become multiple rows instead of multiple columns. Rows are much easier to add than columns.

Next, we’ll get more hands on and learn how to create databases in MySQL.