This is part 2 of a series on storing and managing social science data in relational databases. If you haven’t already, read part 1 to get up to speed.
To help illustrate some of the concepts introduced in Part 1, I’m going to use the Dynamics of Collective Action (DOCA) dataset to design an efficient relational database. You can download the entire dataset at the DOCA website. For our purposes, I’m going to refer to a 15 case sample, which you can view here. The DOCA dataset consists of data coded from newspaper coverage of protest events. Each case is a distinct protest event. Table 1 contains a description of each of the over 100 variables in the dataset. For more detailed description of the dataset, you can browse the DOCA website.
Table 1: DOCA Variables | |
variable | description |
coder | name of the coder |
date | date the event was coded |
newsnm | name of the newspaper the event was coded from |
rptmm, rptdd, rptyy | date (month, day, and year) the newspaper article was published |
evmm, evdd, evyy | date (month, day, and year) the event took place |
days | number of days the event lasted |
eventid | unique ID number of event |
stories | number of articles on event |
title | title of the newspaper article |
followup | (previous month?) |
who, what, where, against, purpose, whysm | Brief narrative describing protest event |
page, section | page and section of newspaper the article appears in |
staten | number of states the event took place in |
state1, state2, state3, state4 | list of states the event took place in (only first four) |
cityn | number of cities the event took place in |
city1, city2 | first two cities in list of cities event took place in |
neighn | number of neighborhoods the event took place in |
neighbr | primary neighborhood the event took place in |
particd | was the number of participants reported? |
particex | if yes above, how many participants were reported? |
partices | if no above, what is the most appropriate estimate of participants (from a list) |
groups | number of groups (broadly defined identity groups) initiating event |
igrp1c1, igrp1c2, igrp2c1, igrp2c2, igrp3c1, igrp3c2 | the types of groups involved in the event, with the primary group listed first, and so on. |
targd | was there an identifiable target of the event? |
targnum | how many targets did the event have? |
targ1, targ2 | list of target types |
ertarg1, ertarg2 | list of target types of ethnic events |
smonamed | were specific organizations named in the article? |
smonum | how many organizations were named |
smoname1, smoname2, smoname3, smoname4 | list of named organizations |
claim1, val1, claim2, val2, claim3, val3, claim4, val4 | list of claim topics made in the event (claim1) along with whether the claim was pro- or anti- the claim topic (val1) |
form1, form2, form3, form4 | list of forms the event appears as (e.g. picket, civil disobedience, symbolic display) |
act1, act2, act3, act4 | list of main behaviors used by the protesters |
viold | did protesters use violence? |
violtype | what kind of violence did protesters use? |
counterd | were counter demonstrators present? |
police1 | where police mentioned at the event? |
police2 | did police take action at the event? |
police3 | did the police use physical force? |
police4 | did the police use violence? |
arrestd | where there arrests at the event? |
arrestex | number of arrests reported (Exact number) |
arrestes | if exact number of arrests not reported, best estimate (from list) |
propdam | was there property damage? |
dollars | dollar amount of property damage |
injury | was anyone reported as injured? |
deaths | was anyone reported as killed? |
d_eventid | |
flag_followup | |
injprot, injprot_est | protesters injured, reported and estimated |
injby, injby_est | bystanders injured, reported and estimated |
injpol, injpol_est | police injured, reported and estimated |
injoth, injoth_est | other injured, reported and estimated |
dthprot, dthprot_est | protesters killed, reported and estimated |
dthby, dthby_est | bystanders killed, reported and estimated |
dthpol, dthpol_est | police killed, reported and estimated |
dthoth, dthoth_est | other killed, reported and estimated |
agenda_code | Policy Agendas Project topic code |
You’ll notice immediately that there are a lot of variables in this dataset – 104 in total. You’ll also notice that many variables are repeats – these are listed in the same cell in the table above. This happens when you want to be able to assign more than one category to a case for a specific variable. For example, the claim variable is repeated four times in the spreadsheet because the DOCA coders wanted to be able to assign more than one claim to a single event. This increases accuracy of the coding, but also makes the data more complex. These sorts of issues are where relational databases really excel at storing data efficiently.
When you have multiple instances of the same variable, like claims, storing the data in a spreadsheet becomes inefficient and cumbersome. You have to have have columns for the maximum number of claims you want to be able to assign to a case. So if you want to be able to sign up to six different claims to a case, you have to have six columns in the spreadsheet, even for cases that only have one claim. This is true even if the vast majority of your cases only have one or two claims, resulting in a large number of empty cells. Plus, you are limited to the number of columns you created – you can’t assign seven claims to an event because there’s not a column to store the data for the seventh claim. In the sample we are using here, only one case has more than 2 claims, and no cases have 4 claims. So we have a bunch of empty cells taking up space in our dataset.
A better way to store this data is to have a separate table for claims. The table will have fields for the row id, the event id it is referring back to, the claim type and the claim valence. When done this way, you don’t have a limit on how many claims you can assign to an event. You just add a row to the claims table for each claim you assign to an event – as many as you would like. The table would look something like the table 2.
Table 2: Claims | |||
id | event_id | claim | valence |
1 | 9009029 | 1900 | 1 |
2 | 9005036 | 1505 | 1 |
3 | 8407041 | 1340 | 3 |
4 | 8109077 | 1400 | 2 |
5 | 7911085 | 740 | 1 |
6 | 7911085 | 1000 | 1 |
7 | 7707051 | 1713 | 1 |
8 | 7503056 | 1336 | 2 |
9 | 7105082 | 1400 | 3 |
10 | 7103032 | 1106 | 1 |
11 | 6905020 | 1500 | 1 |
12 | 6804168 | 711 | 2 |
13 | 6706112 | 1518 | 1 |
14 | 6706112 | 1500 | 1 |
15 | 6706112 | 1334 | 2 |
16 | 6505008 | 1501 | 1 |
17 | 6505008 | 1500 | 1 |
18 | 6210014 | 1501 | 1 |
19 | 6210014 | 1342 | 3 |
20 | 6112002 | 1334 | 3 |
21 | 6112002 | 1501 | 1 |
22 | 6011034 | 1501 | 2 |
23 | 6011034 | 1501 | 1 |
Notice we’ve assigned each row a unique ID number. This is so it is easier to refer to a specific row in case we want to edit or delete data in the future (the id will become more important when we start using MySQL). Each row also contains the event_id for the event the claim corresponds to. You can see that some events have multiple claims, as they did in the original spreadsheet. However, by storing the claims data in a separate table, we eliminate the blank cells for events that only have one or two claims. Instead, we just add a new row whenever there is another claim for an event. This will allow us to code as many claims as we want for each event.
Applying this technique to all the repeating variables, we would create 9 new tables: states, cities, neighborhoods, groups initiating, targets, smos named, claims, event forms, and acts.
We could also create a table for injuries and deaths, with a field for reported or estimated, a field for person type (protesters, bystanders, police, other), and a field for injury type (injury or death).
This google spreadsheet contains the DOCA data reorganized into separate tables as I described. Each sheet contains a separate table. The data relates back to each other through the event_id variable, common to all tables. Notice how few empty cells there are in each table. This is not necessarily the best way to organize a relational database around the DOCA data, but it is a good illustration of basic concepts we’ve already learned.
Let’s recap what we’ve learned so far:
First, relational databases are good at storing multi-leveled data. In general, when designing relational databases, you want to avoid repeating data across rows in a single table. If you find yourself inputting the exact same data (say, org name, or school, or location information), think about how that data could be stored in a separate table and referred to through a simple id.
Second, relational databases are good for tagging multiple categories or codes for a single case. Instead of creating redundant columns in a single table, move the categorization into its own table – multiple categories become multiple rows instead of multiple columns. Rows are much easier to add than columns.
Next, we’ll get more hands on and learn how to create databases in MySQL.