Storing Data Part 2: DOCA Example

This is part 2 of a series on storing and managing social science data in relational databases. If you haven’t already, read part 1 to get up to speed.

To help illustrate some of the concepts introduced in Part 1, I’m going to use the Dynamics of Collective Action (DOCA) dataset to design an efficient relational database. You can download the entire dataset at the DOCA website. For our purposes, I’m going to refer to a 15 case sample, which you can view here. The DOCA dataset consists of data coded from newspaper coverage of protest events. Each case is a distinct protest event. Table 1 contains a description of each of the over 100 variables in the dataset. For more detailed description of the dataset, you can browse the DOCA website.

Table 1: DOCA Variables
variable	description
coder	name of the coder
date	date the event was coded
newsnm	name of the newspaper the event was coded from
rptmm, rptdd, rptyy	date (month, day, and year) the newspaper article was published
evmm, evdd, evyy	date (month, day, and year) the event took place
days	number of days the event lasted
eventid	unique ID number of event
stories	number of articles on event
title	title of the newspaper article
followup	(previous month?)
who, what, where, against, purpose, whysm	Brief narrative describing protest event
page, section	page and section of newspaper the article appears in
staten	number of states the event took place in
state1, state2, state3, state4	list of states the event took place in (only first four)
cityn	number of cities the event took place in
city1, city2	first two cities in list of cities event took place in
neighn	number of neighborhoods the event took place in
neighbr	primary neighborhood the event took place in
particd	was the number of participants reported?
particex	if yes above, how many participants were reported?
partices	if no above, what is the most appropriate estimate of participants (from a list)
groups	number of groups (broadly defined identity groups) initiating event
igrp1c1, igrp1c2, igrp2c1, igrp2c2, igrp3c1, igrp3c2	the types of groups involved in the event, with the primary group listed first, and so on.
targd	was there an identifiable target of the event?
targnum	how many targets did the event have?
targ1, targ2	list of target types
ertarg1, ertarg2	list of target types of ethnic events
smonamed	were specific organizations named in the article?
smonum	how many organizations were named
smoname1, smoname2, smoname3, smoname4	list of named organizations
claim1, val1, claim2, val2, claim3, val3, claim4, val4	list of claim topics made in the event (claim1) along with whether the claim was pro- or anti- the claim topic (val1)
form1, form2, form3, form4	list of forms the event appears as (e.g. picket, civil disobedience, symbolic display)
act1, act2, act3, act4	list of main behaviors used by the protesters
viold	did protesters use violence?
violtype	what kind of violence did protesters use?
counterd	were counter demonstrators present?
police1	where police mentioned at the event?
police2	did police take action at the event?
police3	did the police use physical force?
police4	did the police use violence?
arrestd	where there arrests at the event?
arrestex	number of arrests reported (Exact number)
arrestes	if exact number of arrests not reported, best estimate (from list)
propdam	was there property damage?
dollars	dollar amount of property damage
injury	was anyone reported as injured?
deaths	was anyone reported as killed?
d_eventid
flag_followup
injprot, injprot_est	protesters injured, reported and estimated
injby, injby_est	bystanders injured, reported and estimated
injpol, injpol_est	police injured, reported and estimated
injoth, injoth_est	other injured, reported and estimated
dthprot, dthprot_est	protesters killed, reported and estimated
dthby, dthby_est	bystanders killed, reported and estimated
dthpol, dthpol_est	police killed, reported and estimated
dthoth, dthoth_est	other killed, reported and estimated
agenda_code	Policy Agendas Project topic code

You’ll notice immediately that there are a lot of variables in this dataset – 104 in total. You’ll also notice that many variables are repeats – these are listed in the same cell in the table above. This happens when you want to be able to assign more than one category to a case for a specific variable. For example, the claim variable is repeated four times in the spreadsheet because the DOCA coders wanted to be able to assign more than one claim to a single event. This increases accuracy of the coding, but also makes the data more complex. These sorts of issues are where relational databases really excel at storing data efficiently.

When you have multiple instances of the same variable, like claims, storing the data in a spreadsheet becomes inefficient and cumbersome. You have to have have columns for the maximum number of claims you want to be able to assign to a case. So if you want to be able to sign up to six different claims to a case, you have to have six columns in the spreadsheet, even for cases that only have one claim. This is true even if the vast majority of your cases only have one or two claims, resulting in a large number of empty cells. Plus, you are limited to the number of columns you created – you can’t assign seven claims to an event because there’s not a column to store the data for the seventh claim. In the sample we are using here, only one case has more than 2 claims, and no cases have 4 claims. So we have a bunch of empty cells taking up space in our dataset.

A better way to store this data is to have a separate table for claims. The table will have fields for the row id, the event id it is referring back to, the claim type and the claim valence. When done this way, you don’t have a limit on how many claims you can assign to an event. You just add a row to the claims table for each claim you assign to an event – as many as you would like. The table would look something like the table 2.

Table 2: Claims
id	event_id	claim	valence
1	9009029	1900	1
2	9005036	1505	1
3	8407041	1340	3
4	8109077	1400	2
5	7911085	740	1
6	7911085	1000	1
7	7707051	1713	1
8	7503056	1336	2
9	7105082	1400	3
10	7103032	1106	1
11	6905020	1500	1
12	6804168	711	2
13	6706112	1518	1
14	6706112	1500	1
15	6706112	1334	2
16	6505008	1501	1
17	6505008	1500	1
18	6210014	1501	1
19	6210014	1342	3
20	6112002	1334	3
21	6112002	1501	1
22	6011034	1501	2
23	6011034	1501	1

Notice we’ve assigned each row a unique ID number. This is so it is easier to refer to a specific row in case we want to edit or delete data in the future (the id will become more important when we start using MySQL). Each row also contains the event_id for the event the claim corresponds to. You can see that some events have multiple claims, as they did in the original spreadsheet. However, by storing the claims data in a separate table, we eliminate the blank cells for events that only have one or two claims. Instead, we just add a new row whenever there is another claim for an event. This will allow us to code as many claims as we want for each event.

Applying this technique to all the repeating variables, we would create 9 new tables: states, cities, neighborhoods, groups initiating, targets, smos named, claims, event forms, and acts.

We could also create a table for injuries and deaths, with a field for reported or estimated, a field for person type (protesters, bystanders, police, other), and a field for injury type (injury or death).

This google spreadsheet contains the DOCA data reorganized into separate tables as I described. Each sheet contains a separate table. The data relates back to each other through the event_id variable, common to all tables. Notice how few empty cells there are in each table. This is not necessarily the best way to organize a relational database around the DOCA data, but it is a good illustration of basic concepts we’ve already learned.

Let’s recap what we’ve learned so far:

First, relational databases are good at storing multi-leveled data. In general, when designing relational databases, you want to avoid repeating data across rows in a single table. If you find yourself inputting the exact same data (say, org name, or school, or location information), think about how that data could be stored in a separate table and referred to through a simple id.

Second, relational databases are good for tagging multiple categories or codes for a single case. Instead of creating redundant columns in a single table, move the categorization into its own table – multiple categories become multiple rows instead of multiple columns. Rows are much easier to add than columns.

Next, we’ll get more hands on and learn how to create databases in MySQL.