Storing Data Part 1: Introduction

This is the first of a series of posts about storing social science data in relational databases. I’ve found that grad school has prepared me for many things: using statistical software to do complex analyses, writing up academic papers and submit them for review to journals, collecting and coding data for research projects, presenting and networking at conferences. But I never learned how to properly keep track and store the data I’ve collected throughout my young academic career. Instead, I’ve cobbled together techniques, borrowing heavily from my skills as amateur computer and web programmer, that seem to work for me. In an effort to get more social scientists to think about this, I’ve decided to share some of my techniques, beginning with storing data in databases.

Collecting data is an integral part of social science research. Without data, we can’t test theories or make empirical claims. But, in my experience, we are rarely prepared for the task of storing this data in ways that enable analysis or facilitate collaboration. More often than not (and I am occasionally as guilty of this as anyone else), we store our data in a series of confusingly named files, either spreadsheets or text files, with variable or case names that do little to elucidate their use or purpose. As computers make it easier to collect and analyze data, this problem of data storage is exacerbated. As data become multileveled, and the number of cases reach into the tens of thousands (or more), simple spreadsheets become not only unwieldy, but also potentially unstable as files reach the memory limits of their associated programs. However, there is a solution: databases.

No doubt you’ve heard of Microsoft Access, Microsoft’s version of a database application. There are others – FileMaker Pro is particularly popular among Apple-centric firms, as Access does not run on Macs. Both of these programs cost quite a bit of money, however. MySQL is a free, open source database application, and perhaps the most used, as it powers many popular websites. MySQL has no native graphical user interface and does not have built-in forms support, as Access and FileMaker do, but it easily interfaces with numerous programming languages to build custom forms for inputting, organizing, and viewing data. But we’re jumping ahead: let’s slow things down a bit, put on some light jazz (your choice of station) and talk about databases more generally.

One of the biggest advantages of a database over a spreadsheet is that the database can be relational. A rational database consists of multiple tables, each table storing different types of data. These tables connect to one another through defined relationships. For example, say you are studying elementary school outcomes. You have a sample of classrooms from certain schools in a school district, and have collected data on every student in these classrooms. Initially, you are storing your data in a spreadsheet and it looks something like Table 1:

Table 1: Spreadsheet data for school outcomes study
student id math score verbal score teacher’s education classroom size school’s %white school’s %school lunch program
1 95 65 bachelors 23 45 32
2 84 78 bachelors 23 45 32
3 56 75 bachelors 27 45 32
4 63 69 bachelors 27 45 32
3 92 98 masters 19 85 8
4 89 95 masters 19 85 8
5 86 84 bachelors 18 85 8
6 97 99 masters 15 85 8

Obviously, this is truncated data (both vertically and horizontally), but this is a good illustration of why spreadsheets are bad for this type of data. Data are duplicated across rows – the same school data appear in four rows, the same classroom data appear in two rows. Not only is this a waste of space, but also introduces much more error than there should be. Under this scheme, school data has to be entered for each new student in the data.  In addition to taking more time, there’s potential for miscoding school data with every entry. The same goes for the classroom data. Ideally, we enter this data once, verify that we entered them correctly, then simply refer to a record of this data for each student. This is how relational databases work. Tables 2a-c show how the above data could be reentered into a relational database:

Table 2a: Students
student id math score verbal score classroom id
1 95 65 1
2 84 78 1
3 56 75 2
4 63 69 2
3 92 98 3
4 89 95 3
5 86 84 4
6 97 99 5
Table 2b: Classrooms
classroom id teacher’s education class size school id
1 bachelors 23 1
2 bachelors 27 1
3 masters 19 2
4 bachelors 18 2
5 masters 15 2
Table 2c: Schools
school id % white % school lunch program
1 45 32
2 85 8

While it may seem less efficient (we now have three tables instead of one), the actual data are only entered once. You’ll notice that each classroom and each school now has an ID number. Other tables can use the ID number in the classroom table to refer to a specific classroom record. So student 1 refers to classroom 1, and classroom 1 refers to school 1. These are the relationships connecting the tables together. Relationships in a relational database are defined by common fields. In the case above, classroom id and school id are common fields defining the relationship between the students, classrooms, and schools tables. Now, we only have to enter and store the school-level and classroom-level data once. When we enter a new student, we simply refer to the classroom record for the classroom they are in (and this classroom record refers to the school record the classroom is in). We can visualize this database as in figure 1:

The student table refers to the classroom table using the classroom id field. The classroom table refers to the school table using the school id field. We’ve preserved the data from the original spreadsheet, but are storing them in a much more efficient manner. What’s more, relational databases allow you to pull data from the tables in virtually any manner you wish. Say you wanted to run analyses on just classroom level data. You can ask the database to return a spreadsheet with each row containing a classroom, duplicating the school level data on each row, and aggregating student-level data up to the classroom level (say as average math and verbal scores). This is easy to do; the request generates a spreadsheet ready to be imported into your favorite stats program and preserves the original data perfectly. You can save these data requests so that as you collect new data, you code everything into the database, and then request the updated dataset when you are ready for analysis.

Hopefully this brief introduction has helped explain how relational databases work and has piqued your interest enough to look into it more. I’ll write up another post soon going into more detail about the benefits of relational databases, showing more examples. Later, I plan on posting a brief tutorial on how to use the database application I use, MySQL.