Reproducible research fail


In most of the psychology subdisciplines under the umbrella of “cognitive psychology” (e.g., language, memory, perception, etc.), researchers use programs to collect data from participants (c.f. social psychology, which often uses surveys instead). These are usually simple programs that display words or pictures and record responses; if you’ve ever taken an introductory psychology course, you were surely made to sit through a few of these. Although there are a few tools that allow psychologists to create experiments without writing their own code, most of us (at least in the departments with which I’ve been affiliated) program their own studies.

The majority of psych grad students start their Ph.D.s with little programming experience, so it’s not surprising that coding errors sometimes affect research. As a first year grad student who’d previously worked as a programmer, I was determined to do better. Naturally, I made a ton of mistakes, and I want to talk about one of them: a five-year-old mistake I’m dealing with today.

Like many mistakes, I made this one while trying to avoid another. I noticed that it was common for experiments to fail to record details about trials which later turned out to be important. For instance, a researcher could run a visual search experiment and not save the locations and identities of the randomly-selected “distractors”, but later be unable to see if there was an effect of crowding. It was fairly common to fail to record response times while looking for an effect on task accuracy, but then be unable to show that the observed effect was not due to a speed-accuracy tradeoff.

I decided that I’d definitely record everything. This wasn’t itself a mistake.

Since I program my experiments in Python and using an object-oriented design – all of the data necessary to display a trial was encapsulated in instances of the Trial class – I decided that the best way to save everything was to serialize these objects using Python’s pickle module. This way, if I added additional members to Trial, I didn’t have to remember to explicitly include them in the experiment’s output. I also smugly knew that I didn’t have to worry about rounding errors since everything was stored in machine precision (because that matters).

That’s not quite where I went wrong.

The big mistake was using this approach but failing to follow best practices for reproducible research. It’s now incredibly difficult to unpickle the data from my studies because the half dozen modules necessary to run my code have all been updated since I wrote these programs. I didn’t even record the version numbers of anything. I’ve had to write a bunch of hacks and manually install a few old modules just to get my data back.

Today it’s a lot easier to do things the right way. If you’re programming in Python, you can use the Anaconda distribution to create environments that keep their own copies of your code’s dependencies. These won’t get updated with the rest of the system, so you should be able to go back and run things later. A language-agnostic approach could utilize Docker images, or go a step further and run each experiment in its own virtual machine (although care should be taken to ensure adequate system performance).

I do feel like I took things too far by pickling my Python objects. Even if I had used anaconda, I’d have been committing myself to either performing all my analyses in Python, or performing the intermediate step of writing a script to export my output (giving myself another chance to introduce a coding error). Using a generic output file format (e.g., a simple CSV file) affords more flexibility in choosing analysis tools, and also better supports data-sharing.

I still think it’s important to record “everything”, but there are better ways to do it. An approach I began to use later was to write separate programs for generating trials and displaying them. The first program handles counterbalancing and all the logic supporting randomness; it then creates a CSV for each participant. The second program simply reads these CSVs and dutifully displays trials based only on the information they contain, ensuring that no aspect of a trial (e.g., the color of a distractor item) could be forgotten.

The display program records responses from participants and combines them with trial info to create simple output files for analysis. To further protect against data loss, it also records, with timestamps, a simple log of every event that occurs during the experiment. The log file includes the experiment start time, keypresses and input events, changes to the display, and anything else that could happen. Between the input CSVs and this log file, it’s possible to recreate exactly what happened during the course of the study – even if the information wasn’t in the “simple” output files. I make sure that the output is written to disk frequently to ensure that nothing is lost in case of a system crash. This approach also makes it easy to restart at a specific point, which is useful for long studies and projects using fMRI (the scanner makes it easy to have false-starts).

My position at the FAA doesn’t involve programming a lot of studies. We do most of our work on fairly complicated simulator configurations (I have yet to do a study that didn’t include a diagram describing the servers involved), and there are a lot of good programmers around who are here specifically to keep these running. I hope this lesson is useful for anybody else who might be collecting data from people, whether it’s in the context of a psychology study or user testing.