As a graduate student, Chris had spent the last several weeks working on a statistical analysis exploring the links between air pollution and heart disease. The data set she was analyzing was quite large (the biggest Chris had ever worked with) and had been pulled from a much larger dataset, one which combined air quality data with an immense collection of hospital records, census data, and more into one formidable dataset of tremendous size. With several million rows and seemingly endless columns, the smaller subset Chris was working with was already enough to make anyone’s head spin. Because it was not feasible to look through the values in each of the rows and columns one by one, Chris had to rely on summary statistics – things like mean, median, and max and min values – to help him identify and manage potential errors in the data.
The full dataset was stored on a research server at Chris' university, where (due to the confidential nature of some of the data) it could only be accessed by authorized personnel. While Chris was not authorized to access this comprehensive dataset directly, he was authorized to work with a subset that had been culled from this enormous dataset and placed into a digital folder shared between Chris and his thesis advisor. Though something of a computer whiz himself, Chris was careful when setting up his workflow. Because he knew that it was all too easy to make mistakes when running analyses, he wrote his scripts so that they would always use a temporary copy of the data rather than the original file. This step minimized the risk of inadvertently making changes to the file his advisor had shared with him. It also helped him avoid creating too many file versions by saving only the commands necessary to generate them. Each time Chris found an error in the data that needed correcting or decided to organize the data in a particular way for her analysis, he just added new commands to the script he used to generate the temporary dataset. With a well-documented workflow and solid set of data protection measures in place, Chris was ready to start seeing some results.
Subject to the perils of his unruly graduate student schedule, Chris found himself spending many hours in front of the computer, his eyelids propped open by caffeine and the glare of the monitor before him. To check his code with different combinations of variables and statistical analyses, he was selecting and running just one section of his code at a time. Things were getting exciting – he was working through the bugs in her code and finding some interesting patterns!
A mistake, he realized later, was probably inevitable. With one ill stroke of the keyboard, he managed to delete a section of code without noticing it. Normally this would not have been a problem, as he would have soon noticed the missing chunk of code and been able to retrace his steps to recover it from an earlier version. But the block of code he deleted contained one essential element that had an impact on the code that followed it – the comment character.
The code that the comment prevented from being executed was a reference to the location of the original file. Chris had included it as an annotation in the code he was working with so that he would never forget where the data were located. While this type of documentation is good and generally encouraged, in this case it meant that one errant deletion could lead to a mini-disaster. When Chris clicked ‘Run’, the reference to the original file was read and interpreted as the location for a new blank file of rambling white space. Just like that, in the space of a mouse click, the subset Chris was so depending on had ceased to exist.
If this had been the only copy of the data, he would have almost certainly lost months or years of work. But fortunately the data were still contained within the larger dataset from which the subset was created. This larger dataset was preserved on the server which Chris did not have access to, and therefore could not possibly manage to alter beyond repair!
Chris knew he could not be the first person to commit such an error. But as he sat down to draft an email to his advisor, he could not help but shake his head in disbelief – what kind of researcher couldn’t keep a handle on his own data?! To tell the truth, Chris had been quite proud of his workflow documentation and data protection practices. How could he have not recognized this risk? In hindsight, the solution was annoyingly clear: if he had saved an additional backup of his data in a separate folder, he could have instantly restored the subset and his pride. The entire mess could have been avoided with a minute of his time and a little more foresight. But Chris' advisor told him not to feel embarrassed, that she could access the main dataset and re-subset the lost data. She estimated that by using language she had saved when doing the original query, the re-subsetting would take an hour at most.
Of course, the process took a little time because of the massive size of the original file – but it worked. If his advisor had been away at a conference (as she was often prone to be), Chris might have been forced to give up several days of progress as he waited for her return. Luckily, little time was lost and the subset was on the shared server as though it had never left.
Chris' lab already had good practices and norms for data protection. After his experience, however, they made some changes to how they manage shared data. Now each researcher keeps consistent backups of the datasets they’re sharing and working on so that little errors like the one Chris made don’t destroy data that everyone is depending on.
Chris began his analysis with an intentionally structured workflow most researchers would envy. But careful though he was in her work, the smallest oversight found a way to test the structure’s strength. After averting the crisis of human error, Chris realized that even the most vigilantly maintained workflow could benefit from the teachings of trial and error.
Image: CC BY-NC-ND 2.0 by Michael Lokner via flickr