I want to search

The Patience of the Data Hunter

The Patience of the Data Hunter

Eileen had become an expert in datasets involving oil spills around the world. Horrified but equally fascinated by these man-made disasters, she had made a career out of tracking the oil spills and their ecological fallout. While she often felt sickened by witnessing such devastation, Eileen’s gloomy specialty had at least one hopeful silver lining; each time a spill occurred, new research rose in the oily wake – research that could make a stronger case for responsible oil transportation and use practices. This one small thing gave Eileen great comfort as she poured through the latest reports of compromised shorelines and suffering marine life.

Her work focused on compiling a large database of oil spill-related ecological data, which was used to evaluate long-term impacts such disasters had on physical and biological systems. As she dedicated more and more of her time to searching, Eileen was dismayed to find that only data for the most recent studies was easy to track down. In these cases, digging for the information was usually as simple as contacting the PIs and requesting the data she needed. While they were occasionally slow to respond to her prodding, most PIs were willing to work with Eileen and feed her data by email. The steady exchanges ensured progress was well under way, and Eileen was grateful to those that made her job easier. But as she set her sights on older studies and their datasets, a troubling pattern started to emerge: the older the research study, the more difficult it was to locate the original PI, and therefore, the data.

In many cases, the researcher had moved on to a different job or agency, leaving his or her work in the hands of a new PI, who, although perfectly competent, did not understand the work nearly as intimately. In other cases, the data originator had retired or, in a few sad cases, even passed away. As more and more of these instances kept happening, Eileen was forced to revise her project timeline again and again. Continued waiting and wasted time became frustratingly routine for Eileen in her role as a data hunter.

Eileen politely persisted, feeling a little guilty over what was in reality a perfectly reasonable request. She recognized that the process would be time-consuming for the PIs, and much more involved than simply locating a digital file and transferring it over to her; those that agreed to help were likely signing on (whether they knew it or not) to mine the filing cabinets of long departed co-workers (most of whom they had never met) in search for a key piece of paper or a forgotten file folder, organized in some unknown system towards an unfamiliar purpose.

One researcher who had inherited a dataset responded despairingly to Eileen’s request for metadata: “I’m sure the notebook is here somewhere,” she wrote back, “but I literally don’t know where to look for it.”

Another PI became very offended at Eileen’s request to share data. Concerned that his intellectual property would be exploited or his data disrespected, he fired back with accusations and a refusal to cooperate. Conceding that the data was lost to her, Eileen instead focused her effort on continuing to look elsewhere for other datasets.

A third dataset looked particularly promising for use in a global study, but its PI had neglected to include units of measurement in the dataset. Unwilling to give up on a potentially great contribution, Eileen decided to do some detective work and pull up the associated publication, looking for any clues that might lead to a breakthrough. At long last, Eileen found a single table referencing the units for a particular column of data. With the units finally established, she worked backwards to make sense of the data – but at a cost of several hours’ work.

Throughout the project, Eileen has encountered a startling number of the issues and limitations of data sharing – employee turnover, poor metadata creation, and suspicion towards data sharing culture, to name a few. In all of her correspondences and requests for data, she has yet to conduct anything close to a flawless transfer of data and all its hoped-for pieces. What’s wrong with this picture? Instead of heading straight for a data repository and locating the desired datasets and associated metadata, Eileen and others like her are forced to rely on tenuous and sometimes nonexistent personal connections to locate long-gone PIs and plead with them for cooperation. To access and even interpret the data, Eileen often finds herself dependent on the schedules and good will of the PIs (many of which have moved on and retain little interest in the work). Meanwhile, the additional steps of contact and back-and-forth communication drain energy and time from all involved.

Eileen continues to persist, slowly building up her database of oil spill research and collecting it for further application – but like any researcher, a data hunter’s time and patience are limited. Until open access and contribution to data repositories becomes common practice, Eileen can only hope to get to the datasets before the trails cool and tracking turns impossible.

Image: CC BY-NC 2.0 by golden goat via flickr