Nora was a PhD student when she attended a meeting that would change the course of her career. Until that point, Nora had thought of herself chiefly as an entomologist, with her primary work objective being (as she joked with her colleagues) counting bugs. Born and raised in the Midwest, Nora felt especially drawn to studying field crop pests and their voracious appetite for agricultural delicacies. In her years as a graduate student, she was all too familiar with how pests could invade fields and decimate healthy harvests. Small as they were, hungry insects commanded attention and respect – something Nora provided with satisfaction from behind her microscope.
When Nora decided to attend a meeting bringing together the region’s crop researchers, she wasn’t sure what to expect. But when a man name Andy stood up and described a major dataset on crop pests that had been underutilized and needed some attention, Nora sensed she was in the right place. With the help of his rotating team of researchers, Andy had collected data from a network of more than 40 insect sampling sites every week, every growing season, for the past several years.
But Andy felt the dataset had more to offer. In the process of collecting data for his work on species diversity, he had also managed to net many other data points which might intrigue someone with a quantitative focus. These byproduct data were there, ready for use, but Andy had no plans for them. Why not serve them up to someone else with a different skill set?
“A lot of people have put time into collecting this data, and we still have a lot more that we can do with it,” he explained to the group. “It’s a public resource, and I welcome anyone who has an interest to come and analyze it.” That was all the encouragement Nora needed; she left the meeting with Andy’s word that he would send her a subset of the data that would be useful for her to work on.
Nora got to work compiling the crop pest data she received from Andy with some others she had worked on previously. Stitched together, they created the biggest collection of data Nora had ever dealt with. At the time, Nora didn’t know exactly what she was getting into – it was her first introduction to a collaboratively generated database, and she felt a sort of nervous excitement at the thought of all those insect counts waiting on her computer.
Nora dove in, looking everywhere for patterns. By the time she resurfaced, she had found a new love: data analysis. Never would she leave behind the world of field pests, but it lacked the mystery and complexity she had only just realized she was missing from her work-life. Nora was ready to do some more in depth data analysis and manipulation – something beyond the scope of her previous studies.
Shortly after Nora finished analyzing the data and drafting a paper on her findings, Andy retired from the project, leaving his technician, Didi, to maintain the still-growing database. With funding drying up and their work coming to an end, Nora and Didi met to finish off the project’s loose ends.
Then, they had an idea.
There was still plenty of data yet to be explored – what if they compiled everything, every last spreadsheet, every species count, into one place? Nora’s earlier analysis had considered just one species; imagine what hidden secrets the other 249 might contain!
Their plan was solid, but it presented some special challenges that were as big as the database itself. Each spreadsheet was divided by state, site, and year, so formatting differences were sure to creep up. And even though the data had been collected by expert taxonomists, it had all been entered by a rotating staff of summer students who, as it turned out, all had very different ideas about everything from naming conventions to proper spellings of species names to implied zeroes.
It took two weeks of vigilant quality control to get all of the dataset’s pieces into a consistent format. On the last day Nora leaned back in her chair, trying to shake the last of the species names from her head. The finished dataset before her was a work of beauty; it was perfectly preened, it was in CSV format, and it contained 3.2 million observations.
But Nora and Didi had only just begun. With the complete dataset in order, they decided to go back into the species counts and look at geographical patterns in distribution. And abundance. And while they were at it, maybe they would look at population genetics! But why stop there? The physical specimens still had a lot of information to offer. What if they used those to tackle endosymbionts, and then do some molecular analysis?!
There was still so much to do!
Even now, Nora and Didi persist with their work. Their dream has stretched and expanded over time, taking on still more opportunities for analyses. It may even be enough, they hope, to earn a grant award under the Long Term Research in Environmental Biology program. By continuing to add to the data and reassess its possible uses, Nora and Didi are making sure the dataset fulfills its enormous potential. And though Andy may have retired from the world of bug counting long ago, his work endures through Nora and Didi’s collaborative efforts. With a measure of foresight and an eye for future data applications, he advocated for continued analyses and open access at a time when others may have simply kept the data for themselves.
To maintain the spirit of sharing that had brought the dataset to her in the first place, Nora contributed the full dataset to a biological station and made it downloadable. Now all 3.2 million data points are available to anyone with internet access. Like Andy, Nora enjoys reminding other researchers to explore the dataset and see if they, too, might find a few bytes they’d like to net.
Watch the Film!
Please note: If you are using Chrome and unable to see the embedded video, visit the shield icon in your browser bar and allow access.
Image: CC-BY-NC-SA by Drriss & Marrionn via Flickr
Film: Becky Beamer