Liberating and future-proofing your research data30 Aug 2011
You should store and document all research data in such a way that a complete stranger could open it and figure out how to use it. You may not think anyone will ever come behind you and use your data/results/code/simulations. That might be true, but consider this—in 10 years do you have any advantage over a total stranger when it comes to interpreting your undocumented data and processes?
Not as much of a margin as you think, I’d bet.
I recently had to liberate several gigabytes of data from a directory of SigmaPlot files, which I received from a colleague. These files contained experimental results I am using to validate numerical models. SigmaPlot appears to be a powerful tool (the graphs were beyond outstanding), but this was a time-consuming, tedious process. I had to install and activate an evaluation copy—I was fortunate this existed—and traverse workbook by workbook within each file since the Excel export only works on a per-workbook basis. Of course, I don’t use Excel so I had further work to do to prepare everything for processing in Python.
All of this would have been a moot point if the data had been stored as CSV or plain text. I can open and process data stored in CSV on any operating system with a large number of tools, for free. And I am confident in 10 years time, I will be able to do the same.
This experience solidified my resolve to design my research processes in such a way to minimize any friction for anyone in the future who might want to work with whatever files or data I leave behind.
Functionally for me, this means (where possible!):
- Open source beats closed source.
- Ubiquitous beats niche software.
- Automation/scripting beats manual processes.
- Plain text beats binaries.
- README’s in every project directory.
I could go on about this, but I’ll stop here for now. I would love to hear from the researchers in the crowd, so I’ll ask:
What measures do you implement to future-proof your data and processes?