Liberating and future-proofing your research data

You should store and document all research data in such a way that a complete stranger could open it and figure out how to use it. You may not think anyone will ever come behind you and use your data/results/code/simulations. That might be true, but consider this—in 10 years do you have any advantage over a total stranger when it comes to interpreting your undocumented data and processes?

Not as much of a margin as you think, I’d bet.

I recently had to liberate several gigabytes of data from a directory of SigmaPlot files, which I received from a colleague. These files contained experimental results I am using to validate numerical models. SigmaPlot appears to be a powerful tool (the graphs were beyond outstanding), but this was a time-consuming, tedious process. I had to install and activate an evaluation copy—I was fortunate this existed—and traverse workbook by workbook within each file since the Excel export only works on a per-workbook basis. Of course, I don’t use Excel so I had further work to do to prepare everything for processing in Python.

All of this would have been a moot point if the data had been stored as CSV or plain text. I can open and process data stored in CSV on any operating system with a large number of tools, for free. And I am confident in 10 years time, I will be able to do the same.

This experience solidified my resolve to design my research processes in such a way to minimize any friction for anyone in the future who might want to work with whatever files or data I leave behind.

Functionally for me, this means (where possible!):

I could go on about this, but I’ll stop here for now. I would love to hear from the researchers in the crowd, so I’ll ask:

What measures do you implement to future-proof your data and processes?

Update: Eugene Wallingford wrote a really thoughtful response to this post called The Future of Your Data that I’d love for you to check out.