Keeping data in context: using metadata-friendly, computer-readable storage formats for research data


Long-term records of environmental change are made up of complex, multi-parameter data, often collected from many sites; the complexity of these datasets can make the storage, visualization, and manipulation of such data challenging. Metadata documenting the how and why of data collection are often omitted or stored separately from the data. Dedicated programs have attempted to ameliorate these problems, but the storage format used can be inflexible and/or proprietary, limiting the future reuse of data. In essence, environmental data are comprised of measurements, each having qualifiers (e.g., location identifier, depth below surface, and measured parameter), a value, and tags (e.g., amount of error, number of replicates, written notes pertaining to the value). When data are stored in a table with one row per measurement, the maximum amount of measurement data is retained; when data are stored in a table with one row per time interval per location (one column per parameter), some information is lost but the data are more amenable to visualization in spreadsheet software. The conversion between these structures is easily accomplished using both interactive (e.g., spreadsheet software) and programmatic (e.g., R and Python) mechanisms. As more advanced statistical treatment of data becomes common in long-term environmental studies, storing data in a format that does not result in data loss is advantageous to enhance the replicability of visualizations and statistical analyses. As datasets are more often combined with others and reused in future analyses, formats that enable the storage of metadata are particularly important for data collectors to consider.

Canadian/American Quaternary Association Joint Meeting