The passing down of information from one generation to the next has been an important part of our species since the dawn of civilization. From scrolls to books to paintings to sheet music, practices have long been established to make these artifacts of information available for the generations centuries following the death of their creators. It is with some irony that we, in the information age, an age of seeming mastery over large amounts of information at the fingertips of billions, seem to be at risk of stunting or even ending this continuum of knowledge preservation.

Moving from the preservation of traditional analogue media such as paper, to digital media such as computer files, has proven to be entirely non-trivial. In addition to the threats of physical damage over time, digital media suffers from further challenges surrounding the magnitude and varieties of information available. Because of the benefits of digital media in terms of information manipulation and accessibility, the amount of information available grows exponentially with the passage of time. The varieties of information also increase over time as new software is created with new requirements for storing data, new types of data are explored, and commercial interests come into play. The end effect is that digital preservation not only must deal with the physical storage of the large number of bytes representing the information we are preserving, but also deal with ensuring that we can access and interpret all pieces of archived information centuries down the road.

This project addresses the research challenges involved with digital preservation in terms of data diversity and scale while also focusing on the development of preservation solutions in the form of tools and services. Specifically, we address accessibility with regards to the ever growing number of file formats that represent essentially the same kinds of information. It has been the case, and will continue to be the case, that digital files are preserved on tape or disk, yet are inaccessible some decades later because the software to load the data no longer exists. In the case of 3D data we have documented over 140 file formats. Many of these formats are proprietary with undisclosed specifications meaning that if the owning company where ever to disappear then it is very possible that all user data stored in that format would in short time become inaccessible. These types of situations are occurring today and will only grow worse with time. In past work we have investigated the problem of identifying an optimal file format for long term preservation so as to maximize accessibility while simultaneously minimizing information loss as we convert to the desired format from other formats in an archive. In turn we developed tools to carry out large numbers of conversions in a massively scalable manner, created a registry of software indexable by input/output formats, and laid down the framework for a library of comparison measures so as to estimate content loss before and after conversions across a number of data types.

The practical motivation of our research stems from the exponentially growing number of electronic records and the growing number of file formats dealt with by archives when conducting business with the US government. As an example, electronic records (i.e. digital files) come to the National Archives and Records Administration (NARA) from the Congress, the courts, the Executive Office of the President, numerous Presidential commissions, nearly 100 bureaus, departments, and other components of executive branch agencies and their contractors . These digital files arrive in large quantities and in a wide variety of file formats. These files must be appraised and stored in a manner that will allow access for centuries to come, a task made difficult by a lack of services to manipulate these files in general, render them, and compare their contents. The objectives of this effort are to: enable automated and computationally scalable file format conversions in a manner that will also include control of conversion parameters, predict computational costs associated with data-intensive and CPU intensive file format conversions and file comparisons, and support content-based file-to-file comparisons as well as the choice of comparison methods and their parameters based on specific end user needs.

National Archives and Records Administration/National Science Foundation – Innovative Systems and Software: Applications to NARA Research Problems (OCI-0525308), 2010-2013

Kenton McHenry (PI)