/
Preparation for long-term preservation

Preparation for long-term preservation

What should I keep in mind when preparing data for preservation and re-use?

Preparing data for preservation and reuse is not a stage, but an ongoing part of the research process

  • Archives and repositories require clarity on who owns data and that permission for preservation and re-use is granted
  • Data containing direct or a number of significant indirect identifiers will not be accepted unless anonymised or removed
  • Data requires good explanatory contextual material and information to be accepted into an archive or repository  
  • Converting or migrating data to make the data preservable for the long term

Data Ownership  

Is it clear who owns the data? Archives and repositories are unable to accept data where ownership is unclear or permission for preservation and re-use is not given. Clarify ownership early in the research in case there is a problem. If data cannot be shared because of ownership issues, the research funder must be informed during the funding application stage. Funders will hold principal investigators liable where data is unable to be shared because ownership rights were not resolved or permission to deposit data had been not sought.

Sensitivity

Preparation of sensitive data involves potentially removing sensitive data, anonymising it, and matching the data to an appropriate license or level of controlled access in the archive or repository.

Personal data should never be included in a data set, unless a respondent has given specific consent to do so, ideally in writing. Archives and repositories will not accept variables that can directly identify an individual or can indirectly identify them in either isolation or by linking to another publically available data set -- unless there is a compelling reason why data cannot be anonymised or removed.

Direct identifiers  Indirect identifiers 
 Address or full postcode Ethnicity 
 Biometric data Rare diseases or unusual treatments
 Car registration Household and family composition

 Medical device ID

 High-profile involvement in an event covered by the media
 Name Relative names
 National Insurance of tax number Sensitive data (for example, illicit drug use)
 Personal dates (for example, date of birth) Height or weight
 Phone/email/social network profile Sex or gender
 Photo of face or unique feature Place of birth
  Small population unites (<100)
  Socioeconomic data
  Verbatim quotes
  Year of birth

Funders will treat data where consent for a poorly worded or unnecessarily restrictive process of obtaining consent has prevented preservation and sharing as a Research Data Management failure.

Documentation and Metadata

Research data has many re-use purposes: teaching, replicating existing research findings, repurposing for new research questions, re-used as part of larger data sets comparing, or combining data from different sources. Therefore, it is difficult to know for what it is going to be used. However, we can reasonably guess at what re-users will want to do. Most likely, discover, integrate, and aggregate data. Here good quality documentation and metadata enhance the value of the data, aid its discoverability, and facilitates wider re-use.

Much of what counts as preparation includes generating documentation and metadata during the research itself, where the documentation could have elements of the following

Reasons data was collected

  • Aims and objectives of the project, these are often outlined in funding proposals or end of award reports.
  • Data collection methods and procedures
  • Definition of the universe of analysis and sample framework, notes on instruments used to collect data and analyse data, plus information on the conditions of data collection.

Data collection tools

  • Copy of the questionnaire(s), prompts, and/or interview schedule(s).

Database scheme and data structure

  • Variable labels and descriptions, an outline of relationships within the dataset.

Coding schemes

  • Definition of coding conventions used – including information on missing data, categories, classifications, acronyms and annotations.

Data modifications

  • Anonymisation work undertaken. Specification of any weighting used, identification of derived variables and the syntax used to create them, output files, and subsequent modifications to the original data.

Quality control measures

  • Details on activities undertaken to verify and clean the data, an outline of formatting applied to the data, an explanation of file naming conventions, and if needed, a statement on known problems with the data.

Simple things can help: check spellings, aim for standardised vocabularies, and avoid acronyms. Finally, as a test of readiness for reuse and archiving, could someone familiar with the field understand data without having to ask questions of the original data creator?

Converting or Migrating Files

At some time during your research you may need to convert or migrate your data files from one format to another - maybe because the place chosen for long-term preservation cannot handle the current format. This may also be due to a new computer, new software, sharing with someone who has different software, working on a shared platform instead of your own PC, or simply in order to ensure that your data can be read and used in the future, because the safest option to guarantee long-term data access and usable data is to convert data to standard formats that most software are capable of interpreting, and that are suitable for data interchange and transformation

Some “lossiness” (i.e. reduction in quality) may occur when migrating from one file format to another. It is important for you to understand what is at risk for the type of data you are working with.

Potential risks for loss or corruption on conversion or migration to new media include the following:

  • Textual data: editing such as highlighting, bold text or headers/footers may be lost
  • Data held in statistical packages, spreadsheets or databases: some data or internal metadata such as missing value definitions, decimal numbers, formula or variable labels may be lost during conversion to another format, or data may be truncated
  • Image files: loss of layers, color fidelity, resolution etc.
  • Multimedia: as above, but attention to frame rates, sound quality, codecs and wrappers is needed.

It is worth briefing yourself on the format you are converting from and to before you begin; at least look them up on the web. 

 Check the integrity of converted files as thoroughly as possible immediately afterwards, e.g. by counting rows and columns, testing functionality, testing export, etc.