Preparation for long-term preservation
What should I keep in mind when preparing data for preservation and re-use?
Preparing data for preservation and reuse is not a stage, but an ongoing part of the research process
- Archives and repositories require clarity on who owns data and that permission for preservation and re-use is granted
- Data containing direct or a number of significant indirect identifiers will not be accepted unless anonymised or removed
- Data requires good explanatory contextual material and information to be accepted into an archive or repository
- Converting or migrating data to make the data preservable for the long term
Data Ownership
Is it clear who owns the data? Archives and repositories are unable to accept data where ownership is unclear or permission for preservation and re-use is not given. Clarify ownership early in the research in case there is a problem. If data cannot be shared because of ownership issues, the research funder must be informed during the funding application stage. Funders will hold principal investigators liable where data is unable to be shared because ownership rights were not resolved or permission to deposit data had been not sought.
Sensitivity
Preparation of sensitive data involves potentially removing sensitive data, anonymising it, and matching the data to an appropriate license or level of controlled access in the archive or repository.
Personal data should never be included in a data set, unless a respondent has given specific consent to do so, ideally in writing. Archives and repositories will not accept variables that can directly identify an individual or can indirectly identify them in either isolation or by linking to another publically available data set -- unless there is a compelling reason why data cannot be anonymised or removed.
Direct identifiers | Indirect identifiers |
Address or full postcode | Ethnicity |
Biometric data | Rare diseases or unusual treatments |
Car registration | Household and family composition |
Medical device ID | High-profile involvement in an event covered by the media |
Name | Relative names |
National Insurance of tax number | Sensitive data (for example, illicit drug use) |
Personal dates (for example, date of birth) | Height or weight |
Phone/email/social network profile | Sex or gender |
Photo of face or unique feature | Place of birth |
Small population unites (<100) | |
Socioeconomic data | |
Verbatim quotes | |
Year of birth |
Funders will treat data where consent for a poorly worded or unnecessarily restrictive process of obtaining consent has prevented preservation and sharing as a Research Data Management failure.
Documentation and Metadata
Research data has many re-use purposes: teaching, replicating existing research findings, repurposing for new research questions, re-used as part of larger data sets comparing, or combining data from different sources. Therefore, it is difficult to know for what it is going to be used. However, we can reasonably guess at what re-users will want to do. Most likely, discover, integrate, and aggregate data. Here good quality documentation and metadata enhance the value of the data, aid its discoverability, and facilitates wider re-use.
Much of what counts as preparation includes generating documentation and metadata during the research itself, where the documentation could have elements of the following
Reasons data was collected
- Aims and objectives of the project, these are often outlined in funding proposals or end of award reports.
- Data collection methods and procedures
- Definition of the universe of analysis and sample framework, notes on instruments used to collect data and analyse data, plus information on the conditions of data collection.
Data collection tools
- Copy of the questionnaire(s), prompts, and/or interview schedule(s).
Database scheme and data structure
- Variable labels and descriptions, an outline of relationships within the dataset.
Coding schemes
- Definition of coding conventions used – including information on missing data, categories, classifications, acronyms and annotations.
Data modifications
- Anonymisation work undertaken. Specification of any weighting used, identification of derived variables and the syntax used to create them, output files, and subsequent modifications to the original data.
Quality control measures
- Details on activities undertaken to verify and clean the data, an outline of formatting applied to the data, an explanation of file naming conventions, and if needed, a statement on known problems with the data.
Simple things can help: check spellings, aim for standardised vocabularies, and avoid acronyms. Finally, as a test of readiness for reuse and archiving, could someone familiar with the field understand data without having to ask questions of the original data creator?
Converting or Migrating Files
At some time during your research you may need to convert or migrate your data files from one format to another - maybe because the place chosen for long-term preservation cannot handle the current format. This may also be due to a new computer, new software, sharing with someone who has different software, working on a shared platform instead of your own PC, or simply in order to ensure that your data can be read and used in the future, because the safest option to guarantee long-term data access and usable data is to convert data to standard formats that most software are capable of interpreting, and that are suitable for data interchange and transformation
Some “lossiness” (i.e. reduction in quality) may occur when migrating from one file format to another. It is important for you to understand what is at risk for the type of data you are working with.
Potential risks for loss or corruption on conversion or migration to new media include the following:
- Textual data: editing such as highlighting, bold text or headers/footers may be lost
- Data held in statistical packages, spreadsheets or databases: some data or internal metadata such as missing value definitions, decimal numbers, formula or variable labels may be lost during conversion to another format, or data may be truncated
- Image files: loss of layers, color fidelity, resolution etc.
- Multimedia: as above, but attention to frame rates, sound quality, codecs and wrappers is needed.
It is worth briefing yourself on the format you are converting from and to before you begin; at least look them up on the web.
Check the integrity of converted files as thoroughly as possible immediately afterwards, e.g. by counting rows and columns, testing functionality, testing export, etc.