Illustration by Jørgen Stamp, digitalbevaring.dk, CC BY 2.5 Denmark
It is appropriate that this first blogpost on digital scholarship should have a focus on digital preservation as it should be a key consideration from the inception of any digital project. Too often digital preservation is perceived as a final sequential step along a linear path throughout the lifecycle of a project. However, digital preservation should instead be considered as an ongoing process which should be embedded within the practices of digital scholarship throughout the course of a project from its inception through to its conclusion and indeed beyond. This process should also not be perceived as an added burden which draws attention away from the central ambitions of a project but rather as a process enabling the longevity of a project and the utilisation of the resources created.
Analogue vs Digital
Digital technologies are the key enablers of digital scholarship and it is always worth bearing in mind that their technological sophistication can still continue to mask their underlying fragility. Their ease of replication and circulation around the globe and how embedded they have become in our personal and professional lives can bring added confidence about their robustness and longevity. Paper-based records do not have the ease of replication and circulation. A replication of the informational content of a paper-based record takes much more time to create and their circulation still requires them being physical transported. Whilst their robustness often cannot withstand a catastrophic event such as fire or flood, in the correct environment with appropriate conditions in place humans have demonstrated our ability to steward these objects over centuries. In these conditions, deterioration is gradual with opportunities for intervention on these singular objects to extend their lifespan. However, with digital objects in a singular environment, such as a hard drive, this gradual deterioration tends not to happen as digital objects often operate in a binary way. They do not simply deteriorate like physical objects but are often only in two possible states, readable and unreadable, so there very complexity render them extremely fragile and vulnerable to loss in singular environments. Therefore, backing up data to multiple geographically dispersed locations is an essential practice to ensure the accessibility of digital resources.
The importance of metadata
Whilst maintaining accessibility to your resources is essential, being able to understand what a resource is can be a key enabler in utilising it. This is often not a huge issue when a researcher has a small number of files that only they access. However, with the passage of time memory fades and subsequent projects take precedence. After this passage of time, revisiting resources can sometimes be akin to visiting them for the first time. This leads to querying which was the final version of a document, what data is included in a given dataset, or does a given spreadsheet contain all of the relevant data that comprised the final version? Here, metadata is essential in understanding what you are viewing and what data has been included. Metadata is simply data that describes other data. Some of this will be automatically generated such as the creation date in the properties of a file. Other metadata, such as what data was included in a spreadsheet will require manual generation. Whilst it can seem like an administrative burden experience demonstrates that it can add value to your data when revisited, saving time and resources. These practices can aid individual researchers manage their data but digital scholarship often encompasses large scale projects with multiple researchers and stakeholders often using the same resources. As project teams expand and the scale of the resources generated escalate and accumulate the necessity for efficient management increases. Here, the lack of agreed practices around documentation and metadata can begin to have a real impact leading to inefficient use of time and resources. This is why data management and preservation activities should not be perceived as burdensome administrative activities but rather an essential practice to enable the progression of scholarship. If these are considered at the beginning of a project and embedded within practices, the efficiency and value of data will be considerably enhanced. In practical terms, this can include the creation of documentation clearly stating the expectations of those creating data and providing guidance about how to enact these. This can include rules around version control, file naming conventions, and folder hierarchies, as well as the creation of metadata that provides adequate context for the utilization of the data. In the absence of defined approaches to the management of data there can be local practices amongst individuals or teams within larger teams that have developed naming conventions or folder hierarchies which can often be idiosyncratic and dependent on the knowledge of a small number of researchers who work with the data. These may often be solely dependent on local knowledge that may not be documented which would potentially render them indecipherable for someone without this local knowledge. With researchers leaving and joining a team over the course of a long-term project the data could be difficult to interpret for someone unfamiliar with the imposed structure and in the absence of documentation to explain it. Ideally, it should be possible for someone completely unfamiliar with a given resource to be able to understand what a given resource is, how it was created, and what it contains.
File Format Obsolescence
With these in place another issue that can challenge the longevity of a resource is its file format. Having adequate metadata to enhance context and discoverability, along with data being replicated through multiple back-ups, is not enough to enable us to utilise data. Its utilisation is bound up with its accessibility but also its useability. Replication can ensure access to multiple copies, but the scale of replication is irrelevant if the software dependencies required to open a file no longer exist or are no longer accessible. This has been the fate of multiple file formats and digital scholarship practitioners need to cognisant of this when making decisions about file formats. There are a range of considerations to be taken into account when selecting a format when long-term preservation is an ambition. These can include whether the format is open source or proprietary, the availability of a published specification, and how widely adopted the format is.
Data Sharing
Whilst the outputs of projects need to be sustainable in the long-term there is also the issue of for whom is the data being preserved? An aligned issue in digital scholarship is data sharing and allowing other researchers access to the evidentiary base of your scholarship. This is often a context dependent decision and at the discretion of researchers involved but increasingly funding bodies may also require the retention of the research data that underlies a project and mandate its availability to other researchers. There can be many benefits of this as research data is the foundation upon which knowledge creation is built. For example, reproducibility and replicability of research is a common occurrence in science disciplines but may also be relevant to other disciplines. Research data may facilitate secondary usage, the application of different methodologies, or asking different research questions of the same data. Of course, this needs to take account of issues such as personal or commercial sensitivity and regulatory compliance. Here again, metadata is a significant contribution to the value of the data and this contextual content will allow others who utilise the data to do so with an appropriate understanding of the creation of the data and the environment in which the research was conducted. This greater transparency could enable an informed critique of the research to be made but may also enable new research through the application of different methodologies. The data may also be used in different ways not anticipated by the original researchers which can lead to potentially innovative reuse in different contexts. This also has the added value of a more efficient use of resources through the avoidance of multiple research teams creating the same data. Whilst data sharing is a relatively common practice among researchers in all disciplines the means through which this is carried out are often individualised and involve informal channels of communication with others who are known to the researchers. This sustains a sporadic and targeted data sharing environment enabled through casual networks of researchers and consequentially allows some researchers access to research data but not others. This results in an inconsistent and personalised application of data sharing which excludes other researchers who do not have contacts within an informal network. Digital scholarship initiatives should, where possible, consider sharing their data widely and identify repositories where this can be done. Much of the data produced in the course of digital projects will be the culmination of many individual researchers efforts and may also have been carried out using public funding so it is incumbent upon recipients of this funding to consider their ability to enable the full realisation of their data assets in the long-term.
Conclusion
Digital preservation encompasses a wide range of activities throughout the lifecycle of data and requires the embedding of good practice into digital scholarship initiatives throughout all stages of a project to ensure the accurate rendering of authenticated content over time, regardless of the challenges of media failure and technological change. Like digital scholarship itself, digital preservation is a collaborative endeavour which requires input from a range of stakeholders to ensure that relevant data can be preserved. This requires actions to be taken to ensure that researchers can retain and disseminate the wealth of data that they create and that this data can be utilised in the long-term through a sustainable approach to the active management and preservation of that data.