DS364_MedTermReview PDF

Summary

This document provides an overview of digital curation, outlining its need, benefits, challenges, and associated solutions. It also touches upon different roles in digital curation and its relation to emerging digital scholarship.

Full Transcript

The Need for Digital Curation Ex: collecting and measuring data (such as calories, steps, or heart rate) for enhanced health and well-being. Data, whether personal or of any other kind, has certain characteristics that require it to be actively managed. It is at risk from many factors: (Ch...

The Need for Digital Curation Ex: collecting and measuring data (such as calories, steps, or heart rate) for enhanced health and well-being. Data, whether personal or of any other kind, has certain characteristics that require it to be actively managed. It is at risk from many factors: (Challenges) Technology obsolescence Technology fragility Lack of understanding about what constitutes good practice Inadequate resources Uncertainties about the best organizational infrastructures to achieve effective digital curation 7 The Need for Digital Curation There are a number of solutions to the challenges of digital curation, including: 1. Developing standards and best practices: Developing standards and best practices for digital curation can help to ensure that digital information is preserved and accessible for future generations. 2. Investing in research and development: Investing in research and development can help to develop new technologies and techniques for digital curation. 3. Building partnerships: Building partnerships between different organizations can help to share resources and expertise in digital curation. 4. Raising awareness: Raising awareness of the importance of digital curation can help to encourage organizations to invest in it. 8 What Digital Curation is? Digital curation is the process of managing and preserving digital information over time. It involves collecting, organizing, describing, preserving, and providing access to digital information. Digital curation is a set of techniques that address the issues of data protection and risk management to ensure that the data are available and usable now and in the future. Digital curation is important because it ensures that digital information is accessible and usable for future generations. 9 What Digital Curation is NOT? Previous aspects of digital curation does not present the whole picture. So we are still left with the question: what is digital curation? We can state what digital curation is not: 1. It is not digital archiving, one definition of which is “the process of backup and ongoing maintenance as opposed to strategies for long- term digital preservation”. 2. It is not digital preservation, which is defined as “all of the actions required to maintain access to digital materials beyond the limits of media failure or technological change”. Digital curation is a more inclusive concept than either digital archiving or digital preservation. 14 Incentives for Digital Curation In an environment of competing priorities and multiple demands on our time, data curation has immediate and short-term benefits for all who create, use, and manage data, in four important ways: 1. Improving access: Digital curation procedures allow continuing access to data and improve the speed of access to reliable data and the range of data that can be accessed. 1. Improving data quality: Digital curation procedures assist in improving data quality, improving the trustworthiness of data, and ensuring that data are valid as a formal record (such as use as legal evidence). 18 Incentives for Digital Curation 3. Encouraging data sharing and reuse: Digital curation procedures encourage and assist data sharing and use by applying common standards and by allowing data to be fully exploited through time (thus maximizing investment) by providing information about the context and provenance of the data. 4. Protecting data: Digital curation procedures preserve data and protect them against loss and obsolescence. 19 Digital Curators The creators, users, and curators of data all play roles in the digital curation process. The roles range from those of curators of large data sets in scientific, library, and archive contexts, right down to those played by individuals who create and use digital information for personal use and who wish to keep some of it over time. Creators of data include scholars, researchers, and librarians and archivists who manage digitization programs. 23 Digital Curators Curators of digital information—people who have a primary role of managing or “looking after” data—have job titles that include archivist, librarian, data librarian, annotator, and data curator. Their roles vary according to the context in which they work. 25 Introduction The new data-driven scholarship has various terms associated with it, including cyberscholarship, e-science, e-research and cyberinfrastructure. Cyberscholarship is the term used to refer to the ways in which networked computing, data, and scholars work together. The term e-science is used to refer more specifically to research in scientific fields. Cyberinfrastructure is used to refer to what is required for these new ways of working. 28 Cyberscholarship’s Requirements and Challenges To implement fully the opportunities that cyberscholarship’s new ways of working allow, different kinds of systems and facilities, that is, cyberinfrastructure, are needed. These include: i. computer networks, ii. libraries and archives, iii. online repositories, and much more. iv. New skill sets are also required. The requirements and challenges of cyberscholarship can be considered using Arms’s useful categorization of them into content, tools and services, and expertise (Arms 2008; Nelson 2009). 32 Digital Curation Centre (DCC) Curation Lifecycle Model - Full Lifecycle Actions Full Lifecycle Actions are: 1. Description and Representation Information 2. Preservation Planning; 3. Community Watch and Participation 4. Curate and Preserve. These actions apply to every stage in the lifecycle. The innermost point (the bull’s eye of the diagram) is Data, indicating their centrality to the Model and emphasizing that it is data that are being curated. 7 Digital Curation Centre (DCC) Curation Lifecycle Model - Sequential Actions Sequential Actions are: 1. Conceptualise 2. Create or Receive 3. Appraise and Select 4. Ingest 5. Preservation Action 6. Store 7. Access, Use, and Reuse 8. Transform 8 Digital Curation Centre (DCC) Curation Lifecycle Model - Occasional Actions Occasional Actions are: 1. Dispose 2. Reappraise Occasional Actions may occur when specific conditions are met, but they do not apply to all data. For example, data may need to be reappraised (hence the Reappraise action), or they may be disposed of as an outcome of the appraisal process (hence Dispose). 10 Open Archival Information System (OAIS) Reference Model The OAIS Reference Model defines an Open Archival Information System that provides long-term information preservation and access. This system is “an archive, consisting of an organization of people and systems that has accepted the responsibility to preserve information and make it available for a Designated Community.” The OAIS Reference Model is a widely adopted key standard for managing digital materials in a digital archiving system. The “Open” in OAIS refers to the development of the Model in open forums, not to the notion that access to the archive developed according to its criteria is unrestricted. 16 Open Archival Information System (OAIS) Reference Model - Functions The key functions of an OAIS are (figure 3.2): 1. The Ingest function—the process of accepting information provided by Producers. 2. The Archival storage function ensures that archival context remains secure and is stored appropriately. 3. The Data Management function supports access and updates information. 4. The Administration function manages day-to-day operations and coordinates other functions. 5. The Access function is the interface with the Designated Community. 6. The Preservation Planning function develops preservation strategies, undertakes technology watch, etc. 7. Common Services (not shown in figure 3.2), refers to the services that any IT system needs to function. 19 Open Archival Information System (OAIS) Reference Model – Actors and Objects OAIS is based on the concept of actors and objects: Actors (who can be humans or computer systems) can perform in the roles of Producers, Managers, or Consumers. Producers are individuals, organizations, or computer systems that transfer digital information to the OAIS for preservation. Managers develop policy, define scope, and perform other management functions. Consumers are the individuals, organizations, or systems that are expected to use the information preserved by the OAIS. 21 Open Archival Information System (OAIS) Reference Model – Actors and Objects OAIS is based on the concept of actors and objects: An in important concept in the OAIS Reference Model is the OAIS Designated Community. The Designated Community is a category of Consumer. It is the primary user group of the OAIS, to whom the OAIS must supply information that is understandable by this group. Objects in OAIS are of three kinds: the Submission Information Package (SIP), the Archival Information Package (AIP), and the Dissemination Information Package (DIP). 22 Open Archival Information System (OAIS) Reference Model – Information Packages OAIS is based on the concept of information package. The information package concept recognizes that a digital object consists of more than simply the digital content, in the form of a bit stream which we want to preserve. An information package also includes information that we need in order to preserve the digital object, such as information about its attributes, or about what actions have been applied to it, and so on. An information package has three parts: 1. The digital object(s) to be preserved 2. The metadata required at that point in the system 3. Packaging information 23 Information Packages - Submission Information Package (SIP) The Submission Information Package (SIP) is what arrives at the repository. It consists of the digital object, plus any descriptive and technical metadata accompanying the digital object and/or any other information the content provider considers relevant. SIPs may also be supplied to an OAIS from another digital repository. 24 Information Packages - Archival Information Package (AIP) The Archival Information Package (AIP) is produced by taking the SIP and adding to it, if required, further information about the digital object. The added information is either Preservation Description Information or Representation Information. Preservation Description Information is needed to manage the preservation of the digital objects submitted to the OAIS. It has four components: 1. Reference Information. A unique and persistent identifier that assists in identifying and locating the Content Data Object. 25 Information Packages - Archival Information Package (AIP) 2. Provenance Information. The history of the archived Content Data Object. 3. Context Information. Information about the relationship of the Content Data Object to other objects; for example, the hierarchical structure of a digital archive. 4. Fixity Information. A demonstration of authenticity, such as a hash value or checksum. Representation Information is required to make the Content Data Object intelligible to its Designated Community. Representation Information is the technical metadata required to make the bit stream retrievable as a meaningful digital object. 26 Information Packages - Dissemination Information Package (DIP) The DIP is produced when a user requests access to an object in the OAIS. The DIP consists of a copy of the Content Data Object plus any metadata and support systems necessary to retrieve and use the Content Data Object. The nature of the metadata and Representation Information supplied is determined by the assumed knowledge of the Designated Community. 27 Defining Data Data refers to “any information in binary digital form.” It includes digital objects and databases. Digital objects can be simple or complex. “Simple digital objects are discrete digital items; such as textual files, images or sound files, along with their related identifiers and metadata. Complex digital objects are discrete digital objects, made by combining a number of other digital objects, such as Web sites.” Databases are “structured collections of records or data stored in a computer system.” (SOURCE: Digital Curation Centre 2004–2015) 32 Born-Digital and Digitized Data Born-digital materials, that is, materials created using a computer and therefore existing in a digital version, may also have an analog equivalent. The important point is the use of a computer to create the material. Digitized materials are the result of a process of digitizing analog materials. This distinction, between data that are born digital and data that are the product of a digitizing process, is not usually important for digital curation practice. Digital curation makes little distinction between born-digital and digitized data in most of the curation lifecycle actions—data, whatever their origin, still need to be appraised and selected, ingested, stored, and used and reused. 36 Meta data and databases Metadata is data too. Metadata, according to the National Science Foundation, defined as “summarize data content, context, structure, interrelationships, and provenance (information on history and origins). They add relevance and purpose to data, and enable the identification of similar data in different data collections.” Metadata, too, need curation. In the DCC’s Curation Lifecycle this is called Description Information, and may also need to be curated. 38 Meta data and databases Databases are specifically defined in the DCC Curation Lifecycle as “structured collections of records or data stored in a computer system” (Digital Curation Centre 2004–2015). Data in databases are structured and controlled by database management software. Databases are widely used in contexts where their preservation is mandatory for a range of reasons, one being compliance with government regulations. Their curation poses many problems, not the least of which is their often constantly changing content. 39 Aims of Digital Curation Digital curation aims to produce and manage data in ways that ensure they retain three characteristics: longevity, integrity, and accessibility. 1. Longevity: refers to the availability of the data for as long as their current and future users require them. The life span of data is short unless action is taken. The length of time that data need to be maintained varies, but the minimum period of time usually exceeds the life expectancy of the access system. 10 Aims of Digital Curation 2. Integrity refers to the authenticity of data—that they have not been manipulated, forged, or substituted. Because digital preservation techniques such as migration inevitably alter the data, authenticity has to be demonstrated by paying attention to such characteristics as provenance (where the data came from) and context (the circumstances surrounding the creation, receipt, storage, or use of data and their relationship to other data). 11 Aims of Digital Curation 3. Accessibility requires that we can locate and use the data in the future in a way that is acceptable to their designated community. For example, an image (e.g., a PDF) may be acceptable for some digital objects (such as documentation), but for other objects (e.g., a database) the ability to manipulate or interrogate that object may be required by its Designated Community in the future. Digital curation aims to retain three main characteristics: longevity, integrity, and accessibility. The focus of the Curate and Preserve action is to achieve these goals, by applying some techniques to manage digital objects. 12 Scope of Digital Curation Ensuring Longevity: The main digital curation practices in common use that ensure the long life of digital objects include: refreshing data (moving data to a newer version of the same storage medium, or to different storage media, with no changes to the bit stream) checking accuracy of the results of refreshing generating metadata that document the processes applied to refreshing data maintaining multiple copies of the bit stream keeping track of changes (especially obsolescence) in hardware, software, file formats, and standards that might have an impact on digital preservation 13 Scope of Digital Curation Ensuring Integrity: The main practices in common use that ensure the authenticity of digital objects include: refreshing data checking accuracy of the results of refreshing generating metadata that document the processes applied to refreshing data protecting data by managing them in accordance with good IT practices for data security, backups, and error checking maintaining multiple copies of the bit stream managing intellectual property and other rights 14 Scope of Digital Curation Maintaining Accessibility: The key practices in current use include: maintaining the ability to locate digital materials reliably by assigning persistent identifiers to them to ensure they can be found. recording sufficient representation information for digital objects so that the bit stream is still meaningful and understandable in the future. producing digital objects in open, well-supported standard formats. limiting the range of preservation formats to be managed (often by normalizing data to standard formats). keeping track of changes (especially obsolescence) in hardware, software, file formats, and standards that might have an impact on digital preservation, and maintaining multiple copies of the bit stream. 15 Roles of Digital Curators Digital curation encompasses a wide range of tasks, which varies based on the domain it will be used. For example, some of the practical and technical curation tasks for scientists and research groups are: applying open-source software and open standards to encourage interoperability among different software and hardware Platforms creating metadata and annotations so that digital objects can be reused linking related research materials and making sure the links are persistent using persistent identifiers being consistent about citation formats deciding which digital objects should be curated over the longer term keeping data storage devices current validating and authenticating migrated data 16 Roles of Digital Curators Digital curation also requires the sharing of responsibilities. It is important to determine who the stakeholders in digital curation are, because each kind of stakeholder is likely to have different knowledge, skills, and understandings about what the digital objects are and how they are used. Several kinds of stakeholders play differing roles in the data curation process: funding bodies, discipline-based groups (e.g., scientific organizations),data creators, data users and reusers, and data curators, each with different skills, understandings, and interests. 18 Roles of Digital Curators Funding Bodies: support data creation by providing money for research projects or digitization projects. concerned with making sure that the data whose creation they fund are available to a wide range of users. Discipline Groups: Groups organized around a particular discipline or set of disciplines, particularly in the sciences and increasingly in the humanities, have a strong interest in digital curation. provide software that supports data handling. establish and support data repositories. There is a growing understanding that curation requires significant domain knowledge, and that curators with this domain knowledge are more effective than those who do not possess it, thus, Discipline groups are increasingly recognized as significant stakeholders in digital curation. 19 Roles of Digital Curators Data Creators: Scientists, scholars, and researchers, singly or as members of research teams, are all involved in some of the processes of digital curation. Data creators ensure that the data they bring into being are structured and documented to maintain their longevity and reusability. For digital objects to be usable and reusable, they must be of high quality, well-structured, and adequately documented. Data Users and Reusers: anyone who uses and reuses data—are also involved in some of the processes of data curation. Data users and reusers ensure that any annotations they produce are captured and documented to the extent that they can be understood by other users of those data. 20 Roles of Digital Curators Data Curators: Their primary role is managing or looking after data, carry out a wide range of tasks. For example, include ongoing data management, intensive data description, ensuring data quality, collaborative information infrastructure work, and metadata standards work. 21 The Need for Description and Representation Information Administrative metadata. Metadata related to the use, management, and encoding processes of digital objects over a period of time. Includes the subsets of technical metadata, rights management metadata, and preservation metadata. Descriptive metadata. Metadata that describes a work for purposes of discovery and identification, such as creator, title, and subject. Technical metadata. A form of administrative metadata dealing with the creation or storage encoding processes or formats of the resource. Structural metadata. Metadata that indicates how compound objects are structured, provided to support use of the objects. Preservation metadata. Administrative metadata dealing with the provenance of a resource and its archival management. 27 The Need for Description and Representation Information The key activities associated with the Description and Representation Information action are: appreciating the need for description and representation information. being aware of where description and representation information is required. understanding the key standards that exist for description and representation information. developing policies for applying description and representation information. 30 The Need for Description and Representation Information The concept of representation information is derived from the OAIS Reference Model which categorizes the information required for preservation as Content Information, Representation Information, Preservation Description Information, and Packaging Information. Representation information in OAIS is divided into three classes: Structure Information, Semantic Information, and Other Representation Information. 32 The Need for Description and Representation Information Structure Information describes the formats and data structures relevant for processing and rendering digital objects. It is usually information about file formats. Semantic Information provides additional information about the content of a digital object, particularly information that defines relationships among objects or parts of objects. Other Representation Information is any other kind of representation information thought necessary to interpret the digital object. It could be information about relevant software, hardware, and storage media; encryption or compression algorithms; or documentation. 33 Metadata Schemas and Standards Metadata standards address three main areas: structure, semantics, and syntax. The structure notes the elements that are to be applied. The semantics describe the meanings of each of the elements selected. The syntax determines how these elements are to be written. Different kinds of metadata standards are available and are often used in combination in curation contexts. One characteristic of many of the metadata standards used in digital curation is that they are XML-based. Using standards based on XML has numerous advantages, such that, XML is an open, well-supported, and widely adopted standard for encoding textual data, designed to be used regardless of the hardware platform. 35 Metadata Schemas and Standards The most common metadata standards applied to digital curation are Preservation Metadata: 1. Implementation Strategies (PREMIS) 2. Metadata Encoding and Transmission Standard (METS) 3. Metadata Object Description Schema (MODS) 4. Metadata Authority Description Schema (MADS) These metadata standards are all related to XML, although in different ways. 36 Metadata Schemas and Standards - PREMIS PREMIS is the international standard for preservation metadata. The PREMIS Data Dictionary for Preservation Metadata defines a core set of preservation metadata elements that has wide applicability in the preservation community. It defines preservation metadata as metadata that: supports the viability, renderability, understandability, authenticity, and identity of digital objects in a preservation context. represents the information most preservation repositories need to know to preserve digital materials over the long-term. emphasizes “implementable metadata”: rigorously defined, supported by guidelines for creation, management, and use, and oriented toward automated workflows. embodies technical neutrality: no assumptions made about preservation technologies, strategies, metadata storage and management, etc. 37 Metadata Schemas and Standards - PREMIS The PREMIS data model defines relationships between digital preservation “entities”: Intellectual Entities. Not defined in PREMIS. Users are expected to apply other relevant metadata standards. Objects. Divided into three types: representation, file, and bit stream. Events. Actions on an object in the preservation repository. These document provenance and track the history of the object. Agents. People, organizations, or software programs associated with preservation events in the life of an object. Rights. An agreement with a rights-holder that allows a repository to take actions in relation to objects in the repository. PREMIS is used in conjunction with other applicable metadata standards where appropriate. The PREMIS schema has been endorsed for use with METS. 38 Metadata Schemas and Standards - METS METS is a standard for encoding in XML the metadata describing or characterizing digital objects. It provides a means of associating all the metadata about a digital object with the object—that is, it is a “container format” specifying how different kinds of metadata can be packaged together. METS encourages interoperability by providing a standard for exchanging digital materials among institutions. A METS XML document has five major sections: 39 Metadata Schemas and Standards - METS Descriptive metadata. Contains pointers to external descriptive metadata or contains descriptive metadata, or both. Administrative metadata. Provides information about how the files were created and stored, intellectual property rights, the original source object, and the provenance of the files comprising the digital object. File groups. Lists all files comprising all versions of the digital object. Structural map. Outlines a hierarchical structure of the digital object, linking the parts of that structure to content files and metadata about each element. Behavior. Used to associate executable behaviors with content in the METS object. 40 Introduction Preservation planning is the ongoing process of planning data curation activities. The DCC Curation Lifecycle Model described as: “Plan for preservation throughout the curation lifecycle of digital material. This would include plans for management and administration of all curation lifecycle actions”. The key activities encompassed by Preservation Planning are: appreciating the need for planning at all stages of curation. developing plans for all stages of curation. periodically reviewing and updating curation procedures. 6 Risk Management as the Context for Preservation Planning Digital objects must be proactively managed from the point when they are created, to ensure their long-term accessibility, authenticity, and integrity. Proactive management demands meticulous planning. Planning is intrinsic to the OAIS Reference Model on which most digital archives are based. The Preservation Planning function covers the development of preservation strategies, undertaking technology watch and other planning and policy activities. 7 Risk Management as the Context for Preservation Planning Proactive preservation activities aimed at minimizing risks. The active management of risks over time is essential to preserve digital objects and to ensure that they remain usable in the future. A risk management approach is increasingly common in the preservation of both digital and non-digital materials. Risk management is aimed at reducing the likelihood that compromising events will occur and limiting their impact when they do. 8 Risk Management as the Context for Preservation Planning Broad principles of risk management have become standard practice for informing digital preservation. They include protecting data by: implementing a regular backup procedure; maintaining multiple copies of the bit streams; having disaster recovery contingencies in place; providing secure and stable media storage conditions; copying data to more stable media at defined intervals; and ensuring data security by implementing procedures for virus protection and unauthorized access. 9 Policy for Curation Developing policies for all aspects of digital curation is vital for its effectiveness. Policies provide clear, long-term direction and guidance, and are regularly reviewed and updated. They unambiguously state principles, values, and intentions. Their value lies in the clear articulation of these so that expectations are explicit, and that consistent decisions can be made on the basis of the statements they contain. Having policies about curation in place helps an organization to develop a digital curation strategy and to plan coherent digital curation programs. 11 Policy for Curation Other benefits of having policies in place include protecting organizations that are accused of any wrongdoing, indicating clearly to staff what is acceptable practice and what is not. Policies are implemented through procedures, which describe the process of implementing policy and work together with policies to achieve the overall goals of an organization. Good policies usually have elements in common. They state what is and what is not allowed. They indicate how the policy will be monitored and who is responsible for this. They provide links to other relevant policies and to statements about procedures. They also note the date when they are to be reviewed and how frequently. 12 Policy for Curation The OAIS Reference Model suggests that policies are required for: Archival storage (e.g., for managing migration), Management (e.g., resource utilization and pricing), Disaster recovery, and security (e.g., physical access control). An essential guide to developing policy is Policy-making for Research Data in Repositories, developed by the DISC-UK Data Share Project (Green, Macdonald, and Rice 2009). Its valuable guidance covers six key areas where policies are helpful. 13 Costs of Curation Planning for curation requires estimating the resources that will be required and identifying their sources. Planning can be difficult because until recently there has been very little clarity about the costs of curating data. For image files, the compression used in the format is a factor. This makes a sizeable difference in storage costs when large numbers of files are being considered. Other factors noted were the costs of labor, hardware maintenance, software maintenance, media replacement at specified intervals, capital equipment replacement, software licenses, and electricity. 15 Costs of Curation The frequency of data migrations also affects cost because they involve intervention by people, so that even though migration processes are increasingly automated, their labor costs are unlikely to decrease. There is a further complicating factor impacting cost that relates to the changing storage paradigm and the shift to cloud storage provision (which may be mandated by governments). 16 Introduction Full Lifecycle Action Community Watch and Participation is the process of keeping up to date and participating in developments to improve and advance curation activities. The key activities encompassed by Community Watch and Participation are: keeping up to date with digital curation activities and developments in related areas sharing data and participating in other activities that underpin data reuse participating in the development of standards for digital curation participating in the development of tools and toolkits for digital curation 18 Introduction – Keeping Up to Date The “Community Watch” part of Community Watch and Participation refers to staying fully aware of other activities in the digital curation community on an ongoing basis. Digital curation is a field about which a wealth of high-quality information is available on the Web. It is also a field that changes rapidly and has few common understandings. These factors make it very important to keep up to date. Digital curation is actively developing through research projects, which means that new knowledge and tools are emerging on an ongoing basis. 19 Collaboration: Intrinsic to Digital Curation The “Participation” part of Community Watch and Participation refers to the need for collaboration in the digital curation community. Collaboration is one of the keys to effective curation. All communities involved in curation—data creators, users, and all stakeholders—should participate in discussions about the challenges posed, and contribute helpful responses to these challenges. Active management of data for current and future use relies on effective sharing of data, which in turn relies on agreement about and adoption of standards. 25 Collaboration: Intrinsic to Digital Curation Collaboration ensures the best use of resources through sharing expertise and experience and by developing and building technical resources and solutions that can be shared. The benefits of collaboration to digital curation are: sharing expertise, costs of developing software and systems, the tools and systems of other organizations, and learning opportunities; encouraging influential stakeholders to take digital curation seriously; increased ability to influence data producers and system developers; joint research and development of standards and practices; and enhanced ability to attract resources. 26 Standards: Essential for Digital Curation Effective digital curation requires the development and implementation of standards. A fundamental aspect of digital curation is the sharing and reuse of data. This implies interoperability of systems—the ability of software and hardware to exchange and use information. For reliable and consistent interoperability, standards are essential. They are the basis upon which functional digital curation systems are built. 28 Designing Curation-Ready Data These key activities include planning for: data capture and storage in curation-friendly file formats. recording sufficient information at the time of data capture to assist with ongoing management of those data and with their use. scrupulous identification of files. data storage on appropriate media. identification of a safe place for the data and ensuring that an archive will take them. Another important activity in Conceptualise is the design of systems used to create and manage data to best effect for digital curation. 7 Designing Projects with Curation in Mind Standards and metadata—the standards and methodologies that will be adopted for data collection and management, and why these have been selected; Relationship to other data available in public repositories; Secondary use—further intended and/or foreseeable research uses for the completed dataset(s); Methods for data sharing—planned mechanisms for making these data available, e.g. through deposition in existing public databases or on request, including access mechanisms where appropriate; Proprietary data—any restrictions on data sharing due to the need to protect proprietary or patentable data; Time frames—timescales for public release of data; Format of the final dataset. 15 Introduction Create or Receive is the second stage of the DCC Curation Lifecycle, which involves: creating data and their associated description and representation information so that they are curation-ready, receiving data from external sources and making them curation- ready. These processes ensure that ongoing curation is feasible. The key activities encompassed by Create or Receive are: Develop, document, and apply policies about creating and receiving data. 17 Introduction The key activities encompassed by Create or Receive are: Influence data creators to create data that is curation friendly. Create data in standard data formats and file types that can be processed with open-source, well-documented programs. Collect and keep documentation about the data, formats, software, agreements about its use, and provenance. Develop and implement procedures for receiving data. 18 Policies for Creating and Receiving Data Policies for creating and receiving digital objects are valuable for data curators because they clearly delineate the requirements and responsibilities of creators and curators. Effective curation requires that policies about data formats and quality, the deposit process, and rights and ownership are developed, clearly documented, and then applied. Policies are required to establish: who is eligible to deposit data in the archive the data quality requirements which metadata must be submitted confidentiality and disclosure 19 Policies for Creating and Receiving Data the access status of the data (whether or not there are embargos on them) rights and ownership data file formats: which formats will be accepted for deposit, which are preferred, whether the files will be normalized (converted to another format that the archive can manage better) any volume and size limitations on the data received by the archive. Other areas where policies about creating and receiving digital objects may be useful depend on the nature of the digital objects, the archive, and the discipline. 20 Policies for Creating and Receiving Data The topics and questions that policies should address are indicated in the following: Eligible depositors: Who is eligible to deposit data in the archive? What kind of data will be received? What procedure is in place for providing a receipt to the depositors of data? Data quality requirements: What criteria must the data meet in terms of quality? Is the coverage complete? Have they been checked for validity? Metadata: What metadata (descriptive, structural, and administrative) should be supplied by the creator? What happens if this required metadata is not supplied? What metadata will the archive supply? Confidentiality and disclosure: What requirements must data creators meet regarding confidential data? Does data, for example, identify people? Will the archive anonymize data? 21 Policies for Creating and Receiving Data Embargo status: Will an embargo be placed on the data that makes it unavailable for use? How long will the embargo last? What will trigger the lifting of the embargo? Rights and ownership: What rights relating to the data are retained by the creator? What rights are transferred to the archive? What limitations are placed on the way the archive can use the data? Can the archive change the data at all (e.g., during processing data for preservation)? Does the creator certify that the data do not infringe the copyright of others or certify that permission from the rights-owner has been received? Data file formats: Which formats will be accepted for deposit? Which formats are preferred? Will the formats be normalized? Will compression formats (e.g., zipped files) be accepted? Will the archive retain the original bit stream as well as the normalized files? Is there a limit to the size and number of files the archive will accept? 22 Structuring Data for Use and Reuse To keep data and to keep the ability to process data, they must be: authentic (the data are what they claim to be), accurate (they haven’t been tampered with), renderable (they can be used in the ways for which they were intended or viewed as originally intended), and in a form that best ensures their longevity. One way to achieve these aims is to use file formats that stand a good chance of being understood in the future. These file formats are likely to be in widespread use and very likely to be open. 23 Structuring Data for Use and Reuse Criteria to use to predict the ongoing viability of a file format include the following: Openness: Is there an open, publicly available specification for the format? Are its specifications in the public domain? Is it unencrypted? Portability: Is the format independent of a specific operating system, other software, or hardware? Is it independent of particular institutions, groups, or events? Is it in widespread and current use? Does it contain little or no built-in functionality? Quality: Is it robust, simple, thoroughly tested, and loss-free? There are some kinds of file formats that are preferred to be considered. These preferences are based on criteria such as whether the file formats are widely adopted, are de facto standards, and are nonproprietary and well documented. 24 Structuring Data for Use and Reuse - Open Formats and Open Source The main key characteristics of open-file formats: They are based on freely available standards. They are developed by a community rather than by a single entity. They can be used in multiple software packages, not just one. They do not contain any intellectual property restrictions, suchas patented components. Open-source software programs are increasingly being developed and used for digital curation, such as: An example is Xena (XML Electronic Normalizing for Archives; xena.sourceforge.net) Xena requires the open-source OpenOffice suite of software (www.openoffice.org) to run. Initiatives such as the Open Source Initiative (www.opensource.org), SourceForge (sourceforge.net), and GitHub (https:// github.com) illustrate the popularity of the open-source concept, which is a key aspect of digital curation. 27 Structuring Data for Use and Reuse - Documentation Another key requirement for digital objects to be usable over time is access to documentation about them. This documentation provides detailed description of the digital objects. It indicates how they were digitized or created, provides information that allows the user to understand their meaning, and describes their structure and content. Any actions applied to the digital objects are also documented. The aim is to make sure that digital objects are understandable both now and in the future. Documentation of digital objects is best carried out when they are created and should continue to be an ongoing part of the curation process. 30 Structuring Data for Management Data need to be managed to ensure that they can be processed, accessed, and reused over time. Processes such as data cleaning, storage, and maintenance should be applied. Selection of viable file formats is essential. Three characteristics of file formats that are particularly helpful for managing data are metadata support, interoperability, and viability. Metadata Support: (i) description and representation information are essential for curation, (ii) some software applications generate description and representation information automatically, which is then added to by data creators or data managers, and (iii) some file formats accommodate metadata. 32 Structuring Data for Management Interoperability: (i) Managing data over time will almost certainly require its migration from one technical environment to another, (ii) File formats that are platform independent and/or are supported by a wide range of software are easier to migrate. Viability: (i) Some file formats are “stronger” than others in the sense that they can still be accessed and used even though parts of the data may have been damaged, (ii) recommended file formats based on the characteristics usually applied are not necessarily robust file formats, (iii) file formats that have in-built mechanisms for error checking can assist data management by indicating when files have become corrupted. Data Quality: (i) The best outcomes in managing data over time and reusing data are achieved when they are of high quality, (ii) There are three critical points for data quality in the digital curation process: when data are collected, when they are prepared for analysis, and when they are verified. 33 Structuring Data for Discoverability A persistent identifier is an identifier that does not change, even if the locations of the data or digital objects change. Reliable identification of data is essential for providing long-term access to them and ensuring their reliability and authenticity. Persistent identifier—“a name for a resource which will remain the same regardless of where the resource is located”, provides this reliable identification. The use of persistent identifiers is not limited to web material, they are equally essential for linking and citation of primary research to data sets. 35 Introduction The key activities encompassed by Appraise and Select are: developing, documenting, and applying policies about appraisal and selection, including: (a) defining the designated community— the people who will use the data and digital objects in the future, (b) identifying properties of the data and digital objects to preserve, and (c) deciding how long the data or digital objects need to be maintained developing appraisal criteria determining whether to keep data by evaluating them against the appraisal criteria Dispose is a potential outcome of the appraisal process. Reappraise refers to the process of appraisal, the difference being that it is carried out at a later stage in the lifecycle in response to some trigger. 7 Appraisal and Selection Policies Appraise and Select are described as the processes of developing criteria for determining what data and digital objects should be kept for the long term and then applying those criteria. A concise definition of appraisal is stated as “the process of determining significance of any information object”. Selection is a more general term, usually applied when deciding what materials will be added to a repository, and it is typically used in the library context. Both appraisal and selection rely on criteria for determining what is considered worthy of preservation or of adding to a repository. 9 Appraisal and Selection Policies An appraisal and selection policy for digital objects or data typically addresses five themes: Future users (the designated community) The feasibility of preservation (both economic and technical feasibility) Legal and intellectual property rights Whether data are mission-critical (vital to the success of a project or organization) Associated data (metadata or description and representation information) 11 Appraisal and Selection Policies Both the information professionals who curate data and the creators of that data should be involved in developing develops appraisal and selection criteria. Information professionals who curate data have the responsibility of developing selection policies and guidelines for appraisal. They liaise with creators and depositors to ensure data sets are in the best shape to ensure preservability. They also have the role of locating sufficient resources (e.g., funding, staff, technical infrastructure) to ensure that effective appraisal is possible. 13 Appraisal and Selection Policies While, Data creators should ensure that data sets they create have sufficient metadata and documentation, and that their data are in curation-ready (usually open) formats. They should have a clear understanding of which data are vital, which are important, and which are minor. However, Appraisal and selection policies and specific criteria should be developed, ideally, with input from stakeholders. This is where the concept of the designated community is important. The designated community is “an identified group of potential consumers who should be able to understand a particular set of information” 14 Reappraisal Reappraise is an Occasional Action in the Curation Lifecycle Model. It is an outcome of decisions made at the Preservation Action stage. The activities associated with Reappraise are stated in the Curation Lifecycle as “returning data which fails validation procedures for further appraisal and reselection”. The key tasks of reappraisal are: specifying the conditions that trigger reappraisal and then testing data sets that meet these conditions against the appraisal criteria. 15 Disposal of Data Dispose is an Occasional Action in the Curation Lifecycle Model. It is a possible outcome of decisions made at the Appraise and Select and Reappraise stages. The activities associated with Dispose are stated in the Curation Lifecycle as “Dispose of data, which has not been selected for long-term curation and preservation in accordance with documented policies, guidance or legal requirements.” The options in Dispose are transferring the data or digital objects to another archive or destroying them in a secure manner. 17 Disposal of Data The decision to dispose of data or digital objects is made by assessing them against appraisal criteria developed by archives, repositories, and data centers and used to determine whether they are relevant to their aims and, therefore, worthy of committing resources to their long-term maintenance. If the decision is made not to commit resources to the long-term curation of a data set or digital object, it can either be transferred to another archive, repository, data center, or custodian, or it can be destroyed. The decision to dispose of data or digital objects may also arise from reappraisal. 18 Disposal of Data - Transfer of Data / Destruction of Data If data or digital objects are determined not to be relevant to one archive, repository, or data center, they may be transferred to another that is interested in them. Appropriate and adequate metadata and documentation about the data set also need to be transferred, as they are essential to ensure that the data set can be curated by the receiving organization. Some data or digital objects may need to be disposed of by destroying them completely. There may be legal requirements that they must be destroyed in a secure manner. 19 Introduction Ingest is the fourth Sequential Action in the Digital Curation Centre (DCC) Curation Lifecycle Model. Two sets of activities are given for Ingest: “Transfer data to an archive, repository, data center or other custodian,” and “Adhere to documented guidance, policies or legal requirements”. Ingest refers to the processes of preparing data and digital objects for adding to a digital archive and then adding them to the digital archive. Ingesting materials into a managed repository environment is a prerequisite for effective curation. 6 OAIS and Ingest Processes Ingest is one of the seven key functions in the OAIS Reference Model. Relevant to the Ingest action from this Model is the concept of Information Package. An information package consists of the digital object to be preserved, the metadata (description and representation information) required at that point in the system, and the Packaging Information linking the digital object and the metadata. For the Ingest action, two kinds of Information Package, the SIP (Submission) and the AIP (Archival), are important. 8 OAIS and Ingest Processes An SIP comprises the digital object and its accompanying metadata as presented at the start of the Ingest action. An AIP is based on an SIP, to which additional information needed to manage preservation, the PDI (Preservation Description Information), is added. PDI has four main components: Reference Information. A unique and persistent identifier Provenance Information. The history of the archived object. Context Information. The relationship to other objects—for example, the hierarchical structure of a digital archive. Fixity Information. A demonstration of authenticity, such as a hash value. 9 Ingest Processes in More Detail 2. Receiving SIPs: Main steps in SIPs can be grouped into four categories of actions: 1. Validation: validation actions are intended to ensure that the material received is complete and to identify characteristics of the material that are needed for curation. 2. Health check: health check actions are aimed at ensuring the quality of the data or digital objects. 3. Annotation: Annotation actions are related to the description and representation information associated with the data or digital objects. 4. Transformation: Transformation activities may also be applied. Depending on the archive’s policy on normalizing file formats, it may also be necessary to migrate an object to a different file format as part of the ingest process. 12 Ingest Tools The costs associated with acquisition and ingest are high. Many of the procedures required for the ingest action are labor intensive. The automation of more of these procedures is essential for handling larger quantities of data. Automating the ingest process is enabled by using a range of tools that have been developed for this purpose. The COPTR tool registry shows a variety of tools available to assist with ingest actions (www.digipres.org/tools/by-stage/#ingest/). 14 Policies for Ingest As is the case with all actions in the Curation Lifecycle, Ingest is most effectively implemented where policy statements and guidance are well developed. These policies must be documented and kept up to date. Documented policies are useful for ingest procedures, as they are for all other aspects of digital curation, because they clarify responsibilities and lines of communication, promote standardization, allow risks to be managed, and address compliance issues. 16

Use Quizgecko on...
Browser
Browser