Theme 5: Opportunities and Challenges of Data in Repositories PDF
Document Details
Tags
Summary
This document discusses opportunities and challenges related to using data from repositories. It examines learning outcomes, the introductory concepts, key points for working with collections as data, and examples such as Google Books and Voyant Tools. The document also tackles the crucial aspect of copyright in the digital environment.
Full Transcript
Theme 5 Opportunities and challenges when using data in repositories Learning outcomes After completion of this theme, students should be able to discuss some of the opportunities that data in repositories present; discuss key point when working with collecti...
Theme 5 Opportunities and challenges when using data in repositories Learning outcomes After completion of this theme, students should be able to discuss some of the opportunities that data in repositories present; discuss key point when working with collections in repositories as data; give examples of possibilities when working with data in repositories; explain what copyright is; discuss the copyright issues in a digital repository; discuss the different types of licenses in the Creative Commons family; explain why the Creative Commons licenses are well-suited to digital repositories; give an example of workflow to deal with copyright when ingesting material; discuss the controversy around Google Books with regards to copyright infringement. Introduction Various digitisation initiatives around the world have resulted in a large volume of digital data. Consider for example some of the large-scale digital repositories, such as, Google Books, Internet Archive, HathiTrust and Project Gutenberg. There are also growing digital collections as universities, libraries and other institutions, considering the digital collections at the British Library and the Library of Congress. The availability of texts and other items in digital form and the improvements of technology have opened up new possibilities for researchers and users. The use of the items gives rise to opportunities, as well as challenges. In terms of opportunities, there is increasing interest in using new methods and techniques to explore digital collections. Increasingly collections from repositories in cultural heritage institutions are being released openly and in machine readable form (Ames, 2021). Various computational techniques are used to study digital collections and tools are being developed to help people explore these collections. Tools are also being developed to search in digital text collections. However, there are also challenges. Reese and Banerjee (2008) explain that it is great to get to the point when people start to use a digital repository and the work that went into building the repository seems to pay off. However, there are other challenges to consider once the repository is being used. “… [The challenges] just change as the organization shifts from implementing the repository to moving content into the digital repository (Reese & Banerjee, 2008: 219).” One of these challenges is rights management and “it is the job of the digital repository administrator to ensure that the license restrictions required by the contributor are honoured, which means that organizations must take a much more active role in managing access to digital materials placed within their care (Reese & Banerjee, 2008: 219).” In this theme we will address some of the opportunities and challenges when using data in digital repositories. Key points to consider when working with collections as data There is increasing interest from users to work with digital collections as data. What are some of the key points to consider when a repository wants to enable its users to work with the data in the collection? See the report by Padilla et al. (2019) for more information. Collections as data development requires critical engagement with the ethical implications of cultural heritage organization work. Collections as data development is possible at a wide range of organizations. Collections as data development benefits collection users and stewards. Challenges to collections as data development are more organizational than technical. Collections as data development benefits from engaging specific community needs. Collections as data development benefits from collaboration across multiple communities of practice. Examples of using collections in repositories as data In this theme we will consider three examples: Google Books Ngram Viewer Voyant Tools Data Foundry of the National Library of Scotland (Notebooks) Google Books Ngram Viewer This tool uses data from the large Google Books repository. Users can view trends of the usage of words over a period of time. Some advanced options include searching by part-speech-tagging and modifiers. A user can filter in some sub-corpora (subsections of the dataset) and also by publication year. Some of the criticism of this tool include that the metadata are not released. Example of Google Books Ngram Viewer Voyant Tools This is not strictly a tool on a repository but is a platform of various tools and allows users to import texts and then to analyse the text using various computational methods on the texts, for example, word clouds, filter bubbles, frequencies, key-word-in-context. Example of Voyant Tools Data Foundry of the National Library of Scotland (Notebooks) The National Library of Scotland has a large collection of digital texts, as already in 2022 about 22% of the 31 million items have been digitised (Ames & Havens, 2022). A Digital Scholarship Service was launched in 2019 that focused on making collections available in machine readable form so that researchers could interact with it as data on the Data Foundry (Ames & Havens, 2022). The next step was to enable users to use Jupyter Notebooks to explore and analyse the collections (Ames & Havens, 2022). Jupyter Notebooks support interactive data science. For example, the Notebook for the ‘Edinburgh Ladies’ Debating Society’ collection on the Data Foundry helps users to analyse the collection using computational methods (https://data.nls.uk/tools/jupyter-notebooks/exploring- edinburgh-ladies-debating-society/). Example of Jupyter Notebooks Challenges when working with data in digital collections Though there are many possible challenges that are present when working with the data in digital collections, this theme will focus on copyright. Copyright What is the difference between 'copyright' and 'intellectual property'? The following is taken from PASA (2011: “Intellectual property is … the product of the intellect, or mind. Patents, trademarks, designs and copyright are the four forms of intellectual property (the first three are sometimes known as industrial property. Copyright … includes the right to protect one's intellectual property from unauthorised usage.” What is copyright? "the exclusive right in relation to work embodying intellectual content the product of the intellect to do or to authorise others to do certain acts in relation to that work acts [ which ] represent the manner in which that work can be exploited for personal gain or for profit -- Owen Dean. 1989. Handbook of South African Copyright Law, Juta & Co Ltd., from PASA, 2011 Copyright law Each nation has its own copyright laws, so the specific ways in which copyright laws impact a library’s ability to create digital collections will vary from country to country. In addition, copyright laws change constantly. In most cases, the location of the library and not the location of the origin of the publication of the work, determines the jurisdiction. South Africa is a signatory to various international intellectual property agreements or conventions. Copyright in the digital environment Nicholson (2010: 10) explains that unlike photocopying a work, digitizing a work may involve more than simple reproduction. It may involve “conversion to another format, often involving modification, adaption or cropping, even translation.” A digital item can be “searched, browsed, amended, and enhanced by any number of users at the same time… making information electronically accessible to a wide audience… is a form of publishing” (Nicholson, 2010: 10). “The intimate connection between access and copying has considerable significance in the context of copyright protection. One of the essential elements of copyright is the right to control reproduction. In the digital world, access is not possible without copying. By merely browsing an item on a website, transient or temporary copies are always created in the process (Nicholson, 2010: 10).” Licenses in the digital environment It can be very difficult for digital repository staff to manage different copyright licences if each author has his own type of license. One publication license family, Create Commons, gives authors/creators a variety of options for licensing, while also simplifying the management of different licenses for the digital repository staff as there are only a set number of licenses to cater for. Licenses in the Creative Commons family. “A Creative Commons (CC) license is one of several public copyright licenses that enable the free distribution of an otherwise copyrighted ‘work’.”1 The six most common licenses are listed in the table below. Get the description from: https://creativecommons.org/about/cclicenses/ For further reading, please have a look at this Wikipedia entry: https://en.wikipedia.org/wiki/Creative_Commons_license 1 https://en.wikipedia.org/wiki/Creative_Commons_license Icon Right Description Licensees may copy, distribute, display, perform and make derivative works and remixes based on it only if they give the author or licensor the credits (attribution) in the manner specified by these. Since Attribution (BY) version 2.0, all Creative Commons licenses require attribution to the creator and include the BY element. The letters BY are not an abbreviation, unlike the other rights. Licensees may distribute derivative works only under a license identical to ("not more restrictive than") the license that governs the Share-alike (SA) original work. (See also Copyleft.) Without share-alike, derivative works might be sublicensed with compatible but more restrictive license clauses, e.g. CC BY to CC BY-NC. Licensees may copy, distribute, display, perform the work and make Non- derivative works and remixes based on it only for non- commercial (NC) commercial purposes. Licensees may copy, distribute, display and perform only verbatim No derivative copies of the work, not derivative works and remixes based on it. works (ND) Since version 4.0, derivative works are allowed but must not be shared. CC0 (CC Zero) is a special license that allows the user to give up the copyright for their work. The Creative Commons licences are a choice for many repositories because: they are well-known and widely used. It is not restricted to the library community. there are tools that embed the licenses directly into the resource. Copyright workflow One of the things that Reese and Banerjee (2008) mention, is that is good to work out a plan that specifies how a digital repository will deal with copyrighted material. A very good example is the workflow from DigitalCommons@CalPoly. Note: this example is for an institutional repository which deals primarily with academic articles. This means it might need to be adapted to cater for digitising works. Tools to help with copyright for open access publication SHERPA/RoMEO is an “online resource that aggregates and analyses publisher open access policies from around the world and provides summaries of publisher copyright and open access archiving policies on a journal-by-journal basis” (https://v2.sherpa.ac.uk/romeo/). Publisher copyright policies from Sherpa/Romeo example Acts and more information The following acts are applicable: SA Copyright Act 98 of 1978 National Archives and Record Service of South Africa Act 43 of 1996 Promotion of Access to Information Act 2 of 2000 Electronic Communications and Transactions Act 25 of 2002 Creative Commons More information on copyright Publishers’ Association of SA Copyright Issues DALRO (Dramatic, Artistic and Literary Rights Organization) Freedom of Information Legislation For publishers’ policies: SHERPA RoMEO Copyright Management for Scholarship (SURF) Case study: Google Books As a case study, examine the Google Books project and controversy surrounding it. You should be able to address the following issues in an essay. What is Google Books? How does Google Books work? Where does Google find books? What views are available and what is the difference between the views? Why are some books not available in full-text? How are the books digitised? Who filed lawsuits against Google and for which reasons? How did Google respond? What were the results of the court case in 2013 by judge Chin? (Focus on the benefits society receives from Google Books) What were the results of the ruling by the Supreme Court in 2016? What is your opinion on the ruling? Use the following sources to find information to answer these questions: Wikipedia Google Books - About Howard (2017) Van Helden (2017) Required reading 1. Creative Commons. https://creativecommons.org/licenses 2. Howard, J. 2017. What Happened to Google's Effort to Scan Millions of University Library Books? [Online]. Available: https://www.edsurge.com/news/2017-08-10-what-happened-to- google-s-effort-to-scan-millions-of-university-library-books [Accessed 29 September 2021]. 3. Van Helden, S. 2017. How Google Book Search Got Lost. [Online]. Available: https://www.wired.com/2017/04/how-google-book-search-got-lost/ [Accessed 29 September 2021] References Ames, S. 2021. Transparency, Provenance and Collections as Data: The National Library of Scotland’s Data Foundry. Liber Quarterly, 31: 1-13. Ames, S. & Havens, L. 2022. Exploring National Library of Scotland datasets with Jupyter Notebooks. International Federation of Library Associations and Institutions, 48(1): 50-56. Nicholson, D. 2010. Copyright and Related Matters. In: Liebetrau, P. & Mitchell, J. (eds.) Managing Digital Collections: A Collaborative Initiative on the South African Framework. National Research Foundation (Source on ClickUP - Managing digital collections - SA Framework] Padilla, T., Allen, L., Frost, H., Potvin, S., Russey Roke, E. & Varner, S. 2019. Always Already Computational: Collections as Data: Final Report. Copyright, Fair Use, Scholarly Communication, etc. [Online]. Available: https://digitalcommons.unl.edu/scholcom/181 [Accessed 7 September 2022]. PASA, Publishers’ Association of South Africa. [Online]. Available: < http://www.publishsa.co.za> [Accessed 4 July 2011]. Reese, T. & Banerjee, K. 2008. Building Digital Libraries: a how-to-do-it manual. Neal-Schuman Publishers: New York. Further reading 1. Baksik, C. 2006. Fair Use or Exploitation? The Google Book Search Controversy. portal: Libraries and the Academy, 6(4): 399-415. 2. Chandler, N. n.d. How Google Books Work. [Online]. Available: http://computer.howstuffworks.com/google-books.htm [Accessed 29 September 2021]. 3. Coldewey, D. 2016. Supreme Court affirms Google Books scans of copyrighted works are fair use. [Online]. Available: https://techcrunch.com/2016/04/18/supreme-court-affirms-google- books-scans-of-copyrighted-works-are-fair-use/ [Accessed 29 September 2021].