GFQR 1026: Big Data in "X" Lecture 2 PDF

Summary

This document is a lecture from a course on Big Data in "X", likely about big data applications in the arts, humanities, and religious studies. It includes material about the humanities curriculum, big data in education and a massive digitization project. It also examines book scanning, search engine optimization, and image processing algorithms.

Full Transcript

GFQR 1026: Big Data in “X” Lecture 2 : Big Data Analytics in Art, Humanities & Religion, Language & Education Page 1 Lecture 2: Outline What is Humanities? Big Data Analytics in Humanities (Art, Religion & L...

GFQR 1026: Big Data in “X” Lecture 2 : Big Data Analytics in Art, Humanities & Religion, Language & Education Page 1 Lecture 2: Outline What is Humanities? Big Data Analytics in Humanities (Art, Religion & Language) Big Data Analytics in Education 2 2 Page What is Humanities? Humanities are records of Man’s experience, his values, his sentiments, his ideals and his goals. They are ultimately the expression of man’s feelings and thoughts. (Zulueta, 1994) 3 3 Page What is Humanities? Humanities are educational courses that aim to teach individuals about the human condition in a variety of forms, as well as look at them with a critical and analytical eye. Languages History Humanities Philosophy The Arts and Religion Literature 4 4 Page What is Humanities? Languages: This particular branch of humanities consists of learning the way people communicate in different speaking countries. It brings a sense of culture to individuals as they are likely to be taught the various history and origins of the languages they learn. The arts: The arts consist of theater, music, art and film. They are all mediums of self-expression and these courses in particular encourage personal interpretation and analysis. Fine arts courses also come into this category; however, they focus more on the historical forms of art and their origins. 5 5 Page What is Humanities? Literature: refers to novels, short stories, plays and so on. Individuals attempt to decipher the meaning of texts and look into symbolism and themes. Literature courses delve into social aspects that may influence texts. Philosophy and religion: These courses study human behavior and the age-old questions such as the meaning of life and the existence of God. They analyze various cultures and their religious beliefs as well as moral codes. History: This is arguably the most facts-based course as individuals delve into past events such as war and politics and how societies and cultures have been affected throughout the years. 6 6 Page What is Humanities? Humanities focus more on interpretation and ideals rather than concrete facts such as math and science. Data forms: Text? Image? Structured? Unstructured? 7 7 Page Big Data in Arts and Humanities Massive Digitization Project 1 - Online Database Delpher 8 8 Page Big Data in Arts and Humanities Online Database Delpher A website providing full-text Dutch-language digitized historical newspapers, books, journals, and copy sheets for radio news broadcasts. The material is provided by libraries, museums, and other heritage institutions. It is freely available and includes about 2 million newspapers, 900,000 books and 12 million journal pages 9 9 Page Big Data in Arts and Humanities Massive Digitization Project 2 – Google embarked on its ambitious book digitization project in 2002 “… digitizing at least 25 Google vastly improved million books from major its scanning technology university libraries…” as as the project went of August, 2017 along, how to balance copyright and fair use and keep everybody—authors, publishers, scholars, librarians—satisfied 10 10 Page Book Scanning in general Three main ways that large organizations are relying on: – Outsourcing – Scanning in-house using commercial book scanners – Scanning in-house using robotic scanning solutions 11 11 Page Book Scanning in general Outsourcing Books are often shipped to be scanned by low-cost sources to India or China Scan In-house By using overhead scanners which are time-consuming, or digital camera-based scanning machines which are substantially faster and is a method employed by Internet Archive as well as Google Once the page is scanned, the data is either entered manually or via optical character recognition (OCR), another major cost of the book scanning projects. 12 12 Page Google Book Scanning and Strategy Traditional scanning uses a glass plate that completely flattens each page OR chops off the book’s binding OCR (optical character recognition) software is able to identify the letters and numbers printed on the pages being digitized. Once scanned, those characters can be edited and searched with a computer. 13 13 Page Google Book Scanning and Strategy To eliminate the need for glass plates and reduce the possibility of damage to the books it wants to preserve, Google patented a new book scanning process. Workers simply place the book on an open book scanner that has neither a glass plate nor The scanners work at a rate of about any other equipment that would flatten a 1,000 pages per hour. book. 14 14 Page Google Book Scanning and Strategy Google created some seriously nifty infrared US patent 7508978 filed in 2009 camera technology that detects the three-dimensional shape and angle of book pages when the book is placed in the scanner. This information is transmitted to the OCR software, which adjusts for the distortions and allows the OCR software to read text more accurately. 15 15 Page Currently Digitized 17,645,865 total volumes Big Data !!! 8,484,768 book titles 469,921 serial titles 6,176,052,750 pages 791 terabytes 14,337 tons 7,048,962 volumes(~40% of total) in the public domain (as of July, 2023) 16 16 Page Massive Digitization Project HathiTrust Digital Library – Began in 2008 as a collaboration of the universities of the Committee on Institutional Cooperation (now the Big Ten Academic Alliance) and the University of California system to establish a repository to archive and share their digitalized collection – Contains more than 18 million digitized items – Out of 8 million unique items, about 95% of them from Google’s scanning 17 17 Page Massive Digitization Project Digitized books Digitized books University Libraries 18 18 Page Rulings around Copyright? Le ga Serves as the collective voice of l A ct io n i American authors & 20 n 2011 13 io n in l A c t Le ga 0 1 4 Digitized books 2 Digitized books 19 19 Page Rulings around Copyright? Mar 2011: New York Federal Judge rejected a $125 million legal settlement that Google had worked out with the authors and publisher over the copyright issues Nov 2013: Same Judge issued ruling saying that Google’s use of the works was a “fair use” under copyright law Jun 2014: Second Circuit Court of Appeals ruling on Authors Guild versus HathiTrust (Cornell, U Michigan, U California, U Wisconsin, Indiana) is a major victory for fair use “The creation of a full-text searchable database is a quintessentially transformative use” Information Transformatio n 20 20 Page Rulings around Copyright? Full Text download is limited by both size and by copyright – In-copyright or undetermined (70%) – “Public Domain” (30%) U.S. Federal Government Documents (Worldwide): 4% Public Domain (worldwide): 15% Public Domain (US): 10% Open Access: 0.1% Creative Commons: 0.01% 21 21 Page Rulings around Copyright? Does Authors Guild Represent All Authors? – The Authors Guild members are mainly trade- book authors – The books scanned by the HathiTrust are mainly scholarly books written as free access and sharing – The Authors Alliance (founded in 2014) is a new organization representing authors who are primarily concerned with being read 22 22 Page Massive Digitization Project HathiTrust Research Center – Scholars can tap into the Google Books corpus and conduct computational analysis—looking for patterns in large amounts of text, for instance— without breaching copyright – Scholar can work inside a secured Data Capsule and measure the things they need to measure to do research Data Capsule: Secure virtual environments for non-consumptive text analysis, where researchers can implement their own data analysis and visualization tools. 23 23 Page Data Capsule A secure, virtual computer that allows what’s known as “non-consumptive” research – Scholar can do computational analysis of texts without downloading or reading them. – The process respects copyright while enabling work based on copyrighted materials. e.g. Ted Underwood (a professor and LAS Centennial Scholar of English and a professor in the School of Information Sciences at the University of Illinois) can take on projects like a collaborative study on the gender balance of fiction between 1800 and 2007 24 24 Page Ngram Viewer: Function offered by Google Books Allows a user to search Google Books data for occurrences over time of specific words Search “Albert” What does it imply? 25 25 Page What Google is doing? Continuing to digitize and add books Improving the quality of our image-processing algorithms Improving the effectiveness of search Continuously making it easy for people to find books and conduct deep research 26 26 Page Big Data in Education Learning analytics (LA) – Process of collecting, evaluating, analyzing, and reporting organizational data for decision making – Involves the use of big data analysis – for understanding and improving the performance of educational institutions in educational delivery Benefits of Big Data Analytics in Higher Education – Improving student retention – Supporting informed decision making – Increasing cost-effectiveness – Understanding students’ learning behaviors – Providing personalized assistance for students – Timely feedback and intervention Page 27 Learning Analytics System Page 28 Big Data in Education Improved student retention – Harvard University: demonstrate the potential for natural language processing to contribute to predicting student success in MOOCs (Massive Online Open Course) and other forms of open online learning – University of New England: The student attrition dropped from 18% to 12% Students demonstrated an increase in their sense of belonging to the learner community and learning motivation Understanding students’ learning behaviors – The Education University of Hong Kong: Potential indicators were found for predicting student performance, such as the contribution of in-depth contents in online discussion Page 29 Big Data in Education Supported informed decision making – University of Adelaide: Educators were provided with guidelines to design collaborative learning activities – University of Edinburgh: Through identification of socially engaged students, the instructional team can identify suitable teaching assistants Increased cost effectiveness – University of Sydney: Instant feedback and auto-grading are especially useful for instructors teaching subjects in computer science education – Harvard University: A machine learning prediction model was effective for predicting students who would complete an online course Page 30 Big Data in Education Providing personalized assistance to students – University of Michigan: Customized recommendations were provided, including suggestions on study habits, assignment practice, feedback on progress and encouragement Timely feedback and intervention – University of Edinburgh: Instant feedback was shown to be a useful feature for students in courses on computer programming – Northern Arizona University: Feedback was available to individual students and to university personnel, facilitating a comprehensive support network for all students Page 31 Big Data in Education? Poor Sitting Teacher’s volume & pace Yawning Standing Monitoring teachers & students with an AI System in China 32 32 Page Big Data in Education? Safety Too crowd in Stairs Video Dissemination Of Dangerous Behavior Dangerous Behavior Playback Student’s Behavior Profiling Running Fighting Using sensor/cam to collect data? Facial recognition? Sentiment Analysis? Path Analysis? Or….. 33 33 Page Image Processing Algorithm Free Google AI Image Analysis Tool – Machine Learning (MI) / Artificial Intelligence (AI) algorithm tells you what it thinks the image is relevant for. – Is a part of Google’s Cloud Vision products Cloud Vision Application Programming Interface (API) is a cloud service that can allow you to add image analysis features to apps and websites It has seven ways to classify uploaded images: – Faces, objects, labels, web entities, text, properties, safe search 34 34 Page Image Processing Algorithm Faces: Analyze the emotion expressed by the image 35 35 Page Image Processing Algorithm Objects: Shows what objects are in the image, like glasses, person, etc. Properties: are the colors used in the image. 36 36 Page Image Processing Algorithm Labels: Shows details about the image that Google recognizes, like ears and mouth but also conceptual aspects like portrait and photography. 37 37 Page Image Processing Algorithm Web Entities: shows descriptive words that are associated with the image via the web. Shows how Google itself is interpreting what the image means by what is published online with that image. 38 38 Page Image Processing Algorithm Safe Search: shows how the image ranks for unsafe content (e.g. Adults, Spoof, Medical, Violence, Racy) – Uses image captions, alt text, filename or the text surrounding the image to understand the image and uses it for ranking purposes. 39 39 Page Image Processing Algorithm You may try it at: https://cloud.google.com/vision/docs/drag- and-drop 40 40 Page Search Engine Optimization (SEO) Image SEO – Used to be a huge aspect of content and site optimization, with the potential to drive tons of image search traffic – “View Image” button was added to Google Images in 2013 Sites saw an average decrease of 63 percent in image search traffic. 41 41 Page Search Engine Optimization (SEO) Image SEO – “View Image” button was removed in early 2018 Instead of viewing the image in full size instantly, we can visit, save, share, or view your saved image. To force most users to visit the site before merely grabbing the image. 42 42 Page Big Data around you ! 43 43 To be continued…. Page Lecture 2: Summary What is Humanities? Big Data Analytics in Humanities (Art, Religion & Language) Big Data Analytics in Education 44 44 Page References 1. Suhiuma, G., and Carlucci, D. (2018). “Big Data in the Arts and Humanities: Theory and Practice”, First Edition, CRC Press. (Chapter 2) 2. https://www.visualcapitalist.com/big-data-keeps-getting-bigger/ 3. https://www.slideshare.net/lalay_17/overview-of-humanities-36160173 4. https://gdbkicm10262015.wordpress.com/2016/10/20/what-is-humanities-and-its-branches/ 5. https://www.delpher.nl/ 6. https://en.wikipedia.org/wiki/Book_scanning 7. https://computer.howstuffworks.com/google-books.htm 8. https://www.npr.org/sections/library/2009/04/the_granting_of_patent_7508978.html 9. https://computer.howstuffworks.com/google-books.htm 10. https://www.hathitrust.org/about 11. https://www.edsurge.com/news/2017-08-10-what-happened-to-google-s-effort-to-scan-millions-of-university-library-books 12. https://iplegalforum.com/2015/11/09/googles-effort-to-digitize-millions-of-books-is-fair-use/ 13. https://iplegalforum.com/2015/11/09/googles-effort-to-digitize-millions-of-books-is-fair-use/ 14. https://www.publicknowledge.org/news-blog/blogs/welcoming-a-new-voice-in-copyright-reform-the-authors-alliance 15. https://www.slideshare.net/BethPlale/trust-wiseplalejune2014htrc 16. https://books.google.com/ngrams/graph?content=Albert&year_start=1800&year_end=2000&corpus=15&smoothing=3&share=&dir ect_url=t1%3B%2CAlbert%3B%2Cc0#t1%3B%2CAlbert%3B%2Cc0 17. https://skypost.ulifestyle.com.hk/article/2436129/%E5%AD%B8%E7%94%9F%E6%89%93%E5%96%8A%E9%9C%B2%E5%9D%90%E5 %A7%BF%E4%B8%8D%E4%BD%B3%E5%85%A8%E6%8D%95%E6%8D%89%20%E5%85%A7%E5%9C%B0%E5%BC%95%E5%85%A5AI %E7%B3%BB%E7%B5%B1%E7%9B%A3%E6%B8%AC%E5%B8%AB%E7%94%9F%3Fr=cpsdfb 18. https://www.searchenginejournal.com/google-cloud-vision-tool/304237/#close 19. https://en.wikipedia.org/wiki/Authors_Guild,_Inc._v._Google,_Inc. 20. https://www.slideshare.net/BethPlale/trust-wiseplalejune2014htrc 45 45 Page Image Credits [Slide 2 & 45] https://www.kissclipart.com/instruct-clipart-clip-art-women-clip-art-zameoc/ [Slide 3] https://www.luvze.com/your-expressions-true-feelings-away/ [Slide 3] https://www.health.harvard.edu/blog/feeling-okay-about-feeling-bad-is-good-for-your-mental-health-2017091412398 [Slide 4] https://science.sciencemag.org/content/347/6226/1166.full?sid=396416b9-f034-4a35-a10a-7b9db8732b3f [Slide 7] http://www.semanticsevolution.com/interpretation.html [Slide 7] https://www.digipen.edu/academics/art-music-and-design-degrees/bfa-in-digital-art-and-animation [Slide 8] https://www.delpher.nl/ [Slide 9] https://library.maastrichtuniversity.nl/collections/databases/delpher/ [Slide 11] https://cn.depositphotos.com/77805094/stock-illustration-mr-simple-in-brainstorm-action.html [Slide 12] https://www.sma-edocument.com/products/Products/8/robo-scan.html [Slide 12] https://en.wikipedia.org/wiki/Book_scanning [Slide 13] https://www.monigroup.com/article/should-content-production-happen-house-or-get-outsourced [Slide 14] https://en.wikipedia.org/wiki/Optical_character_recognition [Slide 14] https://www.emoji.co.uk/view/10609/ [Slide 14] https://docs.toonboom.com/help/harmony-11/draw-network/Content/_CORE/_Workflow/010_Scanning/000_CT_Scan.html [Slide 15] https://en.wikipedia.org/wiki/Book_scanning [Slide 16] https://www.npr.org/sections/library/2009/04/the_granting_of_patent_7508978.html [Slide 17] https://www.ie.edu/insights/articles/big-data-critical-path-to-develop-new-business-opportunities/ [Slide 18] https://en.wikipedia.org/wiki/HathiTrust [Slide 19] https://unsplash.com/search/photos/library [Slide 19] https://www.grinnell.edu/news/hathitrust-digital-library [Slide 20] https://www.businesswire.com/news/home/20190108006047/en/Authors-Guild-Survey-Shows-Drastic-42-Percent [Slide 20] https://www.grinnell.edu/news/hathitrust-digital-library [Slide 22] https://tenor.com/search/ok-gifs [Slide 23] https://www.publicknowledge.org/news-blog/blogs/welcoming-a-new-voice-in-copyright-reform-the-authors-alliance [Slide 26] https://books.google.com/ngrams/graph?content=Albert&year_start=1800&year_end=2000&corpus=15&smoothing=3&share=&direct_url=t1%3B%2CAlbert%3B%2Cc0#t1%3B%2CAlbert%3B%2Cc 0 [Slide 27] https://www.alphr.com/technology/1006488/what-is-algorithm-simple-example/ [Slide 28] https://www.emeraldinsight.com/doi/pdfplus/10.1108/AAOUJ-01-2017-0009 [Slide 29] https://www.researchgate.net/publication/326838737_Big_Data_and_Learning_Analytics_Model [Slide 28 & 29] https://skypost.ulifestyle.com.hk/article/2436129/%E5%AD%B8%E7%94%9F%E6%89%93%E5%96%8A%E9%9C%B2%E5%9D%90%E5%A7%BF%E4%B8%8D%E4%BD%B3%E5%85%A8%E6%8D%95%E6 %8D%89%20%E5%85%A7%E5%9C%B0%E5%BC%95%E5%85%A5AI%E7%B3%BB%E7%B5%B1%E7%9B%A3%E6%B8%AC%E5%B8%AB%E7%94%9F%3Fr=cpsdfb [Slide 30] https://blog.coursify.me/en/how-increase-student-retention/ [Slide 32] https://www.mailify.com/blog/email-marketing-tutorials/how-to-personalize-your-email-newsletters-for-better-results/ [Slide 35] https://medium.com/health-ai/ai-2-0-in-ophthalmology-googles-second-publication-c3b5390c19ae [Slide 36-42] https://www.searchenginejournal.com/google-cloud-vision-tool/304237/#close [Slide 43] https://www.searchenginejournal.com/google-image-seo-updates/255573/#close [Slide 44] https://pixabay.com/illustrations/businessman-teacher-cartoon-lecture-607834/ 46 46 Page

Use Quizgecko on...
Browser
Browser