Big Data Foundations Ch1. PDF
Document Details
Uploaded by SophisticatedNashville
Abu Dhabi University
Dr. Heba Ismail
Tags
Summary
This presentation provides an overview of big data foundations, exploring its characteristics, applications, and value chain. The various domains of big data applications including finance, healthcare, and science are discussed, along with the potential opportunities and challenges.
Full Transcript
- Chapter 1 - Big Data Foundations Dr. Heba Ismail “The future belongs to those who rule the data” 1 Agenda Understanding the Basics Big data opportunities for Developing...
- Chapter 1 - Big Data Foundations Dr. Heba Ismail “The future belongs to those who rule the data” 1 Agenda Understanding the Basics Big data opportunities for Developing – Why Big Data? Countries – Financial – Big Data Characteristics – Healthcare – Big Data explosion – Education – Agriculture Big Data Application Domains – Others – The Promise of Big Data – Applications for Big Data analytics Big IT Infrastructure for Big Data Science – Big Data Landscape Internet – Value of Big Data Analytics Intelligent Transportation Systems – Big Data will transform IT infrastructure Big Data value chain Big Data Challenges and Open Research – Storage Questions – Processing Conclusion – Analytics References – Visualization 2 Understanding the Basics 3 New Age of Big Data The world has gone mobile – 5 billion cellphones produce daily data Social networks have gone online – Twitter produces 200M tweets a day Crowdsourcing is the reality – Labeling of 100,000+ data instances is doable Within a week 4 What comes next? Kilobyte (KB) – 103 bytes Megabyte (MB) –106 bytes Gigabyte (GB) – 109 bytes Terabyte (TB) –1012 bytes Petabyte (PB) – 1015 bytes Exabyte (EB) – 1018 bytes Zettabyte (ZB) – 1021 bytes Yottabyte (YB) – 1024 bytes 5 Moore’s Law to Rescue? “data explosion is bigger than Moore's law” Computers get faster and cheaper every year but the amount of DATA data that needs to be processed grows even faster. CPU 6 Are we lost in a sea of Big Data? 7 Why are they collecting all this data? Target Targeted Marketing Information To send you catalogs for To know what you need before exactly the merchandise you you even know you need it typically purchase. based on past navigation! To suggest medications that To notify you of your expiring precisely match your medical driver’s license or credit cards history. or last refill on a Rx, etc. Do you use TAMM Applications? To “push” television channels to your set instead of your To give you turn-by-turn “pulling” them in. directions to a shelter in case of emergency. Have you ever To send advertisements on received emergency alert those channels just for you! for AD Police? 8 Is this too much?! 9 How Can You Avoid Big Data? Pay cash for everything! Never go online! Don’t use a telephone! Don’t fill any prescriptions! Do not use any 10 LIFE HAS CHANGED – BIG DATA HAS BECOME PART OF OUR LIFE LET’S EMBRACE IT! Can you do it? 11 The Model Has Changed… The Model of Generating/Consuming Data has Changed Old Model: Few companies are generating data, all others are consuming data New Model: all of us are generating data, and all of us are consuming data 12 Big Data Characteristics 13 Big Data Characteristics (the 5 v’s) 14 Big Data Types Structured data – Call detail records – Point of sale records – Claims data Semi-structured data – Web logs – Sensor data – Email, Twitter Unstructured data – Video, Audio, – Images, Text “A Sea of Sensors”, The Economist, Nov 4, 2010 15 Where do we see “Big Data”? 16 Big Data Application Domains FINANCIAL SERVICES HEALTHCARE CONSUMER MARKETING ONLINE WEB AND GAMING COMMUNICATIONS RETAIL Applications for Big Data Analytics Smarter Healthcare Multi-channel Finance Log Analysis sales Homeland Security Traffic Control Telecom Search Quality Manufacturing Trading Analytics Fraud and Risk Retail: Churn, NBO 18 Big Data – Science Remote Astronom Sensing y Participat ory Sensing Air, Land, Ocean Sky surveys 100s GBs /day 120 GB/week, 6.5 TB/year Genomics Drug Discovery 25K genes, 3B base 2M of compounds pairs 8B humans > 100M interactions thousands of 19 Big Data – Internet Social Internet Web Networks Traffic 8 Billion pages 10kB/Page 8 TB of indexed text Typical router: Mobile Apps 42 bytes/second 3.5 Gigabytes/day 20 Example: Internet Search Enormous amounts of content on the Internet 47 billion 17 billion 3.3 billion Seek relevant results in less than a second 21 Example: Climate Analysis Analyze current and historical weather data – Sensor readings from 1000s of locations – Satellite/radar images – Geographic features 22 Big Data – Intelligent Transportation Systems The future lies in integration, mining and analytics of BIG DATA From the sky or space From the ground 23 From the vehicles Big Data – Healthcare, transforming Healthcare to a Data-Driven Industry 24 Big Data Lifecycle/Value Chain 25 Big Data Value Chain 26 Big Data Opportunities: Healthcare Example 27 Big Data – Opportunities Big Data presents unprecedented opportunities to – Accelerate scientific discovery and innovation – Lead to new fields of inquiry that would not otherwise be possible – Improve decision making – Understand human and social processes – Promote economic growth – Improve health and quality of life 28 Big Data opportunities in Health Correlational Data Geo-environmental, weather patterns, etc. Clinical Data Claims & Cost EMRs, diagnostic images Data Claims, revenue cycle Big Data Opportunities Pharma & Life Patient & Science Data Consumer Data Clinical trials, genomics Purchasing patterns, social media Usage in Health Sector Data discovery Patterns Predictive Analysis Device Data Research data warehouses Pharmacology, patient management, disease prevention, decision support, future AI systems and more…. 30 Health Data Sources 31 Big Data and Analytics in Sr.No Healthcare Benefits. 1 Improve Patient care quality and program analysis 2 Provide solution that combine clinical, financial, and operational data to address many issues beyond inpatient care. 3 Ability to empower knowledge workers with self-service and “research on demand.” 4 Drug discovery and development analysis 5 Shift from current care models to Accountable Care and Population Health management 6 Improve supply chain management besides health program optimization 32 Class Activity Based on the explanation related to Big Data in Healthcare, identify the following with your teammate: – Identify possible data types and sources in the healthcare domain. – Characterize healthcare big data in terms of the 5 V’s. It will be nice to add screenshots and examples, numbers, statistics, …, any supporting content. Use FigJam, PowerPoint, or any presentation tool. 33 Big IT Infrastructure for Big Data How Big Data transform IT into a scalable, reconfigurable and highly dynamic IT? 34 35 What’s driving Big Data - Optimizations and predictive analytics - Complex statistical analysis - All types of data, and many sources - Very large datasets - More of a real-time - Ad-hoc querying and reporting - Data mining techniques - Structured data, typical sources - Small to mid-size datasets 36 Value of Big Data Analytics Big data is more real- time in nature than traditional Data Warehouse applications Traditional Data Warehouse architectures (e.g. Exadata, Teradata) are not well-suited for Big data apps Shared nothing, massively parallel processing, scale out architectures are well- suited for Big data apps 37 Scalable IT infrastructure IT infrastructure will scale to cope with Big Data, thus new IT Models is emerging and include: – Hardware and Clustering models – Storage infrastructure and models – Data distribution and parallel processing models – Federated Cloud Infrastructure 38 Reconfigurable IT infrastructure IT infrastructure will adapt to Big Data characteristics, thus new featured IT Models is emerging and include: – Configurable architecture – Deployment Models – SaaS, PaaS, Iaas AND Public Cloud, Private Cloud, and Hybrid Cloud – End-to-end Automation – Workload intelligence 39 Dynamic IT Infrastructure IT infrastructure will leverage Big Data, dynamic IT Models is emerging and include: – Combined solutions – Real time processing and distribution IT models – Real time Data collection and Analytics models – Complex Data Analytics – Etc… 40 Big Data Challenges 41 Top 9 challenges About Big Data Data Data Transport Data IOT Processin MapReduce/ SQL, NoSQL Acqui siti Storage & Managem g STORM Disseminat on ent Programming ion Networking Collection in Performance - Databases time Scalability Model Technologies Big Data Data Private Analyst Conversion Cloud Workforc Cost Of Archivin e With All Of Security g Data Advance Data Protection Specialize Data Processing The Cost Storage for Archival dEngineer Skills Above Reduction Purposes 42 Challenge: Collection Where does the data come from? Input from humans, instruments/sensors, existing datasets, etc. Potentially many sources Transport data from source to repository 43 Data Acquisition Data acquisition is the process of sampling signals that measure real world physical conditions and converting the resulting samples into digital numeric values that can be manipulated by a computer – Wiki The question is how can you acquire closely spaced data only when you need so that the data file doesn't become too large with unnecessary data points. 44 Challenge: Organization How is the data structured? Data needs to be labeled, sorted, etc. Relationships may exist between pieces Exclude inaccurate or unknown data 45 Challenge: Storage How do we store large volumes of data? Need space for 100s of Terabytes of data (modern hard drive holds 1 TB) Data needs to be efficiently accessed by servers doing computation 46 Storage Big Data storage for active repositories should meet the following requirements: High performance (very low latency) – Average reads < 5ms, writes < 10ms Seamless scalability – No table or throughput limits – Live repartitioning (no downtime) High durability (availability) Predictable performance 47 Challenge: Computation How is the data processed to obtain desired information? Algorithms determine actions to perform Need computers to run the algorithms May be constrained by time, space, etc. 48 Processing “We have terabytes of click-stream data, what can we do with it?” Very large data repositories Complex data analysis Distributed and parallel data processing 49 Processing Computation patterns in parallel data processing Embarrassingly Loosely synchronous Stream processing parallel Suitable programming models MapReduce /Bulk Synchronous Processing STORM Do we need new programming models? Not likely, the existing ones are adequate. The challenge is to make them more efficient. 50 / 50 Data management The question is how can we store, organize, and query our data. – High performance – High scalability – High availability 51 Challenge: Visualization How is the data (or results) presented? Seek clear, concise representation of the data Emphasize desired information May require many related visualizations 52 Security Data confidentiality Data Integrity Data Accuracy 53 Workforce with specialized skills A need for more specialized and highly skilled workforce to help us deal with Big data. A new Job Title: Data scientist, Data Consultant, etc… 54 Big Data Future Directions 55 Open research questions Data Acquisition and Storage – Do we really need to store everything? – Is it possible to aggregate data at the source? – Are the use of smart compression algorithms combined with analytics, a potential solution to lower the high volume of transported data? – Can the cloud provide a Big data storage as a service? Data Processing and Analysis – Should we rewrite all the analytics algorithms and methods to support distributed and parallel data processing for Big data (MapReduce) or should we develop New algorithms, languages, data structures for Big data analytics? – How to extract and integrate knowledge from massive, complex or dynamic data? – When and where to use approximation results without harming the accuracy (Data quality)? Data Transport – Is it better to send the data to the analytics platform or send the processing algorithms where the data is? And when it is about real-time data (CRITICAL) – How to distribute data for parallel processing when in limited bandwidth? – How to avoid data transfer bottlenecks? 56 Open research questions (cont’d) Data Management – How to store, organize and query the big data? – How to design scalable data storage that provide data mining? – Do we need to redesign new DBMS to adapt to the new Big data Challenges? Security – How to guarantee data security in Big data – Data integrity – How to be sure that data is not compromised? – How to assure correctness and accuracy?. (Data Quality) Data Quality – Do we need to focus on “Smart” Data Instead of “Big” Data: Quality vs Quantity. – Where and when to compromise data quality in Big data? – What is the level of trust we can have on big data? How to measure it? 57 Conclusion Big data is a very promising research and development area. Many applications domains are generating Big data and requires storage, analytics, and quality evaluation. The explosion in data is creating challenges and prompting innovation in computer storage and processing, in terms of software, hardware and data center architecture. Big data is extremely important in terms of social welfare, productivity, and competiveness. The success of Big Data requires Cloud infrastructure, large scale database and distributed file system, and advanced data analytics. Many research challenges were not addressed yet. Research on big data is in it early stage. 58 References "Visual Networking Index (VNI) - VNI Forecast Highlights", Cisco, 2016. [Online]. http://www.cisco.com/c/en/us/solutions/service-provider/visual-networking-index-vni/vni-forecast.html. "A sea of sensors", The Economist, 2010. [Online]. Available: http://www.economist.com/node/17388356 "IBM big data platform - Bringing big data to the Enterprise", Www-01.ibm.com, 2016. [Online]. http://www- 01.ibm.com/software/data/bigdata/. "HubbleSite - The Telescope - Hubble Essentials - Quick Facts", Hubblesite.org, 2016. [Online]. http://hubblesite.org/reference_desk/facts_.and._figures/quick_facts/quick_facts_2.shtml#data_stats. "Hortonworks : Open and Connected Data Platforms", Hortonworks, 2016. [Online]. http://hortonworks.com/. "Welcome to Apache™ Hadoop®!", Hadoop.apache.org, 2016. [Online]. Available: https://hadoop.apache.org/. ”Amazon Web Services ”,Amazon, 2016. [Online]. https://aws.amazon.com/. "What Is Big Data? - Blog", Datascience.berkeley.edu, 2014. [Online]. https://datascience.berkeley.edu/what-is-big- data/. "The Evidence Is In: People Want to Collaborate with Their Doctors and Co-Produce Their Clinical… — Tincture", Medium, 2016. [Online]. https://medium.com/tincture/the-evidence-is-in-people-want-to-collaborate-with-their- doctors-and-co-produce-their-clinical-8c02069ab965#.xbfm74tj5. "Strata + Hadoop World", Conferences.oreilly.com, 2016. [Online]. Available: http://conferences.oreilly.com/strata. "Do you know big data's top 9 challenges? -- Washington Technology", Washingtontechnology.com, 2013. [Online]. https://washingtontechnology.com/articles/2013/02/28/big-data-challenges.aspx. T. challenge, "Bluehill User Community: The data acquisition challenge", Bluehillusercommunity.blogspot [Online]. http://bluehillusercommunity.blogspot.tw/2011/07/data-ccquisition-challenge.html. 2016. [Online]. http://www.ijcaonline.org/volume10/number7/pxc3872013.pdf. Iza.org, 2016. [Online]. http://www.iza.org/conference_files/eddi10/EDDI10_Presentations/EDDI10_P1_PeterWittenburg_Slides.ppt. The overview of The information technology Industry Chain in Big Data era: http://link.springer.com/chapter/10.1007%2F978-3-642-55038-6_66 59