Chapter 2 (2).pdf
Document Details
Uploaded by InterestingLove
University of Hail
Full Transcript
UNIVERSITY OF HAIL COLLAGE OF PUBLIC HEALTH AND HEALTH INFORMATICS DEPT. OF HEALTH INFORMATICS DATA SCIENCE IN HEALTH CARE HIIM 233 Chapter 2 Data at Scale Instructors: Dr. Muteb Alshammri MSc. Ibrahim A. Ibrahim INTRODUCTION ▪ Various data in hospital facilities is generated daily by different sour...
UNIVERSITY OF HAIL COLLAGE OF PUBLIC HEALTH AND HEALTH INFORMATICS DEPT. OF HEALTH INFORMATICS DATA SCIENCE IN HEALTH CARE HIIM 233 Chapter 2 Data at Scale Instructors: Dr. Muteb Alshammri MSc. Ibrahim A. Ibrahim INTRODUCTION ▪ Various data in hospital facilities is generated daily by different sources. Data is usually stored electronically and spread across different locations. For example, electronic reports reporting patients’ treatment information are usually stored within the oncology department of a hospital. Conversely, patient’s images are often stored into the radiology department within a different data platform (PACS, Pictures Archive Communication System). ▪ In addition, different departments within the same hospital might use different infrastructures (e.g. software’s, data formats) to store acquired clinical data. Very often, those systems and / or data formats might not be interoperable between each other’s. No matter, what the source of clinical data is, data fragmentation represents one of the biggest issues when dealing with clinical data in general. ▪ Data fragmentation occurs when a collection of data in memory is broken up into many pieces that are not close together. The problem becomes even more enhanced when willing to perform multicenter studies. 2 For example ▪ New technologies / scanners enabling the possibility to acquire images of a patient in less than a second have determined what has been called ‘data explosion’ for medical imaging data. ▪ In general, techno- logical developments associated with healthcare (new powerful imaging machines) on one side have improved the general healthcare quality. ▪ Nevertheless, on the other side they have produced much more data than expected. Conversely, our developments in data mining techniques have been growing much slower than expected or at least not as fast as the production of data. ▪ In fact, this data volume has been increasing so rapidly, even beyond the capability of humans. This data represents then an almost unexplored source of potential information that can be used for example to develop clinical prediction models, using all the information (e.g. imaging, genetics banks, and electronic reports) available in medical institutions. ▪ Some of the biggest problems associated with this unexplored data are presence of missing values, and absence of a pre-determined structure. 3 ▪ Missing values happen when no data value is stored for the variable in an observation. ▪ Missing data is a common occurrence and can have a significant effect on the conclusions that can be drawn from the data common occurrence. Statistical techniques such as data imputation (explained later in the book) could be used to replace missing values. ▪ Unstructured data is information that either does not have a pre-defined data model or is not organized in a pre-defined manner. ▪ A data model is an agreement between several institutions on the format and database structure of storing data. ▪ Unstructured information is typically text-heavy, but may contain data such as dates, numbers, and facts as well. But also audiovisual, locations, sensors data. ▪ If we look at clinical data, we can recognize both the presence of missing values and its absence of predetermined structure. For these reasons, clinical data is still not ready to be mined (i.e. processed) automatically by machines (e.g. artificial intelligence). ▪ Therefore, the terms big (clinical) data refers to not only a large volume of data, but on a large volume of complex, unstructured and fragmented data coming from different sources. 4 ‘BIG’ CLINICAL DATA: THE FOUR ‘VS’ ▪ Now, why we use the term ‘big’ and what makes big data ‘big’? ▪ We performed a literature research and we tried to summarize the most common definitions of big data. ▪ The community agrees that big data can be summarized by the four ‘V’ concepts: volume, variety, velocity, and veracity 5 ▪ Volume: Volume of data exponentially increases every day, since not only humans, but also and especially machines are producing faster and faster new information (refer to previous example of ‘data explosion’ in medical imaging, but also “Internet of Things”). In the community, data of the order of Terabyte and larger is considered as ‘big volume’. Volume contributes to the big issue that traditional storage systems such as traditional database are not suitable anymore to welcome a huge amount of data. 6 Variety: Big data comes from different sources and are stored in different formats: (a) Different types: in the past, major sources of clinical data were databases or spreadsheets. Now data can come under the form of free text (electronic report) or images (patients’ scans). This type of data is usually characterized by structured or, less often, semi-structured data (e.g. databases with some missing values or inconsistencies) (b) Different sources: variety is also used to mean that data can come from different sources. These sources do not necessarily belong to the same institution. ▪ Variety affects both data collection and storage. Two major challenges must be faced: ▪ (a) storing and retrieving this data in an efficient and cost-effective way, (b) aligning data types from different sources, so that all the data is mined at the same time. ▪ There is also an additional complexity due to interaction between variety and volume. In fact, unstructured data is growing much faster than structured data. An estimation says that unstructured data doubles around every 3 months 7 Velocity: The production of big data (by machines or humans) is a continuous and massive flow. (a) (b) Data in motion and real time big data analytics: big data are produced ‘real time’ and most of the time need to be analyzed ‘real time’. Therefore, an architecture for capturing and mining big data flows must support real-time turnaround. Lifetime of data utility: a second dimension of data velocity is for how long data will be valuable. Understanding this additional ‘temporal’ dimension of velocity will allow to discard data that is not meaningful anymore when new up-to-date and more detailed information has been produced. The period of “data lifetime” can be long, but it some cases also short (days). For example, we might think that for a specific analysis we only need the results from a recent lab test (most recent data). However, for a more detailed analysis we might want to trace same measurements from the past (longer lifetime). 8 Veracity: Big data, due to its complexity, might present inconsistencies, such as missing values. More in general, big data has ‘noise’, biases and abnormality. ▪ The data science community usually recognizes veracity as the biggest challenge compared to velocity and volume. For example, if we took three measurements of blood pressure, even if they can vary differently, reporting the average may be common practice, but it is also not a real measurement value. 9 Besides these four properties, additional four ‘Vs’ have been proposed by the community: validity, volatility, viscosity, and virality. Validity: due to large volume and data veracity, we need to make sure data is accurate for the intended use. However, compared to other small datasets, in the initial stage of the analysis, there is no need to worry about the validity of each single data element. In fact, it is more important to see whether any relation- ships exist between elements within this massive data source than to ensure that all elements are valid. Volatility: big data volatility refers to for how long data must be available and how long they should be stored, since concerns about the increasing storage capacity might be raised. Viscosity: viscosity measures the resistance to flow in the volume of data. This resistance can come from different data sources, friction from integration flow rates, and processing required turning the data into insight. Virality: defined as the rate at which the data spreads, for example it measures how often the data is picked and re-used by other users than the original owner of the data. 10 ▪ To see the presented main four ‘Vs’ in action, let us consider the case of imaging data (e.g. patient’s scans) collected within a hospital institution: 1) 2) 3) 4) Due to improvements in the hardware (e.g. scanning machines) a large amount of images are produced (and stored) within a short elapsed of time (Volume). Developments on hardware and in general in the imaging healthcare sector are producing machines able to produce much more images, combining different modality at the same time. This phenomenon is growing exponentially (Velocity). Different imaging modality are combined together (Variety). Despite there is a unified standard for storing and transmitting medical images (DICOM - Digital Imaging and Communications in Medicine), there is no agreement on associated metadata, such as for example medical annotations of patient’s scans. So that, meta-data associated with imaging data can be of differ- ent formats, without a unique agreed data model (Veracity). 11 DATA LANDSCAPE A good visualization of data scale is represented by the concept of data landscape, shown in Fig. 2.1. 12 We can affirm that: ❖Data collections such as clinical data registries or clinical trial data cover only a small portion of the data landscape. ❖In fact, Cancer registry contains usually several information about a large number of patients (y-axis) or population, but the variables (or features, x-axis) collected are limited. ❖Clinical trial data usually collect more information than cancer registries, but with respect to a selected and limited patients population ❖Clinical routine data covers all the data landscape. ❖These missing dots represent ‘missing’ values. ❖‘Real world clinical data are characterized by a large amount (around 80%) of missing values. ❖When looking at Fig. 2.1, it is possible to identify again some of the six ‘Vs’ associated with big data: ❖A vast volume of data is produced (large extension on x-axis and y-axis): Velocity + Volume. ❖Data includes several information from different sources (‘features’): Veracity + Variety. 13 CONCLUSION – – – – Data volume has been increasing so rapidly, even beyond that capability of humans. This data represents then an almost unexplored source of potential information. The term big (clinical) data refers to not only a large volume of data, but also more on a large volume of complex, unstructured and fragmented data coming from different sources. Big Clinical data are defined by the four ‘Vs’: volume, variety, velocity, and veracity. Several issues limit that sharing and exchange of big clinical data: administrative, ethical, political, and technical barriers. 15