The Data Explosion

EndorsedDetroit avatar
EndorsedDetroit
·
·
Download

Start Quiz

Study Flashcards

38 Questions

What is the estimated daily volume of data generated by NASA's current Earth observation satellites?

1 terabyte

Approximately how many users are there on Facebook?

900 million

What is the estimated number of tweets sent daily on Twitter?

350 million

What is the estimated number of websites?

650 million

What type of data is recorded by CCTV recordings?

Non-symbolic data

What is the purpose of a Data Warehouse?

To store and analyze customer transactions

What is a consequence of the vast amounts of data being stored?

Most of the data is not examined in detail.

What is the potential of machine learning technology?

To solve the problem of the tidal wave of data.

What is the goal of knowledge discovery?

To extract implicit, previously unknown and potentially useful information from data.

What is the role of data mining in knowledge discovery?

It is a central part of the knowledge discovery process.

What is the outcome of the knowledge discovery process?

New and potentially useful knowledge.

What happens to most of the data that is stored?

It is merely stored and never examined.

What is the current state of the world in terms of data and knowledge?

Data rich but knowledge poor.

What is a potential application of knowledge discovery?

All of the above.

What is the primary goal of using labelled data in data mining?

To predict the value of a designated attribute for unseen instances

What is the term for data mining using unlabelled data?

Unsupervised learning

What is the task called when the designated attribute is categorical?

Classification

What is the term for a dataset of examples, each comprising the values of a number of variables?

Instances

What is the goal of data mining when using unlabelled data?

To extract the most information from the data available

What is the term for the process of predicting a numerical outcome?

Regression

What is the primary goal of classification in data mining?

To predict the value of a categorical attribute

What is the term for data that has a specially designated attribute?

Labelled data

What is the goal of the analysis in the given dataset?

To predict the degree classification for other students given their grade profiles

What method involves identifying the closest examples to an unclassified instance?

Nearest Neighbour Matching

What is the purpose of a classification tree?

To generate classification rules

What type of structure is used to generate classification rules?

Decision Tree

What is the form of the dataset?

A table containing students' grades on five subjects

What is the purpose of the classification rules?

To predict the degree classification of an unseen instance

What is the result of applying the nearest neighbour matching method?

A predicted degree classification for an unseen instance

What is the relationship between the attributes in the dataset?

The attributes are used to predict the degree classification

What is the primary goal of market basket analysis?

To find relationships between product purchases

What is the purpose of stating association rules with additional information?

To indicate the reliability of the rules

What is the main difference between supervised and unsupervised learning?

The presence of labeled data

What is the purpose of clustering algorithms?

To find groups of similar items

What is an example of a clustering application?

Fault diagnosis

What is the concept of 'IF variable 1 > 85 and switch 6 = open THEN variable 23 < 47.5 and switch 8 = closed (probability = 0.8)' an example of?

Association rule

What is the term for the type of prediction where the value to be predicted is a label?

Classification

What is the term for the process of finding relationships between product purchases?

Market basket analysis

Study Notes

The Data Explosion

  • Modern computer systems are accumulating data at an unimaginable rate from a wide variety of sources, including point-of-sale machines, machines logging cheque clearance, bank cash withdrawals, credit card transactions, and Earth observation satellites.
  • The volume of data is enormous, with examples including:
    • NASA Earth observation satellites generating a terabyte (10^9 bytes) of data every day.
    • The Human Genome project storing thousands of bytes for each of several billion genetic bases.
    • Data warehouses containing over a hundred million customer transactions.
    • Automatic recording devices, such as credit card transaction files and web logs, as well as non-symbolic data such as CCTV recordings.
    • Over 650 million websites, with some extremely large sites.
    • Over 900 million Facebook users, with an estimated 3 billion postings a day, and 150 million Twitter users, sending 350 million tweets a day.

Knowledge Discovery

  • Knowledge Discovery is the non-trivial extraction of implicit, previously unknown, and potentially useful information from data.
  • It involves a process of data mining, which is a central part of the Knowledge Discovery process.
  • The Knowledge Discovery process involves:
    • Data coming in from many sources.
    • Data integration and storage in a common data store.
    • Pre-processing of data into a standard format.
    • Applying a data mining algorithm to produce rules or patterns.
    • Interpreting the output to gain new and potentially useful knowledge.

Types of Data and Data Mining

  • There are two types of data: labelled and unlabelled data.
  • Labelled data is used for supervised learning, where the aim is to predict the value of a designated attribute for unseen instances.
  • Unlabelled data is used for unsupervised learning, where the aim is to extract the most information possible from the available data.
  • Data mining applications can be divided into four main types:
    • Classification: predicting a categorical value, such as classifying medical patients into high, medium, or low risk of acquiring an illness.
    • Numerical Prediction: predicting a numerical value, such as the expected sale price of a house.
    • Association: finding relationships amongst variables, such as in market basket analysis.
    • Clustering: grouping items that are similar, such as customers according to income, age, and types of policy purchased.

Classification

  • Classification is a common application of data mining, involving predicting a categorical value.
  • Examples include:
    • Classifying medical patients into high, medium, or low risk of acquiring an illness.
    • Classifying people into those likely to vote for different political parties.
    • Classifying student projects into distinction, merit, pass, or fail.

Association Rules

  • Association rules involve finding relationships amongst variables, such as in market basket analysis.
  • An example of an association rule is: IF cheese AND milk THEN bread (probability = 0.7), indicating that 70% of customers who buy cheese and milk also buy bread.

Clustering

  • Clustering algorithms examine data to find groups of items that are similar.
  • Examples include:
    • Grouping customers according to income, age, and types of policy purchased.
    • Grouping electrical faults according to the values of certain key variables.

Explore the rapid accumulation of data in modern computer systems from various sources, including point-of-sale machines, bank transactions, and earth observation satellites.

Make Your Own Quizzes and Flashcards

Convert your notes into interactive study material.

Get started for free

More Quizzes Like This

Use Quizgecko on...
Browser
Browser