Podcast
Questions and Answers
Which of the following factors has contributed to the enormous growth in data within commercial and scientific databases?
Which of the following factors has contributed to the enormous growth in data within commercial and scientific databases?
- Decreased use of data collection technologies.
- Limited access to data storage facilities.
- Advances in data generation and collection technologies. (correct)
- Reduced emphasis on data analysis.
Gathered data always immediately demonstrates value for its original purpose.
Gathered data always immediately demonstrates value for its original purpose.
False (B)
Which of the following is an example of how data mining is used from a commercial viewpoint?
Which of the following is an example of how data mining is used from a commercial viewpoint?
- Analyzing sky survey data from telescopes.
- Archiving earth science data from NASA.
- Conducting high-throughput biological data analysis.
- Managing customer relationships to provide customized services. (correct)
NASA EOSDIS archived more than 24 ______ of earth science data in 2018.
NASA EOSDIS archived more than 24 ______ of earth science data in 2018.
In the context of data mining for scientific purposes, what is one way data mining assists scientists?
In the context of data mining for scientific purposes, what is one way data mining assists scientists?
Data mining is only applicable to commercial databases and cannot be used for scientific databases.
Data mining is only applicable to commercial databases and cannot be used for scientific databases.
Name two examples of where data is collected from individuals, as listed in the provided content.
Name two examples of where data is collected from individuals, as listed in the provided content.
According to the content, many sectors in the US have more data stored per company than which of these options?
According to the content, many sectors in the US have more data stored per company than which of these options?
Match each concept with its purpose in the context of data analysis:
Match each concept with its purpose in the context of data analysis:
What is the primary aim of applying data mining techniques?
What is the primary aim of applying data mining techniques?
Data mining is only about explaining the past and has no predictive capabilities for the future.
Data mining is only about explaining the past and has no predictive capabilities for the future.
Which fields intersect to form the foundation of data mining?
Which fields intersect to form the foundation of data mining?
Name three characteristics that make traditional data analysis techniques unsuitable for modern data sets.
Name three characteristics that make traditional data analysis techniques unsuitable for modern data sets.
Which process involves transforming or consolidating data into suitable forms for data mining?
Which process involves transforming or consolidating data into suitable forms for data mining?
Data cleaning is the process of identifying interesting patterns representing knowledge based on specific measures.
Data cleaning is the process of identifying interesting patterns representing knowledge based on specific measures.
Which of the following best describes 'data selection' in the context of data mining steps?
Which of the following best describes 'data selection' in the context of data mining steps?
The data mining step where visualization techniques present mined knowledge to the user is known as ______.
The data mining step where visualization techniques present mined knowledge to the user is known as ______.
Match each step of data mining with its description:
Match each step of data mining with its description:
Looking up a phone number in a phone directory is an example of what?
Looking up a phone number in a phone directory is an example of what?
Grouping together similar documents returned by a search engine according to their context is considered a data query.
Grouping together similar documents returned by a search engine according to their context is considered a data query.
What are the two main categories of data mining tasks?
What are the two main categories of data mining tasks?
Which of the following describes descriptive mining tasks?
Which of the following describes descriptive mining tasks?
Using some variables to predict unknown or future values of other variables is characteristic of ______ methods.
Using some variables to predict unknown or future values of other variables is characteristic of ______ methods.
Match each data mining task with its category:
Match each data mining task with its category:
Approximating the value of a numeric target variable using other variables is an example of:
Approximating the value of a numeric target variable using other variables is an example of:
Data mining techniques are used to limit the number of patterns found in data mining tasks.
Data mining techniques are used to limit the number of patterns found in data mining tasks.
What is the data mining technique that groups similar records, observations, or cases into classes?
What is the data mining technique that groups similar records, observations, or cases into classes?
What is the primary difference between clustering and classification?
What is the primary difference between clustering and classification?
The technique of seeking to uncover rules for quantifying the relationship between two or more attributes is known as ______.
The technique of seeking to uncover rules for quantifying the relationship between two or more attributes is known as ______.
What is another name for 'affinity analysis'?
What is another name for 'affinity analysis'?
Anomaly detection is the technique of grouping similar data points together based on common characteristics.
Anomaly detection is the technique of grouping similar data points together based on common characteristics.
Detecting fraudulent usage of credit cards is a common application of which data mining technique?
Detecting fraudulent usage of credit cards is a common application of which data mining technique?
Match the data mining tasks to their descriptions:
Match the data mining tasks to their descriptions:
Which of the following activities would NOT be considered a data mining task?
Which of the following activities would NOT be considered a data mining task?
Predicting the outcome of tossing a fair pair of dice is considered a data mining task.
Predicting the outcome of tossing a fair pair of dice is considered a data mining task.
According to Bill Gates, a breakthrough in which field would be extremely valuable?
According to Bill Gates, a breakthrough in which field would be extremely valuable?
Broadly, what is Machine Learning?
Broadly, what is Machine Learning?
Machine learning is a branch of ______ intelligence.
Machine learning is a branch of ______ intelligence.
What is the key difference between traditional programming and machine learning?
What is the key difference between traditional programming and machine learning?
Match the type of data with its description in machine learning
Match the type of data with its description in machine learning
In supervised machine learning, what is used to predict future events?
In supervised machine learning, what is used to predict future events?
In Supervised learning, a training dataset is the output and the learning algorithm produces an inferred function as the input.
In Supervised learning, a training dataset is the output and the learning algorithm produces an inferred function as the input.
Which type of machine learning involves an algorithm that interacts with its environment, learning from errors and rewards?
Which type of machine learning involves an algorithm that interacts with its environment, learning from errors and rewards?
In reinforcement learning, the agent learns through interactions with its environment; what is required for the agent to learn which action is best?
In reinforcement learning, the agent learns through interactions with its environment; what is required for the agent to learn which action is best?
In unsupervised learning, the algorithm identifies patterns in data to spot ______ that split the data into categories.
In unsupervised learning, the algorithm identifies patterns in data to spot ______ that split the data into categories.
Which of these machine learning methods combines labelled and unlabelled data for training?
Which of these machine learning methods combines labelled and unlabelled data for training?
Flashcards
What is Data Mining?
What is Data Mining?
Discovering interesting, non-trivial, and useful patterns from large datasets.
What are Data Mining Steps?
What are Data Mining Steps?
Cleaning, integrating, selecting, transforming, mining, evaluating, and presenting data.
What is Data Mining?
What is Data Mining?
Seeking prevalent names in provinces or grouping search results by context.
What are Descriptive Mining Tasks?
What are Descriptive Mining Tasks?
Signup and view all the flashcards
What are Predictive Mining Tasks?
What are Predictive Mining Tasks?
Signup and view all the flashcards
What are Association Rules?
What are Association Rules?
Signup and view all the flashcards
What is Clustering?
What is Clustering?
Signup and view all the flashcards
What is Anomaly Detection?
What is Anomaly Detection?
Signup and view all the flashcards
What is Machine Learning?
What is Machine Learning?
Signup and view all the flashcards
What is Labelled Data?
What is Labelled Data?
Signup and view all the flashcards
What is Supervised Learning?
What is Supervised Learning?
Signup and view all the flashcards
What is Reinforcement Learning?
What is Reinforcement Learning?
Signup and view all the flashcards
What is Unsupervised Learning?
What is Unsupervised Learning?
Signup and view all the flashcards
What is Semi-Supervised Learning?
What is Semi-Supervised Learning?
Signup and view all the flashcards
Study Notes
Why Data Mining?
- Enormous data growth exits in commercial and scientific databases because of advances in data generation and collection technologies
- The new mantra is to gather whatever data possible, whenever and wherever as data has value
- Progressed from terabytes to petabytes of data
Why data mining? Commercial Viewpoint
- A lot of data being collected and warehoused, especially webdata
- Yahoo has petabytes of web data
- Facebook has billions of active users
- Amazon handles millions of visits per day
- Data from Purchases at department/grocery stores (loyalty cards), bank and credit card transactions is available
- Computers have become cheaper and more powerful enabling competitive pressures
- To create a competitive edge, businesses provide better, customized services to increase profit through Customer Relationship Management
Why data mining? Scientific Viewpoint
- Data can be collected and stored at enormous speeds from:
- Remote sensors on a satellite
- NASA EOSDIS archived >24 petabytes of earth science data during 2018
- Telescopes scanning data
- High-throughput biological data
- Scientific simulations generating terabytes of data in a few
- Data mining helps scientists in automated analysis of massive datasets and hypothesis formation
Data mining at UKZN
- Moodle usage, student card swipes (RFID), and LAN logins
- Individuals can be tracked with smart/fitness watches and smartphones
McKinsey Global Institute report on Big Data
- Innovative competition and productivity is occurring because of big data
- $600 is the cost to buy a disk drive that can store all of the world's music
- 5 billion mobile phones were in use in 2010
- 30 billion pieces of content are shared on Facebook every month
- 40% projected growth in global data generated per year
- 5% growth in global IT spending
- 235 terabytes of data collected by US Library of Congress in April 2011
- 15 out of 17 sectors in the United States have more data stored per company than the US Library of Congress
- Potential Value of big Data:
- $300 billion potential annual value to US health care.
- €250 billion potential annual value to Europe's public sector administration
- $600 billion potential annual consumer surplus from using personal location data globally
- 60% potential increase in retailers' operating margins
- 140,000–190,000 more deep analytical talent positions and 1.5 million more data-savvy managers needed to take full advantage of big data in the United States
Solving Problems using Data
- Improving health care and reducing costs
- Predicting the impact of climate change
- Finding alternative/green energy sources
- Reducing hunger and poverty by increasing agriculture production
What is Data Mining?
- Data Mining,also known as knowledge discovery from data (KDD)
- Extraction of interesting (non-trivial, previously unknown and potentially useful) patterns, structure, or knowledge from huge amounts of data
- Exploration and analysis of data by automatic or semi-automatic means, to discover meaningful patterns/structure/knowledge
- The structure found may take many forms, including a set of rules, a graph or network, a tree, one or several equations, and more
- Data mining is explaining the past and predicting the future by means of data analysis
Data Mining Explained
- Data mining combines statistics, machine learning and database systems
- Traditional techniques are unsuitable with data that is Large-scale, High dimensional, Heterogeneous, Complex and Distributed
- Data mining is a key component of the emerging field of data science and data-driven discovery
Data Mining Steps
- Data cleaning (to remove noise and inconsistent data)
- Data integration (where multiple data sources may be combined)
- Data selection (where data relevant to the analysis task are retrieved from the database)
- Data transformation (where data are transformed or consolidated into forms appropriate for mining by performing summary or aggregation operations)
- Data mining (an essential process where intelligent methods are applied to extract data patterns)
- Pattern evaluation (to identify the truly interesting patterns representing knowledge based on some interesting measures)
- Knowledge presentation (where visualization and knowledge representation techniques are used to present the mined knowledge to the user)
- These steps form part of any data analytics cycle but are specific to data mining
What Data Mining is/is not
- Not data mining:
- Look up phone number in phone directory
- Query a Web search engine for information about "Amazon"
- Data mining:
- Certain names are more prevalent in certain provinces (e.g., Dlamini, Zuma... in KZN)
- Group together similar documents returned by search engine according to their context
- These are considered a data query, which is simply a request for data or information from a database table or combination of tables
Data Mining Tasks
- Two Categories of data mining tasks include:
- Descriptive mining tasks characterize the general properties of the data
- Predictive mining tasks perform inference on the data in order to make predictions
- Data mining functionalities are used to specify the kind of patterns to be found in data mining tasks
- There are data mining techniques used for each task to specify the kinds of patterns or structure to be found in data mining tasks
Data Mining Methods
- Prediction Methods can also involve estimation methods
- Target Outcome can be approximated.
- Data mining techniques examples:
- Clustering : Grouping of records, observations, or cases into classes of that are similar
- Association : Finding which attributes "go together."
- Anomaly Detection : Detecting unusually large amounts for a given account number in comparison to regular charges incurred by the same account
- Predictive Modeling : Model to predict an outcome for some target variable
Examples to Consider
- The main data mining techniques:
- Data is clustered
- Association rules
- Predictive Modelling
- Detecting Anomalies
- Common data mining techniques:
- Predictive modeling is the process by which a model is created to predict an outcome for some target variable
- If the outcome is categorical, it is called classification
- If the outcome is numerical, it is called regression
- Clustering refers to the grouping of records, observations, or cases into classes of similar objects
- A cluster is a collection of records that are similar to one another, and dissimilar to records in other clusters Clustering differs from classification in that there is no target variable for clustering
- Association techniques data mining are used for finding which attributes "go together"
- The technique of association seeks to uncover rules for quantifying the relationship between two or more attributes in business known as with affinity analysis or market basket analysis.
- Anomaly detection is also referred to as outlier detection or analysis.
- These data objects are called outliers.
Data Mining Examples
- Determine whether or not each of the following activities is a data mining task:
- Dividing the customers of a company according to their gender is a simple database query
- Dividing the customers of a company according to their profitability is an accounting application of a new customer would be data mining
- Computing the total sales of a company is simple accounting
- Sorting a student database based on student identification numbers is simple database query
- Predicting the outcomes of tossing a (fair) pair of dice is a probability calculation
- Predicting the future stock price of a company using historical records is predictive modeling
- Monitoring the heart rate of a patient for abnormalities is anomaly detection and even a classification problems
- Monitoring seismic waves for earthquake activities is a classification problem
Machine Learning
- Machine learning (ML) is data analysis that automates analytical model building.
- In branch of artificial intelligence (AI), systems learn from data, identify patterns and make decisions with minimal human intervention without being explicitly programmed
- Machine learning focuses on developing computer programs that can access data and use it to learn to produce a program from data and output
Machine Learning - The Process
- Process of learning begins with observations or data, such as examples, direct experience, or instruction, in order to look for patterns in the data
- One of the more famous examples is a program that learned to recognize cats by being fed cat pictures
- Examples of where machine learning is currently used:
- Image classification
- Document classification
- Hidden spam detection
- Speed fraud detection
- Digital advertising with AI to extract data
- Self driving cars
Supervised Machine Learning Algorithms
- Apply past knowledge to new data using labelled examples to predict future events
- "Labelled data" is a designation for pieces of data that have been tagged with one or more labels identifying certain properties or characteristics, or classifications or contained objects
- Start with a training dataset and a learning algorithm produces a function to make predictions about the output values, so algorithms are trained on some correct examples
- Show an algorithm examples until it gets the idea and can recognize in other pictures
Reinforcement Learning Algorithms
- Learning method by environment interaction via actions and discovers errors or rewards
- Allows machines/software agents to automatically determine the ideal behavior in a context to maximize its performance with simple feedback to the agent to learn which action is best, the reinforcement signal
- Game example: by relating what buttons you press, to what happens on screen and your score, performance improves
Unsupervised Learning Algorithms
- No labelled data is involved at all and no feedback is delivered, so no reinforced learning is delivered
- Tasks algorithms identify patterns in data, to spot similarities that split that data into categories.
- Studies how systems infer a function to describe a hidden structure from unlabelled data, it explores the data and can draw inferences from datasets to describe hidden structures.
- Airbnb clustering houses available to rent by neighbourhood, and Google News grouping together stories on similar topics each day
Semi-Supervised Learning Algorithms
- Mixes supervised and unsupervised learning, where the algorithm is trained on both labelled and unlabelled data. Cat Photo example: The model can recognize a cat in any picture.
Why Use Machine Learning
- Some tasks cannot be defined well except by example
- Hidden among large piles of data are important relationships and correlations that are extracted by machine learning
- Machines can improve upon the original design in their operating environment
- The amount of knowledge available is too large to fully encode by humans so the machines learn so AI systems can conform to new knowledge
- Environments change over time, reducing the need for redesign
- Machine learning solves problems that cannot be solved by numerical means alone!
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.