Data Mining Chapter 5

Data Mining is the extraction of hidden information from large datasets, a powerful technology that helps organizations focus on the most important information in their data repositories.

Structured Data: Highly organized and easily searchable, often stored in relational databases. Example: An Excel spreadsheet with clear rows and columns showing names, addresses, and phone numbers.
Semi-Structured Data: Not as organized as structured data, but still has some identifiable patterns or markers that aid in its processing. Example: JSON files, where data is stored in key-value pairs but doesn't fit into strict tables like structured data.
Unstructured Data: Lacks a predefined format or structure, making it difficult to collect, process, and analyze. Example: Body of an email or a video file, where the content doesn't fit into a database as neatly as structured data.
Categorical Data: qualitative, grouping data into categories. Types:
- Nominal: No order, e.g., colors: red, blue, green.
- Ordinal: Order matters, e.g., ratings: good, better, best.
Numerical Data: quantitative, expressed as numbers. Types:
- Interval: These have equal intervals between values but no true zero, Example: Temperature in Celsius, where 0°C doesn't mean the absence of temperature.
- Ratio: These have a true zero, allowing for statements about how many times greater one value is compared to another. Example: Weight, where 0 kg means no weight, and 10 kg is twice as heavy as 5 kg.

Patterns refer to the identification of recurring structures or regularities in data, helping in understanding the data and making predictions.
Types of patterns in data mining:
- Association Rules: Identify relationships among a set of items, e.g., If a customer buys bread, they are 80% likely to also buy milk.
- Clusters: Grouping similar data points together, e.g., in customer segmentation, similar customers are grouped based on their purchasing behavior or demographics.
- Sequential Patterns: Patterns where the order of events matters, e.g., In a website's usage data, a sequential pattern might be that users often first visit the homepage, then a product page, and finally the checkout page.
- Predictions: Tell the nature of future occurrences of certain events based on what has happened in the past, e.g., predicting the absolute temperature of a particular day.

Data Mining uses techniques from other domains such as statistics, database/data warehouse systems, machine learning (ML), pattern recognition, visualization, information retrieval, and high-performance computing.
There is a significant overlap between Data Mining and Machine Learning, which focuses on classification and prediction based on known properties previously learned from the training data.
Data Mining focuses on the discovery of previously unknown properties in the data, without a specific goal from the domain.

Classification: Assigning categories to data points, e.g., determining whether an email is spam or not spam based on its content.
Clustering: Grouping similar data points together, e.g., grouping customers with similar purchase behaviors without predefined categories.
Regression: Predicting a continuous value, e.g., predicting a house's price based on its size, location, and other features.
Association Rules: Finding relationships between variables, e.g., people who buy bread often buy milk, captured as the rule {Bread} => {Milk}.
Decision Tree: A tree-shaped model used for making decisions or predictions, where each branch represents a choice and each leaf represents an outcome, e.g., a decision tree to decide on playing tennis based on weather conditions like outlook, humidity, and wind.

Data Mining involves extracting useful information from large datasets, turning raw data into valuable insights, integrating methodologies from statistics, machine learning, and database systems to analyze patterns and relationships in data.