Data Preprocessing and Statistical Analysis Quiz

Definition: Preparing raw data for analysis by cleaning and transforming it.
Steps:
1. Data Cleaning:
  - Remove duplicates.
  - Handle missing values (imputation, removal).
  - Correct inconsistencies (formatting issues).
2. Data Transformation:
  - Normalization/standardization (scaling features).
  - Encoding categorical variables (one-hot encoding, label encoding).
  - Feature extraction and selection (reducing dimensionality).

Definition: Using statistical methods to analyze data and derive insights.
Key Concepts:
- Descriptive Statistics: Summarizing data (mean, median, mode, standard deviation).
- Inferential Statistics: Drawing conclusions from a sample (hypothesis testing, confidence intervals).
- Correlation and Regression: Measuring relationships between variables (Pearson correlation, linear regression).

Types:
- Supervised Learning:
  - Algorithms learn from labeled data (training).
  - Examples: Linear regression, decision trees, support vector machines.
- Unsupervised Learning:
  - Algorithms find patterns in unlabeled data.
  - Examples: K-means clustering, hierarchical clustering, principal component analysis (PCA).
- Reinforcement Learning:
  - Learning through trial and error to maximize rewards.

Purpose: To represent data graphically for better understanding and insights.
Tools and Techniques:
- Charts and Graphs: Bar charts, line graphs, scatter plots, histograms.
- Dashboards: Interactive visual displays of key metrics (Tableau, Power BI).
- Libraries: Matplotlib, Seaborn (Python), ggplot2 (R).

Definition: Tools and frameworks designed to handle large volumes of data.
Key Technologies:
- Hadoop: Distributed storage and processing framework.
- Spark: Fast data processing engine; supports batch and streaming data.
- NoSQL Databases: Non-relational databases (MongoDB, Cassandra) for unstructured data.
- Data Warehousing: Systems for storing and analyzing large datasets (Amazon Redshift, Google BigQuery).

Preparing raw data for analysis involves cleaning and transforming it effectively.
Data cleaning removes duplicates, handles missing values through imputation or removal, and corrects inconsistencies related to formatting.
Data transformation includes normalization and standardization to scale features, encoding categorical variables (using one-hot or label encoding), and feature extraction and selection to reduce dimensionality.

Statistical analysis utilizes various methods to analyze data and derive meaningful insights.
Descriptive statistics summarize data with measures like mean, median, mode, and standard deviation.
Inferential statistics allow for conclusions to be drawn from samples, including hypothesis testing and calculating confidence intervals.
Correlation and regression assess relationships between variables, employing techniques like Pearson correlation and linear regression.

Supervised learning algorithms are trained using labeled data, such as linear regression, decision trees, and support vector machines.
Unsupervised learning algorithms identify patterns in unlabeled data, with examples including K-means clustering, hierarchical clustering, and principal component analysis (PCA).
Reinforcement learning focuses on learning through trial and error to maximize cumulative rewards.

Data visualization aims to represent information graphically, enhancing understanding and insight generation.
Tools include various charts and graphs like bar charts, line graphs, scatter plots, and histograms for visual representation.
Dashboards provide interactive visual displays of key metrics, utilizing platforms like Tableau and Power BI.
Visualization libraries such as Matplotlib and Seaborn (Python) and ggplot2 (R) are essential for creating sophisticated graphics.

Big data technologies consist of various tools and frameworks that manage extensive volumes of data.
Hadoop serves as a distributed storage and processing framework suitable for large datasets.
Spark acts as a fast data processing engine capable of handling both batch and streaming data.
NoSQL databases (e.g., MongoDB and Cassandra) are designed for managing unstructured data.
Data warehousing solutions like Amazon Redshift and Google BigQuery focus on storing and analyzing large datasets effectively.