Data Science Overview

Definition: Process of inspecting, cleaning, and transforming data to gain insights or inform decision-making.
Techniques:
- Descriptive statistics (mean, median, mode)
- Inferential statistics (hypothesis testing, confidence intervals)
- Data wrangling (cleaning and preparing data)
Tools: Python (Pandas, NumPy), R, SQL.

Definition: A subset of artificial intelligence that enables systems to learn and improve from experience without being explicitly programmed.
Types:
- Supervised Learning: Algorithms are trained on labeled data (e.g., regression, classification).
- Unsupervised Learning: Algorithms identify patterns in unlabeled data (e.g., clustering).
- Reinforcement Learning: Learning through trial and error to achieve a goal.
Common Algorithms: Decision Trees, Random Forest, Support Vector Machines, Neural Networks.

Definition: The graphical representation of information and data to communicate insights clearly.
Purpose: To simplify complex data sets, identify trends, and assist in decision-making.
Tools: Tableau, Matplotlib (Python), ggplot2 (R), Power BI.
Key Techniques: Bar charts, histograms, scatter plots, heatmaps, dashboards.

Definition: Tools and frameworks designed to process and analyze large, complex data sets that traditional data processing software can’t handle efficiently.
Key Technologies:
- Hadoop: Framework for distributed storage and processing of big data.
- Apache Spark: Fast and general-purpose engine for big data processing.
- NoSQL Databases (e.g., MongoDB, Cassandra): Designed for unstructured data.
Applications: Social network analysis, fraud detection, recommendation systems.

Definition: The process of creating a statistical model to understand relationships among variables and to make predictions.
Types:
- Linear Models: Assumes a linear relationship between input and output variables.
- Generalized Linear Models: Extends linear models to accommodate non-normal distributions.
- Time Series Analysis: Analyzes time-ordered data points to identify trends and seasonal patterns.
Key Concepts: Model fitting, validation, overfitting vs. underfitting, and residual analysis.

The process of inspecting, cleaning, and transforming data to gain insights or inform decision-making.
Uses descriptive statistics like mean, median, and mode to summarize data.
Applies inferential statistics like hypothesis testing and confidence intervals to draw conclusions from samples.
Involves data wrangling, which is cleaning and preparing data for analysis.
Commonly uses tools like Python (with Pandas and NumPy), R, and SQL.

A subset of artificial intelligence where systems learn and improve from experience without explicit programming.
Includes supervised learning, where algorithms are trained on labeled data, such as regression and classification.
Also includes unsupervised learning, where algorithms identify patterns in unlabeled data, such as clustering.
Reinforcement learning involves learning through trial and error to achieve a goal.
Uses algorithms like Decision Trees, Random Forest, Support Vector Machines, and Neural Networks.

The graphical representation of information and data to communicate insights clearly.
Simplifies complex data sets, identifies trends, and aids decision-making.
Uses tools like Tableau, Matplotlib (Python), ggplot2 (R), and Power BI.
Employs techniques like bar charts, histograms, scatter plots, heatmaps, and dashboards.

Tools and frameworks designed to process and analyze large, complex data sets that traditional data processing software can’t handle efficiently.
Utilizes technologies like Hadoop, a framework for distributed storage and processing of big data.
Leverages Apache Spark, a fast and general-purpose engine for big data processing.
Utilizes NoSQL databases like MongoDB and Cassandra, designed for unstructured data.
Applications include social network analysis, fraud detection, and recommendation systems.

The process of creating a statistical model to understand relationships between variables and to make predictions.
Includes linear models that assume a linear relationship between input and output variables.
Utilizes generalized linear models extending linear models to accommodate non-normal distributions.
Utilizes time series analysis to analyze time-ordered data points to identify trends and seasonal patterns.
Key concepts include model fitting, validation, overfitting vs. underfitting, and residual analysis.