Vizualizacija podataka i Python biblioteke

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson

Questions and Answers

Koja je primarna svrha vizualizacije podataka?

  • Stvaranje složenih matematičkih modela.
  • Pisanje izvješća i tehničkih dokumenata.
  • Spremanje podataka u bazu podataka.
  • Predstavljanje informacija i podataka u grafičkom obliku. (correct)

Koji od navedenih Python biblioteka je temeljna biblioteka za vizualizaciju podataka?

  • Bokeh
  • Matplotlib (correct)
  • Plotly
  • Seaborn

Što su 'wrapperi' u kontekstu Matplotlib biblioteke?

  • Biblioteke koje omogućuju pristup Matplotlib metodama s manje koda (correct)
  • Funkcije za čišćenje podataka
  • Alati za izradu interaktivnih vizualizacija
  • Posebne vrste grafikona

Na čemu se temelji Bokeh biblioteka?

<p>Na 'Gramatici grafike'. (C)</p> Signup and view all the answers

Koja biblioteka je specijalizirana za stvaranje mapa i prikaz geografskih podataka?

<p>Geoplotlib (B)</p> Signup and view all the answers

Koju vrstu vizualizacije se koristi za ilustriranje razlika između dvije ili više stavki?

<p>Usporedbenu vizualizaciju (C)</p> Signup and view all the answers

Što prikazuje grafikon kutije (boxplot)?

<p>Pet-brojčani sažetak i odvojene ekstremne vrijednosti. (C)</p> Signup and view all the answers

Za što se najčešće koristi boxplot?

<p>Za usporedbu distribucije kontinuirane značajke naspram vrijednosti kategoričke značajke. (C)</p> Signup and view all the answers

Što označavaju komponente transakcije?

<p>Roba, predmeti itd. (A)</p> Signup and view all the answers

Što se koristi za pohranu podataka u grafičkom obliku?

<p>Čvorovi grafa (A)</p> Signup and view all the answers

Koji od navedenih pojmova opisuje princip 'Smeće unutra, smeće van' (GIGO) u kontekstu podataka?

<p>Rezultati obrade netočnih podataka bit će netočni, bez obzira na ispravnost postupka obrade. (A)</p> Signup and view all the answers

Što se smatra šumom u podacima kada postoji klasifikacijska pogreška?

<p>Neusklađena opažanja ili klasifikacije (B)</p> Signup and view all the answers

Što je svrha deskriptivne statistike u kontekstu obrade podataka?

<p>Proučavanje i kvantificiranje karakteristika podataka i rezultata. (A)</p> Signup and view all the answers

Što je to distribucija podataka?

<p>Prikaz učestalosti pojavljivanja različitih vrijednosti unutar skupa podataka. (C)</p> Signup and view all the answers

Što od navedenog ne predstavlja atributni šum?

<p>Potpuno ispravne vrijednosti atributa (A)</p> Signup and view all the answers

Što znači pozitivna vrijednost korelacije između dvije varijable?

<p>Povećanje jedne varijable uzrokuje povećanje druge varijable, ili smanjenje jedne varijable uzrokuje smanjenje druge varijable. (C)</p> Signup and view all the answers

Koja od sljedećih mogućnosti predstavlja primjer kvalitativnog atributa?

<p>Spol osobe (C)</p> Signup and view all the answers

Koje su dvije glavne kategorije statističkih mjera?

<p>Mjere središnje tendencije i mjere raspršenosti. (B)</p> Signup and view all the answers

Kako se izračunava srednja vrijednost (mean) skupa podataka?

<p>Zbrajanjem svih vrijednosti i dijeljenjem s brojem vrijednosti. (C)</p> Signup and view all the answers

Što je karakteristično za vrijednosti kategoričkih varijabli?

<p>Uvijek su ograničene i nazivaju se oznakama. (B)</p> Signup and view all the answers

Koji od navedenih koeficijenata korelacije je neparametarski?

<p>Kendallov koeficijent korelacije tau i Spearmanov koeficijent rang korelacije. (B)</p> Signup and view all the answers

Kada se koristi Spearmanov koeficijent rang korelacije umjesto Pearsonovog koeficijenta?

<p>Kada je distribucija podataka iskrivljena ili ima outlier-e. (C)</p> Signup and view all the answers

Što je nominalni atribut?

<p>Kvalitativni podatak bez određenog redoslijeda (D)</p> Signup and view all the answers

Što predstavlja medijan u skupu podataka?

<p>Vrijednost koja dijeli skup podataka na dva jednaka dijela. (B)</p> Signup and view all the answers

Koja je glavna svrha predobrade podataka?

<p>Uklanjanje pogrešaka i priprema podataka (C)</p> Signup and view all the answers

Što mjeri Kendallov koeficijent korelacije?

<p>Sličnost ili različitost između dva ordinalne varijable. (C)</p> Signup and view all the answers

Koji su koraci u procesu upravljanja podacima u strojnom učenju?

<p>Razumijevanje problema, razumijevanje podataka, priprema podataka, modeliranje, evaluacija, implementacija. (C)</p> Signup and view all the answers

Prema navedenom, koji su koraci potrebni za proračun medijana?

<p>Sortiranje brojeva od najmanjeg do najvećeg, te odabiranje središnjeg broja (B)</p> Signup and view all the answers

Kako se još naziva Gaussova distribucija u strojnom učenju?

<p>Normalna distribucija. (D)</p> Signup and view all the answers

Što mjere statističke mjere raspršenosti podataka?

<p>Koliko su vrijednosti raspoređene unutar raspona podataka. (D)</p> Signup and view all the answers

Koje svojstvo karakterizira normalnu distribuciju?

<p>Ima oblik zvonolike krivulje i simetrična je oko srednje vrijednosti. (B)</p> Signup and view all the answers

Koji od navedenih koeficijenata korelacije je prikladan za binarnu varijablu?

<p>Pearsonov, Spearmanov i Kendallov koeficijent tau. (D)</p> Signup and view all the answers

Koja od navedenih tvrdnji ne opisuje problem koji mogu uzrokovati odstupanja u podacima prilikom izgradnje prediktivnih modela?

<p>Povećanje točnosti modela. (D)</p> Signup and view all the answers

Koji grafički prikaz najučinkovitije prikazuje odstupanja koristeći interkvartilni raspon?

<p>Box Plot. (C)</p> Signup and view all the answers

Što je osnovna pretpostavka korištenja Z-score metode za detekciju odstupanja?

<p>Normalna distribucija podataka. (A)</p> Signup and view all the answers

Koji je glavni cilj skaliranja značajki u procesu pripreme podataka za strojno učenje?

<p>Dovesti sve značajke na istu razinu magnitude. (B)</p> Signup and view all the answers

Koja je srednja vrijednost i standardna devijacija normaliziranih vrijednosti nakon primjene Z-score standardizacije?

<p>Srednja vrijednost 0, standardna devijacija 1. (C)</p> Signup and view all the answers

Kojem intervalu se transformiraju originalni podaci prilikom min-max normalizacije?

<p>[0, 1]. (C)</p> Signup and view all the answers

Što znači AUC (Area Under the Curve) vrijednost od 0.6 u kontekstu klasifikacijskog modela?

<p>Loš klasifikator. (C)</p> Signup and view all the answers

Kako se izračunava False Positive Rate (FPR) u matrici konfuzije?

<p>$FP / (TN + FP)$ (C)</p> Signup and view all the answers

Kako se generiraju bazni modeli u boosting algoritmu?

<p>Korištenjem istog skupa podataka za obuku i različitih vektora težina. (B)</p> Signup and view all the answers

Što se događa s težinom točke tijekom iteracija u boosting algoritmu ako je točka ispravno klasificirana?

<p>Težina se smanjuje. (D)</p> Signup and view all the answers

Kako se izračunava ϵ u primjeru boosting algoritma?

<p>Kao suma težina pogrešno klasificiranih točaka. (A)</p> Signup and view all the answers

Što predstavlja α u kontekstu ažuriranja težina u boosting algoritmu?

<p>Faktor kojim se množe težine pogrešno klasificiranih točaka. (C)</p> Signup and view all the answers

U primjeru s deset opservacija, koja je početna težina svake opservacije?

<p>Težina je 0.1 (B)</p> Signup and view all the answers

Kako se izračunavaju nove težine za pogrešno klasificirane točke?

<p>Množenjem stare težine s <code>e</code> podignutim na <code>α</code>. (B)</p> Signup and view all the answers

Što je cilj ponovnog ponderiranja točaka tijekom iteracija boosting algoritma?

<p>Povećati utjecaj točaka koje su teže za klasificirati. (C)</p> Signup and view all the answers

Ako model pogrešno klasificira točke 7, 8, 9 i 10, koja je vrijednost ϵ ako je početna težina svake točke 0.1?

<p>0.4 (A)</p> Signup and view all the answers

Flashcards

Grafički prikaz podataka

Podaci u grafičkom obliku prikazuju podatke kao vrhove, a njihove veze kao rubove.

Kvaliteta podataka

Kvaliteta ulaznih podataka utječe na kvalitetu izlaza. Loši ulazi dovode do loših rezultata.

Šum u podacima (label noise)

Netočni podaci koji utječu na klasifikaciju ili prepoznavanje obrasca, npr. pogrešno procijenjeni podatak.

Šum u podacima (attribute noise)

Netočni podaci u atributu koji mogu biti pogrešni, nedostajući ili nepotpuni, npr. netočan broj, nedostajuću vrijednost ili nečitljivu informaciju.

Signup and view all the flashcards

Kvalitativni podaci

Podaci koji se koriste za kategorizaciju ili klasifikaciju bez numeričke vrijednosti, npr. boja, marka, spol.

Signup and view all the flashcards

Kvantitativni podaci

Podaci koji se mogu mjeriti, izraženi u brojevima ili realnim vrijednostima, npr. visina, težina, broj proizvoda.

Signup and view all the flashcards

Nominalni atributi

Vrijednosti atributa su kategorije koje nije moguće rangirati, npr. spol (muški, ženski), država, boja.

Signup and view all the flashcards

Kvalitetni podaci i strojno učenje

Korisni podaci koji poboljšavaju rad algoritama strojnog učenja, smanjujući neizvjesnost i pogreške u klasifikaciji.

Signup and view all the flashcards

Aritmetička sredina (prosjek)

Mjera centralne tendencije koja se koristi za  izračunavanje prosječne vrijednosti skupa podataka. Izračunava se zbrajanjem svih vrijednosti u skupu podataka i dijeljenjem s brojem vrijednosti u skupu.

Signup and view all the flashcards

Medijan

Mjera centralne tendencije koja se koristi za  predstavljanje srednje vrijednosti u skupu podataka. Sortirajte vrijednosti u skupu podataka od najmanje do najveće, a  zatim odaberite vrijednost koja razdvaja gornju i donju polovicu skupa podataka.

Signup and view all the flashcards

Raspon

Mjera disperzije koja se koristi za  mjerenje raspona podataka od najniže do najviše vrijednosti u  skupu podataka.

Signup and view all the flashcards

Moda

Mjera centralne tendencije koja predstavlja najčešće pojavljujuću se vrijednost u skupu podataka. Na primjer, moda skupa podataka [1, 2, 2, 3, 3, 3, 4] je 3, jer se vrijednost 3 najčešće pojavljuje u skupu podataka.

Signup and view all the flashcards

Varijanca

Mjera disperzije koja se koristi za  mjerenje udaljenosti podataka od prosječne vrijednosti. Izračunava se zbrajanjem kvadratnih razlika između svake vrijednosti i prosječne vrijednosti i dijeljenjem s brojem vrijednosti - 1.

Signup and view all the flashcards

Standardna devijacija

Mjera disperzije koja se koristi za  mjerenje širenja podataka oko prosječne vrijednosti. Predstavlja kvadratni korijen varijance.

Signup and view all the flashcards

Deskriptivna statistika

Metoda statističke analize koja se koristi za  opisivanje karakteristika skupa podataka. To uključuje mjerenje centralne tendencije, disperzije i oblika distribucije podataka.

Signup and view all the flashcards

Distribucija

Zastupljenost vrijednosti unutar skupa podataka. Prikazuje kako su vrijednosti raspoređene u skupu podataka.

Signup and view all the flashcards

Korelacija

Količina ili smjer promjene jedne varijable u odnosu na drugu varijablu. Pozitivna korelacija znači da se varijable kreću u istom smjeru, dok negativna korelacija znači da se kreću u suprotnim smjerovima.

Signup and view all the flashcards

Pearsonov koeficijent korelacije

Standardna mjera korelacije između dva kontinuirana skupa podataka. Mjeri linearni odnos između varijabli.

Signup and view all the flashcards

Spearmanov koeficijent korelacije

Mjera korelacije između rangova dva skupa podataka. Dobro funkcionira za nelinearne obrasce ili podatke s outlierima.

Signup and view all the flashcards

Kendallov koeficijent korelacije

Mjera korelacije između rangova dva skupa podataka. Uzima u obzir redoslijed podataka.

Signup and view all the flashcards

Normalna raspodjela

Kontinuirana raspodjela vjerojatnosti simetrična oko srednje vrijednosti. Većina podataka nalazi se unutar jedne standardne devijacije od srednje vrijednosti.

Signup and view all the flashcards

Koraci u analizi podataka

Koraci u procesu analize podataka u strojno učenju.

Signup and view all the flashcards

Priprema podataka (Data Preparation)

Postupak pripreme i transformacije podataka za strojno učenje. Obuhvaća čišćenje, preoblikovanje i pripremu podataka.

Signup and view all the flashcards

Modeliranje (Modeling)

Procesi uključeni u izgradnju i evaluaciju modela strojnog učenja.

Signup and view all the flashcards

Učenje s ponderiranjem

Učenje na skupu podataka s različitim veličinama uzoraka.

Signup and view all the flashcards

Iterativna promjena težina

Sustavno reponderiranje podataka u svakoj iteraciji. Uspješni uzorci smanjuju svoju težinu, a pogrešni uzorci povećavaju svoju težinu.

Signup and view all the flashcards

Alfa (α)

Izračunava se kao logaritam omjera točnih i netočnih klasifikacija u iteraciji.

Signup and view all the flashcards

Promjena težine pogrešno klasificiranih uzoraka

Množenje postojećih težina s e^α.

Signup and view all the flashcards

Epsilon (ϵ)

Zbroj težina svih pogrešno klasificiranih uzoraka.

Signup and view all the flashcards

Boosting

Prilagodba pondera uzoraka u svakoj iteraciji kako bi algoritam naučio od prethodnih pogrešaka.

Signup and view all the flashcards

Skupni model

Serija slabih modela koji se kombiniraju za stvaranje snažnog modela.

Signup and view all the flashcards

Iteracija Boostinga

Slabi model se koristi za klasifikaciju uzoraka. Točni uzorci smanjuju svoju težinu, pogrešni uzorci povećavaju svoju težinu.

Signup and view all the flashcards

Što je vizualizacija podataka?

Vizualizacija podataka koristi se za prikazivanje informacija i podataka u grafičkom obliku pomoću vizualnih elemenata kao što su dijagrami, grafikovi, plotovi i karte.

Signup and view all the flashcards

Zašto je važna vizualizacija podataka?

Vizualizacija podataka pomaže analitičarima da razumiju obrasce, trendove, iznimke, distribucije i odnose u podacima.

Signup and view all the flashcards

Koje su neke popularne Python biblioteke za vizualizaciju podataka?

Python nudi razne biblioteke za vizualizaciju podataka, kao što su Matplotlib, Seaborn, Bokeh i Plotly.

Signup and view all the flashcards

Što je Matplotlib?

Matplotlib je jedna od prvih Python biblioteka za vizualizaciju podataka, a mnoge druge biblioteke su izgrađene na njemu ili dizajnirane da rade u tandem s njim.

Signup and view all the flashcards

Kako se Seaborn i pandas odnose prema Matplotlibu?

Seaborn i pandas su biblioteke koje koriste Matplotlib za vizualizaciju podataka, ali nude jednostavniji način pristupa njegovim funkcijama.

Signup and view all the flashcards

Što je Bokeh i kako koristi "The Grammar of Graphics"?

Bokeh je biblioteka koja koristi "The Grammar of Graphics", što znači da se bilo koji graf može kreirati kombiniranjem podataka s slojevima vizualnih elemenata.

Signup and view all the flashcards

Po čemu je Plotly poseban?

Plotly je biblioteka koja nudi interaktivne grafike, a uključuje i grafikone koji se ne nalaze u većini drugih biblioteka.

Signup and view all the flashcards

Za što se koristi vizualizacija usporedbe?

Vizualizacija usporedbe koristi se za prikazivanje razlike između dva ili više stavki u određenom trenutku ili tijekom vremenskog razdoblja.

Signup and view all the flashcards

Izvanredni slučajevi

Izvanredni slučajevi su vrijednosti podataka koje se značajno razlikuju od ostalih vrijednosti u skupu podataka. Mogu se pojaviti zbog pogrešaka u unosu podataka, nenormalnih događaja ili jednostavno zbog prirodne varijabilnosti podataka.

Signup and view all the flashcards

Utjecaj Izvanrednih Slučajeva na Modele Strojnog Učenja

Izvanredni slučajevi mogu imati negativan utjecaj na modele strojnog učenja. Mogu dovesti do duljih vremena obuke modela, smanjene točnosti, povećanja varijance pogreške i smanjenja normalnosti podataka.

Signup and view all the flashcards

Dijagram kutije

Dijagram kutije koristi se za prikazivanje distribucije skupa podataka pomoću kvartila. Podaci se grupiraju u okvir između prvog i trećeg kvartila. Izvanredni slučajevi se prikazuju kao pojedinačne točke izvan okvira koristeći interkvartilni raspon.

Signup and view all the flashcards

Z-rezultat

Z-rezultat je parametarizirani pristup otkrivanju izvanrednih slučajeva. Pretpostavlja normalnu distribuciju podataka. Izvanredni slučajevi se nalaze u repu normalne krivulje distribucije i daleko su od srednje vrijednosti.

Signup and view all the flashcards

Transformiranje podataka

Prije obrade podataka često je potrebno izmijeniti ili transformirati strukturu ili karakteristike podataka. To može biti potrebno za ispunjavanje zahtjeva određenog pristupa strojnog učenja, za bolje razumijevanje podataka ili za povećanje učinkovitosti procesa strojnog učenja.

Signup and view all the flashcards

Normalizacija Z-rezultata

Normalizacija Z-rezultata, ili normalizacija prema nuli, stvara normalizirane vrijednosti koje imaju srednju vrijednost 0 i standardnu devijaciju 1.

Signup and view all the flashcards

Min-maks normalizacija

Min-maks normalizacija transformira izvorne podatke u interval [0, 1].

Signup and view all the flashcards

AUC (Area Under the Curve)

AUC (Area Under the Curve) je mjera performansi za modele klasifikacije, u rasponu od 0 do 1. AUC≥0,7 - prihvatljiv/dobar klasifikator; 0,7> AUC≥0,6 - loš klasifikator; 0,6> AUC≥0,5 - bez diskriminacije.

Signup and view all the flashcards

Study Notes

Machine Learning

  •  Machine learning (ML) is a field of computer science that studies algorithms and techniques for automating solutions to complex problems.
  •  ML is a term coined around 1960, combining "machine" (computer, robot) and "learning" (acquiring or discovering patterns).
  •  Key aspects of big data in the University of Lodz library and logistics company are the sheer volume of data; average document size ~1MB (although often larger) which can account with over 2.8 millions of volumes in the library, accounting for potentially 30 terabytes of data. Logistics courier shipments can have about 20 terabytes of data.
    • These units (kilobytes, megabytes, gigabytes, etc.) are used to quantify the large volume of data.

Bibliography

  •  Includes various book titles on data mining and machine learning (e.g., Discovering Knowledge in Data, Data Mining: Concepts and Techniques, Introduction to Statistical Learning, Machine Learning with PyTorch).
  •  Provides web references (e.g., Data Mining Map from Saed Sayad, Analytics, Data Mining, and Data Science from Kdnuggets, Datasets from Kaggle), resources offering additional information on machine learning.

What Big Data Looks Like

  •  Large collections of data exist in various forms (e.g., library volumes, courier shipments).
  •  A measure of data size is required (from kilobytes to terabytes) to understand the sheer volume of data.

Relation with Artificial Intelligence (AI)

  •  AI is much broader than machine learning (ML).
  •  ML is a subfield of AI.
  •  AI encompasses various approaches to making machines intelligent, while ML focuses on learning from data using algorithms.
  •  Expert systems are one AI approach not based on learning

Definition of Data Mining

  •  Data mining is the art and science of intelligent data analysis.
  •  Its aim is to discover meaningful insights and knowledge from data, often expressed as models.
  •  Data mining builds models representing discovered knowledge, useful for understanding the world and making predictions.

Data Mining vs. Machine Learning

  •  Data mining is a technique for discovering patterns in data. It is frequently used in business analytics and experimental studies.
  •  Machine learning employs algorithms improving through data-based experience, finding new algorithms and using data mining.
  •  They influence each other as data mining techniques inform algorithms, enabling the creation of predictive models that use data mining models to construct a model of what is happening behind certain information so that it can predict future results.

What is not Data Mining and Machine Learning (DM/ML)

  •  DM/ML is distinct from OLAP. OLAP (On-Line Analytical Processing) is a data analysis technique concerned with summarizing and analyzing data, whereas DM/ML is more about finding patterns and knowledge within data.

Examples of Non-DM/ML Tasks

  •  Illustrative example questions demonstrating scenarios that fall outside the scope of data mining and machine learning

Data Analysis Process

  •  The CRISP-DM (Cross-Industry Standard Process for Data Mining) methodology describes a standard process used in data analysis projects.
  •  The steps include Business/Problem Understanding, Data Understanding, Data Preparation, Modeling, Evaluation, and Deployment.

Data Understanding

  •  Data understanding involves getting familiar with the dataset, including identifying quality problems.
  •  Crucial questions to address include: Data origin, collection methods, meaning of rows and columns, presence of obscure symbols/abbreviations.

Data Preparation

  •  Data preparation involves activities to create a final dataset from the initial raw data
  •  Important aspects of preparation typically include joining different data sets, reducing irrelevant variables, and data cleaning to resolve anomalies, normalisation, formatting and deal with missing data points.

Modeling

  •  Model selection involves choosing and applying appropriate modeling techniques, calibrating parameters to optimal values.

Evaluation

  •  Process evaluation assesses whether models meet assumptions, verifying for important business and objective considerations during evaluation.

Deployment

  •  Creating the model isn't always the end of the project. It needs appropriate organization and presentation for user use.

Data Analysis Engineering Summary

  •  Connections between data science, data engineering, statistics, machine learning, programming, domain expertise, and visualization are illustrated.

Python for Machine Learning

  •  Python is a preferred language for machine learning due to its readability, intuitive syntax.
  • Libraries provide pre-built functions for mathematical operations, data manipulation, and machine learning tasks.

Python Libraries

  •  NumPy is essential for numerical computing with multidimensional arrays and matrices.
  •  Pandas provides data structures and operations for data manipulation and analysis, especially for time series and data cleaning.
  •  Matplotlib enables high-quality visualizations, creating static, interactive, and animated graphs and charts.
  •  Scikit-learn provides a consistent interface for various supervised and unsupervised learning algorithms.
  •  SciPy extends NumPy's capabilities with additional tools for scientific and technical computing.

Data Quality Considerations

  •  Machine learning algorithms are sensitive to data quality problems.
  •  Inaccurate or incomplete data can lead to poor results (Garbage In, Garbage Out).
  • Properties of good data include completeness, accuracy, and actuality (current).

Noise in Data

  •  Data can contain: - label noise (incorrect labels) - inconsistent observations - classification errors (incorrectly classified objects).

Noise in Data: Types of Noise

 - Attribute noise - incorrect or unknown attribute values, impacting model accuracy; these are often difficult to find and can have serious effects on the performance of ML algorithms

  • Incomplete attributes - incomplete attribute values, causing errors in interpreting values.
  • Outliers - data points significantly deviating from the majority of the data.

Types of Variables

  •  Qualitative (categorical) variables are non-measurable and represented by names or labels (e.g., colors, sizes).
  •  Quantitative (numerical) variables are measurable and represented by numbers, further divided into discrete (integers, whole numbers) and continuous (real numbers).

Qualitative Data Types

  •  Nominal attributes: no meaningful order or ranking (e.g., colors, names)
  •  Ordinal attributes: meaningful order or ranking (e.g., ratings, satisfaction levels

Quantitative Data Types

  •  Discrete attributes: finite or countable set of values (e.g., number of items sold)
  •  Continuous attributes: infinite set of values (e.g., height, temperature)

Transaction-based Sets

  •  Each transaction is represented as a vector of items purchased.

Data in Graph Form

  •  Data in graph form represents data as vertices or nodes with relationships between the data shown as edges.

Data Analysis Process (summary)

  •  CRISP-DM methodology represents a standard process typically used in data mining projects.

Data

  •  Data is the plural of datum, and a datum is a single piece of information.
  •  Data comes with a certain implied meaning and should be handled based on the actual data.

Knowledge and Information (Larose)

  • Data refers to factual information, which is stored in a shop's database.
  • Information arises from searching data, or from asking questions of the database (e.g. how many customers have purchased candles on Sundays).
  • Knowledge arises from processing data and draws conclusions (e.g., most customers with specific traits tend to shop for candles) that were not explicitly stated in the data.

Data Pre-processing

  •  Data pre-processing is crucial for the quality of the work since the data need cleaning and transformation to ensure it is suitable for mining and using machine learning algorithms.
  •  Datasets may include outdated, excessive or redundant fields, records with missing values, outliers, different formats making it not usable for ML algorithms or be incompatible with principles to common sense.

Descriptive Statistics

  •  Descriptive statistics quantify data and help to understand, gain insight, and measure data (e.g., average or spread of data values), used to determine the best performance and avoid pitfalls like underfitting or overfitting when using machine learning algorithms.

Distribution

  •  A distribution visually represents how often particular values occur in a dataset.

Statistical Measures

  •  Measures of central tendency, locating the center of a distribution (e.g., mean, median, mode).
  •  Measures of spread or dispersion (e.g., range, variance, standard deviation) illustrate the spread of data.

Measures of Central Tendency

  •  Mean (Average): sum of all values divided by the total count.
  •  Median: the middle value in a sorted dataset.
  •  Mode: the most frequently occurring value in a dataset.

Measures of Spread or Dispersion

  •  Maximum: the largest value in a dataset.
  •  Minimum: the smallest value in a dataset.
  •  Range: the difference between the maximum and minimum values.
  •  Variance: average squared difference between each value and the mean (often denoted as variance).
  •  Standard Deviation: the square root of the variance.
  • Quantiles/Quartiles: values dividing the data into segments with equal frequency.

### Measures of Central Tendency (continued)

  • Median: the midpoint of the sorted data values.
  • Mode: the most common value in a data set

### Measures of Spread

  • Range: the difference between the maximum and the minimum value.
  • Variance: the average of the squared differences from the mean.
  • Standard Deviation: the square root of the variance – a measure also used in describing the spread of the data.
  • Quartiles: The data values that divide the data points into four groups.

Skewness

  •  Skewness measures asymmetry of data distribution.
  • Its value can be positive, negative, and zero. A zero value indicates symmetrical distribution
  • A positive value indicates the right tail is longer than the left (mode is to the left); a negative value indicates the left tail is longer than the right (mode is to the right)
  • The shape of an ROC curve can also visualize this.

Kurtosis

  •  Kurtosis measures tailedness of data compared to a normal distribution.
  • High kurtosis indicates a distribution with heavier tails (more outliers).
  • Low kurtosis indicates a distribution with lighter tails (fewer outliers).
  • Three main types include mesokurtic (normal), platykurtic (light tails), and leptokurtic (heavy tails).

Covariance

  •  Covariance measures the relationships between two variables, displaying the degree to which variables are changing in a given direction.
  • Ranges from negative to positive infinity (not normalized) and is thus not suitable for understanding the impact, rather Correlation, which is normalized, offers better insights.

Correlation

  •  Correlation measures the strength and direction of the linear relationship between two variables. Its values range from -1 to +1, where 0 indicates no correlation, -1 a perfect negative, and +1 a perfect positive correlation.

Spearman's Rank Correlation Coefficient

  •  A non-parametric approach to assess the correlation between ranked variables
  • Useful when data distributions are skewed or contains outliers because it avoids assumptions about data distributions.

Kendall's Rank Correlation Coefficient

  •  Another non-parametric measure of the correlation between ordinal variables, measuring similarity or dissimilarity.
  • Effectively used when both variables are binary; it effectively relates to Pearson's and Spearman's rank correlation (their values or equivalence would be the same).

Data Analysis Process (steps summary)

  •  Illustrative example questions to assess clustering tendency of various datasets

Machine Learning: Methods using Gaussian Distribution

  •  Various machine learning algorithms, such as linear regression and others, involve the assumptions that data comes from a Gaussian (normal) distribution. This simplification often leads to efficient estimations.
  • Some algorithms, such as anomaly detection and PCA, explicitly use the Gaussian distribution's properties

Central Limit Theorem

  •  A core concept in hypothesis testing and related inferential statistics, highlighting that the mean of a sample will approach the mean of a population as sample size increases, and providing a basis for statistical inferences about populations based on samples.

Collecting Samples

  •  Data is rarely collected from an entire population due to time and resource constraints. Samples are part of populations and should ideally be representative of populations to allow valid inferences to be made or conclusions to be drawn from samples.

Data Visualization

  •  Data visualization is the initial step toward easily understanding and communicating information. It represents data graphically using various visual elements like charts, plots, and graphs for user better understanding.

Data Visualization (methods summary)

  •  Visual elements for representing data (e.g. Charts, graphs, plots, maps).

Data Visualization with Python

  • Matplotlib - a foundational library for data visualization in Python.
  • Seaborn - provides higher level libraries over Matplotlib.
  • Bokeh - enables interactive plots.
  • Plotly - offers interactive plots, including additional charts unavailable with other methods.

Data Visualization - Comparison

  •  Comparison visualization method compares elements or variables across different time periods or points in time.
  •  A boxplot is a commonly used comparison chart to compare feature distributions across different categories.

Visualizing Numeric Features - Boxplots

  •  Visual representation of five-number summary statistics (minimum, first quartile, median, third quartile, maximum) and outliers.

Data Visualization - Relationship

  •  Relationship visualizations illustrate correlations between two or more variables.
  •  A scatterplot provides visual representation of the relationship between two continuous variables.

Data Visualization - Distribution

  •  Distribution visualizations show the statistical distribution of values of a feature.
  • A histogram visually presents the distribution of numeric variables, displaying the frequency or count of observations falling within specific ranges.

Data Visualization - Composition

  • Composition visualizations display the component makeup of data using stacked bar charts and pie charts.

Data Visualization - Heatmap

  • A heatmap is used for visualizing data using various colours to represent the values displayed.

Data Visualization - Pair Plot

  • A pair plot is a comprehensive way to visualize the distribution of both variables and the relationships among them.

Data Quality

  •  Data with issues influences the overall quality of results obtained when using machine learning models (garbage in, garbage out). Data quality includes aspects such as completeness, the correctness of data, and the currency of data used when the model is being trained and evaluated.

Data Pre-processing: Types of Data Issues

  •  Issues in datasets include data that is outdated or redundant, records with missing values, outliers, formats unsuitable for ML algorithms and values incompatible with the principles and common sense.

Data Pre-processing: Ways to Improve Data Quality

  •  Cleaning up problematic data (removing outdated and/or redundant fields, resolve anomalies).
  • Restructuring and/or normalizing data (e.g. convert various representations like various date formats into a single representation/format) to improve data quality, which influences model performance.
  • Data Cleaning (remove/fix errors).

Missing Values

  •  Changes in data collection methods, human error, and combining datasets can cause missing values.
  •  Handling methods include removal, imputation (filling in missing values: random, mean-median based, and/or predictive imputation).

Missing Values - Considerations

  •  When deciding how to handle missing values, assess for patterns or reasons for missing data. Example: specific groups of people may not respond to certain survey questions.
  • Method for handling missing values should take into account the data's context and potential impact if values were removed or imputed.

Handling Outliers

  •  An outlier is an observation that is significantly different from most other data points, leading to problems when building predictive models (long training times, poor accuracy, increased error variances, diminished normality).
  •  Handling strategies include removal, imputation, and or transformation.

Handling Outliers: Methods for Handling Outliers

  •  Strategies include removal, imputation, or transformation. The strategy for handling outliers influences model building as it impacts the accuracy of the model, the training times, and the generalizability of the model.

Transforming Data

  •  Data transformation involves modifications to improve the data's structure and characteristics, making it more suitable for specific ML approaches or for enhancing our understanding of the data and improving the efficiency of the machine learning process.
  •  Appropriate transformations can bring all features to the same level of magnitude which improves a model's efficiency and helps its generalizability abilities, in avoiding features being too large, or too small, relative to other features.

Z-score Standardization

  •  A standardization method for numerical data, resulting in normalized values with a mean of zero and a standard deviation of 1.

Min-Max Normalization

  •  A transformation technique that maps numerical data to a specified interval (most frequently [0,1]).

Log Transformation

  •  Transformation used on variables (numerical) with non-symmetrical or wide-range distributions. It helps to reduce skewness in the data and improves model generalizability.

Feature Encoding Techniques

  •  Methods used to convert categorical data into numerical data, which is important to allow machine learning algorithms to process and perform calculations on data.
  •  Techniques include: - Label encoding - One-hot encoding

Types of Machine Learning Algorithms

  •  Classification: assign observations to categories (e.g., spam/not spam). Example use case: predicting whether an e-mail is spam or not. Supervised methods for classification (e.g., k-nearest neighbours, support vector machines, neural networks, decision trees, naive Bayes).
  •  Regression: predict a continuous response variable (e.g., house price, salary). Example use case: forecasting house prices or predicting employee salaries. Supervised methods for regression (e.g., linear regression, decision trees, neural networks).
  •  Unsupervised learning: discover patterns in unlabeled data; it is mostly used to discover patterns, or relationships, and/or to group data observations into homogeneous/similar groups, with no target variable, like in market basket analysis. Unsupervised methods for clustering (e.g. hierarchical methods and partitioning methods like k-means).

Supervised Learning: Classification

  •  Supervised machine learning tasks involve methods/algorithms to predict which category a given example belongs to.
  •  Methods for dealing with these problems include: - Classification algorithms, such as, k-nearest neighbors, support vector machines, neural networks, decision trees, and/or naive Bayes.

Supervised Learning: Regression

  •  Supervised machine learning tasks involve predicting a continuous variable based upon several input variables.
  •  Methods for dealing with these problems include:
    • Logistic regression
    • Regression trees
    • Neural networks

Supervised Learning - Examples

  •  Examples from various domains illustrate the application of classification and regression models (e.g., speech/text recognition, object identification, credit scoring).

Supervised Learning Algorithm (steps summary)

  •  General steps, including data separation into training and test sets, model construction using training data, and evaluating the model's effectiveness.

Unsupervised Learning

  •  Unsupervised learning tasks analyze unlabeled data to discover patterns and/or categorize data into homogeneous groups.
  •  Methods for performing unsupervised learning include: - Clustering algorithms (hierarchical and/or partitioning methods) - Association rules (discover correlations among data items).

Unsupervised Learning: Pattern Discovery

  • The pattern discovery task allows identifying useful associations within datasets. Example: patterns about what items retailers sell together on the basis of transactional purchase (market basket) data are uncovered, which could help in making recommendations and/or developing effective product placement strategies.

Unsupervised Learning: Clustering

  • Categorizing dataset into homogeneous or similar groups; this method finds subgroups of items that are close to each other without relying on pre-classified data (e.g., customers with similar preferences in specific products). Example use cases: market research, image segmentation, and social network analysis

Association Rules

  •  Association rules explore the relationships between items or sets of items in a dataset – mostly used for market basket analysis. Examples: if customer buys chips, then they tend to buy lemon; If someone buys a hotel room, they then often rent a car; if the temperature and wind are specific way, then no rain will occur..

Association Rules Measures

  •  Support measures the frequency of an itemset in a dataset in respect the proportion of the items/set's occurance when comparing it with all the cases in the data set. (example: How many times does item set {chips, lemon} occur in the full data set and how does it relate to all the cases where items are present?).
  •  Confidence relates to the degree to which items are linked given that certain items have been chosen. (example: How many sales/cases of item set {chips} occur when item set {lemon} is present in a given transaction/event? What's that percentage?).
  •  Lift relates to the strength of the association rule (how much stronger the relationship of an association rule appears relative to random chance).

Apriori Property

  •  A rule for association rules; if an itemset is frequent, all of its subsets also need to be frequent for that itemset.

Apriori Algorithm Summary

  •  The Apriori algorithm efficiently identifies frequent itemsets based upon the property that their subsets are also frequent – allowing for algorithms that can scale to large datasets to be developed or deployed.

Estimating Future Performance

  •  Methods for estimating how well machine learning models will perform, such as holdout, k-fold cross-validation, and leave-one-out cross-validation.

Holdout Method

  •  Splitting the dataset into a training set and a separate evaluation set to assess how the model performs with unseen data; this method helps to evaluate the model and avoids overfitting or underfitting of the data.

Model Selection

  •  A data analysis approach to identifying the optimal model for given parameters.

Validation Set

  •  A technique for evaluating models by dividing the dataset apart; It separates it into 3 parts (training, validation, and test) to evaluate models' performance on unseen data and fine-tune model parameters.

Resampling

  •  Repeatedly using different samples of the original data to train and validate a model to yield better performance estimates when evaluating the model.

k-Fold Cross-Validation

  •  A resampling approach wherein the dataset is divided into k folds to evaluate model generalization ability (performance on unseen data). The method avoids overfitting and the use of k-fold produces several/various measures of performance based from the various/many test folds – which are averaged.

Leave-one-out Cross-Validation (LOOCV)

  • A special case of k-fold cross-validation where k is equal to the number of training samples. A more computationally expensive method but does provide better accuracy in assessing model ability/performance/accuracy because it effectively uses as much of the data in the validation process as possible.

Bootstrap Sampling

  •  A resampling method involving repeated sampling with replacement from the original data to create multiple training datasets, with which separate models are run, so as to estimate the performance. This method is used to assess the accuracy and stability of a machine learning model in general (how well it can predict new/unseen data).

Measuring Performance for Classification (Summary/Review)

  •  Review of terms: True Positive, False Positive, True Negative, False Negative.
  •  Precision, Recall/Coverage, F1-score (evaluation metrics) are reviewed. These evaluation metrics in combination can offer various insights and a better/more accurate assessment of a model's robustness or usability.

Class Imbalance Problem

  •  The problem of having significantly more observations belonging to one class than another; in this case accuracy alone is a poor measure. This means that measuring it with a metric such as Accuracy is inaccurate. Methods such as the Kappa statistics or precision, recall, and F1-Scores are potentially better for evaluating the performance.

Kappa Statistic

  •  Statistic adjusting accuracy to account for chance agreement, specifically useful for datasets with imbalanced classes, evaluating model performance.

Properties of Classification Methods

  •  Essential characteristics or properties to consider for classification methods, including accuracy, speed, robustness, scalability, and interpretability, which all significantly influence model performance.

k-nearest Neighbours Algorithm (k-NN)

  •  A non-parametric supervised classification algorithm based on the principle of measuring distance among samples, based on training data. By using this method, it is possible to predict the category in which an element belongs to under varying conditions (e.g. when new, previously unseen, data are collected)

k-nearest Neighbors Algorithm (k-NN) - Idea

  •  Description and illustration/example steps in 3D that measure the distance of an observation or point to determine how similar the point / observation to other points in the data set are.
  •  This is done by classifying the k nearest neighbours, by checking for the the most frequent value or cluster.

k-nearest Neighbors Algorithm (k-NN) - Selecting k

  •  Discussion/explanation of how the choice of k (number of nearest neighbours) affects the performance of the model, illustrating possibilities of overfitting or underfitting.

k-nearest Neighbors Algorithm (k-NN) - Idea Continued

  •  Illustration/example used demonstrating how kNN categorizes a new/unseen data point based on the k-neighbor with the highest frequency belonging to that class.
  •  The final step includes identifying the k-nearest neighbors that are closest, and then determining which class (based upon frequency of occurrence among nearest neighbors) the test data point belongs to.

What happens if the vote is tied?

  •  Dealing with ties in a classification problem (how to break ties for classification).

Strengths and Weaknesses of k-NN Algorithm

  •  Summary of strengths (simplicity, non-parametric) and weaknesses (sensitive to outliers, computational cost).

Decision Trees

  •  A decision tree visualizes the relationships between input variables and outcomes using a tree-like structure.
  •  Discrete decisions are modeled with classification trees, while continuous outcomes are modeled with regression trees.

Decision Trees - Structure

  •  Basic structure of decision tree – root node, branches, decision nodes, and leaf nodes.

Decision Trees - Interpretation

  •  The tree's logic translates into rules for predicting outcomes

Divide and Conquer

  •  Recursive partitioning: a divide-and-conquer strategy for building decision trees.
  • Goal: recursively divide the data for successively, and increasingly homogeneous, subsets.

Attributes Splits (types)

  •   Categorizing splits based on attribute types: binary (two values), nominal (no order), ordinal (order), and continuous (continuous ranges).

How to Determine the Best Split

  •  Greedy approach to determine the optimal splits in decision trees. Example: Attributes are often evaluated in terms of minimizing class impurity or maximizing information gain.

Measures of Impurity

  •  Information measures: information gain, Gini index, or classification error metric is used in selecting the best attribute/variable for splitting.

Entropy

  •  A measure of the uncertainty or impurity of a dataset based on the proportion of different classes in each set (node/group) and used to measure the information gain when splitting attributes.

Information Gain

  •  A measure of the improvement in purity/homogeneity during partitioning related to the information when splitting along a given attribute or characteristic in comparison to a parent node.

Induction of a Decision Tree (using information gain)

  •  Detailed example illustrating the use of information gain to determine the optimal splits during decision tree induction.

Gini Index

  •  A measure of impurity in a dataset or partition based on the likelihood of misclassifications or the frequency of different observations in a node or partition of an overall dataset.

Classification Error

  • Probability of misclassifying minority data elements, often not the best metric for unbalanced datasets, being less sensitive to shifts/changes in class probabilities.

Homogeneity Measures

  •  Comparing the impurity reduction among data partitions using different metrics like information gain, Gini index, and classification error.

Errors Made by Classifier

  •  Identifying and classification of errors made by classifier, including training and test error in assessing and/or measuring a model's performance.

Pruning

  •  A technique to prevent overfitting in decision trees by reducing the tree's size, either during the building of the tree ("prepruning") or after it is built ("postpruning").

Tree Pruning - Prepruning

  •  Early stopping strategies that mitigate overfitting in decision trees during tree construction. Example: maximum depth, minimum samples per leaf, minimum samples per split, specifying maximum features, helping to avoid overfitting of data to training datasets.

Tree Pruning - Postpruning

  • Reduction of nodes in a decision tree that has already been constructed (“postpruning”). This method can include using certain metrics/measures for deciding what nodes should be removed, and/or making selections about which method to use or to determine optimal points. Methods for determining these include cost complexity pruning..

PCA (Principal Composent Analysis) Example

  • Illustrations of how PCA works and is used in practical datasets (e.g., car datasets used for evaluating/comparing pricing among cars).

PCA - Example (cont.)

  • In practical datasets such as pricing of cars based upon multiple characteristics and/or variables, the approach is used to group several characteristics such as weight, volume, design, features, and/or horsepower into just one "component" called, for example, 'size', to reduce complexity in analyzing and understanding the data. This reduces dimensionality without affecting the important data content/information.

Prediction Problems

  •  Illustrative examples of classification and regression problems/tasks, demonstrating the application of machine learning based techniques to datasets.

Data Analysis Process - Steps Summary

  •  Data Pre-processing; Model Building; Model Evaluation

Evaluating Performance - Summary

  •  Important ways for evaluating the usability and performance of machine learning models, including examining accuracy, error rates, precision, recall, F1-score, kappa.

Ensemble Learning Summary

  • Overview of ensemble methods combining the strength of multiple learners/models. Examples include bagging, random forest, and boosting to generate an effective prediction. Strengths of these approaches, especially regarding their parallel implementation.

Gradient Descent

  • A family of optimization algorithms for finding the minimum of a function; including its general idea, and the main parameters that control it (including learning rate).

Gradient Descent: Learning Rate

Gradient descent involves tweaking parameters iteratively to reduce losses (costs/error rates). Learning rates control how aggressively (large steps) or passively/slowly (small steps) parameters are changed during each step of this process, to find a minimum of the function being modeled.

Gradient Boosting Method

  •  A boosting algorithm used to build models sequentially (using existing models or results from previous model outputs or iterations), improving the model's performance by reweighting misclassified observations and aiming at improving the quality of results obtained.

Gradient Boosting Algorithm: General

  • General overview steps illustrating how gradient descent / boosting builds models from previous outputs or errors in those outputs.

Gradient Boosting Algorithm: Binary Classification

  • Example of applying gradient boosting to binary classification problems, including use of cost functions like Log-likelihood and specific calculations used in these models.

Gradient Boosting - Example

  • Practical example illustrating implementation steps, in a binary classification example showing how weights for each observation are altered if misclassifications occur or occur more than expected when model outcomes are being analyzed.

Advantages of GBM

  •  High flexibility, handling missing data, and efficiency of the use of gradient methods.

Extreme Gradient Boosting

  •  An efficient implementation of gradient boosting and its advantages (including parallelism, regularization).

Objective Function

  •  A function used in gradient boosting models, combining a cost function/loss function component and a regularization function for managing overfitting and bias, helping to improve the generalized ability and reliability of the final model created.

Neural Networks

  •  A machine learning technique representing inputs and outputs using neuron networks with weights and/or biases through several layers.

Neural Networks - Structure

  • Explains the layered structure of neural networks; inputs, hidden layers, and/or outputs.

Neural Networks - Layer Node Considerations

  •  Discussion on the appropriate selection of hidden layers and nodes in neural networks to prevent overfitting or underfitting problems, while maximizing results according to the data under analysis.

Perceptron

  •  The simplest neural network type.

Perceptron Learning

  • Algorithms for learning specific weights that yield accurate or correct outputs; iterative approach until convergence is reached/found.

Perceptron - Example

  • Practical example demonstrating how a perceptron network takes data inputs, calculates sums of the weights and/or biases in these calculations, and then calculates a predicted output value.

Activation Function

  •  Specific functions that allow the perceptron or neuron to produce outputs that include non-linearity, or non-linearity, and/or other functions that are almost/similar to linear, almost/similar to curvilinear or almost/similar to constant for performing computations using several/a varying number of inputs in combination.
  •  Example functions (sigmoidal and Tangent).

Interpretation of the Weights

  •  Weights define the input variable's influence on the output of the neural network. Smaller weights mean less influence, while higher weights mean greater influence.

XOR Problems -- Perceptron

  • Illustrative example showing how a perceptron network is unable to solve scenarios or examples that don't have a linear solution.

Multilayer Networks

  • Neural networks that contain multiple hidden layers; these are used to perform complex tasks.

Neural Networks - Learning

  • Explanation of the goal of the learning algorithm in neural networks (calculating weights to minimize mean square error).

Neural Networks - Backpropagation Algorithm

  •  The backpropagation algorithm is a method used in training or learning neural networks to adjust weights in the network to improve output of data from the network. There are two processes that occur in each learning (updating) iteration of the network:
    • Forward propagation: A step that goes from input to output
    • Backward propagation : a step that goes from output to input - this helps the network to adjust the weights to reduce errors or losses.

Neural Networks - Backpropagation Algorithm: Considerations

  •  Conditions, such as, when the learning time is short, the model can be not accurate.
  •  If the learning process is continued for too long, the network will be too sensitive or responsive to changes (overfitting) and may need regularization or pruning to improve the effectiveness and stability of the model.

Estimating Future Performance: Methods

  •  The problem of estimating or evaluating models to assess their generalizability or usability is described, considering issues like data overfitting. Methods to address this problem include holdout, k-fold, and/or leave-one-out (LOOCV).

Holdout Method

  •  Dataset is divided into training and test sets using part of the dataset to train, and a separate part for testing (evaluating) model on unseen data, to assess the model's ability to perform effectively on unseen data points.

Model Selection

  • Selection of optimal procedures (and/or models) by tuning hyperparameters and comparing performance across several implementations considering different settings/parameters, potentially multiple implementations.

Validation Set

  • Technique for using a portion of the dataset aside from training and/or test sets to better evaluate or gauge model performance. This improves generalization ability of a model, particularly in dealing with situations involving small/limited datasets.

Re-sampling

  •  Methods for improving evaluation estimates from models by repeatedly using different sample sizes of original data. Methods include: k-fold cross-validation and/or bootstrapping.

k-Fold Cross-Validation

  •  Detailed explanation (re-sampling technique) of the procedure for implementing k-fold cross-validation (a data separation methodology).

Leave-one-out Cross-Validation

  •  A special case of k-fold cross-validation where k is equal to the entire dataset size (n) - used to maximize the amount of data used, at the cost of computational time.

Bootstrap Sampling

  •  Technique for repeatedly sampling the original data, using replacement, to generate multiple training datasets to assess model performance; it generates different training sets each time, generating an improved estimate of the model's generalizability to new, unseen, data.

Regression Techniques

  •  Regression techniques are methods to predict continuous target variables, modeling/estimating the relationship between predictors (x) and target (y) variables using available data.

Regression Analysis - Overview

  •  The structure of Regression analysis showing how relationships between numerical (continuous) variables (predictors and target).

Simple Linear Regression

  • Regression method used to estimate the linear relationship between a single predictor variable and a response variable.

Multiple Linear Regression

  •  Method for estimating the relationship between several numerical predictor and/or explanatory variables and a single/one response or target variable, based on all the data, with the aim of fitting a model.

Understanding the Output

  •  Illustrative example showing how mean absolute error (MAE), mean squared error (MSE), R-squared (R²), and/or root mean squared error (RMSE) can evaluate the quality of regression models using the model output.

Quantile-Quantile (Q-Q) plots

  •  Graphical method to determine if a dataset follows a specific probability distribution or to determine if two or more datasets came from the same population; useful when determining normality or consistency among datasets for predicting purposes.

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

Related Documents

Machine Learning Lectures - PDF

More Like This

Use Quizgecko on...
Browser
Browser