Data Analytics - Qasem Abu Al-Haija PDF
Document Details
Uploaded by Deleted User
Dr. Qasem Abu Al-Haija
Tags
Summary
This document is lecture notes on data analytics, focusing on fundamental concepts and techniques applied to cybersecurity. It covers topics such as data warehousing, big data, number representation, and storage measurement units, emphasizing the importance of data in decision-making and present analysis.
Full Transcript
CY 451 Security Analytics Fundamentals of Data and Data Analytics Dr. Qasem Abu Al-Haija Data All Around Lots of data is being collected and warehoused Web data, e-commerce Financial transactions, bank/credit transactions Online trading and purchasing Social Network 10/2...
CY 451 Security Analytics Fundamentals of Data and Data Analytics Dr. Qasem Abu Al-Haija Data All Around Lots of data is being collected and warehoused Web data, e-commerce Financial transactions, bank/credit transactions Online trading and purchasing Social Network 10/21/2024 Dr. Qasem Abu Al-Haija - Security Analytics 2 Data Analytics became very Important Google processes over 20 exabytes (EB) of data per day (2023). Facebook (Meta) handles 4+ petabytes (PB) of data every day. eBay stores more than 50 PB of user data and 100 TB of daily logs. Cost of 1 TB of disk: ~$20 (2024). Time to read 1 TB disk: 2.8 hours (at 100 MB/s) 10/21/2024 Dr. Qasem Abu Al-Haija - Security Analytics 3 Big Data Examples In Cybersecurity These datasets help detect and prevent cyber threats in real-time. Network Traffic Logs o Analyzing vast amounts of data to detect anomalies like DDoS attacks or malware. SIEM Data (Security Information and Event Management ) o Collecting and analyzing real-time security events from various sources. User Activity Logs o Monitoring user behavior to identify insider threats. IoT Device Data o Monitoring the security and activity of billions of connected devices. Threat Intelligence Feeds o Aggregating global threat information for proactive defense. Dr. Qasem Abu Al-Haija - Security Analytics 4 Big Data (BD) and Three V’s Big Data exceeds traditional database capacity, requiring advanced technologies for handling its key characteristics: Volume: Data size starts at petabytes (PB) and grows to exabytes (EB) and beyond. Velocity: Data is generated at high speeds, from KB/s to PB/s, especially in real-time apps. Variety: Data comes in diverse formats, including structured, unstructured, and semi- structured (text, images, videos, etc.). Sources of BD 3Vs Diagram Big Data (BD) Management Huge data are stored using nonhierarchical data storage system lakes Most common: Hadoop distributed file system (HDFS) & Amazon Web Services S3 (AWS). HDFS platform uses clusters of commodity servers to store big data. AWS platform is a cloud architecture that’s available for storing big data. 10/21/2024 Dr. Qasem Abu Al-Haija - Security Analytics 6 Review: Number Representation Systems. Decimal, Binary, Octal, …. Review: Data Storage Measurement Units. Bit, Byte, KB, MB, …. Review: CPU Speed Measurement Units. Bit, Byte, KB, MB, …. 10/21/2024 Dr. Qasem Abu Al-Haija - Security Analytics 7 Review: Number Representation Systems Binary numbers Digits = {0, 1} (11010.11)2 = 1 x 24 + 1 x 23 + 0 x 22 + 1 x 21 + 0 x 20 + 1 x 2-1 + 1 x 2-2 = (26.75)10 Decimal Binary Octal Hexadecimal 0 0 0 0 Octal numbers 1 2 1 10 1 2 1 2 Digits = {0, 1, 2, 3, 4, 5, 6, 7} 3 11 100 3 4 3 4 (127.4)8 = 1 x 82 + 2 x 81 + 7 x 80 + 4 x 8-1 = (87.5)10 5 6 101 110 5 6 5 6 7 111 7 7 8 1000 10 8 Hexadecimal numbers 9 10 1001 1010 11 12 9 A Digits = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D, E, F} 11 12 1011 1100 13 14 B C (B65F)16 = 11 x 163 + 6 x 162 + 5 x 161 + 15 x 160 = (46,687)10 13 1101 15 D 14 1110 16 E 10/21/2024 Dr. Qasem Abu Al-Haija - Security Analytics 15 1111 1 8 F 16 10000 20 10 Review: Data Storage Measurement Units Byte B 8 bits At every Kilobyte KB 1024 bytes At every step step Megabyte MB 1024 kilobytes Going Going from Gigabyte GB 1024 megabytes from small to small to large Terabyte TB 1024 gigabytes large 10/21/2024 Remember: bit = b = 0 or 1 Dr. Qasem Abu Al-Haija - Security Analytics 1024 = 210 9 Review: CPU Speed Measurement Units CPU SPEED MEASURES 1 Hz 1 cycle per second (Hz means Hertz) 1 KHz 1,000 Hz (cycles per second) 1 MHz 1 million cycles per second or 1,000 KHz… 1 GHz 1 billion cycles per second or 1,000 MHz… 10/21/2024 Dr. Qasem Abu Al-Haija - Security Analytics 10 Preliminary Concepts: Data, Information, Knowledge, Wisdom Data are the pure & simple facts without any particular structure or organization, the basic atoms of information. Information is structured data, which adds meaning to the data and gives it context & significance. Knowledge is the ability to use information strategically to achieve one’s objectives, Wisdom is the capacity to choose objectives consistent with one’s values and visions. 11 Preliminary Concepts: Data, Information, Knowledge, Wisdom 10/21/2024 Dr. Qasem Abu Al-Haija - Security Analytics 12 Types of Data We Have Structured Data Data is stored, processed, and manipulated in a traditional relational database management system (RDBMS). Unstructured Data. Data that is commonly generated from human activities and doesn’t fit into a structured database format. Semi-structured Data. Data doesn’t fit into a structured database system but is nonetheless structured by tags that are useful for creating a form of order and hierarchy in the data. 10/21/2024 Dr. Qasem Abu Al-Haija - Security Analytics 13 10/21/2024 Dr. Qasem Abu Al-Haija - Security Analytics 14 Universal Data Files Formats Comma-separated values ( ) files. Accessible by any computer and common scripting languages like Python and R. Scripts (Python or R programming language ). These script files end with the extension or (Python) or (R). Web programming files (Data-Driven Documents: D3.js). JavaScript library for data visualization of web-based documents: , ,. Application files (Excel and Geospatial analysis applications). Excel files ( or ) /Geospatial files ( or. for ArcGIS or QGIS apps). 10/21/2024 Dr. Qasem Abu Al-Haija - Security Analytics 15 Universal Data Files Formats - Example CSV vs. XLS 10/21/2024 Dr. Qasem Abu Al-Haija - Security Analytics 16 What To Do With These Data? Aggregation and Statistics Data warehousing and OLAP (online analytical processing), … Indexing, Searching, and Querying Keyword based search, Pattern matching (RDF/XML),… Knowledge discovery Data Mining, Statistical Modeling, Prediction, Classification,… 10/21/2024 Dr. Qasem Abu Al-Haija - Security Analytics 17 Therefore, Data is very critical feature to study Since it helps business leaders to make decisions based on facts, statistical numbers and trends. Due to this growing scope of data, Data Science has emerged as a multidisciplinary field. 18 10/21/2024 Dr. Qasem Abu Al-Haija - Security Analytics What is Data Analytics (DA)? An area that manages, manipulates, extracts, and interprets knowledge from a tremendous amount of data. Examples of techniques utilized in DA: Machine Learning, Visualization, Pattern Recognition, Probability Model, Data Engineering, Signal Processing, Etc. Multidisciplinary field, uses scientific methods to draw insights from data. Extraction, preparation, analysis, visualization, and maintenance of information. 10/21/2024 Dr. Qasem Abu Al-Haija - Security Analytics 19 Data Analytics in Cybersecurity: Examples & Applications Intrusion Detection: Analyzes network traffic to detect anomalies/attacks (e.g., DDoS). Threat Intelligence: Aggregates data to predict and prevent cyber threats. Anomaly Detection: Monitors behavior for deviations, signaling possible breaches. Ransomware Detection: Identifies suspicious encryption or data transfers. User Behavior Analytics: Detects insider threats and compromised credentials. Vulnerability Management: Prioritizes patching of security weaknesses. Phishing Detection: Analyzes emails for phishing indicators using machine learning. Malware Analysis: Assesses malware behavior for faster detection. Fraud Detection: Identifies fraudulent transactions through behavior analysis. Log Analysis: Analyze logs to detect abnormal activities. Incident Response: Assesses and mitigates security incidents using data analysis. Predictive Risk Management: Forecasts future attacks based on past data. Data Leak Detection: Monitors abnormal data transfers to prevent breaches. 10/21/2024 Dr. Qasem Abu Al-Haija - Security Analytics 20 More about Data Analytics (DA) DA concerns of Converting Raw Data into Actionable Insights. 10/21/2024 Dr. Qasem Abu Al-Haija - Security Analytics 21 Example of DA: Data Wrangling (DW) DW refers to modifying and summarizing data. Examples of DW processes: Data Extraction. Data Cleaning. Data Transformation. Data Aggregation. Data Organization. Data Sorting. Data 10/21/2024Validation. Dr. Qasem Abu Al-Haija - Security Analytics 22 Types of Data in DA Applications 10/21/2024 Dr. Qasem Abu Al-Haija - Security Analytics 23 Categorical Data Nominal Ordinal The main difference between nominal and ordinal data is that ordinal has an order of categories while nominal doesn’t. Needs encoding techniques to be converted to numeric data, such as: One-hot encoding (for nominal data ) Integer Encoding (for ordinal data) 10/21/2024 Dr. Qasem Abu Al-Haija - Security Analytics 24 Categorical Probability Distributions Nominal data A nominal scale describes a variable with categories that do not have a natural order or ranking. Examples of nominal variables include: Genotype, Blood Type, Zip Code, Gender, Race, Eye Color, Political Party When you are dealing with nominal data, you collect information through Frequencies or Percentage. To visualize nominal data you can use a pie chart or a bar chart. 10/21/2024 25 Dr. Qasem Abu Al-Haija - Security Analytics Categorical Probability Distributions Ordinal data An ordinal scale is one where the order matters but not the difference between values. Examples of ordinal variables include: Education level (“high school”, ”BS”, ”MS”, ”PhD”), and Satisfaction rating (“extremely dislike”, “dislike”, “neutral”, “like”, “extremely like”). When you are dealing with nominal data, you can summarize your data using percentiles, median, interquartile range, mean, standard deviation, and range. To visualize ordinal data you can use a histogram or a box-plot. 10/21/2024 26 Dr. Qasem Abu Al-Haija - Security Analytics Categorical Probability Distributions Example of Integer Encoding for Ordinal data Low Security Medium Security High Security Critical 1 2 3 4 Example of One Hot Encoding for Nominal data 10/21/2024 Dr. Qasem Abu Al-Haija - Security Analytics 27 Math, Probability, and Statistical Modeling Importance of Statistics To a Data Analytics Data analysts have substantive knowledge in one field or several fields. Math, Probability, & Statistical models are integral to data analytics fields. Data analysts use statistics, math, coding, and strong communication skills to help them discover, understand, and communicate data insights within raw datasets related to their expertise. 10/21/2024 Dr. Qasem Abu Al-Haija - Security Analytics 29 Overview of Statistics Mathematical techniques are used to organize and manipulate data to answer questions, test theories, and make decisions. The statistical techniques are used to Report on Populations and Samples. The statistical models can be either: Descriptive Statistics or Inferential Statistics. 10/21/2024 30 Dr. Qasem Abu Al-Haija - Security Analytics Descriptive Statistics Provide a description for some characteristics of the entire numerical dataset (population) , including: Dataset distribution (such as Normal, Binomial, Categorical), Central tendency (such as mean, min, or max), Dispersion (as in standard deviation and variance) 10/21/2024 Dr. Qasem Abu Al-Haija - Security Analytics 31 Descriptive Statistics Descriptive statistics can be used in many ways such as: To detect outliers. To plan for feature preprocessing requirements To select features: Identify what features you may want, or not want, to use in an analysis. 10/21/2024 Dr. Qasem Abu Al-Haija - Security Analytics 32 Descriptive Statistics: Example Suppose a cybersecurity team records the number of incidents reported each week over a month: These descriptive statistics provide a quick overview of the cybersecurity incidents over the month, helping to understand the data distribution & variability. 10/21/2024 Dr. Qasem Abu Al-Haija - Security Analytics 33 Inferential Statistics Involves using information from samples (carefully chosen subsets of the defined populations) to make inferences about populations Use this type of statistics to get information about a real-world measure in which you’re interested. Dr. Qasem Abu Al-Haija - Security Analytics 10/21/2024 34 Inferential Statistics Inferential statistics have two main uses: Making estimates about populations Example: Out of 100 pairs of jeans, 8 have flaws. Estimate how many of 25,000 pairs of jeans have flaws. About 2,000 pairs are flawed. Testing hypotheses to draw conclusions about populations. Example: Test the claim that the population mean weight is 120 pounds. Drawing conclusions about a large group of individuals based on a smaller group. 10/21/2024 Dr. Qasem Abu Al-Haija - Security Analytics 35 Inferential Statistics Scenario: A cybersecurity firm wants to estimate the average number of security incidents reported by organizations in a region using a sample of 30 organizations. Steps of Inferential Analysis: 1- Sample Mean Calculation: Assume the total incidents reported by the sample is 270. Then, The average number of incidents in the sample is 9. 2-Population Estimation: The firm infers that the average number of incidents for all organizations is approximately 9. 3- Confidence Interval: Assuming a sample standard deviation (s) of 3 and using a 95% confidence level: ⟹ Confidence Interval: (9−1.07,9+1.07) ⟹ (7.93,10.07) ⟹ The true average is likely between 7.93 and 10.07. 4- Hypothesis Testing:: Null Hypothesis (H0): μ≤8 (average incidents ≤ 8). Alterna ve Hypothesis (Ha): μ>8 (average incidents > 8). A t-test can determine if there is enough evidence to reject H0. 10/21/2024 Dr. Qasem Abu Al-Haija - Security Analytics 36 Descriptive Statistics vs Inferential Statistics 10/21/2024 Dr. Qasem Abu Al-Haija - Security Analytics 37 Probability Distribution A probability distribution is a list of all of the possible outcomes of a random variable along with their corresponding probability values. Probability distributions indicate the likelihood of an event/outcome (random variable). p(x) = the likelihood that random variable takes a specific value of x. The probability of any single event never goes below 0.0 or exceeds 1.0. The probability of all events always sums to exactly 1.0. 10/21/2024 Dr. Qasem Abu Al-Haija - Security Analytics 38 Probability Distribution The example below is a discrete univariate probability distribution with finite support. Discrete: if we consider 1 and 2 as outcomes of rolling a six-sided die, then I can’t have an outcome in between that (e.g. I can have an outcome of 1.5) Univariate = means that we only have one (random) variable Finite = This means that there is a limited number of outcomes. Support= is essentially the outcomes for which the probability distribution is defined. So the support in our example is. 1, 2, 3, 4, 5 and 6 0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 The probability distribution for a fair six-sided die 10/21/2024 0 39 1 2 3 4 5 6 Dr. Qasem Abu Al-Haija - Security Analytics Random Variables A random variable x represents a numerical value associated with each outcome of a probability distribution. A random variable is discrete if it has a finite or countable number of possible outcomes that can be listed. x 0 2 4 6 8 10 A random variable is continuous if it has an uncountable number or possible outcomes, represented by the intervals on a number line. x 0 2 4 6 8 10 Random Variables Example: Decide if the random variable x is discrete or continuous. a.) The distance your car travels on a tank of gas The distance your car travels is a continuous random variable because it is a measurement that cannot be counted. b.) The number of students in a data science class The number of students is a discrete random variable because it can be counted. Probability Distribution Functions According to the random variable, Probability distribution is classified into two types: 1. Probability Distribution For Discrete Random Variables Described with probability mass function ( ) 2. Probability Distribution For Continues Random Variables Described with probability density function ( ) 10/21/2024 Dr. Qasem Abu Al-Haija - Security Analytics 42 Probability Distribution Functions Discrete: A random variable where values can be counted by groupings (e.g. car with black color, no car with 0.7 black) Continuous: A random variable that assigns probabilities to a range of value (e.g. the mount of rain that might fall tomorrow.) 10/21/2024 Dr. Qasem Abu Al-Haija - Security Analytics 43 Common Types of Probability Distribution Functions 1. Normal Probability Distribution (Continuous). 2.Binomial Probability Distribution (Discrete). 3. Categorical Probability Distribution (Nun-Numeric). 10/21/2024 Dr. Qasem Abu Al-Haija - Security Analytics 44 Normal Probability Distributions Properties of Normal Distributions A continuous random variable has an infinite number of possible values that can be represented by an interval on the number line. Hours spent studying in a day 0 3 6 9 12 15 18 21 24 The time spent studying can be any number between 0 and 24. The probability distribution of a continuous random variable is called a continuous probability distribution. Properties of Normal Distributions The most important probability distribution in statistics is the normal distribution. Normal curve x A normal distribution is a continuous probability distribution for a random variable, x. The graph of a normal distribution is called the normal curve. Parameters μ and σ Normal pdfs have two parameters: μ - expected value (mean “mu”) σ - standard deviation (sigma) μ controls location σ controls spread Mean and Standard Deviation of Normal Distribution X ~ N(μ, σ) σ μ Properties of Normal Distributions 1. The mean, median, and mode are equal. 2. The normal curve is bell-shaped and symmetric about the mean. 3. The total area under the curve (AUC) is equal to one. 4. The normal curve approaches, but never touches the as it extends farther and farther away from the mean. 5. Between μ σ and μ + σ (in the center of the curve), the graph curves downward. The graph curves upward to the left of μ σ and to the right of μ + σ. The points at which the curve changes from curving upward to curving downward are called the inflection points. Properties of Normal Distributions Inflection points Total area = 1 x μ 3σ μ 2σ μσ μ μ+σ μ + 2σ μ + 3σ If x is a continuous random variable having a normal distribution with mean μ and standard deviation σ, you can graph a normal curve with the equation y = 1 e -(x - μ ) 2σ. e = 2.178 π = 3.14 2 2 σ 2π Means and Standard Deviations A normal distribution can have any mean and any positive standard deviation. Inflection The mean gives points Inflection the location of points the line of symmetry. x x 1 2 3 4 5 6 1 2 3 4 5 6 7 8 9 10 11 Mean: μ = 3.5 Mean: μ = 6 Standard Standard deviation: σ 1.3 deviation: σ 1.9 The standard deviation describes the spread of the data. Means and Standard Deviations Example: 1. Which curve has the greater mean? 2. Which curve has the greater standard deviation? B A x 1 3 5 7 9 11 13 The line of symmetry of curve A occurs at x = 5. The line of symmetry of curve B occurs at x = 9. Curve B has the greater mean. Curve B is more spread out than curve A, so curve B has the greater standard deviation. Interpreting Graphs Example: The heights of fully grown magnolia trees are normally distributed. The curve represents the distribution. What is the mean height of a fully grown magnolia tree? Estimate the standard deviation. The inflection points are one standard deviation away from the μ=8 mean. σ 0.7 x 6 7 8 9 10 Height (in feet) The heights of the magnolia bushes are normally distributed with a mean height of about 8 feet and a standard deviation of about 0.7 feet. 68-95-99.7 Rule for Normal Distributions 68% of the AUC within ±1σ of μ 95% of the AUC within ±2σ of μ 99.7% of the AUC within ±3σ of μ Note: AUC means: (Area Under Curve) Example of 68-95-99.7 Rule UoP student intelligence scores: Normally distributed with: 68% of scores within μ ± σ = 100 ± 15 = 85 to 115 μ = 100 and σ = 15; X ~ N(100, 15) 95% of scores within μ ± 2σ = 100 ± (2)(15) = 70 to 130 99.7% of scores in μ ± 3σ = 100 ± (3)(15) = 55 to 145 Symmetry in the Tails Because the Normal curve is symmetrical, and the total AUC is exactly 1… … we can easily determine the AUC in 95% tails Example: Male Height (X ~ N(70.0˝ , 2.8˝ ) ) Male height: Normal with μ = 70.0˝ and σ = 2.8˝ 68% within μ ± σ = 70.0 2.8 = 67.2 to 72.8 32% in tails (below 67.2˝ and above 72.8˝) 16% below 67.2˝ and 16% above 72.8˝ (symmetry) 68% Binomial Probability Distributions Binomial Probability Distributions A binomial experiment is a probability experiment with the following characteristics: The experiment has identical trials. Two outcomes are possible on each trial – one trial is termed a and the other is termed a. The probability of a success occurring on each trial is. This probability is the same on each trial. Since the outcome must either be a success or failure, a failure is the complement of a success and the probability of a failure is. The trials are independent of each other. 10/21/2024 Dr. Qasem Abu Al-Haija - Security Analytics 60 Given the above conditions: The binomial probability distribution provides the probability of successes in trials, where. Note that there are only two parameters that determine binomial probabilities: Successive trials must be independent of each other. That is, the outcome of any one trial must not affect the probability of success or failure for any other trial. 10/21/2024 Dr. Qasem Abu Al-Haija - Security Analytics 61 Example – Number of females selected in a random sample of size 3 from a large population of half males and half females. 0.375 0.375 x is the number of females selected and f(x) is the probability of x females being selected 0.125 0.125 0 1 2 3 The above distribution is a binomial probability distribution with success defined as selecting a female. There are n = 3 independent trials, the probability of success is p = 0.5, and x is the number of successes. In this experiment, selecting a male is termed a failure, and the probability of selecting 10/21/2024 a male is 1-p = 1-0.5 = 0.5. Dr. Qasem Abu Al-Haija - Security Analytics 62 Formula for binomial probability If is the number of trials of the binomial experiment and is the probability of success, then the probability of successes in trials of the experiment is given by the probability function , defined as follows: n! (n x) f (x) p (1 p ) x x! ( n x )! n! n ( n 1)( n 2 )....( 2 )( 1) 10/21/2024 0! 1 Dr. Qasem Abu Al-Haija - Security Analytics 63 Using the binomial formula 3! (32) 3 21 f (2) 0.5 (1 0.5) 2 (0.25)(0.5) 3 0.125 0.375 2!(3 2)! 211 3! (30) 3 21 f (0) 0.5 (1 0.5) 0 (1)(0.125) 1 0.125 0.125 0!(3 0)! 13 21 10/21/2024 Dr. Qasem Abu Al-Haija - Security Analytics 64 Example A bag contains 10 chips. 3 of the chips are red, 5 of the chips are white, and 2 of the chips are blue. Four chips are selected, with replacement. Create a probability distribution for the number of red chips selected. p = the probability of selecting a red chip 3 0.3 10 q = 1 – p = 0.7 n=4 x P (x) 0 0.240 The binomial x = 0, 1, 2, 3, 4 1 0.412 probability 2 0.265 formula is used 3 0.076 to find each 4 0.008 probability. Graphing Binomial Probabilities Example: The following probability distribution represents the probability of selecting 0, 1, 2, 3, or 4 red chips when 4 chips are selected. Graph the distribution using a histogram. P (x) x P (x) 0.5 Selecting Red Chips 0 0.24 Probability 0.4 1 0.412 0.3 2 0.265 0.2 3 0.076 0.1 4 0.008 0 x 0 1 2 3 4 Number of red chips Mean, Variance and Standard Deviation Population Parameters of a Binomial Distribution Mean: μ np Variance: σ 2 npq Standard deviation: σ npq Example: One out of 5 students at a local college say that they skip breakfast in the morning. Find the mean, variance and standard deviation if 10 students are randomly selected. n 10 μ np σ 2 npq σ npq p 1 0.2 10(0.2) (10)(0.2)(0.8) 1.6 5 q 0.8 2 1.6 1.3 Quantifying Correlation Quantifying Correlation Many statistical & machine learning methods assume that your features are independent. To test whether they’re independent, though, you need to evaluate their correlation Correlation — the extent to which variables demonstrate interdependency. Pearson correlation and Spearman’s rank correlation. Correlation is quantified per the value of a variable called r, which ranges between –1 & 1. The closer the r-value is to 1 or –1, the more correlation there is between two variables. If two variables have an r-value that’s close to 0, it could indicate that they’re independent variables. uncorrelated 10/21/2024 Dr. Qasem Abu Al-Haija - Security Analytics 69 Pearson’s Coefficient Correlation (PCC) Simplest form of correlation analysis between continuous variables in a dataset. Pearson correlation assumes that: 10/21/2024 Dr. Qasem Abu Al-Haija - Security Analytics 70 Pearson’s Coefficient Correlation (PCC) Spearman’s Rank Correlation (SRC) A popular test for determining correlation between ordinal variables. SRC describes the rank (rs) of numeric variable-pairs relationship (± correlation) “very weak” “weak” “moderate” “strong” “very strong”.00-.19.20-.39.40-.59.60-.79.80-1.0 The Spearman’s rank correlation assumes that 10/21/2024 Dr. Qasem Abu Al-Haija - Security Analytics 72 Spearman’s Rank Correlation (SRC) Unlike PCC, which assumes a linear relationship, SRC assesses monotonic relationships, meaning that the variables tend to move in the same direction but not at a constant rate. Spearman’s Rank Correlation (SRC) A cybersecurity analyst is assessing the effectiveness of IDS based on response time and accuracy ranks in detecting attacks across 5 different test environments. Environment Response Time Rank Accuracy Rank Env 1 1 2 Env 2 2 3 Env 3 3 4 Env 4 4 1 Env 5 5 5 Spearman’s Rank Correlation (SRC) Step 1: Calculate the difference in ranks (di) for each environment: Environment Response Time Accuracy di=RankA−RankB di2 Rank Rank Env 1 1 2 -1 1 Env 2 2 3 -1 1 Env 3 3 4 -1 1 Env 4 4 1 3 9 Env 5 5 5 0 0 Spearman’s Rank Correlation (SRC) Step 2: Use the formula for Spearman’s Rank Correlation: Spearman’s Rank Correlation (SRC) Step 3: Provide insights and conclusions SRC rs=0.4 shows a moderate positive correlation between IDS A's response times and IDS B's accuracy. This suggests that, while there is some correlation between these two measures of effectiveness, they do not strongly influence each other in all environments. This cybersecurity context provides insights into how IDS systems might trade- off between response time and detection accuracy across different settings. Dimensionality Reduction with Linear Algebra Dimensionality Reduction with Linear Algebra Data analytics should have a good understanding of linear algebra using matrices. Array and matrix objects are the primary data structures in analytical computing. They are used to perform mathematical operations on large-multidimensional datasets These are datasets with many different features to be tracked simultaneously. We need dimensionality reduction algorithms to reduce a dataset’s dimensionality. Such algorithms compress datasets and remove redundant information and noise. Such algorithms enhance the performance by reducing the number of features of a dataset. Three common dimensionality reduction algorithms: Singular Value Decomposition (SVD). Exploratory Factor Analysis (EFA). Principal Component Analysis (PCA). 10/21/2024 Dr. Qasem Abu Al-Haija - Security Analytics 79 Singular Value Decomposition (SVD) Fundamental linear algebraic technique used in many applications such as dimensionality reduction, matrix factorization, and data compression. Reduce computational complexity by breaking down large matrices into smaller, more manageable ones. SVD simplifies large datasets by identifying key patterns and reducing redundancy. Key Applications: Dimensionality Reduction, data analytics, Image Compression, Noise Reduction,..etc. SVD decomposes data down to three matrices: u, S, and V. The product of these matrices, when multiplied together, gives you back your original matrix. SVD is handy when you want to remove redundant information by compressing your dataset. Finally, the reduced matrices can be used as input for machine learning algorithms SVD Dimensionality Reduction: An Example Can be used to help develop anomaly detectors (unusual patterns in network traffic). If unusual behavior (e.g., a sharp spike in traffic) is detected in the reduced data, it could indicate an anomaly, such as a DDoS attack, making this an early alert system for network admins. Example: Consider a 5x3 matrix (A) representing traffic data for 5 IP addresses across 3 features (packet count, data size, access time). Left Singular Vectors Singular Values Right Singular Vectors SVD Analysis of the results: Singular Values in Σ (13.25, 7.12, 1.07) tell us the importance of each feature in the data. Largest Singular Value (13.25) corresponds to the most significant component in the data (e.g., data size). Keeping only top singular values (e.g., 13.25 and 7.12) can reduce dimensionality while retaining the most important info, making anomaly detection or traffic pattern analysis more efficient. Exploratory Factor Analysis (EFA) EFA can identify latent variables that explain the correlations among observed variables. It helps simplify datasets, uncover underlying structures, and uncover the redundancy in the dataset. Seeks the underlying latent variables reflected in the observed (manifest) variables. Latent variables are meaningful inferred variables that underlie a dataset but are not directly observable. Finally, the reduced factors can be used as input for machine learning algorithms Key Applications: EFA is a valuable preprocessing tool in machine learning, aiding in dimensionality reduction, feature selection, data cleaning, … EFA enhances model performance and interpretability. Deriving latent variables from the observed variables : Several statistical techniques to capture these variables’ relationships, such as Weighted Sum, Weighted Average, and Normalization, …. Dr. Qasem Abu Al-Haija - Security Analytics Example of EFA in Security Analytics You are analyzing a dataset of security incidents with variables such as: Incident Severity (high, medium, low) Type of Attack (phishing, malware, DDoS) Time to Detection (hours) Response Time (hours) Number of Affected Systems User Awareness Level (low, medium, high) Security Training Completed (yes/no) Objective: Identify underlying factors affecting incident response and security posture. Results from EFA: Identified Latent factors Factor 1: Incident Response Effectiveness (Time to Detection, Response Time) Factor 2: User Awareness and Training (User Awareness Level, Security Training) Factor 3: Attack Vector Influence (Type of Attack, Number of Affected Systems) Principal Component Analysis (PCA). PCA is a dimensionality reduction technique that: Cleans the dataset by removing redundancy and noise (reducing dimensionality while retaining key info.). Finds relationships between features and transforms them into principal components (PCs). Creates Non-redundant, Uncorrelated Features (PCs) while capturing the dataset's variance. Reduced components (PCs) can serve as input for ML algorithms, enabling predictions from compressed data. Key Applications: Dimensionality Reduction, Data Visualization, Noise Reduction, Image Compression, Genomics, Finance, Audio Processing, Security Analytics,... PCA enhances model performance and interpretability. Deriving PC from the observed variables requires several steps : Standardization, Covariance matrix, Eigenvalues/Eigenvectors, Eigenvalues sorting,Top eigenvectors selection, and then, data projecting to create PCs. 84 Example of PCA in Security Analytics Example: Analyzing network traffic data to detect potential intrusions. Packet Size Connection Source IP Destination IP Protocol Number of Time Between (for 5 sessions) Duration Packets Packets Sample Data 1500 60 192.168.1.1 192.168.1.10 TCP 30 20 1200 120 192.168.1.2 192.168.1.11 TCP 25 10 2000 15 192.168.1.3 192.168.1.12 UDP 50 5 800 30 192.168.1.4 192.168.1.13 TCP 20 15 1600 90 192.168.1.5 192.168.1.14 UDP 10 25 Applying PCA Process PC1 PC2 Standardization, Covariance Matrix, Eigenvalues and Eigenvectors, Principal Components, and Transformation (Projection). 2.5 -0.8 1.2 1.5 Results from PCA: might look like this for two principal components PC1: might capture packet size & connection duration variations, indicating typical behavior. -1.8 0.4 PC2: may highlight anomalous sessions with unusual packet sizes or connection times. 0.5 1.2 These PCs can be applied with ML techniques (e.g., clustering) to classify sessions. E.g: sessions outside normal clusters might indicate potential threats, such as DDoS attacks or unauthorized access attempts. 1.0 -2.3 Introducing Regression Methods Introducing Regression Methods Regression techniques are mainly used to: Determine the strength of correlation between variables in your data. Predict future values from historical values, Common Regression techniques Linear regression Logistic regression 10/21/2024 Dr. Qasem Abu Al-Haija - Security Analytics 87 Linear Regression A machine learning method to describe and quantify the relationship between your target variable, y and the dataset features. Can be either: simple linear regression or multiple linear regression. 10/21/2024 Dr. Qasem Abu Al-Haija - Security Analytics 88 Linear regression-Limitations Don’t forget dataset size: you should have at least 20 observations per predictive feature if you expect to generate reliable results using linear regression. 10/21/2024 Dr. Qasem Abu Al-Haija - Security Analytics 89 Linear regression-Example A cybersecurity analyst wants to analyze the relationship between the number of failed login attempts (independent variable) and the number of successful breaches (dependent variable). To do so: He collected data on failed login attempts (X) and successful breaches (Y) over 6 months, as in the table. Month (X) (Y) He defied simple linear regression model as: Y = a + bX (where a is y-intercept & b is slope of the line) Jan 5 1 Feb 8 2 He used statistical software or libraries (like Python’s sci-kit-learn) to calculate the coefficients 𝑎 and 𝑏. Mar 12 3 As a result, the regression equation become: 𝑌 = −0.5 + 0.3𝑋 Apr 15 4 After that, he predicted the number of successful breaches based on failed login attempts. May 18 5 For example, if there are 10 failed login attempts: 𝑌 = −0.5 + 0.3(10) = 2.5 Jun 20 6 This means that with 10 failed login attempts, they can expect approximately 2.5 successful breaches. Additional Insights: The slope b=0.3 indicates that for every additional failed login attempt, the number of successful breaches increases by 0.3, suggesting a positive correlation between failed logins and security breaches. Logistic regression A machine learning method to estimate values for a categorical target variable based on your selected features. Your target variable should be numeric, and describe the target’s class (category). It predicts the class of observations and indicates the probability for each of its estimates. Logistic regression requirements: No need for linear relationship between the features and target variable, Residuals don’t have to be normally distributed, Predictive features are not required to have a normal distribution. 10/21/2024 Dr. Qasem Abu Al-Haija - Security Analytics 91 Logistic Regression - Limitations Logistic regression requires a greater number of observations than linear regression to produce a reliable result. The rule of thumb is that you should have at least 50 observations per predictive feature if you expect to generate reliable results. 10/21/2024 Dr. Qasem Abu Al-Haija - Security Analytics 92 Logistic regression-Example Packet Size Connection Duration Protocol 1500 60 0 Label 0 2000 120 0 1 A cybersecurity analyst aims to use log regression to 800 30 1 0 predict whether a network connection is malicious 2200 15 0 1 (1) or benign (0) using features the ones shown in the 900 90 1 0 following sample dataset: 3000 45 0 1 1. Define the Model: 2. Train the Model: Fit the logistic regression model to the data to learn the coefficients. 3. Model Output: Assume the coefficients learned are: 1 =−4.5, 1=0.002, 2=0.1, 3=1.2 4. Perform predictions: for instance, assume a new connection with: Packet Size =1800, Connection Duration = 50, Protocol = 0, then: ― ― Calculate perdition Probability: ― The model predicts a 98.3% chance that this connection is malicious classified as malicious Outliers in Data Sets Outliers in Data Sets Outlier is an observation point that is distant from other observations. It violates the mechanism that generates the normal data. Application of outlier detection. Credit card fraud detection Telecom fraud detection Customer segmentation Detecting measurement errors Types of outliers : Point or global Outliers Contextual Outliers Collective Outliers 10/21/2024 Dr. Qasem Abu Al-Haija - Security Analytics 95 Types of Outliers: 1) Global Outliers (Point Outliers) Observations anomalous with respect to the majority of observations in a feature. In-short: A data point is considered a global outlier if its value is far outside the entirety of the data set in which it is found. Example 1: In a corporate network, most employees' typical daily data transfer ranges from 100 MB to 5 GB. However, if one user's account shows an abnormal spike, transferring 500 GB of data daily, this data point would be classified as a global outlier. Example 2: In a class, all student ages will be approx. similar, but if you see a record of a student aged 500. It’s an outlier. It could be generated due to various reasons. 10/21/2024 Dr. Qasem Abu Al-Haija - Security Analytics 96 Types of Outliers: 2) Contextual Outliers (conditional outliers) Contextual outliers are anomalous data points only within a specific context. Example 1: A user account typically logs in from the corporate office in New York during regular business hours (9 AM to 5 PM). One day, the account logs in from another country (e.g., China) at 2 AM. Example 2: 80o F in Urbana: outlier? (depending on summer or winter?) 10/21/2024 Dr. Qasem Abu Al-Haija - Security Analytics 97 Types of Outliers: 3) Collective Outliers These outliers appear near to one another, all having similar values that are anomalous to most values in the feature Scenario: Analyzing network traffic for malicious behavior (intrusion detection). Imagine a network monitoring system (IDS) that tracks the number of outgoing connections from different devices within a corporate network. Most devices typically make small outgoing connections per minute (e.g., 1 to 5). However, a specific group of devices, such as 10 workstations in a row, suddenly begins making 50 connections per minuteDr.each, 10/21/2024 allAl-Haija Qasem Abu targeting the same external IP address. 98 - Security Analytics Detecting Outliers Outliers detection is important Since it violates the mechanism that generates the normal data. Outliers Detection Approaches: Univariate approach: looks at features in the dataset and inspects them individually for anomalous values. Multivariate approach: considers two or more variables simultaneously and inspects them together for outliers. 10/21/2024 Dr. Qasem Abu Al-Haija - Security Analytics 99 Detecting Outliers Univariate Tukey outlier labeling Outliers Detection Methods Tukey boxplot Scatter-plot matrix Boxplot Multivariate DBScan Methods (Density-based spatial clustering of applications with noise ) 10/21/2024 Principal component analysis Dr. Qasem Abu Al-Haija - Security Analytics 100 Example: Detecting Outliers with a Tukey outlier labeling. How far min and max values are from 25 (1st quartile: Q1) & 75 (3rd quartile: Q3) percentiles. The distance between Q1 and Q3 is called inter-quartile range (IQR), and it describes the data’s spread. Rule of thumb: Always Calculate A & B 𝑨 = 𝑸𝟏 − 𝟏. 𝟓 ∗ 𝑰𝑸𝑹 𝑩 = 𝑸𝟑 + 𝟏. 𝟓 ∗ 𝑰𝑸𝑹. If your minimum value is less than A, or if your maximum value is greater than B, 10/21/2024 Dr. Qasem Abu Al-Haija - Security Analytics 101 Example: Detecting Outliers with a Tukey boxplot. Tukey boxplot. Each boxplot has whiskers that are set at 1.5*IQR. Any values that lie beyond these whiskers are outliers. 10/21/2024 Dr. Qasem Abu Al-Haija - Security Analytics 102 Example: Monitoring the number of failed login attempts on user accounts within a company’s network over a month. User # of login attempts Consider the following dataset showing the number of User 1 2 failed login attempts for 10 user accounts. Identify Outliers User 2 3 Using Tukey’s Method. User 3 4 User 4 5 User 5 6 User 6 7 User 7 8 User 8 9 User 9 15 User 10 45 10/21/2024 Dr. Qasem Abu Al-Haija - Security Analytics 103 Introducing Time Series Analysis Introducing Time Series Analysis (TSA) A time series is just a collection of data on attribute values over time. TSA is performed to predict future instances based on the past observational data. To forecast or predict future values from data in your dataset, use time series techniques. Time series can be analyzed using Multivariate analysis is the analysis of relationships between multiple variables, Univariate analysis is the quantitative analysis of only one variable at a time. 105 Identifying patterns in time series Time series exhibit specific patterns. Constant time series remain at roughly the same level over time, but are subject to some random error. Trended time series show a stable linear movement up or down. Seasonal series show predictable, cyclical fluctuations that reoccur seasonally throughout a year. Solid lines represent the mathematical models used to forecast points in the time series. Shown models represent very good, precise forecasts because they’re a very close fit to actual data. The actual data contains some random error, thus making it impossible to forecast perfectly. More about time series seasonality As an example of seasonal time series: Consider how many businesses show increased sales during the holiday season. If you’re including seasonality in your model: Incorporate it in the quarter, month, or even 6-month period — wherever appropriate. Time series may show nonstationary processes Nonstationary: unpredictable cyclical behavior that is unrelated to seasonality and results from economic or industry-wide conditions instead. Because they’re not predictable, nonstationary processes can’t be forecasted. You must transform nonstationary data to stationary data before moving forward. 10/21/2024 Dr. Qasem Abu Al-Haija - Security Analytics 107 More about time series seasonality 10/21/2024 Dr. Qasem Abu Al-Haija - Security Analytics 108 Stationarity of a time series models 10/21/2024 Dr. Qasem Abu Al-Haija - Security Analytics 109 Time Series Modeling and Forecasting Several methods are exists to model and forecast time-series. Well-known method: Autoregressive moving average (ARMA) Method, ARMA is used to predict future values from current and historical data An example of an ARMA forecast model. 10/21/2024 Dr. Qasem Abu Al-Haija - Security Analytics 110 Data Utilization to build insights for decision makers Data Utilization... How!!! We need to utilize this increasing amount of collected data. To generate useful insights for decision-makers. Several techniques are developed to manage and process the data, such as: Expert systems, Fuzzy logic, Genetic Algorithms, Machine learning, …etc. These techniques help automate detection, predict threats, and optimize security processes, providing actionable insights for decision-makers. By applying such advanced methods. Organizations can process and analyze large datasets effectively to generate valuable insights, improve security posture, and make data-driven decisions. 10/21/2024 Dr. Qasem Abu Al-Haija - Security Analytics 112 Intelligent Computing vs Conventional Computing Aspect Intelligent Computing Conventional Computing Handles complex, dynamic, and Deals with well-defined, rule-based, and repetitive Nature of Tasks ambiguous tasks (e.g., learning, adapting) tasks Uses algorithms like machine learning, AI, Based on fixed algorithms, rules, and predefined Decision-Making fuzzy logic for decisions instructions Learns and adapts from new data and Adaptability Static; does not learn or adapt from new data experiences Capable of solving problems with Requires complete, structured data to solve Problem Solving incomplete or uncertain data problems AI systems, neural networks, expert Traditional computing systems (e.g., calculators, Examples systems, machine learning models databases, spreadsheets) Highly flexible, can adjust its approach Flexibility Rigid, follows predetermined steps dynamically Human-Like Mimics cognitive functions like reasoning, Does not mimic human cognitive functions Abilities perception, and learning Processing Parallel and distributed processing (e.g., Sequential processing Approach deep learning) Dr. Qasem Abu Al-Haija - Security Analytics 113 Features of AI-based work Use of symbolic reasoning. Focus on problems that do not respond to algorithmic solutions - Heuristic. Work on problems with inexact, missing, or poorly defined information. Provide answers that are sufficient but not exact. Work with qualitative knowledge rather than quantitative knowledge. Use a large amount of domain-specific knowledge. AI deals with semantics and syntax in processing language and data. In short, syntax focuses on form, while semantics focuses on meaning, and AI systems need to understand both to process language and data effectively. 10/21/2024 Dr. Qasem Abu Al-Haija - Security Analytics 114 Example of Multiple Rules in a Cybersecurity Application: Scenario: Intrusion Detection System (IDS) monitoring network traffic for potential security breaches. Rule Set: 1.Rule 1: IF there are more than 5 failed login attempts from the same IP within 10 minutes, THEN flag as a possible brute-force attack and trigger an alert. 2.Rule 2: IF a user logs in successfully but attempts to access unauthorized files or folders, THEN flag the event as potential unauthorized access. 3.Rule 3: IF an account logs in from two different geographic locations within 30 minutes, THEN trigger an alert for suspicious activity (e.g., account compromise). 4.Rule 4: IF there is a sudden spike in network traffic (e.g., a 200% increase within 5 minutes), THEN trigger an alert for a possible DDoS attack. 5.Rule 5: IF a system experiences multiple failed file downloads or repeated timeouts, THEN flag the event as a potential malware infection and trigger an alert. How it works: The system continuously monitors network traffic and applies these rules. Each rule targets a different type of suspicious behavior, such as brute-force attacks, unauthorized access, or Distributed Denial of Service (DDoS) attempts. When any of these conditions are met, an alert is triggered for the security team to investigate. This multi-rule-based system helps detect a wide variety of potential threats, improving the overall security posture. 10/21/2024 Dr. Qasem Abu Al-Haija - Security Analytics 115 SUBSETS OF ARTIFICIAL INTELLIGENCE Dr. Qasem Abu Al-Haija - Security Analytics 10/21/2024 116 What is Machine Learning (ML)? ML Concept part of AI that provides intelligence to machines with the ability to automatically learn with experiences without being explicitly programmed. ML algorithms are concerned with the design of algorithms that: Allow the system to learn from historical (past) data. Can identify/discover patterns in data and make decisions. Can learn and improve their performance automatically. Dr. Qasem Abu Al-Haija - Security Analytics 117 ML Applications Dr. Qasem Abu Al-Haija - Security Analytics 10/21/2024 118 TYPES OF ML Dr. Qasem Abu Al-Haija - Security Analytics 10/21/2024 119 Dr. Qasem Abu Al-Haija - Security Analytics 10/21/2024 120 SUPERVISED LEARNING IN ML (SML) SML used when you have a labeled dataset composed of historical values that are good predictors of future events. Then, the machine learn from labeled dataset (set of training examples), and then predict the output (unlabeled data) SML has two categories of algorithms: Classifications and Regression Use cases of SML applications include fraud detection, disease prediction ,... Dr. Qasem Abu Al-Haija - Security Analytics 121 SUPERVISED LEARNING IN ML (SML) Dr. Qasem Abu Al-Haija - Security Analytics 10/21/2024 122 UNSUPERVISED LEARNING IN ML (UML) UML algorithms are trained with data which is neither labeled nor classified. Then, UML groups observations into categories based on the similarities in input features. UML has two categories of algorithms: Clustering and Association (or Dimensionality reduction) Use cases of UML applications include recommendation engines and customer segmentation. Dr. Qasem Abu Al-Haija - Security Analytics 123 UNSUPERVISED LEARNING IN ML (UML) Dr. Qasem Abu Al-Haija - Security Analytics 10/21/2024 124 SML VS UML: ILLUSTRATION EXAMPLE Dr. Qasem Abu Al-Haija - Security Analytics 125 ILLUSTRATION EXAMPLE Dr. Qasem Abu Al-Haija - Security Analytics 126 REINFORCEMENT LEARNING IN ML (RML) RML is a behavior-based learning model similar to how humans and animals learn (learn by mistake). In RML, an AI agent interacts directly with the environment, and it is trained by giving some commands, and on each action, an agent gets a reward as feedback. Using these feedbacks, the agent improves its performance. Reward feedback can be positive or negative, which means that for each good action, the agent receives a positive reward, while for a wrong action, it gets a negative reward. RML has two categories of algorithms: :Positive RML and Negative RML. Use cases of RML applications include Robotocs. Reinforcement learning is an up- and-coming concept in data science. Dr. Qasem Abu Al-Haija - Security Analytics REINFORCEMENT LEARNING IN ML (RML) https://www.youtube.com/watch?v=n2gE7n11h1Y Dr. Qasem Abu Al-Haija - Security Analytics 128 MORE EXAMPLES Supervised Learning: o Application: Email Spam Detection o How: Trains on labeled data to classify emails as "spam" or "not spam." o Use Case: Identifying phishing or malicious emails. Unsupervised Learning: o Application: Anomaly Detection in Network Traffic o How: Detects patterns in unlabeled data and flags deviations as potential threats. o Use Case: Identifying unusual network behavior indicating attacks. Reinforcement Learning: o Application: Automated Firewall Configuration o How: Learns to optimize firewall rules by receiving rewards/penalties based on network interactions. o Use Case: Dynamically adjusts firewall settings to defend against attacks. Dr. Qasem Abu Al-Haija - Security Analytics 10/21/2024 Dr. Qasem Abu Al-Haija - Security Analytics 130