Full Transcript

Unit: 2 (Machine learning clustering algorithms-II) Anomaly detection in machine learning refers to the process of identifying data points or patterns that deviate significantly from the norm or the expected behavior. Point anomalies, also known as global anomalies, refer to individ...

Unit: 2 (Machine learning clustering algorithms-II) Anomaly detection in machine learning refers to the process of identifying data points or patterns that deviate significantly from the norm or the expected behavior. Point anomalies, also known as global anomalies, refer to individual data points that deviate significantly from the rest of the data. These anomalies are isolated and stand out from the majority of the data points. Example: Detecting a fraudulent credit card transaction where the transaction amount is far larger or smaller than the typical spending behavior of the cardholder. Contextual Anomalies Contextual anomalies, also called conditional anomalies, are data points that are considered anomalous only within a specific context or condition. These anomalies might be normal in one context but abnormal in another. Example: An increase in web traffic on a retail website during holiday seasons might not be anomalous, but the same traffic increase on a different website not related to retail during the same time could be considered anomalous. Collective Anomalies Collective anomalies, also known as group anomalies or cluster-level anomalies, involve a group or a subset of data instances that collectively exhibit anomalous behavior when considered together. Individually, the data points might appear normal, but their combination is abnormal. Example: Identifying a cluster of network devices that exhibit unusual communication patterns compared to the overall network traffic. While each device's behavior might appear normal on its own, the collective behavior of the devices as a group is anomalous. Advantages of Anomaly Detection Early Detection: Anomaly detection can identify unusual patterns or observations in data at an early stage, allowing for early intervention and prevention of potential issues. Automation: Anomaly detection can be automated, allowing for continuous data monitoring and reducing the need for manual intervention. Scalability: Anomaly detection can be applied to large datasets, making it suitable for big data applications. Adaptability: Anomaly detection can be applied to various data types, including numerical data, time series data, and categorical data. Real-time Monitoring: Anomaly detection can be used for real-time data monitoring, allowing for immediate action in case of an anomaly. Limitations of Anomaly Detection Data Quality: Anomaly detection’s performance depends on the data’s quality. Poor quality data can result in false positives or false negatives. Choice of Algorithm: The choice of algorithm can also affect anomaly detection performance. Some algorithms may be better suited for certain types of data or specific use cases. The threshold for Determining Anomalies: The threshold for determining anomalies is subjective and can affect anomaly detection performance. The imbalance between Normal and Anomalous Data: As anomalous data is typically rare, it can be challenging to train a model that can accurately identify them. This is known as the class imbalance problem. Anomaly Detection Vs Supervised Learning Supervised learning excels with labeled data, offering clear classification of both normal and anomalous data points. Anomaly detection works well with unlabeled data, identifying deviations from the learned normal behavior. Anomaly Detection using import numpy as np Decision Tree import matplotlib.pyplot as plt from sklearn.tree import DecisionTreeClassifier from sklearn.model_selection import train_test_split from sklearn.metrics import confusion_matrix # Generate synthetic data np.random.seed(42) normal_data = np.random.randn(1000, 2) * 2 # Normal data points anomaly_data = np.random.randn(50, 2) * 10 # Anomaly data points # Create labels (0 for normal, 1 for anomalies) labels = np.zeros(normal_data.shape) anomaly_labels = np.ones(anomaly_data.shape) labels = np.concatenate((labels, anomaly_labels)) # Combine normal and anomaly data data = np.vstack((normal_data, anomaly_data)) # Split data into training and testing sets X_train, X_test, y_train, y_test = train_test_split(data, labels, test_size=0.2, random_state=42) # Fit Decision Tree model model = DecisionTreeClassifier(random_state=42) model.fit(X_train, y_train) # Predict anomalies predictions = model.predict(X_test) # Evaluate model performance conf_matrix = confusion_matrix(y_test, predictions) print("Confusion Matrix:") print(conf_matrix) # Plot the data and decision boundary plt.scatter(X_test[:, 0], X_test[:, 1], c=predictions, cmap=plt.cm.Paired) plt.title("Anomaly Detection using Decision Tree") plt.xlabel("Feature 1") plt.ylabel("Feature 2") Program for Anomaly Detection using KMeans import numpy as np # Fit K-Means model import matplotlib.pyplot as plt num_clusters = 5 model = from sklearn.cluster import KMeans KMeans(n_clusters=num_cluster s) # Generate synthetic data model.fit(data) np.random.seed=42 normal_data = np.random.randn(1000, # Assign data points to clusters cluster_labels = model.labels_ 2) * 2 # Normal data points anomaly_data = np.random.randn(50, # Calculate cluster sizes 2) * 10 # Anomaly data points cluster_sizes = np.bincount(cluster_labels) # Combine normal and anomaly data data = np.vstack((normal_data, anomaly_data)) # Define a threshold for cluster size to identify potential anomalies anomaly_threshold = np.percentile(cluster_sizes, 5) # 5th percentile cluster size # Identify potential anomalies potential_anomalies = data[cluster_labels[np.where(cluster_sizes

Use Quizgecko on...
Browser
Browser