Week01-2023.docx
Document Details

Uploaded by GenerousChrysoprase
La Trobe University
Full Transcript
latíobe.edu.au CSE5DMI Data Mining Week 01 La ľíobe Univeísity CRICOS Píovideí Code Numbeí 00115M Teaching Team Lecturer & Lab Demonstrator Abraham Albert Bonela ([email protected]) Background Lab tutor for Big Data Management on Cloud subject, Data Mining, Computational Intellige...
latíobe.edu.au CSE5DMI Data Mining Week 01 La ľíobe Univeísity CRICOS Píovideí Code Numbeí 00115M Teaching Team Lecturer & Lab Demonstrator Abraham Albert Bonela ([email protected]) Background Lab tutor for Big Data Management on Cloud subject, Data Mining, Computational Intelligence for Data Analytics, Programming Environment/Programming for Engineers and Scientists PhD in Applied Artificial Intelligence, LTU Master of Data Science, LTU Research area: Deep Learning (Computer Vision & Natural Language Processing), Machine learning Webpage: https://scholars.latrobe.edu.au/display/abonela Comments from our students Teaching Team Subject Coordinator Dr. Lydia Cui ([email protected]) Background PhD in computer science, The University of Sydney Postdoctoral research fellow, The University of Sydney Lecturer, La Trobe University. Research area: image processing, medical computer vision, machine learning Webpage: https://scholars.latrobe.edu.au/display/lcui Comments: Objectives This subject Provides students experience and knowledge in the emerging field of data mining Upon successful completion of this subject, you should be able to: Perform critical and effective data- pre-processing tasks. Evaluate major data mining classification methodologies. Critique association rules mining approaches. Evaluate data mining algorithms based on data clustering techniques. Apply advanced data mining techniques for pattern discovery from selected datasets. Learning Activities Session Lectures Labs Contact hours 1 Introduction to Data Mining 2 2 Data and Similarity Python Basics (1) 4 3 Ruled-based Classification Techniques (1) Python Basics (2) 4 4 Rule-based Classification Techniques (2) Python: Numpy, Pandas 4 5 Learning-based Classification Techniques (1) Data Pre-processing 4 6 Learning-based Classification Techniques (2) Python: Decision Tree 4 7 Basics of Association Analysis Assignment Consultation 4 8 Association Analysis Neural Network Classification 4 9 Rule-clustering Techniques (1) Support Vector Machine 4 10 Rule-clustering Techniques (2) K-means Clustering 4 11 Clustering Techniques (3) Hierarchical Clustering 4 12 Review Assignment Consultation 4 Total 46 Study Mode & Weekly activities Lecture (One 2-hour face2face lecture per week) Tuesday 11:00 - 13:00, WLT3 Lectures are automatically recorded by Echo 360 Weekly consultation: Friday 4:00 pm – 5:00 pm (Please email me to make an appointment) Practice (One 2-hour face2face lab per week from week 2) Labs are neither recorded nor delivered online. If you cannot attend labs regularly, there will be a high risk of failing the assignments. Please reconsider enrolling in this subject. Join your allocated session according to Allocate+ o Wed 11:00-13:00, BG 108 o Wed 13:00-15:00, BG 108 o Thu 11:00-13:00, BG 108 o Fri 13:00-15:00, BG 108 Study Mode & Weekly activities Post-class quiz (one quiz per week) There will be ten (10) tests throughout the semester, with the first test starting in week one of the semester. Weightage: 10% (1% per week in designed teaching weeks; 10 × 1% = 10% in total) Details of task: Format: online homework, multiple-choice, problem-solving. Each test will only be available through LMS for one week, that is from Tuesday 1 pm in the topic week to Tuesday 11 am next week. Quiz 6 will be closed at 11 am on 19 Sep due to the mid-semester break. You will have two attempts to do latíobe.edu.au Assessments the quiz in each week before the quiz is closed. The questions in the two attempts can be different. The best grade will be recorded. Week/ Session % Assessment type Key Criteria Feedback method SILOs assessed Weekly 10 Completion of 10 weekly online tests Completion of online quizzes and evaluated by staff. Online via LMS. SILO 5 Midnight, Sunday, Week 7 20 Individual Assignment TBA in assignment description Online written comments by teaching staff SILO1, SILO2 Midnight, Sunday, Week 12 20 Individual Assignment TBA in assignment description Online written comments by teaching staff SILO2, SILO4, SILO5 Central Exam Period 2-hour closed book 50 face to face (paper based) examination. Final Results SILO1, SILO2, SILO3, SILO4, SILO5 The hurdle requirements for this subject are: TO PASS THE SUBJECT, A PASS IN THE EXAMINATION IS MANDATORY You must achieve >50% of the total marks on the exam paper The subject cannot grant any students with online exams. Please reconsider whether this is the subject you want to enrol. Slide 8 | Version 2 to pass the subject. Get familiar with LTU Policies Request for an extension of time to submit an assessment task https://www.latrobe.edu.au/students/admin/forms/request-an-extension Applications must be made at least three business days before the due date. If the assessment task is due in less than three working days, please consider to submit a special consideration application. Please ensure you have provided documentation to demonstrate the date of the circumstance and/or the date(s) of the impact e.g., Medical Impact Statement, police report, funeral notice, Statutory Declaration or a Learning Access Plan (LAP). Special consideration https://www.latrobe.edu.au/students/admin/forms/special- consideration If you have experienced serious short term, adverse and unforeseen circumstances that substantially affect your ability to complete an assessment task to the best of your potential, you may be eligible to apply for Special Consideration. Academic integrity The penalty for submitting an assignment under your name that is the work of a third party may be severe, even leading to exclusion from the University without readmission. Refer to the Academic Integrity - Schedule of Penalties and Actions within the Student Academic Misconduct Policy CSE5DMI Web Page - LMS Announcement Subject information Weekly learning materials Lecture note Lecture recording Lab note Lab recording Post-class quiz Assignments University policies Special consideration Plagiarism Etc. Contact us We would appreciate if you could email us using your LTU account and emphasize CSE5DMI in the email subject line. For any questions about lecture/lab contents, assessments marking, please contact Albert. For assessment extension & special consideration applications, please contact Lydia. Reference book Introduction to Data Mining, Book by Pang-Ning Tan, Michael Steinbach, and Vipin Kumar Selected chapters available via the following link: https://www-users.cse.umn.edu/~kumar001/dmbook/index.php Data Mining: Concepts and Techniques : Concepts and Techniques, Book by Jiawei Han, Micheline Kamber, and Jian Pei Available via LTU library https://ebookcentral-proquest- com.ez.library.latrobe.edu.au/lib/latrobe/detail.action?docID=729031 193 latíobe.edu.au Large-Scale Data is Everywhere! There has been enormous data growth in both commercial and scientific databases due to advances in data generation and collection technologies New mantra Gather whatever data you can whenever and wherever possible. Expectations Gathered data will have value either for the purpose collected or for a purpose not envisioned. Cyber Security Traffic Patterns Sensor Networks E-Commerce Social Networking: Twitter Computational Simulations Why Mine Data? Commercial Viewpoint Lots of data is being collected and warehoused Web data, e-commerce purchases at department/grocery stores Bank/Credit Card transactions Computers have become cheaper and more powerful Competitive Pressure is Strong Provide better, customized services for an edge (e.g., in Customer Relationship Management) Why Mine Data? Scientific Viewpoint Data collected and stored at enormous speeds (GB/hour) remote sensors on a satellite telescopes scanning the skies microarrays generating gene expression data scientific simulations generating terabytes of data Traditional techniques infeasible for raw data Data mining may help scientists in classifying and segmenting data in hypothesis formation Great Opportunities to Solve Society’s Major Problems Improving health care and reducing costs Finding alternative/green energy sources Predicting the impact of climate change Reducing hunger and poverty by increasing agriculture production What is Data Mining? Many Definitions Non-trivial extraction of implicit, previously unknown and potentially useful information from data Exploration & analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns Slide 17 | Version 2 Introduction to Data Mining, 2nd Edition Tan, Steinbach, Karpatne, Kumar What is (not) Data Mining? What is not Data Mining? Look up phone number in phone directory Query a Web search engine for information about “Amazon” What is Data Mining? Certain names are more prevalent in certain US locations (O’Brien, O’Rurke, O’Reilly… in Boston area) Group together similar documents returned by search engine according to their context (e.g., Amazon rainforest, Amazon.com) Slide 18 | Version 2 Mining Large Data Sets - Motivation There is often information “hidden” in the data that is not readily evident Human analysts may take weeks to discover useful information Much of the data is never analyzed at all Sli 4,000,000 3,500,000 3,000,000 2,500,000 2,000,000 1,500,000 1,000,000 500,000 0 1995 1996 1997 1998 1999 Origins of Data Mining Draws ideas from machine learning/AI, pattern recognition, statistics, and database systems Traditional techniques may be unsuitable due to data that is Large-scale High dimensional Heterogeneous Complex Distributed A key component of the emerging field of data science and data-driven discovery Data Mining Tasks Prediction Methods Use some variables to predict unknown or future values of other variables. Description Methods Find human-interpretable patterns that describe the data. From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996 Predictive Modeling: Classification Given a collection of records (training set), each record contains a set of attributes, one of the attributes is the class. Find a model for class attribute as a function of the values of other attributes Goal: previously unseen records should be assigned a class as accurately as possible. Model for predicting credit worthiness Class Tid Employed Level of Education # years at present address Credit Worthy 1 Yes Graduate 5 Yes 2 Yes High School 2 No 3 No Undergrad 1 No 4 Yes High School 10 Yes … … … … … 10 Examples of Classification Task Classifying credit card transactions as legitimate or fraudulent Classifying land covers (water bodies, urban areas, forests, etc.) using satellite data Categorizing news stories as finance, weather, entertainment, sports, etc. Identifying intruders in the cyberspace Predicting tumor cells as benign or malignant Classifying secondary structures of protein as alpha-helix, beta-sheet, or random coil Slide 23 | Version 2 Regression Predict a value of a given continuous valued variable based on the values of other variables, assuming a linear or nonlinear model of dependency. Extensively studied in statistics, neural network fields. Examples: Predicting sales amounts of new product based on advertising expenditure. Predicting wind velocities as a function of temperature, humidity, air pressure, etc. Time series prediction of stock market indices. Slide 24 | Version 2 Clustering Given a set of data points, each having a set of attributes, and a similarity measure among them, find clusters such that Data points in one cluster are more similar to one another. Data points in separate clusters are less similar to one another. Similarity Measures: • Euclidean distance if attributes are continuous. • Other problem-specific measures. Clustering: Application Market Segmentation: Goal: subdivide a market into distinct subsets of customers where any subset may conceivably be selected as a market target to be reached with a distinct marketing mix. Approach: Collect different attributes of customers based on their geographical and lifestyle related information. Find clusters of similar customers. Measure the clustering quality by observing buying patterns of customers in same cluster vs. those from different clusters. Clustering: Application Document Clustering: Goal: To find groups of documents that are similar to each other based on the important terms appearing in them. Approach: To identify frequently occurring terms in each document. Form a similarity measure based on the frequencies of different terms. Use it to cluster. Gain: Information retrieval can utilize the clusters to relate a new document or search them to clustered documents Association Rule Discovery: Definition Given a set of records each of which contain some number of items from a given collection Produce dependency rules which will predict occurrence of an item based on occurrences of other items. Association Analysis: Applications Marketing and Sales Promotion: Let the rule discovered be {Bagels, … } --> {Potato Chips} Potato Chips as consequent => Can be used to determine what should be done to boost its sales. Bagels in the antecedent => Can be used to see which products would be affected if the store discontinues selling bagels. Bagels in antecedent and Potato chips in consequent => Can be used to see what products should be sold with Bagels to promote sale of Potato chips! Association Analysis: Applications Supermarket shelf management: Goal: To identify items that are bought together by sufficiently many customers. Approach: Process the point-of-sale data collected with barcode scanners to find dependencies among items. A classic rule -- If a customer buys diaper and milk, then he is very likely to buy beer. So, don’t be surprised if you find six-packs stacked next to diapers! Deviation/Anomaly/Change Detection Detect significant deviations from normal behavior Applications: Credit Card Fraud Detection Network Intrusion Detection Identify anomalous behavior from sensor networks for monitoring and surveillance. Detecting changes in the global forest cover. Slide 31 | Version 2 Introduction to Data Mining, 2nd Edition Tan, Steinbach, Karpatne, Kumar Motivating Challenges Scalability High Dimensionality Heterogeneous and Complex Data Data Ownership and Distribution Non-Traditional Analysis Summary Introduction to CSE5DMI Weekly activities Teaching team Assessments Important policies Introduction to data mining Why data mining What is data mining Data mining tasks After class Lecture 1 quiz will be available from 1pm today Go to LMS -> Week 1: Introduction to Data Mining -> POST-CLASS QUIZ Get prepared for Lab 1 in Week 2 Go to LMS -> Week 2: Data -> PRACTICAL LAB Thank you latíobe.edu.au La ľíobe Univeísity CRICOS Píovideí Code Numbeí 00115M © Copyíight La ľíobe Univeísity 2018