msc_cs[1].pdf

Applied Data Analytics Unit 1 Fundamentals of Data Analytics Introduction to Data Analytics Definition and importance of data analytics, Types of data and sources, Data analytics process: Collection, Cleaning, Exploration, Analysis, Visualization Data Preprocessing and Cleaning Data cleaning and transformation techniques, Handling missing data Dealing with outliers and noise Exploratory Data Analysis (EDA) Descriptive statistics, Data visualization techniques: histograms, scatter plots, box plots, etc. Identifying patterns and trends Unit 2 Data Analysis Techniques and Tools Introduction to Statistical Analysis Probability and probability distributions, Hypothesis testing and significance Correlation and regression analysis. Machine Learning Fundamentals Supervised, unsupervised, and semi-supervised learning, Classification and regression algorithms Clustering techniques Text and Sentiment Analysis Processing textual data, Sentiment analysis using natural language processing, Applications of text analysis Data Visualization and Interpretation Effective data visualization principles Tools for creating visualizations: Matplotlib, Seaborn, Tableau, Interpreting visualizations and communicating insights Unit 3 Applied Data Analytics and Projects Week Real world Data Analysis Projects Selecting and formulating data analysis projects Collecting and preparing relevant data Applying appropriate analysis techniques Final Presentations and Reflection Presenting project findings and insights Reflecting on the data analysis process and challenges faced, Implications and applications of data analytics in various fields Big data is a field that treats ways to analyze, systematically extract information from, or otherwise deal with data sets that are too large or complex to be dealt with by traditional data-processing application software. 4V PROPERTIES OF BIG DATA- Volume-size of the data sets Variety -refers to the speed with which data is generated. Velocity-makes Big Data really big Veracity-quality of the data Volume of Big Data The volume of data refers to the size of the data sets that need to be analyzed and processed,which are now frequently larger than terabytes and petabytes. The sheet volume of the data requires distinct and different processing technologies than traditional storage and processing capabilities. In other words, this means that the data sets in Big Data are too large to process with a regular laptop or desktop processor. An example of a high-volume data set would be QR code transaction. Velocity of Big Data Velocity refers to the speed with which data is generated. High velocity data is generated with such a peace that it requires distinct (distributed) processing techniques. Variety of Big Data Variety makes Big Data really big. Big Data comes from a great variety of sources and generally is one out of three types: structured, semi structured and unstructured data. The variety in data types frequently requires distinct processing capabilities and specialist algorithms. An example of high variety data sets would be the CCTV audio and video files that are generated at various locations in a city. Veracity of Big Data Veracity refers to the quality of the data that is being analyzed. High veracity data has many records that are valuable to analyze and that contribute in a meaningful way to the overall results. Data analytics process: Collection, Cleaning, Exploration, Analysis, Visualization Data Pre- processing and Cleaning Data cleaning and transformation techniques, Handling missing data Dealing with outliers and noise Exploratory Data Analysis (EDA) Descriptive statistics The collection, transformation, and organization of data to draw conclusions make predictions for the future and make informed data- driven decisions is called Data Analysis. The profession that handles data analysis is called a Data Analyst. The advantage of being a Data Analyst is that they can work in any field they love healthcare, agriculture, IT, finance, business. Steps for Data Analysis Process Step-1. Define the Problem or Research Question In the first step of process the data analyst is given a problem/business task. The analyst has to understand the task and the stakeholder’s expectations for the solution. A stakeholder is a person that has invested their money and resources to a project. The analyst must be able to ask different questions in order to find the right solution to their problem. The analyst has to find the root cause of the problem in order to fully understand the problem. Communicate effectively with the stakeholders and other colleagues to completely understand what the underlying problem is. Questions to ask yourself for the Ask phase are: What are the problems that are being mentioned by my stakeholders? What are their expectations for the solutions? A research question in data analysis is a clear, focused, and specific question that guides the data collection, analysis, and interpretation process. It helps define what you want to understand or solve through your analysis. Broadly speaking there are three major steps in data collection viz. 1. One can ask people questions related to the problem being investigated. 2. One can make observations related to places, people and organizations their products or outcomes. 3. One can utilize existing records or data already gathered by others for the purpose. The first two steps relate to the collection of primary data while the third step relates to the collection of secondary data. The information/data collected by a person directly is known as primary data while records or data collected from offices/institutions is known as secondary data. How to Formulate a Research Question: 1. Identify the Problem or Objective 2. Specify the Variables 3. Define the Scope 4. Formulate the Question 5. Ensure Relevance For example- Student Performance: "What is the effect of online learning tools on student academic performance in high school mathematics?" Treatment Efficacy: "How effective is Drug X in reducing symptoms of Disease Y compared to a placebo?“ Climate Change: "What are the effects of increased atmospheric CO2 levels on the growth rates of coastal plant species?" Step-2. Data Collection Data collection is the process of collecting and evaluating information or data from multiple sources to find answers to research problems, answer questions, evaluate outcomes, and forecast trends and probabilities. It is an essential phase in all types of research, analysis, and decision- making, including that done in the social sciences, business, and healthcare. Accurate data collection is necessary to keep quality assurance and keep research integrity. Considerations for Data Collection Data Quality: Ensure the data is accurate, complete, and relevant. Ethics and Privacy: Adhere to data protection regulations (e.g., GDPR, CCPA) and ensure informed consent. Data Format: Collect data in a format that is easy to analyze and integrate (e.g., CSV, JSON). Data Collection Methods are as follows: Surveys Surveys involve asking questions to a sample of individuals or organizations to collect data. Surveys can be conducted in person, over the phone, or online. Interviews Interviews involve a one-on-one conversation between the interviewer and the respondent. Interviews can be structured or unstructured and can be conducted in person or over the phone. Focus Groups Focus groups are group discussions that are moderated by a facilitator. Focus groups are used to collect qualitative data on a specific topic. Observation Observation involves watching and recording the behavior of people, objects, or events in their natural setting. Observation can be done overtly or covertly, depending on the research question. Experiments Experiments involve manipulating one or more variables and observing the effect on another variable. Experiments are commonly used in scientific research. Case Studies Case studies involve in-depth analysis of a single individual, organization, or event. Case studies are used to gain detailed information about a specific phenomenon. 1. Surveys and Questionnaires Tools: SurveyMonkey: For creating and distributing surveys with various question types. Google Forms: Free tool for designing surveys and collecting responses. Qualtrics: Advanced survey tool with extensive customization and analytics features. Techniques: Online Surveys: Distribute via email or web. Paper Surveys: Collect responses in-person. Phone Surveys: Conduct via phone interviews. 2. Web Scraping Tools: BeautifulSoup: Python library for parsing HTML and XML documents. Scrapy: Python framework for building web scrapers. Selenium: Browser automation tool used for scraping dynamic content. Techniques: Static Scraping: Extract data from fixed HTML pages. Dynamic Scraping: Extract data from pages that load content dynamically via JavaScript. 3. APIs (Application Programming Interfaces) Tools: Postman: For testing and interacting with APIs. Requests: Python library for making HTTP requests. cURL: Command-line tool for transferring data using various protocols. Techniques: REST APIs: Retrieve data from web services (e.g., Twitter API, Google Maps API). SOAP APIs: Web service protocol for exchanging structured information. 4. Database Queries Tools: SQL: For querying relational databases (e.g., MySQL, PostgreSQL, SQL Server). NoSQL Databases: MongoDB, Cassandra for querying non-relational data. Database Management Tools: DBeaver, phpMyAdmin. Techniques: Direct Queries: Extract data using SQL commands. ETL Processes: Extract, Transform, Load processes to integrate data from different sources 5. Web Forms and Data Entry Tools: Google Forms: Create forms to gather data. Typeform: Design interactive forms and surveys. Microsoft Forms: Simple tool for creating surveys and quizzes. Techniques: Online Forms: Collect data through web-based forms. Embedded Forms: Integrate forms into websites or apps. 6. Sensors and IoT Devices Tools: Raspberry Pi: Small, affordable computer used for sensor data collection. Arduino: Open-source electronics platform for building devices. Smart Sensors: Devices that collect data like temperature, humidity, etc. Techniques: Real-Time Data Collection: Use sensors to continuously collect data. Batch Data Collection: Collect data in discrete intervals. 7. Social Media and Online Platforms Tools: Social Media APIs: Twitter API, Facebook Graph API for extracting social media data. Social Listening Tools: Brandwatch, Hootsuite for tracking online mentions and trends. Techniques: Sentiment Analysis: Analyze user sentiment from social media posts. Trend Analysis: Identify trends and patterns from social media data 8. Log Files and Event Tracking Tools: Google Analytics: Track and report website traffic. Mixpanel: Analyze user interactions and behaviors. Log Management Tools: ELK Stack (Elasticsearch, Logstash, Kibana), Splunk. Techniques: Event Logging: Capture user actions and system events. Server Logs: Analyze server activity and performance. 9. Data Repositories and Public Datasets Tools: Kaggle Datasets: Access a wide range of datasets for analysis. UCI Machine Learning Repository: Collection of databases, domain theories, and datasets. Government Data Portals: Data.gov, EU Open Data Portal. Techniques: Dataset Download: Download and use pre-collected datasets. Data Aggregation: Combine multiple public datasets for comprehensive analysis. Examples of Data Collection Traffic Monitoring: Cities collect real-time data on traffic patterns and congestion through sensors on roads and cameras at intersections. This information can be used to optimize traffic flow and improve safety. Health Monitoring: Medical devices such as wearable fitness trackers and smart watches can collect real-time data on a person’s heart rate, blood pressure, and other vital signs. This information can be used to monitor health conditions and detect early warning signs of health issues. Step-3. Data Cleaning Data cleaning, also referred to as data cleansing and data scrubbing. Data cleaning is the process of fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or incomplete data within a dataset. Techniques for Data Cleaning 1. Handling Missing Data Techniques: Imputation: Fill missing values using mean, median, or prediction models. Deletion: Remove rows or columns with missing values. Flagging: Create indicators for missing data. 2. Handling Outliers Techniques: Statistical Methods: Use z-scores or IQR to identify and handle outliers. Transformation: Apply log transformations to reduce the impact of outliers. Winsorization: Cap outliers to specified percentiles. 3. Data Standardization and Normalization Techniques: Standardization: Adjust data to have a mean of 0 and a standard deviation of 1. Min-Max Scaling: Scale data to a fixed range, typically [0, 1]. 4. Removing Duplicates Techniques: Deduplication: Use algorithms or tools to identify and remove duplicate records. Manual Review: Verify and merge duplicates manually if necessary. 5. Correcting Data Entry Errors Techniques: Validation Rules: Implement rules to catch and correct errors during data entry. Format Conversion: Ensure data is in the correct format (e.g., date, numerical). 6. Data Transformation Techniques: Aggregation: Summarize data to higher levels (e.g., totals, averages). Pivoting: Reshape data to different formats or structures for analysis. 7. Handling Irrelevant Data Techniques: Feature Selection: Retain only relevant features or variables. Filtering: Exclude irrelevant records or data points. tools and techniques used in data cleaning: 1. Microsoft Excel Description: A widely used spreadsheet application with built-in data cleaning features. Techniques: Remove Duplicates: Identifies and removes duplicate rows. Data Validation: Ensures data entry meets specified criteria. Text-to-Columns: Splits text into columns based on delimiters. Find and Replace: Corrects or updates values in the dataset. Conditional Formatting: Highlights data based on rules, useful for spotting anomalies. Limitations: Limited scalability for large datasets. 2. OpenRefine Description: An open-source tool for working with messy data. Techniques: Data Transformation: Apply transformations like normalization or splitting. Clustering: Group similar data values to detect duplicates. Faceting: Provides views into different segments of data for easy exploration. Scripting: Use GREL (Google Refine Expression Language) for advanced data cleaning tasks. Limitations: Requires some learning to use effectively. Not as integrated with other data tools as some alternatives. 3. Python Libraries Description: Libraries in Python that provide extensive data cleaning functionalities. Libraries: Pandas: Provides functions for data manipulation (e.g., dropna(), fillna(), replace()). NumPy: Useful for handling numerical data and performing transformations. Scikit-learn: Includes preprocessing tools for scaling and encoding. Limitations: Requires programming knowledge. Can be complex for beginners. 4. R and R Libraries Description: R is a statistical computing language with robust data cleaning packages. Libraries: dplyr: For data manipulation (e.g., filter(), mutate(), select()). tidyr: For tidying data (e.g., gather(), spread()). data.table: Efficient for large data manipulation tasks. Limitations: Requires knowledge of R programming. Learning curve for newcomers. 5. SQL Description: Structured Query Language used for querying and managing relational databases. Techniques: Query Filtering: Use WHERE clauses to remove or correct data. Aggregation Functions: Identify and handle outliers (e.g., SUM(), AVG(), COUNT()). Join Operations: Merge tables to ensure consistency and completeness. Limitations: Limited to relational databases. Requires SQL knowledge and understanding of database schema. 6. Power BI Description: A business analytics tool by Microsoft with built-in data cleaning capabilities. Features: Power Query: A data transformation and cleaning tool integrated within Power BI. Data Shaping: Apply transformations like filtering, grouping, and merging. Data Modeling: Create relationships and manage data schemas. Best For: Users who need data cleaning and visualization within the same platform. 7. Talend Data Quality Description: A data integration and data quality management tool. Features: Data Profiling: Assess data quality and consistency. Data Cleansing: Standardize, correct, and enrich data. Data Enrichment: Enhance data with additional information. Best For: Enterprise environments needing comprehensive data integration and quality management. 8. Pandas (Python Library) Description: A powerful Python library for data manipulation and cleaning. Features: DataFrame Operations: Handle missing data, remove duplicates, and filter data. Transformation Functions: Apply transformations such as scaling and encoding. Integration: Easily integrates with other Python libraries and tools. Best For: Users with programming knowledge who need flexible data manipulation capabilities. Step 4- Analyzing the data Once the data is cleaned, it's time for the actual analysis. This involves applying statistical or mathematical techniques to the data to discover patterns, relationships, or trends. Various data analysis techniques are available to understand, interpret, and derive conclusions based on the requirements. There are various tools and software available for this purpose, such as Python, R, Excel, and specialized software like SPSS and SAS. Types of Data Analysis Data Analysis Techniques There are numerous techniques used in data analysis, each with its unique purpose and application. Most commonly used techniques, including exploratory analysis, regression analysis, Monte Carlo simulation, factor analysis, cohort analysis, cluster analysis, time series analysis, and sentiment analysis. Exploratory analysis Exploratory analysis is used to understand the main characteristics of a data set. It is often used at the beginning of a data analysis process to summarize the main aspects of the data, check for missing data, and test assumptions. This technique involves visual methods such as scatter plots, histograms, and box plots. Regression analysis Regression analysis is a statistical method used to understand the relationship between a dependent variable and one or more independent variables. It is commonly used for forecasting, time series modeling, and finding the causal effect relationships between variables. Factor analysis Factor analysis is a technique used to reduce a large number of variables into fewer factors. The factors are constructed in such a way that they capture the maximum possible information from the original variables. This technique is often used in market research, customer segmentation, and image recognition. Monte Carlo simulation Monte Carlo simulation is a technique that uses probability distributions and random sampling to estimate numerical results. It is often used in risk analysis and decision-making where there is significant uncertainty. Cluster analysis Cluster analysis is a technique used to group a set of objects in such a way that objects in the same group (called a cluster) are more similar to each other than to those in other groups. It is often used in market segmentation, image segmentation, and recommendation systems. Cohort analysis Cohort analysis is a subset of behavioral analytics that takes data from a given dataset and groups it into related groups for analysis. These related groups, or cohorts, usually share common characteristics within a defined time span. This technique is often used in marketing, user engagement, and customer lifecycle analysis. Time series analysis Time series analysis is a statistical technique that deals with time series data, or trend analysis. It is used to analyze the sequence of data points to extract meaningful statistics and other characteristics of the data. This technique is often used in sales forecasting, economic forecasting, and weather forecasting. Sentiment analysis Sentiment analysis, also known as opinion mining, uses natural language processing, text analysis, and computational linguistics to identify and extract subjective information from source materials. It is often used in social media monitoring, brand monitoring, and understanding customer feedback. Data Analysis Tools Python R SQL Power BI Tableau Excel Career Key Skills Essential Tools Proficiency in programming, strong statistical knowledge, familiarity with Python, R, SQL, Scikit-learn, Data Scientist machine learning, data wrangling TensorFlow, Matplotlib, Seaborn skills, and effective communication. Strong analytical skills, proficiency in SQL, understanding of data SQL, Power BI, Tableau, Excel, Business Intelligence Analyst warehousing and ETL, ability to Python create visualizations and reports, and business acumen. Proficiency in SQL and NoSQL, knowledge of distributed systems SQL, NoSQL, Hadoop, Spark, Python, Data Engineer and data architecture, familiarity Java, ETL tools with ETL, programming skills, and understanding of machine learning. Strong analytical skills, understanding of business processes, Business Analyst proficiency in SQL, effective SQL, Excel,Power BI, Tableau, Python communication, and project management skills. Step 5-data visualization Data visualization is the graphical representation of information and data. By using visual elements like charts, graphs, and maps, data visualization tools provide an accessible way to see and understand trends, outliers, and patterns in data. This can include a variety of visual tools such as: Charts: Bar charts, line charts, pie charts, etc. Graphs: Scatter plots, histograms, etc. Maps: Geographic maps, heat maps, etc. Dashboards: Interactive platforms that combine multiple visualizations. Types of Data for Visualization Performing accurate visualization of data is very critical to market research where both numerical and categorical data can be visualized, which helps increase the impact of insights and also helps in reducing the risk of analysis paralysis. So, data visualization is categorized into the following categories: Numerical Data Categorical Data Types of Data Visualization Analysis Data visualization is used to analyze visually the behavior of the different variables in a dataset, such as a relationship between data points in a variable or the distribution. Depending on the number of variables you want to study at once, you can distinguish three types of data visualization analysis. Univariate analysis. Used to summarize the behavior of only one variable at a time. Bivariate analysis. Helps to study the relationship between two variables Multivariate analysis. Allows data practitioners to analyze more than two variables at once. Key Data Visualization Techniques: Line plots-They are normally created by putting a time variable on the x-axis and the variable you want to analyze on the y-axis. Bar plots-A bar chart ranks data according to the value of multiple categories. It consists of rectangles whose lengths are proportional to the value of each category. Histograms Histograms are one of the most popular visualizations to analyze the distribution of data. They show the numerical variable's distribution with bars. Box and whisker plots-Boxplots provide an intuitive and compelling way to spot the Median, The upper quartile, The lower quartile, interquartile range, upper adjacent value, lower adjacent value and Outliers. Scatter plots-Scatter plots are used to visualize the relationship between two continuous variables. Each point on the plot represents a single data point, and the position of the point on the x and y-axis represents the values of the two variables. Bubble plot-Scatter plots can be easily augmented by adding new elements that represent new variables. Treemaps-Treemaps are suitable to show part-to-whole relationships in data. They display hierarchical data as a set of rectangles. Each rectangle is a category within a given variable, whereas the area of the rectangle is proportional to the size of that category. Heat maps-A heatmap is a common and beautiful matrix plot that can be used to graphically summarize the relationship between two variables. The degree of correlation between two variables is represented by a color code. Word clouds- Word clouds are useful for visualizing common words in a text or data set. They're similar to bar plots but are often more visually appealing. Maps- A considerable proportion of the data generated every day is inherently spatial. Spatial data –also known sometimes as geospatial data or geographic information– are data for which a specific location is associated with each record. Network diagrams-graphs are better suited to analyze data that is organized in networks, such as online social networks, like Facebook and Twitter, to transportation networks, like metro lines. Tools for Data Visualization- 1. Tableau 8. SAS Visual Analytics 2. Microsoft Power BI 9. Grafana 3. Google Data Studio 10. Sisense 4. D3.js 11. Highcharts 5. Plotly 6. Qlik Sense 7. Excel

Document Details

Tags

Related

Full Transcript