Data Analysis for COMA 51: Understanding Data Around Us PDF

Summary

This document provides information on understanding data, data analysis methods, and data visualization techniques, covering topics such as levels of measurement, exploratory data analysis (EDA), and the creation of pivot charts. The document also briefly covers Excel dashboards. It is suitable for undergraduate students.

Full Transcript

**COMA 51 UNDERSTANDING THE DATA AROUND US** **Module 1. Data Around Us** **Sources of Data** Internal Sources -- data are collected from reports and records of the organization itself. External Sources -- data are collected from sources outside the organization. **Data Types (Levels of Measure...

**COMA 51 UNDERSTANDING THE DATA AROUND US** **Module 1. Data Around Us** **Sources of Data** Internal Sources -- data are collected from reports and records of the organization itself. External Sources -- data are collected from sources outside the organization. **Data Types (Levels of Measurement)** **Levels of measurement, **also called scales of measurement, tell you how precisely [[variables]](https://www.scribbr.com/methodology/independent-and-dependent-variables/) are recorded. In scientific research, a variable is anything that can take on different values across your data set (e.g., height or test scores). **There are 4 levels of measurement**: - **[[Nominal]](https://www.scribbr.com/statistics/nominal-data/):** the data can only be categorized - **[[Ordinal]](https://www.scribbr.com/statistics/ordinal-data/):** the data can be categorized and ranked - **[[Interval]](https://www.scribbr.com/statistics/interval-data/):** the data can be categorized, ranked, and evenly spaced - **[[Ratio]](https://www.scribbr.com/statistics/ratio-data/):** the data can be categorized, ranked, evenly spaced, and has a natural zero. Depending on the level of measurement of the variable, what you can do to analyze your data may be limited. There is a hierarchy in the complexity and precision of the level of measurement, from low (nominal) to high (ratio). **Nominal, ordinal, interval, and ratio data** Going from lowest to highest, the 4 levels of measurement are cumulative. This means that they each take on the properties of lower levels and add new properties. +-----------------------------------+-----------------------------------+ | **Nominal level** | **Examples of nominal scales** | +===================================+===================================+ | You can categorize your data | - City of birth | | by [[labelling]](http | | | s://www.scribbr.com/us-vs-uk/labe | - Gender | | lled-or-labeled/) them | | | in mutually exclusive groups, but | - Ethnicity | | there is no order between the | | | categories. | - Car brands | | | | | | - Marital status | +-----------------------------------+-----------------------------------+ | **Ordinal level** | **Examples of ordinal scales** | +-----------------------------------+-----------------------------------+ | You can categorize and rank your | - Top 5 Olympic medalists | | data in an order, but you cannot | | | say anything about the intervals | - Language ability (e.g., | | between the rankings. | beginner, intermediate, | | | fluent) | | Although you can rank the top 5 | | | Olympic medalists, this scale | - [[Likert-type | | does not tell you how close or | questions]](https | | far apart they are in number of | ://www.scribbr.com/methodology/li | | wins. | kert-scale/)  | | | (e.g., very dissatisfied to | | | very satisfied) | +-----------------------------------+-----------------------------------+ | **Interval level** | **Examples of interval scales** | +-----------------------------------+-----------------------------------+ | You can categorize, rank, | - Test scores (e.g., IQ or | | and [[infer]](https:/ | exams) | | /www.scribbr.com/commonly-confuse | | | d-words/infer-vs-imply/#infer) eq | - Personality inventories | | ual | | | intervals between neighboring | - Temperature in Fahrenheit or | | data points, but there is no true | Celsius | | zero point. | | | | | | The difference between any two | | | adjacent temperatures is the | | | same: one degree. But zero | | | degrees is defined differently | | | depending on the scale -- it | | | doesn't mean an absolute absence | | | of temperature. | | | | | | The same is true for test scores | | | and personality inventories. A | | | zero on a test is arbitrary; it | | | does not mean that the test-taker | | | has an absolute lack of the trait | | | being measured. | | +-----------------------------------+-----------------------------------+ | **Ratio level** | **Examples of ratio scales** | +-----------------------------------+-----------------------------------+ | You can categorize, rank, and | - Height | | infer equal intervals between | | | neighboring data points, and | - Age | | there is a **true zero point**. | | | | - Weight | | A true zero means there is an | | | absence of the variable of | - Temperature in Kelvin | | interest. In ratio scales, zero | | | does mean an absolute lack of the | | | variable. | | | | | | For example, in the Kelvin | | | temperature scale, there are no | | | negative degrees of temperature | | | -- zero means an absolute lack of | | | thermal energy. | | +-----------------------------------+-----------------------------------+ **Why data are important?** The list below shares twelve reasons why data are important, what you can do with it, and how it relates to the human services field. **1. Improve People's Lives** Data will help you to improve quality of life for people you support: Improving quality is first and foremost among the reasons why organizations should be using data. By allowing you to measure and take action, an [[effective data system]](https://www.c-q-l.org/tools/portal-data-system/) can enable your organization to improve the quality of people's lives. **2. Make Informed Decisions** Data = Knowledge. Good data provide indisputable evidence, while anecdotal evidence, assumptions, or abstract observation might lead to wasted resources due to taking action based on an incorrect conclusion. **3. Stop Molehills from Turning into Mountains** Data allow you to monitor the health of important systems in your organization: By utilizing data for [[quality monitoring]](https://www.c-q-l.org/resources/newsletters/demystifying-factor-10/), organizations are able to respond to challenges before they become full-blown crisis. Effective quality monitoring will allow your organization to be proactive rather than reactive and will support the organization to maintain best practices over time. **4. Get the Results You Want** Data allow organizations to measure the effectiveness of a given strategy: When strategies are put into place to overcome a challenge, collecting data will allow you to determine how well your solution is performing, and whether or not your approach needs to be tweaked or changed over the long-term. **5. Find Solutions to Problems** Data allow organizations to more effectively determine the cause of problems. Data allow organizations to visualize relationships between what is happening in different locations, departments, and systems. If the number of medication errors has gone up, is there an issue such as [[staff turnover or vacancy rates]](https://www.c-q-l.org/resources/articles/the-dsp-crisis-reimbursement-rates-retention-and-research/) that may suggest a cause? Looking at these data points side-by-side allows us to develop more accurate theories, and put into place more effective solutions. **6. Back Up Your Arguments** Data are a key component to systems advocacy. Utilizing data will help present a strong argument for systems change. Whether you are advocating for increased funding from public or private sources, or making the case for changes in regulation, illustrating your argument through the use of data will allow you to demonstrate why changes are needed. **7. Stop the Guessing Game** Data will help you explain (both good and bad) decisions to your stakeholders. Whether or not your strategies and decisions have the outcome you anticipated, you can be confident that you developed your approach based not upon guesses, but good solid data. **8. Be Strategic in Your Approaches** Data increase efficiency. Effective data collection and analysis will allow you to direct scarce resources where they are most needed. If an increase in significant incidents is noted in a particular service area, this data can be dissected further to determine whether the increase is widespread or [[isolated to a particular site]](https://www.c-q-l.org/resources/articles/portal-analytics-provider-codes/). If the issue is isolated, training, staffing, or other resources can be deployed precisely where they are needed, as opposed to system-wide. Data will also support organizations to determine which areas should take priority over others. **9. Know What You Are Doing Well** Data allow you to[[ replicate areas of strength]](https://www.c-q-l.org/resources/newsletters/focusing-on-possibilities-through-appreciative-inquiry/) across your organization. Data analysis will support you to identify high-performing programs, service areas, and people. Once you identify your high-performers, you can study them in order to develop strategies to assist programs, service areas and people that are low-performing. **10. Keep Track of It All** Good data allow organizations to establish baselines, benchmarks, and goals to keep moving forward. Because data allows you to measure, you will be able to establish baselines, find benchmarks and set performance goals. A baseline is what a certain area looks like before a particular solution is implemented. Benchmarks establish where others are at in a similar demographic, such as [[Personal Outcome Measures®]](https://www.c-q-l.org/tools/personal-outcome-measures/) national data. Collecting data will allow your organization to set goals for performance and celebrate your successes when they are achieved. **11. Make the Most of Your Money** Funding is increasingly [[outcome and data-driven]](https://www.c-q-l.org/resources/guides/personal-outcome-measures-25-years-of-person-centered-discovery-and-achieving-outcomes/). With the shift from funding that is based on services provided to funding that is based on outcomes achieved, it is increasingly important for organizations to implement evidence-based practice and develop systems to collect and analyze data. **12. Access the Resources Around You** Your organization probably already has most of the data and expertise you need to begin analysis. Your HR office probably already tracks data regarding your staff. You are probably already reporting data regarding incidents to your state oversight agency. You probably have at least one person in your organization who has experience with Excel. But, if you don't do any of these things, there is still hope! There are lots of free resources online that can get you started. Do a web search for "how to analyze data" or "how to make a chart in Excel." **Data Vs Information** **What are data?** Data are representation of individual facts or statistics in their **raw form**. Since data contains only figures and numbers, it doesn't hold any significance until a researcher analyses or contextualizes it. For example, data can signify basic numbers regarding the price of an object, the score on a test or the current temperature outdoors. Researchers express data in multiple forms, including: - Words - Standard numbers - Percentages - Characters - Symbols **What is information?** **Information** is an interpretation of data, where researchers identify patterns and draw conclusions based on **raw figures**. Using data, a researcher can draw meaningful conclusions about their desired subject. Information also has different meanings based on the context. For instance, the data tells you the temperature outdoors. One person may conclude that the temperature is high, while another person declares that it\'s low. When someone describes the temperate as being high, they're providing information about their interpretation of the weather. You can also use the information to display averages. For example, a meteorologist assesses the number of tropical storms during the summer and determines the season that achieved a historical record. This requires context to understand the typical number of storms per season and how the current season's data compares. **Module 2. Gathering and Cleaning Data** **Ways of Obtaining Data** The following are some methods of collecting data: 1. Interview Method. The researcher gathers data by asking the interviewee series of questions. 2. Questionnaire Method. The researcher distributes the questionnaires either personally or by mail and collects them by the same process. 3. Registration Method. The researcher gathers data from offices concerned, e.g, the national Statistics Office (NSO), the Commission of Elections (COMELEC), Municipal Halls or Barangay Offices. 4. Experimental Method. This method of collecting data is used to find out the cause-and-effect relationship of certain phenomena under controlled conditions. 5. Observation Method. The researcher may observe subjects individually or group of individuals to obtain data and information related to the objectives of the investigation. 6. Texting Method. The individual may ask or invite individuals to send text opinions on certain issues or send in their choices on their brand preferences on a particular product using their cellphones. **Tools on Data Cleaning** 1. **OpenRefine**. Known previously as Google Refine, OpenRefine is a well-known open-source data tool. Its main benefit over other tools on our list is that, being open source, it is free to use and customize. OpenRefine lets you transform data between different formats and ensure that data is cleanly structured.  2. **Trifacta Wrangler**. A connected desktop application, Trifacta Wrangler lets you transform data, carry out analyses, and produce visualizations. Its standout feature is its use of smart tech. Utilizing machine learning to spot inconsistencies and make recommendations, the tool vastly speeds up the data cleaning process. For instance, its artificial intelligence algorithms can easily identify and remove outliers, as well as automating overall [**data quality**](https://careerfoundry.com/en/blog/data-analytics/what-is-data-quality/) monitoring---a helpful feature for ongoing data housekeeping. 3. **Winpure Clean & Match**. A bit like Trifacta Wrangler, the award-winning Winpure Clean & Match allows you to clean, de-dupe, and cross-match data, all via its intuitive user interface. Being locally installed, you don't have to worry about data security unless you're uploading your dataset to the cloud. This is an especially important feature for Winpure, which is specifically designed for cleaning business and customer data (such as CRM data and mailing lists). Winpure Clean & Match also interoperates with a very wide variety of databases and spreadsheets, from CSV files to SQL Server, Salesforce, and Oracle. 4. **TIBCO Clarity**. Cloud-based software as a service (SaaS), TIBCO Clarity, is ideal for cleaning raw data and analyzing it all in one location. It's a feature-rich data cleaning tool that ingests data from dozens of different sources, including from XLS and JSON files to compressed file formats, as well as a wide range of online repositories and data warehouses. Beyond this, TIBCO offers everything from data mapping functionality, to extract, transform, load (ETL), data profiling, sampling and batch functionality, de-duping, and much more. It also boasts some helpful nice-to-have features, such as 'transformation undo.' This is not available with all tools but it's a great feature if you're not happy with a change you've made. 5. **Melissa Clean Suite**. Melissa Clean Suite is a highly targeted data cleaning and management tool. It's designed specifically to support the Salesforce and Microsoft Dynamics customer relationship management (CRM) systems, which many businesses use. Because it's focused on these two systems, it caters to their unique features. 6. **IBM Infosphere Quality Stage**. IBM Infosphere Quality Stage is one of a broader selection of data management tools from IBM. It focuses---as the name suggests---on data quality and governance. While it deals with the usual suspects (data matching, de-duping, etc.) it is specifically designed to clean [**big data**](https://careerfoundry.com/en/blog/data-analytics/what-is-big-data/) for business intelligence purposes. For this purpose, it has about 200 in-built data quality rules, saving analysts tones of time managing these tasks manually with scripts. 7. **Datamatch Enterprise by Data Ladder**. Datamatch Enterprise by Data Ladder is a visually-driven data cleaning application. Like many of the other tools on our list, it focuses on customer data. However, unlike others, it is designed specifically to resolve data quality issues within datasets that are already in a poor condition. Instinctive and simple to use, it employs a walkthrough interface to support you through the data process from start to finish. **Module 3. EXPLORATORY DATA ANALYSIS FOR NUMERICAL ANALYSIS** **What is EDA?** **Exploratory data analysis** (EDA) is used by data scientists to analyze and investigate data sets and summarize their main characteristics, often employing **data visualization methods**. It is used to help look at data before making any assumptions. It can help identify errors, as well as better understand patterns within the data, detect outliers or anomalous events, find interesting relationship among the variables. It is an analysis approach that identifies general patterns in the data. The most common tools for EDA are R and Python. R is a programming language for statistical computing and data visualization. **NUMERICAL SUMMARIZATION** There are many ways to numerically summarize data. The fundamental idea is to **describe** the center, or most *probable* values of the data, as well as the **spread**, or the *possible* values of the data. The commonly used to numerically summarize data are the mean, median, mode, quartiles, and measures of spread such as standard deviation, variance, range, percentile, proportion, and correlation. **Data Visualization** **Data visualization** is the **graphical representation** of information and data. By using v[isual elements like charts, graphs, and maps](https://www.tableau.com/data-insights/reference-library/visual-analytics), data visualization tools provide an accessible way to **see** and **understand** trends, outliers, and patterns in data. Additionally, it provides an excellent way for employees or business owners to present data to non-technical audiences without confusion. In the world of Big Data, data visualization tools and technologies are essential to analyze massive amounts of information and make data-driven decisions. **General Types of Visualizations:** - **Chart:** Information presented in a tabular, graphical form with data displayed along two axes. Can be in the form of a graph, diagram, or map.  - **Table:** A set of figures displayed in rows and columns.  - **Graph:** A diagram of points, lines, segments, curves, or areas that represents certain variables in comparison to each other, usually along two axes at a right angle.  - **Geospatial:** A visualization that shows data in map form using different shapes and colors to show the relationship between pieces of data and specific locations.  - **Infographic: **A combination of visuals and words that represent data. Usually uses charts or diagrams. - **Dashboards: **A collection of visualizations and data displayed in one place to help with analyzing and presenting data.  Some of the best visualization tools include: Google charts, Tableau, Grafana, Chartist, Fusion charts, Datawrapper, Infogram, and ChartBlocks. The best tool to create dashboards is the Tableau. It is a data visualization and business intelligence tool that helps companies make sense of their data. A heatmap is a graphical representation of data that uses a system of color coding to represent different values. It is used to communicate the relationships that may exist between the variables plotted on the x-axis and y-axis. **Summarizing a raw data using a table, bar graph, and a pie chart** Example. A survey was conducted to determine the favorite fruit of 50 individuals. The raw data is as follows: Apple, Banana, Orange, Apple, Mango, Banana, Apple, Mango, Orange Banana, Apple, Mango, Banana, Orange, Mango, Banana, Apple Orange, Banana, Mango, Apple, Orange, Banana, Apple, Mango Orange, Banana, Mango, Orange, Apple, Banana, Orange, Mango Apple, Mango, Orange, Banana, Apple, Orange, Mango, Apple, Orange Banana, Mango, Orange, Apple, Banana, Mango, Orange, Banana a\. Summarizing the Data in a Frequency Table Fruit Frequency -------- ----------- Apple 12 Banana 14 Orange 13 Mango 11 b\. Representing the Data with a Bar Graph - A bar graph displays the frequency of each fruit using bars. - The x- axis represents the fruit type, and the y-axis represents the frequency. - Each bar's height corresponds to the fruit's frequency. - The bar graph would show: Apple: 12 (bar height = 12) Banana: 14 (bar height = 14) Orange: 13 (bar height = 13) Mango: 11(bar height = 11) Bar Graph: [\[CHART\]]{.chart} c\. Representing the Data with a Pie Chart To create a Pie Chart, calculate the percentage of each fruit's frequency relative to the total (50). Percentage calculations: Percentage of Apple = 12/50 x 100 = 24% Percentage of Banana = 14/50 x 100 = 28% Percentage of Orange = 13/50 x 100 = 26% Percentage of Mango = 11/50 x 100 = 22% Pie Chart Description: Apple: 24% of the chart Banana: 28% of the chart Orange: 26% of the chart Mango: 22 % of the chart Pie Chart: [\[CHART\]]{.chart} **Practice Exercise** **Construct a frequency distribution table, pie chart, and bar graph of the following data and give the appropriate title of the visual representation.** 1\. Categories: Vanilla, Chocolate, Strawberry, Mango, Mint Raw Data: Chocolate, Vanilla, Mango, Mint, Strawberry, Vanilla, Chocolate, Mango, Mint, Strawberry, Vanilla, Mint, Mango, Chocolate, Vanilla, Strawberry, Mint, Chocolate, Vanilla, Mango 2\. Categories: Car, Bus, Train, Bike, Walk Raw Data: Car, Bus, Train, Walk, Bike, Car, Bus, Train, Walk, Car, Bike, Bus, Walk, Train, Car, Bike, Walk, Train, Bus, Car **Module 4. EXPLORATORY DATA VISUALIZATION** **Box and Whisker Plot Chart** A box and whisker plot or diagram (otherwise known as a boxplot), is a graph summarizing a set of data. The shape of the boxplot shows how the data are distributed and it also shows any outliers. It is a useful way to compare different sets of data as you can draw more than one boxplot per graph. **Construction of a Box-and Whisker Plot Manually** 1. Draw a horizontal scale that extends from the minimum data value to the maximum data value. 2. Above the scale, draw a rectangular box with its left side at [*Q*~1~]{.math.inline} and its right side at [*Q*~3~]{.math.inline}. 3. Draw a vertical line segment across the rectangle at the median ([*Q*~2~]{.math.inline}). 4. Draw a horizontal line segment, called a whisker, that extends from Q1 to the minimum and another whisker that extends from Q3 to the maximum. Example. The following table lists the calories per 100 milliliters of 25 popular sodas. Construct a box- and-whisker plot for the data set. Calories, per 100 milliliters, of Selected Sodas 43 37 42 40 53 ---- ---- ---- ---- ---- 62 36 32 50 49 26 53 73 48 45 39 45 48 40 56 41 36 58 42 39 Solution We arrange the data in ascending order then determine Q1, Q2, and Q3. Identify also the minimum and maximum values. Thus, Q1 = 39, Q2 = 43, Q3 = 51.5, min. = 26, and max. = 73. Thus, the box-and-whisker plot is shown below. **PRACTICE EXERCISE** 1. The table below shows the heights, in inches, of 15 randomly selected national Basketball Association (NBA) players and 15 randomly selected Division 1 National Collegiate Athletic Association (NCAA) players. NBA 84 76 79 75 81 81 76 85 ------ ---- ---- ---- ---- ---- ---- ---- ---- 78 79 78 78 84 75 76 NCAA 78 73 73 78 77 76 75 74 74 81 75 78 78 79 73 Using the same scale, draw a box-and -whisker plot for each of the two data sets, placing the second plot below the first. Write a valid conclusion based on the data. **PIVOT CHART** A pivot chart is the visual representation of a pivot table in Excel. Pivot charts are effective for presenting data insights and sharing reports with others. Pivot charts and pivot tables are connected with each other. A pivot table is a powerful tool to calculate, summarize, and analyze data that lets you see comparison, patterns, and trends in your data. Below you can find a two-dimensional pivot table. Go back to [Pivot Tables](https://www.excel-easy.com/data-analysis/pivot-tables.html#two-dimensional-pivot-table) to learn how to create this pivot table. ![](media/image3.png) Insert Pivot Chart To insert a pivot chart, execute the following steps. 1\. Click any cell inside the pivot table. 2\. On the PivotTable Analyze tab, in the Tools group, click PivotChart. Click PivotChart The Insert Chart dialog box appears. 3\. Click OK. Below you can find the pivot chart. This pivot chart will amaze and impress your boss. ![Pivot Chart in Excel](media/image5.png) Note: any changes you make to the pivot chart are immediately reflected in the pivot table and vice versa. Practice Exercise Create a pivot table and a pivot chart for the raw data below. Date Region Car Brand Salesperson Units Sold Revenue (\$) 2015-01-01 North America Brand A Alice 50 125,000 2015-01-01 Europe Brand B Rob 40 100,000 2025-01-02 Asia Brand C Charlie 60 150,000 2025-01-2 North America Brand D Alice 45 112,500 2025-01-3 Europe Brand E David 30 75,000 2025-01-03 Asia Brand A Charlie 55 137,000 2025-01-04 North America Brand B Eve 65 162,000 2025-01-04 Europe Brand C Bob 50 125,000 2025-01-05 Asia Brand D Charlie 70 175,000 2025-01-05 North America Brand E Alice 35 87,500 **EXCEL DASHBOARD** A **dashboard** is a visual representation of key metrics that allow you to quickly view and analyze your data in one place. Dashboards not only provide consolidated data views, but a self-service business intelligence opportunity, where users are able to filter the data to display just what's important to them. In the past, Excel reporting often required you to generate multiple reports for different people or departments depending on their needs.

Use Quizgecko on...
Browser
Browser