Market Research Information Processing Lesson 8 PDF
Document Details
Uploaded by CongratulatoryLarch8139
Loyola University Chicago
2024
Rosa M. Muñoz Gómez
Tags
Summary
This document is a lecture or lesson on market research and information processing, focusing on various statistical techniques. It details multivariate analysis, frequencies, descriptives, contingency tables, classification methods (including hierarchical and k-means clustering), and the CHAID technique.
Full Transcript
Market Research Information Processing Lesson 8 INDEX 1. Multivariate statistical analysis. Overview 2. Fundamental techniques in synthesis cases ‣ Frequencies and descriptives. ‣ Contingency tables 3. Classification methods ‣ Hierarchical clustering ‣ K-means clustering...
Market Research Information Processing Lesson 8 INDEX 1. Multivariate statistical analysis. Overview 2. Fundamental techniques in synthesis cases ‣ Frequencies and descriptives. ‣ Contingency tables 3. Classification methods ‣ Hierarchical clustering ‣ K-means clustering ‣ CHAID method Multivariate statistical analysis. Overview Multivariate statistical analysis Multivariate statistical analysis uses data that consists of sets of measurements on a number of individuals or objects. Multivariate analysis is the area of statistics that deals with observations made on many variables. Rosa M. Muñoz Gómez © Multivariate statistical analysis “Set of statistical methods that simultaneously analyse more than two variables within an sample of observations.” M.G. Kendall Rosa M. Muñoz Gómez © Multivariate statistical analysis Objectives 1. To establish relationships between the variables. 2. To unveil latent structures. 3. To reduce the dimension of a data base without losing information, just getting rid of redundant inputs. 4. To establish operative laws and classification criteria. Rosa M. Muñoz Gómez © Multivariate statistical analysis The main objective is to study how the variables are related to one another, and how they work in combination to distinguish between the cases on which the observations are made. Rosa M. Muñoz Gómez © Fundamental techniques in synthesis cases Synthesis cases 1. Frequencies and descriptives. 2. Contingency tables. Rosa M. Muñoz Gómez © Frequencies and descriptives Synthesis cases Frequencies and ‣ Objective:synthesize the descriptives information about the variable. ‣ Input: discrete or continuos variables* in any measurement scale. ‣ Output: frequency distribution, univariate statistics, graphs. *Discrete variable: a variable that cannot take some values within a minimal Rosa M. Muñoz Gómez © numerable set, i.e. it does not accept any value, only those that belong to the set. For example, the number of children of a woman cannot be 5.7 or 1.5,she can have 0, 1, 2, 3, 4... Continuous variables are those that have an infinite number of values between any two values. They can be numerical variables, or date and time variables. For example, the length of a product and the date and time a payment is received are continuous variables (there may be intermediate values). Synthesis cases Statistics Age Mean Median Mode Rosa M. Muñoz Gómez © Synthesis cases Mean Median Mode The mean or average of Is the middle value It is the number that a data set is found by when a data set is occurs most often in a adding all numbers in ordered from least to data set. the data set and then greatest. dividing by the number of values in the set. Rosa M. Muñoz Gómez © Synthesis cases Histogram A histogram represents a frequency distribution by means of rectangles whose widths represent class intervals and whose areas depend on the value of the frequency* for that interval (how many cases do we have for such an interval). For example: how many individuals do we have for an age from 19 to 20 years old. Rosa M. Muñoz Gómez © Contingency tables Synthesis cases Contingency tables ‣ Objective:to analise if there is interdependence among the variables. ‣ Input: discrete or continuos variables (defined by intervals) in any measurement scale. ‣ Output: 1. Tables that show the joint distribution of two isolated variables, or the distribution conditioned to certain values or categories of other variables (we define these conditions Rosa M. Muñoz Gómez © as layers). 2. Statistic indicators that detect the existence of relationship and, provided it exists, they quantify this relationship. Synthesis cases Cases processing summary Contingency tables: gender vs risk value Valid Lost Example: The answer to the statement Contingency table: gender vs risk (*) “Entrepreneuring is risky.” *Lost: cases that haven’t answered to one of the two questions. (*)We are crossing the variable “gender" (man/ woman) with the variable Rosa M. Muñoz Gómez © “risk” or “aversion to risk”. Totally Disagree Indifferent Agree Totally agree disagree Synthesis cases SPSS results Re-count/Count: Contingency table: gender vs risk (*) Total number of cases that have both characteristics at the same time. “There are 7 man that totally disagree on the statement “Entrepreneuring” is risky”. They disagree. Totally Disagree Indiferent Agree Totally agree % within the gender. That’s the % of disagree the variable “risk” when they are men. There is an 0,8% of man that think that entrepreneurial is not risky. Total # of man Total # of people that Rosa M. Muñoz Gómez © Total # of woman Those that have totally answered “totally disagree disagree”, 63.6% are men, and Man that “totally 36,4% are disagree” are a 0.4% women. of the total cases Synthesis cases SPSS results Considering this cases Chi square tests distribution we don't know if the aversion to risk is higher for men or for women. We don't know if the analysis is significant. In order to test significance we will use “Chi-square Tests”. If the significance is lower than 0.05, we can state that this results are not random. It exists relationship Chi-square determines between being a man or a women and have a higher Rosa M. Muñoz Gómez © if there is relationship or lower aversion to risk. or not between both variables. Synthesis cases SPSS results In order to know which Contingency table: gender vs risk (*) gender has a higher aversion to risk, we compare the % within each gender. If we compare the % of men and women that “Totally agree” or “Agree” on the statement “Entrepreneuring is Totally Disagree Indifferent Agree Totally agree risky”, we can conclude disagree that there is a higher Rosa M. Muñoz Gómez © aversion to risk for entrepreneuring among women. Classification methods Classification methods We use different techniques for segmentation and for building typologies. These two processes are opposite. With segmentation we decompose, we consider that the market is uniform and find criteria to separate parts of that market and build segments. However, the cluster technique considers every individual as different (they have different consumption and behaviour pattern). And we Rosa M. Muñoz Gómez © look for affinities among every individual to gather individuals in groups. Classification methods Classification methods can be used for building clusters or obtaining segments. Clustering techniques can use one of these three algorithms to build typologies: 1. Hierarchical clustering. 2. K-means clustering. CHAID technique (Chi-Square Automatic Interaction Detector) is a pure segmentation Rosa M. Muñoz Gómez © technique. Hierarchical clustering Classification methods Hierarchical clustering ‣ Objective: to identify group of cases (or variables) relatively homogeneous with regards to certain selected characteristics. ‣ Input: Discrete or continuous variables in any measurement scale. However, when grouping several variables at least one of them must be formulated as an interval or ratio scale. If you group observations, they should be at least three of them. Rosa M. Muñoz Gómez © ‣ Output: Delivers quantitative information and graphs (dendrogram) expressing how the clusters were formed in each stage. Classification methods ‣ We usually start the analysis using Hierarchical Clustering to solve the question: How many groups will we form? ‣ A real application of this method is the way the clothing sizes are decided. The average measures for each size are calculated considering the different genotypes that we can find in our society. Rosa M. Muñoz Gómez © Classification methods Dendrogram A dendrogram is a graphic representation of the re- scaled distance (using as maximum “25”) “Distance” is the difference between the values of every variable for an individual compared with the values of the rest of the individuals (subtraction). Rosa M. Muñoz Gómez © Classification methods Dendrogram Case identification Cases 657 and 671 are number very similar. 209 and 878 they are also very similar. Original Cluster 1 337 clearly is not similar to others. That's why it is not linked to Original others until the Cluster 2 maximum distance (25). 431 resembles to 209 Rosa M. Muñoz Gómez © but it wasn’t so similar, that’s why it is united in a further distance. If we cut in 7,5 we can see how many groups are formed: 3 K-means clustering Classification methods K-means clustering ‣ Objective: It is a technique used to form groups of homogeneous cases based on the characteristics (variables) selected. ‣ Input: Discrete or continuous variables on any scale of measurement (some variable/s should be expressed in an interval scale or higher). We need to specify the number of clusters to be formed. ‣ Output: Indicates the value of the Rosa M. Muñoz Gómez © centers for each cluster and the number of cases assigned to each cluster. Classification methods The capacity of hierarchical clustering for processing data is somehow limited. However k-means has a higher capacity. Hierarchical clustering is used for an exploratory analysis: How many clusters can we find? k-means requires to set an specific number of clusters or a range. Rosa M. Muñoz Gómez © Classification methods K-means clustering Centres of the initial clusters Motivations for entrepreneuring How do we identify these clusters? Make more money New challenges “Make more money”: 1- Social recognition absolutely unimportant There are more entrepreneur men 5 and 5-absolutely Economic reward Economic risk 80 important. Age We could say that Cluster 1 doesn’t award Centres of the final clusters importance to making more money as motivation for entrepreneurship. Rosa M. Muñoz Gómez © The other 3 clusters consider “To make more money” an important or very important motivation. Classification methods K-means clustering Centres of the initial clusters Motivations for entrepreneuring How do we identify these clusters? Make more money New challenges Regarding to “There are Social recognition more entrepreneur men”. There are more entrepreneur men 5 Cluster 1 subjects don’t Economic reward Economic risk 80 agree, Cluster 2 subjects Age are indifferent, Cluster 3 subjects totally agree and Cluster subjects totally Centres of the final clusters disagree. The individuals have also Rosa M. Muñoz Gómez © valued their motivation in a scale from 0 to 100 the Economic Reward and the Economic Risk of entrepreneurship. Classification methods K-means clustering Centres of the initial clusters Motivations for entrepreneuring How do we identify these clusters? Make more money New challenges In the final clusters, Social recognition differences have diluted. There are more entrepreneur men 5 Now clusters resemble Economic reward Economic risk 80 to each other. Age Centres of the final clusters Rosa M. Muñoz Gómez © 5 80 Classification methods K-means clustering In Cluster 1 there are 315 Number of cases per cluster observations. (Non entrepreneurs with absolute certainty) Cluster In Cluster 2, 907 observations. (Undecided) In Cluster 3, 1 one individual (this is a rare Valids case). Sometimes the algorithm includes a Losts Rosa M. Muñoz Gómez © cluster to gather the residual cases. In Cluster 4, there are 462 cases (Entrepreneurs with absolute certainty). CHAID technique Classification methods Chi-Square Automatic Interaction Detector (CHAID) is an algorithm to study the relationship between a dependent variable and a series of predictor variables. Rosa M. Muñoz Gómez © Classification methods CHAID technique ‣ Objective: this algorithm aims to explain the behaviour of a dependent variable through the identification of the best predictor variable or set of predictor variables. It also identifies the different types of variables, selecting the best predictor variable using a “Chi-square significance test”. It is a hierarchical, explanatory and decompositional technique. ‣ Input: discrete variables. In a nominal or ordinal scale. It is Rosa M. Muñoz Gómez © possible to use frequency (this helps to reduce the number or cases) and weighing variables. Classification methods CHAID technique ‣ Output: This method identifies the best predictor for the dependent variable, as well as the profile/s with the highest predictive power. It creates contingency tables for each interaction/iteration. It also provides a segmentation chart for each segment with regards to the dependent variable.* Rosa M. Muñoz Gómez © *It is a type of decision tree that uses statistical tests to identify and split data into distinct groups. By analyzing the relationship between a target variable and predictor variables, CHAID finds the most significant splits by examining chi-square independence values. This process continues iteratively to build a tree structure, where each branch represents a subset of the data with similar characteristics. CHAID is particularly useful in marketing and research for understanding variable relationships and predicting categorical outcomes. Classification methods CHAID technique FUTURE ENTREPRENEUR Green: I won't be We wanted to explain the entrepreneur variable wether in the Blue: I will be entrepreneur future an university student would be or not an entrepreneur. To continue with my I will be entrepreneur for … family tradition The aim was to identify the variables that have a higher influence in those Take risks I am ready to set up a company that want to be entrepreneurs and in those who don’t. Rosa M. Muñoz Gómez © We must differentiate New challenges New challenges between variables for segmenting and profiles. As the model offers both. Classification methods CHAID technique FUTURE ENTREPRENEUR Green: I won't be The best predictor to know entrepreneur if an university student will Blue: I will be entrepreneur become an entrepreneur is The best predictor “The family tradition”. If they answer that they To continue with my I will be entrepreneur for … family tradition would start a company to continue with their family tradition. Take risks I am ready to set up a company The second best predictor is their "Risk attitude”, wether if they are willing to take risks or not. Rosa M. Muñoz Gómez © New challenges New challenges We don’t subdivide this segment because wether its size is very small or because there is no suitable predictor to divide it Classification methods CHAID technique FUTURE ENTREPRENEUR Green: I won't be We will read the entrepreneur characteristics of a profile Blue: I will be entrepreneur downwards-upwards. What’s the profile of the To continue with my I will be entrepreneur for … university student that will family tradition probably be entrepreneur: “It’s person interested in new Take risks I am ready to set up a challenges, it is willing to company accept risks and is determined to continue with their family tradition of entrepreneurship.” Rosa M. Muñoz Gómez © New challenges New challenges Terminal node: it can’t be subdivided Classification methods CHAID technique We know the validity of the model thanks to the “Classification table”. Classification Correct percentage The rows indicate the Forecasted observed values, this Observed means what every individual have actually I won’t be entrepreneur answered. I will be entrepreneur Global percentage The columns indicate the Growth method forecasted values. Dependent variable Rosa M. Muñoz Gómez © Classification methods CHAID technique There are 252 individuals who answered “I won’t be entrepreneur" and the method Classification Correct percentage has classified them as “non entrepreneur”. Therefore, they Forecasted have been correctly classified. Observed There are 916 that have been classified as “I will be I won’t be entrepreneur entrepreneur” and have been I will be entrepreneur correctly classified as “I will be Global percentage entrepreneur”. And there have Growth method been 76 of the entrepreneurs that have been incorrectly Dependent variable classified and 43 of non Rosa M. Muñoz Gómez © entrepreneurs incorrectly classified. The success rate to classify the entrepreneurs is 92.3% and for non entrepreneurs 85.4%. Classification methods CHAID technique The global percentage of success is 90.8%. This means that the model has Classification Correct percentage quite good predictive quality. Forecasted Observed I won’t be entrepreneur I will be entrepreneur Global percentage Growth method Dependent variable Rosa M. Muñoz Gómez © End of lesson 8