Univariate Descriptive Statistics & Textual Analysis PDF
Document Details
Uploaded by HolyMercury3977
Emlyon Business School
Tags
Summary
This document provides an overview of univariate descriptive statistics and textual analysis, including measures of central tendency (mode, mean, median, variance/dispersion). It also discusses how these concepts are applied using the Sphinx tool and provides examples.
Full Transcript
Univariate descriptive statistics and textual analysis DATA & CONTENT ANALYSIS Lecture 3 2 § The central tendency is the extent to which all the data values grou...
Univariate descriptive statistics and textual analysis DATA & CONTENT ANALYSIS Lecture 3 2 § The central tendency is the extent to which all the data values group around a typical or central value. Descriptive statistics – § There are four main measures: Measures of 1. Mode central tendency 2. Mean 3. Median 4. Variance/dispersion Module 3 – Descriptive Statistics & textual analysis 3 § By definition, it is the outcome associated with the highest frequency Mode § While it is possible to determine the mode of any type of variable, è it is only relevant for qualitative variables Module 3 – Descriptive Statistics & textual analysis 4 Brand Unit sales Levi's 259 Diesel 209 Guess 145 Energie 120 Example Gap 94 Pepe Jeans 76 Calvin Klein 61 Dolce&Gabbana 48 Armani 43 Module 3 – Descriptive Statistics & textual analysis 5 § By definition, the mean is a calculated value which is representative of all the values observed in the sample, but without a real existence § For example, the mean size of households in Mean France in 2015 was 2.23 people § Calculating a mean makes sense for quantitative variables but never for nominal variables Module 3 – Descriptive Statistics & textual analysis 6 § By definition, the arithmetic mean is calculated by Calculating the adding all the observed values (outcomes) and dividing it by the number of observations mean 1 𝑥̅ = ∑𝑛 𝑖=1 𝑥i 𝑛 Please do review BASIC DESCRIPTIVE STATISTICS ON THE RELATIVE SLIDES – MODULE 3 ON BRIGHTSPACE Module 3 – Descriptive Statistics & textual analysis 7 § Only for numeric variables § A numeric variable originates from an open-ended question When to use whose answer is a number the mean? § Examples: How old are you? What is your monthly income? How much did you pay for your car? § Likert scale questions Module 3 – Descriptive Statistics & textual analysis Example: open-ended question, which answer is a number 8 Module 3 - Numeric Variables Module 3 – Descriptive Statistics & textual analysis How it is visualized on Sphinx 9 Variables identified with a symbol 74 represent numeric variables, i.e. variables stemming from open-ended questions and whose modalities are numbers. Module 3 – Descriptive Statistics & textual analysis How it is visualized on Sphinx 10 Module 3 – Descriptive Statistics & textual analysis 11 § In most cases, we will not create a frequency table in the same way that is done for qualitative variables, because the number of outcomes is usually too important § This leads to a table with many outcomes having only 1 Why using the or 2 respondents mean? Module 3 – Descriptive Statistics & textual analysis 12 Open Sphinx Open survey “Automobiles” In Sphinx Go to “Analysis” module Don’t forget to click on ”Go back to the analysis standard environment” Module 3 – Descriptive Statistics & textual analysis 13 Click on “New Analysis” and select “Age of the car” Module 3 – Descriptive Statistics & textual analysis 14 Sphinx automatically creates “classes” It also calculates descriptive statistics: Mean, Median Standard deviation, range Analysis of the variable Module 3 – Descriptive Statistics & textual analysis 15 The mean is a statistic that everybody understands, the majority of people know how to calculate it To calculate the mean, you Advantages/dis take all the values into The disadvantage of consideration, including the mean is its mode extreme values advantages This method of calculation of calculation has an impact on the value of the mean The problem arises when there is a wide spread of the values from the mean, in other words when the standard deviation is large Module 3 – Descriptive Statistics & textual analysis 16 § The median, it is the value that divides the observations into 2 equal parts § It is less influenced by extreme values Median and, in some cases, it represents better the sample Module 3 – Descriptive Statistics & textual analysis Why is the average revenue 17 systematically larger than the median revenue? Year Average revenue Median Revenue 2010 24 580 € 20 970 € 2011 24 660 € 20 840 € Example 2012 2013 24 400 € 23 970 € 20 830 € 20 780 € 2014 23 970 € 20 830 € 2015 24 170 € 20 930 € 2016 24 260 € 21 120 € 2017 24 350 € 21 190 € 2018 24 650 € 21 250 € Source: INSEE, Revenu, niveaux de vie, pauvreté en 2018 Module 3 – Descriptive Statistics & textual analysis 18 Table interpretation § Median: Distribution of § 50% of French individuals earn more than 21 250 € standards of § 50% of French individuals earn less living than 21 250 € § Decils: § 10% of French individuals earn more than 39 130 € § 30% of French individuals earn less than 16 680 € Module 3 – Descriptive Statistics & textual analysis 19 § The general rule: you choose the mean § Exception: if the data are too far from the mean, it is better to use the median Mean or median? That is the § Some tips § Are the data distributed normally? Choose the mean question. § Are there extreme values? Choose the median § If the standard deviation of the distribution is high à use the median § If the standard deviation > |Mean| => use the median Module 3 – Descriptive Statistics & textual analysis 20 To choose the appropriate measure of central tendency you have to determine the nature of the variable: § Nominal variable: only the mode is appropriate § Quantitative variable: mean or median To recap When you report measures of central tendency, you must always describe and comment them (as per the example of French incomes) Module 3 – Descriptive Statistics & textual analysis 21 Mean or median? (don’t forget to click on “New Analysis” 1 2 3 Calculate the mean Calculate the mean Calculate the mean and the median of and the median of and the median of variable “22.Time variable variable “24.Monthly spent on “23.Mileage” spending” maintenance” Module 3 – Descriptive Statistics & textual analysis 22 Mean or median? (don’t forget to click on “New Analysis” Time spent on Mileage Monthly spending mainteinance (in hours) Mean = 110 Mean = 1132 Mean = 2.1 Median = 100 Median = 1000 Median = 1 Module 3 – Descriptive Statistics & textual analysis 23 Sphinx automatically creates “classes” You may need to change such classes in something more meaningful for your analysis Modifying classes Module 3 – Descriptive Statistics & textual analysis 24 Choose variable “24. Monthly spending” Module 3 – Descriptive Statistics & textual analysis 25 To open the dialog window, you have to click on the little purple wheel (parameters) that shows up once you clicked on « Use classes » Click on the little wheel Module 3 – Descriptive Statistics & textual analysis 26 § Of the same value (1 step) § Of the same amplitude (very good option common) Options of § Around the mean (if the distribution is symmetrical) classes § Of the same frequency (if you need to have the same number of items in each class) § Personalized (specify the upper boundaries only) Module 3 – Descriptive Statistics & textual analysis 27 Indicate the upper bounds of the classes separating them by semi-colons, no space You must indicate the upper bound of the class: 500;1000;1500;2000 Personalized classes Module 3 – Descriptive Statistics & textual analysis 28 Indicate the boundaries of the classes and click OK Module 3 – Descriptive Statistics & textual analysis 29 The new classes Module 3 – Descriptive Statistics & textual analysis 30 § By nature, Likert Scales are ordinal categorical variable Likert Scales but in social science they are treated as numerical discrete variables § This allows to calculate a mean which represents the mean evaluation of the scale Module 3 – Descriptive Statistics & textual analysis Go to question “Performances” 31 By default Sphinx treats Likert scales as ordinal variables – You need to tell Sphinx to treat them as discrete continuous variables in order to appropriately analyze them. For doing so, you have to flag the option «process scales as numbers» Do not forget to validate/confirm your change! Module 3 – Descriptive Statistics & textual analysis 32 Don’t forget to confirm Module 3 – Descriptive Statistics & textual analysis As a general rule, numeric variables are analyzed by presenting its descriptive statistics One can present the data under table form (most often than not) after the data have been put into classes (the variable then becomes ordinal) Numeric variables If you want to show a numeric variable in a chart, you should use a scatter plot, which is not available on Sphinx, only on Excel For more information on descriptive statistics, read Chapter 3, sections 3.1 and 3.2 of Levine et al. Module 3 – Descriptive Statistics & textual analysis 34 Part II Textual analysis Module 3 – Descriptive Statistics & textual analysis 35 § Open-ended questions in a questionnaire can lead to: § Numeric variables: when the outcomes are numbers Open-ended § Textual variables: when the outcomes are questions words, ideas, sentences § Sphinx has tools that allow you to analyze both types of variables Module 3 – Descriptive Statistics & textual analysis 36 § It is an analysis that allows to transform textual data in categorical nominal variables § Such transformation is necessary to count the presence What is the of certain topics and contents within the answers to a textual analysis? survey question § The transformation of textual data in nominal categories allows the estimation of frequencies and percentages Module 3 - Numeric Variables 37 1. Identification of concepts (or themes or categories) that are recurrent between the answers provided by the sample The Textual § It means finding common themes expressed by many participants of a sample analysis process 2. Creation of categories 3. Codification of each observation through the identified categories Module 3 - Numeric Variables 38 § Use keyword clouds for a general description of the data The Textual § Use the codification tool to create a analysis process categorical variable that will synthetize the information contained in the data on Sphinx § Show the resulting categorical variable Module 3 – Descriptive Statistics & textual analysis 39 § This refers to coding data for open text questions § Still using the survey “Automobile” Textual analysis § Go to the module “Analysis” on Sphinx § Click on the button “New analysis” § Choose “Textual analysis” Module 3 – Descriptive Statistics & textual analysis 40 Using the scroll down menu on the left, under the word “variables”, you will notice that the only questions that appear in the list are those indicated with the “ab” icon, which are text variables Textual analysis menu To investigate textual questions, there are three possibilities (see in the left-hand side panel): Keywords clouds Verbatim Codification Module 3 – Descriptive Statistics & textual analysis Click on ‘Textual Analysis’ 41 New screen on Sphinx Module 3 – Descriptive Statistics & textual analysis 42 Choose question 27 “Ideal car” Module 3 – Descriptive Statistics & textual analysis 43 The analysis is performed on all the responses obtained on this question A total of 543 words have been analyzed Keyword cloud The words that appear most often are highlighted (including synonyms, such as car, automobile, vehicle) It should be used when an open-ended question requires only one word as an answers Module 3 – Descriptive Statistics & textual analysis 44 § Sphinx allows you to make multiple keyword clouds dividing the sample according to a specified variable – the context Keyword cloud § By indicating the context, you can compare keyword clouds of different sub-groups of participants within the by context same sample § For instance, you can compare how women and men answered to the same open-ended textual question. Module 3 - Numeric Variables 45 Analysis by context (choose “Gender”) Module 3 – Descriptive Statistics & textual analysis 46 Analysis by contexts More detailed analysis, which allows to have a closer look to Result of the specific sub- samples analysis In this example, we compare the answers of men and women Module 3 – Descriptive Statistics & textual analysis § Verbatim means “word for word” § This function does not allow you to analyze the data § It shows the entire list of responses, one by one § It is useful if you look for specific answers “Verbatim” § Also useful if you want a quote from respondents function in § It should be analyzed when an open-ended Sphinx question requires a sentence (more than one word) as an answer § Heavy to use if there are more than 50 responses § Together with keyword clouds, verbatim should be used to identify possible categories to code the answers Module 3 – Descriptive Statistics & textual analysis 47 48 Initial screen Module 3 – Descriptive Statistics & textual analysis 49 Important function to classify responses into categories Coding function This implies that each response be analyzed and categorized Module 3 – Descriptive Statistics & textual analysis Steps involved in coding 1 2 3 4 Look at the Identify 3-4 Go through all Add categories responses (e.g. categories into the responses to as you go along 15-20) in the which you can categorize them (if necessary) Verbatim option classify the responses Categories should It means assigning a represent common category to each themes or concepts answer expressed by the Module 3 – Descriptive Statistics & textual analysis survey participant 50 51 Build a thematic grid (I) After having identified the appropriate categories, you need to build the ‘thematic grid’ – that is the set of categories you will be using for Module 3 – Descriptive Statistics & textual analysis categorizing your textual answers 52 Build a thematic grid (II) Module 3 – Descriptive Statistics & textual analysis 1. Give a name to your thematic grid 2. Give a label to each category (here themes). Build a thematic The label should be meaningful grid (III) since you will be using it to categorize the answers 3. Click «Associate an extract» 4.click OK 53 54 § Read the response § Select the category which corresponds the best to the response (you can select more than one category) § If you think the response is characteristic of the category, cut and paste it as an “Extract” To code each § Save your decision response § Go to the next response: each answer must be coded before passing to the next one. § If any of the categories apply to the answer, you can adjust the grid by adding another category Module 3 – Descriptive Statistics & textual analysis 55 Here, the answer has been coded as part of the catory «Model»: the After clicking OK answer describes the aesthetic appearance of the ideal. The other categories/themes created do not apply here, but the category «design» can also be flagged. Module 3 – Descriptive Statistics & textual analysis 56 Adding a category Click on the little pencil to modify the grid Module 3 - Numeric Variables 57 § At the end of the coding, a new variable is created, which will appear in the database after all the other variables. End of the § The variable name is the name of the thematic grid coding (Ideal Car) § It is a categorical variable that can be analyzed like any other categorical variable Module 3 - Numeric Variables 58 This is a categorical variable with multiple responses. Variable 44: There are 189 non-responses, “Ideal car” because only 11 responses were coded. Here we asked Sphinx to ignore non- responses in order to have a cleaner chart Module 3 – Descriptive Statistics & textual analysis