Psychological assessment: A brief retrospective overview PDF
Document Details
Uploaded by Deleted User
Cheryl Foxcroft, Gert Roodt, and Fatima Abrahams
Tags
Summary
This document provides a brief overview of the historical roots of psychological assessment, touching upon ancient practices like astrology and physiognomy, and more modern approaches. It also discusses the factors that have shaped assessment in South Africa. The document includes questions for the reader to consider.
Full Transcript
Chapter 2 Psychological assessment: A brief retrospective overview CHERYL FOXCROFT, GERT ROODT, AND FATIMA ABRAHAMS CHAPTER OUTCOMES By the end of this chapter you will be able to: › understand how assessment has evolved since ancient times › a...
Chapter 2 Psychological assessment: A brief retrospective overview CHERYL FOXCROFT, GERT ROODT, AND FATIMA ABRAHAMS CHAPTER OUTCOMES By the end of this chapter you will be able to: › understand how assessment has evolved since ancient times › appreciate the factors that have shaped psychological assessment in South Africa › develop an argument about why assessment is still valued in modern society. 2.1 Introduction At the start of our journey into the field of psychological assessment, it is important to gain a perspective of its origins. This is the focus of this chapter. Without some idea of the historical roots of the discipline of psychological assessment, the great progress made by modern assessment measures cannot be fully appreciated. In this chapter you will also be introduced to some of the key concepts that we will be elaborating on in the Foundation Zone of this book. As you journey through the past with us, you should be on the lookout for the following: how difficult it was in ancient times to find an objectively verifi able way of measuring human attributes (ask yourself the reasons for this) how the most obvious things in the world of ancient philosophers and scientists (such as the human hand, head, and body, as well as animals) were used in an attempt to describe personal attributes the stepping stones that some of the ancient ‘measures’ provided for the development of modern psychological assessment the factors both within and outside of the discipline of psychology that have shaped the development of modern psychological assessment the factors that shaped the development and use of psychological assessment in South Africa. 2.2 A brief overview of the early origins of psychological assessment The use of assessment measures can be traced back to ancient times. One of the first recordings of the use of an assessment procedure for selection purposes can be found in the Bible in Judges Chapter 7, verses 1 to 8. Gideon observed how his soldiers drank water from a river so he could select those who remained on the alert. Historians credit the Chinese with having a relatively sophisticated testing programme for civil servants in place more than 4 000 years ago (Kaplan and Saccuzzo, 2009). Oral examinations were administered every third year and the results were used for work evaluations and for promotion purposes. Over the years, many authors, philosophers, and scientists have explored various avenues in their attempts to assess human attributes. Let us look at a few of these. 2.2.1 Astrology Most people are aware of the horoscopes that appear in daily newspapers and popular magazines. The 25 positions of planets are used to formulate personal horoscopes that describe the personality characteristics of individuals and to predict what might happen in their lives. The origin of horoscopes can be traced back to ancient times, possibly as early as the fifth century BCE (McReynolds, 1986). Davey (1989) concludes that scientists, on the whole, have been scathing in their rejection of astrology as a key to understanding and describing personality characteristics. Do you agree? State your reasons. 2.2.2 Physiognomy McReynolds (1986) credits Pythagoras for being perhaps the earliest practitioner of physiognomy, in the sixth century BCE. Later on, Aristotle also came out in support of physiognomy, which attempted to judge a person’s character from the external features of the body and especially the face, in relation to the similarity that these features had to animals. Physiognomy was based on the assumption that people who shared physical similarities with animals also shared some psychic properties with these animals. For example, a person who looked like a fox was sly, or somebody who looked like an owl was wise (Davey, 1989). What is your view on this? CRITICAL THINKING CHALLENGE 2.1 Many application forms for employment positions or for furthering your studies require that a photograph be submitted. It is highly unlikely that selection and admission personnel use these photographs to judge personal attributes of the applicants, as physiognomists would have done. So why do you think that photographs are requested? Try to interview someone in the Human Resources section of a company, or an admissions officer at an educational institution, to see what purpose, if any, photographs serve on application forms. 2.2.3 Humorology In the fifth century BCE, Hippocrates, the father of medicine, developed the concept that there were four body humours or fluids (blood, yellow bile, black bile, and phlegm) (McReynolds, 1986). Galen, a physician in ancient Rome, took these ideas further by hypothesising four types of temperament (sanguine, choleric, melancholic, and phlegmatic), corresponding to the four humours (Aiken and Groth-Marnat, 2005). The problem with the humoral approach of classifying personality types into one of four categories was that it remained a hypothesis that was never objectively verified. Today the humoral theory mainly has historical significance. However, based on the views of Hippocrates and Galen, Eysenck and Eysenck (1958) embedded the four temperaments within the introversion/extroversion and the emotionally stable/emotionally unstable (neurotic) personality dimensions which they proposed. Of interest is the fact that Eysenck and Eysenck’s (1958) two personality dimensions still form the basis for modern personality measures such as the Myers Briggs Type Indicator and the 16 Personality Factor Questionnaire. You can read more about these measures in Chapter 12. 2.2.4 Phrenology Franz Gall was the founder of phrenology, the ‘science’ of ‘reading people’s heads’ (McReynolds, 1986). Phrenologists believed that the brain consisted of a number of organs that corresponded with various personality characteristics (e.g. self-esteem, cautiousness, firmness) and cognitive faculties (e.g. language, memory, calculation). By feeling the topography of a person’s skull, phrenologists argued that it was possible to locate ‘bumps’ over specific brain areas believed to be associated with certain personality attributes (Aiken and Groth-Marnat, 2005). The fundamental assumptions underlying phrenology were later demonstrated to be invalid in research studies, consequently no one really places any value on phrenology today. 2.2.5 Chirology – Palmistry Bayne asserted that palm creases (unlike fingerprints) can change and he found that certain changes appeared to be related to changes in personality. He also believed that all hand characteristics should be taken into 26 consideration before any valid assessments could be made. However, to this day, no scientific evidence has been found that, for example, a firm handshake is a sign of honesty, or that long fingers suggest an artistic temperament (Davey, 1989). 2.2.6 Graphology Graphology can be defined as the systematic study of handwriting. Handwriting provides graphologists with cues that are called ‘crystallised gestures’ that can be analysed in detail. As handwriting is a type of stylistic behaviour, there is some logic to the argument that it could be seen to be an expression of personality characteristics. Graphologists hypothesise that people who keep their handwriting small are likely to be introverted, modest, and humble, and shun publicity. Large handwriting on the other hand, shows a desire to ‘think big’ which, if supported by intelligence and drive, provides the ingredients for success. Upright writing is said to indicate self-reliance, poise, calm and self-composure, reserve, and a neutral attitude (Davey, 1989). Davey (1989) concluded that efforts of graphologists to establish validity of such claims have yielded no or very few positive results. Although there are almost no studies in which it has been found that handwriting is a valid predictor of job performance, graphology is widely used in personnel selection to this day (Simner and Goffin, 2003). It is especially used in France, but also in other countries such as Belgium, Germany, Italy, Israel, Great Britain, and the United States (US). This has prompted Murphy and Davidshofer (2005) to ask why graphology remains popular. What reasons do you think they unearthed in attempting to answer this question? Murphy and Davidshofer (2005) concluded that there were three main reasons that fuelled the popularity of handwriting analysis in personnel selection: It has high face validity, meaning that to the ordinary person in the street it seems reasonable that handwriting could provide indicators of personality characteristics, just as mannerisms and facial expressions do. Graphologists tend to make holistic descriptions of candidates such as ‘honest’, ‘sincere’, and ‘shows insight’, which, because of their vagueness, are difficult to prove or disprove. Some of the predictions of graphologists are valid. However, Murphy and Davidshofer (2005) cite research studies which reveal that the validity of the inferences drawn by the graphologists was related more to what they gleaned from the content of an applicant’s biographical essay than the analysis of the handwriting! Despite having found reasons why graphology continues to be used in personnel selection, Murphy and Davidshofer (2005) concluded that, all things considered, there is not sufficient evidence to support the use of graphology in employment testing and selection. Simner and Goffin (2003) concur with this and argue that the criterion-related validity of graphology is lower and more variable than that of more widely known and less expensive measures. For example, whereas the criterion-related validity of graphology varies between.09 and.16 (Simner and Goffin, 2003), the criterion-related validity of general mental testing and structured interviews in job selection has been found to be.51, and when used in combination, the validity coefficient increases to.63 (Schmidt and Hunter, 1998). Simner and Goffin (2003) thus caution that the continued use of graphology for personnel selection could prove to be costly and harmful to organisations. 2.2.7 Summary All the avenues explored by the early philosophers, writers, and scientists did not provide verifiable ways of measuring human attributes. The common thread running through all these attempts (but probably not in the case of graphology), is the lack of proper scientific method and, ultimately, rigorous scientific measurement. 2.3 The development of modern psychological assessment: An international perspective 2.3.1 Early developments Psychology has only started to prosper and grow as a science since the development of the scientific method. 27 Underlying the scientific method is measurement. Guilford stated as long ago as 1936 that psychologists have adopted the motto of Thorndike that ‘whatever exists at all, exists in some amount’ and that they have also adopted the corollary that ‘whatever exists in some amount, can be measured’. It was perhaps the development of objective measurement that made the greatest contribution to the development of Psychology as a science. During the Italian Renaissance Huarte’s book was translated into English as The Tryal of Wits (1698). This book was a milestone in the history of assessment, because for the first time someone proposed a discipline of assessment, gave it a task to do, and offered some suggestions on how it might proceed. Huarte pointed out that: people differ from one another with regard to certain talents different vocations require different sets of talents a system should be developed to determine specific patterns of abilities of different persons so that they can be guided into appropriate education programmes and occupations. This system would involve the appointment of a number of examiners (triers) who would carry out certain procedures (tryals) in order to determine a person’s capacity (McReynolds, 1986). A further milestone in the development of modern psychological assessment came from the work of Thomasius, a professor of philosophy in Germany. According to McReynolds (1986), Thomasius made two main contributions to the emerging field of assessment. He was the first person to develop behavioural rating scales, and furthermore, the ratings in his scales were primarily dependent on direct observations of the subject’s behaviour. Another milestone was the coining of the term psychometrics by Wolff. This term was used throughout the eighteenth and nineteenth centuries, but was mainly applied to psychophysical measurements (McReynolds, 1986). In the twentieth century, with the shift towards the measurement of individual differences, the term was applied to a wider variety of measuring instruments, such as cognitive (mental ability) and personality-related measures. After the foundation had been laid by experimental psychologists such as Wundt, the latter half of the nineteenth century saw some promising developments in the field of assessment linked to the work of Francis Galton, James McKeen Cattell, and Alfred Binet. One of experimental psychology’s major contributions to the field of psychological assessment was the notion that assessment should be viewed in the same light as an experiment, as it required the same rigorous control. As you will discover, one of the hallmarks of modern psychological assessment is that assessment measures are administered under highly standardised conditions. 2.3.2 The early twentieth century The twentieth century witnessed genuine progress in psychological assessment. The progress has mainly been attributed to advances in: theories of human behaviour that could guide the development of assessment measures statistical methods that aided the analysis of data obtained from measures to determine their relationship to job performance and achievement for example, as well as to uncover the underlying dimensions being tapped by a measure the application of psychology in clinical, educational, military, and industrial settings. Other than these advances, there was another important impetus that fuelled the development of modern psychological assessment measures in the twentieth century. Do you have any idea what this was? During the nineteenth century and at the turn of the twentieth century in particular, a need arose to treat mentally disturbed and disabled people in a more humanitarian way. To achieve this, the mental disorders and deficiencies of patients had to be properly assessed and classified. Uniform procedures needed to be found to differentiate people who were mentally insane or who suffered 28 from emotional disorders, from those who were mentally disabled or suffered from an intellectual deficit. A need therefore arose for the development of psychological assessment measures. According to Aiken and Groth-Marnat (2005), an important breakthrough in the development of modern psychological assessment measures came at the start of the twentieth century. In 1904, the French Minister of Public Instruction appointed a commission to find ways to identify mentally disabled individuals so that they could be provided with appropriate educational opportunities. One member of the French commission was Binet. Together with Simon, a French physician, Binet developed the first measure that provided a fairly practical and reliable way of measuring intelligence. The 1905 Binet-Simon Scale became the benchmark for future psychological tests. The measure was given under standardised conditions (i.e. everyone was given the same test instructions and format). Furthermore, norms were developed, albeit using a small and unrepresentative sample. More important than adequacy of the normative sample, though, was Binet and Simon’s notion that the availability of comparative scores could aid interpretation of test performance. It is interesting to note that one of the earliest records of the misuse of intelligence testing involved the Binet-Simon Scale (Gregory, 2010). An influential American psychologist, Henry Goddard was concerned about what he believed to be the high rate of mental retardation among immigrants entering the US. Consequently, Goddard’s English translation of the Binet-Simon Scale was administered to immigrants through a translator, just after they arrived in the US. ‘Thus, a test devised in French, then translated to English was, in turn, retranslated back to Yiddish, Hungarian, Italian, or Russian; administered to bewildered laborers who had just endured an Atlantic crossing; and interpreted according to the original French norms’ (Gregory, 2000, p. 17). It is thus not surprising that Goddard found that the average intelligence of immigrants was low! The Binet-Simon Scale relied heavily on the verbal skills of the test-taker and, in its early years, was available in French and English only. Consequently, its appropriateness for use with non-French or non-English test-takers, illiterates, and with speech- and hearing-impaired test-takers was questioned (Gregory, 2010). This sparked the development of a number of non-verbal measures (e.g. Seguin Form Board Test, Knox’s Digit Symbol Substitution Test, the Kohs Block Design Test, and the Porteus Maze Test). World War I further fuelled the need for psychological assessment measures. Why? Large numbers of military recruits needed to be assessed, but at that stage only individually administered tests, such as the Binet-Simon scale, were available. So World War I highlighted the need for large-scale group testing. Furthermore, the scope of testing broadened at this time to include tests of achievement, aptitude, interest, and personality. Following World War I, with the emergence of group tests that largely used a multiple-choice format, there was widespread optimism regarding the usefulness of psychological tests (Kaplan and Saccuzzo, 2009). Samelson (1979, p. 154) points out that Cattell remarked that during the war period ‘the army testing put psychology on the map of the US’. Given that over a million people were tested on the Army Alpha and Army Beta tests in the US, Cattell’s observation has merit. Furthermore, the testing of pilots in Italy and France and the testing of truck drivers for the German army, for example, suggest that the world wars did not only put psychology on the map in the US but elsewhere in the world as well. 2.3.3 Measurement challenges Although the period between the two World Wars was a boom period for the development of psychological measures, critics started pointing out the weaknesses and limitations of existing measures. Although this put test developers on the defensive, and dampened the enthusiasm of assessment practitioners, the knowledge gained from this critical look at testing inspired test developers to reach new heights. To illustrate this point, let us consider two examples, one from the field of intellectual assessment and the other from the field of personality assessment. 29 The criticism of intelligence scales up to this point, i.e. that they were too dependent on language and verbal skills, reduced their appropriateness for many individuals (e.g. for illiterates). To address this weakness, Wechsler included performance tests that did not require verbal responses when he published the first version of the Wechsler Intelligence Scales in 1937. Furthermore, whereas previous intelligence scales only yielded one score (namely, the intelligence quotient), the Wechsler Intelligence Scales yielded a variety of summative scores from which a more detailed analysis of an individual’s pattern of performance could be made. These innovations revolutionised intelligence assessment. The use of structured personality measures was severely criticised during the 1930s as many findings of personality tests could not be substantiated during scientific studies. However, the development of the Minnesota Multiphasic Personality Inventory (MMPI) by Butcher in 1943 began a new era for structured, objective personality measures. The MMPI placed an emphasis on using empirical data to determine the meaning of test results. According to Kaplan and Saccuzzo (2009), the MMPI and its revision, the MMPI-2, are the most widely used and referenced personality tests to this day. World War II reaffirmed the value of psychological assessment. The 1940s witnessed the emergence of new test development technologies, such as the use of factor analysis to construct tests such as the 16 Personality Factor Questionnaire. During this period there was also much growth in the application of psychology and psychological testing. Psychological testing came to be seen as one of the major functions of psychologists working in applied settings. In 1954 the American Psychological Association (APA) pronounced that psychological testing was exclusively the domain of the clinical psychologist. However, the APA unfortunately also pronounced that psychologists were permitted to conduct psychotherapy only in collaboration with medical practitioners. As you can imagine, many clinical psychologists became disillusioned by the fact that they could not practise psychotherapy independently, and, although they had an important testing role to fulfil, they began to feel that they were merely technicians who were playing a subservient role to medical practitioners. Consequently, when they looked around for something to blame for their poor position, the most obvious scapegoat was psychological testing (Lewandowski and Saccuzzo, 1976). At the same time, given the intrusive nature of tests and the potential to abuse testing, widespread mistrust and suspicion of tests and testing came to the fore. So, with both psychologists and the public becoming rapidly disillusioned with tests, many psychologists refused to use any tests, and countries such as the US, Sweden, and Denmark banned the use of tests for selection purposes in industry. So it is not surprising that according to Kaplan and Saccuzzo (2009), the status of psychological assessment declined sharply from the late 1950s, and this decline persisted until the 1970s. 2.3.4 The influence of multiculturalism In the latter part of the twentieth century and during the first two decades of the twenty-first century, multiculturalism has become the norm in many countries. As a result, attempts were made to develop tests that were ‘culture-free’. An example of such a measure is the Culture-free Intelligence Test (Anastasi and Urbina, 1997). However, it soon became clear that it was not possible to develop a test free of any cultural influence. Consequently, test developers focused more on ‘culture-reduced’ or ‘culture-common’ tests in which the aim was to remove as much cultural bias as possible from the test by including only behaviour that was common across cultures. For example, a number of non-verbal intelligence tests were developed (e.g. Test of Non-verbal Intelligence, Raven’s Progressive Matrices) where the focus was on novel problem-solving tasks and in which language use, which is often a stumbling block in cross-cultural tests, was minimised. Furthermore, given that most of the available measures have been developed in the US or the United Kingdom (UK), they tend to be more appropriate for westernised English-speaking people. In response to the rapid globalisation of the world’s population and the need for measures to be more culturally appropriate and available in the language in which the test-taker is proficient in, the focus of psychological testing in the 1980s and 1990s shifted to cross-cultural test adaptation. Under the leadership of Ron Hambleton from the US, the International Test Commission (ITC) released their Guidelines for Adapting Educational and Psychological Tests (Hambleton, 1994, 2001). These guidelines have recently been revised (International Test Commission, 2010a) and have become the benchmark for cross-cultural test translation and adaptation around the world. They have also assisted in advocating against assessment practices where test-takers are 30 tested in languages in which they are not proficient, sometimes using a translator who translates the test ‘on the run’. In addition, many methodologies and statistical techniques (e.g. Structural Equation Modeling) have been developed to establish whether different language versions of a test are equivalent (Hambleton, Merenda, & Spielberger, 2005). Sparked by research stemming from large-scale international comparative tests such as the Trends in International Mathematics and Science Study (TIMSS) and the Progressive International Reading Literacy Study (PIRLS), the second decade of the twenty-first century has seen renewed interest in issues of bias and fairness when testing in linguistically diverse contexts. Consequently, the ITC has decided to develop guidelines for testing language minorities. You can read more about how and why measures are adapted for use in different countries and cultures in Chapter 6 and about language issues in assessment in Chapters 7 and 9. A new trend that is emerging in the twenty-first century is to approach the development of tests that are used widely internationally (e.g. the Wechsler Intelligence Scales) from a multicultural perspective. For example, when it comes to the Wechsler Intelligence Scales for Children (WISC), the norm through the years has been to first develop and standardise the measure for the US and thereafter to adapt it for use outside the US. However, for the development of the Wechsler Intelligence Scale for Children – Fourth Edition (WISC-IV), experts from various countries are providing input on the constructs to be tapped as well as the content of the items to minimise potential cultural bias during the initial redesign phase (Weiss, 2003). In the process, the development of the WISC-IV is setting a new benchmark for the development of internationally applicable tests. A further recent trend in multicultural and multilingual test development is that of simultaneous multilingual test development (Solano-Flores, Turnbull and Nelson-Barber, 2002; Tanzer, 2005). This differs from the process outlined for the development of the WISC IV where people from different cultural and language groups provide input on the construct(s) to be tapped but the items are still developed in English before they are translated into other languages. Instead, in simultaneous multilingual test development, once the test specifications have been developed, items are written by a multilingual and multicultural panel or committee where each member has a background in psychology (general and cross-cultural in particular), measurement and linguistics as well as with respect to the specific construct that the test will measure (e.g. personality, mathematics). Chapter 7 will provide more information on this approach. Non-Western countries are also rising to the challenge of not only adapting westernised measures for their contexts (e.g. Grazina Gintiliene and Sigita Girdzijauskiene, 2008, report that, among others, the WISC-III and the Raven’s Coloured Matrices have been adapted for use in Lithuania) but also to develop their own indigenous measures, which are more suited to their cultural contexts. For example, Cheung and her colleagues have developed the Chinese Personality Inventory (CPAI), which was revised in 2000 and is now known as the CPAI-2 (Cheung, Leung, Fan, Song, Zhang, and Zhang, 1996). This measure includes both indigenous (culturally relevant) and universal personality dimensions. Indigenous personality constructs were derived from classical literature, everyday descriptions of people, surveys and previous psychological research. Thereafter items and scales were developed according to the highest acceptable psychometric standards. You can get information on the CPAI and its development by consulting Cheung and Cheung (2003). The way in which the CPAI was developed is widely regarded as the benchmark to attain in the development of culturally relevant personality measures. How this approach is being successfully used in the South African context will be covered in Chapter 7. Other than multiculturalism impacting on the nature of how tests are developed and adapted, due to rapid globalisation which has led to increasing multiculturalism in most societies, the choice of which norm group to use to compare an individual’s performance to has also become an issue. As will be outlined in Chapter 3, norms provide a basis for comparing an individual’s performance on a measure with the performance of a well-defined reference group to aid in the interpretation of test performance. Norm or reference groups can be constituted in terms of various characteristics of people (e.g. age, gender, educational level, job level, language, clinical diagnosis). The key criterion when choosing an appropriate norm group to compare an individual’s performance to is linked to the purpose of the comparison. For example, if the intention is to compare the performance with others in the same cultural group of a similar age, culturally and age appropriate norms must be used. However, what norms does a multinational organisation use to make comparisons of its workforce across cultural and national boundaries? Bartram (2008a) argues that using 31 locally developed national norms are not appropriate in this instance. Instead, he argues that multinational norms should be developed and used provided that the mix of country samples is reasonable and that the samples have similar demographics. 2.3.5 Standards, training, computerised testing, and test users’ roles In an attempt to address issues of fairness and bias in test use, the need arose to develop standards for the professional practice of testing and assessment. Under the leadership of Bartram from the UK, the ITC has developed a set of International Guidelines for Test-use (Version 2000). Many countries, including South Africa, have adopted these test-user standards, which should ensure that, wherever testing and assessment is undertaken in the world, similar practice standards should be evident. You can read more about this in Chapter 8. During the 1990s, competency-based training of assessment practitioners (test users) fell under the spotlight. The British Psychological Society (BPS) took the lead internationally in developing competency standards for different levels of test users in occupational testing in the UK. Based on these competencies, competency-based training programmes have been developed and all test users have to be assessed by BPS-appointed assessors in order to ensure that a uniform standard is maintained. (You can obtain more information from the Web site of the Psychological Testing Centre of the BPS at www.psychtesting.org.uk). Building on the work done in the UK, Sweden, Norway, and the Netherlands, the Standing Committee of Tests and Testing (which has subsequently been renamed the Board of Assessment) of the European Federation of Psychology Associations (EFPA) developed standards of competence related to test use and a qualification model for Europe as a whole. A Test-user Accreditation Committee has been established to facilitate the recognition of qualifications that meet or can be developed to meet such standards (visit www.efpa.be for more information). Given that many countries have assessment competency models, a project is underway to see whether these national models in professional psychology can be integrated into a competence model that can be recognised internationally (Nielsen, 2012). With assessment being widely used in recruitment, selection, and training in work and organisational settings, a standard was ‘developed by ISO (the International Organisation for Standardisation) to ensure that assessment procedures and methods are used properly and ethically’ (Bartram, 2008b, p. 9) and that those performing the assessment are suitably qualified. Published in 2010, the ISO Standard for Assessment in Organisational Settings (PC 230) now provides a benchmark for providers of assessment services to demonstrate that they have the necessary skills and expertise to provide assessments that are ‘fit for purpose’. In addition, the standard can be used by individual organisations to certify internal quality assurance processes, or by those contracting in assessment services to ascertain the minimum requirements that need to be met by the organisation providing the services, or by professional bodies for credentialing purposes. Other than focusing on the competence of assessment practitioners, there has been increased emphasis on test quality in the last two decades. Evers (2012) argues that such an emphasis is not surprising given the ‘importance of psychological tests for the work of psychologists, the impact tests may have on their clients, and the emphasis on quality issues in current society’ (p. 137). Many countries (e.g. the UK, Netherlands, Russia, Brazil, and South Africa) have test review systems – a mechanism used to set standards for and evaluate test quality. Chapter 8 will provide more detail on test review systems. Advances in information technology and systems in the latter half of the twentieth century impacted significantly on psychological testing. Computerised adaptive testing became a reality as did the use of the Internet for testing people in one country for a job in another country, for example. Computerised testing and testing via the Internet have revolutionalised all aspects of assessment and have produced their own set of ethical and legal issues. These issues require the urgent attention of test developers and users during the early part of the twenty-first century. You can read more about computerised testing and its history in Chapter 14 and the future predictions regarding assessment and test development in Chapter 18. Linked to, but not restricted to computer-based and Internet-delivered testing, a trend has emerged around who may use psychological measures and what their qualifications should be. Among the more important issues is confusion regarding the roles and responsibilities of people involved in the assessment process and what knowledge, qualifications, and expertise they require, which has been fuelled by the distinction that is 32 drawn between competency-based and psychological assessment (Bartram, 2003). The block below presents a fairly simplistic description of these two types of assessment to illustrate the distinction between them. Psychological assessment requires expertise in psychology and psychological theories to ensure that measures of cognitive, aptitude, and personality functioning are used in an ethical and fair manner, right from the choice of which tests to use through to interpretation and feedback. Furthermore, the outputs of psychological assessment are in the form of psychological traits/constructs (such as personality and ability). The expertise to perform psychological assessment is clearly embodied in an appropriately registered psychology professional. Competency-based assessment focuses on the skills, behaviours, knowledge, and attitudes/values required for effective performance in the workplace or in educational settings (e.g. communication, problem-solving, task orientation). The assessment measures used are as directly linked as possible to the required competencies. Indirect methods such as simulations and Assessment Centres are used to conduct competency-based assessment. As the outputs of such assessment are directly linked to the language of the workplace or educational settings, the test user does not need expertise in psychological theories to be able to apply the results of competency-based assessments. What is required, however is that competency-based assessment needs to be performed by people with expertise in this area of assessment (e.g. skilled in job analysis and competency-based interviews). Bartram (2003) argues that there is reasonable international consensus regarding the distinction between psychological and competency-based assessment. Consequently, some countries (e.g. Finland) have used this to legally distinguish between assessments that only psychologists can do (psychological assessment) and those that non-psychologists can do (competency-based assessment). The advent of computer-based and Internet-delivered assessment has further meant that the test user in terms of competency-based assessment is not involved in test selection, administration, scoring, and interpretation. Instead, the test user receives the output of the assessment in a language congruent with the competencies required in the workplace. In view of the changing nature of the definition of a test user, it is important that the roles, responsibilities and required training of all those involved in the assessment process is clarified to ensure that the test-taker is assessed fairly and appropriately. To draw this brief international review of the history of psychological assessment to a close, one could ask what the status of psychological testing and assessment is a decade into the twenty-first century? Roe (2008) provided some insights at the closing of the sixth conference of the ITC that could provide some answers to this question. Roe (2008) asserts that the field of psychometrics and psychological testing is doing well as it continues to innovatively respond to issues (e.g. globalisation and increased use of the Internet); develop new test development methodologies; advance test theory, data analysis techniques and measurement models (Alonso-Arbiol & van de Vijver, 2010); and to be sensitive to the rights and reactions of test-takers. Among the challenges that Roe (2008) identifies are the fact that test scores are not easily understood by the public, that psychology professionals struggle to communicate with policymakers, and that test security and cheating on tests are growing issues. He also argues that for testing to make a better contribution to society, psychology professionals need to conceptualise the fact that testing and assessment (from test selection to reporting) are part of the services that they offer to clients, and that the added value of tests should not be confused with the added value of test-based services. 2.4 The development of modern psychological assessment: A South African perspective 2.4.1 The early years How did psychological assessment measures come to be used in South Africa? As South Africa was a British colony, the introduction of psychological testing here probably stems from our colonial heritage (Claassen, 33 1997). The historical development of modern psychological measures in South Africa followed a similar pattern to that in the US and Europe (see Section 2.2). However, what is different and important to note is the context in which this development took place. Psychological assessment in South Africa developed in an environment characterised by the unequal distribution of resources based on racial categories (black, coloured, Indian, and white). Almost inevitably, the development of psychological assessment reflected the racially segregated society in which it evolved. So it is not surprising that Claassen (1997) asserts that ‘Testing in South Africa cannot be divorced from the country’s political, economic and social history’ (p. 297). Indeed, any account of the history of psychological assessment in South Africa needs to point out the substantial impact that apartheid policies had on test development and use (Nzimande, 1995). Even before the Nationalist Party came into power in 1948, the earliest psychological measures were standardised only for whites and were used by the Education Department to place white pupils in special education. The early measures were either adaptations of overseas measures such as the Standford-Binet, the South African revision of which became known as the Fick Scale, or they were developed specifically for use here, such as the South African Group Test (Wilcocks, 1931). Not only were the early measures only standardised for whites, but, driven by political ideologies, measures of intellectual ability were used in research studies to draw distinctions between races in an attempt to show the superiority of one group over another. For example, during the 1930s and 1940s, when the government was grappling with the issue of establishing ‘Bantu education’, Fick (1929) administered individual measures of motor and reasoning abilities, which had only been standardised for white children, to a large sample of black, coloured, Indian, and white school children. He found that the mean score of black children was inferior to that of Indian and coloured children, with whites’ mean scores superior to all groups. He remarked at the time that factors such as inferior schools and teaching methods, along with black children’s unfamiliarity with the nature of the test tasks, could have disadvantaged their performance on the measures. However, when he extended his research in 1939, he attributed the inferior performance of black children in comparison to that of white children to innate differences, or, in his words, ‘difference in original ability’ (p. 53) between blacks and whites. Fick’s conclusions were strongly challenged and disputed. For example, Fick’s work was severely criticised by Biesheuvel in his book African Intelligence (1943), in which an entire chapter was devoted to this issue. Biesheuvel queried the cultural appropriateness of Western-type intelligence tests for blacks and highlighted the influence of different cultural, environmental, and temperamental factors and the effects of malnutrition on intelligence. This led him to conclude that ‘under present circumstances, and by means of the usual techniques, the difference between the intellectual capacity of Africans and Europeans cannot be scientifically determined’ (Biesheuvel, 1943, p. 91). In the early development and use of psychological measures in South Africa, some important trends can be identified which were set to continue into the next and subsequent eras of psychological assessment in South Africa. Any idea what these were? The trends were: the focus on standardising measures for whites only the misuse of measures by administering measures standardised for one group to another group without investigating whether or not the measures might be biased and inappropriate for the other group the misuse of test results to reach conclusions about differences between groups without considering the impact of inter alia cultural, socio-economic, environmental, and educational factors on test performance. 2.4.2 The early use of assessment measures in industry The use of psychological measures in industry gained momentum after World War II and after 1948 when the Nationalist Government came into power. As was the case internationally, psychological measures were developed in response to a societal need. According to Claassen (1997), after World War II, there was an urgent need to identify the occupational suitability (especially for work on the mines) of large numbers of 34 blacks who had received very little formal education. Among the better measures constructed was the General Adaptability Battery (GAB) (Biesheuvel, 1949, 1952), which included a practice session during which test-takers were familiarised with the concepts required to solve the test problems and were asked to complete some practice examples. The GAB was predominantly used for a preliterate black population, speaking a number of dialects and languages. Because of job reservation under the apartheid regime and better formal education opportunities, as education was segregated along racial lines, whites competed for different categories of work to blacks. The Otis Mental Ability Test, which was developed in the US and only had American norms, was often used when assessing whites in industry (Claassen, 1997). Among the important trends in this era that would continue into subsequent eras were: the development of measures in response to a need that existed within a certain political dispensation the notion that people who are unfamiliar with the concepts in a measure should be familiarised with them before they are assessed the use of overseas measures and their norms without investigating whether they should be adapted/revised for use in South Africa. 2.4.3 The development of psychological assessment from the 1960s onwards According to Claassen (1997), a large number of psychological measures were developed in the period between 1960 and 1984. The National Institute for Personnel Research (NIPR) concentrated on developing measures for industry while the Institute for Psychological and Edumetric Research (IPER) developed measures for education and clinical practice. Both of these institutions were later incorporated into the Human Sciences Research Council (HSRC). In the racially segregated South African society of the apartheid era, it was almost inevitable that psychological measures would be developed along cultural/racial lines as there ‘was little specific need for common tests because the various groups did not compete with each other’ (Owen, 1991, p. 112). Consequently, prior to the early 1980s, Western models were used to develop similar but separate measures for the various racial and language groups (Owen, 1991). Furthermore, although a reasonable number of measures were developed for whites, considerably fewer measures were developed for blacks, coloureds, and Indians. During the 1980s and early 1990s, once the sociopolitical situation began to change and discriminatory laws were repealed, starting with the relaxation of ‘petty apartheid’, applicants from different racial groups began competing for the same jobs and the use of separate measures in such instances came under close scrutiny. A number of questions were raised, such as: How can you compare scores if different measures are used? How do you appoint people if different measures are used? In an attempt to address this problem, two approaches were followed. In the first instance, measures were developed for more than one racial group, and/or norms were constructed for more than one racial group so that test performance could be interpreted in relation to an appropriate norm group. Examples of such measures are the General Scholastic Aptitude Test (GSAT), the Ability, Processing of Information, and Learning Battery (APIL-B), and the Paper and Pencil Games (PPG), which was the first measure to be available in all eleven official languages in South Africa. In the second instance, psychological measures developed and standardised on only white South Africans, as well as those imported from overseas, were used to assess other groups as well. In the absence of appropriate norms, the potentially bad habit arose of interpreting such test results ‘with caution’. Why was this a bad habit? 35 It eased assessment practitioners’ consciences and lulled them into a sense that they were doing the best they could with the few tools at their disposal. You can read more about this issue in Chapter 8. The major problem with this approach was that, initially, little research was done to determine the suitability of these measures for a multicultural South African environment. Research studies that investigated the performance of different groups on these measures were needed to determine whether or not the measures were biased. Despite the widespread use of psychological measures in South Africa, the first thorough study of bias took place only in 1986. This was when Owen (1986) investigated test and item bias using the Senior Aptitude Test, Mechanical Insight Test, and the Scholastic Proficiency Test on black, white, coloured, and Indian subjects. He found major differences between the test scores of blacks and whites and concluded that understanding and reducing the differential performance between black and white South Africans would be a major challenge. Research by Abrahams (1996), Owen (1989a, 1989b), Retief (1992), Taylor and Boeyens (1991), and Taylor and Radford (1986) showed that bias existed in other South African ability and personality measures as well. Other than empirical investigations into test bias, Taylor (1987) also published a report on the responsibilities of assessment practitioners and publishers with regard to bias and fairness of measures. Furthermore, Owen (1992a, 1992b) pointed out that comparable test performance could only be achieved between different groups in South Africa if environmentally disadvantaged test-takers were provided with sufficient training in taking a particular measure before they actually took it. Given the widespread use (and misuse) of potentially culturally biased measures, coupled with a growing perception that measures were a means by which the Nationalist Government could exclude black South Africans from occupational and educational opportunities, what do you think happened? A negative perception regarding the usefulness of psychological measures developed and large sections of the South African population began to reject the use of psychological measures altogether (Claassen, 1997; Foxcroft, 1997b). Issues related to the usefulness of measures will be explored further in the last section of this chapter. Remember that in the US testing came to be seen as one of the most important functions of psychologists, and only psychologists. During the 1970s important legislation was tabled in South Africa that restricted the use of psychological assessment measures to psychologists only. The Health Professions Act (No. 56 of 1974) defines the use of psychological measures as constituting a psychological act, which can legally only be performed by psychologists, or certain other groups of professions under certain circumstances. The section of the Act dealing with assessment will be explored further in Chapter 8. Among the important trends in this era were: the impact of the apartheid political dispensation on the development and fair use of measures the need to empirically investigate test bias growing scepticism regarding the value of psychological measures, especially for black South Africans. 2.4.4 Psychological assessment in the democratic South Africa 2.4.4.1 Assessment in education Since 1994 and the election of South Africa’s first democratic government, the application, control, and development of assessment measures have become contested terrain. With a growing resistance to assessment measures and the ruling African National Congress (ANC) express purpose to focus on issues of equity to redress past imbalances, the use of tests in industry and education in particular has been placed under the spotlight. School readiness testing, as well as the routine administration of group tests in schools, was banned in many provinces as such testing was seen as being exclusionary and perpetuating the discriminatory policies of the past. Furthermore, the usefulness of test results and assessment practices in educational settings has been strongly queried in the Education White Paper 6, Special Needs Education: Building an Inclusive Education and Training System (Department of Education, 2001), for example. Psychometric test results should contribute to the identification of learning problems and educational programme planning as well as informing the instruction of learners. Why is this not happening? Could it be that the measures used are not 36 cross-culturally applicable or have not been adapted for our diverse population and thus do not provide valid and reliable results? Maybe it is because the measures used are not sufficiently aligned with the learning outcomes of Curriculum 21? Or could it be that psychological assessment reports are filled with jargon and recommendations which are not always translated into practical suggestions on what the educator can do in class to support and develop the learner? Furthermore, within an inclusionary educational system, the role of the educational psychologist along with that of other professional support staff is rapidly changing. Multi-disciplinary professional district-based support teams need to be created and have been established in some provinces (e.g. the Western Cape) (Department of Education, 2005). Within these teams, the primary focus of the psychologists and other professionals is to provide indirect support (‘consultancy’) to learners through supporting educators and the school management (e.g. to identify learner needs and the teaching and learning strategies that can respond to these needs, to conduct research to map out needs of learners and educators, and to establish the efficacy of a programme). However, Pillay (2011) notes that psychologists often find it challenging to work in collaborative teams in educational contexts, which could be linked to the fact that they need in-service training to equip them in this regard. A further way that psychologists can support educators in identifying barriers to learning in their learners is to train them in the use of educational and psychological screening measures (see Chapter 15 for a discussion on screening versus diagnostic assessment). The secondary focus of the professional support teams is to provide direct learning support to learners (e.g. diagnostic assessment to describe and understand a learner’s specific learning difficulties and to develop an intervention plan). As is the case with the changing role of psychologists in assessment due to the advent of competency-based, computer-based and Internet-delivered testing, the policies on inclusive education and the strategies for its implementation in South Africa are changing the role that educational psychologists play in terms of assessment and intervention in the school system. This can be viewed as a positive challenge as educational psychologists have to develop new competencies (e.g. to train and mentor educators in screening assessment; to apply measurement and evaluation principles in a variety of school contexts). In addition, psychologists will have to ensure that their specialist assessment role is not eroded. However, instead of the specialist assessment being too diagnostically focused, it also needs to be developmentally focused so that assessment results can inter alia be linked to development opportunities that the educator can provide (see Chapter 15). 2.4.4.2 The Employment Equity Act To date, the strongest stance against the improper use of assessment measures has come from industry. Historically, individuals were not legally protected against any form of discrimination. However, with the adoption of the new Constitution and the Labour Relations Act in 1996, worker unions and individuals now have the support of legislation that specifically forbids any discriminatory practices in the workplace and includes protection for applicants as they have all the rights of current employees in this regard. To ensure that discrimination is addressed within the testing arena, the Employment Equity Act (No. 55 of 1998, section 8) refers to psychological tests and assessment specifically and states: Psychological testing and other similar forms or assessments of an employee are prohibited unless the test or assessment being used: (a) has been scientifically shown to be valid and reliable (b) can be applied fairly to all employees (c) is not biased against any employee or group. The Employment Equity Act has major implications for assessment practitioners in South Africa because many of the measures currently in use, whether imported from the US and Europe or developed locally, have not been investigated for bias and have not been cross-culturally validated here (as was discussed in Section 2.3.3). The impact of this Act on the conceptualisation and professional practice of assessment in South Africa in general is far-reaching as assessment practitioners and test publishers are increasingly being called upon to demonstrate, or prove in court, that a particular assessment measure does not discriminate against certain groups of people. It is thus not surprising that there has been a notable increase in the number of test 37 bias studies since the promulgation of the Act in 1998 (e.g. Abrahams and Mauer, 1999a, 1999b; Lopes, Roodt and Mauer, 2001; Meiring, Van de Vijver, Rothmann and Barrick, 2005; Meiring, Van de Vijver, Rothmann and Sackett, 2006; Schaap, 2003, 2011; Schaap and Basson, 2003; Taylor, 2000; Van Zyl and Visser, 1998; Van Zyl and Taylor, 2012; Visser & Viviers, 2010). A further consequence of the Employment Equity Act is that there is an emerging thought that it would be useful if test publishers and distributors could certify a measure as being ‘Employment Equity Act Compliant’ as this will aid assessment practitioners when selecting measures (Lopes, Roodt and Mauer, 2001). While this sounds like a very practical suggestion, such certification could be misleading. For example, just because a measure is certified as being compliant does not protect the results from being used in an unfair way when making selection decisions. Furthermore, given the variety of cultural and language groups in this country, bias investigations would have to be conducted for all the subgroups on whom the measure is to be used before it can be given the stamp of approval. Alternatively, it would have to be clearly indicated for what subgroups the measure has been found to be unbiased and that it is only for these groups that the measure complies with the Act. The advent of the Employment Equity Act has also forced assessment practitioners to take stock of the available measures in terms of their quality, cross-cultural applicability, the appropriateness of their norms, and the availability of different language versions. To this end, the Human Sciences Research Council (HSRC) conducted a survey of test use patterns and needs of practitioners in South Africa (Foxcroft, Paterson, le Roux, and Herbst, 2004). Among other things, it was found that most of the tests being used frequently are in need of adapting for our multicultural context or require updating, and appropriate norms and various language versions should be provided. The report on the survey, which can be viewed at www.hsrc.ac.za, provides an agenda for how to tackle the improvement of the quality of measures and assessment practices in South Africa as well as providing a list of the most frequently used measures which should be earmarked for adaptation or revision first. You can read more about this agenda in Chapter 18. Having an agenda is important, but having organisations and experts to develop and adapt tests is equally important. As was discussed in 2.3.3, the Human Sciences Research Council (HSRC), into which the NIPR and IPER were incorporated, developed a large number of tests during the 1960s through to the 1980s. For at least three decades the HSRC almost exclusively developed and distributed tests in South Africa. However, at the start of the 1990s the HSRC was restructured, became unsure about the role that it should play in psychological and educational test development, and many staff with test development expertise left the organisation. Consequently, since the South African adaptation of the Wechsler Adult Intelligence Scales-III (WAISIII) in the mid-1990s, the HSRC has not developed or adapted any other tests and its former tests that are still in circulation are distributed by Mindmuzik Media now. According to Foxcroft and Davies (2008), with the HSRC relinquishing its role as the major test developer and distributor in South Africa, other role players gradually emerged to fill the void in South African test development. Some international test development and distribution agencies such as SHL (www.shl.co.za) and Psytech (www.psytech.co.za) have agencies in South Africa. Furthermore, local test agencies such as Jopie van Rooyen & Partners SA (www.jvrafrica.co.za) and Mindmuzik Media (www.mindmuzik.com) have established themselves in the market place. Each of these agencies has a research and development section and much emphasis is being placed on adapting international measures for the South African context. However, there is still some way to go before all the tests listed on the websites of the agencies that operate in South Africa have been adapted for use here and have South African norms. The other encouraging trend is that there is a greater involvement of universities (e.g. Unisa and the Nelson Mandela Metropolitan, North-West, Rhodes, Witwatersrand, Pretoria, Johannesburg, Stellenbosch, and KwaZulu-Natal universities) in researching and adapting tests, developing local norms and undertaking local psychometric studies, and even developing indigenous tests (e.g. Taylor and de Bruin, 2003). Furthermore, some organisations such as the South African Breweries and the South African Police Services (SAPS) that undertake large scale testing have undertaken numerous studies to provide psychometric information on the measures that they use, investigate bias, and adapt measures on the basis of their findings (e.g. Meiring, 2007). South African assessment practitioners and test developers have thus not remained passive in the wake of legislation impacting on test use. Although the future use of psychological assessment, particularly in industry and education, still hangs in the balance at this stage, there are encouraging signs that progress is 38 being made to research and adapt more measures appropriate for our context and to use them in fair ways to the benefit of individuals and organisations. The way in which psychologists and test developers continue to respond to this challenge will largely shape the future destiny of psychological testing here. Consult Chapter 18 for future perspectives on this issue. 2.4.4.3 Professional practice guidelines According to the Health Professions Act (No. 56 of 1974), the Professional Board for Psychology of the Health Professions Council of South Africa (HPCSA), is mandated to protect the public and to guide the profession of psychology. In recent years, the Professional Board has become increasingly concerned about the misuse of assessment measures in this country, while recognising the important role of psychological assessment in the professional practice of psychology as well as for research purposes. Whereas the Professional Board for Psychology had previously given the Test Commission of the Republic of South Africa (TCRSA) the authority to classify psychological tests and oversee the training and examination of certain categories of assessment practitioners, these powers were revoked by the Board in 1996. The reason for this was that the TCRSA did not have any statutory power and, being a section 21 company that operated largely in Gauteng, its membership base was not representative of psychologists throughout the country. Instead, the Professional Board for Psychology formed the Psychometrics Committee, which, as a formal committee of the Board, has provided the Board with a more direct way of controlling and regulating psychological test use in South Africa. The role of the Psychometrics Committee of the Professional Board and some of the initiatives that it has launched will be discussed more fully in Chapter 8. The further introduction of regulations and the policing of assessment practitioners will not stamp out test abuse. Rather, individual assessment practitioners need to make the fair and ethical use of tests a norm for themselves. Consequently, the Psychometrics Committee has actively participated in the development of internationally acceptable standards for test use in conjunction with the International Test Commission’s (ITC) test-use project (see Section 2.2 and Chapter 8) and has developed certain competency-based training guidelines (see www.hpcsa.co.za). Furthermore, the various categories of psychology professionals who use tests have to write a national examination under the auspices of the Professional Board before they are allowed to practise professionally. Such a national examination helps to ensure that professionals enter the field with at least the same minimum discernible competencies. In Chapter 8, the different categories of assessment practitioners in South Africa will be discussed, as will their scope of practice and the nature of their training. To complicate matters, the introduction of computer-based and Internet-delivered testing has led to a reconceptualisation of the psychology professional’s role, especially in terms of test administration. There are those that argue that trained test administrators (non-psychologists) can oversee structured, computer-based testing and that the test classification system in South African should make allowance for this. Indeed, in the landmark court case which hinged on the definition of what test use means, the honourable Judge Bertelsmann concluded that the Professional Board needs to publish a list of tests that are restricted for use by psychologists as it ‘is unlikely that the primarily mechanical function of the recording of test results should be reserved for psychologists’ (Association of Test Publishers of South Africa; and Another, Savalle and Holdsworth South Africa v. The Chairperson of the Professional Board of Psychology, 2010, p. 20). Whether it is correct to conclude that test administration is purely mechanistic will be critically examined in Chapter 9. There is also a growing recognition among South African psychologists that many of the measures used in industry and the educational sector do not fall under the label of ‘psychological tests’. Consequently, as the general public does not differentiate between whether or not a test is a psychological test, there is a need to set general standards for testing and test use in this country. This, together with the employment equity and labour relations legislation in industry, has led to repeated calls to define uniform test-user standards across all types of tests and assessment settings and for a central body to enforce them (Foxcroft, Paterson, le Roux and Herbst, 2004). To date, such calls have fallen on deaf ears. As you look back over this section, you should become aware that psychological assessment in South Africa has been and is currently being shaped by: 39 legislation and the political dispensation of the day the need for appropriate measures to be developed that can be used in a fair and unbiased way for people from all cultural groups in South Africa the role that a new range of test development and distribution agencies and universities are playing to research, adapt and develop measures that are appropriate for our multicultural context the need for assessment practitioners to take personal responsibility for ethical test use training and professional practice guidelines provided by statutory (e.g. the Professional Board for Psychology) and other bodies (e.g. PsySSA, SIOPSA). CRITICAL THINKING CHALLENGE 2.2 Spend some time finding the common threads or themes that run through the historical perspective of assessment from ancient to modern times, internationally and in the South African context. You might also want to read up on the development of psychological testing in Africa (see for example Foxcroft, C.D., 2011). 2.5 Can assessment measures and the process of assessment still fulfil a useful function in modern society? One thing that you have probably realised from reading through the historical account of the origins of psychological testing and assessment is that its popularity has waxed and waned over the years. However, despite the attacks on testing and the criticism levelled at it, psychological testing has survived, and new measures and test development technologies continue to be developed each year. Why do you think that this is so? In the South African context do you think that, with all the negative criticisms against it, psychological testing and assessment can still play a valuable role? Think about this for a while before you read some of the answers to these questions that Foxcroft (1997b); Foxcroft, Paterson, Le Roux, and Herbst (2004), and others, have come up with for the South African context. Ten academics involved in teaching psychological assessment at universities were asked whether there is a need for psychological tests in present day South Africa and they all answered ‘Yes’ (Plug, 1996). One academic went so far as to suggest that ‘the need for tests in our multicultural country is greater than elsewhere because valid assessment is a necessary condition for equity and the efficient management of personal development’ (Plug, 1996, p. 3). In a more recent survey by Foxcroft, Paterson, Le Roux, and Herbst (2004), the assessment practitioners surveyed suggested that the use of tests was central to the work of psychologists and that psychological testing was generally being perceived in a more positive light at present. Among the reasons offered for these perceptions was the fact that tests are objective in nature and more useful than alternative methods such as interviews. Further reasons offered were that tests provide structure in sessions with clients and are useful in providing baseline information, which can be used to evaluate the impact of training, rehabilitation or psychotherapeutic interventions. Nonetheless, despite psychological testing and assessment being perceived more positively, the practitioners pointed out that testing and assessment ‘only added value if tests are culturally appropriate and psychometrically sound, and are used in a fair and an ethical manner by well-trained assessment practitioners’ (Foxcroft, Paterson, Le Roux and Herbst, 2004, p. 135). Psychological testing probably continues to survive and to be of value because of the fact that assessment has become an integral part of modern society. Foxcroft (1997b) points out that in his realistic reaction to the anti-test lobby, Nell (1994) argued: psychological assessment is so deeply rooted in the global education and personnel selection systems, and in the administration of civil and criminal justice, that South African parents, teachers, employers, work seekers, and lawyers will continue to demand detailed psychological assessments (p. 105). 40 Furthermore, despite its obvious flaws and weaknesses, psychological assessment continues to aid decision-making, provided that it is used in a fair and ethical manner by responsible practitioners (Foxcroft, Paterson, Le Roux and Herbst, 2004). Plug (1996) has responded in an interesting way to the criticisms levelled at testing. He contends ‘the question is not whether testing is perfect (which obviously it is not), but rather how it compares to alternative techniques of assessment for selection, placement or guidance, and whether, when used in combination with other processes, it leads to a more reliable, valid, fair and cost-effective result’ (Plug, 1996, p. 5). Is there such evidence? In the field of higher education for example, Huysamen (1996b) and Koch, Foxcroft, and Watson (2001) have shown that biographical information, matriculation results, and psychological test results predict performance at university and show promise in assisting in the development of fair and unbiased admission procedures at higher education institutions. Furthermore, a number of studies have shown how the results from cognitive and psychosocial tests correlate with academic performance, predict success and assist in the planning of academic support programmes (e.g. Petersen et al., 2009; Sommer & Dumont, 2011). In industry, 90 per cent of the human resource practitioners surveyed in a study conducted by England and Zietsman (1995) indicated that they used tests combined with interviews for job selection purposes and about 50 per cent used tests for employee development. This finding was confirmed by Foxcroft, Paterson, Le Roux and Herbst (2004) when, as part of the HSRC’s test-use survey, it was found that 85 per cent of the industrial psychologists surveyed used psychological tests in work settings. In clinical practice, Shuttleworth-Jordan (1996) found that even if tests that had not been standardised for black children were used, a syndrome-based neuropsychological analysis model made it possible to make appropriate clinical judgements which ‘reflect a convincing level of conceptual validity’ (p. 99). Shuttleworth-Jordan (1996) argued that by focusing on common patterns of neuropsychological dysfunction rather than using a normative-based approach which relies solely on test scores, some of the problems related to the lack of appropriate normative information could be circumvented. More recently, Shuttleworth-Edwards (2012) reported on descriptive comparisons of performance on the South African adapted WAIS-III and the Wechsler Adult Intelligence Scale – Fourth Edition (WAIS-IV), which has not yet been adapted but for which she has developed normative guidelines. Based on a case study with a 20-year-old brain-injured Xhosa-speaking individual, she concluded: ‘Taking both clinical and statistical data into account, a neuropsychological analysis was elucidated … on the basis of which contextually coherent interpretation of a WAIS-IV brain-injury test protocol … is achieved’ (p. 399). Similarly, Odendaal, Brink and Theron (2011) reported on six case studies of black adolescents where they used quantitative data obtained using the Rorschach Comprehensive System (RCS) (Exner, 2003) together with qualitative follow-up interview information, observations and a culturally sensitive approach to RCS interpretation (CSRCS). Although cautioning that more research was needed, they concluded: ‘in the hands of the trained and experienced clinician … using the CSRCS potentially offers invaluable opportunities to explore young people’s resilience-promoting processes’ (p. 537). Huysamen (2002), however, cautions that when practitioners need to rely more on professional judgement than objective, norm-based scores, they need to be aware that the conclusions and opinions ‘based on so-called intuitions have been shown to be less accurate than those based on the formulaic treatment of data’ (p. 31). The examples cited above suggest that there is South African research evidence to support the value of psychological test information when it is used along with other pertinent information and clinical/professional judgement to make decisions. Furthermore, Foxcroft (1997b) asserts that in the process of grappling with whether assessment can serve a useful purpose in South Africa: attention has shifted away from a unitary testing approach to multi-method assessment. There was a tendency in the past to erroneously equate testing and assessment. In the process, clinicians forgot that test results were only one source of relevant data that could be obtained. However, there now appears to be a growing awareness of the fact that test results gain in meaning and relevance when they are integrated with information obtained from other sources and when they are reflected against the total past and present context of the testee (Claassen, 1995; Foxcroft, 1997b, p. 231). To conclude this section, it could probably be argued that the anti-test lobby has impacted positively on the practice of psychological assessment in South Africa. On the one hand, psychology practitioners have taken a 41 critical look at why and in what instances they use assessment measures as well as how to use tests in the fairest possible way in our multicultural society. On the other hand, it has forced researchers to provide empirical information regarding the usefulness of assessment measures. CRITICAL THINKING CHALLENGE 2.3 You have been asked to appear on a TV talk show in which you need to convince the South African public that psychological assessment has an important role to play in the democratic, post-apartheid South Africa. Write down what you plan to say on the talk show. CHECK YOUR PROGRESS 2.1 2.1 Define the following terms: Phrenology Graphology Psychometrics Standardised. 2.2 Describe how the following has impacted on psychological test use and development in South Africa: Apartheid The Employment Equity Act. ETHICAL DILEMMA CASE STUDY 2.1 Anton, a psychologist, is asked to assess 20 applicants for a management position. All the measures that he plans to use are only available in English. However, only 5 of the applicants have English as their mother-tongue; the home language of the others is either isiXhosa or isiZulu. Anton knows that according to the Employment Equity Act, a measure should not be biased against any group. He is thus worried that applicants whose home language is not English will be disadvantaged. As a result, he asks his secretary who is fluent In English, isiXhosa and isiZulu to act as a translator. Questions: (a) Is it appropriate for Anton to use a translator? Motivate your answer. (b) Will the use of a translator be sufficient to ensure that the measures are not biased against test-takers who do not have English as a home language? Motivate your answer. (c) How else could Anton have approach this dilemma? [Hint: consult the Ethical Rules of Conduct for Professionals Registered under the Health Professions Act, 1974 (Department of Health, 2006), which you can source from www.hpcsa.co.za] 2.6 Conclusion This chapter introduced you to the history and development of the field of psychological assessment from ancient to modern times, internationally and in the South African context. Attention was given to the factors that shaped the development of psychological assessment in South Africa from the early years, in the apartheid era and in the democratic era. The chapter concluded with a discussion on whether psychological assessment fulfills a useful function in modern society. So far we are about one quarter of the way in our journey through the Foundation Zone on our map. Have you picked up some of the fundamental assessment concepts already? You should be developing an understanding that: psychological assessment is a complex process 42 psychological measures/tests represent a scientific approach to enquiring into human behaviour; consequently they need to be applied in a standardised way and have to conform to rigorous scientific criteria (i.e. it must be empirically proved that they are reliable and valid and, especially in multicultural contexts, that they are not biased). Let’s move on to the next chapter where you can explore some of the foundational assessment concepts further. 43 Chapter 3 Basic measurement and scaling concepts GERT ROODT CHAPTER OUTCOMES The next three chapters are the next stops in the Foundation Zone and they introduce you to the basic concepts needed for understanding the properties of psychological measures and how test results are interpreted. The topics covered in these three chapters include basic measurement and scaling concepts, reliability, and validity. In the first chapter the following themes, namely, levels of measurement, measurement errors, measurement scales, basic statistical concepts, and test norms are introduced. Once you have studied this chapter you will be able to: › describe the three distinguishing properties of measurement levels › describe and give an example of the four different measurement levels › describe and give an example for each of the different measurement scales › explain for each of the different scaling methods what type of data (measurement level) it will generate › define and describe the three basic categories of statistical measures of location, variability and association › name and describe the different types of test norms. 3.1 Introduction In order to understand and work with basic psycho-metric principles, you need to have a good understanding of some fundamental principles of statistics. If you have not completed a course in statistics, we suggest that you invest in an introductory text to statistics. Such a book should explain the basic measurement and statistical concepts and procedures which will be highlighted in this chapter in greater depth. 3.2 Levels of measurement Guilford (1936) referred to the great German philosopher Immanuel Kant who once asserted that the sine qua non of a science is measurement and the mathematical treatment of its data. Kant therefore concluded that psychology could never rise to the dignity of a natural science because it is not possible to apply quantitative methods to its data. Guilford concluded that if Kant were to browse through one of the contemporary journals of psychology, he would at least be forced to conclude that psychologists as a group are expending a large amount of energy in maintaining the pretence that their work is science. Following this, Guilford then posed the question: What would be the motive of modern day scientists for spending this arduous amount of energy to express their findings in terms of statistical probability and significance coefficients? He then concluded it is all in our ‘struggle for objectivity. Objectivity is after all the touchstone of science’ (p. 1) (own emphasis in italics). This objectivity can be achieved through effective measurement. Runyon & Haber define measurement as follows: Measurement is the assignment of numbers to objects or events according to sets of predetermined (or arbitrary) rules, or to frame it more precisely in psychometric terms, the transformation of psychological 44 attributes into numbers (1980, p. 21). In psychometrics there are numerous systems by which we can assign numbers. These systems may generate data that have different properties. Let us have a closer look at these properties now. 3.2.1 Properties of measurement scales There are three properties that enable us to distinguish between different scales of measurement, namely magnitude, equal intervals, and absolute zero (De Beer, Botha, Bekwa, & Wolfaardt, 1999). 3.2.1.1 Magnitude Magnitude is the property of ‘moreness’. A scale has the property of magnitude if we can say that one attribute is more than, less than, or equal to another attribute. Let us take a tape measure, for example. We can say that one person is taller or shorter than another, or something is longer or shorter than something else, once we have measured their/its height/length. A measure of height/length therefore possesses the property of magnitude. 3.2.1.2 Equal intervals A scale assumes the property of equal intervals if the difference between all points on that scale is uniform. If we take the example of length, this would mean that the difference between 6 and 8 cm on a ruler is the same as the difference between 10 and 12 cm. In both instances the difference is exactly 2 cm. In Example A on p. 30, an example of an equal-interval response-rating scale is provided. This response-rating scale represents equal intervals and it would typically generate continuous data. In all instances the interval between scores is exactly one. In the case of the scale in Example B, numbers are assigned to different categories. In this case, the distances between categories are not equal and this scale would typically generate categorical data. Although the scale numbers in Example A possess equal intervals, one should however note that there is evidence that psychological measurements rarely have the property of precise equal intervals. For example, although the numerical difference between IQ scores of 50 and 55 is the same as the difference between 105 and 110, the qualitative difference of 5 points at the lower level does not mean the same in terms of intelligence as the difference of 5 points at the higher level. Table 3.1 Comparison of equal interval and categorical rating scales EXAMPLE A Totally disagree 1 2 3 4 5 Totally agree EXAMPLE B Totally disagree Disagree somewhat Unsure Agree somewhat Totally agree 1 2 3 4 5 3.2.1.3 Absolute zero Absolute zero (0) is obtained when there is absolutely nothing present of the attribute being measured. If we take the example of length again, 0 cm means that there is no distance. Length therefore possesses the property of absolute zero. By the same token, if you were measuring wind velocity and got a zero reading, you would say that there is no wind blowing at all. With many human attributes it is extremely difficult, if not impossible, to define an absolute zero point. For example, if we measure verbal ability on a scale of 0 to 10, we can hardly say that a zero score means that the person has no verbal ability at all. There might be a level of ability that the particular scale does not measure; there can be no such thing as zero ability. 45 3.2.2 Categories of measurement levels Some aspects of human behaviour can be measured more precisely than others. In part, this is determined by the nature of the attribute (characteristic) being measured. In part, this is also determined by the level of measurement of the scale used. Measurement level is an important, but often overlooked, aspect in the construction of assessment measures because it remains an important consideration when data sets are analyzed. Measurement serves different functions, such as sorting, ordering, rating, and comparisons. As a consequence numbers are used in different ways, namely to name (i.e. group according to a label – nominal numbers); to represent a position in a series (i.e. to order – ordinal numbers); and to represent quantity (i.e. rating or comparing – cardinal numbers). Each of these measurement approaches results in different types of measurement data. Two broad groups of measurement data are found; namely categorical data and continuous data. Categorical data is in the form of discrete or distinct categories (e.g. males and females). Only a specific set of statistical procedures can be used for data analysis on categorical (discrete) data sets. Categorical measurement data, in turn, can be divided into nominal and ordinal measurement levels: Nominal: With a nominal scale, numbers are assigned to an attribute to describe or name it, such as telephone numbers or postal box numbers. For example, a nominal measurement scale is used if the 11 official languages of South Africa are coded from 1 to 11 in a data set or if the 15 positions of a rugby team rank players from 1 to 15. Languages or rugby players can then be sorted according to these numbers. Ordinal: An ordinal scale is a bit more refined than a nominal scale in that numbers are assigned to objects that reflect some sequential ordering or amounts of an attribute. For example, people are ranked according to the order in which they finished a race or in terms of the results of their class test. Continuous data represents data that have been measured on a continuum which can be broken down into smaller units (e.g. height or weight). Furthermore, continuous data can be divided into two subdivisions, namely, interval and ratio measurement levels: Interval: In an interval scale, equal numerical differences can be interpreted as corresponding to equal differences in the characteristic being measured. An example would be employees’ rating of their company’s market image on a 10-point scale (where 1 is poor and 10 is excellent). IQ test scores are usually considered to be an interval scale. Why? A difference between two individuals’ IQ scores can be numerically determined (e.g. the difference between IQ scores of 100 and 150 is 50, or between 100 and 50 is also 50). However, as was pointed out in Section 3.2.1.2, the meaning of a difference between two IQ scores would be very different if the difference was between 150 and 100 rather than between 50 and 100. Example A can be classified as an interval scale. Ratio: This is the most refined level of measurement. Not only can equal difference be interpreted as reflecting equal differences in the characteristic being measured, but there is a true (absolute) zero which indicates complete absence of what is being measured. The inclusion of a true zero allows for the meaningful interpretation of numerical ratios. For example, length is measured on a ratio scale. If a table is 6 m away from you while a chair is 3 m away from you, it would be correct to state that the table is twice as far away from you as the chair. Can you think of an example related to a psychological characteristic? No. The scores on psychological measures, depending on the response scales used, either represent ordinal or, at the most, interval data rather than ratio measurement, since they do not have a true zero point. Continuous data, if normally distributed, are ideally suited for a wide range of parametric statistical analysis procedures. Selecting a particular measuring instrument that will generate data suitable to your needs is no safeguard against measurement errors. Psychometricians and researchers should therefore always be aware of possible measurement errors. 3.3 Measurement errors 46 In most measurement settings two broad types of errors may occur. They are random sampling errors and systematic errors. The scores generated by measuring instruments can potentially be negatively affected by either sampling or measurement errors (or both these errors). Stated differently, this means that causes of measurement errors can be attributed to sampling, the measure itself or the combination between these two factors. These errors will be briefly introduced here and discussed in more detail in the next chapters. 3.3.1 Random sampling error Random sampling error can be defined as: ‘a statistical fluctuation that occurs because of chance variations in the elements selected for a sample’ (Zikmund, Babin, Carr, & Griffin, 2010, p. 188). The effects of these chance factors, referred to as random sampling error, are normally countered by increasing the sample size. In practical terms this means that larger sample sizes (e.g. > 500) yield more accurate data than smaller sample sizes (e.g. < 100) for the standardisation or validation of measuring instruments, since they can estimate the population parameters more accurately (less measurement error). This has important implications for sampling norm groups that are representative of particular populations. Sampling errors may therefore produce erroneous and skewed data. 3.3.2 Systematic error Systematic errors or non-sampling errors are defined as: ‘error resulting from some imperfect aspect of the research design that causes respondent error or from a mistake in the execution of the research’ (Zikmund et al., 2010, p. 189). Systematic error or measurement bias is present where the results show a persistent tendency to deviate in a particular direction from the population parameter. This type of error can be ascribed to possible measurement error or bias. Measurement errors may therefore also negatively affect the scores obtained from norm groups. The way in which measurement scales are constructed can help psychometrists and researchers to counter the effects of measurement errors, specifically those of non-response errors and response bias. Non-response error is the real difference between the actual respondents in a survey and those who have decided not to participate (if they could be objectively surveyed). Response bias is a tendency for respondents to respond to questions in a set manner. Both these errors can be minimised if the principles of sound scale construction are adhered to. 3.4 Measurement scales In psychological measurement (psychometrics), we are mostly dealing with the measurement of abstract (intangible) constructs. Although these constructs are not visible or tangible, it does not mean that they do not exist. We can still measure their existence in the different ways that they manifest themselves. For instance we can measure intelligence by assessing people’s ability to solve complex problems; or by assessing their mental reasoning ability; or sometimes even by the way people drive. 3.4.1 Different scaling options In deciding on possible measurement formats of these abstract constructs we have a choice of using different measuring scales. The choices we make in this regard can potentially lead to generating data on different measurement levels. (For a more comprehensive discussion on scaling methods refer to Schnetler, 1989). Let’s consider some of the most frequently used scales. 3.4.1.1 Category scales Category scales are, as the name suggests, scales where the response categories are categorised or defined. 47 According to both Torgerson (1958) and Schepers (1992) a scale loses its equal interval properties if more than two points on the response scale are anchored. Usually several response categories exist in category scales and they are ordered in some ascending dimension; in the case below – the frequency of use. This type of scale would generate categorical data; more specifically ordinal data. How frequently do you make use of public transport? (1) Never (2) Rarely (3) Sometimes (4) Often (5) All the time Figure 3.1 Category scale 3.4.1.2 Likert-type scales The Likert scale or summated-rating method was developed by Rensis Likert (1932). Likert-type scales are frequently used in applied psychological research. Respondents have to indicate to what extent they agree or disagree with a carefully phrased statement or question. These types of items in the examples below will generate categorical data. Below are two examples of such response scales: The supervisor can be rated on different dimensions that then provide a profile of the supervisor according to the different, rated dimensions. The combined scores on several of these items would provide a composite (summated) score of the supervisor in terms of how favourable or unfavourable he was rated overall. In the case of the second response scale example above, a zero is used to indicate a neutral or uncertain response. Any idea on what problem will occur if scores on such response scales in the latter example are added? Respondents may therefore opt for the zero if they find the choice difficult, resulting in an abnormally peaked (leptokurtic) response distribution. My supervisor can be described as an approachable person. (1) Strongly disagree (2) Disagree (3) Uncertain (4) Agree (5) Strongly agree or (-2) Strongly disagree (-1) Disagree (0) Neutral (+1) Agree (+2) Strongly agree Figure 3.2 Likert-type scale 3.4.1.3 Semantic differential scales Semantic differential scales provide a series of semantic differentials or opposites. The respondent is then requested to rate a particular person, attribute or subject in terms of these bipolar descriptions of the person or 48 object. Onl