The 5th Annual EALTA Conference





EALTA's Guidelines for Good Practice: A Test of Implementation

by Charles Alderson & Jayanti Banerjee

The EALTA Guidelines for Good Practice in Language Testing and Assessment (GGP) guide test users in key testing principles including respect for the students/examinees, responsibility, fairness, reliability and validity. Since they were adopted in 2006 they have been translated into 32 languages, making them the most widely available code of practice in the field and a potentially important instrument in the promotion of assessment literacy in Europe.

We explored this potential in a recent validation study of the English Language Proficiency test for Air Traffic Controllers (ELPAC). Aviation tests are clearly extremely high stakes, not just for the test-takers but for every potential airline passenger and their quality can have a tremendous impact. They should, therefore, be held to the highest standards. Furthermore, aviation test providers should offer evidence for the validity and reliability of their tests in a form that is accessible to the variety of stakeholders in the aviation context.

This paper describes how we used the EALTA GGP to frame the executive summary of our final validation report, and reflects on how useful this exercise was, both for presenting validity evidence and for informing and educating test users about key language testing principles. We will also suggest how projects such as the ELPAC validation study might help us to pilot procedures for monitoring compliance with the EALTA Guidelines.


To Top


Exploring the Theoretical Basis for Developing Measurement Instruments on the CEFR

by Gary Buck, Spiros Papageorgiou & Forest Platzek

Recently, assessment specialists are developing and improving techniques to link existing tests to the Common European Framework of Reference (CEFR). But, we also need to develop new tests on the CEFR, and we need a theoretical framework as well as practical techniques to do that. However, as Alderson et al (2006) note, this presents a circular problem: to validate the theoretical framework, measurement instruments are needed, but to validate these measurement instruments a theory is necessary. In this study, we attempt to break this circularity by creating and evaluating a theory of test development using the CEFR, and then evaluating that against real world data.

We first develop a set of criteria to determine the CEFR level of listening and reading items that have been pre-calibrated on a Rasch scale. We chose a subset of these items that meet these criteria, and we then set cut scores for each of the CEFR levels based on earlier work on the ILR scale (Buck, 2006, Buck and Platzek, 2007). We then use the empirical data to create a matrix of how persons of known CEFR ability-level perform on items at every CEFR level. From this we present a theoretical model of how a test with a range of item difficulties can place test takers at any level of the CEFR.

Results will be discussed, the method will be evaluated, and implications will be drawn, with particular reference to improving our know-how of how to develop measures of the CEFR levels.


To Top


Towards the Calibration of Translation Errors: The Why and the How

by June Eyckmans, Philippe Anckaert & Winibert Segers

Translation tests have always been a popular format for assessing language proficiency in schools and colleges throughout the world, although some language testers have raised serious objections to this practice (see Klein-Braley 1987, as opposed to Oller 1979). Ironically, translation tests that are so common in testing language proficiency skills are not available as tests of the ability to translate (Lado 1961). Up until recently, no method had been developed to relate performance indicators to the underlying translation competence in a psychometrically controlled way (Anckaert, Eyckmans and Segers 2006).

In this paper presentation, a norm-referenced method for evaluating translation ability will be put forward that is based on the calibration of dichotomous items. The Calibration of Dichotomous Items-method (CDI) is a stable and evaluator-independent evaluation that bridges the gap between language testing theory and the specific epistemological characteristics of translation studies.The different steps that lead to the construction of a standardized test of translation will be illustrated by means of empirically assembled data. This will include pre-testing, establishing item difficulty indices and procedures for verifying the reliability of the translation test.

The pros and cons of calibrating translation errors will be discussed and possible pitfalls will be anticipated. Finally, translation test calibration will be compared to today’s common practice of evaluating students’ translations through the use of analytical grids.


To Top


Effects of Computer Interface Design on High School Graduates

by Jesús García Laborda, Ana Gimeno Sanz, Emilia Enríquez Carrasco & Teresa Magal Royo

The University entrance examination is probably the most significant compulsory high stakes test in Spain. Recently, there have been moves towards the use of computers in its implementation. The Ministery of Education and the Regional Government have begun two programs PAULEX UNIVERSITAS (project supported by the Spanish Ministry of Education) and SELECTOR (Regional Department of Education) which are aimed at verifying whether this significant change could take place (García Laborda, 2006). This paper addresses the design and implementation of the prototype computer tool designed to deliver the test according to current research (García Laborda, 2007a; García Laborda, 2007b), as well as the deficiencies detected in its implementation given the relationship between design-context-construct. 28 students responded to a questionnaire about their difficulties in responding to a mock exam. Results indicate that context and design have potential effects on their scores. Students also felt that the format of the text may change their results in the test and some acknowledged they felt computer literate classmates had a clear advantage despite their English proficiency.


To Top


Procedures for Training Item Writers

by John H.A.L. de Jong

In the context of developing a test of international English for university admission, item writers were trained to write items to meet item specifications including a target range of CEFR levels. In this paper we will report on the item training procedures. These procedures include the provision of templates that are used for item writers to submit the items they have written. Within the template one of the variables is the level expressed in terms of the Common European Framework of Reference for Languages (CEFR). Intensive training was provided to item writers in thee locations: Washington, London, and Sydney. Training included group and individual exercises for familiarisation. Submitted items by item writers were evaluated by peer item writers from other locations than where they had been written, e.g., items written in Washington were reviewed in London and Sydney; likewise items written in London were reviewed by the teams in Washington and Sydney, etc. After this initial review items were reviewed centrally.

Once approved and assigned to a particular CEFR level, items were considered ready for field testing. To further evaluate the assigned levels several procedures were set up. First precalibrated items from DIALANG were included in the field test as anchor items. Secondly, several productive items were scored by human raters on appropriate CEFR scales.

In addition to item responses subjects provided information on recently obtained scores on TOEFL and or IELTS. Discrepancies between these reported scores and obtained scores on the field test will be discussed.


To Top


Use of Young Learner Can-Dos as Part of Formative Assessment

by Szilvia Papp & Neil Jones

Assessment in the classroom should focus on language learning as a continuous process of acquiring skills, not simply can-do outcomes describing students’ ability in some “real world” (the summative aspect that focuses on capacity for language use), but above all the learning skills, the enabling skills and strategies which are the basis of effective learning in the world of the classroom (the formative aspect which focuses on the process of learning). Cambridge ESOL have been developing can-do statements for young learners which address both of these types of assessment within a single conceptual framework. One of the aims of the project is to assist teachers in monitoring learners’ progress within the classroom. This is in line with the Assessment for Learning principles, which emphasize the provision of effective feedback to learners, adjusting teaching to take account of the results of assessment, and acknowledging the influence of assessment and feedback on students’ motivation and self-esteem. In this presentation we report on a validation project involving teachers of learners 10-14 years old in 4 countries. Teachers were asked to indicate which type of can-dos are most useful in the ongoing monitoring of learning:

  • scaled can-dos referring to learners’ emerging competences,
  • can-dos relating to language content based on the course syllabus, or
  • can-dos relating to learning skills, strategies of language use, and descriptors referring to intercultural and language awareness.

Teachers were also asked to report on how they could and would use the can-dos within their work in the classroom.


To Top


A Test Validation Study Based on Textual Analysis

by Christine Niakaris

This paper presents a test validation study of a B2 level writing task, a letter to the editor (LE) of a local community newspaper, included as an option in the writing section of a high stakes international language examination.

According to Hyland (2002), current perspectives on validity indicate that the construct is the most important element in writing assessment. The presentation reports on the process of defining the construct by analyzing the genre of authentic letters to the editor of local community newspapers, and explores how these analyses relate to the official scoring rubrics and samples of candidates’ letters, benchmarked by the test developers.

The presentation begins with a brief discussion on the misconceptions of the construct of an LE within the local testing community (teachers and students) and why a genre analysis approach is useful in helping to define this specific construct. Using a framework based on a systemic functional interpretation of genre (Eggins 2004), I report on the analysis of ten authentic letters to the editor and discuss the findings in relation to the scoring rubrics and benchmarked LEs.

The presentation concludes with some insights into the construct and the challenges they present in attempting to clarify the misconceptions of the LE genre for the local testing community.


To Top


Auditing the Validity Argument

by Nick Saville, Piet van Avermaet & Henk Kuijper

This paper reports on the ALTE Procedures for Auditing and the outcomes of audits conducted in 2007. It summarises the background to the approach, presents data collected so far and reflects on the lessons to be learnt in implementing this approach. The need to develop a validity argument in order to justify the use of language assessments to the stakeholders, using principles of good practice and a quality management system forms part of this discussion. 

The formal scrutiny of 17 minimum standards is the culmination of a process of establishing audited quality profiles for international language examinations. The aim of the process is to enable examination providers to make a formal and ratified claim that a particular test or suite of tests has a quality profile appropriate to the context and intended uses. In providing the evidence to support the claim, a validity argument is provided using the ALTE Principles of Good Practice and QMS Checklists; these “tools” help structure the arguments and provide supporting explanations and justifications. The background to this work was reported at the 2nd EALTA meeting (Voss 2005).

The minimum standards are now being used to establish quality profiles for ALTE exams; the auditing system was put into practice in 2007 and by the end of the year all members had been audited or were in the process of being audited. Data from these recent audits and the ways in which the auditing process can help to build and justify validity arguments will be discussed.


To Top


The Role of Communication in Creating Positive Washback

by Dianne Wall & Tania Horäk

While it is widely accepted that testing agencies should consider the washback of their tests on classroom practices, little has been written about the steps they can take to ensure that teachers understand the nature of their tests and know how to prepare their learners in ways that complement rather than restrict the curriculum.   The washback literature contains numerous examples of how the good intentions of test designers can result in less than satisfactory practice if teachers do not receive the guidance they need to cope with the demands the tests place upon them.

The aim of this paper is to discuss a number of ways that testing agencies can communicate their intentions and help teachers and learners alike to prepare for new kinds of assessment.     We present examples of good practice which have been used in national and international testing contexts, which we have come across through our participation in test development and evaluation projects and during our analysis of a recent survey of EALTA members involved in high-stakes testing.   The notion of assessment literacy should apply not only to the test construction, administration and marking processes, but also to the dissemination of the principles embodied by the tests and the provision of teacher and learner support.    This paper is directed especially to those seeking to introduce innovative tests into traditional systems, but it should also be of interest to practitioners who have seen their good intentions result in unintended and less than beneficial washback.


To Top


Professional Language Proficiency: Setting Thresholds in Engineering

by David L.E. Watt & Andreea Cervatiuc

Countries that rely on immigration as a means of addressing population growth and workforce expansion are increasingly aiming their immigration policies towards the recruitment of skilled professional immigrants. However, existing regulations for professional licensure often act as an impediment to professional immigrant integration. While instruments are in place for the assessment of professional knowledge and professional skills, the assessment of professional language proficiency (PLP) is only beginning to find its way into the licensure processes of many professions. The definition of PLP and the process of accountably establishing proficiency thresholds for professional practice are an integral part of this emerging field language assessment.

The presentation reports on a study undertaken in conjunction with a professional association for licensure in Engineering. The purpose was to establish an acceptable threshold for PLP for P. Eng. licensure. A purposive sample of 69 engineers, representing Canadian Engineering Graduates (CEG) and International Engineering Graduates (IEG) at both were job shadowed and interviewed to identify relevant professional communication tasks. The general language proficiency of the IEG’s was assessed based on the Canadian Language Benchmarks. Illustrative samples of oral and written tasks at different language proficiencies were presented to engineering assessors to determine acceptable PLP thresholds. Statistical generalizability was performed to evaluate the dependability of the measurement technique used in the study.

The presentation concludes with the identification of the threshold for PLP in Engineering, along with a discussion of the strengths and limitations of the approach in relation to test design.


To Top


To Assess Speaking Analytically or Holistically?

by Taner Yapar

This presentation focuses on the two traditional methods for testing speaking: Testing analytically and holistically. Firstly what they come to mean and some samples are presented. Then, the areas they are used and the merits of both methods are discussed.

The presentation later introduces some findings of a research conducted at TOBB University of Economics and Technology to decide whether to test for proficiency in English with an analytical or holistic rubric. For the research, 74 students were selected as subjects and their speaking exams were graded both analytically and holistically. Then, their scores were correlated against their institutional TOEFL (ITP) scores. Neither showed to be superior to each other in terms of predicting TOEFL ITP scores. Thus, the findings emphasized that the speaking subtest is not catered for by the TOEFL ITP scores, therefore, the TOEFL ITP should not be solely used for proficiency testing. Finally, it was found out that the decision about which rubric to employ depends solely on the choice of the administrators. That is to say, if there is a time constraint and giving feedback to the students is not possible, the holistic rubric could be used. On the other hand, if there is no time constraint and if the speaking exams are to be used to give detailed feedback, especially for formative purposes, the analytical rubric could be used.


To Top


A Taxonomy of Argumentative Thesis Statements: The Testing Perspective

by Gyula Tankó & Gergely J. Tamási

Although it is universally acknowledged that argumentative texts are the most difficult to write  and that the measurement of argumentative writing skills is essential for educational, professional, and gate keeping purposes (e.g., CPE, IELTS, GRE, or TOEFL), only an insignificant part of the mainstream body of literature available on argumentative essays focuses on the assessment of argumentative writing. Moreover, the majority of these studies (e.g., Connor & Takala, 1987; Knudson, 1992; Connor, 1993; Yeh, 1998) focused exclusively on investigating the variation in holistic argumentative/persuasive essay scores in terms of a set of argument substructures derived from the Toulmin model of argument (1958). The aim of this paper is to shift the focus to a key structural element of the argumentative essay: the thesis statement. Based on the results of written and spoken argumentation studies, we propose a comprehensive and transparent taxonomy of argumentative thesis statements. Furthermore, we present the results of an empirical study whose main goal was to test the taxonomy on the argumentative essay subcorpus of the Hungarian Corpus of Learner English, a corpus comprising academic argumentative texts produced by Hungarian students of English. The results have practical implications for both the development and the testing of argumentative essay writing skills. They show, for example, student preferences for thesis statement types, major and minor problems in the formulation of position statements, or the relationship between the prompt and the types of argumentative theses students choose to develop into essays.


To Top


IELTS and the Academic Reading Construct

by Anthony Green & Cyril Weir

Reading for academic purposes typically involves extensive reading of books and journals and synthesising diverse sources of information to build an understanding of an academic subject. Tests of academic reading ability, in contrast, seem predominantly to involve questions requiring understanding of propositions contained within a single sentence, without a requirement either to process extensive stretches of text or to integrate information in order to arrive at the correct answers (Weir et al. 2006).

This paper will report the results of a study of the reading types employed by test takers responding to the widely used IELTS test of academic reading ability. Over 250 test takers responded to tasks on the IELTS academic reading test and completed a previously validated retrospective protocol relating to the range of cognitive processes they employed when performing the test tasks and the extent of information from the text(s) that they drew on in arriving at their answers.

The results of this exercise were compared with responses to an earlier survey of over 1000 students reporting on the reading types they employed in their academic studies. This comparison, employing factor analysis, provides evidence for the degree of congruence between the construct measured by the test and academic reading in the target domain.


To Top


Codes of Ethics and Good Practice and Consumers’ Rights in Language Testing

by Pavlos Pavlou

In her discussion of democratic perspectives on assessment, Shohamy (2001) mentions that one necessary principle in assessment is the need to protect the rights of test-takers.  Shohamy cites this need as well as the need to conduct and administer testing in collaboration and in cooperation with those tested. This paper relies upon these principles and elaborates upon the rights of the test-taker as consumer. Such basic rights and the responsibilities of testers with regard to possible violations are discussed.  Moreover, test- takers’ rights are discussed with reference to various codes of ethics and good practice, such as the Code of Ethics of the International Language Testing Association (ILTA) and the EALTA guidelines for Good Practice.

The paper presents cases of rights violations that have been revealed through analysis of the results of a survey among test-takers, their parents and test promoters in Cyprus. The elicited information relates to the following consumer rights: the right to take a test of choice, the right to be informed about the content and possible uses of the test results, and the right to protest in cases where they are not pleased with the test.

The likely reasons for possible violations with special emphasis on financial gain as motivation for test-producers, along with recommendations for avoiding and rectifying intentional or unintentional violations are discussed.   Additional suggestions for language testers and major testing organizations to take the initiative in order to prevent violations of test-takers’ rights are presented.


To Top


Scaled Scores—Theoretical and Practical Challenges

by Sarah Briggs & Fernando Fleurquin

Many test developers and users recognize the complexities of understanding and comparing scores on different language exams, but equally complex is the issue of comparability of scores on alternate forms of the same exam. Accountability and fairness demand attention to the development of such alternate test forms that not only assess the same language construct and domain but are equal in difficulty. Test developers are responsible for equating the different forms of the test, as well as for informing stakeholders how this process is implemented and interpreted.


However, the processes for test equating can seem conceptually complex to test users who are not assessment specialists and who are not familiar with the statistical procedures for equating.  Still, they need to know how to determine examinees’ readiness to take the test and how to interpret scores. If information provided is too technical (e.g., IRT, scaled scores, SEM), stakeholders may not be able to understand it, and that may be interpreted as lack of transparency. If the information provided is simplified, it may be misleading or may lead to overgeneralizations. 


This presentation attempts to enhance assessment literacy of educators who are charged with developing multiple forms of tests, whether they are constructed response tests or multiple-choice format. After providing a brief overview of procedures used in equating and scaling tests, the presenters discuss the challenges of meeting current standards of educational testing that require “clear explanations of the meaning and interpretation of derived scaledscores, as well as their limitations”.


To Top


A Case Study: Conceptual Changes in Assessment Practices

by Sehnaz Sahinkarakas & Kagan Buyukkarci

Research in the area of Language Testing and Assessment shows that formative assessment, which leads to active involvement of students in assessment practices, is a valuable approach in promoting learning. Research also shows that implementing formative assessment properly necessitates changes in teachers’ perceptions of their roles in classroom practices. Without this change, teachers might adopt different interpretations of formative assessment policies, which may then have negative effects on students.

This paper presents a case study conducted with a group of young EFL learners and their teacher. This case study reports how the teacher perceived assessment practices and how these perceptions changed by the end of formative assessment implementation over a certain period of time. In order to reflect on the new constructs formed in the process of this change, the teacher’s perceptions before and after the implementation were analyzed and represented by repertory grid data.  An analysis of classroom discourse was also used in order to provide evidence for the newly-developed constructs.


To Top


Evaluation and Control of Rater Reliability: Holistic vs. Analytic Scoring

by Claudia Harsch & Guido Martin

In the context of the evaluation of the Educational Standards in Germany, writing skills are assessed using a uni-level approach. In this paper, we examine the employment of holistic criteria vs. analytic ratings of descriptors. We present data from the pilot study and two preceding feasibility studies.

The initial, quite common, approach was to use 4 holistic criteria (task fulfillment, organisation, grammar, vocabulary) each being defined by several descriptors based on the CEFR. While the holistic ratings of the first feasibility study displayed seemingly high reliabilities, the second study showed clearly that more than half of the descriptors were not used uniformly amongst raters. The high holistic reliabilities were proven to be an artefact: Ratings differing significantly for the descriptors of one criterion resulted in the same holistic ratings of this criterion. The same effect shows in data simulations employing random distributions of (descriptor-) ratings.

In order to evaluate and control rater behaviour, it is thus necessary to collect data on the descriptor level in addition to the holistic ratings.

Those insights from the feasibility studies resulted in the decision to train and rate on the descriptor level and the holistic level. During the rater training it became evident that several descriptors had to be revised in order to achieve reliable ratings, and that such revisions may interfere with the ratings of other descriptors. Consequently, each descriptor revision has to be empirically evaluated in order to assert that no unintended side effects occur. After the training, high reliabilities were achieved.


To Top


“Can Do”: What Can They Do After All?

by Maria Davou

Bygate (1987) lists formulaic sequences as one of the tools that facilitate the production of spoken language and contribute to oral fluency. According to the Common European Framework, a learner at C2 level has a good familiarity with 'idiomatic expressions and colloquialisms'; at B1, uses 'routines and patterns'; at A2 'memorized phrases, groups of a few words and formulae' and at A1 'can manage very short, mainly pre-packaged utterances'. All these expressions (among many others) are terms used to describe aspects of formulaicity (Wray, 2002). Formulaic language as a term or formulaic language features appear in various assessment scales for speaking. The public version of IELTS Speaking band descriptors refer to the use of idiomatic language for Bands 8 and 9 but also mention 'memorized utterances' for Band 2. The area of what formulaic language is, what exactly it consists of and how it develops in second language causes a lot of confusion. Wray (2002) says that "there have been relatively few published data-based studies which specifically examine formulaic sequences in L2". The aim of my study is to explore how learners develop their oral production in terms of formulaic language. For the purposes of this presentation, I focus on a pilot study with six learner interviews at A2, B2 and C2 levels. Analyzing this oral corpus, I explore how learners at different levels use formulaic language to achieve particular communicative objectives and how their performance compares with the "can-do" statements of the CEFR.


To Top


Writing in Context: A European Study of Comparative Academic Literacy Practices

by Carole Sedgwick

The Bologna Declaration (1999) is an agreement to harmonise degree qualifications in Europe in order to promote mobility for work and education between the signatory countries. To what extent is it possible to create common European standards between universities that are in different cultural and linguistic contexts? How can the assessment of writing in one context give useful information about a student’s ability to cope with the writing demands of another?

Waters (1996:35) in his review of needs analyses conducted for the TOEFL 2000 project stated:

‘it would seem best, as a general strategy for further research, to first of all properly establish the nature of a small part of the field, and only then to broaden the enquiry. Unless this basic strategy is followed, there is a danger that there will be insufficient depth of understanding and communication for the wider type of survey to produce reliable results.’

This paper reports a comparative qualitative study of literacy practices on postgraduate English language major programmes at two universities, each in a different cultural and linguistic context, Hungary and Italy. Interview protocol data from tutors and students, written assignments, feedback on assignments and contextual documentation were collected. The analysis of the data so far has demonstrated that an academic literacies' approach can provide valuable insights into writing demands and underpinning beliefs and values about writing that relate to context. It is believed that this small-scale study will generate some ‘useful’ responses to the questions posed above.


To Top


Is Your Test Question Related to my Answer?  Exploring the Relationship between Test-takers’ Interpretations of Spoken Text and the MC-test Format

by Joanna Anckar

Within a larger study with the goal of investigating the nature of the multiple-choice test format for assessing FL listening, a pilot study explores the relationship between the originally proposed options to multiple-choice questions (administered for a test of listening comprehension of French as a FL within the Finnish matriculation examination in 2002) and short answers to the same questions or short summaries of the text passages in an open-ended version. The purpose is to investigate the plausibility of the original options: do they reflect probable correct or false interpretations of the spoken text contents?

A group of test-takers (n: 100) submit written answers. The questions are identical to the original multiple-choice questions, but with the options left out. Added to that, 63 test-takers are asked to write brief summaries on the contents to the same text passages.

The responses are analysed mainly qualitatively but also quantitatively, in order to obtain information on what kind of interpretations the test-takers make of the spoken text, in the situation where no response options are provided.


To Top


Assessing the Literacy of Children: A CEFR-linked Longitudinal Study 

by Angela Hasselgreen

This presentation will outline the progress of a recently started study in Bergen, Norway, whereby the reading and writing of children in two school classes (between 10 and 13 years) is being studied at regular intervals over two years. The project will be extended to cover classes in Spain, Slovenia and Lithuania from autumn 2008. The background to the project is the need for teachers to be able to recognise and describe what their pupils can do in their English reading and writing. The CEFR has started to gain recognition among Norwegian teachers, but has shortcomings when applied to younger learners. The aim of the project is to provide CEFR-linked benchmarks of reading texts and pupils’ writing, based on evidence of how learners actually progress in the zone between A1 and B1, as well as material for teachers to use in placing pupils. The methods used and data collected so far from the project will be presented, as well as conclusions drawn from the early analysis of data.


To Top


LangPerform – Computer Simulations for Training, Documentation and Evaluation of Language Performance

by Kim Haataja

As today, considerable efforts are taken all across Europe, and also partly beyond, in order to implement the Common European Framework of Reference (CEFR) and the instruments of the European Language Portfolio (ELP) in foreign language education and the respective test and assessment processes at school.  Together with recommendations for diversification of (foreign) language instruction e.g. through the implementation of information and communication technologies (ICT) and cross-curricular mediation of (foreign) languages in the name of content and language integrated learning (CLIL), the appropriate use of the CEFR and the ELP represent some of the most central challenges and development areas of (foreign) language education.

The aim of this work-in-progress-presentation is to show and to discuss how these issues could be tackled with the help of a computer mediated training, test and assessment tool. At first, a general overview will be given on the background and the development process of the innovation in question. In a second step, some impressions from already realised test sessions on German as a foreign language from three different European countries will be presented. Finally, means and prospects of further development and transferability of this testing and evaluation technique e.g. into the context of CLIL-education or working-life language learning (including respective evaluation measures) will be discussed.


To Top


City & Guilds CEFR Linking Project: Work in Progress

by Rachel Roberts

This Work in Progress focuses on the City & Guilds alignment of the International ESOL Communicator examination. It will begin with a very brief summary of the background to the project and an overview of the work done to date.

This work has broadly followed the recommendations of the Draft Manual (2003), although the procedures adopted are slightly different to the linear approach implied there.

The main focus of this presentation will be how the alignment process has been tailored to more closely fit the needs of an international awarding body committed to maintaining quality standards. It will emphasise the iterative nature of the project and steps taken to embed CEFR methodology into the assessment processes. The conclusion will pose questions on possible limitations of the manual and make some recommendations for change.


To Top

Text Comprehension Difficulty from the Test-takers’ Perspective: Predicting the Unpredictable

by Trisevgeni Liontou

This presentation focuses on test-takers’ attitudes to the texts and reading tasks of the reading comprehension module of the Greek State Certificate of English Language Proficiency exam (KPG) . As such, it constitutes part of ongoing research on the effect reader and text variables (i.e. text organisation, genre, content and readers’ background knowledge and sex) have on text comprehensibility. Data from a survey conducted on a national scale in the form of questionnaires administered to candidates sitting for the KPG exams will be presented (4000 questionnaires administered during 2006-2007) along with a discussion on text and task difficulty from the test-takers’ perspective i.e. their level of familiarity with specific topics, difficulty to cope with certain lexical elements, feelings of anxiety when reading specific texts etc. Moreover, light will be shed on whether candidates’ beliefs had an impact on their actual performance and empirical evidence will be provided as regards the level of lexical density of specific test texts. Finally, an alternative way of estimating text difficulty will be proposed based on systemic functional grammar and also the quantitative results of the survey.

The initials KPG correspond to the acronym KΠΓ which in Greek stands for Kratiko Pistopiitiko Glossomathias, translated in English as State Certificate of Language Proficiency. Exams are administered by the Greek Ministry of National Education and Religious Affairs and since their introduction in November 2003 more than 100.000 candidates have taken part in the English exams.


To Top


Teaching and Assessing Argumentation and Negotiation Skills

by Ritva Aho

Argumentation and negotiation are skills needed in all stages of our lives and in all domains of activity. Thus they are basically context-independent and potentially possess a good transfer value. Designing, implementing and assessing a unit on argumentation pose a familiar problem of alignment. This can be promoted by having a good model to draw on. In case of argumentation, the Toulmin model of practical reasoning (1958/1984) has gained increasing recognition and it now serves as a tool in conceptualizing validity argumentation in general (Kane 2006) and in the case of TOEFL (Enright 2007). In the case of negotiation, Goldman & Rojot (2003) provide a good rationale. These sources were drawn on in the project to be reported.

The context is tertiary level military (officer) education. A unit of 15 hours has been planned, in the form of a content-and-language integrated (CLIL) approach, consisting of theories on negotiation, examples of argumentation demonstrated and exercises on argumentation. The Toulmin model is demonstrated and practised to support the students’ ability to acquire data, make claims, and justify their data. Fundamental principles of negotiations are presented and discussed presuming that a good argument can succeed in providing good justification to a claim, which will stand up to criticism and earn a favourable verdict.

Assessment covers self-assessment, peer-assessment and teacher assessment. Criteria for assessment are derived from the models presented. CEFR scales for linguistic accuracy, understanding context and understanding conflict are also used.

The presentation will describe the work done to design the unit and present some preliminary findings. 


To Top


Assessment Literacy of EFL Teachers in Greece – Current Trends and Future Prospects

by Dina Tsagari

Assessment is a widespread - if not intrinsic - feature of most language teaching programs worldwide. This has resulted in a proliferation of various standardized tests as well as in the introduction of teacher-conducted assessments used as a basis for reporting learners' progress and achievement against national standards (Brindley, 1997). Teachers thus find themselves in the position of having to evaluate and prepare students for standardised tests as well as create their own classroom-based instruments for monitoring, recording, and assessing learners' language progress and achievement.

However, relatively little is known about how teachers are dealing with these demands and, more importantly, how such assessment practices impact on their daily teaching. Studies investigating classroom-based assessment practices within the ESL/EFL school contexts (Breen et al. 1997; Davison and Leung, 2001; 2002; Davison, 2002) as well as the tertiary level (Cumming, 2001; Arkoudis and O’Loughlin, 2004; Cheng, et al., 2004) have stressed the need for further research as the picture is not yet complete. With so much time and money devoted to language assessment, it is worth critically understanding how teachers carry out their assessment decisions and what assessment purposes and procedures they use in their daily practice. 

This presentation will discuss recent research into the complex and multifaceted roles that classroom assessment plays in different teaching and learning settings. Then it will provide an overview of the Greek language system, and present research results showing the nature of the language assessment landscape and training opportunities of private and state school EFL teachers in Greece.

It will conclude with discussing the need and present ways of how teachers can become more "assessment literate" and how testing experts and teacher trainers can help in this direction.


To Top


Measurement Equivalence of Written C-Tests and Multiple-Choice C-Tests

by Hella Klemmert


C-Tests are integrative written tests of general language proficiency based on the concept of reduced redundancy. Multiple-Choice (MC) C-Tests offer economical advantages, but their validity is unclear. Therefore n = 1061 customers of the Psychological Service of the German Federal Employment Agency completed an established German C-Test and a computerized MC-Version in randomized order. Confirmatory factor analyses and item response theory analyses revealed separable but highly correlated constructs (r = .95). The MC-Version was not biased with respect to sex, age or German language proficiency and was recommended for regular use in the local employment agencies. Due to the positive results now an English MC C-Test is under development.


To Top


The Challenges of Testing Candidates with Disabilities: Guidelines for Stakeholders

by Fernando Fleurquin

Test developers, educators, and examinees are concerned with many aspects of fairness in testing. One of the concerns under constant discussion is how to provide a fair assessment of the linguistic proficiency of individuals with disabilities. In order to provide equal treatment to individuals with disabilities, test developers can modify the format of the test or the administration conditions for the test. The test must be modified in such a way that the effect of the individual examinee’s disability is minimized or eliminated, while at the same time the construct is not altered. However, the validity of a test designed and normed for a “normal” population can be challenged when it is modified for administration to examinees with disabilities.

This poster helps different stakeholders (students and examinees, educators, test developers, as well as people who make decisions about accommodations or who interpret test results) to understand some of the variables to keep in mind when administering a test to examinees with disabilities, or when interpreting its results. The poster will provide background information and standards on appropriate accommodations, the modifications proposed for individuals with visual, hearing, motor, learning, or other kinds of disabilities used by a high-stakes ESL/EFL testing program, as well as guidelines on how to report, interpret, and use test results. It will also show statistics from several years, sample tests and examinee responses, as well as conclusions on ethical, legal, medical, and socio-political issues related to the administration of tests to individuals with disabilities. 


To Top


Similarities and Differences in EFL Proficiency – a Multifaceted Analytic Approach

by Lisbeth Åberg-Bengtsson, Gudrun Erickson, Jan-Eric Gustafsson, Sölve Ohlander & Dorte Velling-Pedersen


This poster presentation aims at describing a project, granted by the xxx Research Council, focusing on the outcomes of a wide range of analyses of data from a European study of English as a foreign language (EFL) at the end of compulsory school in eight European countries (Bonnet et al., 2004). Based on a large body of data, including performance as well as background and self-assessment data, the project has analysed variability in language proficiency, as well as in attitudes and language habits. The main emphasis is on the xxx data, but some comparisons have been made with results from other participating countries. In this, contributions from English instruction based on national curricula, as well as exposure to English in society, are considered. Different methods, ranging from descriptive statistical approaches to multivariate techniques and performance-based linguistic analyses, have been applied to treat the data from a basically quantitative as well as a purely qualitative angle. Results will be presented focusing on dimensionality of language proficiency, students’ background, habits and self-perceptions in relation to results, as well as language performance by students with different linguistic backgrounds and at different proficiency levels. Throughout the project, researchers with different backgrounds – psychometrics, didactics and linguistics – have collaborated on treating the data. The overall design of the project, with its multifaceted disciplinary and analytic approach, will be considered in relation to the theme of the conference, namely the concept of assessment literacy.


To Top


Assessment Literacy: Outreach in Teacher Education

by Nigel Downey, Christine Niakaris & Alexandra Tsakogiannis


In countries where high stakes international language testing is widespread, there is a need for test administrators and test developers to support language teachers and learners in gaining a greater understanding of the construct of such examinations in terms of level, content, and format, as well as their assessment criteria, in order to generate guidelines for effective classroom preparation.

The Hellenic American Union as test administrators and the Hellenic American University’s Testing Division respond to this need by conducting seminars for language schools throughout Greece, as part of its outreach examination services.

The poster presentation will describe the rationale and content of these seminars, which take into consideration the needs of both learners and teachers for effective examination preparation. Samples of slide presentations will be displayed, accompanied by the rationale for the selection of topics covered, for example, Testing Reading Comprehension or Understanding Scoring Rubrics.  This will lead to an examination of how aspects of assessment literacy can be successfully conveyed to stakeholders, who may otherwise have little knowledge of the field, such as to what extent teachers need to know about the various aspects of scoring systems and how such knowledge can be presented in a meaningful way.

The presentation will conclude with insights into the methodological principles underlying presentations for teachers and how they are related to language teaching beyond the examination preparation classroom.


To Top


Assessment Literacy of Foreign Language Teachers in Europe – Current Trends and Future Perspectives

by Karin Vogt, Elizabeth Guerin, Sehnaz Sahinkarakas, Dina Tsagari, Pavlos Pavlou & Qatip Arifi


One prominent feature of quality assurance in language testing and assessment is teacher training, both initial and in-service teacher training. Foreign language teachers have been faced with evaluating standardised tests and creating their own classroom-based assessment instruments. To do this, they need specific training and know-how to bring their Language Teacher professional education up-to-date, and play a significant role in making parents, students, policy makers and other colleagues aware of the issues involved in making language assessments, and in the subsequent use of those assessment results.

In order to advance professional teacher development, as well as to establish teachers’ current qualifications and identify training needs in classroom-based language testing and assessment, the online questionnaire entitled European Survey of Language Testing and Assessment Needs was adapted and tailored so as to identify the overall assessment training needs of foreign language teachers in some specific countries. Questionnaires were administered to foreign language teachers in Germany, Italy, Turkey, Greece, Cyprus and FYROM to gauge the assessment literacy of foreign language teachers in Europe. The findings give insights into current trends in parts of Europe and lead to a discussion of future prospects related to the role of LTA training in teacher development.


To Top


L1 and L2: Differences in Lexical Richness of Produced Genres

by Krista Kerge & Hille Pajupuu

One marker of language skills is lexical richness. Within the project “Assessing and modelling of speaking naturalness”, we compared the vocabulary of 8 local Russians having successfully passed high (B2+/C1) level Estonian language exams to that of 8 native non-philologists.

We measured lexical richness with the Uber index: U = (logN)2/(logN – logV), where N is the total number of word forms and V the number of words. The higher the Uber index, the richer the vocabulary.

The text material was gathered from high level Estonian exams (essay, oral dialogue and monologue).

Lexical richness of the two groups varied considerably for all text types: native speakers had richer vocabulary in oral dialogue (U 20.4 and 16.4 respectively), monologue (24.0 versus 19.2), and especially essays (34.0 versus 21.2).

We also compared the used words with elementary Estonian vocabulary (3000 most frequent words in public texts used to compile elementary level (A2) tests). The share of frequent vocabulary in L1 was by oral form 64% and by written form 58%; in L2 it was the other way round – oral 53% and written 64.5%.

The results: In written L1 use lexical richness is considerably higher than in oral use; the share of less frequent words is also higher; The relatively high share of elementary vocabulary in high level L1 text indicates a need to compile and assess L2 exam tests; The distinctly poorer lexical richness in written L2 use needs to be interpreted together with strategies for avoiding non-frequent vocabulary.


To Top 


An Analytic Investigation of Paired Compositions from a High-Stakes Writing Test

by Sarah Van Bonn, Aaron Ohlrogge, Fernando Fleurquin & Barbara Dobson

Examinees who participated in an intermediate standardized test of English as a foreign language with a writing component had the opportunity to take a second writing test on a new topic a month later. A group of researchers analyzed the pairs of compositions to evaluate how written performance may vary according to topic.

From a test population of over 11,000 examinees, the vast majority did not show differences in written performance, and their final exam result did not change. This poster presents case studies showing pairs of compositions written by the same examinees. Two types of written performance were analyzed: pairs of compositions that received significantly different scores, and pairs of compositions that received the same score. From each group, we examined discourse and linguistic features derived from the scoring rubric, such as content and support, organization, length, lexical range, number and type of errors, and overall communicative effect. 

The poster is divided into three main sections, addressing the following questions: (1) Are composition pairs which received the same score similar in terms of the linguistic and discourse features specified in the rubrics? (2) For compositions that have different scores, are these differences related to the topic? Are they related to the level of writing proficiency? (3) What are the implications of the findings for test developers? What are the implications for teachers and examinees who prepare for standardized tests at this level?


To Top

Computer-delivered Test of Finnish for Ingrian “return immigrants”: framework and results

by Kaija Kärkkäinen, Anu Lipsonen & Jani Lankinen


”Return immigration” of Ingrians living in the area of the former Soviet Union to Finland started in 1990. In ten years, about 25,000 people moved to Finland. The conditions for entry were revised in 2003. A person with a Finnish heritage was required to: 1) attend courses that prepared for the move, 2) show proof of having accommodation, 3) show proof of having passed a language test administered by Finnish authorities.

The framework for the examination for Ingrians defines A2 as the level required. The exam consists of tests of speaking, listening, reading and writing. A mixed standard setting strategy is applied: responding to all subtesta and obtaining a pass in speaking are absolute conditions and one further pass is an additional condition for getting a certificate.

In the first years, teams of testers travelled to various sites In Estonia and Russia to administer the tests.  Development was soon started to test the feasibility of delivering the exam using PCs with a very heterogeneous group of test-takers.

The presentation will describe the system: 1) data base which makes it easy to compile exams from testlets, store responses, analyse results, issue certificates and make various reports, 2) computer implementation of the tests, 3) results obtained so far with about 1600 test-takers, broken down in several subgroups.

The presentation is likely to be of interest to those who are working in the area of language testing for citizenship and those who are applying or considering applying computerised delivery of language tests.


To Top


Measuring Fairness in Terms of Test-taker Characteristics

by Anna Mouti


Over the last decade increased attention has been given to issues of fairness in testing and test use The aspects of fairness, included in various Codes of Ethics or Fairness, are relevant to treating people with impartiality regardless of such characteristics as gender, race, ethnicity, sexual orientation, language background, creed, political affiliations or religion. The above variables should be placed under consideration when designing language test formats, so that no group bias is introduced into the test. A language test is not considered fair if it advantages or disadvantages groups of test-takers. Bachman (1990:166) says that a major concern in the design and development of language tests is to minimize the effects of the factors that are not part of the language ability. One of these factors is test-taker characteristics; therefore it should be taken into account when it comes to fairness.

In the present study we will narrow our interest to testing related to learning, which is achievement testing, and the test-taker characteristic which we will investigate is learning style. Does this variable correlate with language test performance favouring different groups of test-takers representing different dimensions of learning styles? In this case, flexibility options should be identified and proper test forms should match individual students so that fair and accurate tests will be constructed in terms of learning style profile.

Relevant research could promote assessment literacy by raising test-takers’ awareness of their test-taking behaviour and by providing testers with information on formulating fair language testing according to test-taker individual differences.


To Top


Vocabulary Testing: Some Methodological Considerations

by Norman Verhelst


In modern applications of language testing, the use of item response theory (IRT) seems to be indispensable. It turns out, however, that useful testing can be accomplished with models that are far weaker than IRT models or even Classical Test Theory. Such an approach will be presented with vocabulary testing as a typical example. The following aspects will be discussed:

  • item banking
  • test administration
  • scoring rules
  • standard errors
  • standard setting
  • validity problems
  • international collaboration


To Top