Curriculum Testing

Now we will explore the most important types of decisions (tests) that must be made in most language programs: proficiency, placement, diagnostic, and achievement.

Making Decisions with Tests
The four different types of tests, proficiency, placement, diagnostic, and achievement are probably emphasized because they fit neatly with four of the fundamental types of decisions that must be made in language programs.

Teachers sometimes find themselves in the position of having to determine how much of a given language their students have learned and retained.
General proficiency is being used to describe what the student should have attained by the time they finish the program. It is a decision that must be made by the administrators, teachers, and contract negotiators involved.
For example, TOEFL is an overall English language proficiency test that is widely used to judge students for admissions decisions. The proficiency levels of students when they enter the program must also be measured.

To contractual issues, entry and exit level proficiencies are crucial for understanding the overall boundaries of a program. What level of overall proficiency do the students have when they come to us? And what level will they have when they leave us? Answering these two fundamental questions will help planners in making many different types of curriculum decisions.

Checking at the beginning of the curriculum development process to see if program objectives are set at the appropriate level for the students is far more productive than wafting until after the program is firmly in place, at which point costly materials, equipment, and staff decisions have already been made.
However, such decisions must be made carefully because proficiency tests are not designed to measure specific types of language teaching and learning, and most definitely not the specific types of language teaching and learning that are taking place in a particular language center.

In short, proficiency decisions  involve tests that are general in nature (and not specific to any particular program) because proficiency decisions require general estimates of students' proficiency levels. Such decisions may be necessary in determining exit and entrance standards for a curriculum, in adjusting the level of goals and objectives to the true abilities of the students, or in making comparisons across programs. Despite the fact that proficiency decisions are general in nature, they are nevertheless very important in most language programs.

Also relatively general in purpose, placement decisions are necessary because of the desirability of grouping students of similar ability levels together in the same classes within a program. Some teachers feel that they can do better teaching when they can focus in each class on the problems and learning points appropriate to students at a particular level.
Placement tests are designed to facilitate the grouping of students according to their general level of ability. The purpose of a placement test is to show which students in a program have more of, or less of a particular ability, knowledge, or skill.                               
The placement of students into levels may be based on something entirely different from what is taught in the levels of the program.

In short, placement decisions should be based on instruments that are cither designed with a specific program in mind or, at least, seriously examined for their appropriateness to a specific program. The tests upon which placement decisions are based should either be specifically designed for a given program (and/or track within a program) or, at least, carefully examined and selected to reflect the goals and ability levels in the program. Thus a placement test will tend to apply only to a specific program and will be narrower in purpose than a proficiency test.

Students' achievement is the amount that has been learned. To make any decisions related to student achievement and how to improve it, planners must have some idea of the amount of language that each person is learning in a given period of time (with very specific reference to a particular program).
To help with such decisions, tests can be designed that are directly linked to the program goals and objectives. These achievement tests will typically be administered at the end of a course or program to determine how effectively students have mastered the desired objectives.

The information gained in this type of testing can also be put to good use in reexamining the needs analysis, in selecting or creating materials and teaching strategies, and in evaluating program effectiveness. Thus the development of systematic achievement tests is crucial to the evolution of a systematic curriculum.

In short, achievement decisions are central to any language curriculum. We are in the business of fostering achievement in the form of language learning. In fact, this book promotes the idea that the purpose of curriculum is to maximize the possibilities for students to achieve a high decree of language learning. The tests used to monitor such achievement must be very specific to the goals and objectives of a given program and must be flexible in the sense that they can readily be made to change in response to what is learned from them about the other elements of the curriculum. In other words, well-considered achievement decisions are based on tests from which a great deal can be learned about the program. These tests should, in turn, be flexible and responsive in the sense that their results can be used to affect changes and to continually assess those changes against the program realities.

The last category of decisions is concerned with diagnosing problems that students may have during the learning process, This type of decision is clearly related to achievement decisions, but here the concern is, with obtaining detailed information about individual students' areas of strength and weakness.
The purpose is to help students and their teachers to focus their efforts where they are most needed and where they will be most effective. In this context, "areas of strength and weakness" will refer to examining the degree to which the specific instructional objectives of the program are part of what students know about the language or can do with it. While achievement decisions are usually centered on the degree to which these objectives have been met at the end of a program or course, diagnostic decisions are normally made along the way as the students arc learning the language. As a result, diagnostic tests are typically administered at the beginning or in the middle of a course.

In short, diagnostic decisions are focused on the strengths and weaknesses of each individual vis-à-vis the instructional objectives  for purposes of correcting deficiencies "before it is too late." Hence, diagnostic decisions are aimed at fostering achievement by promoting strengths and eliminating weaknesses.

The definition for a criterion-referenced test (CRT) is:
A test which measures a student's performance according to a particular standard or criterion which has been agreed upon. The student must reach this level of performance to pass the test, and a student's score is therefore interpreted with reference to the criterion score, rather than to the scores of other students.

This is markedly different from the definition for a norm-referenced test (NRT) given in the same source:
a test which is designed to measure how the performance of a particular student or group of students compares with the performance of another student or group of students whose scores are given as the norm. A student's score is therefore interpreted with reference to the scores of other students or groups of students, rather than to an agreed criterion score.

The essential difference between these definitions is that the performance of each student on a CRT is compared to a particular standard called a criterion level (for example, if the acceptable percent of correct answers were set at 70 percent for passing, a student who answered 86 percent of the questions correctly would pass), whereas on an NRT a student's performance is compared to the performances of other students in whatever group has been designated as the norm (for example, regardless of the actual number of items correctly answered, if a student scored in the 84th percentile, he or she performed better than 84 out of 100 students in the group as a whole).

In administering a CRT, the principal interest is in how much of the material on the test is known by the students. Hence the focus is on the percent of material known, that is, the percent of the questions that the student answered correctly in relation to the material taught in the course and in relationship to a previously established criterion level for passing.

In administering an NRT, the concerns are entirely different. Here, the focus is on how each student's performance relates to the scores of all the other students, not on the actual number (or percent) of questions that the student answered correctly.

In short, CRT's are designed, to examine the amount of material known by each individual student (usually in percent terms) while NRTs, examine the relationship of a given student's performance to the scores of all other students (usually in percentile or other standardized score terms).

The two types of tests also differ in:
(1) The kinds of things that they are used to measure,
(2) The purpose of the test
(3) The distributions of scores that will result
(4) The design of the test
(5) The students' knowledge of the test questions beforehand. Exploring each

Used to Measure
In general, NRTs are more suitable for measuring general abilities or proficiencies. Examples would include reading ability in Spanish or overall English language proficiency. CRTs, on the other hand, are better suited to giving precise information about individual performance on well-defined learning points.

Purpose of Testing
The purpose of an NRT must be to generate scores that spread the students out along a continuum of general abilities or proficiencies in such a way that differences among the individuals are reflected in the scores.

In contrast, the scores oh CRTs are viewed in absolute terms, that is, a student's performance is interpreted in terms of the amount, or percent, of material known by that student. Since the purpose of a CRT is to assess the amount of knowledge or material known by each individual student, the focus is on individuals rather than on distributions of scores. Nevertheless, as 1 will explain next, the distributions of scores for the two families of tests can be quite different in interesting ways.

Distribution of Scores
In other words, for an NRT to be effective softie students should score very low, and others very high, and the rest everywhere in between. Indeed, the way items for an NRT are generated, analyzed, selected, and refined will typically lead to a test that produces scores that fall into a normal distribution, or "bell curve.". For a CRT,
then, it is perfectly logical and acceptable to have a very homogeneous distribution of scores whether the test is given at the beginning or end of a period of instruction.

Test Design
NRT is likely to be relatively long and to be made up of a wide variety of different item types. An NRT usually consists of a few subtests on rather general language skills, for example, reading and listening comprehension, grammar, writing, and the like. These subtests will tend to be relatively long (30—50 items) and cover a wide variety of different test items.

In comparison, CRTs are much more likely to be made up of numerous, but shorter, subtests. Each of the subtests will usually represent a different instructional objective for the given course—with one subtest for each objective. For example, if a course has 12 instructional objectives, the CRT associated with that course might have 12 subtests

Students' Knowledge of Test Questions
Because of the general nature of what NRTs are testing and the usual wide variety of items, students rarely know in any detail what types of items to expect. The students might know what item formats they will encounter, for example, multiple-choice grammar items, but seldom will they be able to predict actual language points.

However, on a CRT, students should probably know exactly what language points will be tested, as well as what items types to expect. If the instructional objectives for a course are clearly stated and if those objectives are the focus on instruction, then the students should know what to expect in the test.


Test Qualities
Type of Decision / test
Detail of information
Very general
Very specific
General skills
prerequisite to
program entry

points drawn
from entire

objectives of
course or

of course or        

Purpose of decision
individual overall
with other

Find each

amount of
learning with
regard to

students and
teachers of
that still
need work

Type of comparison
Comparison with other institutions
Comparisons within programs
Comparison to course or program objectives
Comparison to course or program objectives
When administered
Before entry or at the end of program
Beginning of program
End of courses
Beginning middle of courses
Spread of scores
Spread of scores
Degree to which objectives have been learned
Degree to which objectives have been learned
Type of test

Many language tests are, or should be, situation specific. This is to say, a test can be very effective in one situation with one particular group of students and be virtually useless in another situation or with another group of students.

Other practical considerations include the initial and ongoing costs of the test and the quality of all of the materials provided. Is the test easy to administer? What about scoring? Is that reasonably easy given the type of test questions involved? Is the interpretation, of scores clearly explained with guidelines for presenting the scores to the teachers and students?

Clearly, then, a number of factors must be considered even when adopting an already published test for a program. Ideally, the program would have a resident expert,
someone who can help everyone else to make the right decisions. If no such expert is available, it may be advisable to read up on the topic yourself.


A. General background information
1. Title
2. Author
3. Publisher and date of publication
4. Published reviews available

B. Theoretical orientation
  1. Test family (norm-referenced -or-criterion-referenced
2. Purpose of decision (proficiency, placement, achievement, or diagnosis) 
3. Language methodology orientation (approach and syllabus)

C. Practical orientation
1.   Target population (age, level, nationality, language/dialect, educational background, and so forth)
2.   Skills tested (for instance, reading, writing, listening, speaking, structure, vocabulary, pronunciation)
3.   Number of subtests and separate scores
4.   Type of items reflect appropriate techniques and exercises (receptive: true-false, multiple-choice, matching; productive: fill-in, short-response, essay, extended  discourse task).

D. Test characteristics
1. Norms
a. Standardization sample
b, Type of standardized scores
2. Descriptive statistics (central tendency, dispersion, and item characteristics)
3. Reliability
a. Types of reliability procedures used
b. Degree of reliability for each procedure

    4. Validity
a. Types of validity procedures used
b. Do you buy the above validity argument(s)?

    5. Practicality
  1. Cost of test booklets, cassette tapes, manual, answer sheets, scoring templates, scoring services, any other necessary test components
  2. Quality of items listed immediately above {paper, printing, audio clarity, durability, and so forth)
  3. Ease of  administration(time required, proctor/examine ratio, proctor   qualifications, equipment necessary, availability and quality of directions for administration, and so forth)
  4. Ease of scoring (method of scoring, amount of training necessary, time per test, score conversion information, and so forth)
  5. Ease of interpretation (quality of guidelines for the interpretation of
  6. scores in terms of norms or other criteria)

Proficiency, placement, achievement, and diagnostic tests can be developed and fitted to the specific goals of the program and to the specific population studying in it.
That might mean first developing achievement and diagnosis tests (which are
based entirely on the needs or the students and the objectives of the specific program), while temporarily adopting previously published proficiency and placement tests.
Later, a program-specific placement test could be developed so that the reasons, for separating students into levels in the program are related to the things that the students can learn while in those levels. It is rarely necessary or even useful to develop program-specific proficiency tests because of their interprogrammatic nature.
Naturally, all of these decisions are up to the teachers, administrators, and curriculum developers in the program in question.

The purpose of adapting a test to a specific situation will probably involve some variant of the following strategy:
  1. Administer the test to the students in the program.
  2. Select those items that appear to be doing a good job of spreading out the students for an NRT, or a good job of measuring the learning of the objectives with that population for a CRT.
  3. Create a shorter, more efficient, revised version of the test that fits the ability levels of the specific population of students.
  4. Create new items that function like those that were working well in order to have a test of sufficient length.

A checklist for successful testing :

A. Purposes of test
1. Clearly defined (theoretical and practical orientations)" :
2. Understood and agreed upon by staff

B. Test itself

C. Physical needs arranged
1. Adequate and quiet space
2. Enough time in that space for some flexibility
3. Clear scheduling

D. Pre-administration arrangements
1 .Students properly notified
2. Students signed up for test
3. Students given precise information (where "and when test will be, as well as what they should do to prepare and what they should bring with them, especially identification if required)

E. Administration
  1. Adequate materials in hand (test booklets, answer sheets, cassette tapes, X pencils, scoring templates, and so forth) plus extras
  2. All necessary equipment in hand and tested (cassette players, micro-phones, public address system, videotape players, blackboard, chalk, and so forth) with 'backups where appropriate
  3. Proctors trained in their duties
  4. All necessary information distributed to proctors (test directions, answers to obvious questions, schedule of who is to be where and when, and so forth)

F. Scoring
  1. Adequate space for all scoring to take place
  2. Clear scheduling of scoring and notification of results
  3. Sufficient qualified staff for all scoring activities
  4. Staff trained in all scoring procedures

G. Interpretation
1. Clearly defined uses for results
2. Provision for helping teachers interpret scores and explain them to students
3. A well-defined place for the results in the overall curriculum

H. Record keeping
1. All necessary resources for keeping track of scores
2. Ready access to the records for administrators and staff
3. Provision for eventual systematic termination of records

F. Ongoing research
1.   Results used to full advantage for research
2.   Results incorporated into overall program evaluation plan


Post a Comment