VALIDATION OF FOURTH AND EIGHTH GRADE STUDENTS' RESPONSES TO READING, MATH,
AND SCIENCE NAEP BACKGROUND ITEMS:
USE OF COGNITIVE INTERVIEWING TECHNIQUES WITH TEACHERS AND STUDENTS
Mette Huberman and Roger Levine
American Institutes for Research
August 2, 1999
John C. Flanagan Research Center, American Institutes for Research (AIR)
Mailing Address: PO Box 1113, Palo Alto, CA 94302
Telephone Number: (650) 493–3550
Fax Number: (650) 858–0958
E-mail Addresses: mhuberman@ca.air.org and rlevine@ca.air.org
The authors which to thank the Education Statistics Services Institute (ESSI) for its support of this research.
Abstract
To understand how fourth and eighth grade students and their teachers respond to survey items asking about instructional factors, cognitive interviewing techniques that included validation components were employed. The survey items used in the study were taken from the 1996 and 1998 National Assessment of Education Progress (NAEP) student and teacher background questionnaires in fourth grade reading, math, and science, and eighth grade math and science. The types of problems that were identified through utilization of protocols that included these validation data are discussed, along with anecdotal evidence demonstrating the effectiveness of the validation data.
Introduction
The National Assessment of Educational Progress (NAEP) is involved in a major effort to measure the educational achievement of our nation’s children. As part of this effort, when students take achievement tests, they answer questions about a variety of background factors that are known or believed to be related to performance on these tests. Some of the questions are related to home background factors (e.g., parent education, TV watching, and homework). Other questions ask about instructional background factors, such as the use of computers in math or the prevalence of group activities in reading. Teachers also answer instructional background questions, including how they teach math or reading.
There has been concern about the quality of the student background measures. Data from fourth graders on specific factors (e.g., parent education) were seen as not being reliable due to high omission rates. For similar reasons, some data from eighth grade students were also of questionable reliability, in particular for minority students (Levine et al., 1998). Accordingly, a study was undertaken in 1996–97 to improve the quality of the fourth and eighth grade home background items through the use of cognitive survey methodologies. A cognitive interviewing protocol was developed, which included a validation component. Parents of the students were interviewed, providing validation information. This information enabled identification of item problems that would not otherwise have been detected.
In 1997–98, the study was extended to investigate instructional background items completed by students and their teachers. Again, a cognitive interviewing protocol was developed, which included a validation component. However, this time the students' teachers provided the validation information.
This paper summarizes the findings from the latter part of the study. Instructional background items from the 1996 and 1998 NAEP background questionnaires in fourth grade reading, math, and science, and eighth grade math and science were selected for investigation due to their problematic nature. A total of 66 students and 12 teachers participated in this component.
Background
Cognitive interviewing (also known as verbal reporting [Willis et al., 1991]), of questionnaire respondents is a form of interview used to uncover the mental processes involved when a respondent reads and responds to survey questions (Willis et al., 1999). Thus, cognitive interviews, as a form of survey pretesting, are effective in determining how respondents comprehend survey items and what strategies they use to devise answers. Such interviews are primarily conducted to identify sources of respondent confusion and misunderstanding (Krosnick, 1999; Fowler and Cannell, 1996; Schaeffer and Maynard, 1996). More specifically, they may lead to the verification of an expected question problem or the discovery of one that was unanticipated (Willis et al., 1999; DeMaio and Rothgeb, 1996.) Cognitive interviewing can facilitate not only the finding of a problem but also the fixing of a problem (Willis et al., 1999).
Cognitive interviewing has already been successfully used for improving surveys in many areas, including residential and occupational history, dental health, cancer risk factor knowledge, social support and physical limitations faced by the elderly, use of medical assistive devices, radon exposure (Jobe and Mingay, 1991), and tobacco use (DeMaio and Rothgeb, 1996).
A major objective of the cognitive interview is to identify problems a respondent may have when answering survey questions. These problems can be categorized according to a model specifying the types of cognitive activities which are typically undertaken in the course of survey response (Tourangeau, 1984; Jobe and Mingay, 1989; Willis et al., 1991):
With an understanding of why an item is failing, suggestions for improving the item are facilitated.
In most cognitive interviews, two basic techniques are utilized: "think-alouds" and verbal probing (Willis et al., 1999). In the think-aloud technique, the interviewer asks the subject to report what he or she is thinking as he or she is answering a question. Often a simple directive such as "Can you tell me what you’re thinking?" is used. The interviewer records the process the respondent engages in as he or she arrives at an answer. In addition, specific probes related to each item (e.g., paraphrasing of the question or requests for definitions of words and phrases) are developed and administered after the participant has produced a response to the survey question.
The use of validating information can be of tremendous benefit to the cognitive interviewer. Although think-alouds and normal probing enable the identification of many item problems, there are occasions when an individual’s think-aloud and responses to probes will fail to reveal
the existence of an incorrect response. But, if the interviewer knows the response is problematic (as a result of validating information), further probing can be employed until reason(s) for the otherwise undetectable item problem are determined. Protocols employing validation data have been successfully employed, enabling the modification of self-administered survey items.
These validation data also enabled assessments of the impacts of item modifications. Accordingly, it was possible to demonstrate that the use of validating information (provided by a parent or guardian) in cognitive investigations of survey items completed by children lead to the development of revised items with lower error rates than their unmodified counterparts (Levine et al., 1998). Validation data also lend themselves to simple, tabular presentations of the effectiveness of an item.
Validation information can be quite difficult to obtain. Not infrequently, the item being validated is self-report data where "truth" is known only to the respondent. Effective procedures for validating self-report data employing focused retrieval, varied retrieval, and attempts to evoke multiple representations of the construct of interest through the reconstruction of a calendar/diary for a time period of interest have been used to validate self-reports of hours worked (Edwards, Levine, and Cohany, 1989). In a similar fashion, in order to improve the quality of eyewitness reports of a crime, cognitive science techniques have been employed in a procedure called (coincidentally) "cognitive interviewing." The principles underlying this type of cognitive interviewing are (Fisher and Quigley, 1992):
These procedures have also been shown to be effective in enhancing dietary recall (Fisher and Quigley, 1992).
Many of the items of interest in this study dealt with instructional practices and classroom behaviors. Teachers were used as a source of validation data for student responses to these items. Since many of the items were behavioral frequency items, a calendar exercise incorporating the above techniques was used to provide validation data. These behavioral frequency data were used to validate both the teacher’s initial responses to the survey items dealing with behavioral frequencies and to validate the student’s responses to analogous items.
Methodology
Survey and Protocol Development
The items were chosen for inclusion in the study based on analyses of the level of discrepancies between the responses of teachers and students to items on the 1996 and 1998 NAEP student and teacher background questionnaires. In many cases, directly comparable items were asked of students (e.g., How often do you take mathematics tests?) and teachers (e.g., How often do the students in your class take mathematics tests?). Within a class, students and their teacher's responses were compared to determine rates of discrepancy. Those with the highest discrepancy rates were selected.
Selected items were used to create fourth grade student surveys in math, science, and reading, and eighth grade student surveys in math and science. Because fourth grade teachers usually teach all subjects, only one teacher survey covering all three subjects areas was created. Two separate teacher surveys in math and science were developed at the eighth grade level, because of the departmental nature of most middle schools. Each survey requested information on the frequency of instructional practices such as taking tests, using calculators and computers, classroom presentations, field trips, and homework related to each subject area. In addition, the teacher survey asked questions about the teacher’s professional development activities in the past year.
Protocols related to each survey were developed. The protocols provided a variety of optimal probes (e.g., word definition and paraphrasing), to be used when deemed appropriate. The protocols were reviewed both internally and with a project consultant: Dr. Robert Belli, a cognitive survey researcher at the University of Michigan’s Institute for Survey Research. In addition, interviewers were trained to create up unique probes during the interviews tailored to the individual respondents' interpretations of the items.
Participant Recruitment and Selection
In order to get a heterogeneous group of participants, three school districts were contacted in the San Francisco Bay Area. Informational materials and consent forms were prepared and sent to teachers in six different schools in these districts. As an incentive for participation, teachers were offered $100 and students $50 for their time. After teachers agreed to participate in the study, they were asked to distribute informational materials and consent forms to their students. Students and their parents were encouraged to either call the American Institutes for Research directly to set up an appointment for an interview or to give the signed consent form to their teacher. Potential participants were screened to allow selection of a diverse sample with respect to the household’s annual income level and the race/ethnicity of the student. Each teacher was able to help us recruit between 3 and 8 of his or her students.
Spanish versions of the informational materials were also prepared and distributed to students. Staff fluent in Spanish worked with Spanish-speaking parents. However, children needed to be able to speak and read English in order to participate in the study, since this is a requirement to participate in the NAEP assessment.
A total of 12 teachers (6 fourth grade teachers and 6 eighth grade teachers) and 66 of their students (35 fourth graders and 31 eighth graders), representing 5 elementary and middle schools were interviewed for the study. One-third (33%) of the student sample was from low-income households (below an annual income of $30,000) and about two-fifths (41%) of the sample were minority students (mostly of Hispanic origin).
Interview Procedures
Teacher Interviews
The teacher interviews lasted about two hours and consisted of two phases. In the first phase, the survey items were administered. The teacher was asked to read the questions aloud and was reminded to think aloud to provide insights into the cognitive processes that he or she employed in responding to the items. In addition, specific probes and paraphrasing requests were used to further inform about the response process. Typical probes included: "What do you think they mean by [technical term]?" and "What do you think this question is asking?"
The second phase of the interview was designed to validate some of the teachers’ responses in Phase 1. For context reinstatement purposes, interviews were conducted in the teacher's classroom. The teacher was asked to reconstruct the past week’s activities in class through a calendar exercise. To this end, weekly matrices were designed to help the teacher recall the frequency of certain instructional practices occurring each day. The teacher recalled each day of the week by first thinking of important or atypical events that occurred during the past week (e.g., staff meetings, sick students, and special events). These were written onto the matrix to serve as cognitive anchors and facilitate recall of each day’s events. Then the teacher was asked about the day’s lesson in the subject area of interest (e.g., what the teacher taught that day, whether he or she used any special materials such as manipulatives, and whether the teacher utilized technology). To evoke multiple representations, the teacher was also asked to think about a specific student and what this student did during the lesson in question. All of these extensive retrieval activities were intended to facilitate recall. After the interviewer felt these efforts resulted in retrieval of a clear representation of the day's lesson, the teacher was then asked to estimate, which of several instructional practices occurred on that day in the subject area of interest. This process was repeated for each subject and each day. Responses were documented on the matrix, as well. Then, adjustments were made to account for the typicality of the week, and these frequencies were then compared to the frequencies given in the first phase of the interview. Discrepancies were probed, and the most accurate answer (as determined by the teacher) was noted.
Since the student surveys contained some items that were not included in the teacher survey, the teachers were asked additional questions that could be used to validate the students’ answers. For example, the teachers were asked about the use of calculators in math, homework, science field trips, and the use of assessments in math, reading, and science.
A summary of each interview focusing on item problems was prepared subsequent to each teacher interview.
Student Interviews
Before the student interviews were conducted, teacher responses and any other relevant information from the teacher interviews were recorded on the student protocols to facilitate the identification of discrepancies and the triggering of probes.
The three fourth grade surveys consisted of between 12 to 19 items. Administering all three surveys would be too much for these students to answer in one session. Therefore, each fourth grader was administered two of three surveys — math and science, math and reading, or science and reading. Because of the departmentalized nature of middle schools, as mentioned earlier, eighth graders answered either a math or a science survey with 23 and 24 items, respectively.
Similar to the first phase of the teacher interviews, the students were asked to read the questions aloud. This facilitated detection of potential language and comprehension problems. For example, when a student could not read or pronounce a word, it was an indication of a comprehension problem. In these cases, the interviewer would make sure to probe the student’s understanding of the particular word.
The students were continually encouraged to think aloud. As with the teacher interviews, probing and paraphrasing requests were utilized to inform about the student item response process. When a student’s response varied from that of their teacher, the interviewer tried to determine the reason for the discrepancy by administering additional probes about the item (e.g., asking for further elucidation about how frequency estimates were produced or verifying comprehension of the item). For instance, a fourth grade student indicated that his science class had not been on a science field trip, even though the interviewer knew from his teacher's response that the class had been on a field trip to the NASA Ames Research Center. With this information in mind, the interviewer asked if the student had been on ANY field trips this year and the student responded that he had been on a field trip to NASA! However, the student did not consider this a science field trip. In response to a probe about what would be considered a science field trip, the student indicated it would be like a visit to the lab in which he was being interviewed or to the Stanford Hospital labs (where his mom works). Without the teacher response to validate the student's answer, this item problem (the definition of "a science field trip") would not have been detected.
After each student interview, a summary was prepared focusing on item problems and the reasons for any discrepancies between the students’ and the teachers’ answers.
Analysis
In order to summarize results, students' responses were compared to the teachers’ adjusted responses. Discrepancy rates for items were calculated by comparing the number of mismatches between the students' and their teachers' responses. These discrepancy rates were calculated across all student-teacher item pairs.
When discrepancies occurred, the student and teacher summaries were analyzed to identify the reasons for the discrepancies. For most items, the teacher's response was considered to be the correct response. However, situations would occasionally arise which indicated that the teacher had misinterpreted the question.
Results and Discussion
From the think-alouds and the validation data provided by teachers, it was possible to compare and validate the students’ answers against the teachers’ responses, identify inconsistencies, and determine the reasons that these inconsistencies occurred. Three general areas of item problems emerged: language and comprehension, use of behavioral frequency scales, and use of list formats. These issues and recommendations for item revisions are discussed below.
1. Language and Comprehension
The first stage of the item response process is the comprehension and interpretation of the item. If failure occurs at this stage — that is, if the respondent does not understand what the item is asking, there is a strong possibility of an inaccurate response. Several language and comprehension problems were found with both fourth and eight grade students, as well as with teachers.
Many fourth graders had trouble understanding or interpreting the following words and phrases as intended by the item writers:
Replacements such as "not sure" instead of "undecided," "for example" instead of "e.g.," and "books with chapters" instead of "novels" seem to work better with students at this grade level. However, fourth grade students generally can not understand technical words and long phrases, so these should be avoided in fourth grade questionnaires.
Eighth grade students also had trouble understanding terms such as: "integrated or sequential math," "applied mathematics (technical preparation)," "geometric solids," and "hands-on activities or investigations." Thus, similar to the fourth graders, these types of words and phrases should be minimized with eighth graders.
If a technical term or construct is followed by examples, children often fail to generalize the construct and respond only to the specific examples mentioned in the question. An example of this is the following fourth grade science item (shown in bold).

This item had a 70% discrepancy rate. As shown below, 16 out of 23 students answered the question inaccurately.
|
Teacher Responses |
Student Responses |
|
|
Yes |
No |
|
|
Yes |
1 |
0 |
|
No |
16 |
6 |
Fourth grade students overreported hands-on activities or projects with chemicals. Many students did not understand the meaning of the word "chemicals" and focused on the examples provided. They included any situation that involved mixing or dissolving sugar or salt in water (e.g., baking a cake, making lemonade, and doing an experiment with popcorn).
Another item comprehension problem, which also contributed to the high discrepancy rate, was students' literal interpretation of "ever." Fourth graders tended to include activities they had done with chemicals in previous grades. Specifying the implicit time period of interest (e.g., "in fourth grade") can alleviate this item problem.
The following fourth grade mathematics item also revealed interpretation problems.

The item had an 89% discrepancy rate between students' and teachers' responses. Seventeen (17) out of 19 fourth graders provided an inaccurate answer, as shown below.
|
Teacher Responses |
Student Responses |
|||
|
Never or hardly ever |
Once or twice a month |
Once or twice a week |
Almost every day |
|
|
Never or hardly ever |
1 |
1 |
||
|
Once or twice a month |
||||
|
Once or twice a week |
6 |
1 |
3 |
|
|
Almost every day |
3 |
1 |
3 |
|
The student interviews indicated that discrepancies were not due to within-class, between student variation in the students' behavior but to their interpretation of the item. The intent of the item was for students to report the frequency, with which they share their math work with the entire class informally, which was how the teachers interpreted the item. However, some fourth graders interpreted "talk to the class" as either cheating (i.e., showing your work to other students) or as formal presentations at the board and therefore underreported the behavior. In fact, 13 of the 17 students who responded inaccurately to the question reported that this behavior occurred less frequently than their teachers. This item might be improved by making the intent more explicit: "Talk to the whole class about your mathematics work from your seat."
Even students with good reading skills have difficulties with long, linguistically complex items. An example is an eighth grade math item, where students were asked how much they agreed (on a five point scale) with the following statement: "Describing mathematical concepts and ideas is as important as doing mathematical operations such as addition and multiplication in solving problems." Half of the eighth graders (or 10 out of 20 students) checked the middle point "undecided" and one student skipped the item altogether. Seven of these students explicitly indicated that they chose "undecided" because they did not understand the item. Therefore, items should be kept as short and simple as possible.
2. Use of Behavioral Frequency Scales
After several interviews with fourth graders, it became clear that scale problems were the source of many difficulties. For some items, students were unable to synthesize their retrieved representation into responses compatible with some of the categories. For example, fourth graders could not reliably discriminate between the two time frames "Once or twice a week" and "Once or twice a month." If an event was very frequent, fourth grade students generally could correctly label the frequency as "Almost every day." If an event were rare or unusual, these students generally would accurately categorize the frequency of the event as "Never or hardly ever." However, the two middle categories were continual stumbling blocks for fourth grade students. Thus, after a number of interviews showing identical problems, the four-point scale was modified to a three-point scale for further testing: "Almost every day," "Sometimes", and "Never." This revision seemed easier for fourth graders to use and produced more valid responses.
An example of a problem with a behavioral frequency scale is shown in the following fourth grade reading item.

This item had a 79% discrepancy rate. As shown below, 7 students reported that this behavior occurred more frequently, and 4 students that it occurred less frequently than the teachers indicated.
|
Teacher |
Student Responses |
|||
|
Never or hardly ever |
Once or twice a month |
Once or twice a week |
Almost every day |
|
|
Never or hardly ever |
2 |
2 |
1 |
1 |
|
Once or twice a month |
3 |
1 |
||
|
Once or twice a week |
1 |
3 |
||
|
Almost every day |
||||
However, when the scale was changed to a three-point scale, the discrepancy rate decreased to 29%. Only two students reported that this behavior occurred less frequently than the teacher did, as shown in the table below.
|
Teacher Responses |
Student Responses |
||
|
Never |
Sometimes |
Almost every day |
|
|
Never or hardly ever |
|||
|
1–2 times/ month |
1 |
||
|
1–2 times/ week |
1 |
3 |
|
|
Almost every day |
1 |
1 |
|
Discrepancies related to behavioral frequency items were found with eighth grade student items, as well. However, these problems were less of an issue for eighth grades than for fourth graders. Eighth graders generally possessed many more of the cognitive skills and strategies required for accurate behavioral frequency estimation.
3. Use of List Formats
Items that were presented in a list format (e.g., How often do you do each of the following?) produced problems because of lost context. That is, the respondents often forgot the stem and responded to the items as stand-alone items. This problem was not restricted to fourth graders. For instance, 5 out of 20 eighth graders lost context when they were asked the following math item: "When you do mathematics in school, how often do you do each of the following? Use a computer." This item was number 10 in a list (as shown below).

When the five students responded to item 10, they had lost the math context of the question. As a result, they overreported their computer usage by answering about their use of a computer anywhere and for any purpose. This finding was not due to probing effects (i.e., the interviewer asking probes after each subquestion and interfering with the participant's response process), since the students were instructed to answer all questions in a list before think-aloud and probing took place.
Maintaining the context through redundancy in the list can eliminate item problems like this. Accordingly, it was recommended to change item 10 to "Use a computer for mathematics in school."
Conclusion
A study with fourth and eighth grade students and their teachers was carried out to investigate the quality of NAEP instructional background items in math, science, and reading. To this end, cognitive interviewing techniques, including a validation component, were employed to facilitate the interpretation of the item response process utilized by students and teachers.
From the interviews with the students and the validation data provided by the teachers, it was possible to identify discrepancies between the students and the teachers’ responses, and the reasons that these discrepancies occurred. Several item problems were detected through this process. For example, numerous problems were found in fourth and eighth graders’ ability to understand long and technical words and phrases such as "geometric shapes," "science demonstration," "integrated or sequential math," and "hands-on activities or investigations." Words and phrases like these should be avoided in fourth and eighth grade questionnaires to the greatest extent possible. It was also discovered that students, in particular fourth graders, have a very hard time accurately reporting on the frequency of behaviors they or their teacher engage in in class. Therefore, it is important to simplify these behavioral frequency scales as much as possible. Finally, items presented in a list format present problems. Respondents often lose the context or the stem of these types of questions, and answer them as stand-alone items. Maintaining the context in the list questions by repeating part of the stem can avoid this type of item problem.
This study has shown the value of carrying out systematic cognitive investigations of the questionnaire response process in fourth and eighth grade children. By using think-alouds with children and adding a teacher validation component, detection of problems associated with the NAEP instructional background items under investigation were greatly enhanced.
References
Anderson, R. and Pichert, J. 1978). Recall of previously unrecallable information following a shift in perspective. Journal of Verbal Learning and Verbal Behavior, 17, 1–12.
Demaio, T. and Rothgeb, J. (1996). Cognitive Interviewing Techniques: In the Lab and in the Field, in N. Schwarz and S. Sudman (eds.), Answering Questions. San Francisco: Jossey-Bass Publishers, pp. 177–195.
Edwards, W., Levine, R., and Cohany, S. (1989). Procedures for validating reports of hours worked and for classifying discrepancies between questionnaire reports and validation totals. Proceedings of the American Statistical Association.
Fisher, R. and Chandler, C. (1984). Dissociations between temporally-cued and theme-cued recall. Bulletin of the Psychonomic Society, 22, 395–397.
Fisher, R. and Quigley, K. (1992). Applying Cognitive Theory in Public Health Investigation: Enhancing Food Recall with the Cognitive Interview. In J. Tanur (ed.), Questions About Questions: Inquires into the Cognitive Bases of Surveys. New York: Russell Sage Foundation, pp.154-169.
Fowler, Jr., F. and Cannell, C. (1996). Using Behavioral Coding to Identify Cognitive Problems with Survey Questions, in N. Schwarz and S. Sudman (eds.), Answering Questions. San Francisco: Jossey-Bass Publishers, pp. 15–36.
Jobe, J. and Mingay, D. (1991). Cognitive and Survey Measurement: History and Overview. Applied Cognitive Psychology 5, 175–192.
Kahneman, D. (1973). Attention and Effort. Englewood Cliffs, NJ: Prentice-Hall.
Krosnick, J. (1999). Survey Research. Annual Review of Psychology, 50, 537–567.
Levine, R., Huberman, M., Allen, J. & DuBois, P. (1998) The Measurement of Home Background Indicators: Cognitive Laboratory Investigations of the Responses of Fourth and Eighth Graders to Questionnaire Items and Parental Assessment of the Invasiveness of These Items. (Draft Final Report). Washington, DC: Education Statistics Services Institute.
Roediger, H. and Payne, D. (1982). Hypermnesia: The role of repeated testing. Journal of Experimental Psychology: Learning, Memory, and Cognition, 8, 66–72.
Schaeffer, N. and Maynard, D. (1996). From Paradigm to Prototype: Interactive Aspects of Cognitive Processing in Standardized Survey Interview, in N. Schwarz and S. Sudman (eds.), Answering Questions. San Francisco: Jossey-Bass Publishers, pp. 65–88.
Tourangeau, R. (1984). Cognitive Sciences and Survey Methods, in T. Jabine, M. Straf, J. Tanur, and R. Tourangeau (eds.), Cognitive Aspects of Survey Methodology: Building a Bridge Between Disciplines. Washington, DC: National Academy Press, pp. 73–100.
Tulving, E. and Thomson, D. (1973). Encoding specificity and retrieval processes in episodic memory. Psychological Review, 80, 352–373.
Willis, G, Royston, P., and Bercini, D. (1991). The Use of Verbal Report Methods in the Development and Testing of Survey Questionnaires. Applied Cognitive Psychology 5, 251–267.
Willis, G., Stinson, L., and Welniak, E. (1999). Is the Bandwagon Headed to the Methodological Promised Land? Evaluating the Validity of Cognitive Interviewing Techniques, in M. Sirken, D. Herrmann, S. Schechter, N. Schwarz, J. Tanur, and R. Tourangeau (eds.). Cognition and Survey Research. New York: John Wiley and Sons, Inc., pp. 133–153.