| Federal
Committee on Statistical
Methodology Office of Management and Budget |
FCSM
Home ^ Methodology Reports ^ |
Statistical Policy Working Paper 20 - Seminar on Quality of Federal Data - Part 1 of 3
Click HERE for graphic. Statistical Policy Working Paper 20 Seminar on Quality of Federal Data Part 1 of 3 Federal Committee on Statistical Methodology Statistical Policy Office Office of Information and Regulatory Affairs Office of Management and Budget March 1991 MEMBERS OF THE FEDERAL COMMITTEE ON STATISTICAL METHODOLOGY (February 1991) Maria E. Gonzalez, Chair Office of Management and Budget Yvonne M. Bishop Daniel Kasprzyk Energy Information Bureau of the Census Administration Daniel Melnick Warren L. Buckler National Science Foundation Social Security Administration Robert P. Parker Charles E. Caudill Bureau of Economic Analysis National Agricultural Statistics Service David A. Pierce Federal Reserve Board Cynthia Z.F. Clark National Agricultural Thomas J. Plewes Statistics Service Bureau of Labor Statistics Zahava D. Doering Wesley L. Schaible Smithsonian Institution Bureau of Labor Statistics Robert M. Groves Fritz J. Scheuren Bureau of the Census Internal Revenue Service Roger A. Herriot Monroe G. Sirken National Center for National Center for Education Statistics Health Statistics C. Terry Ireland Robert D. Tortora National Computer Security Bureau of the Census Center Charles D. Jones Bureau of the Census PREFACE In 1975, the Office of Management and Budget (OMB) organized the Federal Committee on Statistical Methodology. Comprised of individuals selected by OMB for their expertise and interest in statistical methods, the committee has during the past 15 years determined areas that merit investigation and discussion, and overseen the work of subcommittees organized to study particular issues. Since 1978, 19 Statistical Policy Working Papers have been published under the auspices of the Committee. On May 23-24, 1990, the Council of Professional Associations on Federal Statistics (COPAFS) hosted a "Seminar on the Quality of Federal Data." Developed to capitalize on work undertaken during the past dozen years by the Federal Committee on Statistical Methodology and its subcommittees, the seminar focused on a variety of topics that have been explored thus far in the Statistical Policy Working Paper series. The subjects covered at the seminar included: Survey Quality profiles Paradigm Shifts Using Administrative Records Survey Coverage Evaluation Telephone Data Collection Data Editing Computer Assisted Statistical Surveys Quality in Business Surveys Cognitive Laboratories Employer Reporting Unit Match Study Approaches to Developing Questionnaires Statistical Disclosure-Avoidance Federal Longitudinal Surveys Each of these topics was presented in a two-hour session that featured formal papers and discussion, followed by informal dialogue among all speakers And attendees. Statistical Policy Working Paper 20, published in three parts, presents the proceedings of the "Seminar on the Quality of Federal Data." In addition to providing the papers and formal discussions from each of the twelve sessions, this working paper includes Robert M. Groves' keynote address, "Towards Quality in a Working Paper Series on Quality," and comments by Stephen E. Fienberg, Margaret E. Martin, and Hermann Habermann at the closing session, "Towards an Agenda for the Future." We are indebted to all of our colleagues who assisted in organizing the seminar, and to the many individuals who not only presented papers and discussions but also prepared these materials for publication. A special thanks is due to Terry Ireland and his staff for their work in assembling this working paper. Table of Contents Wednesday, May 23, 1990 Part 1 KEYNOTE ADDRESS TOWARDS QUALITY IN A WORKING PAPER SERIES ON QUALITY. . . . . . .3 Robert M. Groves, The University of Michigan and U. S. Bureau of the Census Session 1 - SURVEY QUALITY PROFILES THE SIPP QUALITY PROFILE. . . . . . . . . . . . . . . . . . . . 19 Thomas B. Jabine, Statistical Consultant INITIAL REPORT ON THE QUALITY OF AGRICULTURAL SURVEY PROGRAM . .29 George A. Hanuschak, National Agricultural Statistics Service DISCUSSION. . . . . . . . . . . . . . . . . . . . . . . . . . . 40 Barbara A. Bailar, American Statistical Association DISCUSSION . . . . . . . . . . . . . . . . . . . . . . . . . . .46 Nancy A: Mathiowetz, U. S. Bureau of the Census Session 2 - PARADIGM SHIFTS USING ADMINISTRATIVE RECORDS PARADIGM SHIFTS: ADMINISTRATIVE RECORDS AND CENSUS-TAKING. . . .53 Fritz Scheuren, Internal Revenue Service AN ADMINISTRATIVE RECORD PARADIGM: A CANADIAN EXPERIENCE . . . .66 John Leyes, Statistics Canada DISCUSSION . . . . . . . . . . . . . . . . . . . . . . .. . . 77 Gerald Gates, U.S. Bureau of the Census DISCUSSION . . . . . . . . . . . . . . . . . . . . . . . . . . 83 Edward J. Spar, Market Statistics Session 3 - SURVEY COVERAGE EVALUATION CONTROL MEASUREMENT, AND IMPROVEMENT OF SURVEY COVERAGE . . . .87 Gary M. Shapiro,, U. S. Bureau of the Census; Raymond R. Bosecker, National Agricultural Statistics Service QUALITY OF SURVEY FRAMES . . . . . . . . . . . . . . . . . . .100 Judith T. Lessler, Research Triangle Institute DISCUSSION ... . . . . . . . . . . . . . . . . . . . . . . . .108 Fritz Scheuren, Internal Revenue Service DISCUSSION . . . . . . . . . . . . . . . . . . . . . . . . . . 114 Joseph Waksberg, Westat, Inc. Session 4 - TELEPHONE DATA COLLECTION QUALITY IMPROVEMENT IN TELEPHONE SURVEYS . . . . . . . . . . . 123 Leyla Mohadjer, David Morganstein, Westat, Inc. COMPUTER ASSISTED SURVEY TECHNOLOGIES IN GOVERNMENT: AN OVERVIEW . . . . . . . . . . . . . . . . . . . . . . 137 Marc Tosiano, National Agricultural Statistics Service DISCUSSION . . . . . . . . . . . . . . . . . . . . . . .155 William L. Nicholls II, U. S. Bureau of the Census DISCUSSION. . . . . . . . . . . . . . . . . . . . . . . . . . .161 James T. Massey, National Center for Health Statistics iv Part 2 Session 5 - DATA EDITING OVERVIEW OF DATA EDITING IN FEDERAL STATISTICAL AGENCIES. . . . 167 David A. Pierce, Federal Reserve Board EDITING SOFTWARE (An excerpt from Chapter IV of Working Paper 18) . . . . . . . . . . . . . . . . . . . . . . . .173 Mark Pierzchala, National Agricultural Statistics Service RESEARCH ON EDITING. . . . . . . . . . . . . . . . . . . . . . 180 Yahia Ahmed, Internal Revenue Service DISCUSSION . . . . . . . . . . . . . . . . . . . . . . . . . 184 Charles E. Caudill, National Agricultural Statistics service DISCUSSION . . . . . . . . . . . . . . . . . . . . . . . . . . 186 Richard Bolstein, George Mason University Session 6 - COMPUTER ASSISTED STATISTICAL SURVEYS OVERVIEW OF COMPUTER ASSISTED SURVEY INFORMATION COLLECTION . . 191 Richard L. Clayton, U. S. Bureau of Labor Statistics A COMPARISON BETWEEN CATI AND CAPI. . . . . . . . . . . . . . . 197 Martin Baum, National Center for Health Statistics COMPUTER ASSISTED SELF INTERVIEWING . . . . . . . . . . . . . . 202 Ralph Gillmann, Energy Information Administration COMPUTER ASSISTED SELF INTERVIEWING: RIGS AND PEDRO, TWO EXAMPLES. . . . . . . . . . . . . . . . . . . . . . . 205 Ann M. Ducca, Energy Information Administration DATA COLLECTION. . . . . . . . . . . . . . . . . . . . . . . . 209 Cathy Mazur, National Agricultural Statistics Service v DISCUSSION . . . . . . . . . . . . . . . . . . . . . . . . . 212 Robert N. Tinari, U. S. Bureau of the Census DISCUSSION . . . . . . . . . . . . . . . . . . . . . . . . . .216 David Morganstein, Westat, Inc. Thursday, May 24, 1990 Session 7 - QUALITY IN BUSINESS SURVEYS IMPROVING ESTABLISHMENT SURVEYS AT THE BUREAU OF LABOR STATISTICS . . . . . . . . . . . . . . . . . . . . . . . 221 Brian MacDonald, Alan R. Tupek, U. S. Bureau of Labor Statistics A REVIEW OF NONSAMPLING ERRORS IN FEDERAL ESTABLISHMENT SURVEYS WITH SOME AGRIBUSINESS EXAMPLES. . . . . . . . . . . . 232 Ron Fecso, National Agricultural Statistics Service DISCUSSION. . . . . . . . . . . . . . . . . . . . . . . . . . .243 DISCUSSION. . . . . . . . . . . . . . . . . . . . . . . . . . .247 Charles D. Cowan, Opinion Research Corporation Session 8 - COGNITIVE LABORATORIES THE BUREAU OF LABOR STATISTICS COLLECTION PROCEDURES RESEARCH LABORATORY: ACCOMPLISHMENTS AND FUTURE DIRECTIONS. . .253 Cathryn S. Dippo, Douglas Herrmann, U. S. Bureau of Labor Statistics THE ROLE OF A COGNITIVE LABORATORY IN A STATISTICAL AGENCY. . .268 Monroe G. Sirken, National Center for Health Statistics DISCUSSION . . . . . . . . . . . . . . . . . . . . . . . . . .278 Elizabeth Martin, U. S. Bureau of the Census DISCUSSION . . . . . . . . . . . . . . . . . . . . . . . . . .281 Murray Aborn, National Science Foundation (retired) vi Session 11 - STATISTICAL DISCLOSURE - AVOIDANCE DISCLOSURE AVOIDANCE PRACTICES AT THE CENSUS BUREAU. . . . . 367 Brian Greenberg, U. S. Bureau of the Census THE MICRODATA RELEASE PROGRAM OF THE NATIONAL CENTER FOR HEALTH STATISTICS. . . . . . . . . . . . . . . . . . . . 377 Robert H. Mugge, National Center for Health Statistics (retired) DISCUSSION . . . . . . . . . . . . . . . . . . . . . . . . . ..385 George T. Duncan, Carnegie Mellon University Session 12 - FEDERAL LONGITUDINAL SURVEYS FEDERAL LONGITUDINAL SURVEYS. . . . . . . . . . . . . . . . . . 393 Daniel Kasprzyk, U. S.reau of the Census; Curtis Jacobs, U. S. Bureau of Labor Statistics THE ADVANTAGES AND DISADVANTAGES OF LONGITUDINAL SURVEYS. . . 407 Robert W., Pearson, Social Science Research Council LONGITUDINAL ANALYSIS OF FEDERAL SURVEY DATA. . . . . . . . . 425 Patricia Ruggles, Joint Economic Committee DISCUSSION ... . . . . . . . . . . . . . . . . . . . . . . . ..438 Michael Brick, Westat, Inc. DISCUSSION . . . . . . . . . . . . . . . . . . . . . .. . . . 447 Marilyn E. Manser, U. S. Bureau of Labor Statistics TOWARDS AN AGENDA FOR THE FUTURE Stephen E. Fienberg, Carnegie Mellon University. . . . . . . . .455 Margaret E. Martin. . . . . . . . . . . . . . . . . . . . . . . 462 Hermann Habermann, Office of Management and Budget. . . . . . 465 viii Part 3 Session 9 - EMPLOYER REPORTING UNIT MATCH STUDY INTERAGENCY AGREEMENTS FOR MICRODATA ACCESS: THE ERUMS EXPERIENCE. . . . . . . . . . . . . . . . . . .291 Thomas B. Petska, Internal Revenue Service; Lois Alexander, Social Security Administration SAMPLE SELECTION AND MATCHING PROCEDURES USED IN ERUMS . . . . 301 John Pinkos, Kenneth LeVasseur, Marlene Einstein, U. S. Bureau of Labor Statistics; Joel Packman, Social Security Administration RESULTS, FINDINGS, AND RECOMMENDATIONS OF THE ERUMS PROJECT. . 309 Vern Renshaw, Bureau of Economic Analysis; Tom Jabine, Statistical Consultant DISCUSSION.. . . . . . . . . . . . . . . . . . . . . . . . . . 318 W. Joel Richardson Charles A. Waite, U. S. Bureau of the Census DISCUSSION . . . . . . . . . . . . . . . . . . . . . . . . . 324 Thomas J. Plewes, U. S. Bureau of Labor Statistics Session 10 - APPROACHES TO DEVELOPING QUESTIONAIRES TOOLS FOR USE IN DEVELOPING QUESTIONS AND TESTING QUESTIONNAIRES . . . . .. . . . . . . . . . . . . . . . .331 Theresa J. DeMaio, U. S. Bureau of the Census TECHNIQUES FOR EVALUATING THE QUESTIONNAIRE DRAFT. . . . . . . 340 Deborah H. Bercini, National Center for Health Statistics DESIGNING QUESTIONNAIRES FOR CATI IN A MIXED MODE ENVIRONMENT . . . . . . . . . . . . . . . . . . . . . . 349 Gemma Furno, U. S. Bureau of the Census DISCUSSION . . . . . . . . . . . . . . . . . . . . . . . . . .360 Carol C. House, National Agricultural Statistics Service vii Part 1 Keynote Address TOWARDS QUALITY IN A WORKING PAPER SERIES ON QUALITY TOWARDS QUALITY IN A WORKING PAPER SERIES ON QUALITY Robert M. Groves The University of Michigan and U.S. Bureau of the Census 1. Introduction. Although this meeting has the title of the "Seminar on the Quality of Federal Data," its structure follows quite closely the topics covered in the multi-paper series of Statistical Policy Working Papers sponsored by the Office of Statistical Policy and, Standards. There are as of this date, 19 Statistical Policy Working Papers written since the first in 1978. That is about 1.6 per year over the 12 years of the series, (see Figure 1). They range over a wide terrain, involving issues of the topical focus of surveys to a set of methodological and statistical issues affecting survey quality. I am unaware of the processes that led to my being asked to give the keynote address at this meeting. I must admit that I speak to you today as someone who has a very biased opinion about the OMB Statistical Policy Working Papers - I love almost all of them; I like the idea that they exist and only recently, because of my change of job sectors, have I appreciated their worth from another perspective. I have used them in graduate courses for students in survey methods (they are fine introductions to important design topics). I have used them in my research work (they are unique sources of documentation about what goes on in the Federal Statistical System). I recommend them to others calling for consulting assistance. Although I speak as a friend, 45 minutes of praise from me wouldn't act to improve this series and runs the risk of "head inflation" for those who developed the papers. Instead, I want to be a constructive critic and will divide my remarks into several categories: a. alternative goals of the OMB series b. the need for a structure to their topics I note that what follows are my personal views as a close observer from afar of the system and a rookie member of the system. 3 Click HERE for graphic. 4 2. Alternative Perspectives on Goals of the Working Paper Series 2.1. OMB Series as Review of the State of Practice Some of the papers in the series address a topic that spans many surveys of different populations (see Figure 2). The papers on coverage error and telephone data collection are examples of this. These kind of papers are compact summaries of the state of the art on a current issue facing all surveys. They often describe activities in both household surveys and those in economic surveys. Many times they end with case studies of different surveys across the Federal system and how they handle the particular issue at hand. Figure 2 Alternative Perspectives on Goals of the Working Paper Series 1. OMB series as a review of the state of practice 2. OMB series as agency cross-fertilization 3. OMB series as a prod to new developments These kind of papers are valuable to the extent that they have deep depth and wide breadth. By that I mean, they cover all the sources of data quality and cover them in sufficient depth that real learning is likely on the part of most readers. Let me first speak of breadth of topics. I find it most simple to array the topics of the papers along the components of total survey error (see Figure 3). It is unfair for me to present this chart without some clarifying remarks about the missing cells. First, missingness does not imply absence of any treatment of the topics. Indeed, on sampling error, for example, many of the reports comment on the impact of design options on sampling variance. Second, this structure is only one which could be applied to classify the xx reports. Considering the label of this seminar "quality of Federal Data", however, I find it attractive to use it here. Despite the weakness of any one classification scheme, let me point out what I believe are weaknesses with the current status of the series. There is a distinct bias toward the household survey domain to the detriment of the economic domain. There is one paper with the overarching title of "Quality in Establishment Surveys", but the fact that it along exists underscores the problem. This is a reflection of the smaller literature in the methodology and evaluation of quality of economic surveys, but it is a status that 5 I hope will change in the future. Why? We have in the past too quickly assumed the following premises about economic survey measurement: a. establishment surveys are too diverse to yield themselves to common methodologies or standards. b. establishment surveys do not face questionnaire design issues like those of household surveys because the information gathered is factual in nature c. establishment surveys have nonresponse properties that do not resemble those of household surveys. Each of these can be refuted with some observation of the various establishment surveys now ongoing. It is true that establishment populations have large variation in size; that their organizational structures are diverse; that their recordkeeping practices are not standardized; that the ideal respondent for different issues may vary across establishments. All of this is true, but should not lead to the extreme that there Are no common problems either across different establishment surveys or between household and economic surveys. As the Boskin report has observed, economic survey data needs improvement and the working paper series could be one vehicle of focusing attention on specific needs in this area. The next most important omission, in my opinion, concerns the issue of nonresponse. I must admit here that the work of the National Academy of Sciences Panel on Missing and Incomplete Data offers a comprehensive review of current theory and practice. Conversely, the issue is vital to the unique inferential power of probability samples and therefore cannot receive too much attention. Even the most basic issues remain unresolved: relationships between response rates and nonresponse error; relationships between likelihood of coverage and likelihood of participation; cost/error evaluations of alternative methods of improving response rates. Mean square errors of survey estimators stem from thousands of individual decisions to cooperate with the survey request. It behooves us to devote more energy to this and the working paper series should do this. Third, the interviewer has largely been ignored. It has been ignored despite that fact that many Federal surveys use interviewers to assist in the data collection, despite the fact that evaluative procedures desperately need review and reconceptualization, despite the fact that it is an area where both statistics and social science perspectives work. The attention to the interviewer is even more important given the likely future in which the traditional labor force of underemployed/overskilled part 6 time homemakers will decline and computer technologies are likely to transform the job. Fourth, although large portions of data collection in the Federal Statistical System is by mail and self-administered questionnaire there is no focused treatment of the methodology in the series. Fifth, a few comments specifically on error profiles. When I first read the CPS error profile 12 years ago, I had two reactions. I was attracted to the literary form -- a compilation of quality measures for the survey, combined with documentation of design features. I then felt and still believe that the structure of an error profile is a valuable way to document leading components of error in survey statistics (we should be grateful to Brooks and Bailar as the mothers (or midwives) of the invention). My second reaction came after digesting the full report. How little we as a community seemed to know about the error properties of the CPS, the largest ongoing and one of the most important ongoing Federal household surveys. Of the 80 pages of the report, for example, only about 25 are devoted to the data collection operations, a source of most of the errors in the process! That combination of reactions led me to the belief that I still have -- the error profile, in the hands of intelligent program directors, can act as an agenda setting document for quality improvement programs. Finally, there are no serious treatments of costs of data collection - a topic I'll revisit in a few minutes. Let me now turn to issues of depth. At their worst the reports are catalogues -- they make great reading for someone interested in buying an idea from those presented, but they don't make thrilling reading for the uninitiated. At the same time, they often assume knowledge of various data series that is not Possessed by many outside experienced statistical system staff. As a corollary, some fail to cite relevant research literature outside that produced within the statistical system. Part of these features may be a matter of choice of audience. I have assumed that the desired audience consists of both Federal Statistical System staff and researchers in related fields from academia and commercial domains. The government, academic, and commercial research sectors have much to gain from learning about each others methods. The paper series could be enhanced by seeking input from the two other sectors. At the very least, this might entail a forced literature review within each paper; at a higher intensity this might involve the subcommittee membership of those outside the Federal system. Even the input from outsiders may not sufficient. 7 Figure 3 Topics of Statistical Policy Working Papers Multiple Error Sources 3 - CPS Error Profile 4 - Nonsampling Error Terms 13 - Federal Longitudinal Surveys 15 - Quality in Establishment Surveys Coverage Error 17 - Coverage Error Nonresponse Error Sampling Error Measurement Error: Interviewer Measurement Error: 10 - Developing Questionnaires Questionnaire Measurement Error: Respondents Measurement Error: Mode 6 - Uses of Administrative of Data Collection Records 12 - Telephone Data Collection 19 - Computer Assisted Surveys Processing 2 - Statistical Disclosure 5 - Statistical Matching 11 - Industry Coding Systems 18 - Data Editing Estimation 7 - Time Series Revision 8 Topics Not Classifiable Easily in Error/Quality Terms Topical focus 1 - Statistics for Allocation of Funds 16 - Reporting in Employer Data Systems Administration 8 - Statistical Interagency Agreements 9 - Contracting for Surveys Other 14 - Uses of Microcomputers Missing Topics of Statistical Policy Working Papers Coverage Error Problems using households as sampling frame elements Nonresponse Error Combining social science and statistical models of participation Sampling Error Statistical software for estimation; generalized variance models; alternative estimators for public use files Measurement Error: Training; variance models; Interviewer reinterview programs; monitoring of telephone interviewers Measurement Error. Developmental methods in cognitive Questionnaire laboratories; pretesting regimens; imbedding experiments in surveys Measurement Error: Mode Mail and self-administered of Data Collection surveys; mixed mode surveys Processing Statistical quality control; automated coding Estimation Model-based Estimation 9 2.2. OMB Series as Cross-Fertilization Among Federal Statistical Agencies In my fifteen years of working with Federal statistical agencies from my academic base, I was consistently reminded of the relative isolation of individual agencies from each other. As most people in this room know, it is not uncommon for very similar lines of research and development to be pursued without much coordination across agencies. The arguments for this are that different problems faced by the agencies demand different solutions. The arguments against are that functionally equivalent solutions are often created by two different agencies at twice the cost. The working paper series has had, I believe, a beneficial unanticipated effect at reduction on interagency duplication. First, the subcommittees consist of members from several different agencies. Second, the tasks of the subcommittees often involve collecting information from many statistical agencies. The members thereby learn of work going on in agencies they normally don't visit. Third, recommendations of the papers often seek to apply standards across agencies, and the committees are forced to face the difficulty of system wide standards. This is laudable and necessary. Is it sufficient? Clearly not. That is, working subcommittees of the Federal Committee on Statistical Methodology are temporary, normally have an agenda limited to the report, and do not generally follow up on logical conclusions of the report. Our dispersed statistical system, with all the benefits that specialization offers, misses opportunities to implement recommendations of these working papers. 2.3. OMB Series as a Prod to New Developments Several of the papers treat topics where only one or two agencies are making major contributions and most others fall behind. For example, the Time Series Revision paper, the industry coding paper, the paper on computer assisted surveys, all fall into this category. If I can temporarily put on the hat of an OMB staff member, this perspective seems to be the most central to the goals of the group. If reports like this can serve to improve the quality of work ongoing in several agencies, investments by one agency might quickly reap benefits in many agencies. Some of the reports are poised for such effects, but the statistical system seems to miss more opportunities than necessary. Interagency agreements can be forged to promote such technology transfer. That is, consultation or subcontracting can be obtained within existing regulations. However,, this requires the target agency to acknowledge the need for such upgrading. Could OMB 10 facilitate this process? I am too naive to know, but the existence of a pool of funds at the OMB staff level to assure the spread of innovation across agencies through detail of staff and other mechanisms would be productive. Are there areas of innovation that can profit from coordination? Certainly. The use of CATI/CAPI is one that comes to mind quickly. It is now an area in which separate expenditures are being made by several agencies, where no standards have been well-defined, where different solutions, with essentially that same cost/benefit structure, may evolve across different agencies. The prod to new developments, however, demands that the papers end with a series of recommendations. The authors should stimulate the readers, dare I say, challenge the readers, toward improving current practice. After the detailed investigation needed for these reports, they are uniquely qualified to offer such recommendations. Only a minority of the reports end with such recommendations. This should be part of the charge to each committee. 3. The Need for a Structure of the Working Paper Series As I age, I must admit that I find more appeal in structures that guide our research and development in survey design and implementation, as opposed to reacting to each new idea without an explicit framework. In the academic world major theories provide that structure; they help to identify what are the important questions; they guide the development of new ideas. The application of the word "theory" to social and economic data production is rare. We do work that is guided by statistical theories, social science theories, organizational theories, and computer science theories. We are, however, basically on the applied side of research and development. We have a data collection and estimation vehicle (e.g., a survey) which is used for many substantive purposes. We are interested in knowledge that improves the vehicle and less interested in anything else. As I understand the Federal Committee on Statistical Methodology, the topics for papers are essentially the fruit of discussions of the committee members. This is fine for assuring interest in the paper series among subcommittee members, but fails to assure coverage of important topics. I have suggested a total survey error structure above. The reports should have both measurement and reduction of error in mind. The widely perceived worth of sampling error as a criterion of evaluation of data owes its existence largely to well accepted estimators of the error. We currently lack comparably well accepted measures for nonsampling errors, but the report series could be used as a vehicle to stimulate such measures. Finally, another way to structure the report series is around major problems facing the Federal statistical system in the near and far term (see Figure 4). These, in my view, should form the core attention of the working paper series. The first I mention may be the most controversial. The statistical literature on survey design is schizophrenic on costs. On one hand, there exist models which demonstrate that only through knowing cost components can design optimization be achieved. On the other hand, there is little serious treatment of survey costs by statisticians or those from other disciplines. Figure 4 Likely Problems Facing Federal Data in the Near/Far Term 1. Identification of cost components associated with error- related design features 2. Integration of question changes motivated by cognitive research into ongoing surveys 3. Public cooperation with data collection requests and coverage of subpopulations on sampling frames 4. Development of mixed strategy designs, tailored to diverse subpopulations 5. Development of nonsampling error indicators; implementation of statistical quality control procedures 6 Training of statisticians and social scientists in survey research; recruitment/retention of trained staff The second issue has both a restrictive and more global meaning. First, the work ongoing in so-called cognitive laboratories is seeking to identify principles influential of measurement error in question-answer sequences. The Federal statistical system at the current time has no good mechanism for the orderly introduction of change in questionnaires. For the vast majority of ongoing surveys, questionnaires remain static despite evidence of improved alterative measures. The value of unbroken time series and the assumptions of canceling biases in over-time comparisons are used to justify inactivity. Americans have very interesting reactions when they visit Cuba or see scenes of the country. They marvel at the maintenance of U.S. manufactured cars in their original state from the 1950's. They are at once proud of the ongoing use of older vehicles and humored by the lack of progress. A U.S. auto manufacturer would quickly go out of business if he were continuing to market 1950's designs. Indeed, 12 the watchword in that industry in continued investment in change, designing systems to permit ongoing change, making change part of the design. Survey researchers are driving 1950's vehicles in the 1990's. What we dearly lack is the will to mount ongoing programs of ongoing improvement in data series. The third likely issue of import is the role of voluntary participation in surveys over the coming years. Some countries in Western Europe have experienced political shocks to response rates (e.g., Sweden, West Germany) . Public debate about surveys in these countries has led to lower cooperation with survey requests. In some cases documented effects on survey statistics exist. That is, the nonresponse error becomes visible to even the most naive reader of statistics. At this point, there was little the researchers were prepared to do in terms of reaction of field interviewers or construction of adjustment schemes. We must acknowledge that public cooperation is a fragile base on which the scaffolding of inference lies. To improve participation or to adjust inference in the presence of lower participation, understanding of the decision to participate must be obtained. This is an issue that faces the entire statistical system, indeed, the entire industry of information collection. The fourth issue is not unrelated to the problems of participation. As the diversity of the U.S. population increases, survey designs that tailor procedures to different subpopulations grow. Large portions of the population remain covered by traditional frames, cooperative and competent to provide information using cheap data collection methods. Others fail to be covered on traditional frames, have difficulty providing information, and fear harmful consequences from their participation. The coming years are likely to find greater appeal in mixed design strategies -- multiple frames, multiple data collection modes, tailored questionnaires to subpopulations. The models exist in the survey design literature, but they need careful attention. The final problem listed above concerns a crisis looming ahead for the social measurement industry in this country. Like all endeavors that require quantitative literacy social and economic statistics are currently facing a shortage of qualified personnel. If this were not bad enough, we also suffer from a worse problem -- the absence of ongoing training programs. It's not merely that students aren't entering the field; it's not clear how they can within traditional academic programs. Let's examine the problem. Sampling statistics was well developed by the early 1950's; it is not a "hot" area of development, attracting the best and brightest of students. Instead, a variety of analytic statistical developments are more emergent. Young Ph.D.'s labelling themselves as sampling statisticians are unlikely to have an easy route to tenure in an academic department. Within the social sciences the difficulties might be greater, with great pressure on students to 13 develop areas of expertise which are central to the dominant paradigms in the discipline. Survey methodology is not one of them in any discipline. There are two results of this: 1) a gross inadequacy of training of new staff coming into the statistical system in topics relevant to survey quality. (This is not a comment on their training as statisticians, psychologists, or economists.) and 2) a reduction in the number, of academic researchers devoted to the craft of social measurement. There is a clear conclusion here: the statistical system has to get serious about training of staff it needs for the future. This means support of specialized graduate programs, focused continuing education, onsite training and other similar mechanisms. The two types of structure - quality/cost components of data series and problems facing the system - suggest two paper series, one devoted to technical issues, another to administrative and professional issues. 4. Other Comments, Not Elsewhere Classified I must admit confusion about the term, "working paper series." In an academic setting this term is used to describe papers in the process of being refined or papers not worthy of being refined. People are sometimes "working" on them. The better ones change over time, they evolve to a better state. This doesn't seem to fit well with the OMB Working Paper Series. Most all remain in their original state. I don't want to change the name of the series; I'd rather see the series periodically updated. Several of the papers were valuable only for a short period of time (e.g., microcomputers; telephone data collection). Having a well-defined structure to the series might define a set of ongoing updates of papers devoted to individual topics. There in another connotation of "working" when attached to paper series. That is, they are "working" toward quality improvements in the statistical system. I like this connotation. But it implies two burdens not uniformly accepted: a) a set of recommendations at the end of reports, b) follow through by OMB or individual agencies to implement change. On this definition, I think, the paper series has not achieved full success. Another problem with the series are the costs and benefits assigned to authors of the reports. Contrary to my colleagues in academia, statistical system staff rarely experience career- enhancing effects of writing such papers. There is the value of education about other agencies, of "networking" with other members of the statistical system, and of learning more about important issues facing the system. On the other hand, I've learned that this is work essentially performed at nights and weekends by people 14 already very busy. Now, night and weekend work is commonly very productive and I have no problem with such a plan. What I do regret (and think it bad for the health of the system) is that such work is given so little value by many of the home agencies. OMB might consider remedying this with some more formal recognition of the writers of these reports. At the very least, the authors of the report might be given a more prominent position on the covers of the papers. It strikes me that this seminar is an ideal forum for generating discussion on the future of this series. I recommend several questions: Have the basic issues changed since the report? - because of the paper? - in spite of the paper? Is it time to redo the paper, to update it? Are there subtopics now of sufficient importance that they deserve separate treatment? 5. Personal note This working paper series consistently contains the name of one person, from the first to the last - Maria Gonzalez. The Federal Statistical System often focuses its attention on data series structures and organizations, not people, but the success of any endeavor that spans decades depends on key people. In this paper series the key person is unambiguously Maria. As those of you who know her well can attest, she has been a rock of rationality, courtesy, integrity, and absolute honesty in her work on the Federal Committee on Statistical Methodology. She alone can succeed in pressing overworked federal statisticians to take on projects for the benefit of the whole system. Her near unique ability to suggest ideas in a manner that allows the hearers to believe they are their own ideas is a marvel. Her perseverance toward important goals of quality improvement and coordination have made the working paper series and this conference possible. 15 16 Session 1 SURVEY QUALITY PROFILES 17 THE SIPP QUALITY PROFILE Thomas B. Jabine Statistical Consultant A. Introduction The Survey of Income and Program Participation (SIPP) is a longitudinal national household survey which has been conducted by the U.S. Bureau of the Census since 1983, following several years of developmental research. The goal of the survey, which uses a rotating panel design, is to provide policy makers with comprehensive and accurate data about the levels and determinants of the income of U.S. persons and households and about their participation in a broad range of income transfer and welfare programs. The SIPP quality profile summarizes current knowledge about the sources and magnitude of errors based on SIPP. An initial version of a SIPP quality profile was issued in 1987 (U.S. Bureau of the Census, 1987) and an updated and expanded version was prepared in 1989 (U.S. Bureau of the Census, 1990).This paper describes the purposes of developing a quality profile for a survey or other statistical program and the process of preparing and updating a quality profile, using the SIPP Quality Profile as an illustration. The contents of the updated version will be discussed briefly. Those who wish to evaluate the quality of SIPP data on specific topics or to develop an overall judgement about the quality of SIPP data are referred to the latest version of the SIPP Quality Profile and the other sources of information that it identifies. Section B outlines the development of the quality profile concept and identifies some publications of the last 4 decades that could be regarded as forerunners of the current model. Section C explains the origin of the SIPP Quality Profile. Section D provides an overview of the updated version: its intended audiences, purposes sources of information and structure. The contents are discussed briefly in section E. In the concluding section, I discuss the role of a quality profile in the broad context of survey quality control and improvement. 19 B. Some Forerunners of the Quality Profile The theoretical foundation for a quality profile rests on various models that have been developed for the measurement and analysis of errors in surveys, especially the Census Bureau model, which integrates components of sampling and nonsampling error and the interactions between them (Hansen, Hurwitz and Bershad, 1959). Dalenius (1974) formalized the concept of total survey design, using the Census Bureau model to guide the allocation of resources to minimize total error in a survey. Based on this foundation, there have been several broad qualitative and quantitative reviews of the quality of data from censuses and surveys, featuring direct and indirect data about the various components of error. Zarkovich (1966) published what was perhaps the first systematic treatment of nonsampling errors in surveys, with emphasis on procedures for their measurement and control, and including numerous examples of specific information about nonsampling errors from surveys and censuses in many countries. Bailar and Lanphier (1978), in a pilot test of methodology for the evaluation of survey practices, reviewed the quality-related design features of 36 U.S. surveys. Their review was not based on direct measures of errors, but the frequency with which they found indirect evidence of low quality was high enough to be disturbing and to suggest a need for greater attention to the quality of survey designs and practices. A United Nations (1982) manual on Nonsampling Errors in Household Surveys, prepared for use in developing countries, systematically explores the different sources and types of nonsampling error and provides illustrative data from numerous household surveys throughout the world. Statistical Policy Working Paper 15 (Office of Management and Budget, 1988) performs a similar function for Federally sponsored establishment surveys in this country. Compilations of information about the quality of surveys have two main audiences: survey designers/managers and users of survey data. To ensure that the latter have access to such information, standards have been developed for the dissemination, in survey publications, of information about errors. An early example of such standards was Census Bureau Technical Paper 32 (1974). Today, several Federal statistical agencies apply similar standards in their publication programs. There have been some publications devoted entirely to the quality of data on a specific topic in a census or survey. An early example was a detailed appraisal of the income data from the 1950 Census of Population (Conference on Research in Income and Wealth, 1958). The most immediate forerunner of the SIPP Quality Profile was Statistical Policy Working Paper 3 (Brooks and Bailar, 1978), which provided an error profile for estimates of 20 unemployment from the Current Population Survey (CPS) Jabine (1987) provided a detailed analysis of the quality of data on chronic conditions reported in the National Health Interview Survey. There are two fairly evident differences between the CPS error profile and the SIPP quality profile. The most obvious is the switch from "error" to "quality" as the defining adjective for the profile's content. While this may seem to be only a semantic change, it reflects a feeling, undoubtedly shared by the authors of the CPS error profile, that the goals of such a publication are constructive. The use of the term quality seems more in keeping with today's emphasis on quality control and improvement in all kinds of endeavors, including surveys. The other basic difference is that the SIPP quality profile covers the quality of estimates for all of the topics included in SIPP, whereas the CPS error profile covered only one of the many topics included in that survey. Other U.S. statistical agencies are undertaking similar although not identical efforts. The Energy Information Adminis- tration, for example, periodically publishes reports in a series called An Assessment of the Quality of Selected EIA Data series. These reports rely largely on the technique of comparing data from EIA surveys with more or less comparable data from other sources and analyzing the differences that are observed. Janet Norwood, in a paper presented at the Census Bureau's Third Annual Research Conference, stated that the Bureau of Labor Statistics was planning to develop a comprehensive error profile for each of its surveys (Norwood, 1987, pp. 217-218). C. Origin of the SIPP Quality Profile The SIPP is a major longitudinal survey. The start of the survey was preceded by several years of research and development, an effort known as the Income Survey Development Program. The evolution of SIPP's complex survey design did not end when the survey became operational late in 1983. Methodological research and evaluation studies have continued at a substantial pace and the results of these studies, along with accumulated performance statistics, feedback from users and adjustments made necessary by reductions in funding, have led to significant changes in the survey design and procedures. Thus, SIPP is still in the early stages of its evolution, in contrast to the Current Population Survey which, although not immune to evaluation and improvement, has reached a more mature and stable phase. In l984 the Social Science Research Council and the Survey Research Methods Section of the American Statistical Association, with the encouragement and support of the Census Bureau, estab- lished a Working Group on the Technical Aspects of SIPP to provide 21 advice to the Census Bureau on research priorities and the translation of research findings into changes in the survey design and procedures. (The Social Science Research Council later relinquished its sponsorship role.) An early recommendation of the Working Group was that the Census Bureau prepare a compendium of research results and other information about the quality of SIPP data. Members of the Working Group believed that a systematic account of information about the different kinds of errors that affect estimates from SIPP would be invaluable as a guide in setting research priorities and applying the principles of total survey design to SIPP. Given the substantial amount of ongoing research, they recommended that such a quality profile be updated periodically, perhaps every two years. The Census Bureau accepted the Working Group recommendation and produced the Quality Profile for the Survey of Income and Participation (King, Petroni and Singh, 1987), early drafts of which were reviewed by several members of the Working Group. New information continued to flow in at a rapid rate and toward the end of 1988, Census decided that it was time to start work on an update. The updated version, published in mid-1990, was prepared by the author of this paper with substantial assistance from Karen King and Rita Petroni of the Census Bureau's Statistical Methods Division. Although the general structure of the two versions is similar, the update contains much new material and some of the earlier sections were significantly revised. It also includes an index. The new version benefitted from reviews by several members of the SIPP Working Group and Census staff. Special thanks are due to Daniel Kasprzyk and Rajendra Singh for their support of the project. D. Overview of Version 2 The SIPP Quality Profile is intended to serve two main audiences: "users of SIPP data and those who are responsible for or have an interest in the SIPP design and methodology." The interests of these two groups are different. Users want to know how the errors associated with specific categories or classes of data are likely to affect their analyses. SIPP designers and managers need to know the magnitude of errors associated with specific design features, in order to control the quality of the survey estimates and to guide the allocation of resources available for their improvement. Besides these two primary audiences, it was expected that the publication would be of interest to persons concerned with the design of longitudinal surveys other than SIPP and to two special groups: the ASA/SRM Working Group and a Panel to Evaluate the Survey of Income and Participation, convened by the Committee on National Statistics at the request of the Census Bureau. 22 Information about the components of error that affect SIPP data comes from four sources: o Performance statistics, such as unit and item non- response rates and reports based on quality control procedures used in data collection and processing operations. o Methodological experiments. Both in the developmental period and since the start of survey operations, there have been numerous methodological experiments involving design features such as length of questionnaire, respondent rules, use of respondent incentives, increased use of telephone interviewing and methods of adjustment for nonresponse. o Micro-evaluation studies. The outstanding example is the SIPP Record Check Study, in which individual survey responses to questions about program participation and benefits were compared with administrative data for each of several programs. o Macro-evaluation studies. There have been numerous comparisons of SIPP data with data on the same topics from other surveys, especially the Current Population Survey, and from program records. Assembling the relevant documentation was a challenge. SIPP has probably generated more methodological documentation than any other survey that has been in existence for a similar length of time. The list of 161 references provided in the updated version of the Quality Profile, which includes only those items that were actually cited in the report, is nearly double the size of the list, included in the first version. The most commonly used sources were: the SIPP Working Paper series; the annual proceedings of the Survey Research Methods, Social Statistics and Business and Economic Statistics sections of the American Statistical Association; the proceedings of the Census Bureau's Annual Research Conferences; and internal Census Bureau memoranda. The report informs readers how to obtain copies of any of the internal memoranda in which they are interested. Finding a suitable framework in which to present all of this information about different components of error also presented a challenge. The traditional approach is to organize the material according to the main phases of the survey: sample selection, data collection, data processing and estimation. The core of the Quality Profile (Chapters 3 through 8) is, in fact, organized in that manner, with one chapter devoted to sample selection, three to data collection (covering data collection procedures, nonresponse error and measurement error) and one each to data processing and estimation. 23 Two important topics did not fit neatly within this framework. Chapter 9, Sampling Errors, covers the procedures used to estimate sampling errors and the relationship between sampling errors and sample size. Chapter 10, one of the longer chapters, is called "Evaluation of Estimates" and covers both comparisons Of SIPP estimates with data from other sources and indicators of errors of undercoverage. The remaining chapters, 1, 2 and 11, provide an introduction, an overview of the survey and a summary, respectively. The structure of the SIPP Quality Profile is similar to that of its chief forerunner, the CPS Error Profile. The main differences are the division of the material on data collection (called "Observational Design and Implementation" in the CPS Error Profile) into three chapters, and the addition of the chapters on sampling errors and evaluation of estimates. Our goal was to provide, insofar as available, quantitative information about overall error and its components. Hence, the report includes 6 figures and 43 tables, a substantial increase over the number included in the first version. Space limitations preclude inclusion of tables in this paper, but for those who may be interested, the numbers of some key tables and figures from the publication ate given in the following section. E. Summary of Findings Major sources of error The SIPP Quality Profile does not contain any broad conclusions about how successful SIPP has been so far in fulfilling its goals. Our goal was to provide enough information about the quality of the survey data so that individuals and groups like the Committee on National Statistics Panel to Evaluate SIPP could reach their own conclusions. The summary chapter does, however, identify what stood out as the three main sources of error in SIPP estimates: nonresponse, differential undercoverage and measurement error. As in any longitudinal survey, unit nonresponse increases in succeeding rounds (called "waves" in SIPP) of the survey. Table 5.1 (not included with this paper, see the report) shows the data available as of 1989 on unit nonresponse by wave for each panel of the survey (households and individuals in each panel are interviewed 8 or 9 times, at 4-month intervals). The rates are relatively low -- 4.9 to 7.6 percent -- for the first wave, but increase to over 20 percent at the final wave of each panel. This relatively high attrition is due in part to the difficulty of tracking households and individuals that move, as is required by the SIPP design. The characteristics associated with Unit nonresponse have been analyzed in detail, and these analyses have 24 guided the development of estimation procedures designed to minimize the biases that result from differences between the characteristics of respondents and nonrespondents. Item nonresponse has been low for core items on labor force activity, income recipiency and asset ownership. It has been somewhat higher for income amounts, especially self employment earnings and interest. In the topical modules (questions not asked in every wave), especially high nonresponse has occurred for questions on asset amounts. Indicators of differential undercoverage in SIPP for population subgroups defined by age, race and sex are shown in Table 10.13 of the report. The table shows the reciprocals of the weights that are applied in order to make the simple unbiased estimate for each subgroup agree with an independent estimate that uses the Population Census count as a benchmark. The group most affected is young adult black males. The ratios for black females in the same age group are also quite low. At least for the males, the coverage ratios shown understate the amount of undercoverage, because the ratios do not include any adjustment for census undercoverage which is known to be above average for this population subgroup. Similar patterns of undercoverage have been observed in the Current Population Survey and other national household surveys. The second-stage ratio adjustments used for both cross-sectional and longitudinal estimates to compensate for undercoverage are believed to reduce both the sampling error and bias of the estimates. The effects of these adjustments on sampling errors can be estimated, but little is known about their affects on biases associated with undercoverage. Measurement error takes many forms, but perhaps its most significant manifestation in SIPP has been the seam problem, i.e., a pronounced tendency for survey respondents to report month-to- month changes for months in adjacent waves at substantially higher rates than for adjacent months within a single wave. Figure 6.1 in the report provides a graphic illustration of the seam effect on reports of changes in earnings. Pronounced effects have been noted for most income recipiency and amount variables. Because of the rotation group design used in SIPP, cross-sectional estimates of transitions are not likely to be seriously distorted by this pattern of reporting, but it can affect estimates of the covariance structure and may have adverse effects on multivariate analyses dealing with transitions or length of spells. Table 6.6 in the report shows some early results from the SIPP Record Check Study. The sample sizes are small, and the table shows results for only two of the four states included in the study. For the State of Wisconsin, significant levels of underreporting were found for participation in two programs and 25 benefit amounts in one other program. The full results from the Record Check Study will provide the best direct information so far available on levels of measurement error in SIPP and will be a valuable resource for studying the sources and correlates Of response bias and response error variance. Current research An active program of SIPP methodological and evaluation research is continuing. The main areas of research include: o The design of the questionnaires and the structure of the interviews. Laboratory research is being conducted to study the cognitive aspects of SIPP interviews and how they relate to seam effects and other kinds of reporting errors. Field experiments have been conducted to test the feasibility of providing feedback of prior wave information and encouraging greater use of records in interviews. o Interview mode. An experiment with increased use of telephone interviewing is being evaluated to determine whether to adopt the procedures that were tested. For the longer term the Census Bureau is arranging for the development of a prototype questionnaire for use in computer-assisted personal interviewing (CAPI), in order to evaluate the potential effectiveness of this collection mode in SIPP. o Estimation procedures. The broad goal for this area of investigation is to develop estimation procedures for SIPP that make effective use of auxiliary data available from both the Current Population Survey and administrative records. An initial study of the feasibility of reducing variances by using IRS data as controls in the second-stage ratio estimation procedure showed considerable promise. Research in these and other aspects of the survey is proceeding at a pace that suggests the desirability of preparing updates of the SIPP Quality Profile on a regular basis. Areas of research that have been relatively untouched so far include the effects of interviewer variance and the conditioning, effects of repeated interviews on response error. For the latter, the overlapping panel design used in SIPP offers the possibility of comparing cross-sectional estimates for households and persons that have been in the sample for varying lengths of time. There is also a need to update some of the earlier evaluation studies in order to monitor the effects of design changes since the beginning of the survey. Much of the research reported in versions 1 and 2 of the 26 SIPP Quality Profile, including the Record Check Study, which is the only source of direct information on the site of individual reporting errors, is based on data from the 1984 panel. F. Conclusions Judging from some comments by users of the initial version and reviewers of the preliminary draft of the updated version of the SIPP Quality Profile, the systematic compilation and publication of information about the nature and sources of error in a major continuing survey like SIPP, with Periodic updates, is a worthwhile undertaking. A more definitive evaluation of utility will be possible now that the updated version has been published and is being widely distributed. The author believes that the preparation of quality profiles could be valuable in connection with efforts to track and improve the quality of data from other major continuing national surveys, such as the Current Population Survey, the National Health Interview Survey, the National Crime Survey, the Annual Survey of Manufactures and the Monthly Retail Trade Survey. The technique is applicable to both household and establishment surveys. Maintaining and improving the quality of survey data is a never-ending job for survey designers and managers, and there is room for a multiplicity of approaches. Some Federal agencies are making a strong commitment to the application, to survey operations, of Deming's philosophy and techniques for total quality management. That approach implies not just measurement of errors and identification of their sources, but modification of the survey process as needed to eliminate or reduce the effects of significant sources of error. The other paper presented at this session (Hanuschak, 1990) provides an example of this model of survey quality management, with active participation and commitment to quality improvement by key managers in the organization. The same commitment to the quality of data can be seen in the work of the sponsors and participants in this Conference and they deserve our thanks for it. REFERENCES Bailar, B. and Lanphier, M. (1978), Development of Survey Methods to Assess Survey Practices, Washington DC: American Statistical Association. Brooks, C. and Bailar, B. (1978), An Error Profile: Employment as Measured by the Current Population Survey, Statistical Policy Working Paper 3, Office of Federal Statistical Policy and Standards, U.S. Department of Commerce. 27 Conference on Research in Income and Wealth (1958), An Appraisal of the 1950 Census Income Data, Studies in Income and Wealth, Vol.23, National Bureau of Economic Research, Princeton: Princeton University Press. Dalenius, T. (1974), Ends and Means of Total Survey Design, Stockholm: University of Stockholm. Energy Information Administration (1983), An Assessment of the Ouality of Principal Data Series of the Energy information Administration (first in a series of "state of the data" reports), Publication DOE/EIA-0292(82). Hansen, M., Hurwitz, W. and Bershad, M. (1959), "Measurement Errors in Censuses and Surveys", Bulletin of the International Statistical Institute, 38:359-374. Jabine, T. (1987), Reporting Chronic Conditions in the National Health Interview Survey: A Review of Findings From Evaluation Studies and Methodological Tests, Data From the National Health Survey, Series 2, No. 105, National Center for Health Statistics. Jabine, T., assisted by King, K. and Petroni, R. (1990), Survey of Income and Program Participation: SIPP Quality Profile, Bureau of the Census, U.S. Department of Commerce. King K., Petroni, R. and Singh, R. (1987), Quality Profile for the Survey of Income and Program Participation, SIPP Working Paper No. 8708, Bureau of the Census, U.S. Department of Commerce. Norwood, J. (1987), "What is Quality?" in Proceedings, Third Annual Research Conference, Bureau of the Census, U.S. Department of Commerce: 215-222. Subcommittee on Measurement of Quality in Establishment Surveys (1988), Ouality in Establishment Surveys, Statistical Policy Working Paper 15, Statistical Policy Office, U.S. Office of Management and Budget. United Nations (1982), Non-sampling Errors in Household Surveys: Sources, Assessment and Control, UN Publication DP/UN/UBT-81- 041/2, National Household Survey Capability Programme. U.S. Census Bureau (1974), Standard's for Discussion and Presentation of Errors in Data, Technical Paper 32, U.S. Department of Commerce. Zarkovich, S. (1966), Quality of Statistical Data, Rome: Food and Agriculture organization of the United Nations. 28 INITIAL REPORT ON THE QUALITY OF AGRICULTURAL SURVEY PROGRAM George A. Hanuschak National Agricultural Statistics Service I. Background and Introduction In December 1988, the National Agricultural Statistics Service (NASS) formed a Survey Quality Team (SQT) for its Agricultural Survey Program (ASP). The ASP is a series of integrated multiple sampling frame (area and list) based surveys throughout the agricultural calendar year. Some major items on the surveys are planted and harvested crop acreages, hog, cattle and sheep inventories, crop yields and production and on-farm grain storage. There was a major survey redesign from individual MF surveys to an integrated multiple frame survey program which was implemented over several years (1984 - 1986). The mission of the Survey Quality Team is to identify and develop statistical process control (SPC) methods for the management of the integrated Agricultural Survey Program. The SPC methods are based upon the fundamentals of total quality management (TQM) techniques developed by Edward Deming, Joseph Juran, Philip Crosby and other well-known TQM developers in the TQM and SPC literature. However, since much of the literature refers to "manufacturing" situations, it was adapted to fit the government agricultural survey situation. Several papers by Ron Fecso developed the basic model of survey quality used by the SQT. The first major milestone of the SQT was to be the development of a baseline "state of the survey" quality report. The mission of the SQT is quite broad, challenging and critically important to the Agency's long term goal of routinely and continually improving survey quality. The team and the Agency also face this challenge in the light of severe budget pressure, in general, on Federal Statistics programs. However, the team feels that TQM and SPC methods are quite powerful tools, when properly applied, that can aid in measuring and improving survey quality over time. One of the first lessons of total process control is to define the major steps in the total process. In the case of the ASP, one needs to first define or identify the major steps or stages of the ASP surveys. The survey quality team had identified the following steps (Exhibit I) as the major 22 processes of the survey. Unfortunately, each one of these survey stages or processes is probably susceptible to some type of errors or biases. The SQT developed the following profile (Exhibit 11) of 24 potential sources of error or bias in the ASP. Like any good statistical organization, the Agency has tried to minimize the probability of various nonsampling errors occurring 29 in the survey process. Controls include training, survey manuals and instructions, Agency Policy and Standards Memorandum, quality control checks on enumeration, reinterview studies, etc. Controlling and measuring nonsampling errors for a complex survey process will remain extremely challenging even with the best efforts at statistical process control. However, in the remainder of this report, the SQT defines and demonstrates how to use statistical process control and total quality management techniques to reduce total survey error over time. Exhibit I - Major Survey Stages Survey Clearance Area Sampling Frame (Construction, Maintenance and Sampling) List Sampling Frame (Construction, Maintenance and Sampling) Survey Specifications Design of Questionnaires (Design, Print and Distribution) Preparation of Manuals (Interviewers, Supervisory and Editing) Prepare Survey Software (Data Entry, Survey Coordinator, Edit, Analysis, Summary, Data Base, Mail and Maintenance System, Etc.) National/Regional Training Schools Survey Management - Headquarters and State Statistical Offices (Coordination of Procedures) Presurvey Coding/Handling/Processing by State Statistical Offices State Training Schools Data Collection Data Collection Quality Control Manual Data Review and Coding Data Entry and Validation Data Edit and Review Imputation, Analysis and Summarization State Statistical Office Review of Survey Results (including submission of estimates) Headquarters Review and Release Preparation Post Survey Updating (Data Base and List Sampling Frame) Post Survey Evaluations Survey Research 30 Exhibit II - Some Potential Sources of Total Survey Error in the Agricultural Survey Program Undetected List Sampling Frame Duplication List Sampling Frame (Old or Incorrect Control Data) List - Undetected Reporting Duplication or other reporting/enumeration errors or bias List Sources of Questionable Quality used for List Sampling Frame Build/Maintenance Area Sampling Frame (Outdated Land Use Stratification) List Sampling Frame (Any large operations not covered by the frame) Area Sampling Frame (Outdated Sample Segment - Aerial Photography) Different Farm Operation Description Questions on Different Questionnaire versions Incorrect overlap/nonoverlap Determination Incorrect Exception Report Handling (One Type of Survey Weighting Factor) Incorrect Coding (List Adjustment Survey Weighting Factors, Completion/Imputation Codes, etc.) Undetected Data Entry errors (pass all the way through the editing system) Shift in Mix of Data Collection Modes (Telephone, Computer Assisted Telephone, Mail and Personal) Shift in Mix of Respondents (Operator vs. Spouse vs. Other) Incorrect Survey Master Records Questionnaire Design (or Print) Errors Unmeasured Major Changes in Survey or Estimation Procedures (Headquarters or State Statistical Offices) Error in Known Zero Determination (Is Respondent Validly out of Business?) Overediting/Underediting of Survey Data Potential Bias in Manual or Machine "Imputation" Procedures Lack of Formal Outlier Handling Procedures (Non Robust or Non Smooth Time Series Estimation) Survey Processing Software Shifts in Characteristics or Skill Level of Work Force {(Enumerators, Statisticians, Programmers, Support Staff) Experience in their current job, survey procedures knowledge, farm knowledge, statistics knowledge, technology skills, etc.} Farmer or Respondent's level of understanding or grasping of survey reporting concepts and item definitions (Cognitive aspects). 31 II. The Components of Survey Quality When faced with the problem of measuring and improving the quality of the ASP, one should consider the components of survey quality. Listing the components defines exactly what is meant by the-term "survey quality" and highlights specific sub-areas that need to be explored. Figure 1 shows the components of survey quality. It was developed by the Nonsampling Errors Research Section in the Survey Research Branch of NASS and adopted by the SQT. There are four major components related to survey quality accuracy, resources, timeliness, and relevance. Click HERE for graphic. Accuracy is the component that first comes to mind when thinking about survey quality. NASS wants the survey indications to be as accurate as possible. Not only should the sampling errors be small, but also the nonsampling errors should be minimized. In large-scale surveys the relative sampling errors can be smaller than the relative size of the nonsampling errors. Factors such as undetected list sampling frame duplication, nonresponse, questionnaire wording, mode of interview, change in respondent, etc., can lead to substantial nonsampling errors. The second component of survey quality is resources. Even if a survey organization can control the sampling and nonsampling errors, its ability to do so will be affected by the amount of dollars that are available to spend on the survey. The amount of dollars has a direct impact on sample sizes, list frame quality, pretesting, reinterview projects, editing programs, summary programs, analysis, etc. Also important is the amount and quality 32 of staff hours that can be devoted to a survey. Staff hours are affected by salaries, training, hiring practices, long-term career development, and organizational climate; components that are also greatly affected by the amount of dollars available. Most people quickly realize that the crucial problem is to take the fixed set of available resources and use those resources in a way that maximizes the survey quality. The third component is timeliness. Of course, time could be considered another element of resources -- like dollars and staff. However, timeliness needs to be considered a component by itself because timeliness is crucial in the survey process. The impact and usefulness of survey indications are greatly affected by whether the survey data were collected one month or one year earlier. NASS has always stressed the need to collect data quickly and to release estimates as close to the survey reference date as possible. Thus, the survey calendar -- which is used to time all the steps of the survey -- is important to the survey quality. The final component is relevance. Relevance is dependent on the needs of the users of NASS statistics, and those needs change from day to day. It is useless for NASS to collect a high-quality piece of information on farming if that piece of information has no relevance for the users of NASS statistics -- that piece of information simply becomes a product without a buyer. NASS must constantly assess the needs of people using its statistics to make sure that the collected information is relevant. The second aspect of relevance is internal to NASS. An example of internal relevance is whether the Agency wants direct expansion (level) or ratio (percent change) or both types of estimators out of the ASP. III. Accuracy of Survey Soybean Acreage Estimates NASS has an expert panel of Agency statisticians called the Agricultural Statistics Board. (ASB) which reviews all survey indications (often multiple indications for any one item), and administrative or check data (such as the amount of soybeans crushed in processing plants) and adopts or sets the official estimates to be published. Two concepts need to be defined - use and fitness. The ASB's use of the ASP indications was chosen as the primary "use" of the ASP. "Fitness" for use is evaluated by setting a standard for use and measuring adherence to the standard. Ideally we would have standards for all the components of mean squared error (MSE) for the various commodity indications and administrative data used by the ASB. This would provide the ability to create statistically well defined composites of the data for use as the Board estimate or forecast. As this time we have measures of the variance for most indications, but have only enough 33 information about MSE's to recognize the importance of developing more extensive MSE measures. This section will provide information for Agency management to assess which areas are most in need of further study or research and/or corrective action. The ASB's specific need is to have indications which serve as a solid basis for the official numbers. The following chart on soybean planted acreage display the degree to which the ASB has found the ASP indications to be "fit for use." In reviewing the soybean planted acreage chart on ASB use you will observe the following: 1. The Agricultural Statistics Board finds the area sampling frame based June acreage estimate quite "fit for use." 2. The ASB does not find the integrated multiple frame based June acreage estimate "fit for use." It has an observed substantial upward bias which also changed substantially in magnitude between 1987 and 1988 and stayed at the larger magnitude in 1989 and 1990. Using Pareto analysis and an expert panel using TQM principles applied to surveys, the SQT identified the major suspected causes of the upward bias in the multiple frame based soybean acreage estimate. These suspected causes are: Click HERE for graphic. 34 1. Different Data Collection Methodologies The area frame based acreage estimate is based upon a sample of about 16,000 sample segments throughout the U.S. Data collection is done completely by personal interviews using an aerial photograph to locate each crop field and recorded on a questionnaire by the interviewer with the farmers direct participation. Crop acreage data is collected and edited field by field. Farmers are probed to report waste acreage for each field. There are also five specific questions related to defining land operated now to which all the rest of the questions relate to. On the integrated multiple frame survey, the majority of data collection is done by telephone (both conventional and computer assisted). The crop acreage data is collected for the entire farm (not field by field). Therefore farmers are probed for waste acreage only once, at best, when reporting crop acreage. There is no photographic aid for the farmer to refer to. There is only one or two questions on defining land operated now. 2. Undetected List Sampling Frame Duplication There are sophisticated record linkage tools to identify and remove duplication on the list sampling frame. However, due to clerical resource constraints and funding to call farmers to resolve differences and the use of multiple list sources some duplication remains. A special study was designed in 1989 to measure remaining duplication and the effect on the estimates. The study showed that approximately 10 percent of the acreage difference was due to obvious list frame duplication. 3. No Formal Documented Outlier Handling Procedurers While there are several good analysis tools to identify outliers, there is no formal procedure for handling them. The area frame based acreage estimator is quite robust since the average expansion factor is about 200 and the segment size is 640 acres putting an upper bound on "influential observations". For the list sample, expansion factors are considerably larger and farm size does not have much of an upper bound. Thus it is much easier to get highly influential observations in the list sample. Development of a formal robust estimator for the list sample is highly recommended. 35 4. Different Imputation Methodologies There are also different imputation methodologies. All imputation for the area frame is done manually by interviewers observations or statisticians. In the case of crop acreage if a farmer refuses the interviewer can still observe most of the crop fields and the crop. On the list sample, the imputation is a computerized algorithm that uses other reported survey data and list frame control data to impute for nonreported data cells. 5. Undetected Reporting Errors Since the questionnaire design is different the undetected reporting error structure may also be different. For example, the screening questions on land operated on the area side are more detailed than the list questionnaire and may do a more accurate job of screening out landlords who are not active farmers at survey time. New farm programs may have also led to the formation of more complex farming operations, which may involve a different reporting error structure also. 6. Different Ratio Type Information and Sample Designs On the area frame sample there is an 80 percent overlap from one year to the next. On the list frame sample (independent from year to year) there is negligible overlap. Thus the area frame sample also provides a paired sample ratio estimator. It is important to note that there have also been two rather independent sources of data available to the ASB which also support following the area frame level. These are a Landsat satellite based regression estimator (1980-1987) which for major soybean states had variances at least twice as small as the direct expansion estimator but also were unbiased when compared to the ASB and direct expansion. The second source is the calculation of a soybean balance sheet which the ASB uses as an evaluation tool. A balance sheet takes the carryover from one crop year to the next and adds crop production to that and then subtracts crop utilization including exports from it to get a current balance. These balance sheets also support the area frame based crop acreage level. Thus the agency has attempted to verify the correct crop acreage level using several methods and independent data sources. Even though there is an observed upward bias in the integrated multiple frame estimator for soybean acreage there are reasons for keeping it and reducing the bias. These reasons are: 36 1. Later crop season yield and production estimates are tied to the integrated multiple frame (IMF) approach. 2. State and sub-state level estimates from the IMF have much better precision than the corresponding area frame estimates. 3. Solving the bias problem associated with soybean acreage may well improve the entire IMF which is a survey 6 times a year with an average of 20-40 items (multivariate in nature). The Survey Quality Team has performed similar analysis for on-farm grain storage, and cattle and hog inventories. Some of the bias issues are item specific but others are associated with the total survey process or components of the survey process. 4. The IMF approach is substantially more cost efficient and involves less respondent burden than the area frame approach. Most important is that the Agency is taking actions on all of these expected causes in 1989 and 1990. As previously mentioned there is now an improved list frame duplication adjustment procedure in place starting in June 1989. There is a reinterview research study being conducted in June 1990 to provide initial measures of previously undetected reporting errors. This study will involve the reinterviewing of a subsample of the list sample of farmers and record the crop data field by field and ask the more detailed land operated questions and compare the results. There are also research efforts underway to examine the imputation methodologies and to look at an across year design for list frame based estimators and evaluate several robust estimators. In addition the SQT has provided several quality measures to be monitored on the resource, relevance, timeliness and accuracy dimensions which should become operational in 1990-91. The Agency is also developing alternative "proxies" to the true item values in addition to relying on the ASB process. An operational reinterview/reconciliation survey is being conducted in six major grain producing states in December 1990. There has also been an extensive operational soybean yield validation survey (198? - current) where farmers are asked to harvest specific fields and take just that grain to a grain elevator to be weighed and measured. This "proxies" to true values are important in a survey evaluation program but are also complex and expensive to develop and implement. As previously mentioned, use of earth resource satellite data has also been used by the Agency to develop more precise and accurate crop acreage estimates. 37 IV. Summary It is the claim of the SQT that more consistent and timely process improvements can take place by using the principles of statistical process control and Total Quality Management. More formal survey quality measurement and monitoring mechanisms will provide the Agency's management with more and critically important information to manage the quality of the ASP. Also, most of these techniques will readily transfer to other survey programs in the Agency such as Prices Paid and Received by Farmers, the Farm Costs and Returns Survey, Objective Yield Surveys, Farm Labor Surveys, and even to new programs such as Water Quality and Food Safety Surveys, the National Animal Health Monitoring System and the Monthly Yield Survey Program. There are several tools available for such a survey quality management system. First there are numerous charting techniques such as bar and pie charts for resource information, Board standardized indication graphs with standard errors, Gantt charts to display, project management and survey schedule information, upper limit and lower limit control charts, multivariate control charts, Ishikawa fishbone diagrams and Pareto charts and analysis. Many of these were used in an earlier effort by the Nonsampling Errors Research Section when a statistical process control study was conducted on the Soybean Objective Yield Program. Pareto analysis is one of the most powerful tools in quality monitoring systems. Pareto analysis ranks the potential errors in a system from most serious to least serious. The reasoning is that in many systems and not just surveys, there are a "vital few" and "trivial many" potential errors in the system. Thus, the most important beginning of evaluating the quality of a system is to identify where it is most likely to break down or fail. Once the ranking of potential errors is accomplished, then it is recommended to identify the allocation of resources for each potential error to see if management is allocating resources in a fashion that will truly minimize total survey error. Many Pareto analyses have demonstrated that the resource allocation was not in proper alignment with the true error structure. Thus, more information on the true total survey error structure and appropriate resource allocations, is being provided to survey managers and administrators to form a basis for future improvements in total survey quality. Considerable progress has been made by the Agency in addressing quality issues in its integrated multiple frame Agricultural Survey Program. Many of the discoveries will translate to improved quality on several other major Agency survey programs as well. 38 References Beller, N., "Error Profile for Multiple Frame Surveys," Statistical Reporting Service, Research Report, 1979, Washington, DC. Bosecker, R., "Integrated Agricultural Surveys," National Agricultural Statistics Service, Research Report No. SSB-89-05, Washington, DC, June 1989. Fecso, R., "Survey Quality," Presented at the 2nd Quality Assurance in Government Symposium, Washington, DC, May 1989. Fecso, R., Pafford, B., Tremblay, T., Johnson, R., "Quality Profile for Soybean objective Yield Survey," National Agricultural Statistics Service, Unpublished Case Study, Washington, DC, 1988. 39 DISCUSSION Barbara A. Bailar American Statistical Association I. What is a Quality Profile? The first quality profile was called an error profile and it concerned the CPS employment statistics. To be more positive, error profiles have now become quality profiles. The purpose is to prepare a systematic and comprehensive account of survey operations, listing the operations, the potential sources of error, and how the error influences the uses of the survey statistics. Quality profiles are still rare events. When asked why there are not more, survey producers have three main themes: o The staff resources that would go into producing a quality profile are too great and are in competition with other, more urgent needs. o Producing a report that tells about the errors in surveys would lead to less credibility in the statistics produced. o Admitting that there are errors is admitting that we haven't done our jobs well. In fact there are many benefits to producing quality profiles. Some of these are as follows: o to minimize total error, not just sampling error within given cost constraints o to force a thorough documentation of the survey process. o to guide a user on the effects of possible errors and their impact on specific uses o to develop a sound quality control program o to use in training programs for new staff in either operations or research; and o to use as the foundation for a sound research and analysis program The development of a quality profile parallels the survey process and would contain the following elements: 40 1. Objectives and specifications of the survey 2. Sampling design and implementation 3. Observational design and implementation 4. Data processing 5. Estimation 6. Analysis and publication Given this as my basic understanding, let me comment on the quality profile for SIPP and the quality assessment of the Agricultural Survey Program (ASP). The two reports have some differences and some similarities. The SIPP profile summarizes what is known about sources and magnitudes of errors of estimates and addresses accuracy. The ASP report is written from the point of view of total quality management and uses many of the ideas of Deming, Juran, and Crosby. This report considers resources, timeliness, and relevance as major components of quality, along with accuracy. The aims of the two groups seem to be quite different. The two reports each identify the same groups as their targets -- the users of the survey data outside the agency and producers of the survey inside the agency. Another similarity is that both look at major phases of the survey operation, something essential for a quality profile. A difference in the two reports was that the SIPP report actually identified four main sources of information on nonsampling errors: Performance data methodological experiments micro-evaluation studies macro-evaluation studies. The ASP report was more concerned with process and how quality would be assessed. In fact, the report stresses the need not to identify too many sources of error because tracking everything down might take too long. Actually, I think the total quality management movement urges groups to use brainstorming techniques to identify all possible problems and then Pareto analysis to decide where to concentrate one's efforts. Another similarity is that both reports left out major steps in the survey process. The SIPP report briefly listed the objectives of the survey, but said nothing about the objectives being conflicting. Producing a survey to give both cross-sectional and longitudinal data has been a new experience for the Census Bureau. The two objectives do conflict, at least from the resource point of view. There were some references to different needs in imputation, but the resource needs have probably had more impact on the survey. 41 The ASP report did not even list objectives of the survey as a potential source of error. Neither report really addressed the effects of staff training or compared the kinds of training, length of training, etc. It is fairly well known that performance data does not correlate well with interviewer performance on accuracy. Training could make a difference, but almost nothing is known at the present time. Let me move now to some separate comments on the two reports, starting with the ASP report. There was a large group of people who worked on this survey quality team. Many of them have done excellent work in survey methodology, so I think we can expect great things from this group., The mission of the group is to contribute to NASS's long term goal of routinely and continually improving survey quality. The focus on quality at NASS has taken on the language of the quality and productivity movement. For example, they use a simple definition of quality, "fitness for use." This led them on a search to decide what that meant and what objective criteria would be. Finally, they decided that they would measure it by comparison with the Agricultural Statistics Board (ASB) estimate. If the ASB value is within plus or minus two standard errors of the survey indication, then the survey indication is fit for use. And, in fact, they have five ratings: ideal, acceptable, workable, minimal, and out-of-control. I find it hard to see why the Agricultural Statistics Board estimate would be used as the standard. In some cases, there are long time series and other indicators that the ASB uses to make its estimate. However, for some surveys they have much less information. Perhaps NASS is pushing the ASB to use the survey indicators or explain why they haven't. Though the example given in the paper about the integrated multiple frame based June acreage estimate was interesting, there will not always be that kind of other data available to compare with. There is nothing about a Board estimate that measures accuracy. In some ways, it is as if the SIPP people looked at one of their macro indicators and said that if SIPP didn't come within two standard deviations of that estimate, then SIPP was not fit for use. At least, with a macro indicator, one might be able to untangle why estimates differ; that may not be possible to do with the ASB. Following Deming's principles, I think the careful documentation of every survey for which millions of dollars are spent and on which important decisions are based is important to the profound understanding of which Deming speaks. A quality profile tells you what you know and what you don't know but should. 42 It was interesting to see that KASS also addressed resources, timeliness, and relevance as major components of quality. However, it was not clear how criteria would be set or measurements taken. The Gantt chart on the QAS was helpful in identifying time periods and overlaps of one round of survey with the next but it did not help individuals who have many surveys to work on identify overlapping periods of high intensity. The sentence "Too frequent use of overtime to correct a process that is out of control usually has a devastating effect on overall performance" What does out of control mean? How does it affect overall performance? How do you know these things unless you keep careful records on hours worked on a survey, overtime, and have some measure of a downturn in performance? NASS has several good ideas about looking at relevance, timeliness, and resources as well as accuracy. It is an ambitious undertaking. I have one word of caution in their drive to use total quality management techniques to help them. They focus on several tools available for a survey quality management system including charting methods. I agree that these are useful tools. But what has been most helpful in the manufacturing and service industries where TQM is used is bringing in a team that has hands- on knowledge of all the facets of the survey. The team would include data collectors from states, edit specifications people, estimation people, those who set objectives. The tools would be something the team would be taught to use to help them. They would all need to learn basic concepts of variability. Only when all these people participate, do you get the profound knowledge that you need to improve a system, not merely tamper with it. As you recall, tampering with a system does not take care of the major changes needed to remove high variability due to special causes. Let me now move to the SIPP report. This is a good report that gets periodic updating. There are areas not covered in the report, probably because they did not seem as urgent as the areas covered. However, I do believe that we will need to see a section on objectives, meeting multiple objectives, defining concepts, translating concepts into questions, and so forth. At the other end of the survey, something needs to be said about analysis and publication. Though the Census Bureau does not use the language of total quality management, I know that they have thought along those lines. Using some of the performance measure standards flies in the face of everything Deming preaches. I'm talking about standards for response rates: Outstanding................ 97.5 - 100.0 commendable................ 95.5 - 97.4 Fully successful........... 91.5 - 95.4 Marginal................... 88.0 - 91.4 Unsatisfactory............. 87.9 and less 43 Instead of setting arbitrary standards for response rates and production, the Bureau needs to get a deeper understanding of what is possible in each type of area in which it does surveys. For example, response rates can be charted with upper and lower control limits for PSU's in New York City. Probably the response rates there very seldom, if ever, meet the commendable level. However, they may be within normal variability for that area. Only with positive efforts at changing the system can the response rates be lowered. This is partly what Dr. Deming thunders about -- blaming the worker who may be doing the best he or she can when it is the system at fault. Again, this labelling of people's work does not make the interviewer proud, and it is really tampering with the system. The report gave lots of interesting information on household, person, and item response rates. Some of the non-response rates on asset data are such that it seems questionable that the survey is the right vehicle for collecting the data. There is also emphasis on the seam problem, but this is nothing new. As I recall, it also showed up in the crime survey. It seems that certain biases are endemic to longitudinal surveys. So far the Bureau has been content to catalog the measured effect. We really need some creative thinking and some money to get some experiments going to look at recall errors, the placement of events in time, and the time in sample problems. Though dependent interviewing may yield more consistent results, they may be no more accurate. Before action is taken to fix a problem, there needs to be a deeper understanding of why the problem exists. There was very little information available on the extent of editing, what it does, why changes are made, and what we call editing and what we call imputation. Beller made some very pertinent comments in his 1979 error profile for NASS surveys. "The amount of editing on some questions resulted in changing the level of cattle and calves by an amount two or three times greater than the error caused by sampling. This amount of editing is cause for alarm in that it clearly shows a breakdown in the survey process." In both the NASS surveys and SIPP, we need to get a better picture -- a profound understanding -- of what editing is doing to the data. One last point on SIPP. The only direct estimates of sampling error were for the third quarter of 1983 using 1984 panel data collected in wave one. The survey at that time was based on the 1970 census. It certainly seems time to recompute variances. Besides having incorrect variances, it seems like gilding the lily when the analysts are making actual and implied comparisons that they multiply by 1.6 times the standard error. The interpretations and the comparisons could be quite far off. 44 All in all, I enjoyed reading these papers. I think the documentation of SIPP is more complete but I think NASS is farther along in trying to improve quality. They do not want to document only; their real goal is improvement. I believe that is ultimately the SIPP goal too, but no strategy has yet been set forward on how to move in that direction. 45 Discussion Nancy A. Mathiowetz U. S. Bureau of the Census The data collected by Federal statistical agencies are used to both shape federal policy and change the distribution of federal expenditures; given the magnitude of the impact of these data, the need for high quality goes without question. In developing the Quality Profiles, the agencies responsible for this work are to be commended for continuing to move the discussion of error beyond that of sampling error and into the realm of the measurement of nonsampling error. Although most agencies have for years provided discussion of sampling error with release of their data and research findings, we are just beginning to develop a standard of reporting which includes a discussion of all of the components of total survey error. Sources of Nonsampling Error The sources of nonsampling error are many and include: - the design of the study (e.g. longitudinal vs. cross sectional; length of recall period; - the questionnaire, both the contents and the structure; - the interviewer; - the respondent; and - the post-survey processing, including coding and keying of data. Rather than reiterate issues raised in the Quality Profiles, I would like to suggest some other topics of investigation within these sources of nonsampling error. My goal in doing so, is not to criticize the work presented here, but to provide some ideas on where these Quality Profiles could be expanded. Design With respect to design, we still know little about the effects of longitudinal designs on the level of error and the error variance structure of reports over time. There has been research to indicate that respondents suffer from "conditioning" effects, that is the changing of behavior or the reporting of behavior in later interviews resulting from earlier interviews. Some conditioning may improve reporting in that the respondent knows 46 prior to the interview what are the nature of the questions; conditioning may also result in a reduction in reporting since respondents are now knowledgeable about the sequencing within an interview. In one study, the best predictor of error in reports of functional status in the fourth round of interviewing is the length of time it took to conduct the previous interviews. The finding suggests that conditioning effects may be reduced by something as subtle as reducing the length of an earlier interview. We need further research to understand how conditioning impacts the analysis of change over time and the structure of errors over time. Longitudinal designs may also be affected by changes in the respondents the interviewer, or even the interpretation and meaning of critical concepts in the questions, if the panel has a long life. With the proliferation of more longitudinal data collection efforts within the Federal Government, more research into what questions are sensitive and which are resistant to conditioning effects as well as which items are most affected by between interview changes, is necessary. Questionnaire As noted in a lecture to the Society of Government Economists, Janet Norwood stated that ...the quality of a statistical indicator is sometimes elusive and often difficult to define. Effective measurement requires an underlying conceptual framework and careful identification of the phenomenon to be estimated.... In the past 25 years, we have made great strides in understanding how sensitive response distributions are to minor changes in question wording. The merging of literatures from cognitive psychology, social linguistics, and social psychology with survey methodology has presented use with new means for attempting to reduce the levels of error associated with the questionnaire. What is now needed in the Federal statistical system is a means for evaluating the various forms by which the "same" information is collected and analyzed among various agencies. For example, in recent years, the proportion of individuals lacking health insurance has been a critical issue. The most widely cited data on insurance coverage comes from the Current Population Survey, which asks whether each person in a household was covered at any time during the preceding year. Persons covered by any source at any time during the year are counted as insured. In 1987, the estimate for uninsured from the March CPS was 17.6 percent. Notice that this question asks whether the person has been covered "at any time" during the previous year. In contrast questions from the 1980 National Medical Expenditure and Utilization Survey (NMCUES) and the 1987 National Medical 47 Expenditure Survey (NMES), both designed as one-year panel surveys, indicate that point in time estimates of the uninsured (at the time the person was interviewed) are approximately 14 to 16 percent at any one cross-section, but that estimates for all year uninsured are approximately 9 percent. There is some conjecture that the response to the CPS may reflect a respondent's status at the time of the interview rather than in reference to any time in the previous year, due to the, similarity in the estimates from CPS and the cross-sectional estimates from NMCUES and NMES. From a policy perspective the difference is critical -- whether to provide health insurance for the chronically uninsured, approximately 21 million people, or whether to provide insurance for all individuals ever uninsured, which appears to be approximately 35 million people in a given year. Those attempting to address this issue would benefit from a consistent definition of uninsured as well as a set of questions which asks about a consistent time period. Interviewer The use of response rates, hours per completed interview and item nonresponse rates traditionally used as measures of interviewer quality, only begin to capture the errors that are potentially associated with the interviewer's task. While each of these measures provides us with information that we believe is related to quality, we need to employ more measures that could be used with respect to understanding error for individual questions. How well do interviewers understand the concepts underlying the questions they are asking? Do they have sufficient training and understanding to ask non-directive probes when necessary to obtain an adequate answer? The increased movement toward telephone interviewing provides use with a means to routinely randomize interviews across interviewers to obtain measures of interviewer variance. We spend millions of dollars in the training of interviewers and yet know little about the most effective means for training interviewers or determining their ability to conduct the interview as trained. The review of one or more interviews by a supervisor provides some information, but if we believe that training interviewers to read questions exactly as written is worth the cost, we should be routinely evaluating the association between the delivery of questions and the error associated with the responses. Editing and Coding As noted in the SIPP Quality Profile, much of the between wave difference in industry and occupation appears to be a spurious result of either data collection or data processing. A similar problem can be found in the coding of medical conditions and 48 surgical procedures based on household reported data. Not only coding, but also editing procedures, can contribute to the overall level of error in estimates. For example, Duncan and Mathiowetz (1985), using microlevel validation data, found that trimming estimates of change in income between two years, that is disbelieving levels of change beyond a certain level as reported by household respondents, a procedure often done in editing data from longitudinal surveys of income, resulted in biased estimates of change and bias in the coefficients predicting income levels and change. Retrospective reports of income were more likely to be correct for those individuals with a large proportional change than for those with little or no change. The finding suggests that editing procedures should be conservative and based on empirically derived principles. Whereas we have learned to be sensitive to question wording with respect to understanding potential sources of bias, and in doing so demand documentation concerning question wording and study design, few, if any, studies provide information on effects of editing and coding processes. If consumers of the data are to understand all aspects of total survey error, coding and editing decisions need to be researched and documented. Adjusting for Nonresponse For the most part, nonresponse adjustments are made using demographic and segment information and little if any information concerning the nature of the nonresponse is factored into the adjustment. There is a growing body of literature which suggests that using information from call records, specifically separating refusals from those you were unable to locate, in a nonresponse adjustment may prove beneficial, since difficult to locate (but eventually interviewed) sample individuals look similar to respondents who cannot be located. These comments are intended to extend the excellent work presented in the Quality Profiles. The profiles provide details on the measurement of nonsampling error and the results of several experiments to reduce these levels of error. In addition. I hope that as others consider producing quality profiles these profiles are expanded to cover some of these other issues. Reference Duncan, G.J. and Mathiowetz, N.A. A Validation Study of Economic Survey Data, Ann Arbor, MI: The Institute for Social Research, 1985. 49 50 Session 2 PARADIGM SHIFTS USING ADMINISTRATIVE RECORDS 51 52 PARADIGM SHIFTS: ADMINISTRATIVE RECORDS AND CENSUS-TAKING Fritz Scheuren Internal Revenue Service There is a lot in the news lately about problems with the 1990 decennial census in the United States. Many opinions have already been offered about what went wrong and what should be done. Indeed, a paradigm shift may be needed in census-taking. This brief note talks about the possible role administrative records might play in a new paradigm. To get things started, the word "paradigm" might deserve some elaboration: a paradigm is a way of thinking and then doing; a pattern of belief and behavior; a way of seeing reality and using that sense to accomplish something. Paradigms are common -- the way we get to work would be a humble example. Conventional census-taking, under this definition, could be characterized as a major scientific and technical paradigm. As long as our paradigms work well for us, we tend not to change them. Occasionally, however, paradigms break down and have to be replaced; e.g., the bridge goes out and we need to find another route to work. As Kuhn pointed out in his seminal book on the structure of scientific revolutions, paradigms break down in, science, as well (Kuhn, 1970). Perhaps the most famous example of this is the revolution in the thinking of astronomers that occurred when the Ptolemic earth-centered view of the universe was replaced by the Copernican view of an earth that revolved, with the other planets, around the sun. If we look at the problems the U.S. Census Bureau has encountered with the 1990 decennial census, it can easily be argued that one of the major barriers to overcoming these obstacles is the conventional census-taking paradigm. Kish, in a recent paper he has written for Survey Methodology (1990), considers at length some possible alternatives. My objective here will be to focus on two of those areas -- rolling censuses and administrative registers and to explore a new paradigm for the U.S. decennial census. 53 Conventional Census-Taking Conventional censuses, like those in Canada and the U.S., continue to do many things very well (e.g., Hammond, 1990). Indeed, at present, we have no adequate substitute for them; nonetheless, the need for at least some change seems compelling. Rising costs are a big factor. There have been many improvements in census-taking in this century; still, in both Canada and the U.S., total costs and even costs per person have risen significantly: o The 1990 decennial census in the U.S. is budgeted at about $10 (U.S.) per person. Even adjusting for inflation, this is a four-fold increase over what the per capita expenses were in 1960. Item content differences between the two censuses are small and essentially not a factor in explaining the difference. Both the 1960 and 1990 Census, for example, asked only 7 population questions of everyone (U.S. Bureau of the Census, 1989). The Census long-form sample in 1960 contained 35 questions and was to be completed by 25% of the population. For 1990, the Census long-form sample was given to 16% of U.S. hou