| Federal
Committee on Statistical
Methodology Office of Management and Budget |
FCSM
Home ^ Methodology Reports ^ |
Statistical Policy Working Paper 18 - Data Editing in Federal Statistical Agencies
Click HERE for graphic.MEMBERS OF THE FEDERAL COMMITTEE ON STATISTICAL METHODOLOGY Maria E. Gonzalez (Chair) Office of Management and Budget Yvonne M. Bishop Daniel Kasprzyk Energy Information Bureau of the Census Administration Warren L. Buckler Daniel Melnick Social Security Administration National Science Foundation Charles E. Caudill Robert P. Parker National Agricultural Bureau of Economic Analysis Statistical Service John E. Cremeans David A. Pierce office of Business Analysis Federal Reserve Board Zahava D. Doering Thomas J. Plewes Smithsonian Institution Bureau of Labor Statistics Joseph K. Garrett Fritz J. Scheuren Bureau of the Census Internal Revenue Service Robert M. Groves Monroe G. Sirken Bureau of the Census National Center for Health Statistics C. Terry Ireland National Computer Security Robert D. Tortora Center Bureau of the Census Charles D. Jones Bureau of the Census Preface The Federal Committee on Statistical Methodology was organized by OMB in 1975 to investigate issues in Federal statistics. Members of the committee, selected by OMB on the basis of their individual expertise and interest in statistical methods, serve in their personal capacity rather than as agency representatives. The committee conducts its work through subcommittees that are organized to study particular issues and that are open to any Federal employee who wishes to participate in the studies. working papers are prepared by the subcommittee members and reflect only their individual and collective ideas. The subcommittee on Data Editing in Federal Statistical Agencies was formed in 1988 to document, profile, and discuss the topics of data editing in Federal surveys. In preparing this report, the subcommittee walked in uncharted territory. Unlike many other survey process topics, such as design and estimators, where there is substantial literature, textbooks, and documentation, the formal literature pertaining to data editing is quite limited. It is hoped that this report will further the awareness within agencies of each other's data editing practices, as well as of the state of the art of data editing, and thus lead to improvements in data quality throughout Federal statistical agencies. A key ingredient in this effort is a profile of current data editing practices constructed from an editing questionnaire designed by the subcommittee and covering 117 Federal surveys. The report also describes current and recent research developments that may aid agencies in evaluating their current data editing practices, as well as in planning for future data editing systems. The subcommittee report is presented in a format and style that aims to increase awareness of Federal survey managers and subject matter specialists (statisticians, economists, computer programmers, statistical assistants, and clerks, etc.) on survey data editing practices. When possible, observations are made in this report that may aid in the evaluation of current editing practices and in the planning of future editing systems. In fact, this goal provided the subcommittee with the incentive to also investigate the methodology for software, technology, and research developments beyond the profile of current editing practices. This subcommittee on Data Editing in Federal Statistical Agencies was chaired by George Hanuschak of the National Agricultural Statistics Service, U.S. Department of Agriculture. i MEMBERS OF THE SUBCOMMITTEE ON DATA EDITING IN FEDERAL STATISTICAL AGENCIES George Hanuschak, (Chair) National Agricultural Statistics Service Yahia Ahmed Internal Revenue Service Laura Bauer Federal Reserve Board Charles Day Internal Revenue Service Maria Gonzalez Office of Management and Budget Brian Greenberg Bureau of the Census Anne Hafner National Center for Education Statistics Gerry Hendershot National Center for Health Statistics Rita Hohenbrink National Agricultural Statistics Service Renee Miller Energy Information Administration Tom Petkunas Bureau of the Census David Pierce Federal Reserve Board Mark Pierzchala National Agricultural Statistics Service Marybeth Tschetter Bureau of Labor Statistics Paula Weir Energy Information Administration ii ACKNOWLEDGMENTS This report represents an intensive voluntary effort on the part of dedicated subcommittee members and outside reviewers over an eighteen month period. It is truly a collective effort on the part of the subcommittee members, who worked very well as a team. While maintaining their full time Federal job responsibilities, the fifteen subcommittee members worked diligently on this challenging mission. The subcommittee expresses its appreciation to Cathy Mazur of the National Agricultural Statistics Service, U.S. Department of Agriculture for her timely summarization of the current editing practices survey, to Dale Bodzer and Howard Magnus of the Energy Information Administration, and Cathy Cotton of Statistics Canada and John Monaco of the U.S. Bureau of the Census for demonstrating editing software packages to the subcommittee. Jelke Bethlehem of the Netherlands Central Bureau of Statistics and John Kovar of Statistics Canada provided substantial aid to the subcommittee by answering numerous questions about editing systems and providing software systems for subcommittee review. The subcommittee extends its thanks to David Pierce, Federal Reserve Board, Terry Ireland, National Security Agency, and Fritz Scheuren, Internal Revenue Service of the Federal Committee on Statistical Methodology for their reviews of the report. The subcommittee extends its appreciation to Maria Gonzalez, Chair of the Federal Committee on Statistical Methodology, for her guidance, encouragement and advice throughout the eighteen months. Last but not least. the subcommittee extends its appreciation to Sherri Finks, Amy Curkendall, and Jennifer Kotch of the National Agricultural Statistical Service who diligently and patiently did the fine word processing using a desktop publishing package under rather severe time constraints. iii TABLE OF CONTENTS Page Chapter 1. EXECUTIVE SUMMARY 1 A. Introduction 1 B. Key Findings 2 C. Recommendations 3 D. Implementation of Recommendations 4 E. Structure of Report 4 Chapter II. BACKGROUND 5 A. Scope, Audience and Objectives 5 B. Subcommittee Approach to Accomplishing Mission 5 C. Subcommittee Work Groups 7 D. Practices and Issues in Editing 7 Chapter III. CURRENT EDITING PRACTICES IN FEDERAL STATISTICAL AGENCIES 11 A. Profile on Editing Practices 11 B. Case Studies 17 Chapter IV. EDITING SOFTWARE 21 A. Introduction 21 B. Software Improving Quality and Productivity 22 C. Descriptions of Editing Software 26 Chapter V. RESEARCH ON EDITING 31 A. Introduction 31 B. Areas of Edit Research 31 C. Editing Research in Other Countries 33 D. Case Studies 34 E. Summary 36 F. Bibliography 38 APPENDIX A: Results of Editing Practices Profile From Questionnaire Responses 39 APPENDIX B: Cast Studies 47 APPENDIX C: Checklist of Functions of Editing Software Systems 65 APPENDIX D: Annotated Bibliography of Articles on Editing 77 APPENDIX E: Glossary of Terms 87 iv CHAPTER I EXECUTIVE SUMMARY A. INTRODUCTION The Subcommittee on Data Editing in Federal Statistical Agencies was established by the Federal Committee on Statistical Methodology in November 1988 to document, profile and discuss data editing practices in Federal surveys. The subcommittee had the following mission statement: The objective is to determine how data editing is currently being done in Federal statistical agencies, recognize areas that may need attention and, if appropriate, to recommend any potential improvements for the editing process. To accomplish its mission, the subcommittee first addressed the definition of data editing - what was it? No universal definition of survey data editing exists. The following working definition of editing was developed and adopted by the subcommittee: Procedure(s) designed and used for detecting erroneous and/or questionable survey data (survey response data or identification type data) with the goal of correcting (manually and/or via electronic means) as much of the erroneous data (not necessarily all of the questioned data) as possible, usually prior to data imputation and summary procedures. Data editing can be seen as a data quality improvement tool by which erroneous or highly suspect data are found, and if necessary corrected. The subcommittee members realize that the boundaries of editing (where it begins and ends) is not absolute. The subcommittee was instructed by the Federal Committee on Statistical Methodology to concentrate on the front end of the editing process and not to duplicate the extensive work on imputation done by the Panel on Incomplete Data, Incomplete Data in Sample Surveys, Volumes I, II and III, Academic Press, 1983. Therefore, the rest of the document is based on the subcommittee's working definition of editing. In order to gather the necessary information related to Federal survey editing practices, the subcommittee used a combination of information gathering techniques: a profile on editing practices using a subcommittee prepared questionnaire (6 pages and 41 questions) for 117 Federal surveys in 14 Agencies, an extensive literature search and review, case studies of 8 Federal surveys, editing system software evaluation for several recently developed generalized editing systems, and a search and review of current research efforts, including a few case studies. on the editing process. These information-gathering techniques contributed to the development of an extensive editing information base for this report. In summary, data editing is considered to be an important component of Federal statistical agencies. Key findings from the survey on editing practices conducted by the subcommittee follow, along with recommendations. In some cases. detailed discussions of recommendations are handled in the text of the document. A glossary of terms used in this report is in Appendix E. 1 B. KEY FINDINGS Key findings from the profile on editing practices follow. - About 60 percent of Federal survey managers reported that they refer all data that fad edit checks (not only critical errors or severe outliers) to subject matter specialists or editors (economists, statisticians, clerks, etc.) for review and resolution. The role of the subject matter specialist is often valued as somewhat indispensable, as their expert knowledge and judgment are key ingredients in the survey editing process. Two key questions are: 1. at is the cost/benefit relationship of this extensive manual review? 2. How consistent are the actions of different subject matter specialists? - Editing costs are reported to have a median value of about 35 percent of the total survey cost; however, the mode is 10 percent and the distribution is quite skewed to the right. Administrative records systems, such as those used by the Federal Reserve Board and the Internal Revenue Service, are on the skewed right-hand tail of the cost distribution. The reason is that compared to censuses or sample surveys, data collection costs are a much smaller portion of total costs. This finding points to the importance of improvements in the cost efficiency of the editing process as a target for all Agencies in the next decade. - Federal survey managers report that in over 80 percent of their surveys there is a good internal documentation of the editing system. Federal survey managers appear to take the editing process very seriously and recognize its importance in the overall survey performance. Under tight resource constraints, the level of documentation on the editing process is impressive. - There is a strong desire by many of those involved in the editing process to combine or replace "batch-oriented" systems with "on-line" or quick-turnaround systems. - Another desire expressed by respondents is for continued research and development and implementation of more efficient, well targeted, consistent, and accurate methods to detect potentially erroneous survey data. - Integration of survey tasks (e.g., computer-assisted data collection, data entry, data editing, imputation, and summary) is important for improving data quality and productivity in survey processing. - Several major developments in generalized editing software are taking place in domestic and international statistics agencies. Three major ones covered in some detail in this report (Chapter IV) are: Statistics Canada's Generalized Edit and Imputation System (GEIS): the Netherlands Central Bureau of Statistics Blaise system (named after Blaise Pascal, the well known mathematician of the 1600's); and the U.S. Census Bureau's Structured Program for Economic Edit and Referral (SPEER). - Some agencies are currently conducting research on the editing process, and several case 2 studies are presented in Chapter V and Appendix B. - There is substantial potential for several related technology and data systems developments to contribute to more efficient and consistent editing systems in the next decade. These include data base systems, expert systems, electronic data collection such as computer assisted telephone interviewing (CATI), computer-assisted personal interviewing, (CAPI) and touchtone surveys, major generalized edit systems, and artificial intelligence systems. - If the cost of data processing continues to drop at its current rapid pace, the analysis of multi-variate statistical relationships among survey variables can be more widely used for editing (and imputation) if appropriate. - The major challenge in software development lies in the reconciliation of two goals: the increased use of computers for certain tasks and the more intelligent use of human expertise. C. RECOMMENDATIONS Based upon the findings in the Subcommittee's editing information base, we present the following recommendations. Federal survey managers and administrators should: - Evaluate and examine the cost efficiency, timeliness, productivity, repeatability, statistical defensibility, and accuracy of their current editing practices versus alternative systems. The checklist of editing software features provided in Appendix C and the remainder of this report is an aid in such an effort. Such an effort can also be part of a Total Quality Management system for surveys and agencies. - Review and examine the implications, for their editing situation, of important developments in data processing such as powerful microcomputers and scientific workstations, local area networks (LAN's) and data base software that provide electronic communication links from microcomputers and LAN's to mainframe computers. - Follow the research and applications developments in the use of CATI, CAPI, touchtone, and other electronic means of data capture with potential for improving the editing and/or data processing flow. - Continue to share information on research and development and software systems efforts in the editing process with other Federal and international statistical agencies. - Stay attuned to research and developments in the use of expert systems and/or artificial intelligence software for survey data editing. - Evaluate both the role and the effectiveness of editing in reducing nonsampling errors for their surveys. 3 - Evaluate the extensive relationship of extensive manual review on resulting estimates. - Explore development of a catalog of situations in which various techniques work well or not; e.g., research has indicated that exponential smoothing does not work well when data are erratic. - Recognize the value of editing research and place a high priority on devoting resources to their own research, to monitoring developments in data editing at other agencies and to implementing improvements when they are found to be desirable. - Explore integration of functions in a survey; e.g., data entry, data editing and computer assisted data collection. - Give attention to the future roles of the subject matter specialist and the methodologist and to the tools and consistency with which they perform their jobs. D. IMPLEMENTATION OF RECOMMENDATIONS An interagency working group should be formed to continue the mission of the subcommittee and work on the implementation of the subcomm- ittee's recommendations. E. STRUCTURE OF REPORT The structure for this report is executive summary, followed by Chapter II which is introductory, Chapter III on the editing profile and the case studies, Chapter IV on the role of software in editing, and Chapter V on the role and status of research in editing. Supporting appendices include: A. Results of Editing Practices Profile From Questionnaire Responses B. Case Studies C. Checklist of Functions of Editing Software Systems D. Annotated Bibliography of Articles on Editing E. Glossary of Terms 4 CHAPTER II BACKGROUND A. SCOPE, AUDIENCE, AND OBJECTIVES The Subcommittee on Data Editing in Federal Statistical Agencies was established by the Federal Committee on Statistical Methodology in November 1988 to document, profile and discuss data editing practices for Federal surveys. The Subcommittee had the following mission statement: The objective is to determine how data editing is currently being done in Federal statistical agencies, recognize areas that may need attention and, if appropriate, to recommend any potential improvements for the editing process. The project will obtain information on current data editing practices. The information on editing will include the role of subject matter specialists; hardware, software, and data base environment; new technologies of data collection (and editing) such as CATI and CAPI. and current research efforts in the Agencies and some recent developments, in generalized editing systems, from the U.S. Census Bureau, Statistics Canada, and the Netherlands Central Bureau of Statistics. B. SUBCOMMITTEE APPROACH TO ACCOMPLISHING MISSION A number of paths were followed by the subcommittee in accomplishing its goals as set forth in the preceding mission statement, including developing a questionnaire on survey editing practices. assembling several detailed case studies, investigating alternative editing systems and software, exploring research needs and practices, and compiling an annotated bibliography of literature on editing. The editing profile questionnaire (6 pages and 41 questions) was developed and administered to 117 Federal surveys covering 14 different agencies. The 117 surveys were selected by Subcommittee members and thus were not a scientific sample of all Federal surveys. The subcommittee members felt that the 117 surveys represented a broad coverage of agencies and types of surveys or censuses that might have different editing circumstances or situations. The two major purposes of the editing questionnaire were to provide an adequate profile of current editing practices and to aid in developing a typology of surveys to be used for selecting case studies. The typology is a classification of surveys according to a number of criteria such as frequency of the survey, number of respondents. degree of automation and judgmental review of the edits, whether respondents are contacted regarding questionable items, whether historic data are used in the editing of current data, and so forth. This information is of general interest, and was useful to the subcommittee in selecting a representative group of surveys to serve as case studies 5 of editing practices. The questionnaire and a tabular summary of the results are presented for the reader in Appendix A. Chapter III of this report contains the analysis of the questionnaire and a description of the case studies. For each different editing environment, a case study was conducted. The case studies provide more detailed information for the selected cases than just the editing questionnaire . The case studies are published in two forms (long and short) in Appendix B to give descriptions of the varied editing practices and situations. Another important area of the subcommittee's work was the investigation and evaluation of some recently developed generalized editing systems and software packages. Several major editing systems were studied and a profile of features was developed and is presented in Chapter IV. The editing systems reviewed were the U.S. Census Bureau's Structured Program for Economic Edit and Referral (SPEER), the Netherlands Central Bureau of Statistics Blaise system, and Statistics Canada's Generalized Edit and Imputation System (GEIS). Also, several recent application specific editing systems at the U.S. Department of Energy and the Bureau of Labor Statistics were reviewed. These systems were developed under different conditions and applications, so direct comparisons are not feasible. However, the subcommittee believes that a description of these systems features and capabilities is of substantial value to Federal statistical agencies. Appendix C gives the reader a detailed checklist of editing software system features. This checklist will be a valuable tool to editing system developers. The remaining major activity of the subcommittee was a review of historic and ongoing research. This review consisted of a literature search that enabled the subcommittee to develop an annotated bibliography, presented in Appendix D. This appendix provides a valuable source of information on editing literature. In addition, case studies of ongoing or recent editing research were conducted. Also information about editing research and research needs on the editing process were gleaned from the editing profile. A more detailed description of editing research is provided in Chapter V. A short glossary of editing terms is given in Appendix E. 6 C. SUBCOMMITTEE WORK GROUPS To effectively accomplish its mission, the subcommittee was divided into four major groups. I. Editing Profile Group Charles Day, Leader Yahia Ahmed George Hanuschak Rita Hohenbrink Renee Miller II. Case Studies Group Anne Hafner, Leader Yahia Ahmed III. Editing Software Group Mark Pierzchala, Leader Charles Day Gerry Hendershot Rita Hohenbrink Tom Petkunas Marybeth Tschener IV. Editing Research Group Brian Greenberg, Leader Yahia Ahmed Laura Bauer Renee Miller David Pierce Paula Weir D. PRACTICES AND ISSUES IN EDITING Description of the Process Pre-survey editing tasks include the writing and evaluation of editing programs, evaluation of tile edits themselves, and writing instructions for the inspection of questionnaires by interviewers. field supervisors, clerks, and subject matter specialists. These activities influence how well editing is done, as well as how many resources will be expended on editing once data are collected. During the survey itself, editing may occur in many ways and at many stages, from data collection to publication, and even after publication in some cases. In paper and pencil interviewing, the interviewer is the first to inspect the questionnaire for errors. Optimally, this should be done immediately after the interview so that the respondent can easily be contacted to clarify responses. If questionnaires are channeled through a supervisor, then a second 7 inspection can be done. Not only can recontacts be made shortly after the interview, but the supervisor can provide feedback to the interviewers on how they are doing. Once the questionnaires reach an office, they may be edited manually by clerks, subject matter specialists, or both. In some organizations, this manual edit may include a significant amount of coding. It can also include a visual check that answers are given in correct units, that routing instructions have been followed correctly, and consideration of notes written by either the respondent or the enumerator. In most cases a computer edit is then performed. Error signals (flags) and messages are presented to a reviewer either on printouts or a screen in an interactive session. If the program output is printed, then the review tends to be cyclical as the computer must then re-edit, in batch, all of the changes. If the output is on a screen, (microcomputers or terminals hooked to a larger computer), then questionnaires are usually processed one at a time until they pass review. All of the above editing activities relate to reviewing data at the record (or questionnaire) level. This is often referred to as micro- editing. Editing of data at an aggregate level will then take place even if it is not explicitly recognized as such. This macro-editing may be by cells in an economic table, or by some other aggregation such as a stratum. The cells in a table may be edited against themselves (one can visualize some sort of super-questionnaire) or against similarly defined cells from previous surveys. This macro- editing may be done by hand or through specially designed software. Depending on the degree of automation, it may or may not be possible to trace inconsistencies at the aggregate level to the offending questionnaires. If the macro-editing program can trace inconsistencies back to the micro-level, then macro-editing can in theory be used to direct the micro-editing. If Computer Assisted Data Collection is used, then much of the editing process is formally introduced and enforced at the time of data collection. Not only are most major errors corrected at the time of the interview. but the subject matter specialists may have greater confidence in the data after collection and be more likely to let the data pass without change. Thus, Computer Assisted Data Collection has enormous potential for reducing the costs of data editing after data collection. By introducing edits into data collection, it will also improve the data themselves. Currently, Computer Assisted Data Collection is becoming more common in the survey world. However, for the foreseeable future, many surveys will still be collected by mail or by paper and pencil interviewing. In any case, the need for editing after data collection will never be totally eliminated. Issues in Editing Costs and Benefits The importance of data editing in reducing non-sampling errors has been questioned. Granquist (1984) questions whether the editing process can essentially improve data quality after data are collected. He states that there are three purposes of editing: to give more detailed information about the quality of the survey, to provide basic data for the improvement of the survey. and to tidy up the data so that further processing can be done. Further, Granquist considers the sources and types of survey errors, and questions the ability of most generalized editing systems to address all kinds of errors including systematic response errors. If data are considered to have a timely quality. that is, the value of data deteriorate as time goes along, then editing can reduce die value of the data. Pullum, Harpham, and Ozsever (1986) describe a situation where the editing of survey data 8 had no discernible effect on the estimates other than to delay their release by about one year. One common question that many organizations have is when to declare that editing is finished. "Over-editing" of data has long been a concern. In order to make sure that all errors are flagged, often many unimportant error signals (flags) are generated. These extra signals not only take time to examine but also distract the reviewer from important problems. These extra signals are generated because of the way that error limits are set. One way that researchers are trying to reduce the number of error signals, while at the same time ensuring that the important cases are flagged, is through the development of statistical editing techniques. For example, time series techniques can be used in repetitive surveys on a record-by-record basis. Alternatively, cross record statistical editing can be done on either a univariate or multivariate basis. This may include the graphical inspection of data. Data editing often requires considerable resources in federally conducted surveys both in terms of staff time and dollars. These expenditures are themselves reason enough to re-evaluate the editing process. In addition, there are often external economic incentives in the form of reduced budgets for statistical agencies. The combination of rapidly decreasing computing costs, rapidly increasing computing capabilities, and steady or increasing staff costs, is changing the economics of the process vis-a-vis the proper mix of human and computer processing. Another cost that is not considered much is the increase in respondent burden. In some surveys, edits are so tightly set that few if any records pass the edits. As a result respondents are called back, some many times, in order to clear up suspicious situations. There is also an opportunity cost to editing. Any time spent in editing is time that is not being used for other activities, some of which may have greater potential for reducing nonsampling errors. Statistical and Quality Concerns Statistical considerations will impact the development of new editing systems and may even lead to their development. Defensibility of the process is a concern because data are changed after data collection and before summary. The ability of an agency to defend itself from criticism is enhanced by implementing methodologically sound procedures, by capturing data electronically as they are reported, and by auditing all changes made during the edit. The effect of editing can then in principle be known, and feedback for the improvement of the survey can be given. Conceptually, the edit process should be repeatable (or reproducible). This means that the same data run through a system twice should lead to the same results. Editing should not change survey results in an unpredictable manner. Integration of Survey Tasks Integration of survey tasks is important for improving both data quality and productivity in survey processing. Consider the functions of Computer Assisted Data Collection, data entry, and data editing. By integrating these functions, data quality can be improved by injecting the editing function into collection, and also by reducing transcription errors by eliminating the need for in office data entry. Given the proper software, pre-survey activities may be done more productively by reducing the need for multiple specification of the data. For example, if a particular variable can take only the values of 1, 2, and 3, then the program for each of the three functions should have 9 specified this limitation. Time is saved, and potential for inconsistencies are reduced, if all three programs derive from one specification. Usually, routing instructions and edits are common between a data collection instrument and an editing program. If both functions derive from the same program, then double programming can be eliminated. Also it is easier to make more explicit and purposeful the differences between the collection and editing instruments. Constraints Constraints (other than economic constraints already considered) on the organization or on the survey itself often adversely affect the quality of editing. Some large federal surveys (e. g., in the National Agricultural Statistics Service or in the Bureau of Labor Statistics) are conducted under extremely tight deadlines. Data are collected, processed, analyzed, and published under a rigid schedule. Detecting all of the major problems in such a large data set under the time allowed becomes enormously difficult. Computer hardware and software have their own constraints. For example, access to a mainframe may be limited and editors may have to review error signals on paper printouts because of costs. Software, may not be easy to use, and it may be extremely difficult to coordinate disparate programs. Data editors may not have sufficient knowledge of the subject matter or survey procedures nor sufficient training. High turnover of editors may be a problem in some surveys. The challenge then is in providing the inadequate staff with enough, effectively presented information to allow the job to be done correctly. There may be resistance to change or a questioning of its need in the implementation of new editing systems and methodologies. People may wonder how their job will be changed or if it will be eliminated. Some problems may be easy to identify (e.g., the amount of resources consumed is too large) but others may require special studies (e. g., how much is spent on each task and how much do we get from it?). In considering either the development of a new editing system or the purchase of one, it is often difficult to know which editing system features are necessary, and their relative performance. Evaluation of editing software is difficult and time consuming. Another consideration is who should be on the evaluation team. 10 CHAPTER III CURRENT EDITING PRACTICES A. PROFILE ON EDITING PRACTICES To obtain an adequate profile of current editing practices, the subcommittee developed an editing profile questionnaire which was administered to 117 Federal surveys covering 14 different if agencies. The 117 surveys were selected by Subcommittee members and thus were not a scientific sample of all Federal surveys; however, the subcommittee members felt that the 117 surveys represented a broad coverage of agencies and types of surveys or censuses that might have different editing situations. This section describes how the questionnaire was designed and administered, and summarizes the findings. While this section focuses on the highlights of the profile, tallies of responses to all of the questions appear in Appendix A. Designing and Administering the Questionnaire The subcommittee designed a six-page questionnaire containing general descriptive questions about a particular survey as well as specific questions on editing practices. (See Appendix A for a copy of the questionnaire) Each subcommittee member pretested the editing questionnaire by answering the questions for a survey with which each was familiar. Although a scientific sample was not drawn, the goal was to select a group of surveys that would be representative of the surveys conducted by Federal statistical agencies. Each subcommittee member sought to obtain information for ten to twenty surveys that represented their agencies surveys. In addition, they obtained information from several agencies not represented on the subcommittee. Some subcommittee members reviewed the completed questionnaires for consistency by contacting the agency respondents prior to submitting them. A small number of consistency edit checks were performed for the questionnaires; however, the editing was limited. 11 Characteristics of Surveys in Editing Practices Study Illustrating the wide range of surveys in the study, large and small surveys were represented. The smallest survey in the study had 22 units, while the largest had about 1 million units. As shown in Figure 1: About three-fourths of the surveys in the study are sample surveys. Various frequencies of collection are represented (annual, quarterly, monthly, and weekly). About three-quarters of the surveys are filed by establishments, institutions, farms and other entities, and the remaining quarter by households or individuals. Traditional means of data collection such as mail, personal, and telephone interviews were the most common. Only a small proportion used computer assisted telephone interview (CATI); and no survey respondents reported using computer-assisted personal interview (CAPI) as their primary method of data collection, although a few did report using CAPI as a secondary method. About sixteen percent of the surveys in the study use administrative records. The remainder of this section discusses editing practices. As part of the analysis, data on editing practices were cross-classified by the characteristics just discussed (sample versus census, frequency of collection, type of unit surveyed and mode of collection) to determine whether editing practices varied by these characteristics. If these characteristics do in fact affect editing practices, and the surveys in the study are not representative of all surveys on these characteristics, then the aggregated results of this study would not be applicable to all surveys. Results are presented for all of the surveys in this study, but situations in which editing practices differed greatly are highlighted. Editing Practices The questionnaire covered the following areas with respect to editing practices: cost of editing, when and how editing occurs. type of edits used, degree of satisfaction with current system, and future applications. Cost of Editing The survey respondents were instructed to include all aspects of editing in their cost figures, such as edits made at data entry. clerical work, computer time, design, testing, monitoring, analyst review, call-backs, and summary table review. However, in all of the following information on editing costs, about two-fifths of the respondents reported that information on the cost of data editing was available. The subcommittee does not claim that this data is totally free of nonsampling errors. Therefore. all conclusions are subject to this constraint. Editing costs representing at least 20 percent of the total cost of the survey were reported for four-fifths of these surveys. A similar pattern was observed for the surveys for which cost information was not available. Of the 73 surveys where no cost data were available, cost estimates were provided for about two-thirds (49 surveys). About three-quarters of these surveys had editing costs representing at least 20 percent of the total cost of the survey. The median editing cost as a percentage of the total survey cost was 35 percent. While an attempt was made in the instructions to the survey to standardize the activities to be included in the cost of editing (see question 20 in Appendix A), record-keeping practices vary. As a result, estimates may not represent the same activities from survey to survey. However, the data still proved useful in determining the survey characteristics that most effect the cost of editing. Editing costs as a percentage of the total survey cost varied greatly by the type of survey. Demographic surveys (surveys of individuals and households) had a far lower median than economic surveys (surveys of farms, establishments or firms, institutions, and others). The median for demographic surveys was 20 percent compared with 40 percent for economic surveys. Among the economic surveys, the surveys that used administrative records had the highest median of percentage of cost, 60 percent. This high percentage does not necessarily indicate a high absolute editing cost, it could indicate a low total survey cost because no new survey data are collected. As discussed in the next section, these surveys have more extensive involvement of subject matter analysts than demographic surveys have. 13 Overall, surveys in which all error correction was done by clerks or analysts were more likely to have editing costs that represent over 40 percent of the total survey cost than were surveys in which only unusual cases were referred to analysts . Almost one-half of the former group had editing costs in the category "40 percent or greater" compared with one-third of the latter group. Reversing the perspective, only 6 percent of the former group (all error correction by clerks or analysts) had editing costs in the category, "under 10 percent" compared with about one-third of the latter group (unusual cases by analysts). When and How Editing and Follow-up Occur For about two-thirds of the 117 surveys studied, the majority of data editing takes place after data entry. Subject matter analysts play a large role in almost all of the surveys. In about three-quarters of the surveys, subject matter analysts review all unusual or large cases after automated or clerical editing. Only seven surveys have little or no intervention from subject matter specialists. Of these, only four are completely automated (i.e., edit checking and error correction are done without referral to analysts). Surveys of farms, establishments and institutions tend to have heavier involvement from subject matter analysts than surveys of individuals and households (i.e., higher proportions of the study respondents report that all data editing is done by subject matter analysts for these surveys than for the others). This could explain the relatively higher editing costs as a percentage of the total survey cost reported for the surveys of farms, establishments and institutions. The degree of automation varies considerably among the surveys in the study. About three-fifths of the survey managers note that automated edit checking is done, but error correction is performed by clerks or analysts (Figure 2). In about 62 percent of the cases, there is no analysis of the effect of editing practices on the estimates produced. Types of Edits Almost all the surveys in the study use validation editing which detects inconsistent data within a record. A large proportion (83 percent) also use macro-editing where aggregated data are examined to detect inconsistencies. In addition to these two types of edits, 57 percent of the survey managers report using other edits. In response to an open-ended request to describe "other" edits, "range edits" were mentioned most frequently. followed by procedures that used historical data. "Ratio edits" were another common response. These three groups may not be distinct. Because responses we re not detailed, it was difficult to determine exactly what these edits involved. Other types of edits and analyses mentioned include: comparisons with other surveys. comparing the current value to a value estimated by regression analysis, using interquartile measures, and listing the ten highest and ten lowest values before and after expansion factors were applied. 15 Satisfaction With Current Edit System The study respondents were split in the level of satisfaction with their current system. About 47% were satisfied, while about half thought that at least minor changes were required. A small proportion said it was not possible to determine what changes were required at this point (Figure 4). Future Applications Among those expressing a need for change, an on-line system topped the list of desired improvements. Other changes that were mentioned frequently (as a result of an open-ended question) included: - The use of prior years' data to test the current year, - More statistical edits, and - More sophisticated and more extensive macro and validation editing. An audit trail, more automation in general. and a user-friendly system were also mentioned several times. In addition, the following enhancements were mentioned: automated error correction, incorporation of imputation into the editing package. evaluation of the effect of data editing, reduction of the number of edit flags to follow-up, incorporation of information on auxiliary variables, multivariate editing. use of an expert system approach for criteria which require judgment, and editing using micro-computers. In summary our questionnaire revealed wide diversity in current editing practices and in user satisfaction with them. To present more of an in-depth picture, we now describe the development of the case studies. 16 B. CASE STUDIES Federal government surveys, censuses, and administrative records systems create a broad range of data editing situations. In addition to the statistical profile on editing practices, it was felt that a further description of several of the surveys in case study form would reveal in greater detail the complexity of the different editing practice situations in operation. A typology of editing situations was developed by the subcommittee to be used for selecting case studies (Figure 5). The typology was developed through extensive subcommittee discussion and from analysis of responses to the editing practices questionnaire. The grouping variables included in Figure 5 are: 1. Census or sample survey approach 2. Longitudinal or cross sectional approach 3. Frequency of census or sample survey 4. Size of census or sample survey 5. Continuous and/or categorical data 6. Administrative records used (Yes or No) 7. Mode(s) of data collection used (mail, telephone, CATI, CAPI, touchtone, personal, etc.) 8. Use of historical data in the edit process (Yes or No) There were also other grouping variables that were considered and then discarded, for example, the level of clerical knowledge of subject matter when editing. The major reason for elimination was subjectivity involved in measuring those variables. In order to represent the range of different editing situations, the subcommittee picked eight case, studies that covered the different values of the eight grouping variables. Four were chosen to develop brief case studies which represent different survey situations and are presented in short abstract form in Appendix B. These are: - BLS:CPI: Commodities and Services - IRS: US Corporation Income Tax Returns - NCES: National Education Longitudinal Study of 1988 - Federal Reserve Board: Edited Deposits Data System 17 The first paragraph of each abstract describes the environment in which the survey takes place, including type of survey and size. The second paragraph includes a brief description of editing practices used. Four additional surveys are described in greater detail (in Appendix B) to give the reader a flavor of the range of editing practices and situations. Surveys covered are: . NCHS: National Health Interview Survey; . Census: Enterprise Summary Report; . NASS: Quarterly Agricultural Survey; and . EIA: Monthly and Weekly Refinery Reports. The first section of the in-depth case studies describes the environment in which the survey takes place. The second section describes editing practices - used, including data processing environment, audit trail, micro, macro and statistical editing, prioritizing of edits, imputation process, standards, costs, role of subject matter specialists, measures of variation, and current and future research. The wide variation in editing situations makes it impossible to recommend any one editing system or methodology for all Federal statistical agencies, surveys, administrative records systems, or censuses. 18 CHAPTER IV EDITING SOFTWARE A. INTRODUCTION For most surveys, large parts of the editing process are carried out through the use of computer systems. The task of the Software Subgroup has been to investigate software that in some way incorporates new methodologies, has new ways of presenting data, operates in recently developed hardware environments, or integrates editing with other functions. In order to fulfill this charge, the Subgroup has evaluated or been given demonstrations of new editing software. In addition, the Subgroup has developed an editing software evaluation checklist that appears in Appendix C. This checklist contains possible functions and attributes of editing software, which would be useful for an organization to use when evaluating editing software. Extremely technical jargon can be associated with new editing systems; and new approaches to editing may not be familiar to the reader. The purpose of section B is to explain these approaches and their associated terminology as well as to discuss briefly the role of editing in assuring data quality. A distinction must be made between generated systems and software meant for one or a few surveys. The former is meant to be used for a variety of surveys. Usually there is an institutional commitment to spend staff time and money over several years to develop the system. It is hoped that the investment will be more than recaptured after the system is developed through the reduction in resources spent on editing itself and in the elimination of duplication of effort in preparing editing programs. Some software programs have been developed that address specific problems in a particular survey. While the ideas inherent in this software may be of general interest, it may not be possible to apply the software directly to.other surveys. Section C describes three generalized systems in some detail, and then briefly describes other systems and software. These three systems have been used or evaluated by Subgroup members in their own surveys. New and exciting statistical methodology is also improving the editing process. This includes developments in detecting outliers, aggregate level data editing, imputation strategy, and statistical quality control of the process itself. These activities are covered more fully in Chapter V. The Implementation of these activities, however, requires that the techniques be encoded into a computer program or system. 21 B. SOFTWARE IMPROVING QUALITY AND PRODUCTIVITY Reasons for the Development of New Editing Software Traditional editing systems do not fully utilize the talents or expertise of subject matter specialists. Much of their time may be spent in dealing with unimportant or spurious error signals and in coping with system shortcomings. As a result, the specialist has less time to deal with important problems. In addition, editing systems may be able to give feedback on the survey itself. For example, a pattern of edit failures may suggest misunderstandings by the respondent or interviewer. If this is recognized, then the expertise of the specialist may then be used to improve the survey itself. Labor costs are a large part of the editing costs and are either steady or increasing, whereas the cost of computing is decreasing. In order to justify the heavy reliance on people in editing, their productivity will have to be improved through the use of more powerful tools. However, even if productivity is improved, different people may do different things in similar situations. If so, this makes the process less repeatable (reproducible) and more subject to criticism. When work is done on paper, it is hard to track, and it is impossible to estimate the effect of editing actions on estimates. Finally, some tasks are beyond the capability of human editors. For example, it may be impossible for a person to maintain the multivariate frequency structure of the data when making changes. These reasons and several others are commonly given as explanations for the increased use of computer software to improve the editing process. It is in the reconciliation of these two goals, (the increased use of computers for some tasks and the more intelligent use of human expertise), that the major challenge in software development lies. There will always be a role for people, but it will be modified. One positive feature of new editing software is that it can often improve the quality of the editing process and productivity at the same time. Ways that Productivity Can Be Improved One way to improve productivity is to break the constraints imposed by computer systems themselves. The use of mainframe systems for editing data is widespread. In some cases, however, an editor may not use the system directly. For example, error signals may be presented on paper printouts, and changes entered by data typists. Processing costs may dictate that editing jobs are run at low priority, overnight. or even less frequently. The effect of the changes made by the editor may not be immediately known: thus, paper forms may be filed, taken from files, and refiled several times. The proliferation of microcomputers promises to eliminate many of these bottlenecks. while at the same time it creates some challenges in the process. The editor will have direct access to the computer, and will be able to prioritize its use. Once the microcomputer is acquired. user fees are eliminated, thus resource-intensive programs such as interactive editing can be employed, provided the microcomputers are fast enough. Moving from a centralized environment (i. e., the main frame) to a decentralized environment (i. e., microcomputers) will present challenges of control and consistency. In processing a large survey on two or more microcomputers, 22 communications will be necessary. This will best be done by connecting them into a Local Area Network (LAN). New systems may reduce or eliminate some editing tasks. For example, where data are edited in batch and error signals are presented on printouts, a manual edit of the questionnaires before the machine edit may be a practical necessity. Editing data and error messages on a printout can be a hard, unsatisfactory chore because of the volume of paper and the static and sometimes incomplete presentation of data. The purpose of the manual edit in this situation is to reduce the number of machine-generated error signals. In an interactive environment, information can be efficiently presented and immediately processed. The penalty associated with machine-generated signals is greatly reduced. As a result, the preliminary manual edit may be eliminated. In addition, questionnaires are handled only once, further reducing filing and data entry tasks. Productivity may be increased by reducing the need for editing after data are collected. Instruments for Computer Assisted Telephone Interviewing (CATI), Computer Assisted Personal Interviewing (CAPI), and on-site data entry and editing programs are gaining wider use. Routing instructions are automatically followed, and other edit failures are verified at the time of the interview. There may still be many error signals from suspicious edits, however, the analyst has more confidence in the data and is more likely to let them pass. There are two major ways that productivity can be improved in the programming of the editing instruments. First is to provide a system that will handle all, or an important class, of the agency's editing needs. In this way the applications programmer need not worry about systems details. For example, in an interactive system, the programmer does not have to worry about how and where to flag edit failures as it is already provided. The programmer only codes the edit specification itself. In addition, the end-user has to learn only one system when editing different surveys. Second is the elimination of multiple specification and programming of variables and edits. For example, if data are collected by CATI, and edited with another system, then essentially the same edits will be programmed twice, possibly by two sets of people. If the system integrates several functions, e. g., data entry, data editing, and computer assisted data collection, then one program may be able to handle all of these tasks. This integration would also reduce time spent on data conversion from one system to another. Systems that Take Editing and Imputation Actions Some edit and imputation systems take actions usually reserved for people. They choose fields to be changed and then change them. The human element is not removed. rather this expertise is incorporated into the system. One way to incorporate expertise is to use the edits themselves to define a feasible region. This is the approach outlined in a famous article by Fellegi and Holt ( 1 976). Edits that are explicitly written are used to generate implied edits. For example if 100 < x/y < 200, and 3 < y/z < 4, are explicit edits. then an implied edit obtained algebraically is 300 < x/z < 800. Once all implied edits are generated the set of complete edits is defined as the union of the explicit and implied edits. This complete set of edits is then used to determine a set of fields to be changed for every possible edit failure. This is called error localization. An essential aspect to this method is that changes are made to as few fields as possible, or alternatively, to the least reliable set of fields which are determined by weights given to each field. 23 The analyst is given an opportunity to evaluate the explicit edits. This is done through the inspection of the implied edits and external records (the most extreme records that can pass through the edits without causing an edit failure). In inspecting the implied edits, it may be determined if the data are being constrained in an unintended way. In inspecting external records, the analyst is presented with combinations of the most extreme values possible that can pass the edits. The human editor has several ways to inject expertise into this kind of a system: (1) the specification of the edits; (2) the action of implied edits and external records and then the respecification of edits; (3) the weighting of variables according to their relative reliability. There are some constraints in systems that allow the computer to take editing actions. Fellegi and Holt systems cannot handle certain kinds of edits, notably nonlinear and conditional edits. Also algorithms that can handle categorical data cannot handle continuous data and vice versa. Within these constraints (and others), most edits can be handled. For surveys with continuous data, a considerable amount of human attention may still be necessary, either before the system is applied to data or after. Another way that computers can take editing actions is by modeling human behavior. This is the expert system" approach. For example, if typically maize yields average 100 bushels per acre, and the value 1.000 is entered. then the most likely correction is to assume that an extra zero was typed. The computer can be programmed to substitute 100 for 1,000 directly and then to re-edit the data. Ways that Data Quality can be Improved or Maintained It is not clear that editing done after data collection can always improve the quality of data by reducing non-sampling errors. An organization may not have the time or budget to recontact many of the respondents or may refrain from recontacts in order to reduce respondent burden. Additionally, there may be cognitive errors or systematic errors that an edit system cannot detect. Often, all that can be done is to maintain the quality of the data as they are collected. To use the maize yield example again, if the edit program detects 1,000 bushels per acre, and sets the value to 100 bushels per acre, then the edit program has only prevented the data from getting worse. Suppose the true value was really 103 bushels per acre. The edit and imputation program could not get the value closer to the truth in this case. Detecting outliers is usually not the only problem. The proper action to take after detection is the more difficult problem. One of the main reasons that Computer Assisted Data Collection is employed is that data are corrected at the time of collection. There are a few ways that an editing system may be able to improve data quality. A system that captures raw data, keeps track of changes, and provides well conceived reports, may provide feedback on the performance of the survey. This information can be used to improve the survey in the future. To take another agricultural example, farmers often harvest corn for silage (the whole plant is harvested, chopped into small pieces, and blown into a silo). Production of silage is requested in tons. Farmers often do not know their silage production in tons. Instead, the farmer will give the size (diameter and height) of all silos containing silage. In the office, silo sizes are converted into tons of production. If this conversion takes place before data are entered, then there is no indication from the machine edit of the extent of this reporting problem. 24 Another way that editing software can improve the quality of the data is to reduce the opportunity cost of editing. The time spent on editing leaves less time for other tasks, such as persuading people to participate, checking overlap of respondents between multiple frames, and research on cognitive errors. Ways that Quality of the Editing Process can be Defended or Confirmed There is a difference between data quality and the quality of the editing process itself. To refer once again to the maize yield example, a good quality process will have detected the transcription error. A-poor quality process might have let it pass. Although neither process will have improved data quality, the good quality process would have prevented their deterioration from the transcription error. Editing and imputation have the potential to distort data as well as to maintain their quality. This distortion may affect the levels of estimates and the univariate and multivariate distributions. A high quality process will attempt to e distortions. For example, in Fellegi and Holt systems, changes to the data will be made to the fewest fields possible and in a way such that distributions are maintained. A survey organization should be able to show that the editing process is not abusing the data. For editing after data collection, this may be done by capturing raw (unedited) data and keeping track of changes and the reasons for change. This is called an audit trail. Given this record keeping, it will be possible to estimate the impact of editing and imputation on expansions and on distributions. It will also be possible to determine the editor effect on the estimates. In traditional batch mode editing on paper printouts, it is not unusual for two or more specialists to edit the same record. For example, one may edit the questionnaire before data entry while another may edit the record after the machine edit.. In this case, it is impossible to assign responsibility for an editing action. In an on-line mode one person handles a record until it is done. Thus all changes can be traced to a person. For editing at the time of data collection, (e.g., in CATI), it may be necessary to conduct an experiment to see if either the mode of collection, or the edits employed, will lead to changes in the data. A high quality editing process will have other features as well. For example, the process should be repeatable, in time and in space. This means that the same data passed through the same process in two different locations, or twice in one location, will look (nearly) the same. The process will have recognizable criteria for determining when editing is done. It will detect real errors without generating too many spurious error signals. The system should be easy to program in and have an easy user interface. It should promote the integration of survey functions such as micro- and macro-editing. Changes made by people should be on-line (interactive) and traceable. Database connections will allow for quick and easy access to historical and sampling frame data. An editing system should be able to take actions of minor impact without human intervention. It should be able to accommodate new advances in statistical editing methodology. Finally, quality can be promoted by providing statistically defensible methods and software modules to the user. 25 C. DESCRIPTIONS OF EDITING SOFTWARE Three Generalized Editing Software Systems Blaise The Blaise system has been developed by the Netherlands Central Bureau of Statistics and is its standard data processing system. It is intended for use on microcomputer Local Area Networks (LANS) but can work on stand-alone machines as well. The required operating system for the microcomputers is MS DOS. The preferred LAN protocol is Novell, though Blaise will work with others as well. Turbo Pascal is required to compile applications programs; however, it is not needed by the end user. Development of applications in Blaise can be done in Dutch, English, Spanish, and French. Blaise can handle categorical, continuous, and character data. It has been used for economic, agricultural, social, and demographic surveys. It handles edits of all types. In Blaise, the human editor is not replaced as the primary reviewer of data. Rather, the individual is given a more powerful, interactive tool with which to work. Blaise is used to perform CATI, CAPI, and data entry as well as editing. Herein lies the strength of the system. Since it can perform these related functions, it can also integrate them. This integration is done through the creation of a "Blaise Questionnaire". This questionnaire is not a survey instrument itself, rather it is a "specifications generator". In it, data are defined, routes are described. and edits are written. From this one specification, the Blaise system can generate two related modules. The first, for data collection, can be used for both CATI and CAPI applications. The second is used for data entry and data editing. Since Blaise integrates these related survey tasks, multiple specification of the data and edits is avoided. Blaise does not perform data analysis because there are already many packages that can perform this Job. Blaise does generate dataset specifications for SPSS and Stata statistical packages and for the Paradox database system. Users can also specify their own specialized setups. Blaise can read in data from other sources as long as they are in (virtually any) ASCII format. A related tabulation module, which is part of the Blaise system, is called ABACUS. It can generate tables from Blaise data sets. These can be used for, among other things, some survey management functions. Weighted data can be tabulated in ABACUS. Interactive editing in Blaise can be approached in several different ways. For example, data can be entered either by the analyst or by high speed data entry operators. In the first case. data are edited as they are entered. In the second case, the editor has several different ways of approaching the task. A batch edit can be performed. In the batch edit, records are marked as clean. suspicious. or dirty. The editor can retrieve the records based on their status. Also, the editor can access any record by its identification number or call up records based on certain criteria such as stratum designation, or the value or range of values of designated variables. 26 Generalized Edit and Imputation System (GEIS) The GEIS system, developed by Statistics Canada, is based on the work of Fellegi and Holt. A predecessor, the Numerical Edit and Imputation System (based on the ideas of Sande), has been used as a prototype for GEIS. GEIS has been developed as part of the Business Survey Redesign Project. GEIS is intended to be applied to continuous data in economic surveys. Editing and imputation are considered to be part of the same process. In GEIS, data review and change are performed primarily in batch. GEIS performs edit analysis, error localization, and imputation. The system can be used on mainframes as well as on microcomputers. The database system ORACLE is required for all stages of processing, (GEIS is not part of the ORACLE system). GEIS handles linear edits and variables that take positive values. Within these constraints, most situations can be handled. Non-linear edits can often be transformed to linear edits or can be restated keeping in mind the intent of the original edit. It is intended that the system be used by a subject matter specialist working with a methodologist. Edits are specified interactively through specially designed screens. After specification, feedback is provided in the form of implied edits, extremal records, and edit analysis such as checks for consistency and redundancy. Data are edited in batch. Fields are automatically selected for change under the principle that the smallest (weighted) set of fields is changed. Next, imputation is performed in a manner that the edits are satisfied. The primary method of imputation is hot-deck imputation where good records are used to donate values for incomplete records. Other model-based methods can also be specified. Since GEIS is embedded in the ORACLE system, the edit and imputation process can be easily monitored. Many different kinds of reports can be generated. For example, the frequency of imputation by field, and the number of times each donor record has been used in imputation are two reports that can be generated. Through these reports, it is possible to measure the impact of the process on the estimates. Defensibility of the edit and imputation process is a priority in GEIS. This is done not only through the tracking of records as they proceed through the system, but also by providing the user with statistically defensible methods. Data are held in an ORACLE database. Before they are edited in GEIS, they are treated in a preliminary edit. For example, all coding and respondent follow-up would be done in this preliminary edit. Unresolved records from the preliminary stage are sent to GEIS. Structured Program for Economic Editing and Referrals (SPEER) SPEER is intended primarily for continuous data under ratio edits for economic surveys conducted by the various divisions of the U. S. Bureau of the Census. SPEER applies the Fellegi and Holt methodology to ratio edits. Within that realm, SPEER performs edit generation and analysis, error localization. and imputation. Additivity edits can also be handled in SPEER. Other edits are handled either in satellite routines within SPEER or in a program outside of SPEER. Data are edited and imputed in batch mode first. On-line (interactive) review of referral records is an 27 essential part of SPEER. Records can be designated as referrals based on criteria such as size of firm or on specific editing actions. SPEER runs on mainframes as well as on microcomputer LANs. All of the SPEER modules are programmed in FORTRAN. A FORTRAN compiler is required to program new applications. The use of FORTRAN as the base language has the advantage of flexibility. The limits of SPEER regarding imputation routines, screen design, etc. are the same as those of FORTRAN (there are very few limits). When using the system, the services of a programmer are required to incorporate survey specific expert information. In SPEER, both the machine and the human editor play major roles. Subject matter expertise is incorporated into SPEER through the programming of flexible modules. A hierarchy of imputation procedures for each variable is set; that is, imputation is on a field-by-field basis. The procedures are tried one at a time until a value within the feasible region is found. If desired, human editing actions can also be modeled in SPEER, through the use of IF-THEN statements. Since SPEER can handle most problems, the analyst is spared the task of reviewing minor problems and can concentrate on unusual or large cases. When necessary, however, the analyst can review records interactively. In the interactive review, the screen display includes reported data, corrected data, a status indicator. and the lower and upper limits of the feasible region for each variable. This allows the editor to see the effect of the editing actions vis-a-vis the SPEER limits. Also incorporated into SPEER is an audit trail, which keeps track of changes and reasons for them. The analyst requests a specific record and reviews the processing done by the automated system. The human expert can override the decision rules residing in the automated system and replace them based on alternative information about the case under review. The analyst typically has access to one or more of the following: the original response form, auxiliary information about the establishment under review, or the respondent by a telephone call. Based on this additional information and personal experience. an analyst may alter the decision rules built into the automated system. If there is reason to believe that the most appropriate imputation value lies outside the acceptable region, the analyst can select an imputed value outside the range. This system has also been used as a data entry vehicle for late arrival forms. The late form are entered into the data file by subject matter specificants using SPEER and they are edited as they are being entered. Brief Description of Other Systems or Programs An Example of an Expert System Application. An expert system application has been developed by the Quality Assurance Division of the Energy Information Administration. The program has been written for the Monthly Power Plant Survey (EIA-759). It was written to assist in the process of disposing of items that fail computer edit routines and to compensate for insufficient expertise and training of editors manually performing the process of disposing of edit failures. It was thought that the expert system could guide and assist the data editors through the more difficult dispositions of items that have failed edits thus 28 allowing the data to be edited according to the standard required. Though the system is ready for its first use, it has yet to be implemented operationally. PEDRO, a System for the On-Site Entering and Editing of Data. The Petroleum Electronic Data Reporting Option (PEDRO), developed at the Energy Information Administration (EIA), is an on-line system for data entry in which the respondents are involved in the data editing (Swann 1988). The respondents can use a personal computer for data entry or import data from the mainframe or another microcomputer system using a predefined format. The PEDRO system software then provides them with an image of a printed form which they proceed to "fill-out". The PEDRO data entry programs include a wide variety of edit checks to detect data errors at the time of entry. Users can enter and exit the PEDRO data entry function as often as they want while working to resolve any errors in the data. After data are entered, checked by PEDRO, and reviewed by the respondent, the data are transmitted to EIA. Examples of the edits include a check to determine whether a total equals the sum of its parts and whether current month beginning stocks are equal to previous month ending stocks. Range edits that use historical data are included among the other system edits. Sometimes error messages will be generated for values that are actually correct. In that situation, the respondent is asked to provide an explanation for the anomalous value in the comments screen. This information is also transmitted to EIA making it unnecessary for an analyst to contact the respondent to explain the anomaly. Currently PEDRO is used by approximately 61 respondents to the "Monthly Refinery Report" and 10 respondents to the "Monthly Imports Report." Other offices in EIA are currently in the testing phase of using PEDRO for their surveys. DIA, a System for the Automatic Editing of Qualitative Data DIA is the name of a system developed by the National Statistical Institute of Spain (Garcia-Rubio and Villan, 1990). It applies the Fellegi and Holt methodology to qualitative data. Only the minimum number of fields necessary are changed in order to satisfy the edits. The only specification necessary for imputation is that of the conflict (edit) rules. Each record is edited once and distributions are maintained. Random errors are distinguished from systematic errors, however, a rules analyzer ensures that both types of errors are treated consistently. Detailed information is provided by DIA on the whole editing and imputation process. Micro-Macro Statistical Analysis System The Micro-Macro Statistical Analysis System system is a graphics-on- screen, interactive, macro-editing system developed by the Bureau of Labor Statistics for use on the Current Employment Survey (CES). It is meant to replace the current batch system that generates thousands of computer printout pages. First, a table of industry identification codes for industries with suspicious estimates is presented. The analyst chooses one industry to work with. At this point. the analyst will try to find suspicious sample data which might have caused the problem. This can be done in either of two modes of operation: query or graphics. In the query mode, tables of estimates for specific cells are displayed. The analyst can ask logical questions about a set of sample members 29 in order to select suspicious members. For a particular record, the analyst can reduce its weight so that it represents. orgy itself, can reject it, or can change entries. The effect of these micro changes can be seen at the cell (macro) level. In the graphics mode, current versus previous data points are displayed in a scatter plot for each variable. Outliers are easily seen and can be marked for further inspection in the query facility. Records that are changed in the query mode are marked when displayed in the graphics mode. A full audit trail is generated as changes are made in order to facilitate supervisory oversight of the process. Other Software Paul Cotton (1988), reviewed four systems in a paper entitled "A Comparison of Software for Editing Survey and Census Data". The paper is in two parts. A set of criteria for evaluating Editing Software is discussed followed by a review of the -four systems. In addition to the GEIS system, the paper describes three systems used primarily in the Third World. They are the Integrated System for Survey Analysis (ISSA), PCEDIT, and the Integrated Microcomputer Processing System (IMPS). ISSA was developed by the Institute for Resource Development Westinghouse, to process demographic and health surveys in Third World countries on IBM personal computers. It can perform data entry, data editing, and tabulation. ISSA is described in Cushing (1988). PCEDIT is available from the Department of Technical Co-operation for Development of the United Nations. It is meant to be used to process population (demographic) data. IMPS, developed by the International Statistical Programs Center of the U. S. Bureau of the Census, consists of six major components, one each for data definition, data entry, editing, tabulation, census management, and census planning. The name of the editing package is CONCOR. IMPS was developed to process census data in developing countries. The U.S. Census Bureau is using CONCOR to edit and impute data for the 1990 Decennial Census for the U. S. Pacific Islands, (Guam, American Samoa, Northern Marianas, and Paluau). CONCOR is also being used to test edit specifications for population and housing characteristics for the basic 1990 United States Census long-form questionnaire. IMPS is described in Toro and Chamberlain (1988). 30 CHAPTER V RESEARCH ON EDITING A. INTRODUCTION All survey or census data must go through some level of editing. In the absence of correction activities, errors could introduce serious distortions into the data and derived statistics. Surveys, survey staff, and processing capabilities all change over time, and procedures for editing change as well. Redesign or improvement for edit systems can be minor to correct for slight problems, or there can be large research efforts to introduce major changes in methodology. These investigations can be carried out by specialists for a specific survey, programmers focusing on computer enhancements, or methodologists working on edit research. Three related goals of the Research Subgroup of the Subcommittee have been to identify areas in which improvements to edit systems will prove most useful, describe recent and current research activities to enhance edit capabilities, and make recommendations for future research. The Edit System Questionnaire discussed in preceding chapters included questions about edit improvements. One question asked was "For future applications, what would you like your edit system to do that it doesn't do now?" Another source of information was discussions with those responsible for edit tasks within a number of Federal agencies. Two areas emerged as priorities: (1) on-line, human interaction with a computer edit system and (2) better ways to detect potentially erroneous survey responses. Section B of this chapter provides examples of research in the two areas mentioned above. Section C briefly describes editing research in other countries. Section D presents case studies of editing research in United States Federal Statistical Agencies. A summary is provided in Section E. In Appendix D an annotated bibliography describes research efforts over the past years and we discuss this bibliography in section F. The bibliography is particularly important because it is difficult to locate and identify research on edit development. Sometimes the research is part of a quality assurance project. Often, research findings are not written up as such, but they are implemented and evolve into practical and useful software. The chapter is limited to research on editing as opposed to imputation. B. AREAS OF EDIT RESEARCH One area of current research interest is that of on-line edit capabilities in which survey takers interact with editing software to edit responses at the time of data collection. This occurs in a CATI (Computer Assisted Telephone Interviewing) or CAPI (Computer Assisted Personal Interviewing) setting. The BLAISE system discussed earlier is an example of edit software used in support of a CAPI and CATI program. Computer Assisted Self Interviews (CASI), is an innovative extension of these ideas which is to provide respondents with software to allow them to edit their own responses before transmission to the collecting agency. One software system and supporting 31 hardware for this purpose in use in Federal agencies is the PEDRO system which is described in Chapter IV. The topic of computer assisted data collection activities has been investigated in detail by the Computer Assisted Survey Information Collection (CASIC) Working Group. Another use of on-line, interactive edit programs is in the review of edit referral documents. Most survey editing, especially of economic data, is a combination of automated batch computer runs and a follow- up review of selected cases by subject matter staff. The reason for targeting a record may be changes to a large case, large or unusual changes, or the need for an analyst to supply an imputation. An on- line referral system should allow an analyst to make changes in a record, enter the change to the data file, and have the edit system validate the change or indicate that further. adjustments may be necessary. After an analyst completes the review of a record using an on-line system, the record should require no further action. This is in contrast to procedures currently in place in which an analyst will make "paper and pencil" changes to a referral document, changes will then be entered through some data entry process, the revised record will be run through an automated batch system, and the record may be targeted for further review. With an on-line, interactive referral system for analyst/clerical review of individual cases, the review process should be more efficient, less error prone, and less tenuous. Research into this area has a major system design orientation with the primary focus on software development rather than on new editing methodologies. Several of the systems described in Chapter IV, EDITING SOFTWARE, incorporate interactive review. Blaise is a system in which interactive review is the primary method of data editing and which integrates editing with computer assisted data collection. SPEER is a system where interactive review is tied in with Fellegi and Holt editing principles. In PEDRO, the respondent fills an electronic form that is edited at the same time. The Micro-Macro Statistical Analysis System incorporates interactive tabular and graphical review in order to perform macro-editing. The systems ISSA, PC EDIT, and CONCOR also have interactive capability. A second area of active research is in the detection of potentially erroneous responses. The method for error detection most commonly used in Federal agencies is to employ explicit edit rules. For example, edit rules may require that: (a) the ratio of two fields lie between prescribed bounds, (b) the current response be within some range of a predicted value based on a time series or other models, or (c) various linear inequalities and/or equalities hold. Edit rules and parameters are highly survey-specific. A related editing research area is the design of edit rules and the development of methods for obtaining sensitive parameters. For some automated edit systems the primary activity is to screen records which fail some combination of edit rules, after which data correction or verification is completed by subject specialists. This is especially true for questionnaires having, in part, a regulatory purpose or having only a small number of cases. For such edit systems, research will focus on selecting the appropriate edit rules. deriving sensitive bounds, and setting up flagging procedures. A related area of interest focuses on optimal methods to target cases for review as one does not want to burden the review process with an excessive number of referral cases nor does one wish to let many errors escape detection. 32 Several research studies are described in Section D in which the editing objective is to detect potentially erroneous responses. The first case study on methods to develop edit rules and tolerances was conducted at the Federal Reserve Board to derive set rules and parameters for editing bank deposit data. One objective of this study was to determine procedures to group reporting units into clusters and form edit parameters by cluster. A related study at the Federal Reserve Board (FRB) to investigate the use of more model based range limits is described as well. Three case studies follow on the use of time series data on a firm's performance to predict current reporting and then edit actual reported values against those predicted. The first two studies describe research at Energy Information Administration (EIA) and are followed by a description of work at the National Agricultural Statistics Service (NASS). These studies illustrate the type of research being conducted at various Federal agencies and should prove useful as a source of ideas, directions, and considerations in edit system design. In contrast to the rule-driven method for the detection of potentially erroneous response combinations within a record, one alternative procedure is to analyze the distribution of questionnaire responses. Records which do not conform to the observed distribution are then targeted as outliers and are selected for review and examined further for potential errors (Little and Smith, 1984 for example). Although there has been research interest in this topic, no application of these multivariate methods was found. In addition, an investigation of the joint use of outlier detection procedures and rule-driven edits to detect potentially erroneous responses may prove valuable. C. EDITING RESEARCH IN OTHER COUNTRIES Much editing research has been conducted in national statistical offices around the world. It is these organizations, which conduct huge and complicated surveys, that have the most to be gained from developing new systems and techniques. They also have the resources upon which to draw for this development. The following are citations of people and organizations about which the members of this Subcommittee have knowledge. Leopold Granquist of Statistics Sweden has presented papers on both the purposes of editing (Granquist 1984), and on macro-editing (Granquist, 1987). Granquist has also developed a typology of survey errors with which to judge the effectiveness of editing systems. Members of the Australian Bureau of Statistics have given editing papers at two recent Annual Research Conferences of the U. S. Bureau of the Census. The first by Linacre and Trewin (1989) addresses the optimal allocation of resources to various survey functions (including editing) in order to reduce non-sampling errors. The second by Hughes, McDermid, and Linacre (1990) concerns the use of graphical techniques to find outliers at both the micro and macro level. The National Statistical Institute of Spain has developed a Fellegi and Holt system for edit and imputation of categorical data. In a recent paper, Garcia-Rubio and Villan (1990) discuss the applicability of the Fellegi and Holt methodology to randomly generated and systematic errors. They have made modifications in the methodology in order to better handle errors of the latter type. The Netherlands Central Bureau of Statistics (CBS) is the world leader in the use of microcomputer 33 Local Area Networks for the processing of survey data. Keller and Bethlehem (1990) describe the systems, organizational issues and their resolution related to this new technology. Currently, the CBS has 2,000 microcomputers installed in 60 LANS. All the day-to-day processing of survey data is now carried out on these LANs using standardized software tools. The CBS has also carried out a "Data Editing Research Project" to determine the need for an interactive computer assisted procedure (Bethlehem, 1987). In Statistics Canada, Hidiroglou and Berthelot (1986) have developed a method of statistical editing for periodic business surveys. An international group called the Data Editing Joint Group, has been meeting for a few years under the auspices of the Statistical Computing Project Phase 2 of the United Nations Development Program's Economic Commission for Europe. Countries represented include Sweden, Netherlands, Soviet Union, Yugoslavia, Hungary, Spain, France, Canada, and the United States. (The National Agricultural Statistics Service is the U. S. representative.) This group concentrates more on the systems aspects of editing and will be making recommendations about systems specifications both for their own use and for systems development in the third world. Phase 2 will be finished in the au of 1990. The group intends to continue its work under the auspices of the European Statisticians Association with a focus on cooperating for their own benefit. D. CASE STUDIES Respondents to the questionnaire on editing practices expressed interest in deriving sensitive tolerance edits and using more sophisticated and extensive validation editing. They also mentioned that they would like to employ historic data to test the current data. An important aspect in the development of edits is determination of bounds or tolerance limits to use in identifying potentially erroneous data. Several recent research studies have focused on various ways of setting the bounds and on the limitations of the approaches. Determining Optimal Clusters for Model Specification If a large number of separate clusters or groupings are used to determine tolerances for edit rules, the procedure for providing ranges can become unwieldy. On the other hand, if too few groups are used. erroneous items may not be flagged as ranges may become insensitive. Research to reduce the number of cells used to set tolerance limits has been carried out at the Federal Reserve Board (Pierce and Bauer, 1989). To edit data that banks and other depository financial institutions submit to the Federal Reserve System, tolerance bands are constructed for groupings of institutions felt to be homogeneous by size. location, and type of institution. However, an objective measure of this homogeneity was not available. Since the edits were designed to flag or identify observations falling into the tails of the distribution of week-to-week changes. the measure proposed to assess the degree of homogeneity of different institutions was the variances of the changes for those institutions. As a result of performing an analysis of variance, with these variances as the cell observations, it was determined that an unnecessarily fine subdivision of groups was being used since the sample variances were not significantly different between many of the groups. Based on the results of multiple comparisons, new ways of combining the groups were suggested. 34 The Federal Reserve Board is also exploring an approach to editing referred to as "model-based edit design". The basic ideas behind this are that information in addition to the previous value of the item being edited is relevant in tolerance-band construction for edit checks (for example, last month's or year's values of the item, values of related variables, calendar or tax-date information), and such information is best incorporated into the editing procedure through a model which can then be used in determined the edit tolerances and executing the edits. Moreover, tolerance widths can be determined from the model's standard error estimate and given a probabilistic interpretation. The Federal Reserve Board will experiment with such models and model determined tolerances for pilot items and for selected sources of additional information and then move toward a more systematic development of this modeling approach. One topic to be investigated concerns the prospect for having common models for classes of similar financial institutions. This would avoid the necessity for building an unwieldy number of models, while still having each model provide a sufficiently accurate description of the relevant behavior of the variable being edited for each institution. Intermediate possibilities include having a common model specification within a class but with different parameter values permitted for individual banks, or fixing those values but allowing different standard deviations (tolerance widths). Use of Time Series Methods to Set Bounds Another approach when historical data are available is to use past data values for a particular respondent to predict the current value and to then use the predicted value to construct tolerance limits for the new data. The Energy Information Administration (EIA) uses this approach for its weekly surveys on petroleum supply and its monthly surveys on petroleum marketing. Exponential smoothing is the particular technique used to obtain the predicted value (Burns, 1983). This technique has worked well during periods when the data are relatively stable; for example, on the weekly series on petroleum supply. However, it has not performed well when the data are erratic such as when there are sharp price changes or seasonality. To address the problem of this price change, the Petroleum Marketing Division of EIA has looked into the possibility of introducing a market shift into the edits that would account for real time market changes. The market shift is calculated from partial data from the current period. Before this was actually fully implemented, it was employed using an externally calculated market shift based on 'industry information. Later, these ideas were implemented by calculating the shift using as much current month data as available at the time of editing. This allowed not only full automation, but also targeted market shifts for varying populations and products as the data are received on a daily basis. To address the problem of seasonality in the monthly series on petroleum supply, the Petroleum Supply Division implemented tests on month-to-month differences rather than using exponential smoothing. Research has also been conducted on the Kalman filter implementation of exponential smoothing (Kirkendall, 1988). The EIA used this procedure to obtain preliminary estimates of crude oil purchases and production. The procedure provided a method to both estimate and edit state data. In some states the difference between the data on purchase volumes and production has remained relatively constant since 1983. In other states abrupt changes in the relationship or the presence of outliers were observed. Actually both transfer function models and ARIMA models were tried. 35 However, these procedures were not satisfactory in states in which large outliers or abrupt level shifts appeared. Use of Robust Estimators to Set Bounds The National Agricultural Statistics Service (NASS) of the Department of Agriculture has performed research on using Tukey's biweight to develop bounds for their statistical edit of data on livestock slaughter at the plant level (Mazur, 1990). In searching for a statistical estimator to determine edit boundaries, two desired properties are that tolerances quickly stabilize to new levels if true changes occur and that they return to old levels in the presence of outliers. Therefore. robust methods were considered because they are more resistant to outliers than the standard statistical methods and work well on many distributions as compared with the standard methods which work best when the distribution is normal. Initially, four measures of central tendency were considered: the mean, the median, the an (sum of the upper and lower quarides plus twice the median, entire quantity divided by four), and the 20 percent ed mean (the lowest n*0.20 values where n is the sample size and the highest n*0.20 values are dropped and the mean of the remaining values is computed). Four measures of spread were also considered the standard deviation, the inter-quartile range, median absolute difference, and the 20 percent trimmed standard deviation. The mean and standard deviation were greatly affected by outliers. The other estimators seemed inadequate because they excluded good values. There was also the concern that they may underestimate the measure of spread. Because of these limitations, further research was conducted using the biweight. The biweight differs from other estimators in that the weights are dependent on the data set. Therefore. it tends to include good values and excludes unreasonable ones. If the data are normal, the biweight is like the mean, but if the data are not normal, it is more like the median. The edit limits will be calculated for each plant, using the plant's 13 previous week's data. Research was also conducted on identifying inliers, that is, values that do not change much over time and are suspicious for that reason. A key feature of the process is the use of stratification to provide edit limits for slaughter plants with insufficient data to edit small plants, or to impute for missing data. Also, a journal provides an audit trail. The analyst resolves error signals interactively on a microcomputer. Future research is being considered to extend the biweight approach to other data series which collect data from the same reporting units over time and to develop a system to plot historic data. Other possibilities include research to determine whether seasonality could be incorporated into the biweight (mainly for large plants) and whether a capability to identify plant holidays could be added. E. SUMMARY The survey on editing practices indicated that there was little analysis of the effect of editing on the estimates that were produced. Considering that the cost of editing is significant for most surveys, this is clearly an area in which more work is required. A related issue is to attempt to determine when to edit and not to edit. Clearly, all the errors are not going to be found and we should not 36 attempt to find them all at the risk of over-editing. An interesting task is in designing guidelines for determining what is an acceptable level of editing. Another neglected research area in this country concerns the editing of data at the time they are keyed from maid responses. Data entry systems typically have some keying error detection capabilities of a univariate nature, typically range checks and checks to detect when an unacceptable character has been keyed. The primary focus of checks at this stage is to detect data entry errors. This area is usually discussed in the setting of quality control; however, it is an area that can benefit from further research from the perspective of data editing. A number of surveys have reduced this sort of error through the use of double keying. In the Netherlands Central Bureau of Statistics, subject-matter specialists enter data and edit them interactively as they are entered. The advancement of computer and peripheral technology is playing a dual role in affecting survey editing. On the one hand, some developments have helped to eliminate or reduce the need for some edits. Computer Assisted Data Collection systems (e.g., CATI, CAPI) not only reduce data entry errors but reduce other errors as well. The use of machine-readable forms and bar codes will eliminate keying errors. On the other hand. the increased speed, memory, and storage of computers, and networking have allowed statisticians to consider computationally-intensive techniques for editing that previously would have been possible, particularly considering survey deadlines, and to utilize other databases. The questionnaire respondents expressed interest in the use of expert systems to improve survey specific sensitivity in editing. The term "expert system" is not really well defined and different analysts attach different meanings to it. With respect to data editing, it refers to the treatment of survey specific information in a structured way. In that regard, the computer is simulating, to some degree, the role of the subject matter specialist. Two systems already in use that have expert system components are SPEER and PEDRO. (These systems are described in Chapter IV.) In these systems, decisions that may have been made by subject matter specialists are now made by using rules that have been programmed 'into an automated system. In this chapter, examples of research on methods to detect erroneous values were discussed. With improved technology, the techniques have become more sophisticated and undoubtedly will continue to become more so. The question then becomes how effective are the techniques in actually detecting errors. Two related areas for further research are monitoring the effectiveness of the edits and determining guidelines for when to use each technique. To address these issues, it is necessary to track the proportion of flagged items that are actually errors (often referred to as the "hit" rate). This, of course, only gives one side of the picture; it does not address the issue of errors that are not detected by a specific procedure. Despite this limitation, tracking the "hit" rate is useful and ways of automatically alerting the analyst that it has gone out of control would be helpful. As more techniques become computationally feasible, the analyst is confronted with more choices in designing an edit system. It would be useful to know when the techniques work well. For example, research has already indicated that exponential smoothing does not work well when the data are erratic. If findings could be made available about other techniques, time could be saved in developing new edits. 37 In conclusion, there are several recommendations for research in data editing that are contained in the preceding paragraphs. However, the most important recommendation we can make is that agencies recognize the value of editing research and place high priority on devoting resources to their own research, to monitoring developments in data editing at other agencies and elsewhere, and to implementing improvements. F. BIBLIOGRAPHY It is quite difficult to provide a complete assessment of current research activities in the area of editing because so much of the research, progress, and innovations are described only in survey- specific documentation. The difficulty is even more fundamental. Innovations in editing methods made by survey staff are often viewed as enhancements to processing for that particular survey, and little thought is given to the broader applicability of methods developed. Accordingly, survey staff do not typically prepare a discussion of new methods for publication or for other forms of wide dissemination. A description of editing methods and system design might be found in survey processing specifications. instructions to programming staff, or in survey processing code. Innovations that are computer intensive often are regarded not as method changes, but rather as computer enhancements. In other cases, edit activities may be included in the general area of "quality assurance" with little thought of the subject of editing per se. For these reasons, any bibliography on editing will undoubtedly miss important areas of research and innovations. Fortunately, a number of researchers did see editing as distinct from other processing tasks and have taken the time to describe their experiences. Some of the papers in the bibliography can be viewed as case studies for a particular editing strategy employed on a particular survey. To some extent, authors of such papers wanted to record their activities, subject them to public scrutiny, and offer up their techniques to others who may be working under similar conditions and who may find their suggestions useful. It is often in such articles that methods which may be applicable to more than one survey are first introduced and described. There are features of the editing process that cut across surveys, and this realization has encouraged the development of general methodologies and multiuser systems. Much recent research in the area of editing has focused on the development of multipurpose edit systems, and a number of papers in this bibliography discuss multipurpose edit systems. Some of these systems have imputation components while others do not. nm preceding chapter on Editing Software described three multipurpose software packages: GEIS, BLAISE, and SPEER. In the respective specialized bibliographies, ([A], [B], and [C)), we include papers which describe underlying methods, the software, proposed uses, and possible advantages of the respective systems. The bibliographic citations provide the theoretical and research background for these systems and constitute a link between the software chapter and this research chapter. 38 APPENDIX A RESULTS OF EDITING PRACTICES PROFILE FROM QUESTIONNAIRE RESPONSES Frequency Percent 1. What type of survey are you engaged in? a. Sample 90 77% b. Census 27 23% 2. What is the purpose of the survey? a. Statistical 98 84% b. Regulatory 0 0% c. Both 19 16% 3. How would you classify your survey? a. Single-time survey 6 5% b. Repeated survey (cross-sectional) 50 43% c. Panel Survey (longitudinal) 39 34% d. Rotating panel survey 11 10% e. Split panel survey 6 5% f. Other 3 3% Not answered 2 4. What is the frequency of your survey? a. Weekly 4 4% b. Monthly 23 20% c. Quarterly 12 10% d. Annual 39 33% e. Other 39 33% 5. What is your sampling unit? a. Individual 21 18% b. Household 11 9% c. Farm 6 5% d. Economic Establishment or firm 58 50% c. Institution 8 7% f. Other 13 11% 39 Frequency Percent 6. How many units are included in your survey? 22 through 1,000,000 7. Is response to your survey mandatory? a. Yes 43 37% b. No 74 63% 8. Averaging across all items, what level of item nonresponse does your survey experience? a. None 18 16% b. Less than 5% 43 38% c. 5% or greater, but less than 10% 20 18% d. 10% or greater, but less than 20% 19 17% e. 20% or greater 12 11% Not answered 5 9. What is your primary data collection method? a. Computer-assisted telephone interview (CATI) 4 3% b. Computer-assisted personal interview (CAPI) 0 0% c. Telephone interview 9 8% d. Personal interview 25 21% e. Mailed questionnaire 49 42% f. Administrative records 18 16% g. Other (please specify) 12 10% 10. What secondary data collection method(s) do you use? (Circle all that apply) a. Computer-assisted telephone interview (CATI) 19 b. Computer-assisted personal interview (CAPI) 2 c. Telephone interview 62 d. Personal interview 24 e. Mailed questionnaire 24 f. Administrative records 16 g. Other (please specify) 5 40 Frequency Percent 11. What type of computer do you use for data processing? (Circle all that apply) a. Mainframe 107 b. Minicomputer 20 c. Microcomputer 40 d. None 0 12. What is your data processing environment? a. Batch mode 49 42% b. On-line 9 8% c. Both 58 50% Not answered 1 13. If your survey is computerized. what sort of file structure do you use? (Circle all that apply) a. Sequential 71 b. Database using ORACLE software 7 c. Database using ADABAS software 5 d. Database using DBASE software 10 e. Database using other software (please specify) 35 14. Are you limited in your ability to disseminate data by confidentiality (privacy) restrictions? a. Yes 104 89% b. No 13 11% 15. Do you release microdata (respondent-level data)? a. Yes, and imputed data items are identified (flagged) 36 31% b. Yes, and imputed data items are not identified 19 16% c. No 62 53% 16. When you release aggregated data, do you provide information as to the percentage of a particular data item which has been imputed? a. Yes 24 21% b. No 89 79% Not answered 4 41 Frequency Percent 17. Are there minimum standards for reliability for the data you disseminate; e.g., do you require that an estimate have less than an established maximum variance or be based on more than an established number of observations before the estimate can be released? a. Yes 79 71% b. No 33 29% Not answered 4 18. What documentation exists for your survey? a. All aspects of the survey are well documented 88 76% b. The data editing system is well documented, but some of other aspects are not 10 9% c. Some aspects are documented, but not the data editing system 1% d. Some documentation exists, but it is neither complete nor current throughout the system 16 14% e. No documentation exists for this survey 0 0% Not answered 2 19. Is information available on the . cost of data editing in your survey? a. Yes 44 38% b. No 73 62% 20. Please estimate the percentage of the total survey cost spent on data editing. (Please include all of the aspects of editing, such as any edits made at the time of data entry, clerical work, computer time, design, testing, monitoring, analyst review, call-backs, and review of summary tables.) Range 5% through 90% Mean 41.4% Median 35% Mode
(April 1990)