Federal Committee on Statistical Methodology
Office of Management and Budget
FCSM Home ^
Methodology Reports ^

 

  Statistical Policy Working Paper 20 - Seminar on Quality of Federal Data - Part 2 of 3


Click HERE for graphic.

 



 



                           Statistical Policy

                            Working Paper 20







                   Seminar on Quality of Federal Data



 



                              Part 2 of 3



  



               Federal Committee on Statistical Methodology



 



                          Statistical Policy Office



               Office of Information and Regulatory Affairs



                   Office of Management and Budget



 



                              March 1991



 



                  



 



                   MEMBERS OF THE FEDERAL COMMITTEE ON



                          STATISTICAL METHODOLOGY



 



                              (February 1991)



 



                         Maria E. Gonzalez, Chair



                      Office of Management and Budget



 



 



    Yvonne M. Bishop                  Daniel Kasprzyk



    Energy Information                Bureau of the Census



      Administration



                                      Daniel Melnick



    Warren L. Buckler                 National Science Foundation



    Social Security Administration



                                      Robert P. Parker



    Charles E. Caudill                Bureau of Economic Analysis



    National Agricultural



      Statistics Service              David A. Pierce



                                      Federal Reserve Board



    Cynthia Z.F. Clark



    National Agricultural             Thomas J. Plewes



      Statistics Service              Bureau of Labor Statistics



 



    Zahava D. Doering                 Wesley L. Schaible



    Smithsonian Institution           Bureau of Labor Statistics



 



    Robert M. Groves                  Fritz J. Scheuren



    Bureau of the Census              Internal Revenue Service



 



    Roger A. Herriot                  Monroe G. Sirken



    National Center for               National Center for



      Education Statistics              Health Statistics



 



    C. Terry Ireland                  Robert D. Tortora



    National Computer Security        Bureau of the Census



      Center



 



    Charles D. Jones



    Bureau of the Census



 



                                PREFACE



 



 



  In 1975, the Office of Management and Budget (OMB) organized the



  Federal Committee on Statistical Methodology. Comprised of



  individuals selected by OMB for their expertise and interest in



  statistical methods, the committee has during the past 15 years.



  determined areas that merit investigation and discussion, and



  overseen the work of subcommittees organized to study particular



  issues.  Since 1978, 19 Statistical Policy Working Papers have been



  published under the auspices of the Committee.



 



  On May 23-24, 1990, the Council of Professional Associations on



  Federal Statistics (COPAFS) hosted a "Seminar on the Quality of



  Federal Data." Developed to capitalize on work undertaken during



  the past dozen years by the Federal Committee on statistical



  Methodology and its subcommittees, the seminar focused on a variety



  of topics that have been explored thus far in the Statistical



  Policy Working Paper series.  The subjects covered at the seminar



  included:



 



       Survey Quality Profiles



       Paradigm Shifts Using Administrative Records



       Survey Coverage Evaluation



       Telephone Data Collection



       Data Editing



       Computer Assisted Statistical Surveys



       Quality in Business Surveys



       Cognitive Laboratories



       Employer Reporting Unit Match Study



       Approaches to Developing Questionnaires



       Statistical Disclosure-Avoidance



       Federal Longitudinal Surveys



 



  Each  of these topics was presented in a two-hour session that



  featured formal papers and discussion, followed by informal



  dialogue among all speakers and attendees.



 



  Statistical Policy Working Paper 20, published in three parts,



  presents the proceedings of the "Seminar on the Quality of Federal



  Data." In addition to providing the papers and formal discussions



  from each of the twelve sessions, this working paper includes



  Robert M. Groves' keynote address, "Towards Quality in a Working



  Paper Series on Quality," and comments by Stephen E. Fienberg,



  Margaret E. Martin, and Hermann Habermann at the closing session,



  "Towards an Agenda for the Future."



 



  We are indebted to all of our colleagues who assisted in organizing



  the seminar, and to the many individuals who not only presented



  papers and discussions but also prepared these materials for



  publication.  A special thanks is due to Terry Ireland and his



  staff for their work in assembling this working paper.



 



                      Table of Contents



 



                    Wednesday, May 23, 1990



 



 



                             Part 1



 



 



                        KEYNOTE ADDRESS



 



 



TOWARDS QUALITY IN A WORKING PAPER SERIES ON QUALITY. . . . . .  3



    Robert M. Groves, The University of Michigan and U. S.



    Bureau of the Census



 



 



 



         Session 1 - SURVEY QUALITY PROFILES



 



 



 



THE SIPP QUALITY PROFILE. . . . . . . . . . . . . . . . . . .   19



    Thomas B. Jabine, Statistical Consultant



 



INITIAL REPORT ON THE QUALITY OF AGRICULTURAL SURVEY PROGRAM.   29



    George A. Hanuschak, National Agricultural Statistics



    Service



 



DISCUSSION. . . . . . . . . . . . . . . . . . . . . . . . . . . 40



    Barbara A. Bailar, American Statistical Association



 



DISCUSSION. . . . . . . . . . . . . . . . . . . . . . . . . .   46



    Nancy A. Mathiowetz, U. S. Bureau of the Census



 



 



 



Session 2 - PARADIGM SHIFTS USING ADMINISTRATIVE



                           RECORDS



 



 



 



PARADIGM SHIFTS: ADMINISTRATIVE RECORDS AND CENSUS-TAKING. . . 53



    Fritz Scheuren, Internal Revenue Service



 



AN ADMINISTRATIVE RECORD PARADIGM: A CANADIAN EXPERIENCE . . . 66



    John Leyes, Statistics Canada



                                               



 DISCUSSION. . . . . . . . . . . . . . . . . . . . . . . . .  77



       Gerald Gates, U.S. Bureau of the  Census



 



 DISCUSSION. . . . . . . . . . . . . . . . . . . . . . . . .  83



       Edward J. Spar, Market Statistics



 



 



 



        Session 3 - SURVEY COVERAGE EVALUATION



 



 



 



 



 CONTROL MEASUREMENT, AND IMPROVEMENT OF SURVEY COVERAGE . . .87



       Gary M. Shapiro, U. S. Bureau of the Census; Raymond R.



       Bosecker, National Agricultural Statistics Service



 



 QUALITY OF SURVEY FRAMES. . . . . . . . . . . . . . . . .   100



       Judith T. Lessler, Research Triangle Institute



 



 DISCUSSION. . . . . . . . . . . . . . . . . . . . . . . . . 108



       Fritz Scheuren, Internal Revenue service



 



 DISCUSSION. . . . . . . . . . . . . . . . . . . . . . . .   114



       Joseph Waksberg, Westat, Inc.



 



 



 



          Session 4 - TELEPHONE DATA COLLECTION



 



 



 



 QUALITY IMPROVEMENT IN TELEPHONE SURVEYS. . . . . . . . . . 123



       Leyla Mohadjer, David Morganstein, Westat, Inc.



 



 COMPUTER ASSISTED SURVEY TECHNOLOGIES IN GOVERNMENT:



       AN OVERVIEW. . . .  .     .  . . . . . . . . . . . .  137



       Marc Tosiano, National Agricultural Statistics Service



 



 DISCUSSION . . . . . . . . . . . . . . . . . . . . . . . .  155



       William L. Nicholls II, U.  S. Bureau of the Census



 



 DISCUSSION. . . . . . . . . . . . . . . . . . . . . . . . . 161                                                           .161



       James T. Massey, National Center for Health Statistics



 



 



 



 



 



 



 



 



                                    iv



 



            



 



                                  Part 2



 



 



 



                          Session 5 - DATA EDITING



 



 



 



    OVERVIEW OF DATA EDITING IN FEDERAL STATISTICAL AGENCIES .167



          David A. Pierce, Federal Reserve Board



 



    EDITING SOFTWARE (An excerpt from Chapter IV of Working-



          Paper 18). . . . . . . . . . . . . . . . . . . . . .173



          Mark Pierzchala, National Agricultural Statistics



          Service



 



    RESEARCH ON EDITING. . . . . . . . . . . . . . . . . . .  180



          Yahia Ahmed, Internal Revenue Service



 



    DISCUSSION. . . . . . . . . . . . . . . . . . . . . ..    184



          Charles E. Caudill, National Agricultural Statistics



          Service



    DISCUSSION. . . . . . . . . . . . . . . . . . . . . . . . 186



          Richard Bolstein, George Mason University



 



 



 



          Session 6 - COMPUTER ASSISTED STATISTICAL



                                SURVEYS



 



 



 



  OVERVIEW OF COMPUTER ASSISTED SURVEY INFORMATION COLLECTION. 191



          Richard L. Clayton, U. S. Bureau of Labor Statistics



 



    A COMPARISON BETWEEN CATI AND CAPI. . . . . . . . . . . . .197



          Martin Baum, National Center for Health Statistics



 



    COMPUTER ASSISTED SELF INTERVIEWING. . . . . .. . . . . .  202



          Ralph Gillmann, Energy Information Administration



 



    COMPUTER ASSISTED Self INTERVIEWING: RIGS AND PEDRO,



          TWO EXAMPLES . . . . . . . . . . . . . . . . . . . . 205



          Ann M. Ducca, Energy Information Administration



 



    DATA  COLLECTION. . . . . . . . . . . . . . . . . . . .  . 209



          Cathy Mazur, National Agricultural Statistics Service



 



                                  v



 



   DISCUSSION. . . . . . . . . . . . . . . . . . . . .  . . . 212



         Robert N. Tinari, U. S. Bureau of the Census



 



   DISCUSSION. . . . . . . . . . . . . . . . . . . . . . . . . 216



         David Morganstein, Westat, Inc.



 



 



                         Thursday, May 24, 1990



 



 



            Session 7 - QUALITY IN BUSINESS SURVEYS



 



 



 



  IMPROVING ESTABLISHMENT SURVEYS AT THE BUREAU OF LABOR



        STATISTICS .. . . . . . . . . . . . . . . . . . . . . .221



       .Brian MacDonald, Alan R. Tupek, U. S.Bureau of Labor



        Statistics



 



  A REVIEW OF NONSAMPLING ERRORS IN FEDERAL ESTABLISHMENT



  SURVEYS WITH SOME AGRIBUSINESS EXAMPLES. . . . . . . . . . . 232



        Ron Fecso, National Agricultural Statistics Service



 



  DISCUSSION. . . . . . . . . . . . . . . . . . . . . . . . .  243



        David A. Binder, Statistics Canada



 



  DISCUSSION. . . . . . . . . . . . . . . . . . . . . . . . . .247



        Charles D. Cowan, Opinion Research Corporation



 



 



              Session 8 - COGNITIVE LABORATORIES



 



 



 



 THE  BUREAU OF LABOR STATISTICS' COLLECTION PROCEDURES



 RESEARCH LABORATORY: ACCOMPLISHMENTS AND FUTURE DIRECTIONS . .253



       Cathryn S. Dippo, Douglas Herrmann, U. S. Bureau of Labor



       Statistics



 



 THE ROLE OF A COGNITIVE LABORATORY IN A STATISTICAL AGENCY. . 268



       Monroe G. Sirken, National Center for Health Statistics



 



 DISCUSSION. . . . . . . . . . . . . . . . . . . . . . . . . . 278



       Elizabeth Martin, U. S. Bureau of the Census



 



 DISCUSSION. . . . . . . . . . . . . . . . . . . . . . . . .  .281



                                             



       Murray Aborn, National Science Foundation (retired)



 



                               vi



 



                            Part 3



 



         Session 9 - EMPLOYER REPORTING UNIT MATCH



                                     STUDY



 



   INTERAGENCY AGREEMENTS FOR MICRODATA ACCESS:



         THE ERUMS EXPERIENCE. . . . . . . . . . . . . . . .   291



         Thomas B. Petska, Internal Revenue Service; Lois



         Alexander, Social Security Administration



 



   SAMPLE SELECTION AND MATCHING PROCEDURES USED IN ERUMS. . . 301



         John Pinkos, Kenneth LeVasseur, Marlene Einstein,



         U. S. Bureau of Labor Statistics; Joel Packman, Social



         Security Administration



 



   RESULTS, FINDINGS, AND RECOMMENDATIONS OF THE ERUMS PROJECT. 309             .309



         Vern Renshaw, Bureau of Economic Analysis; Tom Jabine,



         Statistical Consultant



 



   DISCUSSION. . . . . . . . . . . . . . . . . . . . . . . . .  318



         W. Joel Richardson, Charles A. Waite, U. S. Bureau of the



         Census



 



   DISCUSSION. . . . . . . . . . . . . . . . . . . . . . . . . .324



         Thomas J. Plewes, U. S. Bureau of Labor Statistics



 



            Session 10 - APPROACHES TO DEVELOPING



                              QUESTIONAIRES



 



   TOOLS FOR USE IN DEVELOPING QUESTIONS AND TESTING



         QUESTIONNAIRES. . . . . . . . . . . . . . . . . . . .  331



         Theresa J. DeMaio, U. S. Bureau of the Census



 



   TECHNIQUES FOR EVALUATING THE QUESTIONNAIRE  DRAFT. . . . .  340



         Deborah H. Bercini, National Center for Health Statistics



 



   DESIGNING QUESTIONNAIRES FOR CATI IN A MIXED MODE



         ENVIRONMENT. . . . . . . . . . . . . . . . . . . . . . 349



         Gemma Furno, U. S. Bureau of the Census



 



   DISCUSSION . . . . . . . . . . . . . . . . . . . . . . . . . 360



         Carol C. House, National Agricultural Statistics Service



                                 vii



 



    Session 11 - STATISTICAL DISCLOSURE - AVOIDANCE



 



                           



 



  DISCLOSURE AVOIDANCE PRACTICES AT THE CENSUS BUREAU. . . . . .367



        Brian Greenberg, U. S. Bureau of the Census



 



  THE MICRODATA RELEASE PROGRAM OF THE NATIONAL CENTER



  FOR HEALTH STATISTICS .. . . . . . . . . . . . . . . . . . ...377



        Robert H. Mugge, National Center for Health Statistics



        (retired)



 



  DISCUSSION. . . . . . . . . . . . . . . . . . . . . . . . . . 385



        George Duncan, Carnegie Mellon University



 



 



        Session 12 - FEDERAL LONGITUDINAL SURVEYS



 



 



 



  FEDERAL LONGITUDINAL SURVEYS . . . . . .  . . . . . . . . . . 393



        Daniel Kasprzyk, U. S. Bureau of the Census; Curtis



        Jacobs, U. S. Bureau of Labor Statistics



 



  THE ADVANTAGES AND DISADVANTAGES OF LONGITUDINAL SURVEYS. . ..407



        Robert W. Pearson, Social Science Research Council



 



  LONGITUDINAL ANALYSIS OF FEDERAL SURVEY DATA. . . . . . . . . 425



        Patricia Ruggles Joint Economic Committee



 



  DISCUSSION. . . . . . . . . . . . . . . . . . . . . . . . . . 438



        Michael Brick, Westat, Inc.



 



  DISCUSSION. . . . . . . . . . . . . . . . . . . . . . . . .   447



        Marilyn E. Manser, U. S. Bureau of Labor Statistics



 



 



                 TOWARDS AN AGENDA FOR THE FUTURE



 



 



 



  Stephen E. Fienberg, Carnegie Mellon University . . . . . . . 455



 



  Margaret E. Martin. . . . . . . . . . . . . . . . . . . . . . 462



 



  Hermann Habermann, Office of Management and Budget. . . . . . 465



 



                               viii



 



                      



 



                              Part 2



                             Session 5



                           DATA EDITING



 



 



 



 



 



 



 



 



                              165



 



                             166



 



    OVERVIEW OF DATA EDITING IN FEDERAL STATISTICAL AGENCIES



 



                         David A. Pierce



                      Federal Reserve Board



 



 



Abstract



 



     This paper is the first of three in the session on Data



Editing presenting highlights of the report "Data Editing in



Federal Statistical Agencies", Statistical Policy Working Paper 18,



OMB, prepared by the Subcommittee on Data Editing in Federal



Statistical Agencies, FCSM.  Included in this paper are a listing of



the Subcommittee members, a discussion of its mission statement



from the FCSM, definition and concepts of data editing, the major



areas investigated and the methods used to do so, the development



of case studies, and the Subcommittee's recommendations for data



editing in Federal statistical agencies.  The paper highlights the



findings from a survey of current data editing practices which was



conducted by the Subcommittee.



 



 



1. Introduction



 



     The Subcommittee on Data Editing in Federal Statistical Agen-



cies was established by the Federal Committee on Statistical



Methodology (FCSM) in November 1988 to document, profile, and



discuss the topic of data editing in Federal censuses and surveys.



The Subcommittee consisted of the following individuals:



 



     George Hanuschak, National Agricultural Statistics Service,



          Chair



     Yahia Ahmed, Internal Revenue Service



     Laura Bauer, Federal Reserve Board



     Charles Day, Internal Revenue Service



     Maria Gonzalez, Office of Management and Budget



     Brian Greenberg, Bureau of the Census



     Anne Hafner, National Center for Education Statistics



     Gerry Hendershot, National Center for Health Statistics



     Rita Hohenbrink, National Agricultural Statistics Service



     Renee Miller, Energy Information Administration



     Tom Petkunas, Bureau of the Census



     David Pierce, Federal Reserve Board



 



 



 



 



                                 167



 



 



       Mark Pierzchala, National Agricultural Statistics Service



       Marybeth Tschetter, Bureau of Labor Statistics



       Paula Weir, Energy information Administration



 



       A key aim of this effort was to further the awareness within



 agencies of each other's data editing practices, as well as of the



 state of the art of data editing, and thus to promote improvements



 in data quality throughout Federal statistical agencies.              To



 further these goals, the Subcommittee was given a "charge", or



 mission statement, of



 



       determining how data editing is currently being done in



       Federal agencies, recognizing areas that may need



       attention, and, if appropriate, recommending any



       potential improvements for the editing process.



 



 Among the many items investigated by the Subcommittee were the role



 of subject matter specialists; hardware, software, and the data



 base environment; new technologies of data collection and editing,



 such as CATI and CAPI; current research efforts in the various



 agencies; and some recently developed editing systems such as at



 the Census Bureau and Statistics Canada.



 



       In fulfilling its mission the Subcommittee followed a number



 of paths, including developing a questionnaire on survey editing



 practices, assembling several case studies of editing practices,



 investigating alternative editing systems and software, exploring



 research needs and practices, and compiling an annotated



 bibliography of literature on editing.  The result of the



 Subcommittee's work is its report (1990), organized into 5 main



 chapters with several supporting appendices as follows:



 



         Chapters                      Appendices



 



   I. Executive Summary             A.  Questionnaire Responses



  II. Background                    B.  Case Studies



 III. Current Editing Practices     C.  Software Functions checklist



  IV. Editing Software              D.  Annotated Bibliography



   V. Research on Editing           E.  Glossary of Terms



 



 After discussing some general topics pertaining to editing and to



 the Subcommittee's work, this paper summarizes some of the main



 results of a questionnaire on Current Editing Practices, designed,



 administered and compiled by the Subcommittee.   The two papers



 immediately following address, respectively, the subjects of



 software developments and recent research findings in editing.



 



 



 



 



 



 



 



                                    168



 



 2. Data Editing--Definition and Concepts



 



      The subcommittee first addressed the definition of data



 editing.   While no universal definition of survey data editing



 exists, the following working definition was developed:



 



      Procedures designed and used for detecting erroneous



      and/or questionable survey data, with the goal of



      correcting (manually or electronically) as much of the



      erroneous data as possible (not necessarily all of the



      questioned data), usually prior to data imputation and



      summary procedures.



 



 Thus  data editing can be seen as a data quality improvement tool by



 which erroneous or highly suspect data are found and (if necessary)



 corrected.  We have focused primarily on editing rather than



 imputation in our work, though in practice the boundary between



 these is not absolute.



 



 3. Current-Editing Practices



 



      To obtain a profile of current editing practices, in the



 various Federal statistical agencies, the subcommittee developed an



 editing questionnaire, which was completed for 117 Federal censuses



 and surveys representing 14 different Federal agencies.  These 117



 surveys were selected by subcommittee members, and thus they were



 not a scientific sample of all Federal surveys; however the



 Subcommittee felt that the 117 surveys represented a broad coverage



 of agencies and types of surveys or censuses that would present



 different editing situations.



 



      The Subcommittee members primarily involved with the



 questionnaire and editing profile were Charles Day, Yahia Ahmed,



 George Hanuschak, Rita Hohenbrink and Renee Miller.



 



      The questionnaire that was designed was a six-page document



 containing general questions about the particular survey as well as



 specific questions on editing.  The report contains a complete



 listing of the questions asked, along with a tally of the results



 obtained for the 117 surveys, and should serve as a useful



 reference for the current (1990) state of data editing practice.



 A few of the major results follow.



 



      Regarding general characteristics of the surveys, about three-



 fourths of the surveys are actually sample surveys, and the



 remaining one-fourth censuses.  A wide range of frequencies of



 collection are represented, from daily to quinquennial.  About one-



 fourth are completed by individuals, and three-fourths               by



 establishments.  While traditional means of data collection such as



 mail, personal and telephone interviews were most common, a small



 



                                   169



 



proportion of the surveys used CATI, and some were administrative



 records.



 



      Turning to editing, while the idea that there's no such thing



 as a free lunch seems to be as true of data editing as it is of



 anything else, there was wide variation in the actual cost of



 editing as a percent of total survey cost.  The median editing cost



 for the surveys was more than one-third of the total cost of the



 survey.   One of the interesting findings was that surveys of



 individuals had lower relative editing costs than surveys of



 establishments.



 



      The questionnaire also elicited information on when in the



 survey process the editing occurs.  For about two-thirds of the 117



 surveys, most of the data editing takes place after data entry.



 Editing at the time of data entry is on the increase but not yet



 common.



 



      Subject matter analysts play a large and important role in



 data editing.  In about three-fourths of the surveys, subject



 matter analysts review all unusual or large cases. Only seven of



 the surveys had little or no intervention by subject-matter



 specialists.  In this regard, we found that surveys of



 establishments had heavier involvement from subject-matter



 specialists than surveys of individuals; and this could also be



 related to the, finding, mentioned above, of lower editing costs in



 individual than in establishment surveys.



 



      The degree of automation in data editing varies considerably



 among the surveys in our study.  In about three-fifths of the



 surveys, automated edit checking is done, but error correction is



 performed by clerks or analysts.  In about one-third of the cases,



 only unusual situations are referred to analysts.  Only 3% of the



 surveys were totally automated, though all but 1% had at least some



 automation.



 



      There are different types of edits that are applied to



 surveys. Almost all the surveys in our study use validation



 editing, which detects inconsistent data within a record.  About



 five-sixths also use macro editing, where aggregated data are



 examined.   The majority of surveys use other types of edits as



 well, such as range edits, edits using historical data, ratio



 edits, some of which may overlap.  Additional information is also



 utilized in  editing many of the surveys, such as comparisons with



 other surveys, comparison to a value estimated by regression



 analysis, or the use of interquartile measures.



 



      Satisfaction with the current editing system varied widely.



 About half the respondents were satisfied with their current



 editing systems, and another one-fourth felt only minor changes



 were needed.  The remaining one-fourth thought major changes were



 desired, with 5% of those being in favor of a complete overhaul.



 



                                  170



 



Among those desiring improvements, those most frequently mentioned



were:



 



     an on-line system for data editing,



     the use of prior periods' data to test the current period,



     more statistical edits,



     more sophisticated validation and macro editing,



     an audit trail,



     more automation, particularly automated error correction,



     user-friendlier systems,



     incorporation of imputation into the error package,



     evaluation of effects of data editing,



     reduction of the number of edit flags to follow up,



     incorporation of information on auxiliary variables,



     greater use of Expert Systems, and



     multivariate editing.



 



An Audit trail, or a complete record of the original and corrected



data, the edits failed and any other relevant information, is very



helpful in monitoring and improving the editing process.  The



importance of an evaluation of the effects of editing on the data,



and our current lack of knowledge of such effects, have also been



noted by Bailar (1990).



 



4. Case Studies



 



     In addition to the breadth of valuable information obtained



from the questionnaire, the Subcommittee also felt that an



examination of a relatively few surveys in greater depth would shed



light on the complexity of the different editing situations in



operation.  Therefore several case studies are described, some in



two-paragraph summary format and others in greater detail.  These



comprise Appendix B of the report.  Anne Hafner and Yahia Ahmed had



primary responsibility for preparation of the Case Studies.



 



5. Recommendations



 



     The report lists a number of recommendations for future data



editing practice, some general and some specific. Many of them



fall into the following general categories.



 



     The quality of an agency's existing editing practices and



     technology should be examined in the light of possible



     improvements or alternatives, with respect to such



     criteria as cost efficiency, timeliness, statistical



     defensibility, and accuracy.



 



     Important recent developments in data processing, such as



     new microcomputers, workstations, local area networks,



     data base software, and mainframe linkages, should be



 



                                 171



 



    examined for their possible incorporation into the survey



     editing process.



 



     Agencies should stay in communication with each other and



     with other professionals regarding their research in



     editing, particularly the development and implementation



     of new editing procedures and related methodologies such



     as data base technologies and expert systems.



 



References



 



Bailar, Barbara (1990), "Discussion of 'Survey Quality Profiles'",



Seminar on the Quality of Federal Data, May 22, 1990, COPAFS.  This



Proceedings.



 



Groves, Robert (1990), "Towards Quality in a Working Paper Series



on Quality", Keynote Address, Seminar on the Quality of Federal



Data, May 22, 1990, COPAFS   This Proceedings.



 



Hanuschak, George, Yahia Ahmed, Laura Bauer, Charles Day, Maria



Gon-zalez, Brian Greenberg, Anne Hafner, Gerry Hendershot, Rita



Hohenbrink, Renee Miller, Tom Petkunas, David Pierce, Mark



Pierzchala, Marybeth Tschetter and Paula Weir (1990), Data Editing



in Federal Statistical Agencies, Statistical Policy Working Paper



18, Statistical Policy Office, Office of Management and Budget,



Washington, DC.



 



 



 



 



 



 



 



 



                                172



 



                          EDITING SOFTWARE



          (An excerpt from Chapter IV of Working Paper 18)



 



                           Mark Pierzchala



              National Agricultural Statistics Service



 



 



 A. Introduction



 



      For  most surveys, large parts of the editing process are



 carried out through the use of computer systems.  The task of the



 Software Subgroup has been to investigate software that in some way



 incorporates new methodologies, has new ways of presenting data,



 operates in recently developed hardware environments, or integrates



 editing with other functions.  In order to fulfill this charge, the



 Subgroup has evaluated or been given demonstrations of new editing



 software.  In addition, the Subgroup has developed an editing             

 

 software evaluation checklist that appears in Appendix C of



 Statistical Policy Working Paper 18. This checklist contains



 possible functions and attributes of editing software, which would



 be useful for an organization to use when evaluating editing



 software.



 



      Extremely technical jargon can be associated with new editing



 systems; and new approaches to editing may not be familiar to the



 reader.  The purpose of section B is to explain these approaches



 and their associated terminology as well as to discuss briefly the



 role of editing in assuring data quality.



 



      A distinction must be made between generalized systems and



 software meant for one or a few surveys.  The former is meant to be



 used for a variety of surveys.  Usually there is an institutional



 commitment to spend staff time and money over several years to



 develop the system.  It is hoped that the investment will be more



 than recaptured after the system is developed through the reduction



 in resources spent on editing itself and in the elimination of



 duplication of effort in preparing editing programs. Some software



 programs have been developed that address specific problems in a



 particular survey.  While the ideas inherent in this software may



 be of general interest, it may not be possible to apply the



 software directly to other surveys.  Section C of Chapter IV of



 Working Paper 18 describes three generalized systems in some



 detail, and then briefly describes other systems and software.



 These three systems have been used or evaluated by Subgroup members



 in their own surveys.



 



      New and exciting statistical methodology is also improving the



 editing process.  This includes developments in detecting outliers,



 aggregate level data editing, imputation strategy, and statistical



 quality control of the process itself.  The implementation of these



 activities, however, requires that the techniques be encoded into



 a computer program or system.



 



                                   173



 



B. Software Improving Quality and Productivity



 



 Reasons for the Development of New Editing Software



 



      Traditional editing systems do not fully utilize the talents



 or expertise of subject matter specialists.  Much of their time may



 be spent in dealing with unimportant or spurious error signals and



 in coping with system shortcomings.  As a result, the specialist



 has less time to deal with important problems.  In addition,



 editing systems may be able to give feedback on the survey itself.



 For example, a pattern of edit failures may suggest



 misunderstandings by the respondent or interviewer.  If this is



 recognized, then the expertise of the specialist may then be used



 to improve the survey itself.



 



      Labor costs are a large part of the editing costs and are



 either steady or increasing, whereas the cost of computing is



 decreasing.  In order to justify the heavy reliance on people in



 editing, their productivity will have to be improved through the



 use of more powerful tools.  However, even if productivity is



 improved, different people may do different things in similar



 situations.  If so, this makes the process less repeatable



 (reproducible) and more subject to criticism.  When work is done on



 paper, it is hard to track, and it is impossible to estimate the



 effect of editing actions on estimates.  Finally, some tasks are



 beyond the capability of human editors.  For example, it may be



 impossible for a person to maintain the multivariate frequency



 structure of the data when making changes.



 



      These reasons and several others are commonly given as



 explanations for the increased use of computer software to improve



 the editing process.  It is in the reconciliation of these two



 goals, (the increased use of computers for some tasks and the more



 intelligent use of human expertise), that the major challenge in



 software development lies.  There will always be a role for people,



 but it will be modified.  One positive feature of new editing



 software is that it can often improve the quality of the editing



 process and productivity at the same time.



 



 



 Ways That Productivity Can Be Improved



 



      One way to improve productivity is to break the constraints



 imposed by computer systems themselves. The use of mainframe



 systems for editing data is widespread.  In some cases, however,



 an editor may not use the system directly.  For example, error



 signals may be presented on paper printouts, and changes entered by



 data typists.  Processing costs may dictate that editing jobs are



 run at low priority, overnight, or even less frequently.  The



 effect of the changes made by the editor may not be immediately



 



 



                                  174



 



  known: thus, paper forms may be filed, taken from files, and



   re-filed several times.



 



       The proliferation of microcomputers promises to eliminate many



   of these bottlenecks, while at the same time it creates some



   challenges in the process.  The editor will have direct access to



   the computer, and will be able to prioritize its use.  Once the



   microcomputer is acquired, user fees are eliminated, thus



   resource-intensive programs such as interactive editing can be



   employed, provided the microcomputers are fast enough.  Moving from



   a centralized environment (i. e., the mainframe) to a decentralized



   environment (i.e., microcomputers) will present challenges of



   control and consistency.  In processing a large survey on two or



   more microcomputers, communications will be necessary. This will



   best be done by connecting them into a Local Area Network (LAN).



 



        New systems may reduce or eliminate some editing tasks.  For



   example, where data are edited in batch and error signals are



   presented on printouts, a manual edit of the questionnaires before



   the machine edit may be a practical necessity.  Editing data and



   error messages on a printout can be a hard, unsatisfactory chore



   because of the volume of paper and the static and sometimes



   incomplete presentation of data.  The purpose of the manual edit in



   this situation is to reduce the number of machine-generated error



   signals.  In an interactive environment, information can be



   efficiently presented and immediately processed.  The penalty



   associated with machine-generated signals is greatly reduced. As



   a result, the preliminary manual edit may be eliminated.  In



   addition, questionnaires are handled only once, further reducing



   filing and data entry tasks.



 



        Productivity may be increased by reducing the need for editing



   after data are collected.  Instruments for Computer Assisted



   Telephone Interviewing (CATI), Computer Assisted Personal



   Interviewing (CAPI), and on-site. data entry and editing programs



   are gaining wider use.  Routing instructions are automatically



   followed, and other edit failures are verified at the time of the



   interview.  There may still be many error signals from suspicious



   edits, however, the analyst has more confidence in the data and is



   more likely to let them pass.



 



        There are two major ways that productivity can be improved in



   the programming of the editing instruments.  First is to provide a



   system that will handle all, or an important class, of the agency's



   editing needs.  In this way the applications programmer need not



   worry about systems details.  For example, in an interactive



   system, the programmer does not have to worry about how and where



   to flag edit failures as it is already provided.  The programmer



   only codes the edit specification itself.  In addition, the



   end-user has to learn only one system when editing different



   surveys.  Second is the elimination of multiple specification and



   programming of variables and edits.  For example, if data are



 



                                      175



 



  collected by CATI, and edited with another system, then essentially



   the same edits will be programmed twice, possibly by two sets of



   people.  If the system integrates several functions, e.g., data



   entry, data editing, and computer assisted data collection, then



   one program may be able to handle all of these tasks.  This



   integration would also reduce time spent on data conversion from



   one system to another.



 



 



   Systems That Take Editing and Imputation Actions



 



        Some edit and imputation systems take actions usually reserved



   for people.  They choose fields to be changed and then change them.



   The human element is not removed, rather this expertise is



   incorporated into the system.  One way to incorporate expertise is



   to use the edits themselves to define a feasible region.  This is



   the approach outlined in a famous article by Fellegi and Holt



   (1976).  Edits that are explicitly written are used to generate



   implied edits. For example, if 100 < x / y  < 200, and 3 <



   y / z < 4, are explicit edits, then an implied edit obtained



   algebraically is 300 < x / z < 800.  Once all implied edits are



   generated, the set of complete edits is defined as the union of the



   explicit and implied edits.  This complete set of edits is then



   used to determine a set of fields to be changed for every possible



   edit failure.  This is called error localization. An essential



   aspect to this method is that changes are made to as few fields as



   possible, or alternatively to the least reliable set of fields



   which are determined by weights given to each field.



 



        The analyst is given an opportunity to evaluate the explicit



   edits.  This is done through the inspection of the implied edits



   and extremal records (the most extreme records that can pass



   through the edits without causing an edit failure).  In inspecting



   the implied edits, it may be determined if the data are being



   constrained in an unintended way. In inspecting extremal records,



   the analyst is presented with combinations of the most extreme



   values possible that can pass the edits.  The human editor has



   several ways to inject expertise into this kind of a system:  (1)



   the specification of the edits;  (2) the inspection of implied



   edits and extremal records and then the re-specification of edits;



   (3) the weighting of variables according to their relative



   reliability.



 



        There are some constraints in systems that allow the computer



   to take editing actions.  Fellegi and Holt systems cannot handle



   certain kinds of edits, notably nonlinear and conditional edits.



   Also algorithms that can handle categorical data cannot handle



   continuous data and vice versa.  Within these constraints (and



   others), most edits, can be handled.  For surveys with continuous



   data, a considerable amount of human attention may still be



   necessary, either before the system is applied to data or after.



 



 



                                     176



 



      Another way that computers can take editing actions is by



 modeling human behavior.  This is the "expert system" approach.



 For example, if typically maize yields average 100 bushels per



 acre, and the value 1,000 is entered, then the most likely



 correction is to assume that an extra zero was typed.  The computer



 can be programmed to substitute 100 for 1,000 directly and then to



 re-edit the data.



 



 



 Ways That Data Quality Can Be Improved or Maintained



 



      It is not clear that editing done after data collection can



 always improve the quality of data by reducing non-sampling errors.



 An organization may not have the time or budget to recontact many



 of the respondents or may refrain from recontacts in order to



 reduce respondent burden.  Additionally, there may be cognitive



 errors or systematic errors that an edit system cannot detect.



 Often, all that can be done is to maintain the quality of the data



 as they are collected.  To use the maize yield example again, if



 the edit program detects 1,000 bushels per acre, and sets the value



 to 100 bushels per acre, then the edit program has only prevented



 the data from getting worse.  Suppose the true value was really 103



 bushels per acre.  The edit and imputation program could not get



 the value closer to the truth in this case.  Detecting outliers is



 usually not the only problem.  The proper action to take after



 detection is the more difficult problem.  One of the main reasons



 that Computer Assisted Data Collection is employed is that data are



 corrected at the time of collection.



 



      There are a few ways that an editing system may be able to



 improve data quality. A system that captures raw data, keeps track



 of changes, and provides well conceived reports, may provide



 feedback on the performance of the survey.  This information can be



 used, to improve the survey in the future.  To take another



 agricultural example, farmers often harvest corn for silage (the



 whole plant is harvested, chopped into small pieces, and blown into



 a silo). Production of silage is requested in tons.  Farmers often



 do not know their silage production in tons.  Instead, the farmer



 will give the size (diameter and height) of all silos containing



 silage.  In the office, silo sizes are converted into tons of



 production.  If this conversion takes place before data are



 entered, then there is no indication from the machine edit of the



 extent of this reporting problem.



 



      Another way that editing software can improve the quality of



 the data is to reduce the opportunity cost of editing.  The time



 spent on editing leaves less time for other tasks, such as



 persuading people to participate, checking overlap of respondents



 between multiple frames, and research on cognitive errors.



 



 



 



 



                                  177



 



Ways That Quality of the Editing Process Can Be Defended or



 Confirmed



 



      There is a difference between data quality and the quality of



 the editing process itself.  To refer once again to the maize yield



 example, a good quality process will have detected the



 transcription error.  A poor quality process might have let it



 pass.  Although neither process will have improved data quality,



 the good quality process would have prevented their deterioration



 from the transcription error.  Editing and imputation have the



 potential to distort data as well as to maintain their quality.  



 This distortion may affect the levels of estimates and the



 univariate and multivariate distributions.  A high quality process



 will attempt to minimize distortions.  For example, in Fellegi and



 Holt systems, changes to the data will be made to the fewest fields



 possible and in a way such that distributions are maintained.



 



      A survey organization should be able to show that the editing



 process is not abusing the data.  For editing after data



 collection, this may be done by capturing raw (unedited) data and



 keeping track of changes and the reasons for change.  This is



 called an audit trail.  Given this record keeping, it will be



 possible to estimate the impact of editing and imputation on



 expansions and on distributions.  It will also be possible to



 determine the editor effect on the estimates.  In traditional batch



 mode editing on paper printouts, it is not unusual for two or more



 specialists to edit the same record.  For, example, one may edit the



 questionnaire before data entry while another may edit the record



 after the machine edit.  In this case, it is impossible to assign



 responsibility for an editing action.  In an on-line mode one



 person handles a record until it is done.  Thus all changes can be



 traced to a person.  For editing at the time of data collection,



 (e.g., in CATI), it may be necessary to conduct an experiment to



 see if either the mode of collection, or the edits employed, will



 lead to changes in the data.



 



      A high quality editing process will have other features as



 well.  For example, the process should be repeatable, in time and



 in space.  This means that the same data passed through the same



 process in two different locations, or twice in one location, will



 look (nearly) the same.  The process will have recognizable



 criteria for determining when editing is done.  It will detect



 real errors without generating too many spurious error signals.



 The system should be easy to program in and have an easy user



 interface.  It should promote the integration of survey functions



 such as micro- and macro-editing.  Changes made by people should



 be on-line (interactive) and traceable.  Database connections will



 allow for quick, and easy access to historical and sampling frame



 data.   An editing system should be able to take actions of minor



 impact without human intervention.  It should be able to



 accommodate new advances in statistical editing methodology.



 



                                  178



 



Finally, quality can be promoted by providing statistically



defensible methods and software modules to the user.



 



Acknowledgements



 



    Other members of the Editing Software Working Group for



Working Paper 18 were Tom Petkunas, Bureau of the Census, Gerry



Hendershot, National Center for Health Statistics, Charles Day,



Internal Revenue Service, Marybeth Tschetter, Bureau of Labor



Statistics, and Rita Hohenbrink, National Agricultural Statistics



Service.



 



 



 



 



 



 



 



 



                               179



 



                         RESEARCH ON EDITING



 



                             Yahia Ahmed



                      Internal Revenue Service



 



 Introduction



 



      This paper is one of three papers presented in a session



 organized to present topics from the Statistical Policy Working



 Paper 18, "Data Editing in Federal Statistical Agencies."  The



 Subcommittee on Data Editing in Federal Statistical Agencies was



 established by the Federal Committee on Statistical Methodology to



 document, profile and discuss data editing practices in Federal



 surveys.  To effectively accomplish its mission, the subcommittee



 was I divided into four major groups: Editing Profile, Case Studies,



 Editing Software, and Editing Research.



 



      The purpose of this paper is to present briefly the goals,



 findings and recommendations of the Editing Research Group.  A more



 detailed description of editing research is provided in Chapter V



 of the Working Paper.



 



      The goals of the Editing Research Group were to identify areas



 in which improvements to edit systems would prove most useful, to



 describe recent and current research activities designed to enhance



 edit capabilities, to make recommendation for future research an



 to develop an annotated bibliography on editing.



 



 Areas Which Need Improvement,



 



      The Editing Research Group used two sources of information to



 identify areas which need improvement.  The first source was the



 editing profile questionnaire which was administered to managers, of



 117 Federal surveys covering 14 different agencies.  This



 questionnaire included questions about edit movements.  One



 question asked was "For future applications, what would you like



 your edit system to do that it doesn't do now?" The second source



 was discussions with those responsible for edit tasks within a



 number of Federal agencies.  The following areas emerged as



 priorities:



 



 0    More on-line edit capabilities



 



 0    Better ways to detect potentially erroneous responses



 



 0    More sophisticated and extensive macro-editing



 



 0    Evaluation of the effect of data editing.



 



 



                                 180



 



 Areas of Edit Research



 



      Much editing research has been conducted in national



 statistical offices around the world.  It is these organizations,



 which conduct huge and complicated surveys, that have the most to



 be gained from developing new systems and techniques.  They also



 have the resources upon which to draw for this development.



 



      One area of current research interest is that of "on-line



 edit capabilities".  BLAISE, SPEER, and PEDRO discussed in the



 preceding paper are examples of such research activities.



 



      A second area of active research is in the detection of



 potentially erroneous responses.  The method most commonly used is



 to employ explicit edit rules.  For example, edit rules may require



 that:



 



   1) The ratio of two fields lie between prescribed bounds,



 



   2) various linear-inequalities and/or equalities hold, or



 



   3) the current response be within some range of a predicted



      value based on a time series or other models.



 



      Edit rules and parameters are highly survey specific.  A



 related area of editing research is the design of edit rules and



 the development of methods for obtaining sensitive parameters.



 



      In order to make sure that all errors are flagged, often many



 unimportant error flags are generated.  These extra flags not only



 take time to examine but also distract the reviewer from important



 problems.  These extra flags are generated because of the way that



 the error limits are set.  A related area of research focuses on



 developing statistical editing techniques to reduce the-number of



 error flags, while at the same time, ensuring that not many errors



 escape detection.  Several research studies in which different



 statistical techniques (such as clustering, exponential smoothing



 and Tukey's biweight) to detect potentially erroneous responses or



 to set error bounds are described in the working paper.



 



      In contrast to the rule-driven method f or the detection of



 potentially erroneous response combinations within a record, one



 alternative procedure is to analyze the distribution of



 questionnaire response.  Records which do not conform to the



 observed distribution are then targeted as outliers and are



 selected for review.  Although there has been research interest in



 this method, no application of these multivariate methods was



 found.



 



 



 



 



 



                                    181



 



 Recommendations



 



      The most important recommendation is that agencies recognize



 the value of editing research and place in high priority on



 devoting resources to their own research, to monitoring



 developments in data editing at other agencies and elsewhere and to



 implement improvements.



 



      Often innovations in editing methods made by survey staff are



 viewed as enhancements to processing for that particular survey and



 little thought is given to the broader applicability of methods



 developed.  Accordingly, survey staff do not prepare discussion of



 new methods for publication.  We encourage survey staff to take the



 time to describe their work and publish them in order to share



 their experiences with others who may be working under similar



 conditions.  It is often in such articles that methods which may be



 applicable to more than one survey are first introduced and



 described.



 



      The survey on editing practices indicated that there was



 little analysis of the effect of editing on the estimates that were



 produced.  Considering that the cost of editing is significant for



 most surveys, this is clearly an area in which more work is



 required.  A related issue is the need to attempt to determine when



 to edit and not to edit.



 



      Clearly, all the errors are not going to be found and we



 should not attempt to find them all.  Therefore, there is a need to



 design guidelines for determining what is an acceptable level of



 editing.



 



      Another neglected research area in this country concerns the



 editing of data at the time they are keyed from mail responses.



 This area is usually discussed in the setting of quality control;



 however, it is an area that can benefit from further research from



 the perspective of data editing.



 



 



 Annotated Bibliography



 



      It is quite difficult to provide a complete assessment of



 current research activities in the area of editing because so much



 of the research, progress, and innovations are described only in



 specific documentation.  However the group was able to identify 86



 references which describe research efforts over the past years.



 Appendix D of the working paper contains the annotated



 bibliography   The annotations are brief and are only intended to



 give a very general idea of the paper's content.  The appendix



 provides a valuable source of information on the editing



 literature.  In addition it includes papers which describe the



 underlying methods, the software, proposed uses, and possible



 



 



                                 182



 



  advantages of three generalized editing software systems -- GEIS,



   BLAISE and SPEER.



 



   Acknowledgements



 



        Other members of the Editing Research Group for Working Paper



   18 were Laura Bauer, Federal Reserve Board, Brian Greenberg, Bureau



   of the Census, Renee Miller, Energy Information Administration,



   David Pierce, Federal Reserve Board, Paula Weir, Energy Information



   Administration.



 



 



 



 



 



 



 



 



                                    183



 



                               DISCUSSION



 



                           Charles E. Caudill



               National Agricultural Statistics Service



 



 



       As Administrator of a Federal-State Cooperative Statistical



  Agency, I am quite impressed with the information contained in OMB



  Statistical Policy Working Paper No. 18 on Data Editing in Federal



  Statistical Agencies.  The working paper thoroughly, documents many



  existing editing practices, generalized editing software



  developments and provides a detailed software evaluation protocol.



  In addition, it covers current research activities on editing,



  provides an annotated bibliography and has a good executive summary



  including recommendations.



 



       I believe that this report, if read and seriously considered



  by federal survey managers and administrators, can have a



  substantial effect on improving productivity.  Thus, "precious"



  resources could be freed up to more formally address nonsampling



  errors, quality control, and total survey error models,



  measurements and structures. In my opinion, if there was ever a



  report that survey administrators should take seriously, this is



  it.



 



      There are several more detailed comments and observations that



  I have about working paper number 18.  The data on the costs of



  editing was intriguing.  My observation is that there may be an



  upward bias in the data, and some non-editing cost may have been



  included.  However, even if this is the case, there obviously is



  still plenty of room for productivity gains in the editing process.



  With the proliferation of personal computer networks and data base



  software, there is substantial potential to improve the



  productivity of editing systems by being on-line and providing the



  editor with immediate screen feedback and re-editing of their



  proposed changes.



 



      Recent computer processing technology advances also make the



  use of audit trails more available for more users.  Inexpensive



  audit trails provide the capability to analyze and conduct research



  on the effects of editing on the estimators and also on the overall



  performance of the survey as well.



 



     The detailed checklist of edit software system features in



  Appendix C of working paper 18 will be beneficial to both the



  development of new systems and maintenance and evaluation of



  existing systems.  The annotated bibliography of articles and



  papers on editing presented in Appendix D will be valuable for



  researchers and system developers as a substantial source of



  literature and information.



 



 



 



                                 184



 



        Working paper 18 certainly demonstrated that current data



  editing practices are labor intensive.  Many remain mainframe and



  batch oriented, with multiple passes of the data.  Also, I think



  that there may be a tendency to stay with existing systems too



  long.



 



       My final comments are on total quality management of surveys.



  As an Administrator, one of my major concerns is with the quality



  of the final products and reports that the Agency delivers to the



  public.  Thus, if the editing process can be made more efficient,



  without degrading accuracy, then that adds to the potential of



  using the saved resources on other important areas of the survey



  process.  Total quality management techniques applied to surveys



  are useful tools in efficiently identifying the most important



  potential sources of survey error.



 



                              DISCUSSION,



 



                           Richard Bolstein



                       George Mason University



 



      The serious impact that erroneous survey data can have on



  results, the fact that the number of errors tend to increase with



  the size and complexity of the survey, and the relatively large



  proportion of survey costs currently required to edit and correct



  data, make the need for new and improved methods of data editing



  imperative.  To this end, the authors have done a laudable job in



  researching methods currently used, presenting several case



  studies, testing and discussing the advantages and disadvantages of



  some current and developing editing software, and providing a



  synopsis of current research.



 



      A working definition of editing was clearly necessary in this



  study, since, among other things, in order to estimate costs



  of editing, a fairly rigorous definition of the scope of editing was



  required.  The working definition used by the authors, namely,



  "procedure(s) designed and used for detecting erroneous and/or



  questionable survey data with the goal of correcting as much of the



  erroneous data as possible, usually prior to data imputation and



  summary procedures" is quite suitable for this purpose.  We should



  keep in mind, however, that while it feels comfortable to clean up



  erroneous data prior to imputation for missing data, in practice



  the two are often intertwined.



 



     The paper states that the cost of editing was available for



  40% of the 117 surveys in the sample, and cost estimates were



  possible for an additional 40%.  It was reported that between 75%



  and 80% of these surveys had editing costs of at least 20% of total



  costs.  It is not too meaningful to compare the relative costs of



  editing across all types of surveys however, since one would



  naturally expect that these costs would be higher in less expensive



  surveys (such as mail or administrative records) than in expensive



  surveys (such as personal interview, surveys of institutions), as



  found by the authors.  Thus, it would be more informative if the



  relative cost figures cited above were reported by survey type.



  Another factor that can account for a large percentage of editing



  costs is the presence of a relatively large number of questions



  requiring open-ended responses and subsequent coding of the



  responses.  But although the distribution of the relative cost of



  editing may vary considerably, there is no doubt that editing is



  costly and methods to reduce this cost and improve data quality are



  much needed.



 



     Finally, no discussion of the costs of editing is complete



  without determining what percentage is due to bad data that should



  not have occurred but for inadequate interviewer training, poor



  supervision and quality control of interviewers, and simple common



 



                                186



 



 



  sense errors.  For these are errors which should not have occurred



  and should be deducted from the cost of editing in the estimates of



  the surveys above, since they are likely to have varied



  considerably.



 



       Although elimination of such unnecessary errors was not part



  of the project of the three authors, it seems appropriate in a



  discussion of improving data editing procedures to mention ways in



  which the need for editing can be reduced.  To illustrate an



  example of a common sense error that should be eliminated, in a



  certain survey, the sponsor of which I will not name, fishermen are



  interviewed and their catch is weighed and measured.  The



  interviewer is supposed to record weight in kilograms, but the



  scale used shows weight in both pounds and kilograms.  As expected,



  frequent errors occur. The obvious solution is to use a scale that



  only shows kilograms, but when I suggested this to the survey firm,



  the response was "no one makes such a scale".  When I then



  suggested taping over the side of the scale showing pounds, the



  reply was "but the fishermen want to know what their fish weigh in



  English".  Finally, I suggested taping over the kilogram side of



  the scale, have the interviewer record the weight in pounds, and



  have the data entry program convert it to kilograms.  The response



  to this suggestion I am sure you have all heard before: "well,



  that's the way we're used to doing it".  There are numerous other



  examples of course (for example, in some surveys interviewers are



  required to record the hour in military time).



 



       The most promising methods to reduce editing costs and improve



  data quality (after elimination of the unnecessary errors) are



  found in interactive data entry software and in general editing



  software systems.  These methods seem appropriate for large,



  complex surveys, or surveys which are repeated.  For small one-time



  surveys the cost of purchasing, learning, and programming the



  software will most likely outweigh the savings, as this is even



  true with CATI.  But this is generally not the case with surveys



  gathering Federal Data.  The three generalized editing software



  systems studied in detail by Mark Pierzchala seem very promising,



  especially BLAISE because of its generality and ability to handle



  both categorical and continuous data.  GEIS and SPEER are specific



  to economic type surveys.



 



       To what extent can graphics or other theoretical tools be used



  in editing systems?  The STAR WARS software described uses graphics



  to compare edited values with the originals, but not to detect



  outliers.  The parallel coordinate system for graphic displays of



  high-dimensional data [see Miller and Wegman (1989), Wegman (1990)]



  may be used to detect outliers.  Yahia Ahmed noted that analysis of



  the multivariate distribution of questionnaire responses to flag



  records that don't conform to the distribution as outliers has been



  infrequently used, no doubt due to its complexity.  I believe that



  graphical methods for detecting outliers will meet with more



  acceptance than the multivariate analysis approach has but it would



 



                                      187



 



 not be cheap (time-wise) and probably would be best used as a final



  check rather than at the front-end of the editing task.



 



      Finally, I have two recommendations.  In view of the



  increasing abundance of software we will see in the future, we



  should construct a standard set of test data sets for evaluating



  present and future software editing systems.  Secondly, a one or



  two-day demonstration seminar of some of these systems would be



  well received.



 



 



  References



 



  Miller, J.J. and Wegman, E.J. (1989), "Construction of line



  densities for parallel coordinate plots", Technical Report No. 53,



  Center for Computational Statistics, George Mason University.



 



  Wegman, E.J. (1990), "Hyperdimensional data analysis using parallel



  coordinates", Journal of the American Statistical Association, to



  appear.



 



 



                                Session 6



                   COMPUTER ASSISTED STATISTICAL SURVEYS



 



 



 



 



 



 



 



 



                                   189



 



   OVERVIEW OF COMPUTER ASSISTED SURVEY INFORMATION COLLECTION



 



                         Richard L. Clayton



                  U. S. Bureau of Labor Statistics



 



 



      This section provides a summary of Working Paper 19 on



 Computer Assisted Survey Information Collection (CASIC).  For



 additional information, we encourage you to see this document.



 



      The power of rapid calculating has been applied to virtually



 every phase of the survey process, including sample design and



 selection, and estimation.  The most important implication of these



 applications is that survey practitioners are allowed to consider



 a growing range of techniques which were not affordable prior to



 the availability of inexpensive and fast calculating capability.



 



      The field of computer assisted collection applications may be



 the area of greatest and most rapid change in survey methods.  This



 field includes the rapidly expanding variety of applications based



 on the availability of powerful and inexpensive computers.  Most



 familiar of the new techniques are CATI and CAPI.  However, a



 variety of other collection methods are being developed across the



 Federal government's statistical agencies, including Touchtone Data



 Entry, Prepared Data Entry and more recently, voice Recognition



 Entry.



 



      High quality published data begins with collecting high



 quality data from our respondents.  Much of survey processing



 addresses, and compensates for, weaknesses in the quality of the



 collected data and the data we do not collect.  Those methods which



 capture data quickly and accurately should be developed which allow



 respondents to answer our questions accurately and quickly.  With



 this in mind, we provided the results of research and development



 activities using new technological features throughout the Federal



 government seeking new data collection methods, and in modifying



 the old, to improve the quality of data collection.



 



      For the purposes of this report, we defined computer assisted



 survey information collection methods as those using computers as



 a major feature in the collection of data from respondents, and in



 transmitting of data to other sites for post-collection processing.



 



      Goal:  The overall goal of Working Paper 19 was to provide



 information on new data collection methods to challenge Federal



 survey managers to reconsider their operations in light of recent



 changes in survey methods available, or made attainable through



 changing technology to reassess their methods of accomplishing the



 common goal of providing the critical information to the public



 which is accurate, timely and relevant.  We hope that by sharing



 information and experiences, that others may gain and forward the



 overall effectiveness of governmental activities.



 



                                  191



 



     Objectives:  The primary objective is to describe emerging



methods of interactive electronic data collection, the potential



benefits, and current examples of its use in Federal surveys.  In



describing current uses and tests, a secondary objective is to pose



questions about the implications of use of computer assisted



methods and try to suggest some answers.  These questions involve



such factors as quality, costs, and respondent reaction to.



computerized surveys.



 



     Scope: The survey operations included in this report includes



all of the activities and tasks from the transmittal of the



questionnaire, conduct of the interview, data entry, editing and



followup for nonresponse or edit reconciliation.



 



     The last major survey operation to benefit from automation is



data collection.  Computers were first applied to collection using



mainframes to control certain aspects of telephone collection, and



Computer Assisted Telephone Interviewing (CATI) was born.  The



first applications of CATI stimulated new research worldwide



evaluating the impact on of CATI on the survey error profile and



costs.  CATI is now used to assist interviewers in all collection



activities, including scheduling calls, controlling detailed



interview branching, editing and reconciliation, providing much



greater control over the collection process and reducing many



sources of error.  At the same time, a tremendous amount of



information it captured by the computer providing additional



insight into the data collection process.



 



     The ongoing advances in computer technology, and particularly



the advent of microcomputers, continue to offer additional



opportunities for improving the quality of published data.  The



first portable computers were quickly pressed into service to



duplicate the advantages of CAT! in a personal visit environment.



Thus, Computer Assisted Personal Interviewing (CAPI) was launched



from the work in CATI.



 



     While CATI and CAPI represent advances for surveys requiring



interviewers, microcomputers are now finding important roles in



self-administered questionnaires, where interviewers are not



needed.



 



     Prepared Data Entry (PDE), developed by the Energy Information



Administration, allows respondents which have a compatible



microcomputer or terminal to access and complete the questionnaire



directly on their screen.



 



     Touchtone Data Entry (TDE), developed at the Bureau of Labor



Statistics, allows respondents to call a toll-free telephone



number.   Questions posed by a computer are answered using the



keypad of their touchtone telephone.  The machine repeats the



answers for verification with the respondent which are stored in a



database.  TDE systems are now commonplace for bank transfers, and



 



                                192



 



telephone call routing, as examples.  We have just applied



 existing technology to the data collection process.



 



      As an extension of this approach, techniques have been



 developed more recently allowing respondents to answer the



 questions by speaking directly into the telephone.  The incoming



 sounds are matched to known patterns recognizing the digits and the



 words "yes" and "no".  Voice Recognition Entry (VRE), as this is



 known, is not the distant future.  The Bureau of Labor Statistics



 is currently conducting live tests where this method is being



 warmly received by respondents as natural and convenient.



 



      Both TDE and VRE offer inexpensive data collection where the



 respondents initiate the calls, enter and verify the data.



 Refinements to procedures will now focus on minimizing nonresponse



 prompting activities.



 



      Respondent Burden:  For many respondents, the use of automated



 methods can actually reduce the collection burden placed on them.



 For example, use of Prepared Data Entry, where respondents interact



 with computer screens, provides a single set of step-by-step



 procedures with on-line editing to prevent inconsistent or



 incorrect reporting, thus reducing the need for expensive and



 troublesome recontacts.  Also, these methods have, in some cases,



 substantially reduced the time taken to provide complex data for



 large establishments.  Similar methods may be applied to other



 surveys covering large establishments where the one-time costs of



 data conversion to a standard format would be cost-effective,



 especially in repeated surveys.



 



       Ouality:   Automated collection allows for improved control



 yielding reduced error from several sources including errors caused



 by the respondent, the interviewer, and post collection processes



 such as key entry error.  The instant status capabilities of CATI,



 for example, provide stronger intervention features for nonresponse



 prompting, reducing nonresponse error.



 



       In deciding which collection method to use, quality can become



 a relative concept that is affected by a tradeoff between cost and



 benefit.  The choice of a data collection method is usually based



 on a combination of performance and cost factors determining



 affordable quality.  For traditional collection methods, these



 factors and the decision-making process are fairly well known.



 Now, these new methods discussed in Working Paper 19 expand the



 array of potential collection tools and challenge the survey



 designer to reevaluate old cost/performance assumptions.



 



       Costs: The data collection process is composed of a few major



 activities, including transmitting and receiving the questionnaire,



 data entry, editing and nonresponse prompting.  The labor and



 nonlabor costs will vary depending on the method used.  For



 example, under mail collection virtually each action is conducted



 



                                    193



 



manually and postage is the dominant nonlabor cost.  By contrast,



 CATI operations can minimize postage costs reduces many of the



 expensive mail handling operations.  However, CATI adds new costs



 in the form of telephone line charges and computers (including



 Systems design and ongoing maintenance).  Self-response methods,



 such as TDE, VRE and PDE collection, reduce postage, the manual



 mail operations and the labor involved in CATI interview



 activities, but may still require edit reconciliation and



 nonresponse followup.



 



      Thus, the factors of production, and the composition of each



 those inputs vary greatly  among the existing and newer techniques.



 Many factors can change in a short period.  Only a few years ago,



 automation costs were driven by the scarcity of mainframe hardware



 capacity.  Now, the costs of automation are driven by the labor



 involved in developing specialized systems dominates automation



 costs.  Portable and desktop microcomputers were not widely



 available at the beginning on this decade.  Now, microcomputers are



 widely available, very inexpensive and extremely powerful.



 



      Old assumptions about costs need to be reevaluated.  Labor and



 postage costs have risen steadily in recent years, while capital



 costs, such as microcomputers and telephone services have been



 declining.



 



      The decision on which collection mode to use, or which



 combination, will depend on the particular survey application and



 the existing cost structure.  However, it is important to view such



 investments over the long-term as the relative costs of each of the



 inputs do not remain constant over time.  Survey managers should



 periodically review old assumptions in light of new technology and



 project operating costs over the reasonable foreseeable future in



 deciding not to investigate new methods.



 



      Users: Automated data collection includes three major groups



 of people: the respondents, the interviewers and the designers and



 developers of the system and procedures for collection.  This



 report covers the essential factors involved in successfully



 including the requirements of each group.



 



      Respondents: The respondent must be considered the primary



 user of any survey vehicle, whether automated or not, and all



 aspects of the response environment must be developed with the



 respondent in mind.  The cooperation of the respondent is the



 single most critical factor in survey operations. Respondents must



 be treated with the greatest care.  We must consider our



 respondents as a Customer, after all, if our survey vehicle doesn't



 "sell", if the questionnaire is not successful in getting an



 accurate response, we will have no input for the rest of our



 production process.



 



 



                                  194



 



       Even one-time surveys must strive to leave the respondent with



 the feeling of contribution and importance, and most of all, a



 willingness to participate in other surveys in the future if called



 on. Thus, our primary job is to develop techniques which allow the



 respondent to complete the survey completely and accurately and



 with a minimum level of burden.



 



       The use of these collection methods, while bringing



 improvements in the quality of collected data, has entailed other



 challenges.  These automated collection methods are made possible



 through the close interaction of subject matter experts,



 statisticians, and computer scientists.  To effectively use these



 methods, each of these groups learned the basic tenants of the



 others.  This close relationship will only continue to grow, with



 advances in each field aiding advances in the others.



 



       Interviewers:  The second most important user is the



 interviewer.  The systems provided to assist in the interview



 process must be easy to use, must work infallibly and must actually



 provide improvements in his or her work environment.  Interviewers



 must feel as they are the most valuable feature in the interview,



 that the machine is merely a tool to expedite and simplify their



 work.  This is not always an easy task.



 



       Survey Practitioners: We are the third major group of users.



 The decisions made early in the development process will carry over



 into the ongoing use and maintenance of the system.



 



       Systems designers face difficult choices, such as building



 customized systems from scratch versus linking standardized "off



 the shelf" routines or commercial, packages.  The inevitable



 limitations would have to be traded off against reduced maintenance



 and lower start up costs.



 



       Automated collection methods can also improve data quality.



 All of the methods discussed could be designed to include on-line



 editing to prevent impossible and inconsistent entries.  Some of



 these methods, such as TDE and VR, improve data quality by



 verifying recorded data with the respondent.



 



       These are potential improvements.  The final impact of quality



 lies in the up front planning and execution.  This place



 responsibility for clearly defining and controlling the collection



 environment directly with the survey designer.



 



       Future:  The future application of these techniques is limited



 only by our creativity and initiative of program managers and



 planners.  The "case studies" serve to illustrate the options



 available, and will surely raise many more questions for further



 investigation.



 



 



 



                                    195



 



     We hope that the discussion of technological advances



generates discussion and stimulates creative, new applications to



the whole range of governmental information collection activities.



 



     In addition to the methods described here there are other



advances in, technology which hold potential for vastly changing



data collection.  Integrated Services Digital Network (ISDN) is a



powerful network system which will provide simultaneous



transmission of sound, video and data.  The result could be a



change in the way some surveys are conducted offering all of the



benefits of personal interviewing with the lower costs of telephone



interviewing.



 



     You have heard a several different collection methods



described and discussed which are currently available.  And you can



see that the pace of change will accelerate and match changes in



technology.  So what does the future hold?



 



     You have to ask yourself how your survey operations will be



conducted in 5 or perhaps 10 years.  In doing so, ask yourself how



things were done 5 or 10 years ago.  What sorts of things have



happened and what were their implications?



 



 



 



 



 



 



 



 



                                196



 



                 A COMPARISON BETWEEN  CATI AND CAPI



 



                             Martin Baum



               National Center for Health Statistics



 



  Introduction



 



      I will describe for you some of the critical factors one must



  consider when deciding whether to conduct a survey by either CATI



  or CAPI.  I also will try to indicate the similarities and



  differences between these to methods of survey data collection



  automation.



 



  Definition



 



      Let me first define each of the methods.  Computer Assisted



  Telephone Interviewing (CATI) is a computer assisted survey process



  which uses the telephone for voice communications between the



  interviewer and the respondent.  Computer Assisted Personal



  Interviewing (CAPI) is a personal interview usually conducted at



  the home or business of the respondent using a portable computer.



 



  Rationale



 



       The rationale for the development and for your use of these



  methods are based primarily on reasons of improved data quality and



  improved timeliness of data release.  Cost is a factor, but in our



  experience, it has been a break-even situation; the cost of



  automating has equaled the savings.  This result has been due



  primarily to the high cost of software development.



 



  Factors



 



      The following are critical factors that must be considered in



  addition to those of improved data quality and timeliness, and cost



  when deciding whether to use CATI or CAPI for your survey data



  collection.  I will discuss each of these factors in some detail.



 



  Hardware CATI



 



       Initially CATI was developed as a mainframe application but



  as computer technology changed, CATI moved to the mini computer and



  then to a networked micro computer application.  The investment in



  hardware has steadily decreased without any lost of capability.



  Telephone technology, which impacts telephone availability is



  important to the CATI application - no phone no respondent.



 



                                   197



 



 Hardware CAPI



 



      The most important computer hardware criteria for a CAPI



 application are generally quite different from those that would be



 critical to most other applications.  The major reason is the role



 that environmental conditions play in the selection of CAPI



 hardware.  The fact that CAPI is a personal interview situation,



 usually taking place in or at the home of the respondent, dictates



 a number of possible circumstances under which the interview will



 be conducted.



 



      For example, screen visibility becomes a paramount criterion



 because of the environmental conditions.  Interviews will take



 place under all types of lighting conditions; outside in bright



 sunlight, twilight, and normal light, and inside under lamp light,



 fluoresce light, and bear bulb.



 



      Weight is especially critical because of the variety of



 environmental conditions.    Interviewers may be conducting the



 survey in an urban setting where the computer will be carried up



 and down the stairs of apartment houses; or in a suburban setting



 where the computer is carried many blocks; or in a rural setting



 where the computer is carried long distances from car to house.  In



 any of these conditions, the computer is moved in and out of a car



 many times.  This situation is further compounded by the fact that



 the interviewer must also carry considerable paper e.g. back-up



 paper questionnaires in case the computer fails, letters of



 explanation, introduction, and thank you.  Carrying all of this



 weight in and out of cars and up and down steps all day is no easy



 job, particularly if the computer and back up battery weighs 10



 plus lbs. and the paper weighs an additional 5 lbs. or more.



 



     For a household type survey, the interviewers are generally



 reluctant to ask for the respondent's permission to use power for



 the computer because of fear of possibly losing the interview.



 Also, surveys frequently are conducted outside of the house where



 no power is available.  Many of our surveys can last as long as 2-



 4 hours.  Consequently, battery life it critical.



 



     Environmental conditions often impact the ergonomics of the



 hardware.  Consider a survey interview conducted where the computer



 must be placed on the interviewer's lap.  This situation would be



 quite difficult if the computer were either top heavy when open or



 the interviewer was small and the computer's depth long.



 Balancing would be a problem.  Also consider the door step



 interview with a 10 lb. clam shell design computer.



 



 Software



 



     Now let's discuss the most costly factor in the CATI/CAPI



 decision - software.  There are four components to the CATI/CAPI



 



                                198



 



software:  Questionnaire, Case Management, Output Reporting, and



 Authoring System.



 



      The questionnaire component refers to the software that places



 each question in the survey on the computer screen in the proper



 sequence with the appropriate information (i.e. prompts) and allows



 the entry of an answer or answers to the question with edits on



 those answers such as; range, specific values, consistency with



 another question's answer.  This software should also contain on



 screen help and if necessary, rostering.



 



      The case management component is the software that allows the



 interviewer to keep track of the status of the survey interview;



 that is, is the interview complete?; if the interview is not



 complete, what has been completed and what is the next question to



 be asked?; is the interview a partial interview or is the interview



 to be completed later?; what sections of the survey are mandatory?;



 and in some instances, interviewer assignments.  In the case of



 CATI, case management software also would provide the sample



 selection and dialing of the phone number.



 



      The output reporting component is often either overlooked or



 given minimal consideration.  This is a big mistake.  Collection of



 the data is not very useful if the data cannot be easily accessed



 for analysis.  Output reports can be categorized as either survey



 questionnaire statistics or management statistics.  The level of



 detail and complexity can vary significantly.  Survey questionnaire



 reporting can be as little as the ability to place the data into



 specific analysis software file format e.g. SAS or can include



 actual analyses.



 



      Management statistics can be extremely useful for the conduct



 of the survey data collection.  For example, data can be



 automatically collected on the time to complete a section of the



 questionnaire by interviewer.  This information could provide



 insights for training and/or question rewrite.



 



      The authoring system allows a non-computer programmer e.g. a



 survey questionnaire designer, to create the questionnaire while



 simultaneously and automatically generating the questionnaire



 software component.  It has been our experience that this is the



 most difficult component to develop.  Although there are a number



 of such systems that are available, none of these systems has met



 all of our requirements for the type of complex survey we conduct



 e.g. NHIS.  The authoring system should be extremely user-friendly



 and be able to handle a large number of question types.



 



 



 



 



 



 



                                  199



 



Data Transmission



 



     In the case of CATI, the data is automatically transmitted to



a central point for either uploading to larger computer or further



processing e.g. analysis.



 



     In the case of CAPI, the data collection is dispersed



generally over a wide geographic area.  The two primary methods for



data transmission have been mailed floppy disk or



telecommunications.  For data that is needed in one day or later,



floppy disk has been adequate.  Telecommunications, however, adds



a new dimension - Two way communications.  Not only can data be



transmitted to a central point, but instructions for the



interviewers, for example, could be transmitted from the central



point to the field.  The major problem with the telecommunications



method has been consistent quality of the communication lines.



Cost can also be a barrier.



 



Interviewer Training



 



     The level and amount of training needed depends, to large



extent, on the level of user-friendliness of the software.  Our



experience has shown that the type of training is different for



either a CATI or CAPI conducted survey than for the pencil and



paper conducted survey.  In the paper and pencil conducted survey,



training is focussed on almost entirely on the content of the



questionnaire, management of the questionnaire, and the proper



question sequencing.  It would not be unusual to have an



accompanying instruction manual 3-4 inches thick that would have to



learned by each interviewer.  Whereas, in the CATI or CAPI



conducted survey, training included both questionnaire content and



the care and use of the computer.  The major focus, being the



computer not the content because the computer software can handle



most of the problems the interviewer needs to worried about in the



pencil and paper conducted survey, such as; probes, question



sequencing, completeness.



 



    There is one major difference between CATI and CAPI that



impacts on the training: the level of interviewer anxiety.  CATI is



conducted at a central location where supervision and help are



readily available.  CAPI, on the other hand, is conducted in the



field where no supervision or help is readily available.



Therefore, CAPI training must try to provide the interviewers with



sufficient confidence in the software and hardware to cope with



this lack of help.   One method that has proven effective is to



emphasize hands-on practice.  Interviewers are encouraged to take



home their computer and practice interviews with anyone they can



get prior to going into the field.  In addition, interviewers are



given their computer prior to the training so they can have some



familiarity with them.  CAPI interviewers must be able to cope with



 



                                200



 



problem occurrences.  Consequently, training must concentrate on



such situations.



 



 



Future Technology



 



     Impending technological advances can have a profound impact on



these automation methods; particularly CAPI.  Changes in hardware



such as; an "etch-a-sketch" microcomputer and an inexpensive long-



life, light-weight battery would open new possibilities for the



CAPI conducted survey.  Use of a light-weight computer, under 5



lbs,no key board, with light pen hand-written entry would allow



door step surveys as well as reduce training efforts.  The "etch-a-



sketch" computer has been introduced by one vendor and several



other are about to announce.  The long-life light-weight



inexpensive battery, although not currently announced or available,



when available will produce much faster and larger light-weight



computers.  Thus allowing larger and more complex surveys to be



automated.



 



     The development of an generalized authoring system software



would open up the use of CATI and CAPI to the quick-turn-around



type survey.  Survey questionnaires could be designed and



implemented quickly and easily.  Staff productivity would also



increase significantly because computer programming efforts to



automate each survey questionnaire would be reduced to a minimum.



The survey designer, in effect, would be programming the survey



while designing the questionnaire.



 



 



 



 



 



 



 



 



                                 201



 



                COMPUTER ASSISTED SELF INTERVIEWING



 



                           Ralph Gillmann



                 Energy Information Administration



 



     The phrase "computer assisted self interviewing" (CASI)



covers all survey methods in which respondents access computers.



These methods include "computerized self administered



questionnaires" (CSAQ) and "prepared data entry" (PDE) where the



respondent fills out a computerized version of the survey



instrument.  Also included are methods where the respondent uses a



telephone to access a computer: "touch tone data entry" (TDE) and



"voice recognition data entry" (VRE).



 



     Let's step back for a moment and look at different ways that



computers can be used in interviews:



 



Click HERE for graphic.



 



 



 



 



 



 



     The top line represents direct interaction between an



interviewer and a respondent.  The left line represents the



interviewer accessing a computer such as in CATI and CAPI which



were previously discussed.  CASI methods are illustrated by the



lower right triangle.  The diagonal represents respondents



accessing an agency computer as in TDE and VRE.  The right line



represents respondents accessing their own computers as in PDE.



With the personal computer (PC) becoming ubiquitous, at least in



establishments, respondents usually have access to a computer.



 



     The bottom represents computer to computer interaction for



data transmission.  The missing diagonal would represent the



activities of hackers and spies.



 



 



 



 



 



 



 



 



                                202



 Next, let's compare manual and computer assisted methods:



 



Click HERE for graphic.



 



      Some methods are part manual and part computer assisted.  For



 instance, CATI and CAPI combine a personal interview with an



 electronic survey instrument.   One survey which uses all of the



 computer assisted methods is the Petroleum Electronic Data



 Reporting option (PEDRO) in use at the Energy Information



 Administration.  In general, the manual methods are slower and more



 prone to processing errors.  Labor and postage costs are also



 rising faster than the operational expenses of computer assisted



 methods.



 



      For transmission of the data to the collecting agency, paper



 copies can be sent via facsimile machines (fax).  This method is



 faster than the mail but doesn't eliminate the need to key in the



 data.  If the data are in electronic form, a diskette with the data



 can be mailed in.  This is useful if security and authenticity are



 a particular concern.  Transmission time may be saved by sending



 the data over the telephone network or using "electronic mail" over



 a computer network.  (Note that it's becoming harder to tell



 telephone and computer networks apart.)



 



      The use of an electronic mail service is feasible now and



 likely to be more important in the future.  This method allows a



 third party to handle the support for telephone lines, security,



 and temporary storage.  Respondents only need to have a terminal



 which operates over ordinary telephone lines if the survey



 instrument resides with the electronic mail service in the form of



 an electronic questionnaire.  Security can be provided by passwords



 and data encryption.  The survey agency can retrieve the data at



 its convenience.



 



      Finally, CASI offers several quality improvements:



 



      Increased timeliness of the data (especially important in



      monthly and weekly surveys)



 



      Fewer follow-up calls to respondents (because many, if



      not all, data edits can be done immediately)



 



                                  203



 



    Reduced respondent burden (fewer persons are needed to



     fill out an electronic form)



 



     Lower costs (at least in cases where labor and postage



     make up a large part of the costs)



 



 



 



 



 



 



 



 



                              204



 



                COMPUTER ASSISTED SELF INTERVIEWING:



                    RIGS AND PEDRO, TWO EXAMPLES



 



                            Ann M. Ducca



                 Energy Information Administration



 



 



      I am going to talk about two systems that the Energy



 Information Administration has for reporting data using personal



 computers (PC's).  One system is a mail submission of a PC



 diskette, and the other uses telecommunications between the



 respondent's PC and our mainframe computer.



 



      The first example is the Reserves Information Gathering



 System, known as RIGS.  It is a system for reporting data on



 domestic oil and natural gas reserves on PC diskettes.  The data



 are collected by the EIA in its annual survey of oil and natural



 gas well operators.  Reporting to this survey is mandatory.



 



      Briefly, this survey is a stratified sample survey with the



 stratification being done according to the amount of production of



 oil and natural gas.  Respondents in the first strata, representing



 the largest amounts of production and having the most data to



 report, are eligible to report using RIGS.  They will also continue



 to have the option of reporting on paper forms.  The EIA cannot



 require an electronic form of submission.  RIGS first became



 operational for the reporting of 1988 data.   We anticipate that



 25-30 percent of the 1989 reserves information will be reported



 using the RIGS system.



 



      The EIA sends PC diskettes containing the RIGS processing



 software by mail to respondents.  A user's guide is also provided.



 The respondents install RIGS onto their PC's and use it to enter



 data.



 



      The basic hardware requirement is an IBM compatible PC with at



 least 360K of random access memory, and two floppy disk drives or



 one floppy and one hard disk drive.  A printer should also be



 attached to the system so that a hard copy can be printed.  Version



 2.0 or higher of MS DOS is also required.  The IBM PC compatible



 computer was chosen because of its wide availability.



 



      The software for RIGS was originally written in dBASE III, a



 PC database management system.  dBASE III programs can only be



 executed using the dBASE III software, that is, stand-alone



 programs cannot be created.  Since the EIA did not want to purchase



 and provide the dBASE III software for every respondent, Clipper,



 a linkage compiler, was used to compile dBASE III into object code



 to make it a portable system.  The licensing agreement with Clipper



 permits run-time programs created by it to be operated outside the



 agency.  Thus, the respondents are provided with an executable load



 module, not programs.  Licensing agreements must be carefully



 



                                  205



 



reviewed before planning to use software products outside an



 agency.



 



      An advantage of a load module is that respondents cannot



directly or inadvertently change the programs.  Also, there is no



cost to the respondents since the RIGS software was developed by



the government.



 



      Using the RIGS software, the respondents enter data directly



on their PC.  The data entry screens for RIGS are formatted like



the data collection form.  There may be some benefits to exploring



other formats which take advantage of options available to



automated collection, such as question sequencing.



 



      There is also the option of sending an ASCII file to the RIGS



system so that data already available in an automated form at the



respondent site can be submitted without re-keying.  The RIGS



User's Guide gives the instructions and record layout requirements



for downloading ASCII files.



 



      Respondents are required to submit to us by mail a diskette



containing a copy of the cover page and the data. They must also



return a paper copy of the cover page with the signature of the



certifying official.



 



      Because the survey is an annual one, it was decided that



telecommunications with the EIA mainframe computer was not needed,



and that the mail submission would be sufficient.  Since the data



in the RIGS system are proprietary, it was also decided that



respondents would not be provided with their previous year's data



because of the risk of sending confidential data to the wrong



respondent.



 



      Preliminary edits such as range checks are performed as the



data are entered into the RIGS system.  If the system detects an



incorrect entry, the bell sounds and a message appears across     the



top of the data entry screen. The message will prompt the user    for



a response.    Help screens are available to assist the user,   and



help is also available by telephone on a toll-free number.  For



data that have been downloaded into RIGS, an edit report is



produced afterwards.  A respondent may then use the RIGS edit



function to correct the errors.



 



      Final edits, such at comparisons with previous year's reports,



are made after the data are returned to the EIA.  These edits are



performed on our mainframe system.  When questionable data are



identified, a quality control analyst contacts the respondent by



telephone and changes are made by the EIA.



 



      Respondents also have the option to make notes in a footnote.



These notes may be helpful in explaining data that appear to be



questionable.



 



                                 206



 



      The second example is the Petroleum Electronic Data    Reporting



  Option (PEDRO).  It gathers monthly data for petroleum supplies



  from petroleum companies.  The respondents eligible to use PEDRO



  participate in 7 monthly surveys.  They include refineries, storage



  facilities,  pipelines, importers, and extraction facilities.



  Reporting to these surveys is also mandatory.  But again, the EIA



  cannot require an electronic form of submission.



 



      The participation in PEDRO varies among the 7 surveys.  The



  market share represented by reports to PEDRO ranges from 25 to 90



  percent of the total volume for a survey.



 



      The main difference between the PEDRO and RIGS systems is that



  PEDRO uses telecommunications to transmit data directly to the EIA



  mainframe computer.  PEDRO users need an IBM compatible PC with a



  hard disk and a floppy drive, and a modem.  As with the RIGS



  system, respondents are provided with an executable load module at



  no cost.  PEDRO also requires the Arbiter communications software



  which is licensed only for use with the EIA.  Arbiter was selected



  because it satisfied our security needs.  The EIA supplies the



  respondents with Arbiter.



 



      The basic methods of entering data to PEDRO are the same as



  those with RIGS -- keying on the PC or sending an ASCII file to the



  PEDRO system.  However, data submission in PEDRO is done by



  telecommunications directly to our mainframe, rather than by



  mailing diskettes.  Since these are proprietary data, PEDRO



  submissions are encrypted.  The transmissions are time-stamped to



  replicate a postmark.  The respondents must use passwords to



  transmit data, and the password, rather than a written signature,



  serves as the certification of the validity of the data.



 



       All edits in the PEDRO system appear on the respondent's PC.



  Since there is a direct link to our mainframe, all data needed for



  editing comparisons, for example prior month's data, are available



  on-line.    Preliminary edits are performed before respondents



  transmit. any data.  Final edits are performed after the link to the



  EIA mainframe and transmitted back to the user.



 



       The EIA is very pleased with the RIGS and PEDRO reporting



  systems.   We believe that we are getting data faster and more



  accurately from these systems, and are encouraged by the increase



  in interest in using them.



 



 



 



 



  



 



 



 



                                    207



 



 



 



                                    208



 



 



Click HERE for graphic.



 



 



 



                                   208



 



 



                          DATA COLLECTION



 



                             Cathy Mazur



             National Agricultural Statistics Service



 



 



      In this session, I will first mention several factors to



 consider when deciding on a mode of data collection.  Then I will



 spend a few minutes comparing the modes of data collection that



 have been discussed.



 



      The primary factors in choosing a method of data collection



 for a given survey are (as previously ;mentioned) the available



 time frame, the desired quality, and the cost of resources.  It is



 unusual to have all three of these in abundance.  Therefore,



 tradeoffs must be considered.



 



      Several other factors to consider which relate to survey



 design and operation are whether the survey is mandatory or



 voluntary, whether a onetime or ongoing survey is to be



 implemented, whether households or businesses are sampled, whether



 the data will be collected; in a centralized or decentralized



 manner, whether networking of computers will be done, the sample



 size, and the complexity of the questionnaire.



 



      The remaining factors to consider in automated data collection



 refer to the characteristics of the technology.  First is the speed



 of the hardware and data transmission over the phone lines.  Next



 is the size of the computer's memory, and the system's weight (as



 in CAPI).  Portability is a concern to data collection when



 different hardware and/or software is to be used (as in Prepared



 Data Entry (PDE).  The type of display is important in some modes



 (as in CAPI).  The mode of data entry can be through the keyboard,



 a pushbotton phone, or using one's voice.  Data verification



 depends on the desire for quality, the complexity of the data, and



 other factors.  The database generation is also an important step



 (as was discussed by Martin Baum).  It refers to integrating the



 data with other survey processes (label generating, data



 summaries).  Hardware is selected based on cost, the amount of time



 available, the data quality desired, and the background of the



 staff that will operate the machines.  Lastly, training is



 important in any survey, the amount of which depends on the



 technology chosen.



 



      The priorities that are given to these factors and the



 relationships between them, help to decide which technology to use.



 All combine data collection with data entry, and most add editing



 at the time of data collection.  This reduces the time component



 and increases the quality component.  Also, mixed modes of data



 collection are possible in a survey.



 



 



 



                                 209



 



      First, (as a means of comparison), a mail or manual survey



 would require a fairly long time to send out personal enumerators



 or to send and receive questionnaires through the mail.  The amount



 of editing is very limited as data entry and editing is done after



 all the data is collected and the interview is completed.  The cost



 is fairly high if personal interviews are done, and nonresponse may



 also be high if questionnaires are mailed out.



 



      CATI is used because it collects data quickly and accurately.



 The cost component (which is fairly high), comes from the hardware,



 software, training, and support factors (such as phone charges).



 One cost component which is eliminated is the travel expense.  One



 suggestion is that CATI improves the cost benefit.  The respondent,



 however, must have a phone.  Other benefits are that it is useful



 in complex survey environments, can provide information on call



 scheduling successes/failures, and can be used for non-response



 follow up.



 



      CAPI also has fairly high costs, but it provides accurate data



 with a tendency for higher response rates (which may be a problem



 in CATI), and saves on the separate keyentry time.  The largest



 cost component is due to travel (with some in hardware and software



 support costs).  The weight, battery life, and screen visibility



 are important issues to CAPI.



 



      As to computer-assisted interviewing, 3 data collection modes



 are discussed -- Prepared Data Entry (PDE), Touchtone Data Entry



 (TDE) and Voice Recognition Entry (VRE).  PDE provides faster and



 more accurate data, for an average cost.  Costs are incurred in



 software development and support areas.  This mode requires the



 availability of a PC (usually by establishments), and two issues



 are data security and data integration (as different PC's are



 used).



 



      TDE allows respondents to call and answer questions posed by



 a computer using the keypad of their touchtone telephone.  VRE also



 allows respondents to call and answer questions posed by a



 computer, but the respondent answers by speaking directly into the



 telephone, and a computer system translates the incoming sounds



 into text.  TDE and VRE offer low cost alternatives in a short data



 collection  time, but editing is more limited.  In both, surveys



 tend to be  shorter and simpler, non-response prompts are used, and



 respondent  acceptance is a concern.  TDE requires access to a



 touchtone phone and service, where VRE can use any phone.  The



 Bureau of Labor Statistics collects data monthly for the Current



 Employment Statistics Program using mail, CATI, TDE, and VRE.  The



 VRE system recognizes any American English-speaking person with



 continuous speech of the numbers 0-9, yes, and no.



 



      These are not simple issues, and there are no clear cut



 answers.  The definitions and importance of the factors must be



 



 



                                 210



 



 



agreed upon.  This comparison only represents the current state of



technology, much will change with future development.



 



    Lastly, I hope this session has made you more aware of the



possibilities, the issues, and what to consider when choosing a



data collection method.



 



 



 



 



 



 



 



 



                               211



 



                           DISCUSSION



 



                          Robert N. Tinari



                     U. S. Bureau of the Census



 



 



      I want to begin my remarks today by noting that this paper is



 a very thorough treatment of the issues surrounding automated



 survey collection methodologies.



 



      I am impressed with the organization of the paper and the



 thoroughness of discussion of the many considerations that go into



 selecting, designing, and implementing these types of data



 collection systems.  The subcommittee is to be commended for the



 excellent job they have done in bringing together in one document



 a tremendous amount of information that I think will be extremely



 useful to those considering alternative data collection



 methodologies.



 



      Based oh my experience as a program manager responsible for



 the initial development and implementation of CATI on the National



 Crime Survey, there are several issues raised in the paper that I



 believe need more emphasis.



 



      The first issue I want to discuss has to do with organization



 and its affect on CATI/CAPI development and implementation.



 



      In its conclusion, the committee notes that increased reliance



 on software development has important implications for hiring and



 training skilled survey designers.  It also states that previously



 distinct boundaries between occupational groups will-continuously



 blur and disappear and survey design will likely be increasingly



 accomplished through teams of skilled workers from different



 occupations.



 



      Based upon my experience, I believe that this is an accurate



 assessment.  Obtaining the maximum benefit from the these data



 collection methodologies requires that a fully integrated system be



 developed and this, in turn, requires the concerted effort and



 collaboration of programmers, survey design experts, statisticians,



 field staff, program managers, and survey sponsors.



 



      However, the level of cooperation and communication necessary



 to successfully design and implement CATI/CAPI may be very



 difficult to achieve in a large, hierarchical organization.  Staffs



 tend to be highly specialized and not experienced in projects



 requiring a multi-disciplined approach.



 



      From my own experience working on one of the first CATI



 applications at the Census Bureau, we had a very difficult time



 organizing the right team with the right experience necessary to get the



 project underway and in keeping the lines of communication



 



                                 212



 



 open among the various divisions involved to implement it



  successfully.



 



      We learned a lot from that process and have come a long way.



  A recent example is a cooperative effort between the Economic Area



  and the Demographic Area in successfully   developing and



  implementing a CATI system for the Survey of Manufacturing



  Technology.  The Industry Division was responsible for conducting



  the survey and wanted to use CATI for nonresponse followup of



  manufacturing plants.  The division lacked the experience to



  develop the questionnaire on CATI.  Demographic Surveys Division



  offered to help with the authoring, Industry assisted with testing



  and Field Division worked on interviewer training and data



  collection.  The survey was carried out on time, within budget, and



  with high quality.  This is a good example of what can be



  accomplished by individuals working together from the various



  divisions and sharing their expertise to get the job done.



 



       Poor organization and control can have a very serious impact



  on the cost and time of development and the quality of the final



  product.  I believe that what is needed to successfully design and



  implement automated data collection methodologies is:



 



  0    commitment and full support from upper-level management.



 



  0    a full-time, dedicated staff - no part-time work along



       with other projects.



 



       open lines of communication with clear assignment of



       responsibility/accountability.



 



  0    designate a project coordinator/facilitator



 



  0    breaking down of traditional barriers between survey



       statisticians, mathematicians, survey designers,



       programmers, and field staff in order to work



       effectively.



 



  0    ongoing commitment and organizational change to adapt to



       needs of the new data collection methodology.  Especially



       important if you are using mixed mode such as personal



       visit (paper) and centralized telephone (CATI).



 



  0    reduced layers of bureaucracy.



 



  0    empowerment of the team to get the job done.



 



       We must think of new ways of organizing ourselves to be more



  flexible and effective in designing and implementing new



  technologies.  In addition, there must be more sharing of



 



                                   213



 



 information among the various statistical agencies on approaches



  and experiences in the area of organization.



 



      The second issue has to do with interviewer acceptance of new



  technologies like CATI and CAPI.  The paper points out the



  importance of involving the user in the design process.  I do not



  think this point can be over-emphasized.



 



      In the rush to develop survey instruments on tight time



  schedules or in deciding which portable machines to use for CAPI



  applications, we the developers and/or program managers, take it



  upon ourselves to decide what is best for the interviewers and may



  not actively involve them in the decision or development process.



  This can be a big mistake.



 



      If the interviewers are not comfortable with the interface, if



  it is slow, clumsy or awkward to use, "not natural" feeling, not



  helpful, etc., the survey is in serious trouble.  If the



  interviewers have no say in the design and for any reason should



  decide that the system is not helping them to get the job done



  better, then you face an uphill struggle to gain their acceptance,



  and in some instances, the system may never be fully accepted.



  Interviewers may work to defeat the system, morale may suffer,



  respondent cooperation may suffer, turnover rates will increase,



  quality will suffer, and costs will escalate.



 



      In addition, if you are contemplating switching from a



 personal visit environment to CATI, you must consider the effect on



 the interviewer staff out in the field.  Field interviewers will be



 concerned about losing their jobs and quality may suffer during the



 transition to CATI.  How the Field interviewers will be treated and



 possible impact on data quality during the transition period should



 most definitely be taken into account.  For example, in planning



 the transition of cases from personal visit to CATI for the



 National Crime Survey we used attrition among interviewing staff



 and hard to enumerate areas for conversion to CATI.  By using this



 approach, CATI was viewed as positive tool by Field staff.  This



 plan helped to gain acceptance of CATI.



 



      The third and final area I want to discuss has to do with the



 need for adequate testing and evaluation of these new



 methodologies.



 



      Before implementing any survey operation, it is good practice



 to allow enough time for adequate testing and evaluation of the



 instrument and the data collection and processing system.  This is



 especially crucial for automated data collection systems.  Complex



 questionnaires (those with complex branching or edits)need to be



 thoroughly tested and evaluated before they are introduced on a



 production basis.



 



 



 



                                 214



 



      While the automated data collection systems provide us with



 the ability to field much more complex questionnaires than we could



 using conventional paper forms, they also pose additional



 challenges related to testing.  Aside from the obvious problems



 that may surface during interviewing, if the instrument is not



 adequately tested, there may be logic errors hidden in the



 instrument that go undetected or aren't found until after the data



 collection phase is complete.



 



      In addition, when changes are introduced to the questionnaire,



 (even minor ones), thorough testing should be conducted again to



 insure that other questions or skip patterns have not been



 affected.



 



      In the paper, the committee discusses the possible application



 of expert systems in questionnaire development.  I would suggest



 that perhaps some application could be found for these systems to



 testing and evaluating as well.  There is definitely a need for



 more systematic and thorough methods for checking out the



 questionnaire.  In addition, attention must be paid to testing the



 case management, call scheduling, training, data transmission, and



 processing systems before the survey is fielded.



 



      This is not something that only needs to be done before, a



 survey is fielded.  It should be an ongoing effort to evaluate how



 well the system is functioning.  It should allow for feedback for



 continuous improvement/refinement such as monitoring, observation,



 debriefing interviewers/respondents.



 



      I want to thank the organizers for giving me the opportunity



 to share my views on this important topic.  I think the committee



 has made an important contribution by bringing together in one



 document many of the issues facing project managers in deciding



 whether or not to adopt these technologies.  I hope that the



 document will be treated as a dynamic one that will be expanded as



 we gain more experience with the various aspects of these data



 collection methodologies.



 



 



 



 



 



 



 



 



                                  215



 



                              DISCUSSION



 



                           David Morganstein



                              Westat, Inc.



 



 



      I thank Terry Ireland for organizing this intriguing session



 and I would like to express my appreciation to the speakers for the



 work they have done in their examination of new methods for



 assisting in the processor conducting government surveys.  It is



 a pleasure to be given this o