| Federal
Committee on Statistical
Methodology Office of Management and Budget |
FCSM
Home ^ Methodology Reports ^ |
Statistical Policy Working Paper 20 - Seminar on Quality of Federal Data - Part 3 of 3
Click HERE for graphic. Statistical Policy Working Paper 20 Seminar on Quality of Federal Data Part 3 of 3 Federal Committee on Statistical Methodology Statistical Policy Office Office of Information and Regulatory Affairs Office of Management and Budget March 1991 MEMBERS OF THE FEDERAL COMMITTEE ON STATISTICAL METHODOLOGY (February 1991) Maria E. Gonzalez, Chair office of Management and Budget Yvonne M. Bishop Daniel Kasprzyk Energy Information Bureau of the Census Administration Daniel Melnick Warren L. Buckler National Science Foundation Social Security Administration Robert P. Parker Charles E. Caudill Bureau of Economic Analysis National Agricultural Statistics Service David A. Pierce Federal Reserve Board Cynthia Z.F. Clark National Agricultural Thomas J. Plewes Statistics Service Bureau of Labor Statistics Zahava D. Doering Wesley L. Schaible Smithsonian Institution Bureau of Labor Statistics Robert M. Groves Fritz J. Scheuren Bureau of the Census Internal Revenue Service Roger A. Herriot Monroe G. Sirken National Center for National Center for Education Statistics Health Statistics C. Terry Ireland Robert D. Tortora National Computer Security Bureau of the Census Center Charles D. Jones Bureau of the Census PREFACE In 1975, the Office of Management and Budget (OMB) organized the Federal Committee on Statistical Methodology. Comprised of individuals selected by OMB for their expertise and interest in statistical methods, the committee has during the past 15 years determined areas that merit investigation and discussion, and overseen the, work of subcommittees organized to study particular issues. Since 1978, 19 Statistical Policy Working Papers have been published under the auspices of the Committee. On May 23-24, 1990, the Council of Professional Associations on Federal Statistics (COPAFS) hosted a "Seminar on the Quality of Federal Data." Developed to capitalize on work undertaken during the past dozen years by the Federal Committee on Statistical Methodology and its subcommittees, the seminar focused on a variety of topics that have been explored thus far in the Statistical Policy Working Paper series. The subjects covered at the seminar included: Survey Quality Profiles Paradigm Shifts Using Administrative Records Survey Coverage Evaluation Telephone Data Collection Data Editing Computer Assisted Statistical Surveys Quality in Business Surveys Cognitive Laboratories Employer Reporting Unit Match Study Approaches to Developing Questionnaires Statistical Disclosure-Avoidance Federal Longitudinal Surveys Each of these topics was presented in a two-hour session that featured formal papers and discussion, followed by informal dialogue among all speakers and Attendees. Statistical Policy Working Paper 20, published in three parts, presents the proceedings of the "Seminar on the Quality of Federal Data." In addition to providing the papers and formal discussions from each of the twelve sessions, this working paper includes Robert M. Groves' keynote address, "Towards Quality in a Working Paper Series on Quality," and comments by Stephen E. Fienberg, Margaret E. Martin, and Hermann Habermann at the closing session, "Towards an Agenda for the Future." We are indebted to all of our colleagues who assisted in organizing the seminar, and to the many individuals who not only presented papers and discussions but also prepared these materials for publication. A special thanks is due to Terry Ireland and his staff for their work in assembling this working paper. Table of Contents Wednesday, May 23, 1990 Part 1 KEYNOTE ADDRESS TOWARDS QUALITY IN A WORKING PAPER SERIES ON QUALITY . . . . . . . . 3 Robert M. Groves, The University of Michigan and U. S. Bureau of the Census Session 1 - SURVEY QUALITY PROFILES THE SIPP QUALITY PROFILE . . . . . ... . . . . . . . . . . . . . . . 19 Thomas B. Jabine, Statistical Consultant INITIAL REPORT ON THE QUALITY OF AGRICULTURAL SURVEY PROGRAM. . . . 29 George A. Hanuschak, National Agricultural Statistics service DISCUSSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 Barbara A. Bailar, American Statistical Association DISCUSSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 Nancy A. Mathiowetz, U. S. Bureau of the Census Session 2 - PARADIGM SHIFTS USING ADMINISTRATIVE RECORDS PARADIGM SHIFTS: ADMINISTRATIVE RECORDS AND CENSUS-TAKING . . . . . 53 Fritz Scheuren, Internal Revenue Service AN ADMINISTRATIVE RECORD PARADIGM: A CANADIAN EXPERIENCE. . . . . . 66 John Leyes, Statistics Canada DISCUSSION. . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 Gerald Gates, U.S. Bureau of the Census DISCUSSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 Edward J. Spar, Market Statistics Session 3 - SURVEY COVERAGE EVALUATION CONTROL MEASUREMENT, AND IMPROVEMENT OF SURVEY COVERAGE . . . . . 87 Gary M. Shapiro, U. S. Bureau of the Census; Raymond R. Bosecker, National Agricultural Statistics Service QUALITY OF SURVEY FRAMES . . . . . . . . . . . . . . . . . . . . . 100 Judith T. Lessler, Research Triangle Institute DISCUSSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 Fritz Scheuren, Internal Revenue Service DISCUSSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 Joseph Waksberg, Westat, Inc. Session 4 - TELEPHONE DATA COLLECTION QUALITY IMPROVEMENT IN TELEPHONE SURVEYS . . . . . . . . . . . . . 123 Leyla Mohadjer, David Morganstein, Westat, Inc. COMPUTER ASSISTED SURVEY TECHNOLOGIES IN GOVERNMENT: AN OVERVIEW . . . . . . . . . . . . . . . . . . . . . . . . . 137 Marc Tosiano, National Agricultural Statistics Service DISCUSSION. . . . . . . . . . . . . . . . . . . . . . . . . . . . .155 William L. Nicholls II, U. S. Bureau of the Census DISCUSSION. . . . . . . . . . . . . . . . . . . . . . . . . . . . .161 James T. Massey National Center Health Statistics iv Part 2 Session 5 - DATA EDITING OVERVIEW OF DATA EDITING IN FEDERAL STATISTICAL AGENCIES . . . . . .167 David A. Pierce, Federal Reserve Board EDITING SOFTWARE (An excerpt from Chapter IV of Working Paper 18) . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 Mark Pierzchala, National Agricultural Statistics Service RESEARCH ON EDITING . . . . . . . . . . . . . . . . . . . . . . . . 180 Yahia Ahmed, Internal Revenue Service DISCUSSION . . .. . . . . . . . . . . . . . . . . . . . . . . . . . 184 Charles E. Caudill, National Agricultural Statistics Service DISCUSSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . .186 Richard Bolstein, George Mason University Session 6 - COMPUTER ASSISTED STATISTICAL SURVEYS OVERVIEW OF COMPUTER ASSISTED SURVEY INFORMATION COLLECTION . . . . .191 Richard L. Clayton, U. S. Bureau of Labor Statistics A COMPARISON BETWEEN CATI AND CAPI . . . . . . . . . . . . . . . . . 197 Martin Baum, National Center for Health Statistics COMPUTER ASSISTED SELF INTERVIEWING . . . . . . . . . . . . . . . . .202 Ralph Gillmann, Energy Information Administration COMPUTER ASSISTED SELF INTERVIEWING: RIGS AND PEDRO, TWO EXAMPLES. . . . . . . . . . . . . . . . . . . . . . . . . .205 Ann M. Ducca, Energy Information Administration DATA COLLECTION . . . . . ... . . . . . . . . . . . . . . . . . . . .209 Cathy Mazur, National Agricultural Statistics Service v DISCUSSION. . . . . . . . . . . . . . . . . . . . . . . . . . . . .212 Robert N. Tinari, U. S. Bureau of the Census DISCUSSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . .216 David Morganstein, Westat, Inc. Thursday, May 24, 1990 Session 7 - QUALITY IN BUSINESS SURVEYS IMPROVING ESTABLISHMENT SURVEYS AT THE BUREAU OF LABOR STATISTICS . . . . . . . . . . . . . . . . . . . . . . . . . .221 Brian MacDonald, Alan R. Tupek, U. S. Bureau of Labor Statistics A REVIEW OF NONSAMPLING ERRORS IN FEDERAL ESTABLISHMENT SURVEYS WITH SOME AGRIBUSINESS EXAMPLES . . . . . . . . . . . . . . 232 Ron Fecso, National Agricultural Statistics Service DISCUSSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . .243 David A. Binder, Statistics Canada DISCUSSION. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247 Charles D. Cowan, Opinion Research Corporation Session 8 - COGNITIVE LABORATORIES THE BUREAU OF LABOR STATISTICS' COLLECTION PROCEDURES RESEARCH LABORATORY: ACCOMPLISHMENTS AND FUTURE DIRECTIONS . . . . 253 Cathryn S. Dippo, Douglas Herrmann, U. S. Bureau of Labor Statistics THE ROLE OF A COGNITIVE LABORATORY IN A STATISTICAL AGENCY . . . . .268 Monroe G. Sirken, National Center for Health Statistics DISCUSSION. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278 Elizabeth Martin U. S. Bureau of the Census DISCUSSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . .281 Murray Aborn, National Science Foundation (retired) vi Part 3 Session 9 - EMPLOYER REPORTING UNIT MATCH STUDY INTERAGENCY AGREEMENTS FOR MICRODATA ACCESS: THE ERUMS EXPERIENCE . . . . . . . . . . . . . . . . . . . . . .291 Thomas B. Petska, Internal Revenue Service; Lois Alexander, Social Security Administration SAMPLE SELECTION AND MATCHING PROCEDURES USED IN ERUMS . . . . . . . 301 John Pinkos, Kenneth LeVasseur, Marlene Einstein, U. S. Bureau of Labor Statistics; Joel Packman, Social Security Administration RESULTS, FINDINGS AND RECOMMENDATIONS OF THE ERUMS PROJECT . . . . . 309 Vern Renshaw, Bureau of Economic Analysis; Tom Jabine, Statistical Consultant DISCUSSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318 W. Joel Richardson, Charles A. Waite, U. S. Bureau of the Census DISCUSSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324 Thomas J. Plewes, U. S. Bureau of Labor Statistics Session 10 - APPROACHES TO DEVELOPING QUESTIONNAIRES TOOLS FOR USE IN DEVELOPING QUESTIONS AND TESTING QUESTIONNAIRES . . . . . . . . . . . . . . . . . . . . . . . . .331 Theresa J. DeMaio, U. S. Bureau of the Census TECHNIQUES FOR EVALUATING THE QUESTIONNAIRE DRAFT . . . . . . . . . .340 Deborah H. Bercini, National Center for Health Statistics DESIGNING QUESTIONNAIRES FOR CATI IN A MIXED MODE ENVIRONMENT. . . . . . . . . . . . . . . . . . . . . . . . . . .349 Gemma Furno, U. S. Bureau of the Census DISCUSSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 360 Carol C. House, National Agricultural Statistics Service vii Session 1 1 - STATISTICAL DISCLOSURE - AVOIDANCE DISCLOSURE AVOIDANCE PRACTICES AT THE CENSUS BUREAU . . . . . . . . .367 Brian Greenberg, U. S. Bureau of the Census THE MICRODATA RELEASE PROGRAM OF THE NATIONAL CENTER FOR HEALTH STATISTICS . . . . . . . . . . . . . . . . . . . . . . . .377 Robert H. Mugge, National Center for Health Statistics (retired) DISCUSSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385 George T. Duncan, Carnegie Mellon University Session 12 - FEDERAL LONGITUDINAL SURVEYS FEDERAL LONGITUDINAL SURVEYS . . . . . . . . . . . . . . . . . . . . 393 Daniel Kasprzyk, U. S. Bureau of the Census; Curtis Jacobs, U. S. Bureau of Labor Statistics THE ADVANTAGES AND DISADVANTAGES OF LONGITUDINAL SURVEYS . . . . . . 407 Robert W. Pearson, Social Science Research Council LONGITUDINAL ANALYSIS OF FEDERAL SURVEY DATA . . . . . . . . . . . . 425 Patricia Ruggles, Joint Economic Committee DISCUSSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 438 Michael Brick, Westat, Inc. DISCUSSION. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .447 Marilyn E. Manser, U. S. Bureau of Labor Statistics TOWARDS AN AGENDA FOR THE FUTURE Stephen E. Fienberg, Carnegie Mellon University . . . . . . . . . . .455 Margaret E. Martin . . . . . . . . . . . . . . . . . . . . . . . . . 462 Hermann Habermann, Office of Management and Budget . . . . . . . . . 465 viii Part 3 Session 9 EMPLOYER REPORTING UNIT MATCH STUDY 289 290 INTERAGENCY AGREEMENTS FOR MICRODATA ACCESS: THE ERUMS EXPERIENCE Thomas B. Petska Internal Revenue Service Lois Alexander Social Security Administration The Employer Reporting Unit Match Study (ERUMS) was a pilot record linkage study carried out under the auspices of the Federal Committee on Statistical Methodology of the Office of Management and Budget. The study linked records of employers and their reporting units from three agencies: the Bureau of Labor Statistics (BLS), the Social Security Administration (SSA) and the Internal Revenue Service (IRS). The primary linkages involved samples of the agencies' records for employers in the State of Texas covering their-activities in 1982. For the ERUMS Workgroup to gain access to the data sets needed for the study, arrangements had to be developed that would comply with the confidentiality provisions and statutes of the Federal and State agencies that controlled these data sets. This paper gives an overview of these arrangements and agreements. In the first section, background information on the statistical content and confidentiality provisions of each of the data sets is provided. In the second section, the actual arrangements for the release of confidential microdata are described. The last section provides a summary of what we have learned about such data sharing arrangements. Background Information The goal of ERUMS was to demonstrate the feasibility of matching employer and reporting unit data from different agency record systems as a means of obtaining more precise information about the coverage and content of the data in those systems. A purpose was to examine and I evaluate differences in wage and employment data at the state and county level as reported to those agencies. Despite the many difficulties encountered in establishing the data access agreements, ERUMS demonstrated that data such sharing Projects can be successful under current laws. 1. Data Sets The ERUMS study was a three-way data linkage study in which individual microdata records from BLS, SSA, and IRS were matched by Employer Identification Number (EIN). 291 a. BLS provided a 1982 Unemployment Insurance (UI) Address File, which, for each state, consists of data for individual employers and their reporting units, which are often equivalent to "establishments". The data for this file are submitted to BLS by the State employment security agencies that operate the Federal-State UI Program. BLS uses the data submitted by the states as a basis for statistical reports on employment and wages and uses the UI Address File as a national sampling frame for its establishment surveys. b. SSA provided an edited file of Form W-3 annual reports for 1982 and the Single Unit and Multi-Unit Code Files. The Form W-3 file provided data on individual employers and, in some cases, for each of their reporting units, which are frequently equivalent to establishments. The Single Unit Code File contains a record for most entities that have filed an application for an Employer Identification Number. The Multi-Unit Code File contains a record for each reporting Unit of multi-unit employers who are participating in the Establishment Reporting Plan, a voluntary program under which employers report wage information on Form W-3 separately for each of their reporting units. c. IRS data used for ERUMS were from a Census-edited file based on Forms 941 and 943 for Tax Years 1981-83. These forms are used by employers to report each quarter (annually for Form 943) to IRS on income taxes withheld from wages and other payments to employees and on taxes under the Federal Insurance Contributions Act (FICA) under the Social Security system. Extracts of data from these forms are provided annually by IRS to the Census Bureau for use in the latter's County Business Patterns Program and other statistical programs. The Census Bureau edits the files, particularly the industry codes, and imputes certain missing data. This file was made available to the IRS Statistics of Income (SOI) Division for use in its business employment and payroll studies and was used for ERUMS. In addition, copies of Form 940, Federal Unemployment Tax Return, were obtained for a substantial proportion of the ERUMS sample cases. 2. Data Sharing Issues For the ERUMS Workgroup to gain access to the data sets needed for the study, it was necessary to develop working arrangements that complied with the provisions of confidentiality statutes, regulations, and policies of the Federal and State agencies that controlled these data sets. 292 Although interagency exchange of identifiable microdata was the key to ERUMS, such data sharing is restricted by Federal confidentiality laws which generally permit agencies to disclose statistical information only in summary or other unidentifiable form. Since ERUMS was designed to link and compare information about individual employers collected separately by the different agencies, the Workgroup had to develop and implement lawful methods of transferring data on identifiable business units among the participants. A related task was to minimize the disclosure of identifiers in making those transfers and linkages. The Workgroup was particularly interested in the different ways an employer may report establishment or multi-unit enterprise data to various State and Federal agencies. To examine these differences, the Workgroup needed to compare employers' reports to the BLS State UI programs, the SSA FICA reporting, and the IRS employment tax returns. Members Of the Workgroup included employees of these agencies, plus employees of the Bureau of Economic Analysis, Office of Management and Budget, the Bureau of the Census, and the Committee on National Statistics of the National Academy of Sciences. The Workgroup planned to analyze the information that corresponded to each EIN as it was reported to each agency. The analysis and findings would be entirely statistical in nature with no reference to the individual (identifiable) cases. Nevertheless, the planning, processing, and analysis phases each required access to identifiable data. 3. Confidentiality of Federal and State Tax Records In the ERUMS study, the Employer, Identification Number (EIN) was the identifier that was common to all the reporting systems. It was used to define the sample drawn by BLS and was used as the basis for retrieving, linking and comparing records containing information from the SSA and IRS files. By law, the EIN is a tax identification number, and even when standing alone is protected by Internal Revenue Code confidentiality restrictions. ERUMS required access to data from W-3 records which by law are Federal tax records that are processed and maintained at SSA in conjunction with the computation of Social Security retirement benefits. Since these are tax records, it was necessary to satisfy IRS that the selection by SSA of sample cases, SSA's disclosure of W-3 data to BLS, and the use of employer data by other members of the Workgroup met the requirements of the Internal Revenue Code dealing with disclosure of tax information. (See No. 4 below.) BLS selected Texas as the State whose records it would sample, and it obtained written permission from the Texas State Employment Security Agency to use their UI records in the project. The Texas 293 Unemployment Compensation Act requires Texas employers to maintain records and file reports to the Texas Employment Commission with detailed information about the business operations and the number and compensation of employees. Texas law prohibits disclosure except for administering the Act, and it makes improper disclosure punishable by fines or imprisonment. 4. Other Confidentiality Considerations Since the Workgroup was composed of employees from several agencies and organizations, confidentiality laws did not apply to them uniformly. In varying degrees, certain laws, regulations, and policies affected each agency's access to identifiable records from particular sources and provided differential access to various individuals in the Workgroup. A recurring theme was the necessity at each phase of the process to identify the persons who needed to use identifiable data and to ensure that no others had access at that time. Besides affidavits and other written procedures to protect the confidentiality of records, certain technical safeguards were adopted to minimize disclosure risk. The first of these methods was to avoid identifying sample cases by EIN to persons who performed processing in the participating agencies but were not directly associated with the Workgroup. This method was adopted to conform to, the Internal Revenue Code requirements for tax information under the agreement BLS had with the State of Texas. At BLS this led to a decision not to process the data on the mainframe computer system at the Department of Labor that is operated by a private contractor. Instead, BLS used a mini- computer which was accessible only to BLS employees who were members of the Workgroup. State agencies periodically submit to BLS UI address files that compile identification data for all reporting units at the most-detailed level that is available from employers' reports. BLS compiles these reports under a pledge of confidentiality that allows the data to be used only by authorized persons for statistical purposes. Once BLS selected the Texas sample, it had to create a finder list so that SSA could extract corresponding records from its W-3 and related files for employers in the sample. The technical staff who performed these operations at SSA have routine access in their usual jobs to the employer records maintained at SSA. However, they did not need to know which of the employers' records comprised the sample selected by BLS from the Texas UI file. To avoid identifying those cases that were actually in sample, furnished SSA with a listing of 7 of the 9 digits of sample EINS. SSA staff then extracted records from the W-3 and related files for all records in which these 7 digits appeared without knowing which 294 employers were actually in the BLS sample. This procedure effectively masked the identities of sample cases derived from State UI files, and thus significantly limited the number of SSA employees who were required to sign BLS non-disclosure affidavits. Agreements for Interagency Data Sharing Access by the Workgroup to the data sets needed for the study was accomplished through three interagency agreements plus an additional access arrangement. The Workgroup had originally planned a tripartite arrangement through interagency agreements of SSA and BLS with IRS. However, IRS counsel raised objections that such a multi-party agreement would be unduly cumbersome, and approval would probably not be forthcoming. As an alternative, IRS proposed to contract exclusively with BLS for the performance by BLS of services that required access to tax data. SSA staff would be designated as special agents of BLS to process the data. Bilateral BLS/IRS and BLS/SSA agreements would also have to be drafted under this arrangement. The drafting of these arrangements proved to be a delicate task. By law, the purposes of IRS participation in the project and its service contract with BLS had to be related to IRS administration of the tax laws. Section 6103(n) of the Internal Revenue Code (IRC) allows IRS to disclose tax return information to persons outside of the agency as long as it is for purposes of tax administration [1]. Specifically, this purpose is to conduct statistical studies based on return information, which Section 6108(a) of the IRC authorizes IRS to perform [2]. A case was made that the ERUMS study was one such purpose 1. BLS and Texas Agreement BLS has cooperative agreements with 50 State Employment Security Agencies to use employment statistics collected by the states for its labor economics research. The 1982 data used in the ERUMS study was furnished to BLS in its ES-202 program by the Texas State Employment Commission under a cooperative agreement. It was necessary for BLS to obtain authorization from the State Commission to use the microdata for the ERUMS study and to provide access for the Workgroup members. Under this cooperative agreement, the access and use of the data were subject to the confidentiality requirements of the Texas Employment Compensation statute as well as those set out in the BLS Commissioner's Order No. 2-80. Each UI program is operated under state law that must conform to certain minimum federal standards, with reports that enable BLS to monitor state compliance. Under the Texas program, each 295 employing unit is required to file (and update periodically) a status report with the Texas Employment Commission, describing the type of ownership, location, and nature of business. On a quarterly basis, employers are required to file detailed reports on wages and contributions. Multi-Unit employers are asked to file a voluntary statistical supplement that provides detailed employment, wage, and contribution reports for each establishment. The ES-202 reports are compiled by BLS and form the basis for the UI Address file that BLS maintains. This is a micro-level employer file that contains first quarter information for each reporting unit, and the 1982 file provided the Texas sampling frame for the ERUMS sample. The confidentiality of statistical data collected under the cooperative agreement is protected by interrelated state and federal procedures. At the state level, these UI reports are collected under the Texas Unemployment Compensation Act which limits the availability of its UI reports to public employees in the performance of public duties, except, as the Employment Commission may find necessary in its administration of Texas law. At the federal level, BLS receives and maintains these confidential reports under the authority of the BLS Commissioner's Order that pledges confidentiality and prohibits disclosure except to authorized persons for statistical purposes. This Order precludes any use of identifiable information for non-statistical purposes, such as investigation or enforcement. Under this cooperative agreement with the State of Texas, it was necessary for BLS to obtain permission from the Texas Commissioner to select employer sample cases and to make information about them available to BLS and SSA employees in the ERUMS Workgroup and later to others in the Microdata Access Group. In Addition, BLS procedures establish the confidentiality of the identities and all information pertaining to employers in the sample. Members of the Workgroup who were not BLS employees were appointed as BLS agents pursuant to another interagency agreement with BLS. Like BLS employees, other Workgroup members were required to sign a Non-Disclosure Affidavit before they would be given access to the microdata. 2. IRS And BLS Agreement The initial draft of the statement of purpose by IRS representatives was, acceptable to IRS counsel since its justification for sharing of confidential tax information was defined as for purposes of tax administration, which is permissible under section 6103(n) of the Internal Revenue Code [1]. However, the case that was made for IRS tax administration purposes was not acceptable to other Workgroup participants because they felt that this did not clearly describe the purposes of the ERUMS project in general or SSA's role in particular. In the, subsequent draft, care was taken to define contractual purposes in language that covered 296 the statistical purposes of the several participating agencies and that provided for the exchange of records to create a common pool of data for a variety of analytical purposes, including those related to tax administration. In this agreement, IRS contracted with BLS for the performance of those parts of the ERUMS project that required access to tax data, including the wage report information that was to be provided by SSA. Under this agreement, SSA staff could be designated as special agents of BLS to carry out their part of the linkage and analysis operations. By law, the purposes of IRS participation in the project and its service contract with BLS had to be related to IRS administration of the tax laws. The terms of a contract between IRS and BLS which needed to be acceptable to SSA enabled BLS to receive tapes containing tax information from IRS and SSA and to combine them with records in the UI Address File maintained by BLS. It imposed strict safeguard procedures and required BLS to provide IRS with a list of all persons permitted to see confidential tax return data. This list included SSA employees who were required to sign affidavits as agents of BLS. 3. BLS and SSA Agreement The third agreement was a Conditions of Use agreement between BLS and SSA which enabled SSA to release data from its employer files to BLS and authorized BLS to link data from these files to data in the UI Address File and data to be furnished by IRS. Like the IRS/BLS agreement, it limited access at each stage of the project to those persons who needed to use identifiable data, kept the number of such persons to a minimum, and required adequate physical security procedures. This agreement, which needed to be acceptable to IRS, enabled BLS to use SSA files for the ERUMS project. Under this agreement, SSA would furnish BLS with SSA's Single Unit Code File, Multi Unit Code File, and Employer Report (W-3) Record. The agreement authorized BLS to link data from these statistical files with data in the BLS Unemployment Insurance Address File and with data to be furnished by IRS, and prohibited any other linkage. 4. Microdata Access Group In the planning and matching stages of the project, the persons who needed to have access to microdata were those members of the Workgroup who were performing the record matching and verification. At Workgroup meetings, members generally reviewed data in the form of frequencies and other summaries to track the progress of the matching operations and to plan future steps. Occasionally, discrepancies appeared or questions arose concerning 297 classification of a particular employer or possible mis-match of data. Those matters were usually referred to particular members to resolve, with access to microdata as needed on an ad hoc basis. When the matching steps were completed and time came to plan the analysis, new arrangements were needed to enable a different group of persons to examine identifiable microdata. The Microdata Access Group (MAG) was formed for this purpose. At this point, IRS agreed that its contractor, BLS, would be permitted to make Workgroup members its agents as needed for the analysis stage. This ehabled the Workgroup members who were employees of BEA and the Committee on National Statistics to become sworn agents who, like the employees of BLS and SSA, would be permitted to examine and analyze microdata. Thus, of the three agencies sharing microdata (BLS, SSA, and IRS), IRS was the only one that did not have access to the matched microdata file. This group met periodically to plan and perform the analysis, prepare findings, and to report its findings back to the full Workgroup. Once the terms of all contracts were agreed upon, the contracts and the conditions of use agreement were signed by officials of the participating agencies, and the way was cleared for the data transfers. Summary and Conclusions To say that the process of discussion and negotiation leading to the signing of the ERUMS access agreements was painstaking, sensitive, and costly in terms of staff time and delay in the study's completion is an understatement. The disclosure aspects of the study severely tested the will and resolve of the affected agencies. In retrospect, the signing of interagency agreements between IRS and BLS and between SSA and BLS documented a process of negotiation by which the study plan was adapted to the requirements of the varios confidentiality laws that impinged on it. In addition, it summarized a process in which a combination of technical and procedural safeguards were fitted to meet the requirements of the Federal and State agencies that were involved in the data sharing. While the participants in the ERUMS study all feel a certain, degree of Accomplishment due to their collective persistence, none are quite so upbeat about the long duration of the study. Clearly, the long incubation period for the interagency data sharing agreements was a major contributor. However, it is important to recognize that the prolonged negotiation for interagency agreements did not result from lack of cooperation among the participants. On the contrary, it reflected the complex mosaic of legal restrictions on use and interagency dissemination of records. 298 Once it became evident that a single multi-party agreement would be unworkable for the overall project, the plan was broken down into component steps of disclosure, record linkage, and analysis. Each failure to reach an agreement required a step back to re-examine the study imperatives and to adapt the procedures to the practical and legal necessities at each stage. In addition to adding to the overall time and resources consumed by the project, these delays further contributed to supplemental delays, including: 1. Personnel turnover among the project participants due to the extended length of the project's schedule necessitated slower progress on the technical issues. 2. The acquisition of IRS Form 940 data was adversely impacted since these have a 5 year retention and were scheduled for destruction by the time the sample EIN's were determined. On the positive side, however, ERUMS demonstrated that such data sharing projects can be successful under current laws if there is creativity, flexibility, and most of all, persistence. Notes and References [1] Section 6103(n) of the Internal Revenue Code (IRC) allows for the provision of confidential tax return information for purposes of tax administration. Specifically, it reads: "Certain Other Persons. -- Pursuant to regulations prescribed by the Secretary, returns and return information may be disclosed to any person, including any person described in Section 7513 (a), to the extent necessary in connection with the processing, storage, transmission, and reproduction of such returns and return information, and the programming, maintenance, repair, resting, and procurement of equipment, for purposes of tax administration." [2] Section 6108 of the IRC has three parts which call for the publication of statistical compilation of tax return information at regular intervals, but, unlike Section 6103(n), such information cannot identify a particular taxpayer. This Section is the primary "mandate" for IRS' Statistics of Income (SOI) program. a) Publication or other Disclosure of Statistics of Income. -- The Secretary shall prepare and publish not less than annually statistics reasonably available with respect to the operations of the internal revenue laws, including classifications of taxpayers and of income, the amounts 299 claimed or allowed as deductions, exemptions, and credits, and any other facts deemed pertinent and valuable. b) Special statistical Studies. -- The Secretary may, upon written request by any party or parties, make special statistical studies and compilations involving return information (as defined in section 6103 (b)(2)) and furnish to such party or parties transcripts of any such special statistical study or compilation. A reasonable fee may be prescribed for the cost of the work or services performed for such party or parties. c) Anonymous Form. -- No publication or other disclosure of statistics or other information required or authorized by subsection (a) or special statistical study authorized by subsection (b) shall in any manner permit the statistics, study, or any information so published, furnished, or otherwise disclosed to be associated with, or otherwise, identify, directly or indirectly, a particular taxpayer. Section 6108(a) has been interpreted as a tax administration purpose for the Statistics of Income (SOI) Program (unlike 6108(b) and 61O8 (c)); hence, if a 6108 (a) study requires the use of "outsiders", then a 6103(n) contract can be initiated as was done for the ERUMS study. 300 SAMPLE SELECTION AND MATCHING PROCEDURES IUSED IN ERUMS John PinkosKenneth LeVasseur Marlene Einstein U. S. Bureau of Labor Statistics Joel Packman Social Security Administration Introduction The first paper in this session described the experience with developing interagency agreements, the third described the findings resulting from the study while this one describes the sample selection and matching procedures used. In addition to describing the sample selection and matching procedures, the followinq will explain what the ERUMS Workgroup considered when developing the protect design. This paper also describes the sampling frames, data, and manual matching conducted by the ERUMS Workgroup. The ERUMS project was a pilot study, designed to develop and test procedures for linking and comparing employer and reporting unit data from different administrative record systems. The study from its inception was exploratory in nature, and the ERUMS Workgroup members hoped to observe and document the similarities and differences discovered between the records in the systems being studied and, thus, between the systems, themselves. The scope of the project included employer reporting unit data from the Bureau Labor Statistics and Social Security Administration employer data files which have similar coverage. Internal Revenue Service data, which were edited by the Bureau of the Census, were used to assist in the analysis of the sample. The ERUMS committee members included staff from Office of Management Budget (OMB) , Bureau of Labor Statistics (BLS), Social Security Administration (SSA), Bureau of Economic Analysis (BEA), Internal Revenue Service (IRS), Census and the Committee on National Statistics (CNS). Developing the sample design, selecting the sample, and performing the machine and manual match were conducted by SSA and BLS staff who were cleared to work with the confidential data. To conduct the final analysis of the data this group was later expanded to include staff from BEA and the CNS. 301 There are two reasons for providing an account of the ERUMS sample selection and matching procedures. The obvious reason is that the results, like those of any research study, are dependent on the procedures used, and anyone interested in the results is entitled to a full description of how the study was carried out. The other reason, equally or perhaps more important, is that ERUMS was a venture into uncharted territory, and we believe that future projects of this kind will benefit from the availability of a detailed road map of the procedures that were developed to match and compare employer and reporting unit records from BLS, SSA, and IRS for statistical purposes. Sample Design Considerations A major design consideration affecting the size and scope of the project was the limited staff time and resources each of the participating agencies was able to contribute. The committee realized from the beginning, the meat of the project would be in the manual review of the reporting units from each of the administrative record systems. To keep the workload manageable, the Workgroup decided to limit the study to one State rather than several. It was also decided that this State should be large and be one which could share its data with federal statistical agencies for research purposes. The State selected was Texas. Probability sampling was used at all stages of selection and provided two benefits. It ensured that sample results could be used to produce unbiased estimates for the study population, and it made possible estimation of sampling errors. Additionally, the Workgroup felt it would be useful for both analytical and methodological purposes to produce weighted estimates. Consideration was given to designing a baseline sample where a sample from one agency (e.g., BLS) would be drawn and then a search for the selected sample members would be conducted on the other agency's files (e.g., SSA). This approach would provide matched units on both files as well as those on the BLS file but not the SSA file. This method, however, would not identify those units on the SSA file but not on the BLS file. The baseline sample approach was abandoned and it was decided that samples would be selected in two stages. The stage one sample was an equal probability sample of the population which was then stratified by match status. The stage two sample was a systematic subsampling from these strata. This method of sampling provided a means for over- sampling selected types of records which were of more interest to the project and it also resulted in a manageable sample size. As a final design consideration, the committee wanted to ensure that records from both SSA and BLS had an equal chance of selection. Additionally, the Committee wanted to develop an approach that would minimize the number of computer searches 302 required to select the sample and relevant data elements from these large administrative record files. The sample design used was one that selected separate samples from the BLS and SSA files using the same get of random pairs of numbers. The purpose of this design was to measure overlap between the two frames and, more importantly, to measure the amount of nonoverlap between the two frames. The nonoverlap included those sample members on one frame but not the other. This design also minimized the computer costs and allowed the committee to select the sample in one pass through each agency's data file. Once the sample was selected, the relevant data elements for each sample member were downloaded to a micro computer. Sampling Frames Both the SSA and BLS data files are compilations of administrative tax records. The SSA data file includes data from employer W-2 and W-3 wage reports, whereas the BLS file includes data from employers' State Unemployment Insurance tax reports. The identifying data element common to both the SSA and BLS files and assigned from a single source is the Employer's Identification Number, or EIN. The EIN is a unique 9-digit number assigned to companies by IRS and is used to track federal tax payments. When companies pay State Unemployment Insurance Taxes the State assigns an Unemployment Insurance (UI) Tax number to track payment. Since companies are given a federal tax credit for State UI taxes, they provide their EIN to the State UI tax department. On an annual basis IRS provides each State UI tax department with a file of all the EINs registered in the State. The UI tax department then reconciles the amount of State UI taxes paid by each employer against the IRS file of EINs and tax credits claimed by each employer. By definition, all companies on the SSA files should have an EIN reported, because this is what is required for an employer to be included on the file. On the BLS State file a few units did not have an EIN reported since only a State Ul tax number is required for an employer to be included on that file. The first quarter 1982 Texas file had EINs reported for 98.7 percent of all reporting units. The sampling frame for BLS was all the EINs reported in the Texas first quarter 1982 U.I. Name and Address File. The sampling frame for SSA was all the EINs reported in the Single Unit or Multi Unit Code file with wage reports for calendar year 1982. The SSA files are continuous files linked over time, whereas the BLS file in 1982 was a snapshot of one calendar quarter. Effective with first quarter 1989 data, the BLS began linking data quarterly and now has a continuous data file. 303 The sampling rate was determined by the Workgroup's decision that 400 EINs would be a manageable sample size and that about one-. half of the sample should have EINs classified as multis, or companies with multiple locations. EINs classified as multis were of particular interest because there is more variation in reporting practices. To derive the sampling rate, the committee looked at the first quarter 1982 Texas file, which had 267,487 EINs classified as single units and 3,125 EINs classified as multi units. A sampling rate of 6 in 100 was selected since it provided approximately 188 EINs that were multi units. As previously mentioned, it was decided to select a two- stage sample. The first was an equal probability sample of the population. This first-stage sample was selected from all EINs that had 1 of 6 random pairs of numbers in positions 7 and 8 of the EIN. The sampling rate of 6 in 100, when applied to both the BLS and SSA frames provided a combined stage one sample of 19,964 EINS. The stage one sample was then machine matched and each EIN was assigned a status classification. The initial status classifications are shown below: MATCH STATUS IN: Table A Group BLS SSA 1 Single Single 2 Single Inactive 3 Inactive Single 4 Multi Single 5 Single Multi 6 Multi Inactive 7 Inactive Multi 8 Multi Multi EINs that were inactive in both systems obviously had no chance of entering the ERUMS sample. Another view of the status classifications is shown in attachment A, which is a 3x3 grid having classifications, single, multi, and No Wage Report (NWR) on each scale for both the BLS and SSA files. Records with no wage reports on the SSA file were considered inactive. The bottom right cell on the grid is not applicable since these would be records that did not exist on either file. 304 Based upon the interest of the Workgroup three of the basic classifications or cells were subdivided and are shown as the shaded sectors on the 3x3 grid (see attachment A). County and SIC became matching criteria for those EINS that were single on both files. The number of reporting units became a criterion for those EINS that were multis on the BLS file but were single on SSA file and those EINs that were multis on both. These eleven match status classifications became the strata used for the second stage sample. The second stage sample selection had equal probability within each stratum. The sampling rates used varied by stratum, from selecting all to selecting 1 in 173.78. Given the exploratory nature of ERUMS, the intent of the Workgroup was to pull a larger sample of EINs classified as multis and nonmatched records. These cases were expected to present more difficulties. Therfore, the Workgroup wanted to, have enough of these cases to learn what the situations were and to test methods of dealing with them. The final sample contained 401 EINS, including 201 classified as having multi units on, either the BLS or the SSA files. The remaining 200 EINs were those not classified as multis on either the BLS or SSA files. Once the sub sample was selected, the Workgroup began the review and analysis phase, which included labor-intensive manual matching. The working group reviewed reported employment and SIC and geographic codes for each of the 401 EINS. To assist in this process, the Workgroup made arrangements to have access to IRS data for tax years 1981 through 1983. Data for 385 of the 401 EINs were made available. During the review process the Workgroup attempted to uncover the reasons why records did not match or why records were on one file but not the other. In this process of looking very closely at the actual records from each agency, the Workgroup learned much about the two systems and found reasons to reclassify some of the records which affected the final match status. For example, in the area of multiunits, the BLS system defines multis as companies with multiple locations within the same State whereas the SSA system defines multis as companies that have multiple locations in the United States. During the review of the multi-unit records, employment levels were considered and attempts were made to reconcile differences in reporting units by aggregating employment of the individual multi units to the EIN level. As a result of this review, the Workgroup decided not to use employment as a match criterion. It was also decided that for purposes of this study, a multi unit EIN would be an EIN that had multiple locations within the State of Texas. This reduced the number of SSA multi unit EINs in the final sample from 120 to 10. The remaining 110 records were reclassified as single EINS. 305 As noted, the Workgroup also compared SIC and geographic codes from both files. SIC codes were first examined to see why there were non-matches at the four-digit SIC level. In some cases, the non matched EINs were assigned SIC code in related industries; in other cases, the industry code reflected la larger aggregation of the reporting unit. Another, and perhaps more important factor that accounted for differences at the 4-digit level was, both BLS and SSA have policies for SIC coding exceptions. The BLS in 1982 had 11 exceptions to 4-digit SIC coding which meant a 3-digit SIC code was assigned in certain industries in lieu of the 4-digit SIC code. This represented 43 4-digit industries. These are industries which either have a significant amount of overlapping in their industrial activities or are industries that historically had been difficult to collect sufficient information from to assign a 4-digit SIC. The BLS currently has reduced the number of 4-digit coding exceptions to 6, which represents 17 4- digit SIC industries. The SSA SIC coding exceptions exist in some agricultural industries and Public Administration, which are coded to the 1 digit level. This affected 64 4-digit industries. Approximately 63 other 4-digit industries were coded at the 3-digit level for one reason or another, typically insufficent information. In addition to reviewing SIC codes, the Workgroup also looked at geographic codes and tried to explain why some records did not match between files. Maps and coding manuals were consulted and the review showed there was some inherent misreporting of county codes by employers. Texas has more than 37 cities with the same name as a county but these cities ate not located in those counties. Houston, for example, is in Harris County not Houston County and Austin is in Travis County, not Austin County. Counties named Houston and Austin are located elsewhere in the State. In some cases the reason for non matching records was that the reporting unit was coded in an adjacent county. Texas has a very large number (254) of counties. For those employers who keep their records by city or are not familar with the county names, it is easy to see the potential for some misreporting. The Workgroup also looked very closely at the cases having inactive EINs on either the BLS or SSA files. Inactive EINs for the BLS were defined as those that appeared on the SSA file but did not Appear on the BLS File. Inactive EINs for SSA were defined as those on the SSA file with no wage reports for 1982. When reviewing the BLS inactive EINs, the Workgroup used SSA SIC and employment data to determine if the employer was exempt from Unemployment Insurance coverage. They also looked at IRS data to determine if the employer became active after the first quarter of 1982 and at the first quarter 1983 Texas file to see if the employer reported in 1983. 306 When reviewing the SSA inactive EINS, the Workgroup was able to use a more nearly complete SSA wage report file that included wage reports that were either delinquent when the sample was selected or were in the process of reconciliation with IRS. As a result of these additional data, 44 of the 99 EINs originally classified as inactive on the SSA file were determined to be active. The Workgroup also used the BLS 1982 and 1983 first quarter Texas files to conduct name searches to see if the same employer reported under a different EIN. The Texas files were also used to see whether zero employment was reported, which might have indicated no wages were paid. Additionally, IRS data were then used to see what level of employment was reported to IRS. The last step in the review and analysis phase was to determine the final match status of the 401 EINS. As a result of the review, it was decided to collapse the 11 categories shown in Attachment A down to the basic 8 cells shown in Table A. As part of the final analysis, committee members worked on completing the documentation for the project and discovered that an additional 2,608 EINs that were on the SSA file but not the BLS file were inadvertently omitted from the first stage sample and, consequently, from the second stage. Adding cases to the stale 1 and 2 samples at that point in time would have further delayed completion of the study, so the Workgroup decided the best way to deal with this problem was to reweight the sample cases in the two affected strata and rerun the results tables. 307 MATCH STATUS CLASSIFICATIONS Click HERE for graphic. KEY: NWR = No Wage Report SIC = Standard Industrial Code RU = Reporting Units 308 RESULTS, FINDINGS, AND RECOMMENDATIONS OF THE ERUMS PROJECT Vern Renshaw Bureau of Economic Analysis, Tom Jabine Statistical Consultant The other papers in this session have examined the administrative arrangements and the sample selection and matching procedures for the Employer Reporting Unit Match Study (ERUMS) This paper reviews the study's results, findings, and recommendations. The main purpose of the ERUMS project was to provide information on the technical and administrative feasibility of interagency record linkages. However, the ERUMS Workgroup hoped that the study would also shed some light on at least three areas of substative concern. 1) We hoped that geographic and industry information for reporting units contained in the Bureau of Labor Statistics (BLS) Unemployment Insurance (UI) Address File could help evaluate the potential statistical usefulness of a) reporting unit data supplied by multi unit employers participating in the Social Security Administration (SSA) Establishment Reporting Plan (ERP) for forms W-2 and W-3; and b) State data supplied to the Internal Revenue Service (IRS) on Form 940. SSA has been concerned about the quality of its reporting unit data because resources for maintaining the ERP had been inadequate for some time and the State data supplied on IRS Form 940 had never been used for statistical purposes. 2) We hoped that information from LRS and SSA files could help evaluate the completeness of employer coverage in the UI Address File. The UI Address File leaves out or estimates employer information that is not received by its statistical deadline, whereas information for late reports was generally available in the IRS and SSA files used for ERUMS. 3) We hoped that the analysis of matched records could help evaluate the consistency of industry and geographic coding in the BLS, IRS, and SSA systems. The extent to which the ERUMS project could actually shed light on these areas was limited by several factors. First, ERUMS was a pilot study based on a small sample drawn from a single State (Texas) for a single year (1982). The results, therefore, could 309 not be expected to reflect precisely the status of the data systems for the entire country or for subsequent years. (BLS has taken steps to improve the UI Address File since 1982.) Second, both the information content and processing procedures differed somewhat among the data systems. The W-2/W-3 data were for calendar 1982, for example, while the UI Address File that was used contained data only for the first quarter of 1982. Finally, a number of unanticipated problems were encountered in carrying out the study. The most limiting of these problems resulted from the slow implementation of ERUMS. For example, by the time the final sample of employers was selected, many IRS Form 940s for 1982 had been destroyed. Therefore, it was not possible to evaluate the State data contained on the Form 940s. Another unanticipated difficulty arose because the initial SSA files used in the matching process omitted some wage reports and were generally inadequate to determine if employers were actually reporting multiple units in Texas. These initial files were later supplemented with more complete information, but the supplementation occurred after the final sample had been I drawn; consequently the size of the sample was smaller than intended for some categories of employers, especially for multi unit employers. Finally, it proved to be more difficult than had been anticipated to account for differences in employer coverage among the data files. In part, this was because estimated data were not identified in the UI Address File (a deficiency being corrected) and because there was no documentation of such phenomena as dates when employment started for employers (or ended, or was changed by reorganization, etc.) or dates when forms filed by employers were received by the processing agencies. The clearest conclusion to emerge from the ERUMS project related to the poor quality of SSA's ERP data for multi unit employers. It was evident that SSA would need to take steps to improve quality control it the SSA system were ever to be useful for developing data by geographic and industry classification. The other findings of the ERUMS project were not so stark as those relating to the poor quality of SSA's establishment data, but the study could well reinforce the concerns of those who worry about the inconsistencies in industry coding that occur when employers are coded independently by different agencies. In the following sections of the paper the results, limitations, findings, and recommendations, of the ERUMS project are discussed in somewhat greater detail. Tables A-1 to A-8, which are referred to in the next two sections, appear in Chapter III of the ERUMS final report (Statistical Policy Working Paper 16). In order to meet space limitations, we have included Only Table A-4 with this paper. 310 Results As explained in detail by Pinkos et al in the second paper of this session, the ERUMS sample was a two-phase sample of employers, as defined by unique Employer Identification Numbers (EINs). Most of the results presented in this paper are estimates based on the Phase II sample of 401 EINS, weighted to account for the disproportionate sampling used in the second phase of the sample selection. Of the Texas EINs that were active in 1982 in the BLS or SSA systems, 67.1 percent were active in both systems, 27.6 percent were active only in the SSA system and 5.3 percent were active only in the BLS system (Table A-1). Only about 1.0 percent of all active EINs were classified as multi unit in one or both systems, and most of these were classified as multi unit only in the BLS system (Table A-4). For the matched single unit EINS, i.e., those that were active in both systems, an estimated 81.6 percent had the same State and county codes in both systems. The remaining cases were about equally distributed in three categories: same State, different county; same State with no county code in the SSA file; and different State (Table A-5). An estimated 70.2 percent of the matched single unit cases had the same two-digit industry codes. About half of the remaining cases were not classified by industry in the SSA system (Table A-5). When matched against the IRS/Census-edited Form 941/943 file, about three-fourths of the matched single units from both the BLS and SSA files had two-digit industry codes that agreed with those in the IRS/Census file. However, when the SSA unclassified cases were excluded from this comparison, the proportion of SSA cases that agreed with the IRS/Census two-digit code was somewhat greater than the corresponding proportion for the BLS matched single unit cases (Table A-8). Only a few EINs (nine sample cases) were classified as multi unit in both the BLS and SSA systems. Matching individual reporting units for these cases proved to be difficult. Overall, the nine sample employers had 105 Texas reporting units in the BLS system and 60 in the SSA system for 1982. Of the active SSA EINs not found in BLS's first quarter 1982 UI Address File, it was estimated that 69.2 percent had reported no first quarter employment to IRS on Form 941 and therefore would not normally be expected to appear in the BLS system (Table A-6). For another 10 percent of these employers, the analysis suggested that they may not have met requirements for UI coverage in Texas either because they had no operations in Texas, because of nonprofit status or because their payrolls were too small. For the remaining 20 percent, the reasons for their absence are not always clear, but 311 it may have resulted in part from lags in incorporating new employers in the UI State agency and BLS files. Most of the employers who were included in the 1982 UI Address File but did not file 1982 W-2/W-3 wage reports (22 sample cases) appeared to have ceased hiring employees, gone out of business, or gone through other changes that altered their reporting to IRS and SSA. Half of the employers in this group reported no employment in the 1982 UI Address File. Many of the remainder had filed their final Form 941 with IRS (at least for the period 1981-1983) for a quarter in 1981. An analysis of the sample EINs that appeared in SSA's Multi Unit Code File provided some indication of the extent to which multi unit employers were participating in SSA's Establishment Reporting Plan (ERP) in 1982 (Table A-7). An estimated 35.9 percent of these EINs had been incorrectly added to the Multi Unit Code File as the result of a processing error that has since been corrected. Most of the remaining employers had initially agreed to participate in the ERP, but more than half of this group did not provide separate data for each reporting unit in their W-3 wage reports for 1982. Limitations Several factors limit the broad applicability of the ERUMS findings. The results reflect the reporting requirements and operating procedures associated with the agency record systems in 1982. There have been significant changes since then. In particular, BLS has taken several steps to improve the timeliness and the completeness and accuracy of data in its UI Address File. The study was based on data for a single State, Texas, and on a small sample of employers and reporting units. The UI system gives the States some latitude in their record-keeping practices, so indications of the coverage of employers in the record systems of the Texas State Employment Agency in 1982 should hot be assumed to apply fully to the UI systems of other States at that time. The small sample size means that estimates based on the Phase II sample are subject to relatively large sampling errors. Because of limited resources and the complexity of the Phase II sample design, we were able to compute sampling errors only for a few key estimates (see Table A-4). The analysis of the results was complicated by differences in concepts and coverage in the record systems used in the study. These differences occurred in the basic filing requirements for the UI and SSA/IRS systems, the time reference of the basic BLS and SSA files used for matching, the definition of reporting units in the BLS and the SSA/ERP systems, and the structures of the BLS and SSA industry classification systems. In addition, certain file 312 deficiencies and operational problems made the analyses more difficult. About 1.3 percent of the records in the 1982 UI Address File for Texas did not have EINs and therefore were not included in the Phase I sample of EINs from that file. I In the SSA files, a significant proportion of employers lacked county and industry codes. The most serious problem was that a high proportion of multi unit employers were not reporting separately in 1982 for each reporting unit, so that we were unable to do a thorough comparison of reporting units for multi unit employers active in both the BLS and SSA systems. Although these differences and file deficiencies made the analyses more difficult, the fact that we succeeded in identifying and documenting them is an indication that the ERUMS project succeeded in its main goal, which was to demonstrate the feasibility of doing matching studies as a means of evaluating the suitability of administrative record systems for statistical uses. The data on amounts of employment and payroll available from SSA, BLS and IRS files were used in reviewing the unmatched sample cases and trying to understand why they were not present in both SSA and BLS files. However, the employment and payroll data were not added to the data file for the 401 sample EINs that were used to develop the estimates presented in this report. Therefore, all of the results shown are estimates of numbers of employers or reporting units classified by attributes such as match status, and geographic and industry codes in the different systems included in the study. We did not attempt to estimate what proportions of aggregate employment or payroll were accounted for by employers who were unmatched or had different geographic or industry codes. Findings The detailed analyses of the ERUMS data did not suggest that large numbers of employers who report wages in one of the payroll tax systems were failing to report in the other system when they should have been. They do, however, suggest that late reports and different procedures for processing the reports in the two systems created potential problems for using both of the systems data files for statistical purposes. Perhaps the clearest finding was that it is not possible to maintain a usable establishment reporting unit plan for multi unit employers in the absence of systematic procedures for monitoring employer reporting and updatig files for changes in the number, location and industry of each employer's reporting units. SSA's Establishment Reporting Plan clearly lacked the necessary resources to do this in 1982 and there is no reason to think that the situation has improved since then. 313 There, was a moderately high but by no means perfect correspondence between county and two-digit industry codes for single unit employers included in both the BLS and SSA systems. A substantial proportion of the differences arose from the absence of county or industry codes in the SSA system. Comparisons of industry codes at the three and four-digit level were not attempted because of the differences in the industry classification systems used by the two agencies. With some qualifications, we were successful in matching the records of employers, as defined by their EINS, in different systems. However, we were not successful in matching BLS and SSA records for reporting units, the main reason being the incompleteness of SSAs data for reporting units provided under the voluntary ERP. Other reasons were the lack of a common identifier, analogous to the EIN at the employer level, for reporting units and the slight differences in the reporting unit definitions used by BLS and SSA. We learned what we believe are some important lessons for others who may wish to match business records from different agency sources, whether for research or operational purposes. First, the plans and the necessary interagency agreements should be developed well ahead of the earliest date at which the files to be linked are expected to be available. In particular, the development of interagency agreements for the exchange of identifiable records is a painstaking process and considerable time may be needed for their completion and approval. Second, successful matching requires in-depth knowledge of all of the record systems involved and of the specific files that exist within those systems. An interagency team approach, with full exchange of information, is essential because there is unlikely to be a single individual who has all of the necessary information, even for the files of a single agency. Finally, whenever possible, it is essential to pretest matching procedures before embarking on large-scale operational applications. Recommendations ERUMS was designed primarily as a demonstration project and was therefore limited in its coverage and scope. Nevertheless, the Workgroup believes that the study results, along with other information acquired in the course of the study, justified the inclusion in its report of five formal recommendations addressed specifically to the BLS and SSA record systems for employers and reporting units. These recommendations were: 314 1. SSA should undertake a full review of the current status and uses of the Establishment Reporting Plan and decide either to continue it with adequate resources for maintenance and improvement of quality or to discontinue it entirely. (note- such a review was begun by SSA prior to the completion of the ERUMS project. As a result of that review, SSA is taking steps to prepare for the termination of the ERP.) 2. BLS should review the State Employment Security Agencies' procedures for identifying employer births (including those resulting from mergers and changes of organization) and seek ways of reducing the apparent lag between filing of applications for EINs and inclusion of new employers on State Agency and BLS lists used as frames for statistical surveys and reports. 3. Data in the UI Address File on employment and wages paid should be labelled to distinguish imputed data from data reported by employers. 4. The EIN should be identified as a key item in the UI Address File and efforts should be made to achieve 100 percent reporting initially and current reporting of changes in EINS. 5. BLS and SSA (if it continues the Establishment Reporting Plan) should strive to obtain data from employers for, their establishments as defined in the 1987 Standard Industrial Classification (SIC) Manual Both agencies should code industry for all establishments, without exception, at the 4-digit SIC level of detail. Whether or not the Establishment Reporting Plan is continued, SSA should code all employers identified on Forms SS-4 at the 4-digit level of detail. (see parenthetical note following recommendation 1 concerning the current status of the ERP) In a broader context, the ERUMS Workgroup concluded that current efforts to collect economic data at the establishment level are dispersed among Federal and State agencies, are poorly coordinated, and place unnecessary burden on employers. The Workgroup believes that further, more intensive and extensive interagency matching studies have an important role to play in resolving these problems and in determining the possible effects on statistical programs of prospective major changes in administrative reporting systems for employers. We therefore recommend that: 315 6. Further matching studies should be directed at acquiring information that will support the eventual development of a mandatory reporting system to meet the needs of all Federal and State statistical programs for establishment lists, including SIC codes. An interim goal should be that all agencies requiring or requesting employers to provide data at the establishment or reporting unit level adopt common definitions of units and data items to be submitted for these units. Three agencies the BLS, the Census Bureau and the National Agricultural Statistics Service -- play a dominant role in the direct collection of establishment-level economic data. Recent initiatives of these agencies, under the general guidance of OMB's Statistical Policy Office, have been directed at greater coordination of their respective list-building and maintenance activities. Further integration of business lists will require fuller understanding of the similarities and differences of the three systems, based on matching of individual establishments and reporting units in the different systems. 316 Click HERE for graphic. 1/Numbers in parentheses are standard errors of the percents. * Indicates a standard error of less than 0.05 percent. 317 DISCUSSION W. Joel Richardson Charles A. Waite U. S. Bureau of the Census Introductory Comments on ERUMS First of all, I would like to thank the many people who have been involved with ERUMS. Their commitment and resourcefulness have helped to make the ERUMS project a success. As Vernon has detailed, several recommendations were presented that undoubtedly will improve the business files of the Bureau of Labor Statistics (BLS) and the Social Security Administration (SSA). But more importantly, the ERUMS study provided valuable experience in the technical aspects of matching interagency data sets. I am hopeful that this experience will help to further the efforts of data- exchange initiatives among federal statistical agencies in the coming years. When the preliminary planning for ERUMS began in 1983, the Census Bureau expected to be one of the participating agencies. Our business employer files were to be matched along with those of the BLS, SSA, and the Internal Revenue Service (IRS). However, there were significant problems concerning the release of our confidential data. Though we realized the importance of ERUMS, we could not resolve these data-access problems soon enough to allow us to be an active participant. As an alternative, the Census Bureau obtained observer status, which enabled us to closely follow the progress of ERUMS. Before critiquing the three papers, I'd like to expound on the value of the ERUMS study to the federal statistical community. Warren stated that a major goal of ERUMS was to test the feasi- bility of matching employer records from the business lists of different government agencies. This goal was, accomplished in ERUMS, and the results showed that the matching of the two distinct data files is possible. Additionally, the ERUMS evaluation revealed problems associated with matching the interagency data files. I expect that these findings will be valuable in future matching studies. A matching study should be the first step in any data-sharing proposal -- before a data sharing proposal is accepted by the participating agencies, it is essential to confirm the comparability of the data sets and to resolve any conceptual an definitional differences. In my view, the ERUMS project showed that the BLS and SSA data sets are comparable, and that an effective matching operation is possible. 318 Although there are obvious discrepancies between the data sets -- only 67.1 percent of the EIN records were active in both systems -- significant benefits could be realized through data sharing. First, greater consistency in the industrial classification codes, geographic location indicators, and related data values could be achieved by sharing the data for matched records. Second, unmatched records could be researched in an effort to ensure the completeness of each of the employer universes. Though numerous issues would need to be explored and settled, such a data-sharing plan could result in greater comparability among the data series. Currently, the administration has a legislative proposal in Congress that would permit limited data sharing between the Census Bureau and the Bureau of Economic Analysis (BEA). The primary purpose of the proposal is to provide BEA with confidential access to the Census Bureau's establishment information. This information will augment and improve the data on foreign direct investment that BEA collects and publishes. There are other versions of the legislative proposal in Congress to share Census and BEA data -- not only with each other, but, in at least one version, with the Government Accounting Office (GAO) and the Committee on Foreign Investment in the U.S. (CFIUS). We are concerned that response rates may decline if our microdata are made available to such policy-making organizations as GAO and CFIUS. For this reason, the Census Bureau does not support this legislative proposal. The BEA collects foreign-investment data at the enterprise level. The Census Bureau conducted a feasibility study that showed BEA enterprise-level data could be linked successfully with Census Bureau establishment data. By integrating our establishment-level data with BEA enterprise data, BEA will be able to present foreign direct investment statistics at a much finer industry and geo- graphic level. This is one of many possible data-sharing plans that could provide significant cost and qualitative benefits to Federal statistical programs. I would like to believe that the administration's legislative initiative, together with successful match studies such as ERUMS, will provide the impetus for increased data sharing among Federal statistical agencies in the future. Interagency Agreements for Microdata Access: the ERUMS Experience Tom Petska's presentation focused on the interagency agreements required to comply with the confidentiality provisions that govern the three sets of data. Clearly, the matching of individual records in the ERUMS project could not take place until these confidentiality issues were resolved. 319 Tom has presented thoroughly the problems associated with sharing the individual records from different agencies. It is apparent that these legal agreements represented a major barrier in the ERUMS project. To their credit, the ERUMS workgroup was able to overcome, the confidentiality problems and to formulate a workable plan -- IRS contracted with BLS to perform the match, and SSA staff were designated as special agents of BLS to process the data. The IRS is permitted to disclose tax information to outside contractors as long as it is for purposes of tax administration, and the ERUMS study was considered to be a statistical study related to the administration of IRS tax laws. Unfortunately, considerable time was spent in determining this solution and in drafting the required legal agreements. This added considerably to the length of the ERUMS study. Future matching studies may face similar obstacles in gaining access to confidential data. As an example, the Census Bureau obtains the EIN and related data values for many small employer businesses from the IRS. Any future studies undoubtedly will rely on the EIN to match the records, because the EIN is the one key identifier common to U.S. data systems. But as Tom has pointed out, the EIN itself is protected by Internal Revenue Code confidentiality provisions. For this reason, the EIN and related data that the Census Bureau obtains from the IRS cannot be released to other statistical agencies such as the BLS. Only those business records whose EIN and related data have been confirmed through direct respondent contact would be eligible for release. This would impact on the completeness of any matching studies between the BLS and Census Bureau data sets, because a portion of our business universe has not been directly canvassed. The BLS was permitted access to IRS records in the ERUMS project because of tax-administration purposes. Although additional studies possibly could be conducted using similar arrangements, it would require the support of the IRS and other agencies that furnish the administrative data. Otherwise, future studies may require changes to relevant statutes and regulations before microdata access is authorized. Such changes are difficult to obtain. I do have one minor point on the paper concerning the confidentiality provisions of the BLS data. The ERUMS study used matched BLS records from only one state -- the state of Texas. Although Tom outlined the disclosure provisions associated with the data records from Texas, it was unclear whether these provisions were typical of the other 49 states. We understand that BLS affords each state with certain latitude as to the collection of the unemployment data. If the states also have different confidentiality provisions -- specifically, provisions that strictly prohibit the release of data to Federal agencies other than BLS -- the ERUMS project may not have been possible using records from these states. 320 One of the goals of ERUMS was to gain experience in the procedure of obtaining access to the confidential data of the various data sets. To this end, the ERUMS study was a success. The study revealed the problems associated with obtaining the access to the microdata for matching purposes, and also determined a workable solution that overcame these problems. However, I expect that disclosure problems will continue to be a major obstacle in future matching initiatives. Sample Selection and Matching Procedures for ERUMS John Pinkos's presentation focused on the sample selection and matching procedures in ERUMS. As John has pointed out, a major constraint affecting the sample size was the limited staff time and resources. Because considerable analysis was inevitable for the sampled records, the-ERUMS members agreed to select a relatively small sample. As it turned out, 401 cases were selected. By limiting the sample to one state, and oversampling from certain categories of records that were of particular interest, ERUMS was able to create a manageable set of sample records that were sufficient to meet the study's objectives. I expect that future matching studies will benefit from the details, of the procedures used in ERUMS. Three sources of data were used in the study -- BLS data, SSA data, and IRS data. Cases were selected first from the BLS data files and then independently from the SSA data file. Using this technique -- specifically, by selecting independently based on certain digits of the EIN -- the ERUMS sample included records that were present in only one of the two data systems, as well as records that were present in both systems. Records present in only one of the data systems were a critical part of the study, as these represented potential differences in employer coverage between the two data files. The ERUMS study, however, did not sample from the IRS data set. The IRS data were used only to help analyze the BLS/SSA cases selected in the sample. The IRS file was not included in the sample selection because of the difficulties in gaining access for such a purpose. Although this decision was unavoidable, it may have compromised the results of ERUMS somewhat. The IRS data file represents a complete universe of business employers in 1982 -- all employers who filed payroll tax returns in with no exclusions as to the size of the business or the nonprofit status of an organization, were included on the IRS file. Without this complete file of businesses, ERUMS was left to compare records from the BLS and SSA data sets. Although differences were identified and quantified, the study could not make valid estimates 321 on the completeness of the two data sets as compared to the universe of businesses on the IRS file. A similar point exists for the matching of multiunit records from the BLS and SSA data sets. The ERUMS study showed that about l percent of all active EINS were classified as multi unit in-one or both systems. Most of these were classified as multi unit only in the BLS system. One of the findings of the study was that the SSA multiunit file is deficient, and steps should be taken to either improve the quality or to discontinue it entirely. Because of the obvious deficiency in SSA's multiunit file, no legitimate conclusions could be reached on the accuracy of the BLS multiunit file. One last point on John's paper, he discussed briefly the comparison of industry classification and geographic location from the BLS and SSA files. I would liked to have seen some general table that presented these results. Even if the results were presented at broad industry and geographic levels, it would have provided some general information on the comparability of these critical data elements. Results, Findings and Recommendations of the ERUMS Project The agencies involved in the ERUMS project have gained valuable experience in the technical aspects of linking data files and in the administrative requirements for gaining access to the data. For this reason alone, the ERUMS project should be considered a success. In addition to the experience gained, the ERUMS project presented several recommendations that will help to improve the business files of the BLS and SSA. I understand that the BLS has already taken several measures to improve the timeliness, completeness, and accuracy of the data in its Unemployment Insurance Address File. Vernon's presentation detailed the recommendations that were identified in the ERUMS study. In one of the recommendations, he stated that BLS should review the procedures for identifying births in an effort to improve the timeliness of including new employers in the BLS lists. I suggest that the BLS review procedures for identifying deaths as well. Up-to-date operational status is a critical element of business employer records. The final recommendation in vernon's presentation covered the need for additional matching studies to acquire information that will support the eventual development of a reporting system to meet the needs of all Federal and State statistical programs. Because of certain legislative barriers -- for example, Title 26 strictly prohibits the release of IRS data to other statistical agencies -- and significant operational problems, such a far-reaching goal may not be plausible in the foreseeable future. 322 The Census Bureau supports a more achievable goal of data- sharing among Federal statistical agencies, and would welcome the opportunity to conduct additional matching studies in an effort to further data-sharing initiatives. Before proposing the Census/BEA data-sharing initiative, we conducted a matching study that confirmed the feasibility and value of linking our establishment- level data with BEA's enterprise-level data. This preliminary study was a necessary step in the Census/BEA data-sharing initiative. Additional matching studies may promote other data- sharing initiatives in the Federal Government. The ERUMS project, which effectively matched interagency data files, may help provide the impetus for increased data sharing in the coming years. With the necessary legislative changes, pertinent data from each of the employer files could be shared among statistical agencies. Such a data-sharing plan would provide major advantages, including greater comparability among economic data series, less respondent burden on the business community, and a reduction in overall Government costs. Summary Comparisons between data sources are beneficial because they highlight conceptual differences and identify the limitations and strengths of the data sets. The ERUMS project successfully met both of these objectives. In addition, ERUMS provided valuable experience it the technical aspects of matching interagency data sets. Our current mission should be to use this experience to further the efforts of data sharing in the Federal Government. Data sharing offers major advantages to Federal statistical agencies. By supplementing business data sets with applicable information from the data sets of other agencies, the Federal statistical system will attain greater comparability in related economic data series. The ERUMS project showed that interagency. data sharing is a viable option. I would like to congratulate the many people who have been involved with ERUMS for a job well done. 323 DISCUSSION Thomas J. Plewes U.S. Bureau of Labor Statistics I appreciate the opportunity to appear at this public unveiling of the Employer Reporting Unit Match Study (ERUMS) report. This is an event that has been long-awaited by all of those who have been involved in this multi-agency, multi-year, and multi-faceted project. I expect that no participant has awaited this day more anxiously than Warren Buckler, who, along with the folks here at the speaker's table and many in today's audience, has spent a great deal of time over the past few years in conceiving, giving birth, and nurturing this little study. Indeed, to carry the metaphor further, is hard to figure out where we stand now on the continuum from project conception to death. Is this session a commencement ceremony, or is it a eulogy? As my commentary will soon indicate, I hope that we are gathered for a commencement ceremony for the statistical community has learned important lessons about sharing and about the basic quality of two major business lists in this project at some significant cost. It would be a shame if the lessons learned were not put to use in implementing critically needed program improvements. I would like to accomplish two objectives in the short time I have allotted as a discussant. First, I want to step back to examine the environmental framework in which this study took place and contemplate the arena into which the report now has been thrust. My second goal is to draw specific conclusions from the exercise and suggest specific steps that should be taken as a result of the work that has been done. What is the environment in which we must consider this study? It is a complex environment, characterized by: 1. Little sharing of business directory information between Federal government agencies, but a growing pressure to develop, procedures for sharing so as to reduce the burden on respondents. These pressures are building to the extent that I believe sharing will surely be mandated. That mandate may come in the form of legislative action, a fiat from the Office of Management and Budget using its authority under the Paperwork Reduction Act, or of most profound consequence, through a centralization of the statistical agencies. 2. A reliance on lists characterized by their primary usage as administrative data sources which focus the support Of the administration of the law or function. We have built our elaborate business directory programs and constructed our business survey frames on databases that have been 324 developed with only a distant secondary concern for the statistical uses of the data. 3. Difficulty in separating statistical from enforcement purposes. If we, as statistical agencies, make the data better and create an environment for comparing lists, we enhance their use for enforcement and administrative purposes also. This aspect will be particularly troublesome when we involve, as we eventually must, the Internal Revenue Service in sharing schemes. The participation of the IRS in the ERUMS process gave us an indication of the lengths to which IRS will go to protect the tax data, and of the difficulties this injected in the ERUMS process. 4. A growing concern over confidentiality of establishment records. 5. A lack of consistency of definitions and coding that extends throughout the statistical system, but has a most profound impact on sharing of administratively-derived lists. Administrative differences in the programs lead to inconsistent definitions of even the most simple of terms, such as "employment", "address", "wages" and the like. 6. An expanding recognition that errors and omissions in the business lists are a significant source of error in the survey process. The Federal Committee on Statistical Methodology's Working Paper 15, "Quality in Establishment Surveys" documented this, and the Tupek-MacDonald paper this morning discussed the effect that the Bureau of Labor Statistics' Business Establishment List improvement project will have on BLS survey quality. These environmental elements pose formidable challenges to statistical agencies that want to improve the efficiency of their operations and reduce burden on their reporters. For example, in terms of frames for surveys of nonagricultural businesses, there are at present two major government lists -- the Census Bureau's Standard Statistical Establishment List (SSEL) and the BLS Business Establishment List (BEL) -- and one major private sector list -- the Dun & Bradstreet file -- with a myriad of lesser known and more specialized lists for more limited purposes. We can look at the SSEL as a representation of the of the SSA/IRS administrative data files with considerable value added by the Census Bureau. Likewise, the BEL may be seen as a representation of the State unemployment insurance files with considerable BLS value added. If these Federal government files do not match, and we suspect they do not through analysis of the macrodata, the problem can be with the basis administrative data files, with the value added, or both. Over the years, Fritz Sheuren's various administrative database 325 comparison projects have documented the systemic differences in the files very well. They must be borne in mind. Fixing the files once we have identified the root difficulties is quite another matter. The statistical agencies do not own them, and they are exceedingly expensive to change (in terms of budget and response burden). Indeed, quite often only a revision in law or nationwide program practice will do the trick. Fixing the "value added" portion is somewhat more possible, but it too is expensive in terms of budget and people. often there are good reasons for not fixing the way we add our value, such as the need to assure the continuity of historical data series. Definitions are another challenge. If we want to share lists, we must think in terms of three types of problem. In some cases, repair is relatively simple. We heard today, for example, that our definitions of multi-unit employers are already in close proximity. The EIN and SIC systems are also bedrock. Our challenge in those instances where there is close concordance between the files is to maintain the definitional base in a standardized, current and relevant manner. In other areas, we must change the way we do business but, if we are willing, our task will be reasonably easy. One match problem that ERUMS identified was that the project was comparing annual SSA reporters with lst Quarter UI reporters. This is one Of the problems that we can fix with time and resources, because the data are there. In a few important other cases, however, we are quite limited in our ability to bridge definitional gaps. For example, when coverages are based on Federal laws, State laws, and judicial precedent regulating the administrative database, we would be forced to justify a change in the insurance or tax program on statistical grounds. Certainly, confidentiality concerns have a presence in the equation. We, glimpse in the Petska-Alexander paper the importance that necessary confidentiality protection schemes had in this project, and the price those schemes exacted in terms of time and precision. That's one of the reasons I like the Petska-Alexander paper so much. It outlines the practical implications of maintaining a pledge of confidentiality when cooperating on a project of importance to the statistical agencies. Everything, as they so well point out, had to be invented. There are no text book examples of interagency agreements on confidentiality. The solutions which the project team developed were carefully crafted to stay within the very restrictive IRS law and were implemented with an eye toward the reality of the environment. Thus, there are really two stories in the Petska-Alexander paper. One story is about the difficulties that the team encountered in sharing confidential data. The other, written between the lines, is about 326 the sense of cooperation and dedication that allowed the cumbersome solutions to move forward. The Petska-Alexander paper starkly reminds us that the role of confidentiality policy is important but little understood. We may be hopeful that the current situation will be short-lived. The National Academy of Science's Committee on National Statistics had taken on these issue with the formation of an expert panel. Until we are able to benefit from that report, however, we are left with the fact that understanding of confidentiality of business records has not progressed very far as either science or practice. Only recently has a literature on the subject of confidentiality begun to emerge, but most of it addresses the more emotional topic of confidentiality of information about individuals. The literature pays little attention to issues surrounding confidentiality of business records. Without such a foundation, the statistical agencies have mostly assumed that the issues of confidentiality of business records are the same as those for individuals. This assumption has played an important role in justifying past limits on sharing between the Federal agencies. The second paper, by Einstein, Levasseur, Packman, and Pinkos, also attempts to stand back with benefit of hindsight and make some sense out of what was a convoluted process. Since 3 of the 4 authors work with me, these comments may not be as critical as others may have rendered, for all along the way I "bought-in" to the approaches taken and the effort expended. Nonetheless, I view the documentation that this paper offers in a somewhat different light than the authors, and draw slightly different conclusions. The matching process, as described, makes a good deal of statistical sense. The team selected a two-stage sample selection process, stratified into 9 groups. The second phase, a subset of about 400 cases of the first selected on a probability basis, provides for detailed analysis. Some of the specific steps in the process were to meet the confidentiality restrictions, but not all. The process that the team established should serve as a first step toward developing an on-going statistical process control system, if and when sharing does take place. Many of these same activities should be continued in a recurring program to meet the objectives of total quality management. Thus, the work of the team has long-term, permanent implication. The authors seemed to recognize this when they stated that "we believe future projects of this kind will benefit from-the availability of this detailed road map". Probably so, but I speculate that future researchers will look at the road map and decide against making the journey. That is why I would take pains to separate the enduring aspects that should be the foundation of a quality management system from those that were necessary to meet more bureaucratic objectives. 327 The contribution of the Renshaw-Jabine paper is to Yield some hope, in that it reminds us how close we are to an ability to share, while providing some sober reflection about some major tasks still lying ahead if we are to share. Their bottom line is that the systems are reasonably close in coverage -- eventually most employers emerged in the systems. There were troublesome differences in multi-unit identification, in county coding, and in industrial classification at the 2-digit level, but I would label these of moderate concern. Indeed, under the BEL initiative, BLS has taken steps to correct many of the inadequacies in its data, investing with the States in improving SIC coding, interpretation of SICS, and, more recently, in fixing the multi-establishment identification problems. Unfortunately, with lack of resources, the Social Security Administration has not been able to make the same investment, so many of the difficulties in the SSA file may have multiplied. In summary, we ought not let this expensive experience lie on the shelf. We have learned a great deal about two files -- lessons that should be extended to files maintained by the Bureau of the Census. And we need to get on with fixing some of the obvious flaws in the administrative data. Most importantly, we have learned that maintaining confidentiality is possible, that matching is feasible, and that the will is present at the staff level in the agencies to make it all come together. Now it is time for leadership. As Senator Bennett Johnston said in an argument before Congress, "There's a time to stop talking the talk and start walking the walk." We have the map. Let's start walking. 328 Session 10 APPROACHES TO DEVELOPING QUESTIONNAIRES 329 330 TOOLS FOR USE IN DEVELOPING QUESTIONS AND TESTING QUESTIONNAIRES Theresa J. DeMaio U. S. Bureau of the Census As the collection of information through surveys becomes more prevalent in our society, increasing numbers of people find themselves in a position to develop questionnaires. Writing a questionnaire seems like such a simple task -- many people think that anyone without training or experience can do it. But developing a good questionnaire -- one that can obtain good quality, information that meets the objectives of the survey -- is not as easy as it looks. Many different kinds of abilities, including subject matter expertise, writing capabilities, and knowledge of social psychological principles are necessary to develop a simple, cohesive questionnaire in which the questions are clearly worded. Developing a good questionnaire is not a solitary task -- simply a matter of sitting down at your desk for a few minutes or even a few hours. There are a number of procedures that can be used to involve potential respondents in content or question development, and to test and evaluate questionnaire drafts before they are finalized. The purpose of Statistical Policy Working Paper #10, Approaches to Developing Questionnaires. is to provide practical information about these methods. The report contains descriptions of 11 different techniques, which can be used at various stages of questionnaire development. The report is structured in three parts: tools to develop questions, procedures for testing the questionnaire draft, and techniques used to evaluate the questionnaire draft. This structure was somewhat artificially imposed for ease of presentation in the report. In fact, there is no one ideal way to go about the process of developing a questionnaire. Depending on a number of factors, such as whether you're working from scratch or from an existing questionnaire, how much time and funds are available for survey development, these techniques can be used in many different combinations. In terms of improving the content of a survey questionnaire before it goes out into the field, the important thing is that testing and developmental work be conducted, not necessarily that it be done according to the structure presented in the report. Having made this disclaimer, I am nevertheless going to discuss the techniques that are presented in the first two sections of the report -- that is, tools for developing questions and techniques for testing the questionnaire draft. I'm going to generally describe the methods contained in the report, and mention some additional techniques as well. 331 Developing Questionnaires Part I of the report describes three tools for developing questions. The report presents these methods as useful in developing new questionnaires. I'd like to expand on this a little and suggest that these techniques can be used in the early stages of questionnaire development of any survey. Most surveys are conducted more than once; subsequent rounds of data collection begin with an existing questionnaire draft that is subject to revision. These later rounds each have early stages of questionnaire development, complete with an existing questionnaire draft. In these cases too, the methods described in Part I of the report may be appropriate. Unstructured individual interviews Unstructured individual interviews are one-on-one conversations between a researcher and a member of the population for the survey or proposed survey. I use the term "conversations" because the discussion is unstructured; rather than having a set of specific questions, the researcher uses a topic outline that collects information on various aspects of these topics in whatever order, and using whatever terminology the respondent suggests. Respondents may also bring up additional issues related to the general topic, which might be incorporated into the topic outline for later interviews. The goal is an unstructured setting in which the researcher finds out how the respondent perceives the topic of interest, what terminology the respondent uses to talk about the topic, whether the respondent is knowledgeable and able to provide information on the topic. By working from a blank slate, the researcher is not constrained by the content and terminology of an existing questionnaire, and the true frame of mind of the respondent is more likely to surface. Qualitative Group Interviews Many of you may be familiar with qualitative group interviews under a different name, such as focus group interviews, group depth interviews, or focussed discussion groups. Essentially these are unstructured interviews with a group of respondents rather than a single respondent, led by a group moderator. About 8 to 12 people participate in a group, and the moderator uses a topic outline to guide the discussion. Qualitative group interviews are used for many research purposes other than questionnaire development. When used to assist in questionnaire construction, the goal is the same as the goal of unstructured individual interviews -- to elicit the terminology used by respondents in thinking about the topic in question, to determine aspects of the topic that respondents consider important, and to get a reading on how respondents react to aspects of the topic that survey planners consider important. 332 The difference between qualitative group interviews and unstructured individual interviews is, obviously, the group setting the diversity of opinions held by group members may stimulate interaction among them that elicits more information than could be obtained through interviews with each member separately. In order for these groups to be successful, however, the ability of the moderator is an important consideration. The idea is to stimulate discussion among all the participants and to avoid domination of the discussion by some people who may be more vocal than others. Participant Observation Participant observation is a technique that is used as an independent method of data collection, as well as a tool for questionnaire development. It has been extensively used around the world. The basic elements of the technique are suitable for questionnaire design purposes, especially in developing questionnaires for use by members of other cultures or subcultures living within our own country. For example, the homeless population is a subculture that is currently the object of much interest, and for which the use of participant observation techniques is relevant. Indeed, these techniques have been successfully used in research on homelessness being conducted at the Census Bureau. There are several distinguishing characteristics of participant observation research. First, the researcher must speak the respondents' language. This is not limited to English as opposed to a foreign language, but also refers to dialects, slang, or professional jargon. Second, the researcher associates with the members of the community he or she studies and engages in their activities. Ideally the researcher lives among the respondents; at a minimum, he or she develops contacts in the community over a long period of, time. The participant observer may also use the ethnographic interview technique during the course of his or her research. This involves using unstructured interviews (the methodology I previously described) with "key informants." These are members of the community who are willing to talk at length with the researcher or introduce the researcher to other community members. From this brief description, it should be obvious that participant observation is not a methodology that a person can "pick up" by reading an introductory textbook. The expertise required in the use of this technique dictates the involvement of trained ethnographers. While that may limit its use somewhat among U.S. statistical agencies, there are several ways it can be incorporated in a project. First, participation observation can be conducted as part of a project by trained anthropologists hired to serve on the project staff. In the homeless project I referred to a moment ago, we hired an anthropologist to work with a survey 333 methodologist, and this combination has worked out very well. A second way to make use of this technique is to consult with ethnographers who have prior experience among the culture of interest, and take advantage of this previous experience rather than conducting original fieldwork. This could be done either by hiring the person on staff or doing it on a consultant basis. Think Aloud Interviews Another technique suitable for the early stages of questionnaire development has gained in popularity since the Working Paper was completed in 1983. This is the think aloud interview. Also referred to as protocol analysis, this method is an extremely valuable source of information about how respondents understand the survey questions put to them, and how they go about answering the questions. The purpose of the technique is to get respondents to talk out loud and verbalize their thoughts as they respond to questionnaire items. The data of interest here are respondents' reactions to the items, their thoughts as they formulate answers to the items, and what decisions they make in answering the questions. Use of the technique requires a questionnaire draft. Since the results of these interviews are crucial to the questionnaire development process, the person doing the interviewing is generally a researcher or questionnaire designer. For interviewer- administered surveys, the questioner first explains to the respondent that rather than just answering the questions, he or she should actually think out loud -- that is, say what he/she is thinking as he/she answers each question. Respondents differ in their ability to verbalize their thoughts, and some may require a, bit of probing to uncover how they arrive at the answer to a question. At times it may take skillful questioning to probe completely what is on a respondent's mind. The interviews are generally tape-recorded (with the respondent's permission), since it is difficult to take notes and concentrate on probing the respondent's answers at the same time. This technique can also be adapted for self-administered interviews. In this case, the questioner is basically an observer. The respondent is instructed to complete the questionnaire, reading the questions and instructions out loud as well as verbalizing the responses. I've done quite a few of these interviews, and they really are quite helpful in detecting layout problems (not noticing skip instructions, etc.) in addition to uncovering problems with the questions. This technique is used with relatively small numbers of respondents. Ten or fewer think aloud interviews provide large amounts of information and can uncover systematic misinterpretations or other problems. Use of the technique is an 334 iterative process -- once the questionnaire designer conducts five to ten think aloud interviews, problem areas will generally surface. Then, after revisions to the questionnaire are made, additional interviews can be conducted to detect problems with the revisions. Or alternatively, some other method can be used for the next round of questionnaire development. Testing Questionnaires Whatever methods are used to develop a questionnaire draft, it must be subjected to testing before it can be used in the field. There are a number of ways that this can be done, involving various levels of time and effort. Part II of the report mentions three techniques: informal testing, pilot studies, and split sample testing. I'11 describe each of these briefly and also add another selection to the menu. Testing Multiple Questionnaires In the questionnaire testing phase, the content of the questionnaire may be pretty much set except for fine tuning, or substantive questions may remain about how best to ask about a topic. When the latter is the case options that involve the testing of alternative questionnaires should be considered. Experimental Group Session The experimental group session is a small-scale method of testing alternative questionnaire versions, applicable only to the development of self-administered questionnaires. It may be conducted with respondents who are selected for their demographic characteristics and are not representative of any larger population. In an experimental group session, respondents come to a central location (usually a large room containing tables or desks) for the purpose of completing a questionnaire. A group session is held, and 20-30 respondents participate at one time. The session is experimental, since more than one questionnaire version is randomly administered. A moderator conducts the session, and questionnaires are randomly distributed to the participants. After the questionnaire has been completed, a debriefing form may be administered to collect additional information about how the respondent interpreted specific questions. Multiple sessions are conducted until the total number of respondents is large enough (about 500 or so) to facilitate statistical comparisons of the responses to the alternative questionnaires. 335 This methodology does not duplicate the response situation of a self-administered questionnaire where the respondent receives a form in the mail and returns it that way as well. For one thing, the respondent in the group session does not have access to other household members or to personal records, which may be necessary to answer some items. For another thing, once a respondent has joined the group, he or she generally completes the questionnaire and turns it in, while at home the questionnaire might remain unanswered. Despite these limitations, however, the methodology has definite advantages in the early stages of questionnaire development. It takes a relatively short time to arrange and conduct the sessions. If the statistical analysis is conducted quickly, it can provide rapid feedback about large differences in response to the alternative questionnaires, for use in later revisions of the survey instrument. Split Sample Testing In some situations, the questionnaire designer needs a large sample of respondents and a more formal test of different question wordings, question concepts, or methods of categorizing responses. This is particularly important in developing a major new survey instrument (such as SIPP when it was introduced several years ago), or in revising an existing questionnaire such as the decennial census form. When the nature of the survey requires large-scale testing of different versions of a questionnaire, the vehicle of choice is the split sample test. Split sample testing, also referred to as split ballot or split panel testing, involves the use of multiple questionnaire variants, each administered to a portion of the sample. The entire questionnaire need not be different, but the alternative questionnaires should contain different versions of the items that are the focus of the test. In fact, the questionnaires should not contain too many differences, since all the variations in a questionnaire can affect response. One way to deal with this issue is to limit the number of questions that are tested. Another might be to use automated data collection methods such as CATI or CAPI, which provide the means to randomize several experimental series of questions with respect to each other. Once the content of the questionnaires is established, the alternative questionnaires are randomly distributed among the sample population, to decrease bias due to factors other than those being tested. The procedures for data collection are basically the same as for a survey containing a single questionnaire with one exception: control procedures must be established to I ensure that each sample case is assigned to the proper treatment group. In a split sample test, the responses to the question variants that are the focus of the test are of keen interest. Thus, 336 statistical analysis of the data is an important aspect of the evaluation. In addition, observation of interviewers and interviewer debriefing can be used for a personal interview survey, and information gained from these methods can help inform some of the statistical results of the analysis. Testing Single Questionnaires The final two techniques I would like to mention involve testing of questionnaires that are ready for "fine tuning." That is, major uncertainties about the content of a questionnaire do not exist, although changes in the instrument may be recommends base on the results of a field test Informal Testing Informal testing as its name implies, is a relatively casual method of evaluating a questionnaire. It is relatively small in scope, involving between 50 and 300 interviews. The cases for interview are selected purposively, rather than through any kind of systematic sampling methods. This may be accomplished by selecting participants from a broad range of subpopulation groups, in the case of a test for a national survey, or limiting the participants to narrow population segments, if for example, a survey of food stamps or social security recipiency is being tested. The informal nature of the test also carries over into the evaluation system. While some basic, quantitative information is calculated from the questionnaire responses, such as item nonresponse rates, and the number of "don't know" responses most of the evaluative information is based on observational feedback. There are several ways of obtaining this feedback. Observers, including the questionnaire designers, can accompany interviewers in the field, or in the case of telephone interviews, they can be tape-recorded. Specially-designed evaluation forms can be completed by interviewers and/or observers. Also, interviewers and observers can be debriefed after the interviewing is completed. Most of the information collected through these methods is subjective, based on the impressions of the staff present at the interviews. The informal testing procedure also allows unstructured discussion with the respondent at the end of the interview. In response to probing by the interviewer or observer the respondent can provide information about his or her problems with the questionnaire, the meaning of specific items on the questionnaire, or other items of information. Observers who are involved in the survey as questionnaire designers, subject matter specialists, or sponsors may use their background knowledge to guide their probing and obtain useful information for evaluating the questionnaire. 337 Pilot Studies In contrast to an informal test, pilot studies are much more formal and are conducted on a larger scale. Pilot studies are generally conducted further along in the questionnaire development cycle, and the goal is to duplicate the final survey design from beginning to end. This includes data collection from a larger sample, scientifically selected to represent the survey universe, and execution of data processing and perhaps tabulation procedures as well. Needless to say, this is a lot more time-consuming than an informal test, and is not attempted until the questionnaire is in a final or nearly final state. Where the informal test seeks to uncover problems with terminology and question interpretation, at this point the questionnaire design issues of concern relate to how well the survey instrument performs in conjunction with the other aspects of the survey -- for example, errors in key codes or problems with response range categories for numerical data. Evaluation of the results of a pilot study is much more quantitative than the analysis of an informal test. In a pilot study, the data capture, editing, and imputation programs are performed and, to the maximum extent possible, the data analysis plan is executed. This tests all the software developed for the survey and checks to see that the various stages of data processing are properly coordinated. Frequently, time constraints limit the amount of analysis conducted on pilot study data; however, the more effort is expended at this stage, the less likely you will be to find surprises when the survey is actually fielded. In addition to this formal evaluation of a pilot study, some less rigorous evaluative tools are also used observation of interviewer training sessions is generally conducted and modifications to the training are suggested, as necessary. Also, observers may accompany interviewers for a personal visit survey, and both interviewers and observers are debriefed. Another use of a pilot study might be to phase in a new questionnaire in a continuing survey. Rather than adopting the new questionnaire wholesale, overlapping samples can be designed, in which a portion of the respondents receive the new questionnaire and the rest receive the old one. The purpose here is not to test a questionnaire, but to collect information about the alternative questions and measurement strategies. The goal is to calibrate the old and the new questionnaires, to provide quantitative information about differences in response, which might affect the time series for the survey. Discussion The descriptions I've presented demonstrate the wide range of options available in the questionnaire development process. The intent of the report was not to suggest that each of these should 338 be used in the development of a single questionnaire. Rather, we wanted to familiarize questionnaire designers with the techniques, to encourage their use, and to promote the value of testing questionnaires in general. As I said at the beginning of this paper, there is no one ideal procedure to follow in preparing to field a survey instrument. It is generally best to start by talking to respondents, with or without a draft questionnaire, to find out if the vocabulary and the intent of questions are understood. Think aloud interviews, unstructured individual interviews, or other techniques that involve in-depth one-on-one discussions with respondents are extremely helpful here. These techniques are not limited to the earliest stages of questionnaire development, however. A draft questionnaire can be revised based on think aloud interviews, used in a field test, and the revised version also used in additional think aloud interviews. It is an iterative process that can continue as long as you find problems that need fixing. Similarly, there is no magic formula for field testing an instrument. An informal test followed by a pilot study might be warranted based on the characteristics of the survey. Perhaps a series of informal tests might be considered for some complicated surveys. Or informal tests might be followed by a split sample test. Depending on the circumstances of a particular survey and the time and budget allowed for survey development many possibilities are available. The important point is that testing facilitates problem detection, and fixing problems in a questionnaire will improve the quality of the data that is obtained. 339 TECHNIQUES FOR EVALUATING THE QUESTIONNAIRE DRAFT Deborah H. Bercini National Center for Health Statistics This paper reviews the section of the report, Approaches to Developing Ouestionnaires, called "Techniques for Evaluating the Questionnaire Draft." How do these techniques differ from "Tools for Developing Questionnaires" and "Procedures for Testing the Questionnaire Draft" described in earlier sections of the report? In many cases, they do not. When there is a lot of time before a survey goes into the field, a particular method might be described as a tool for developing the questionnaire. When time has run out or the survey is already in the field, the same method would be referred to as a technique for evaluating the questionnaire draft. It doesn't really matter. In fact, the beauty of some of the techniques covered in this paper is that they can be adapted for use anywhere in the questionnaire design process. What are these techniques? At first glance, the chapter headings shown in Figure 1 represent an apparently unrelated assortment of methods. However, there is a common thread. What links these techniques is that each uses an external source of information to evaluate the performance of the questionnaire. In this case, "external" refers to data that originates outside of the answers to the questionnaire items themselves. The first three techniques rely on the insights of the survey participants, that is respondents and interviewers. Next, observer evaluations are provided by an outsider who is not part of question-response process. The last technique, record checks, steps even further away from the interview or data collection situation by comparing questionnaire responses to an independent criterion, usually administrative records. Approaches to Developing Ouestionnaires predated the emergence of laboratory or cognitive evaluation methods. The cognitive approach has provided a theoretical framework for understanding and reducing many kinds of response errors. Although this framework was not in place in the early 1980's, some of the techniques that follow are similar, if not identical, to those used in today's labs. 340 Figure 1 Technique Source Frame-of-Reference Probing Respondents Response Analysis Surveys Respondents Interviewer Debriefings and Interviewers Interviewer Questionnaires Observation and Monitoring Survey designers, etc. Record Check Studies Administrative records Frame-of-Reference Probing Frame-of-reference probing is such a technique. This method evaluates the questionnaire by probing for how respondents understand key concepts, terms, definitions and instructions. The probes may be in the form of structured questions developed before the interview, or ad hoc, spontaneous questioning by an interviewer. Although this technique could be applied at any stage of questionnaire development, the report deals with it primarily in the context of field testing and the survey itself. The probes can be inserted after selected survey questions or they can be grouped at the end of the interview. When probing is unstructured, it is usually done by the survey researchers or questionnaire designers because of their greater insight into question objectives. Standardized, structured probes can be administered by a field interviewer as part of the data collection process. The use of frame-of-reference probing requires some planning. The first decision concerns when in the development process to probe. While probing will yield useful information at any time, clearly it will have the most impact when it is done with early drafts. If there are problems with fundamental concepts and key terms, it makes sense to detect them soon enough to work on a solution. If it turns out that respondents have difficulty with an entire topic or questionnaire approach, then modifications might be needed in the data objectives, not just question wording. Structured probes, of course, need to be developed in advance. These may range from general, all purpose questions such as "What does so and so mean to you?" to more specific individualized probes. Even when probing is going to be of the unstructured or ad hoc variety, a plan for which questions and which terms to probe is advisable. In a field setting, respondents' time is limited and so are the number of probes that tan be asked. Based on previous 341 testing, researchers are likely to have some notion of potential trouble spots to cover. This information can be used to develop a protocol which specifies the criteria for probing. No single evaluation technique is comprehensive or perfect. Each has its strengths and limitations. The particular strength of probing is that it can identify problems in the questionnaire that are often missed by methods that rely on respondents giving overt indications of difficulty. Consider the question, "During the past year, have you had pain in the abdomen?" This question was tested in the Questionnaire Design Research Laboratory at the National Center for Health Statistics. Most laboratory respondents answered it readily with a "yes" or a "no." It was not until interviewers probed for how respondents interpreted the term "abdomen" that it was discovered that very few respondents knew exactly where their abdomens were (1]. Probing also has the potential for identifying the underlying causes of response problems, not just the fact that a problem exists. Returning to the example, the problem was variable interpretation of a key term. The underlying cause was lack of knowledge. When the cause of a question problem is understood, the solution is likely to suggest itself. In this case, the solution was a respondent flash card that showed an outline of the torso with the abdominal area shaded in. Skilled probing is cost-effective. It can unearth quantities of information on how the questionnaire is working in a relatively short time. However, exclusive reliance on standardized probes tends not to produce very useful insights. Specialized probes require more time to develop but yield more valuable results. Also, if no ad hoc probes are used, unanticipated problems will be missed. The results of unstructured probing are, of course, subjective and anecdotal and require some skill to interpret. Today, probing is one of the primary tools of the cognitive or laboratory approach to questionnaire evaluation. But its use is not limited to comprehension issues. Question laws related to vague concepts or unfamiliar terms art certainly common, but they are by no means the only sources of response error. Probing techniques can be used effectively to detect question problems that affect many components of the response process. These include recall, estimation, judgement, decision making and motivational factors (2]. Probing and other intensive interviewing methods have now evolved into a major and separate phase of the questionnaire design process that usually precedes the field testing phase. This approach gives questionnaire designers freedom to explore the response process in depth. Constraints on probing are significant when it is done during a field interview that was designed for another purpose. 342 Response Analysis Surveys The response analysis survey (RAS), like frame-of-reference probing, evaluates the questionnaire from the respondent's perspective. The report describes it as a technique used to evaluate mail surveys, especially mail surveys of establishments. In effect, the response analysis survey is a survey about a survey in which personal interviews are conducted with a sample of mail survey respondents. Interviewers administer a structured questionnaire which asks questions about how respondents would go about answering the mail survey questions. The typical RAS would collect data on how establishment records are maintained, what kinds of information they contain, how difficult it is to retrieve this information and so on. It also attempts to find out if respondents can understand what is being asked of them, their willingness to provide the information and other aspects of respondent burden. If the RAS is being conducted to evaluate an on-going survey or to prepare for the next cycle of a periodic survey, researchers use available data on response errors as a guide when developing questions for the RAS. Data collection and analysis proceed as in a regular survey. The results are interpreted and then used to redesign the questionnaire for the next mail survey. The strengths and limitations of the RAS parallel those of frame-of-reference probing. The RAS asks the respondent to analyze the response task, and, in doing so, both techniques are capable of detecting covert response problems and their underlying causes. And the RAS, like probing, loses some of its potential when only structured probes are used. On the other hand, a formal response analysis survey will produce valuable, objective data that will reliably indicate where questionnaire revisions are needed. Variations or adaptations of the RAS concept can be used to, evaluate any self-administered questionnaire, not just mail questionnaires for establishment surveys. Laboratories at several government agencies test self-administered questionnaires using a combination of observation, think-aloud, and structured and unstructured probing methods. The Government Accounting Office, for example, uses an interesting combination of observation and laboratory-style probing. They watch respondents complete the questionnaire noting all kinds of non-verbal behavior such as sighs, grunts, head shaking and other signs of impatience, skipped questions, and so on. Afterwards, the interviewer goes back to each of the questions that provoked the reaction and asks the respondent to elaborate. 343 Learning from Interviewers: Interviewer Debriefings and Questionnaires No assessment of interview survey questionnaires is complete without the interviewer's input. The report chapter, "Learning from Interviewers," presents two techniques for gathering information from interviewers - interviewer debriefings and structured post-interview evaluations. The latter are questionnaires completed by interviewers about various features of the survey questionnaire. Either method can be employed with pretest questionnaires or during an on-going survey. The interviewer debriefing session is a forum in which interviewers can relate their experiences in administering a questionnaire or data collection procedure. Their scope and formality can vary with the scope and formality of the testing operation that they accompany. In large field tests involving many interviewers, a single comprehensive debriefing is usually held when interviewing is completed. With more informal testing, it is possible to conduct multiple debriefings throughout the pretest period. Questionnaires can be revised on the spot, tested, and then revised again. Interviewer questionnaires can take several forms also. They can be directed to a specific issue or, problem, such as nonresponse. Or they can consist of questions designed to get at suspected difficulties with particular survey items. Another, format is a questionnaire made up of standardized ratings that interviewers apply to each survey item. Any survey planner who ignores what interviewers have to say about the questionnaire is taking a great risk. Interviewers are in the best position to comment on how the questionnaire and other survey procedures affect respondent cooperation. For the most part interviewer performance is judged on response rates and completion rates, so interviewers will naturally be sensitive to factors that affect performance in these areas. Interviewers also have excellent insights into the logistics of questionnaire administration and are quick to spot things that impede the efficient flow of the interview. If survey planners, for whatever reason, do not heed interviewers' major objections, they will pay a price. A questionnaire that interviewers find unnecessarily difficult to administer will lead to poor interviewer performance, and therefore, lower data quality. There are also limitations to what can be learned from interviewers. Interviewers are often more adept at making bad questions work than they are at finding flaws, unless the question is so bad that the interviewer can't figure out how to ask it and the respondent can't or won't answer it. The interviewer's job is to get a response and they are good at it. They are less likely to 344 notice subtle problems with question wording, interpretation, and so on, as long as the respondent gives a codable response. Reliance on post field test interviewer debriefings to detect question problems is a poor evaluation strategy. Only those problems that visibly disrupt the interview will be mentioned. And as in any group situation, the most vocal will dominate. It can be difficult to achieve concensus as to what the problems are because interviewers can have varied experiences depending upon the sample cases they have interviewed. Using interviewer evaluation questionnaires to supplement the debriefing can compensate for some of these drawbacks. Getting interviewer input need not be confined to the field situation. It is possible to ask interviewers to evaluate draft questionnaires before they ate field tested. This can be done in a laboratory setting with "real" respondents or with researcher respondents. Although subject to some of the same sorts of limitations mentioned above, it may be possible for interviewers to identify some flaws in this way. It could be especially useful to ask interviewers to try out questionnaires that have been adapted to new data collection modes, such as CAPI, for example. Observation and Monitoring Observation of face to face interviews or monitoring telephone interviews-evaluates the questionnaire from the perspective of a third party. Observers are usually people involved in the survey planning process, from sponsors and subject matter experts to questionnaire designers and data analysts. At NCHS, observation usually takes place during field pretests, but the same methods could be used to evaluate on-going surveys. Most often, an observation program provides a qualitative, subjective assessment of questionnaire performance and related communications. An infrequently used, but more objective approach, is known as behavior or interaction coding in which standardized codes are used to evaluate question performance. In all cases, observers need some preparation for their task. Attending the interviewer training helps as well as some coaching on specific situations or problems to watch out for. Observer forms can be useful, provided observers are not so busy recording minor detail that they miss more significant interactions. At the end of the testing period, observers may submit a written report summarizing their experiences, participate in a debriefing session, or both. Sometimes observers can become active participants if they are also there as frame-of-reference probers. At the other extreme are the behavior coders who usually work from taped interviews. 345 The Survey Research Center at the University of Michigan has recently completed a study on pretesting techniques, among them, behavior coding. The goal of the study was to test techniques that would enhance the usefulness of the traditional field pretest. The Michigan study used the coding scheme shown in Figure 2 to identify interviewer and respondent behaviors that are symptomatic of problem questions. Trained coders listen to taped interviews and apply the appropriate codes to each question. The numbers and types of codes for each question are, then tallied. The benefit of this technique, according to the study, is that is can provide objective indicators of flawed questions [3]. Figure 2 Interviewer Behavior Respondent Behavior Reads question with slight Interrupts question changes with answer Reads questions with major Requests clarification or changes or does not complete repeat of question it Gives qualified but adequate answer Gives inadequate answer Gives "Don't Know" answer Refuses to answer What are the strengths of the observation technique? It can readily be incorporated into existing pretest plant. Third party observers can interpret the interviewer-respondent dynamic in a way that the participants cannot. And for many survey sponsors and planners, actually seeing how the questionnaire performs in the field is the most convincing evidence that changes need to be made. The limitation of the observation technique is evident from the name - it can only detect observable questionnaire problems. And when a problem is observed, the underlying cause may not be obvious. Individual observers will have limited experience with the questionnaire unless they have observed a great many interviews. Therefore, agreement on what the problems are may be difficult to achieve. 346 Record Check Studies Record check studies are used not so much to evaluate question wording, but to evaluate the validity of the data that is produced by the questionnaire as a whole. The accuracy of responses on a particular topic are checked against an independent criterion, usually administrative records. For example, data from a health survey that asks questions on doctor visits could be compared to respondents' records maintained by their health care providers. In these studies, it is assumed that the administrative record represents the "truth." Respondent reports that do not correspond to record data are counted as errors and a high error rate would indicate that something is wrong with questionnaire or approach to data collection. Record check studies pose numerous logistical challenges. One needs to obtain the cooperation of a records source. Preservation of confidentiality is often a problem. The structure and quality of the record system needs to be studied. Is it adequate? Matching criteria must be developed. Does the record system support the level of matching that is desired? Matching questionnaire data to record data is invariably more difficult than anticipated, and many discrepancies have to be resolved. Finally, the results require thoughtful interpretation. What are the implications for the questionnaire? What is it about the questionnaire or other aspects of the survey design that are contributing to response errors? Record check studies can provide objective evidence that a questionnaire is collecting the information it is designed to collect. However, they can only be used to evaluate questionnaires on topics for which independent records are available. Clearly, there are many types of human behavior that interest researchers for which no records exist. Compared to the other evaluation methods described above, record checks are relatively time consuming and costly. But the costs have to be weighed against the benefits. For large, expensive surveys where data precision is critical, evaluation by record check would make sense. Several variations of the record check study are possible. Seeding the pretest sample with cases known to possess a target characteristic is a scaled down version of a record check study. It is not too difficult to implement if the characteristic is a simple one and if it is not highly sensitive, having arthritis versus having AIDS, for example. Methodological studies can use other validation sources besides administrative records. Some possibilities are respondent diaries, data collected from other family members on the same topic, biochemical markers, medical exams and so forth. 347 Conclusions When there are so many ways to find out if a questionnaire is performing as intended, there is no good reason not to do it. Several of the techniques in the report are conducted in conjunction with field testing, so that time and cost factors are marginal. Probing techniques can be applied in so many different ways and at different levels of intensity that the technique can be adapted to almost any evaluation objective or questionnaire type. Laboratory facilities are advantageous, but not essential. It should be evident that no single technique will tell you all you need to know about the adequacy of your questionnaire. An evaluation program that includes, several different sources of information on question performance will be the most successful. References [1] Bercini, Deborah H. "Pretesting Questionnaires in the Laboratory: An Alternative Approach." (accepted for publication) Toxicology and public Health. [2] Royston, Patricia N. "Using Intensive Interviews to Evaluate Questions." Proceedings of the Fifth Conference on Health Survey Methods Research, Keystone, Colorado, 1989. [3] Cannell, Charles et. al. "New Techniques for Pretesting Survey Questions. NCHSR #HS 05616, Survey Research Center, University of Michigan, 1989. 348 DESIGNING QUESTIONNAIRES FOR CATI IN A MIXED MODE ENVIRONMENT Gemma Furno U. S. Bureau of the Census 1. Introduction The use of computer assisted data collection by Federal statistical agencies has increased dramatically since Approaches to Developing Questionnaires (Statistical Policy Working Paper No. 10) was written in the early 1980's. Utilization of computer assisted telephone interviewing, or CATI, is now commonplace in many agencies. CATI is an interactive system whereby the questions appear on a terminal screen and the interviewer keys the answers directly into the computer. Branching paths are programmed into the system and the next appropriate question is automatically presented. Range and consistency edits can be programmed to allow for on-line editing of data. These telephone interviews are conducted from one or more centralized locations. At the Census Bureau, CATI is often used for demographic surveys in conjunction with field personal visit and decentralized telephone interviews which use a paper and pencil questionnaire. Typically, some portion of households interviewed previously are assigned to CATI in this mixed mode environment. How many are assigned depends on several factors, such as the sample design itself and optimum workloads for both the field and the centralized telephone facility. Personal interviews are reserved for first time contacts where a visit to establish rapport has been found beneficial, and to follow up cases, such as unable to contact and refusals, that could not be completed on CATI. Telephone interviewing from the interviewer's home is often used for returning cases not assigned to CATI. This description fits the current usage of CATI in the national sample of the American Housing Survey, known as AHS. The American Housing Survey is conducted every two years. CATI was first introduced in 1987 and its use was expanded in 1989, when approximately 25 percent of the sample was initially assigned to CATI. The AHS questionnaire is lengthy and complex, containing over 125 main items, in addition to the household roster. The average interview time is approximately 30 minutes. This paper describes our experiences collecting data on the American Housing Survey using both CATI and a paper and pencil questionnaire from the perspective of CATI questionnaire design. A summary of several data quality issues will be presented, followed by a discussion of issues encountered in designing an AHS 349 CATI questionnaire that was comparable to the paper version and which minimized any problems when the two data sets were merged. 2. Issues of Data Quality Computer assisted data collection holds the promise of improving data quality in several areas. Those cited in the literature [2,5,6,8,9,12] that are most directly related to questionnaire design include the ability to: 1. control branching paths, thus helping to ensure that the correct questions are asked; 2. tailor question wording to the specific situation, thus relieving the interviewer from the burden of choosing alternative wordings, and helping to ensure that the questions are always worded correctly; and 3. evaluate the answers given for appropriateness and take corrective action through the use of on-line range and consistency edits, scripted probing, and dependent or reconciliation interviewing (using answers obtained in previous interviews to improve present answers). Intuitively one would think that these capabilities should improve data quality. But much research still needs to be done to prove that it actually does, and to quantify the improvement [6,9]. It clearly has been shown that controlling branching paths does ensure that the appropriate questions are presented on the screen. But this alone does not guarantee that the interviewer actually reads the question as worded, receives an acceptable answer or enters it correctly [1,5,6,9]. In reality, entries of "don't know" or "refused" are allowed for most items, range and consistency edits cannot catch all respondent or interviewer errors, and scripted probing and dependent or reconciliation interviewing have practical limitations. Data from the AHS preedit reject operation illustrate these points. In AHS, the field preedit operation is designed to identify and correct certain clerical, keying and consistency errors in order to improve control of the sample and the quality of survey data before it goes through the regular range, consistency and blanking edits. Approximately half of these reject reasons involve consistency checks within the household roster. The CATI data was put through the preedit program to help evaluate and ensure its quality. The results of the preedit operation for 1989 show that 11.6 percent of the 8,794 completed CATI cases rejected with an average of 1.20 rejects per case, while 42.6 percent of the 49,279 cases completed on the paper and pencil form rejected with an average of 1.76 rejects per case. 350 In addition to vastly improving the control of the sample (no rejects for duplicate records, mistakes in the control number, sample status etc.), results for CATI also show that most reject reasons related to the questionnaire were reduced or eliminated. For example, missing data on several key items such as type of living quarters, roster line number, relationship code, reference person, and heating equipment were eliminated due to the automatic branching feature. Where consistency checks were programmed into the CATI questionnaire to identify roster errors such as inconsistencies between birthdate and age, two spouses recorded for the same person, unmarried person has a spouse, etc., the corresponding preedit reject was greatly reduced. However, where roster consistency checks were not in place, some errors remained. Time constraints and the size of the CATI questionnaire have prevented programming all the appropriate checks. The practicality of adding more of these checks will be investigated. For other reject reasons not showing improvement, we found a number of explanations. For example, preliminary review indicates that for one reject the CATI interviewers accepted "don't know" entries at a much higher rate than the field interviewers. The item asked for the number of units in a multiunit building. This may well be related to telephone interviewing in general rather than CATI per se. However, now that the problem has been discovered, better interviewer training and/or a scripted probe for a "don't know" answer could be added to CATI for this item. In another instance, a high rate for a reject reason disclosed a flaw in the CATI questionnaire, which will be corrected. Adding more roster checks and a careful review of the other reject reasons should lead to improvements that will further lower the number of CATI cases rejecting in the preedit operation, although some interviewer and respondent errors will inevitably remain. Another indicator of quality in the AHS CATI data involves a reconciliation study conducted for selected items. In the 1987 interview, these items were tenure, type of basement, number of bedrooms and bathrooms, heating fuel, heating equipment, rent and home value. If the 1987 CATI response failed certain tolerance limits, compared to the 1985 response, then the answers were probed at the end of the interview to discover the reason for the discrepancy. Of the 6,432 cases completed on CATI in 1987, 54.8 percent failed the comparison on at least one of the items, triggering the reconciliation questions [11,13]. For all the items reconciled, an average of 49% of the respondents reported some plausible explanation for the discrepancies between the two survey periods. For example, a half bath was converted to a full bath, a different type of heating equipment has been installed, local real estate conditions affected rent or house values, etc. 351 However, that left an average of 51 percent of the respondents reporting that the response was incorrect in either 1985 or 1987, thus the change between survey years was spurious. (One caveat is that some of these cases may actually represent real status changes that were incorrectly classified due to response error in the reconciliation question itself.) An interesting result it that respondents were almost as likely to point the finger at the answer they had just given in the 1987 interview as the 1985 one. Forty-nine percent said the 87 answer was wrong compared to fifty-one percent for the 85 answer. But the reconciliation questions did not attempt to ascertain why the 87 response was wrong - did the interviewer read the question incorrectly, did the respondent or interviewer misunderstand or did the interviewer enter it incorrectly? This result offers another reminder that a CATI questionnaire may offer the potential to improve data quality in some areas, but it is not a panacea. Future AHS reconciliation studies will try to better ascertain the cause of these errors. 3. Issues Encountered When Designing the AHS CATI Questionnaire When designing a CATI questionnaire to be used in a mixed mode environment with a paper and pencil questionnaire, the paper form and its associated procesting system has usually been in use for a number of years. In such applications, the CATI questionnaire generally is expected to conform to yield comparable data and expeditious processings of data from both modes. In this situation, the CATI questionnaire has to serve "two masters". First, it should satisfy the basic objectives of CATI questionnaire design. For example, House and Nicholls stress that a CATI questionnaire must conform to the general accepted standards of questionnaire design while functioning as a complex computer program [7,8,10]. The program must ensure that the questions work correctly under all circumstances and that minimum demands are made on hardware resources while maintaining rapid response times. But secondly, in a mixed mode environment, a CATI questionnaire must meet these requirements while providing comparability with an existing paper and pencil version and minimizing any problems encountered when the data is processed. Usually the CATI data is reformatted, then merged at some point with the data collected on the paper form and processed through the existing system. A single processing system saves time and money and ensures that any complex edit and data imputation/allocation procedures are consistently applied, regardless of collection method. A. Numbering of CATI Questions and System Commands/Instructions A basic issue of CATI questionnaire design is the numbering scheme used for the questions and system commands/instructions. This can have important implications when a complex questionnaire 352 is used in a mixed mode environment. Two possibilities are to utilize the actual question numbers, or if different, any processing code numbers. The AHS questionnaire contains several sections of duplicate or parallel items because there are different sections of the questionnaire for renters and owners. Within both of these two major subsections there are several further subsets of parallel items based on type of housing unit. All questions on the paper form have a unique item number, but a duplicate or parallel item shares the same processing code as only one version of the item could be asked in an interview. The question arose as to how to handle these duplicate or parallel sets of questions in CATI - should the basic question be programmed only once and the system programmed to alter the question wording and its universe as appropriate, or should the design of the CATI questionnaire follow the paper form as closely as possible? We chose the latter course for AHS, that is, to follow the paper form as closely as possible, and thus utilize the item numbers rather than the processing codes. There were two reasons for this. First, the universes for the duplicate or parallel items are extremely complicated in AHS. Entering the basic question only once and programming the system to alter the wording and universes as needed would not have saved the CATI author any time as the system instructions and documentation would have just become more complicated, prone to error and difficult to test out. Secondly, with both CATI and paper questionnaires in use simultaneously, and separate training materials to be written, our goal was to move easily from one questionnaire and set of materials to the other without confusion. This compatibility between questionnaires proved especially helpful when it came to writing the specifications and programs to reformat the CATI output to be merged with the paper and pencil data. B. Question Wording, Fills and Answer Categories We encountered little difficulty in transferring the actual question wording to CATI. The paper and pencil form had already been adapted for telephone use in the field several years before. However, a few problems had to be dealt with. When collecting data under both modes, it becomes difficult to change the wording close to the start of the survey if you want to keep the questions as comparable as possible. For example, the CATI interviewers found the wording of one question particularly awkward but it could not be easily changed because the paper form was already printed. The sponsor did not feel comfortable changing the wording on only the CATI questionnaire. A revision had to wait 353 until the next time the survey was conducted when it was made to both. Other situations involved question "fills." That is, using information previously obtained to tailor the exact wording to the situation. This is one of the jobs that CATI does best, but to display the question correctly, the system obviously must be programmed to distinguish among the wording choices. This sometimes required that answer categories be expanded. For example in Figure 1 below, the paper and pencil version of item 120g (on means of transportation) groups "cars, truck and van" into a single answer category. The interviewer substitutes the specific response in the following question, 120h. In CATI, three separate categories are required if subsequent questions are to use the answer given. Figure I Q120g - 120h On The Paper & Pencil Questionnaire Click HERE for graphic. Figure 1 also illustrates the situation where a subquestion on the paper form is imbedded in the middle of the main question with the answer categories numbered as if it is one continuous question. Figure 2 below shows what this series looks like in CATI. In CATI, the subquestion appears on a separate screen but the answer categories are numbered 1 and 2 instead of 2 and 3. CATI interviewers are used to seeing the categories in numerical order, 354 starting with one. It would have been confusing to present the question on a separate screen with the categories numbered any other way. 355 Figure 2 Q120g - 120h In CATI (Answered For Mary Smith) >Q120g< How did MARY SMITH usually get to work Last week? (MARK ITEM THAT ACCOUNTED FOR GREATEST DISTANCE TO LOCATION OF JOB AT WHICH PERSCN WORKED MOST HOURS LAST WEEK.) <1> Car <2> Truck <3> Van <4> Bus or streetcar <5> Subway or elevated <6> Railroad <7> Taxicab <8> Motorcycle <9> Bicycle <10> Other vehicle <11> Walked only <12> Works at home ===> 3 (System is programmed to display Q120g 1 if 1-3 is answered here.) >Q120g1< Did MARY SMITH drive alone or go with other? <1> Alone <2> Go with others ===> 2 (System is programmed to display Q120h if 2 is answered here.) >Q120h< How many people including MARY SMITH usually ride in the van? <01-97> 1-97 <98> 98 or more ===>7 356 C. The Reformattng Stage Our processing goal for the CATI questionnaire was to produce output that could easily be reformatted to look exactly like the data keyed from the paper and pencil form. This allowed for merging the two data sets at the earliest possible opportunity and running the combined file through the current processing system. We encountered several situations that affected the reformatting operation. 1. Reformatting the CATI Data When the CATI question and answer categories closely corresponded to those on the paper form, reformatting was straight forward. However, if it did not, then more complicated reformat specifications had to be followed. Some simple examples of more complicated situations include questions where, as seen in figures 1 and 2, expanded CATI answer categories had to be collapsed back or separate CATI questions had to be combined to match the compact style and embedded questions of the paper form. Although some of this reformatting was performed in the CATI questionnaire itself, most was completed in batch mode after data collection. Performing most of these tasks later reduced the number of variables in the CATI questionnaire, an important consideration since the questionnaire was quite large. While the Census CATI system does allow for very large and complex questionnaires, there are system limits which the AHS questionnaire reached. Table and grid formats also frequently required reformatting. The CATI author had to ensure that these questions were programmed in such a way that the data could be successfully reformatted later. 2. Adding Special Data Items Not Needed for CATI Questionnaire Not all items on the paper and pencil questionnaire are needed or are relevant to the CATI questionnaire. Some are standard housekeeping input items, such as sample designations or geography codes that can't be changed during the interview. Others are output items that summarize respondent or housing unit characteristics from previously answered items. The question arises as to whether such items should be carried on the CATI input or output files or merely added in batch mode when the CATI data are reformatted. In deciding which course to follow, we gave consideration to balancing the size and efficiency of the CATI questionnaire with the difficulty of adding these items later. The system size constraint and the ease of taking care of these items dictated that 357 we add them during reformatting while taking special care not to overlook relevant items. As can be seen from these examples, complex questionnaires and processing needs often require special effort to ensure comparability of the CATI data to that collected on the paper form. 4. Summary While great improvement in data quality was seen as a result of automatic branching and on-line consistency checks, there is still room for improvement in the AHS CATI questionnaire. It must also be remembered that the questionnaire can not shoulder all the burden for improving data quality. Respondent and interviewer actions over the telephone, whether in a centralized or decentralized environment, must also be considered. Our experiences on AHS showed a large, complex paper and pencil questionnaire successfully transferred to CATI. Constant effort by a number of people, including the specifications writer, the CATI author, and other programmers was needed to accomplish the task. In AHS, it was the CATI questionnaire that was expected to conform to the paper and pencil form and the current processing system. This does not always allow CATI to be used to its fullest potential but it is a common situation that must be faced when CATI is added to an already existing paper and pencil survey. As new surveys are developed with a mixed mode of data collection planned from the beginning, different experiences will result. References 1. Catlin, G. and Ingram, S. (1988), "The Effect of CATI on Data Quality: A Comparison of CATI and Paper Methods", Proceedings of the Fourth Annual Research Conference, U. S. Bureau of the Census, pp. 291-299. 2. Dillman, D. and Tarnai, J. (1988), "Administrative Issues in Mixed Mode Surveys," in Telephone Survey Methodology, Robert G. Graves et al, (editors), John Wiley & Sons, New York, NY. 3. Ferrari, P. (1984), "Preliminary Results from the Evaluation of the CATI Test for the 1982 National Survey of Natural Scientists and Engineers", unpublished research report, U.S. Bureau of the Census. 4. Ferrari, P. (1986), "An Evaluation of Computer-Assisted Telephone Interviewing Used During the 1982 Census of Agriculture, 358 unpublished report, Agriculture Division, U. S. Bureau Census. 5. Groves, R. and Mathiowetz, N. (1984), Computer-Assisted Interviewing: Effects on Interviewers and Respondents", Public opinion Quarterly, 48(1B), pp. 356-3691. 6. Groves, R., and Nicholls, W. (1986), "The Status of Computer-Assisted Telephone Interviewing: Part II - Data Quality Issues, "Journal of Official Statistics, 2(2), pp. 117-134. 7. House, C. (1985), "Questionnaire Design With Computer-Assisted Telephone Interviewing", Journal of Official Statistics, 1(2), pp. 209-219. 8. House, C. and Nicholls, W. (1988), "Questionnaire Design for CATI: Objectives and Methods", in Telephone Survey Methodology, Robert G. Groves et al, (editors), John Wiley & Sons, New York, NY. 9. Nicholls, W. and Groves, R. (1986), "The Status of Computer-Assisted Telephone Interviewing: Part I - Introduction and Impact on Cost and Timeliness of Survey Data", Journal of Official Statistics, 2(2), pp. 93-115. 10. Nicholls, W. and House, C. (1987), "Designing Questionnaires for Computer-Assisted Interviewing: A Focus on Program Correctness", Proceedings of the Third Annual Research Conference, U.S. Bureau of the Census, pp. 95-1ll. 11. Nicholls, W. (1989), "The Impact of High Technology on Data Collection", CATI Research Report No. Gen-1, Computer-Assisted Interviewing Central Planning Committee, U. S. Bureau of the Census. 12. Van Bastelaer, A., Kerssemaklers, F. and Sikkel, D. (1988), "Data Collection With Hand-Held Computers: Contributions to Questionnaire Design", Journal of Official Statistics, 4(2), pp. 141-154. 13. Schwanz, D., Montfort, E. and Cannon, J. (1988), "Analysis of Operational Issues: 1987 AHS-CATI", CATI Research Report No. AHS-1, Computer-Assisted Interviewing Central Planning Committee, CATI Research and Analysis Sub-Committee, U. S. Bureau of the Census. 359 DISCUSSION Carol C. House National Agricultural Statistics Service Dillman and Tarnai (1988) define a mixed mode survey as one that uses two or more methods to collect data for a single data set which will be analyzed as a unit. Familiar examples include face- to-face first wave interviews followed by telephone or mail on subsequent waves; and telephone follow-up to a mailed questionnaire. Most large Federal statistical agencies routinely use mixed mode surveys to collect data. The Furno paper focuses on the 1987 American Housing Survey which uses a combination of face- to-face, decentralized telephone, and CATI interviews in its mixed mode design. The National Agricultural Statistics Service (NASS) sometimes incorporates mail, centralized (non-CATI) telephone, CATI, face-to-face, and decentralized telephone interviews in a single survey. Why are survey organizations choosing mixed mode designs over simpler single mode surveys? Their objectives appear to be to reduce survey costs, improve timeliness, and to take advantage of the relative strengths of different modes of collection. At the same time they want to preserve data comparability and integrate the mixed design into the data collection, data handling and data manipulation processes that are already in-place in the organization. This over simplifies the decision making process, but it fits most large Federal agencies. I will use this set of objectives as a basis to evaluate the design of the CATI questionnaire discussed in the Furno paper. This paper describes adding CATI to an existing mixed mode survey featuring face-to-face and decentralized telephone interviewing. The CATI questionnaire was designed to work effectively in that specific environment. The author discusses issues related to question wording , editing and data processing because a CATI questionnaire design impacts all of these areas. Cost Reductions and Improvements in Timeliness The Census Bureau probably achieved most of the gains in these areas when they originally mixed face-to-face interviewing with decentralized telephoning. They may see additional cost savings by adding CATI to this mix, but any such gains are likely to be minimal. However, a specific discussion of cost were beyond the scope of this paper. The literature usually asserts that timeliness can be improved by CATI because the data is immediately entered into a computer without a separate data entry step. Secondly, and more 360 importantly, data is edited during the interview and is "clean" by the time the data collection is over. But what actually happens to CATI data from the American Housing Survey? Furno reports that, "Our processing goal for the CATI questionnaire was to produce output that could be easily reformatted to look exactly like data keyed from the paper and pencil form... and [to] run the combined file through the current processing system." Thus the "clean" data from CATI is dumped in with "dirty" data from other data collection modes and the whole file is run through the standard batch editing programs. This results in no improvements in timeliness. Her approach is not uncommon. The goal of easy assimilation frequently takes precedence over improvements in timeliness as well as several other objectives of the mixed mode design. This decision may be necessary during early experimental uses of CATI, but as CATI (and soon CAPI) become ongoing parts of a survey organization, we need to find ways to integrate CATI data into data processing programs with less duplication of effort. Tapping the Strengths of Different Data Collection Modes The American Housing Survey's instrument incorporates a number of CATI features to improve data quality. These include controlling the branching paths, tailoring question wording to the individual respondent, and online editing. Furno measures the gains in some of these areas through two different comparisons: by counting the number of rejects in a subsequent, "pre-edit" program; and by conducting a reconciliation study. The pre-edit program is designed to make simple checks or clerical and keypunch errors prior to the data entering more sophisticated and complex editing Programs. Furno measures 43% rejects on the paper versions and 12% rejects on CATI. This demonstrates substantial improvements using CATI. However, one wonders why there should be any of these very simple errors in the CATI data. The author indicates that not all of the checks were added to the CATI instrument, so this is an area for possible improvement in the questionnaire. The reconciliation study was conducted at the end of the interview for selected items, comparing responses with corresponding answers obtained during the 1985 survey. These items (such as number of bedrooms in the house) were expected to be fairly constant over the two years. This study uncovered reporting errors on both the 1985 and 1987 surveys, with approximately the same number of errors occurring each year. This fact indicates that the CATI questionnaire (new in 1987) did not significantly improve the data quality in some areas. More detailed studies of this type will possibly uncover the causes of these errors and lead to improvements in both the CATI And paper questionnaires. 361 The reconciliation study was kept completely independent of the main part of the questionnaire. The original responses on the survey were not changed based on reconciliation, although CATI technology would have made it easy to do so. This brings up a broader issue to consider for panel survey: is it appropriate to use previously collected data to edit or influence current responses? Is this practice any less appropriate on mixed mode surveys where certain modes (CATI) would use this earlier information and other modes would not? It is unclear whether using previously collected information would improve overall data quality or merely heighten inconsistency and variability in the error structure of a mixed mode survey. NASS is struggling with these issues and we would appreciate reaction and experiences from other groups. Keeping Data Comparable Across Modes This was one of the primary objectives of the designers of the American Housing Survey's CATI instrument and the Furno paper concentrates on these issues. Discussions include the ways to program questions on CATI that appear in tabular form on a paper form; handling fills; using consistent answer codes; and handling last minute questionnaire changes. Integrating A Mixed Mode Design Into Existing Survey Processes When new technologies or new modes of data collection are added to an on-going survey it is important to cause as little disruption to the routine as possible. This was the situation with the CATI test on the American Housing survey, and the Furno paper describes the efforts to which the designers went to make CATI fit into the existing design. The CATI version of the questionnaire was always made to conform to the paper version, and the CATI processing and editing to conform to, and go through the existing batch programs. The disadvantage of this approach is that some (much?) of the advantages from CATI were lost in the mixed mode design. CATI is here to stay in telephone surveys and CAPI is just arriving. These technologies will be used routinely in mixed mode designs. How do we handle the integration of CATI and paper once the testing phase is over? Although it may be reasonable to make a CATI questionnaire conform to a paper version when early testing is going on, it is not reasonable to retain that unbalanced relationship later after 75% to 80% of the Contacts are made with the CATI version. This situation can and does happen, because the test version is implemented operationally with minimal revisions. It is time to re-evaluate CATI/CAPI technology in mixed mode designs. The modes must fit together into a single survey operation and produce compatible data. However, we need to look 362 for better ways of integrating existing technologies with the new so that total quality is optimized. Reference Dillman, Don A. and Tarnai, John, "Administrative Issues in Mixed Mode Surveys," in Robert M. Groves (ed.), Telephone Survey Methodology, New York, Wi1ey, 1988. 363 364 Session 1l STATISTICAL DISCLOSURE - AVOIDANCE 365 366 DISCLOSURE AVOIDANCE PRACTICES AT THE CENSUS BUREAU Brian Greenberg U. S. Bureau of the Census I. Introduction The Census Bureau, as well as other statistical agencies, collects information about the Nation's population and institutions and releases this information to the public. The information is typically collected under pledges of confidentiality and agencies are required to release data in such a manner so as not to violate guarantees of non-disclosure either through design or neglect. At the same time, date collection agencies have the responsibility to make statistical information available for a wide range of uses that include policy decision making, program analysis, economic modeling, and many others. A data collection agency has the obligation to release as much information to the public as possible while adhering to pledges of confidentiality given to respondents. Broadly speaking, the objective is to release as much information as possible consistent with the requirement that the risk of disclosure is acceptably low. There is no known way to quantify the amount of information released or to quantify level of risk of disclosure. Finding methods to relate the levels of information and levels of risk is an area of very active research at the Census Bureau and at other statistical agencies in this country and abroad. In a recent paper and talk at the Census Bureau 1990 Annual Research Conference (Greenberg 1990), I discussed disclosure avoidance research activities at the Census Bureau. The report focused on the work to develop data release strategies through the use of tools of operations research, mathematics, and statistics. We discussed research efforts here at the Bureau, studies conducted under Joint Statistical Agreements, and other cooperative efforts with researchers on this topic. In that paper we describe the mathematical programming methods to design controlled rounding and suppression routines, the statistical techniques for data perturbation, and the more probabilistic analysis to attempt to evaluate risk. That paper contains an extensive bibliography and should be regarded as a companion to this one for the understanding of the underlying mathematics and methods. Although there will be some inevitable overlap between this report and the Annual Research conference paper the focus here will be on practical considerations in the design of a product for data release and a description of current programs, planned products, and options which are available. The overall theme of this Seminar is Quality of Federal Statistics. In addition to the notion of accuracy, other aspects 367 *of quality are timeliness and completeness. From the perspective of disclosure avoidance activities, we address the issues accuracy and completeness. We cannot release full and accurate detail on a public use file because that would exceed any reasonable level of disclosure risk. By taking measures to have acceptably low levels of risk we compromise completeness and/or accuracy. In designing a data release strategy we must evaluate the trade-off between completeness and accuracy and between completeness for one data attribute at the expense of completeness for another. To reduce levels of disclosure risk, one either suppresses information and collapses categories or introduces noise. Both these actions can be thought of as data masking. Under the first option we reduce completeness while under the second we reduce accuracy. Earlier, we introduced the idea of "amount of information" versus "level of risk" and indicated the need to Optimize amount of information while maintaining an acceptably low level of risk. We can think in terms of accuracy and completeness as components of level of information and evaluate the trade-off with acceptable levels of risk (which is much harder to characterize). This theme will run through-out the paper, and made explicit or not, this theme pervades the design of any data release strategy. In Section II we discuss tabular data, including tables of amounts which are bated on our economic surveys and censuses and tables of frequency counts which appear in the Summary Tape Files (STF's) from the Decennial Censuses. In Section III, we discuss public use microdata. Public use microdata files are released as standard products from virtually all demographic programs and they are extensively used by researchers in many areas. In fact, the public use microdata files for the Survey of Income and Program Participation form the major data product from that survey. Section IV consists of a brief summary. II. Tabular Data A. Frequency Counts of Demographic Characteristics Cross-classified tables of frequency counts of demographic and housing characteristics constitute one of the major formats for release of data from the Decennial Censuses. For example, one such cross-classification can look like Table 1 below. 368 Click HERE for graphic. Table 1. Block Group 1 - Age by Sex The major disclosure risk in the release of such tables occurs when a small value appears in a marginal position. If an investigator examines a table and knows the identify of the person or persons having marginal characteristics as indicated, the investigator could infer other characteristics of the respondent through the cross-classification. In so doing, the investigator would learn of information provided to the Census Bureau in confidence. The way to reduce this disclosure risk is to suppress cells with low marginal values or introduce uncertainty into cell counts. Suppression was used for frequency counts from the 1980 Census and for earlier Censuses. If a marginal value was below a specified cut-off, all cells summing to that marginal were suppressed. That is, if the cut-off were 10, Table 2 would have become Table 3. Click HERE for graphic. In order to prevent deriving the third row in Table 3 by subtracting the non-suppressed rows from the totals row, at least one more row must be suppressed. Suppressed values in Row 3 are 360 primary suppressions. Row 1 was chosen for the complementary suppressions, denoted by "C", to protect the primary as shown in Table 4. There were two problems with this method for disclosure avoidance. On one hand, due to the need for complementary suppressions, there were sometimes large values suppressed as complementary cells to protect small primary cells. This was considered a major draw-back for data users. The other problem with this procedure was that it was often difficult to guarantee geographic complementary suppressions. For example, if one or more data cells is suppressed for exactly one county in a state, then the suppressed value can be derived exactly by subtracting the value of all other counties from the state total. To avoid this from occurring, geographic complementary suppressions are required., It was clear that procedures to ensure complete complementary geographic suppressions would also take their toll in suppression of even more information. This realization led to a recognized need to develop disclosure avoidance procedures for tables of cross-classified frequency counts along the lines of data distortion in order to introduce uncertainty into the data. The method to be used as a disclosure avoidance measure on 1990 Census frequency count tables introduces uncertainty into the tables by changing some values. The basic idea is as follows and I quote, virtually verbatim, from (Greenberg 1990). For a subset of records, field values on a record will be replaced with field values on a different record having the same control characteristics so that the newly created records will be different on potentially all characteristics except the controls. This method has been called the Confidentiality Edit because of the use of a hot-deck similar in spirit to the hot-deck used in edit and imputation procedures. Given a target record on which some changes are to be made, based on specified control characteristics the system matches the target record to another record and "hot-decks" the remaining non-control variables. To be a little more specific, I paraphrase from (Griffin, Navarro and Flores-Baez 1989). The Confidentiality Edit selects a small sample of census household records from the internal census data files and interchanges their data with other households which have identical characteristics on. a set of selected key variables but are in different geographic locations. The matching and interchanging operations are controlled on the key variables of number of persons in household; population characteristics of race, Hispanic origin and age; and on housing characteristics of units in building, rent/value and tenure. Because of the controls described above, census counts for total persons, and totals by race, Hispanic origin and age, 18 and above. These counts provide information required for voting rights as outlined in Public Law 94-171. In addition, housing counts by tenure will not be affected by the Confidentiality Edit. The interchange of information on records will be accomplished on the detail file of records. The revised records will be used to 370 generate all tables so that there will be no inconsistencies between tables and the revised records will also be used to produce other Census products. Three advantages of Confidentiality Edit include: (1) this procedure needs to be implemented only once on internal files to obtain protection for all Summary Tape File data products, (2) all data cells can be shown on Summary Tape Files so there is no interference with data aggregation by users, and (3) more data values will be available than in 1980. These procedures have been evaluated for their impact on data products and details of the analysis are contained in (Griffin and Thompson 1987) and (Navarro, Flores-Baez, and Thompson 1988). For tables of frequency counts for the 1980 Census and earlier, there was a reduction in completeness through the use of suppressions to achieve an acceptable level of disclosure risk, Due to the need for complementary suppressions, the overall effect of a suppression pattern caused more loss in completeness than desirable. For the 1990 Census, by interchanging values on the detail record file, there will be a loss of accuracy through the interchange of information between records. The papers, cited above contain studies to show that loss of data utility due to this reductions of accuracy is not significant. B. Aggregate Economic Data The primary method for releasing data from Census Bureau establishment surveys or censuses is in the form of cross- classified tables of amounts. For example, in a given state the total value of shipments may be cross-classified by SIC and by county. A cell is regarded as sensitive (i.e., having an unacceptable high disclosure risk) if the (N,K) -rule is violated, that is, if N or fewer respondents account for at least K% of the total cell value. Such cells are regarded as primary suppressions and they are not released. If only primary cells are suppressed, their values often can be derived exactly, or closely estimated through linear analysis using marginal totals. To prevent this, complementary suppressions are introduced, and one seeks a set of complementary suppressions which protects the sensitive cells yet suppresses as little additional information as possible. We illustrate these ideas with a few (artificially) simple examples. Consider Table 5 in which cell (2,2) is considered sensitive because it failed the (N,K)-rule. We place a "P" in position (2,2) to indicate a primary suppression, and introduce a set of complementary suppressions, for example, as in Table 6. 371 Click HERE for graphic. Given a suppression pattern in a table, the values of all suppressed cells (primary or complementary) can be estimated. To indicate how this is done, we return to Table 6 which we rewrite as Table 6'. From Table 6', we have the system of equations: Click HERE for graphic. Note that Table 8 and Table 9 both display patterns of complementary suppressions. In Table 8 three complementary suppressions were introduced while in Table 9 four complementary suppressions have been introduced. The sum, of complementary suppressed values in Table 8 is 295 and the sum of complementary suppressed values in Table 9 is 135. For the 1982 and 1977 Economic Censuses, the criterion for selecting a set of complementary suppressions was to suppress as few complementary 372 cells as possible to protect the primary suppressions. For the 1987 Economic Censuses, we have implemented the criterion of suppressing the least total value. Thus, given Table 8 and Table 9 above, the preferred complementary suppression pattern under our current criterion will be as in Table 9 since less total value would be suppressed. In 1982 and before, the preferred pattern would have been as in Table 8 since fewer cells are suppressed. The basic disclosure avoidance method for the release of cross-classified aggregate economic data at the Census Bureau is cell suppression. That is, reduction in completeness. This method seems to work well, especially as users can estimate the value of suppressed cells within acceptable limits. Whether we employ the criterion of minimizing the number of complementary suppressions or the total value that was suppressed constitutes a selection of methods within an overall strategy. We are currently investigating how procedures for finding complementary suppressions can be improved for the 1992 Economic Censuses. III. Microdata Microdata records ate data records at the respondent level and the risk in the release of a microdata file is that someone may be able to discover the identity of a respondent. The risk can arise from the presence of highly visible and unique characteristics, or it may stem from the threat of matching public use microdata files to other files either privately or publicly held. For the latter threat of linking two files, some of the issues are: what data are available on both files, how comparably reported are the data, how up-to-date are they, and how easily accessed are the records? In particular, one must ask the cost to an investigator to carry out such a project. All these factors contribute to a picture of overall risk. The basic strategy for the release of general purpose public use microdata files at the Census Bureau is to reduce completeness by restricting the level of detail on the file. Instead of releasing exact date of birth, we can release month, or quarter, or year. Percentages can be grouped into deciles or quantiles. Income can be recoded into intervals of size, for example, $4,000 for income up to $100,000 and all income in excess of $100,000 can be topcoded to read as "$100,000 or more". Virtually all quantitative variables on public use microdata files are topcoded to obscure high visibility respondents and to reduce the likelihood of successful computer matching by removing outliers. In considering reduction in completeness or reduction in accuracy as disclosure avoidance practices, the Census Bureau tends to strongly favor reduction of completeness for the release of microdata files. By so doing, we are better able to maintain a broad range of utility for the files. 373 For some special purpose microdata files noise has been added to variables (in addition to topcoding and using categories) in order to further frustrate the ability for successful computer matching, see (Greenberg 1990) for a further discussion. Such files can be created when we know in advance intended uses so we can design a noise introduction strategy to suit specified needs. One of the most important fields oh a microdata record is the geographic identifier. Geography is the single identifier which cuts across all public use microdata files and is a field in which there is little error. Under current Census Bureau procedures, no area having fewer than 100,000 persons in the sample frame can be identified on a microdata record. This minimum can be raised for surveys which have a presumed greater disclosure risk. This was the case for the Survey of Income and Program Participation (SIPP) whose geographic cut-off was set to 250,000 by the Microdata Review Panel because of the fine level of detail on SIPP and the longitudinal nature of the survey. Prior to 1981, each operating division had responsibility for the confidentiality of any public use microdata sample released by the division. At that time, no geographic area could be shown having fewer than 250,000 residents in the sampling frame. The Microdata Review Panel was established in 1981 to review all proposed new microdata files for release. No new microdata file can be released by the Census Bureau without Panel approval. At that time, the geographic minimum was reduced to 100,000. The Panel is composed of representatives from Data Users Services Division, Program and Policy Development Office, Demographic Surveys Division, and representatives from the Associate Directors for Economic Fields, Demographic Fields, and Statistical Standards and Methodology. This Panel make-up reflects broad Census Bureau concern. As part of the review process, survey staff seeking release approval must fill out a disclosure checklist which asks about identifiable geography, matching potential, topcodes, etc. The Panel typically meets with survey staff to discuss problems to seek a resolution. The Panel may request additional topcodes, deletion or recoding of some variables, and other actions to reduce disclosure risk. At times the Panel will request cross-tabulation frequency counts to observe if there are outlying combinations of values. The Panel may recommend changes; however, it is more typical for the Panel to point out problems and leave it to the survey staff to find solutions based on their understanding of intended uses of the file. Survey sponsors attend Panel meetings to discuss options and assist in the determination of risk and resolution options. There are often a number of options available to reduce risk on a file. For example, typically one of several variables can be recoded to reduce the possibility of matching to external files. 374 At times, and depending on perceived user needs, geographic specificity can be reduced. That is, one can provide more potentially identifying demographic characteristics on a national file (i.e., no subnational geography) than on a file that identifies a relatively small geographic locale. By and large, one must think in terms of trade-offs between the various data items and their relative completeness. In a public use microdata file, it is not possible to provide a very complete and accurate file due to an unacceptably high level of risk. Survey sponsors and data users must contribute to the decision making process in identifying areas in which some completeness and/or accuracy can be sacrificed while attempting to maintain as much data quality as possible. Below we list some options which are currently available to- enhance data utility with no increase of risk. If the topcode on some item, say income, is $100, 000, replace all values- over the topcode by the mean (or median, etc.) of the topcoded values. Thus, if the mean of the topcoded values were $130,000, replace any value in excess of $100,000 by $130,00O. This is in contrast to the current practice of replacing topcoded values by the cut-off (in this case $100,000). In fact, one can actually provide the exact distribution of all topcoded values. Another option we have is for local topcodes. For the Metropolitan Sample of major cities from the American Housing Survey, each city has a different topcode for "home value" based on (roughly) a three percent upper tail cut-off for that city. Would state-level topcodes for such items as income, housing costs, etc. be desirable for other files? Would such a strategy provide more useful data? The Census Bureau is currently planning for the Public Use Microdatal Samples (PUMS) for the 1990 Decennial Census. Current plans call for a "standard" 1% file and 5% file as were produced for the 1980 PUMS. In addition, we are considering another file having only national geography but containing far more detail than the other files. For example, we are considering adding tract characteristics to each record. That is, we append to each record information about the tract of residence; information such as unemployment rate, percentage of minorities, median home value, etc. Such local detail would not be acceptable on a file with more specific geography, for fear one may be able to identify tract of residence based on tract characteristics. In addition, there is no- reason, a priori, that income topcodes, and other topcodes as well, cannot be raised to allow more detail for the respective variables on a national file. This also represents a trade-off between various kinds of reduction of completeness -- geographic detail verses demographic detail -- in which we provide less of the former to obtain more of the latter. It is important that users of public release microdata files contribute to the discussion of methods for the design of such 375 files often options and choices are available, and to the extent that user priorities are known efforts can be made to accommodate them. IV. Summary In this report, we describe methods used by the Census Bureau to reduce disclosure risk in the release of data products. We discuss tabular data and microdata for which the issues are somewhat different. In a related paper (Greenberg 1990) we provide a detail discussion of Census Bureau research efforts in the area of disclosure avoidance. In the design of a data release strategy many options are typically available. The trade-off between loss of completeness and loss of accuracy is a theme that runs through much of the discussion. Plans are being made for the Public Use Microdata Samples from the 1990 Census. It is important that data users contribute to the planning process by contributing to the discussion of options and choices by indicating both needs and preferences. References Greenberg, B. (1990), "Disclosure Avoidance Research at the Census Bureau," Proceedings of the 1990 Annual Research Conference, Bureau of the Census, Washington, D.C. (to appear). Griffin, R.A., Flores-Baez, L. and Navarro, A. (1989), "Disclosure Avoidance for the 1990 Census," Proceedings of the Section on Survey Research Methods, American Statistical Association, Washington, D.C., (to appear). Griffin, R.A. and Thompson, J. (1087), "Confidentiality Techniques for the 1990 Census," presented at the Fall meeting of the American Statistical Association and Population Statistics Census Advisory Committees. Navarro, A., Flores-Baez, and Thompsono J. (1988), "Results of Data Switching Simulation," presented at the Spring meeting of the American Statistical Association and Population Statistics Census Advisory Committees. 376 THE MICRODATA RELEASE PROGRAM OF THE NATIONAL CENTER FOR HEALTH STATISTICS Robert H Mugge, PhD National Center for Health Statistics (retired) My presentation will be in three parts: First I shall describe the microdata release program of the National Center for Health Statistics (NCHS, or "the Center"); secondly I'll explain the rules and procedures followed by NCHS in attempting to insure the confidentiality of the subjects of our data; finally, I shall discuss some concerns I have for confidentiality protection for these NCHS data and some suggestions for meeting the problems that I see. Let me make clear, that I am not speaking as a staff member of NCHS, but rather as one who retired from that staff nearly eight months ago after working in the confidentiality program of NCHS for quite a few years. So I am now speaking only for myself and not on behalf of the Center. I am told that there have been no important changes in the Center's data security program since I left, and Mr. Israel, Deputy Director of the Center, has kindly reviewed this paper for current accuracy. But all opinions and commentary are strictly my own and not necessarily those of the Center. The NCHS Microdata Release Program. The primary function of the National Center for Health Statistics is to develop and make available statistical information on the health of the U.S. population, on the vital statistics of the U.S., and related matters. This is clearly stated in the law authorizing the work of the Center (3). The Director of the Center decided many years ago that, carrying this mandate to its proper conclusion, the Center would make available its statistics in as full detail as possible for the use of scholars who wish to analyze these data. The covering policy statement is this: "Within prevailing ethical, legal, technical, technological, and economic restrictions, it is the policy of the National Center for Health Statistics to augment its programs of collection, analysis, and publication of statistical information with procedures for making available, at cost, transcripts of data for individual elementary units -- persons or establishments -- in a form that will not in any way compromise the confidentiality guaranteed the respondent (6)." Implementing this policy, NCHS has now for a long time, and with only rare exceptions, made available quite detailed data sets, known as Public Use Data Tapes, on all of its finished surveys and data reporting programs, together with full printed documentation 377 (2). These systems include the National Health Interview Survey; the National Health and Nutrition Examination Survey; the National Hospital Discharge Survey; the National Ambulatory Medical Care Survey; The National Nursing Home Surveys; the National Survey of Family Growth; several follow-up surveys; annual vital statistics on births, deaths, fetal deaths, marriages, and divorces; and various others (4). Many years ago these files took the form of boxes of punched cards; now for a long time they have been on one or more reels of magnetic tape; recently they have been made available on tape cassettes; and now the Center is moving into a program of producing the files on CD-Roms. However, the material form of the data file is not relevant to the principles involved in data release. Confidentiality Protection. As noted in the policy statement, the Center is very concerned that the confidentiality of data subjects in its surveys and reports be maintained. From its study of the problem NCHS has devised a set of rules for protecting the confidentiality of subjects -- persons and establishments -- whose information is included; these rules are stated in the NCHS Staff Manual on Confidentiality (5). The Center has a Confidentiality Committee, made up of high level staff, which reviews needs for policy changes and makes recommendations regarding them to the Director. The rules followed in the Center for protecting confidentiality are of two kinds. One set of rules relates to data published in tabular form and provides limitations on the contents of published statistical tables; the other set of rules covers what may be included in the public use microdata files (5). But tables published by the Center are generally limited to what may be tabulated from the public use files, and, when this is the case, the former set of rules may be ignored if the rules on the microdata files are first met. The rules regarding tabular production of data are designed to insure that no single cross-tabulation, or combination of cross- tabulations, may permit disclosure of a confidential characteristic of any identifiable individual or establishment. This possibility may be substantially dismissed in the cases of the Center's large scale surveys of persons, involving samples representing usually far less than one one-thousandth of the relevant population, provided that data are not presented separately for any small areas, in which a unique individual might stand out. But special care must be taken in the reporting of establishments, since these often involve large proportionate samples, and reporting data for larger areas -- perhaps even census regions -- may serve to disclose data on particular institutions. But in any event, if the necessary care has been taken in restricting the contents of the microdata set, and only this microdata set is used in building 378 tables, it fo1lows that there should be no disclosures resulting from the publication of these tables. The rules for protecting confidentiality in public use microdata sets, as set forth in the Manual, are as follows (5, P. 19): 1) All direct personal or establishment identifiers, such as name, social security number, or address are purged from the file. 2) The file must not contain any other detailed information about the subject that could facilitate identification and that is not essential for research purposes (such as the exact date of the person's birth). It is often found necessary to give certain numerical information-such as income, nursing home size, or costs and charges of institutions -- only in broad class intervals in order to avoid disclosure. 3) Geographic places that have fewer than 100,000 people are not to be identified in the file. (In practice much larger places often cannot be identified, such as when a State is known to have primary sampling units totalling less than 100,000.) 4) Characteristics of an area are not to appear in the file if they would identify an area of less than 100,000 people. 5) Information on the drawing of the sample which might assist in identifying a data subject must not be released outside the Center. Thus the identities of primary sampling units are not to be made available outside the Center. (I must say in all candor that NCHS seems to have lost control of that one, as it turned out that for several reasons the PSU identifications have unavoidably been made public.) 6) Before any new or revised microdata files are published, they, together with their full documentation, must be approved by the Director or Deputy Director. (When I was there this responsibility was delegated to me, and I reviewed all plans for public use microdata files, to make sure they complied with the governing rules and that no other data had crept in which might compromise confidentiality. In the Census Bureau there is a high- level Microdata Review Panel that reviews plans for each public use microdata set release [1]; NCHS did not feel that such an expenditure of staff time was necessary in its particular situation.) 379 7) Finally, NCHS required that before anyone outside the Center may be provided a public use microdata file, that person would be required to sign a statement called a "Data Use Agreement." This statement points out the legal requirement that no data obtained by NCHS under its mandate may be used for any purpose other than the purpose for which it was obtained, i.e., for statistical purposes (3, Section 308[d]), it notes that all appropriate precautions have been taken to keep the data safe from,disclosure, but that there may still be a way that the data could inadvertently be used for identifica- tion. In signing the agreement, the user states that he/she understands this and gives assurance that the data set he/she receives will not be misused, the security of the data will be protected, and no attempt will be made to reveal identities of data subjects, and, further, that if any subject is accidentally identified the user will work with the Center to make sure that this identification is not used, and procedures will be taken to assure that the identification cannot be repeated (2). NCHS requires signing of the Data Use Agreement, even though it may have no force in law, because of its information value and its assumed effect in raising the sensitivity of data users on the importance of protecting the files. NCHS does not doctor the data in other ways to avoid disclosure. It does not substitute any false data, nor does it do anything like data swapping. It has been determined that any such procedure would lessen the value of the files for purposes of research, and it is felt that this is most undesirable to do, as the nation has an important stake in getting the best research possible using these health-related data. But primarily it has been considered unnecessary to doctor the data. The files contain many errors already -- normal errors in data collection and processing -- although the Center tries hard to minimize them. So the user cannot have absolute confidence in the information, especially as it relates to individual subjects. The Center is reluctant to add additional errors. Is the system working? It seems to be. In all the years it's been operating the Center has never heard of a case in which a disclosure has been made through one of the survey files. That it comforting, but the comfort is mitigated by the knowledge that the Center wouldn't necessarily ever hear about it if a survey file were compromised. There is also another piece of evidence as to the effect of the confidentiality program. I don't think the Center has ever received a complaint from the public that it isn't protecting 380 confidentiality adequately in the data files program. But there have been many complaints from researchers that the Center isn't releasing enough information to them. Since we felt that, if anything, the Center erred in being too liberal in releasing data, the researchers' complaints encouraged me to feel that the Center is balance may have been about right. Commentary Is this confidentiality-protection program, then, good enough? There cannot be perfect confidentiality protection if any midrodata, are to be released. So each organization must seek to find the proper balance between the public's needs for data and the appropriate measures for protecting confidentiality (7, pp. 1-2). I have described the compromise position reached by NCHS. In the large surveys it conducts NCHS obtains a great deal of information, which may be considered independent or dependent variables for data analysis, on each individual subject (4). This mass of data may constitute a fingerprint about an individual; there may be no one else with this particular set of characteris- tics. So if one had another source of such information about individuals (or establishments) then it would be easy to match them up and disclose all the new information in the survey on the identified individual. Fortunately, no such files exist about individuals on any mass basis. There may be similar files in other sample surveys, but the chance of overlap is so small that the likelihood can be dismissed. This is not true about establishments; there are lists of nursing homes, hospitals, and clinics that could be used to identify them if the file contains the right kinds of characteristics. So great care must be taken in determining what can be published about characteristics of establishments included in data files; we've made some last minute discoveries on certain file-release plans which we hope enabled us to make these files disclosure-proof. For those concerned about inadvertent disclosures there is this consolation: the vast majority of information in our data files is quite innocuous. Much of it is obvious, at least within certain limits, or already well known by individuals' associates, such as the person's sex, age, and weight, and various obvious health conditions. If you found a friend's file in one of the surveys you are not very likely to learn through it something you didn't already know, or, if you did learn something new it would probably be quite harmless. So chances are that data subjects are not likely to sustain any harm or embarrassment from having their data disclosed. This, however, would certainly not excuse an agency from not doing its utmost to keep its promise of confiden- tiality to the data subject. 381 Historically, however, there have always been some sensitive items in the data files -- information that could cause harm and embarrassment. There have been the early and late effects of venereal disease; there have been some diseases which at times have carried stigmas, such as leprosy and cancer; and there have been surveys on social behavior, such as sexual practices, which if the information got out could cause considerable harm. And now it seems that society needs to obtain new information which may carry threats of individual harm beyond that brought by any data in the past. I have two particular examples: one is AIDS or its precursor, the HIV virus; the other is information on the sexual practices of unmarried teen-age girls, as obtained in the latest cycle of the Family Growth Survey. (The latter information is obtained with the approval of parents, but the parents are not given the information obtained from the girls.) There is so much concern about protection of the AIDS or HIV information that, so far, surveys to obtain it are only being done using procedures which guarantee the anonymity of subjects, even from the data collecting agency. Now, with all of NCHS's efforts to protect the confidentiality of the data, there is one scenario that haunts me. That is this: If someone knows well a survey subject and knows the person was in a particular survey and at some time has access to that survey file, then he/she could easily use his/her knowledge of the person to locate the person's file in the data set. Then all the information about that person obtained in the survey will be laid out before them. I don't know of any reasonable way to avoid this possibility, beyond what the Center is already doing, especially through the Data Use Agreement. I wouldn't like the idea of warning data subjects not to tell their friends and relatives of their being in the survey sample, when the Center is trying to get good publicity for the survey and urging people to cooperate. Of course, though, if it is found that a data subject is advertising being in the survey, that person is out. (Last year a college professor wrote in to tell the Center staff how interested she was in having been included in the Family Growth Survey. She had told her class all about it. The Center wrote back to say that her interest was appreciated, but her record was being removed from the survey sample.) There is, however, one policy response I think the Center should make to the scary scenario I alluded to. I think the Center should lean over backwards to assure that there is as little sensitive information as possible in the public use data files. I think that NCHS should, for example, make sure that there is no AIDS or HIV-positive information in any survey files published where there is any possibility of individual disclosures. By the same token, the Center should not release in microdata files the 382 Family Growth Survey data on sexual practices of unmarried women; I personally feel that that entire survey is too sensitive to justify any microdata release from it. That will make the researchers angry with the Center, and the Center should arrange to do the special tabulations and analyses that the outside researchers want, within limits of practicality. Eternal vigilance is the price of good data confidentiality protection. But a prime needed ingredient in a successful confidentiality protection program is clout! If the head of any agency producing statistical files is not keeping a close eye on protective procedures and lending her/his authority to maintaining a strict program of protection, that statistical program -- along with all the people depending on it -- could be in big trouble. I think the protection program in NCHS has been successful, and that success owes much to the continuing concern and support it has received from the Center's Directors. References (1) Gates, Gerald W., "Census Bureau Microdata: Providing Useful Research Data while Protecting the Anonymity of Respondents." U.S. Bureau of the Census, Program and Policy Development Office. Paper prepared for presentation at the annual meetings of the American Statistical Association in New Orleans, August 1988. (2) National Center for Health Statistics, Cataloque of NCHS Public Use Data Tapes. DHHS Publication No. (PHS) 88-1213. U.S. Department of Health and Human Services, Public Health Service, Centers for Disease Control. Hyattsville, MD, July 1988. (3)____, Current Legislative Authorities Enacted as of December 1989: Sections 304, 306, 307, and 308 of the Public Health Service Act. U.S. Department of Health and Human Services, Public Health Service, Center for Disease Control. Hyattsville, MD, April 1990. (4)____, Data Systems of the National Center for Health Statistics: Programs and Collection Procedures, Series 1. No. 16. U.S. Department of Health and Human Services, Public Health Service, Office of Health Research, statistics, and Technology. Hyattsville, MD, December 198l. More recent versions may be available in unpublished form from NCHS. (5)____, NCHS Staff Manual on Confidentiality. DHHS Publication No. (PHS) 84-1244. U.S. Department of Health.and Human Services, Public Health Service. Hyattsville, MD, September 1984. (6)____, Policy Statement on Release of Data for Individual Elementary Units and Special Tabulations. DHEW Publication No. (PHS) 78-1212. U.S. Department of Health, Education, and Welfare, Public Health Service. Hyattsville, MD, May 1978. 383 (7) Subcommittee on Disclosure-Avoidance Techniques, Federal Committee on Statistical Methodology, Statistical Policy Working Paper 2: Report on Statistical Disclosure and Disclosure-Avoidance Techniques. U.S. Department of Commerce, Office of Federal Statistical Policy and Standards. Washington, D.C., May 1978. 384 DISCUSSION George T. Duncan Carnegie Mellon University Serving the public, federal statistical agencies must balance the respondent's need for privacy and the researcher's need for information. Ultimately, how the balance is struck should be the result of the political process, which in the U. S. is complex, indeed. What the agencies can contribute to this is the development of procedures, both administrative and statistical, that for a given level of privacy protection maximize access to data and that for a given level of access maximize privacy protection. Brian Greenberg, of the U.S. Bureau of the Census, and Robert Mugge, retired from the National Center for Health Statistics, have given us a clear perspective on the data dissemination policies and practices of two major federal statistical agencies. As chair of the Panel on Confidentiality and Data Access that is co-sponsored by the Committee on National Statistics and the Social Science Research Council, I find their work of special importance -- both to the panel and to all federal data users and providers. Brian Greenberg, both here and in a recent paper (Greenberg, 1990) describes some disclosure limitation practices at the Census Bureau. He properly emphasizes the need in data dissemination for a tradeoff between "amount of information" and "level of disclosure risk". He identifies the open research problem of quantifying each to be meaningful for disclosure-limited data dissemination. While mathematical arguments are useful in this quantification, essentially the task is a decision-theoretic one that incorporates the motivations of the stakeholders. Since these stakeholders are various -- including individual respondents, other government agencies, academic researchers, market researchers, commercial planners, the media, and lobbying groups -- the measures developed should be multivariate in nature. He notes in Greenberg (1990) that The first general purpose public use microdata file released by the Census Bureau was the 1 in 1,000 sample from the 1960 Census of Population and Housing. This file was released in 1963. A few years later a public use microdata file from the Current Population. Survey was released. At present, public use microdata files are released as standard products from virtually all demographic surveys, and they are extensively used by researchers in many areas. In fact, the public use microdata files for the Survey of Income and Program Participation form the major data product from the survey. The Public Use Microdata Sample from the 385 Decennial Censuses are becoming increasingly important to users, especially researchers in the social sciences, and these files are gradually replacing the Summary Tape Files for many research applications. Against this history, his paper focuses on statistical procedures that the Census Bureau uses to limit disclosure risk, rather than procedures -- whether statistical or administrative -- for expanding research access. While the disclosure limitation aspect is important, I would have liked to have seen more attention paid to the ways the Census Bureau has actively tried to make information available. Much mote than Robert Mugge, Brian Greenberg emphasizes masking procedures, both for disclosure limitation in tabular data and for microdata. He draws a nice conceptual distinction between the effect of disclosure limitation on data utility through loss of accuracy and loss of completeness. For tabular data disclosure limitation, he shows how cell suppression and the technique of a Confidentiality Edit can be employed. For microdata disclosure limitation, he stresses grouping of quantitative variables, including topcoding. In masking some special purpose files, noise is also added to lower the likelihood of a successful computer match with publicly available files having identifiers. Greenberg describes the work of the Microdata Review Panel, which is broadly representative of key components of the Census Bureau. Commendably, the Census Bureau, through work of Gerald Gates and the Microdata Review Panel, has been open to suggestions of ways of expanding researcher access to data. In thinking about Greenberg's paper, certain questions nagged me. How useful will researchers find the data after it has been massaged by the various disclosure limitation procedures? A research effort is needed on appropriate ways of analyzing masked data. What do researchers need to be concerned about in analyzing data that has been "Confidentiality Edited"? Some researchers will ignore the fact that the data has been altered, and hence produce misleading conclusions from their analysis. Researchers need to be carefully informed about the limitations of standard analyses of the edited data. This requires a study of how researchers in practice respond to various caveats attached to the released data. Do the thresholds on geography (100,000 persons or 250,000 for SIPP) have any basis in theory or do they just "feel right" to the Microdata Review Panel? In another paper to be presented at the 386 American Statistical Association meetings in August, Brian Greenberg and Laura Voshell relate the size of geographical units to the percentage of unit records. This is a good start but the direct tie to disclosure risk is not yet made. Are any special disclosure-limiting procedures used for longitudinal data? How can data users best be brought into the decision making process? Specifically, how can agencies insure that data users help identify what in data accuracy can best be spent to buy disclosure limitation? How do respondents view the level of disclosure limitation provided by these procedures? Robert Mugge has described what is by all accounts a successful program in microdata release in an important area for public policy. As with Brian Greenberg's paper, I would like to highlight certain aspects of why I think the program is successful and then focus on some specific concerns that I have for the future. I believe the program's success stems from the basic policy statement of the National Center for Health Statistics: Within prevailing ethical, legal, technical, technological, and economic restrictions, it is the policy of the National Center for Health Statistics to augment its programs of collection, analysis, and publication of statistical information with procedures for making available, at cost, transcripts of data for individual elementary units -- persons or establishments -- in a form that will not in any way compromise the confidentiality guaranteed the respondents The three reasons I see are these: First, NCHS has taken seriously its mandate to make microdata available to researchers. Its focus is not predominately on data collection but equitably on data dissemination. This balance ensures good stewardship, not the hoarding of quality data but rather its investment in the work of researchers who can advance the public good. In thinking about how disclosure limitation related to researchers, I should point out that researchers come no just from academia, as faculty members at Carnegie Mellon, but also from the media, as reporters for the New York Times, and lobbying groups, as analysts for the American Association of Retired Persons, say. 387 Second, NCHS explicitly recognizes that there are constraints on microdata dissemination that have ethical, legal, technical, and economic dimensions. This cues where to look for potential problems. Third, NCHS does not guarantee that identifiability is impossible but instead links to the implicit contract that it has established with the respondent. This is both realistic and responsible. In implementing this policy, NCHS has made Public Use Data Tapes of key surveys available -- in accord with the directions of a Confidentiality Committee and under rules well stated in the NCHS Staff Manual on Confidentiality. So NCHS does in fact deliver the data, but makes sure that there-is well-identified administrative oversight and that the policies exist in written form for reference by agency staff, researchers, and the interested public. Further, through the "Data Use Agreement" NCHS encourages the receiver of the data to assume some responsibility for proper use of the data. This agreement, while not legally binding, educates the researcher on the restriction to statistical use of the data, provides that no attempt will be made to identify data subjects, and provides for appropriate action in the case of accidental identification. In pointing to the future, I would like to fix on a few concerns: Why should NCHS not emulate the Census Microdata Review Board? And indeed go further by including representatives of both the respondent and the user communities? Through internal representation from various areas within NCHS it might further consistency of application of the policies. Through external representation it might foster responsible interaction with both respondents and researchers. It might help ensure that respondents got meaningful information about agency practices and intentions so that in authorizing how their responses are to be used, they would be properly informed. It might help ensure that researchers needs were addressed and that data quality was not unduly sacrificed in the name of confidentiality protection. Why not have the Data Use Agreement be more binding on the researcher? Possible mechanisms for this might be legal requirements (such as legal sanctions or use of binding contracts) economic incentives (such as use of bonds or returnable license fees), or administrative practices (such as restrictions on further access)? Why not use data masking in certain cases where the data might not be releasable otherwise? This requires that enough information be provided to the researcher that an appropriate analysis can be 388 carried out. It also requires that suitable techniques be developed for the analysis of masked data. Might not advances in computer technology increase the prospect of linkage of a record with an identifier with a released record, even when only a sample is released? After all, in a recent JASA article, Bethlehem, Keller, and Pannekoek note that in a certain region of the Netherlands having 23,485 households composed of a father, mother, and two children with just a six- item key of ages (in years) and gender, 16,008 of the households were unique. Presumably, a plausible model could be constructed in which -- with a bit more detail -- a data intruder who matched a record in the sample uniquely would also place high probability that it is a unique match in the population. Should not special administrative procedures be developed for establishment -- like nursing homes and hospitals -- that cannot be reasonably assured that they would not be identifiable? For example, large hospitals might be asked for authorization to include their data in a public use file. Are many variables really innocuous? Marital status, age, and weight under certain circumstances are sensitive, let alone the details of sexual practices required in AIDS-related surveys. Can an agency depend on naturally-occurring errors to provide confidentiality protection? Introducing noise with a distribution, known to the user may be effective. Research is ongoing in this area. To sum up, I think with the attention of professionals such as Robert Mugge and Brian Greenberg that the growing tension between privacy concerns and demand for data access can better be mediated. To take a clue from Fritz Scheuren, we need DANTOTSU, Japanese for "choosing the best of the best". The federal statistical agencies can then be better stewards of the data our citizens provide to further the public interest. References Bethlehem, Jelke G., Keller, Wouter, J., and Pannekoe Jeroen (1990) Disclosure Control of Microdata. Journal of the American Statistical Association 85, 38-45. Greenberg, Brian (1990) Disclosure Avoidance Research at the Census Bureau. 1990 Annual Research Conference, Bureau of the Census, Arlington, VA, March 18-21. 389 390 Session 12 FEDERAL LONGITUDINAL SURVEYS 391 392 FEDERAL LONGITUDINAL SURVEYS Daniel Kasprzyk U. S. Bureau of the Census Curtis Jacobs U. S. Bureau of Labor Statistics I. Introduction During the 1960's and 1970's, panel surveys surveys in which similar measurements are made on the same sample at different points in time -- became a popular tool for social science and policy research. Boruch and Pearson (1985) indicate 64 national surveys of this kind were carried out during that period of time. The apparent popularity of such survey designs prompted the Office of Management and Budget's Federal Committee on Statistical Methodology (FCSM) to form a subcommittee on "federal longitudinal surveys" during the Spring of 1983 under the chairmanship of Barbara Bailar and Daniel Kasprzyk. Maria Gonzalez, chair of the FCSM, provided organizational and staff support to the subcommittee. The subcommittee's goals were very general -- to identify the strengths and limitations of longitudinal surveys, and to propose some guidelines for using them more effectively. The Subcommittee on Federal Longitudinal Surveys was composed of the following members: Barbara Bailar (co-chair, Bureau of the Census), Daniel Kasprzyk (co-chair, Bureau of the Census), Barry Bye (Social Security Administration), Dennis Carroll (National Center for Education Statistics), Robert Casady (Bureau of Labor Statistics), Steven B. Cohen (Agency for Health Care Policy and Research), Lawrence Ernst (Bureau of the Census), Maria Gonzalez (Office of Management and Budget), Catherine Hines (Bureau of the Census), Curtis Jacobs (Bureau of Labor Statistics), Inderjit Kundra (Energy Information Administration), and Bruce Taylor (Bureau of Justice Statistics). This paper follows the general outline of the working paper developed by the OMB subcommittee. We discuss the advantages of longitudinal surveys, managing longitudinal surveys, some activities related to longitudinal survey operations, estimation, some persistent issues in longitudinal surveys, and data user issues. II. Definitions Terminology in this area of social science research has not been standardized. Kish (1987) describes longitudinal studies as a generic term referring to a wide variety of studies done over time. Duncan and Kalton (1987) prefer to use the word 393 "longitudinal" in the context of data; thus, permitting longitudinal data to be collected in either a panel or cross- sectional (retrospective) survey. The subcommittee chose to combine two components, design and data, into the definition of longitudinal survey adopted for the report. The distinguishing features of a longitudinal survey are: 1) repeated data collection for a sample of observational units over time; 2) the linkage of data records for different time periods to create a longitudinal record for each observational unit; and 3) the principal analysis was to be based on the data collected over time. The subcommittee's definition is more restrictive than that adopted by Duncan and Kalton or Kish, since longitudinal surveys are those in which the sample unit is followed, microdata assembled, and longitudinal analysis included as part of the estimation plan. III. Advantages of Longitudinal Surveys A longitudinal survey is usually needed to measure and study micro-level dynamics -- changes in attitudes, changes in prices, changes in economic well-being, for example -- or to improve the measurement of certain important concepts (Pearson, 1989). Some advantages for obtaining repeated measurements on the same sample unit over time are: 1) multiple interviews of the same sample unit reduce sampling variability on estimates of changes; 2) a matched longitudinal data set provides a better measure of components of individual change; that is, measures of gross change for the unit at two, points in time; 3) a longitudinal survey is capable of obtaining a wider range of variables from each sampled element than is possible from a repeated survey of cross-sections; 4) longitudinal surveys with relatively short reference periods may reduce telescoping errors that occur when respondents misplace the timing of the occurrence of events; 5) longitudinal surveys with relatively short reference periods can be used to produce aggregated data for a longer time period -- a year, for example. While longitudinal surveys are advantageous, they do not solve all data collection problems. In fact, they create some additional problems which will be discussed later. IV. Managing Longitudinal Surveys Managing large, complex longitudinal surveys has much in common with managing large, complex cross-sectional surveys. Successful project management techniques and the issues surrounding the successful execution of a project should not be related to the design of the project. There are, however, nuances in the case of longitudinal surveys that are important to recognize. They are: 394 1) inordinately high expectations for the project; 2) budget planning; 3) content, procedural, and methodological innovations; and 4) changes in the data collection organization. Expectations associated with longitudinal data collections typically run high. The set of analysts interested in the data set as a vehicle for answering-their own research questions is often broad and diffuse. The sum of these expectations as well as the project staff's expectations, almost by definition must exceed what is achievable in the short run. Grasso and Kohen (1978) make this point concerning The National Longitudinal Surveys (NLS); similarly, Duncan and Morgan (1984) admit that judged by the expectations for the Panel Study of Income Dynamics (PSID) the investment in the study could not have been profitable. Long range planning and budget planning play an important role in the development of a longitudinal survey. A long range planning document laying out the budget, analysis plan, instrument development plans, staffing plans, survey procedures, and anticipated products is one way to assist senior agency officials in understanding the need for the project; it also provides a baseline document for the survey. Another aspect of the management of longitudinal survey operations is the persistent tension in maintaining the status quo versus making corrections and alterations to the instrument and processing system. A serious analysis of the trade-offs from the cost as well as analytic point of view ought to be made before making a change. Long-term longitudinal surveys, such as the NLS and the surveys sponsored by the National Center for Education Statistics, can be spread over a decade or more. These, surveys when contracted to private sector survey research organizations usually have periodic recompetition for the contract. However, a change in data collection organization can be very traumatic to the longitudinal survey project if not properly planned for. A very detailed level of documentation of methods is required to ease the transition, if it should be necessary. V. Longitudinal Survey operations The differences between field and processing operations in one time cross-sectional surveys and longitudinal surveys are created by the time dimension. For example, time enters in the selection of new units into the sample, in identifying and matching the same sample unit from round to round of the survey, in following sample units from one interview to the next, and in the way longitudinal products are released. We discuss these below. 395 A. Maintaining the Composition of the Sample The composition of the sample may be expected to change across waves for many reasons. Respondents may refuse to participate, they may not be home, they may die or may be institutionalized, or may go abroad. To reduce the effects of these problems, some continuing panel surveys routinely introduce new sample units at certain points in time within a panel. These designs are called rotating panel designs. Other longitudinal surveys, such as the Panel Study of Income Dynamics (Duncan, Juster, and Morgan, 1986), argue that the representativeness of the sample of the entire population of families and individuals can be maintained over time through rules that allow families and individuals to enter the sample with known selection probabilities. B. Following Individuals Over Time The issue of whom to follow in a longitudinal survey and the intensity at which one follows individuals over time is directly related to the analytic uses of the data, the amount of time between interview rounds, and the budget of the survey. Analytic uses should drive the operational decisions of whom and how far to follow an individual. If the basic sampling unit and unit of analysis is the individual, then the following rules consist of following all individuals originally selected into sample. These are generally called cohort studies. Another design, labelled by Cox and Cohen (1985) as a longitudinal household design, consists of the individual as the basic sampling unit. The dwelling unit is sampled and all individuals living in the dwelling unit are selected into sample at the first round of interviewing and are interviewed in subsequent rounds whether or not they reside at the original sample address. In order to develop household and family estimates for the dwelling units, data are obtained from all individuals living at the address of the person originally identified as a sample individual. Tracing is directed toward obtaining the current address of the survey respondent. Some people move great distances and are difficult to trace; others may not want to be traced. The operations of the survey organization must establish a set of information sources that are capable of providing current address information for individuals who move. Several surveys obtain information from the respondent at the end of each interview on the name and phone number of a person who will always know the sample person's whereabouts. Other information sources can be developed by the interviewer through the sample person's friends, relatives, and other contacts 396 established through the respondent, such as neighbors employers, and directories (Burgess, 1989). The mode of interview may effect the types of tracking techniques used. Personal visit surveys will use mail or telephone tracking as well as place heavy reliance on the interviewer for creative solutions to finding respondents. Telephone surveys are likely to rely on the telephone for tracking, but rarely send staff into the field (Cantor, 1989). An operational concern in tracing respondents is the additional costs incurred by field staff. White and Huang (1982) have estimated that during a one year time period (4 interviews) of the Income Survey Development Program (ISDP) Panel, the number of interviewing hours increased by 7% and the number of miles charged by the interviewer increased by 22%, due to the cost of following movers and interviewing additional households. However, NCES found that per-unit tracing costs for NCES' High School and Beyond Survey were approximately 20% less than the cost of base year sampling, indicating the potential economies of longitudinal surveys (Office of Management and Budget, 1986). C. Linking Analysis Units Between Waves What is the point in conducting a longitudinal survey if data from successive interview rounds can not be successfully brought together for analysis? Obviously linking or matching variables must be created to permit the merging of data over time. Complications arise when consideration must be given to multiple units of analysis. Surveys which are intended to follow individuals, regardless of their association with the sampled household location or address simply assign an independent and unique person identification number to each individual. The Survey of Income and Program Participation (SIPP) is illustrative of surveys requiring linkage variables to allow analysis at various levels -- household, family, person, and event. The SIPP has a complicated variable as described by Jean and McArthur (1984, 1987) which ensures that the identification number remains constant regardless of changes in address and household composition. The National Medical Care Expenditure Survey (NMCES) took an altogether different approach by using the identification variable for internally matching the rounds of interviews and providing the public with matched rounds of data. As a consequence, round-to- round matching was unnecessary for the public. 397 D. Operational Changes over Time Changes in the administration and operation of longitudinal surveys seem to be inevitable; these changes, however, are likely to make comparisons difficult to assess and interpret. One needs to recognize that aspects of the survey design and data collection that change during the course of the survey may influence results. This is not to say that one should never make any changes; rather, one needs to be aware of the consequences of actions taken and attempt to measure the effects of such changes. E. Operational variations in Longitudinal Data Products As with any complex study, many variations are possible in the processing and development of longitudinal data products. Three illustrations give a sense of the options available. The main Panel Study of Income Dynamics (PSID) data files contain information gathered since the beginning of the study in 1968 and are updated on an annual basis. Thus, each wave of data is released together with the data previously made available. Because the length of the survey is predetermined, the National Medical Expenditure Survey (NMES), prefers to wait to the conclusion of the panel -to develop its longitudinal products. This survey program uses the multiple interviews as a vehicle to revise data or fill in data not completely reported in earlier interviews. The SIPP, on the other hand, first releases individual wave public data files to allow researchers the opportunity to analyze each wave as a separate cross-sectional data set and to provide the ability to develop their own multiple interview data set. A longitudinal data file for the entire panel (32 months) is released as a separate product only after all the individual wave products are released. VI. Estimation Three issues stand out in considering estimation from the longitudinal point of view: 1) defining the longitudinal universe; 2) defining longitudinal unit concepts; and 3) the treatment of missing data. A. Defining the Longitudinal Universe The target population of a longitudinal survey must deal with the consequences of birth, death, and mobility during the life of the study. Unlike cross-sectional studies that fix the population at a specific point in time and inferences made only about the time the sample was drawn, some longitudinal studies may be concerned 398 with drawing inferences about a nonstationary target population whose composition is changing over time. Judkins et al (1994) describe three methods for defining a longitudinal universe. One method selects a specific time during the course of the study as the point that defines the universe. If the universe is defined at the time of sample selection, it is called a "cohort" study. A second method of defining a longitudinal universe looks at more than one point in time. Several time points are selected, each one defining a universe. The entire set of units defined by these different cross-sectional universes is included in the longitudinal universe. A third method of defining a longitudinal universe includes only units common to all selected time periods; that is, in this approach one includes only those elements which were members of all cross-sectional universes. This universe contains only those units which did not enter or exit the survey universe and as a consequence is a static universe. B. Defining Longitudinal Unit Concepts Some longitudinal surveys, the prime examples being the NMCES and the SIPP, have undertaken the task of conceptualizing annual units of analysis using subannual data. Longitudinal analyses of a sample of households, families, or establishments must deal with the problems brought on by changes in the composition of these units. When a household or family splits up as a result of a divorce or separation, which of the two units is the same as the original unit? Dicker and Casady (1982) and McMillen and Herriot (1985) discuss this topic for the NMCES and SIPP respectively. Statistical estimation of longitudinal concepts is discussed by Ernst (1989) and Folsom, LaVange, and Williams (1989). Note that the acceptance of such concepts is not universal. Duncan and Hill (1985) argue that defining these concepts is unnecessary since all relevant analysis can be done at the individual level. C. The Treatment of Missing Wave Data In cross-sectional surveys, nonresponse is categorized in two ways: unit (total) nonresponse and item nonresponse. In longitudinal surveys, however a third type of nonresponse exists -- wave nonresponse. Wave nonresponse occurs when a sample unit does not respond in one or more waves of a longitudinal survey. In this situation, considerably more data are missing compared to the item nonresponse situation; however, considerably more data are available for use in nonresponse compensation strategies. 399 Solutions to the issue are not clear cut. Weighting and imputation, the methods use to compensate for nonresponse, have their own advantages and drawbacks (Kalton, 1986; Lepkowski, 1989). Cox and Cohen (1985), Kalton and Miller (1986), and Mulvihill and Lawes (1980) have conducted empirical investigations into the relative quality of imputation and weighting as nonresponse compensation procedures in panel surveys. They found little difference when cross-sectional estimates were of interest. They show, however, that some forms of imputation are clearly inferior when longitudinal analysis is of interest. Singh, Huggins, and Kasprzyk (1990) advocate imputation for a restricted set of missing data patterns. This point of view is consistent with that expressed by Lepkowski (1989) where consideration is given to combined strategies of imputation and weighting. VII. Persistent Issues in Longitudinal surveys Longitudinal surveys theoretically offer the opportunity to measure change at the individual level as well as the opportunity to improve the overall measurement of data that are difficult to collect. In practice, two kinds of nonsampling error issues arise that play a significant role in longitudinal surveys: nonresponse and conditioning. A third issue, the role of nonsampling error in the measurement of gross flows, remains a complex and persistent problem for longitudinal surveys. A. Attrition A major concern as a longitudinal survey ages is the loss of representativeness of the sample due to nonresponse. Typically, the largest nonresponse occurs in the first several interviews with the wave-to-wave change in sample loss decreasing during the panel. Frequently it is not clear what the nature and character of sample loss is. See Kalton, Kasprzyk, and McMillen (1989) for illustrations of sample loss in selected longitudinal surveys. The picture of nonresponse that we typically see is varied and likely dependent on factors such as, the frequency of interviews, difficulty in following or tracing respondents, sample composition, the length of the longitudinal survey, quality of the field staff, content of the questionnaire, and the efforts made to retain the sample. In general, nonresponse rates in longitudinal surveys increase over time as one would expect, but the rate of increase declines or stabilizes over time. However, we would not minimize the importance of the observation that cumulative overall nonresponse rates can be substantial over the length of a panel. Analytic difficulties can occur if the nature of the nonresponse problem is not well understood. Too often, little is 400 done to describe the problem. Descriptive studies such as those done by McArthur (1988) and McArthur and Short (1985) for the SIPP provide some insight in understanding differences between respondents who participate in all waves with those who miss one or more interviews. Other studies aim to assess whether the current wave sample differs systematically from the original sample. See Rhoton (1986) for example. Another approach to understanding whether responses in later waves have a potential bias is to compare distributions of responses of subsequent respondents and nonrespondents to responses to questions asked earlier in the panel. This approach was taken by Petroni and King (1988) to study the effect of SIPP's cross-sectional nonresponse adjustment, variables in accounting for attrition in later waves. Finally, another approach was taken by the PSID. Becketti et al (1983) identified a particular analysis (e.g. regression analysis.) of data obtained from earlier waves and included variables indicating subsequent response status to provide evidence that nonresponse bias was not present in the PSID. B. Time-In-Sample Bias Time-in-sample bias refers to the concept that individuals' responses to the survey instrument may change due to the length of time an individual has been in the survey (that is, the number of times interviewed). Evidence of this bias has been found in estimates of unemployment, where higher rates of unemployment are observed among individuals in sample for the first or second time (Bailar, 1989). Other surveys have observed a time-in-sample phenomenon. See Neter and Waksberg (1964) and Woltman and Bushery (1975). In essence, this effect may occur because the early interviews in a longitudinal survey change either the respondents' behaviors or the way they answer the questions. Similarly the interviewers' behavior and approach to the respondents may change. In practice, this bias is difficult to estimate because of a variety of changes taking place between rounds of a longitudinal survey, especially attrition. Unfortunately, even documentation of the existence of this bias is very difficult, requiring either a rotating panel design in which fresh replicate samples are added to the panel or an independent replicate sample implemented specifically to address this issue. Bailar (1989) and Kalton, Kasprzyk, and McMillen (1989) review several studies. C. Measurement Error One of the presumed benefits of longitudinal surveys is their theoretical ability to measure change at the individual level. The difficulty, however, is that change measures are very sensitive to 401 individual measurement errors. Kalton, Kasprzyk, and McMillen (1989) identify aspects of panel surveys that may lead to measurement error: 1) simple response variability; 2) wave-to-wave changes in respondents; 3) changes in data collection mode; 4) wave-to-wave changes in interviewers; 5) wave-to-wave changes in questionnaires; 6) changes in the interpretation of questions over time; 7) wave-to-wave changes in coders; 8) imputation; 9) time-in- sample-bias; 10) matching interviews across time. Any one of the above or all in some combination may make the measurement of gross change problematic. A reporting error in the data at one point in time, corrected at another point in time, can lead to spurious measurements of change. Analytical difficulties in this type of analysis can be mitigated somewhat by sensitivity of the data collection organization to the problems; for example, detailed field edits, and proper documentation of amputations and identification of nonrespondents can help analysts in understanding their results. VIII. Data User Issues As discussed above, many statistical and measurement issues occur in the development of large-scale national surveys. In particular, many of these issues are exacerbated in longitudinal surveys in which repeated observations are taken on the same unit at several points in time. We believe the myriad of issues and their consequences places responsibility on the sponsors of such activities to provide substantially more documentation and guidance on the nature and extent of errors, both sampling and nonsampling errors. It is impossible to control and determine the effects of all the various sources of error; nonetheless, most of us, ourselves included, can make greater efforts at conducting evaluation of the quality of survey-data and documenting indications of nonsampling error that are likely to make a difference in longitudinal data analysis. Developing a "quality profile" summarizing in a convenient form what is known about the sources and magnitudes of errors in estimates should be done periodically for large multipurpose longitudinal surveys. See, for example, the SIPP Quality Profile (Jabine, King, and Petroni, 1990). Similarly, knowledge of the existence of both methodological and substantive research should be made available to the user community. The NLS has done this by publishing a bibliography of known research (Center for Human Resource Research, 1989). Sponsors of longitudinal surveys should make available data quality evaluations in whatever form deemed appropriate. Duncan and Hill (1989) it a formal refereed article assessed the representativeness of the PSID sample and compared their survey measures with program aggregates and aggregates from the Current 402 Population Survey. The SIPP now includes evaluations in the technical documentation of the file, if such evaluations are available when the file is released; otherwise, the evaluations are issued as "User Notes" after the release of the file. Finally, several years ago the Social Science Research Council (SSRC) sponsored a research conference whose aim was to foster substantive analyses while providing information on the comparative strengths and weaknesses of several longitudinal data sets (U.S. Bureau of the Census, 1990). Conferences of this type, where program sponsors encourage comparative analysis of their data set, help engage the policy and research community to fully appreciate the strengths and weaknesses of each data base. One hopes that improved understanding of the data will result in better analysis. Sponsors of large complex surveys ought to be encouraged to foster more of these kinds of exchanges. Endnote The Subcommittee enjoyed the benefits of the discussion of individuals who played active roles in several large Federal longitudinal surveys. Through these discussions, OMB Statistical Policy Working Paper 13 emerged. The paper we developed for the seminar on the "Quality of Federal Data" summarized in a rather lengthy fashion Statistical Policy Working Paper 13. Because of page constraints, the paper above is a considerably condensed version of the paper prepared for the seminar. The long version of the manuscript is available from Daniel Kasprzyk, Statistical Methods Division, U.S. Bureau of the Census, Washington, D.C. 20233. References Bailar, B.A. (1989), "Information Needs, Surveys, and Measurement Errors," in Panel Surveys (D. Kasprzyk, G.J. Duncan, G. Kalton, M.P. Singh, eds.), John Wiley and Sons: New York, 1-24. Becketti, S., W. Gould, L. Lillard, and F. Welch (1983), Attrition from the PSID, Santa Monica, California: Unicon Research Corp. Boruch, R.F. and R.W. Pearson (1985), The Comparative Evaluation of Longitudinal Surveys, Social Science Research Council, New York. Burgess, R.D. (1989), "Major Issues and Implications of Tracing," in Panel, Surveys (D. Kasprtyk, G.J. Duncan, G. Kalton, M.P. Singh, eds.), John Wiley and Sons: New York, 52-74. Cantor, D. (1989), "Substantive Implications of Longitudinal Design Features,. The National Crime Survey as a Case Study," in Panel 403 Surveys (D. KasprZyk, G.J. Duncan, G. Kalton, M.P. Singh, eds.), John Wiley and Sons: New York, 25-51. Center for Human Resource Research (1989), NLS Annotated Bibliography: 1968-1989, Center for Human Resource Research, The Ohio State University, Columbus, Ohio. Cox B.G. and S.B. Cohen (1985), Methodological Issues for Health Care Surveys, Marcel Dekker, New York. Dicker, M. and R. Casady (1982), "A Reciprocal Rule Model for Defining Longitudinal Families for the Analysis of Panel Survey Data," Proceedings of the Social Statistics Section, American Statistical Association, 532-537. Duncan, G.J. and D.H. Hill (1989), "Assessing the Quality of Household Panel Data: The Case of the Panel Study of Income Dynamics," Journal of Business and Economic Statistics, 7, 441-452. Duncan, G.J. and M. Hill (1985), "Conceptions of Longitudinal Household: Fertile or Futile," Journal of Economic and Social Measurement, 13, 361-375. Duncan, G.J. and G. Kalton (1987), "Issues of Design and Analysis of Surveys Across Time," International Statistical Review, 55,97- 117. Duncan, G.J., F.T. Juster, and J.N. Morgan (1986), "The Role of Panel Studies in a World of Scarce Research Resources" in Survey Research Designs: Toward a Better Understanding of Their Costs and Benefits (R.F. Boruch and R.W.,Pearson, eds.). Lecture Notes in Statistics No.38, Springer Verlag, New York, 94-129. Duncan, G.J. and J.N. Morgan (1984), "Behavioral Research with the Panel Study of Income Dynamics in Retrospect and Prospect," Vierteljahrshefte zur Wirtschaftsforschung, Dunker and Humblot, Berlin, 415-427. Ernst, L.R. (1989), "Weighting Issues for Longitudinal Household and Family Estimates," in Panel Surveys (D. Kasprzyk, G.J. Duncan, G. Kalton, and M.P. Singh, eds.), John Wiley and Sons: New York, 139-159. Folsom, R., L. LaVange, and R.L. Williams (1989), "A Probability Sampling Perspective on Panel Data Analysis," in Panel Surveys (D. Kasprzyk, G.J. Duncan, G. Kalton, M.P. Singh, eds.), John Wiley and Sons: New York, 108-138. Grasso, J. and A. Kohen (1978), "The National Longitudinal Surveys Data Processing Systems," in The Survey of Income and Program Participation: Proceedings of the Workshop on Data Processing (D. 404 Kasprzyk, ed.), Office of the Assistant Secretary for Planning and Evaluation, Department of Health and Human Services, II-33-II-53. Jabine, T., and K. King, and R. Petroni (1990), Survey of Income and Program Participation: Quality Profile, U.S. Bureau of the Census, Washington, DC. Jean, A. and E. McArthur (1987), "Tracking Persons Over Time," SIPP Working Paper No. 8701, U.S. Bureau of the Census. Jean, A.C. and E. K. McArthur (1984), "Some Data Collection Issues for Panel Surveys with Application to the Survey of Income and Program Participation," Proceedings of the Section on Survey Research Methods, American Statistical Association, 745-750. Judkins, D.R., D.L. Hubble, J.A. Dorsch, D.B. McMillen, and L.R. Ernst (1984), "Weighting of Persons for SIPP Longitudinal Tabulations," Proceedings of the Section on Survey Research Methods, American Statistical Association, 676-681. Kalton, G. (1986), "Handling Wave Nonresponse in Panel Surveys," Journal of Official Statistics, 2, 303-314. Kalton, G., D. Kasprzyk, and D.B. McMillen (1989), "Nonsampling Errors in Panel Surveys," in Panel Surveys (D. Kasprzyk, G.J. Duncan, G. Kalton, M.P. Singh, eds.), John Wiley and Sons: New York, 249-270. Kalton, G., and M. Miller (1986), "Effects of Adjustments for Wave Nonresponse on Panel Survey Estimates," Proceedings of the Section on Survey Research Methods, American Statistical Association, 194- 199. Kish, L. (1987), Statistical Design for Research, John Wiley and Sons: New York. Lepkowski, J. (1989), "Treatment of Wave Nonresponse in Panel Surveys," in Panel Surveys (D Kasptzyk, G.J. Duncan, G. Kalton, M.P. Singh, eds.), John Wiley and Sons: New York, 348-374. McArthur, E. (1988), "Measurement of Attrition through the Completed SIPP 1984 Panel: Preliminary Results," Internal Bureau of the Census memorandum to D. Kasprzyk, March 4, 1988. McArthur E. and K. Short (1985), "Characteristics of Sample Attrition in the SIPP," Proceedings of the Section on Survey Research Methods, American Statistical Association, 366-369. McMillen, D.B. and R. Herriot (1985), "Toward a Longitudinal Definition of Households," Journal of Economic and Social Measurement, 13, 504-509. 405 Mulvihill, J. and M. Lawes (1980), "Imputation Procedures for LFS Longitudinal Files," Statistics Canada Internal Memorandum. Neter, J. and J. Waksberg (1964), "A Study of Response Errors in Expenditure Data from Household Surveys," Journal of the American Statistical Association, 59,18-55. Office of Management and Budget (1986), Federal Longitudinal Surveys (Statistical Policy, Working Paper No. 13), National Technical Information Service, PB86-139730. Pearson, R. (1989). "The Advantages and Disadvantages of Longitudinal Surveys," Research in the Sociology of Education and Socialization, Vol. 8, 177-199. Petroni, R.J. and K.E. King (1988), "Evaluation of the survey of Income and Program Participation's Cross-Sectional Noninterview Adjustment Methods," Proceedings of the Section on Survey Research Methods, American Statistical Association, 342-347. Rhoton, P. (1986), "Attrition and the National Longitudinal Surveys of Labor Force Behavior: Avoidance, Control, and correction," IASSIST Ouarterly, 10(2). Singh, R., V. Huggins, and D. Kasprzyk (1990), "Handling Wave Nonresponse in Panel Surveys," paper presented at the Conference on "Survey Design, Methodology, and Analysis," University of Essex, Colchester, England, July 4-7, 1990. U.S. Bureau of the Census (1990), Individuals and Families in Transition: Understanding Change Through Longritudinal Data. Papers presented at the Social Science Research Council Conference in Annapolis, Maryland, March 16-18, 1988. U.S. Bureau of the Census. White, G.D. and H. Huang (1982), "Mover Follow-Up Costs for the Income Survey Development Program," Proceedings of the Section on Survey Research Methods, American Statistical Association, 376-381. Woltman, H. and J. Bushery (1975), "A Panel Bias Study in the National Crime Survey," Proceedings of the Social Statistics Section, American Statistical Association, 159-167. 406 THE ADVANTAGES AND DISADVANTAGES OF LONGITUDINAL SURVEYS Robert W. Pearson Social Science Research Council Introduction Longitudinal surveys have existed for some time in the social sciences. A quick scan of research would find them employed at least as early as 1928, when Stuart Rice studied the changing presidential preferences of Dartmouth college students (Rice 1928). Perhaps more readily recalled are Theodore Newcomb's classic studies of the effects of a liberal environment at Bennington College on young women from conservative families (Newcomb 1943). Panel designs were further extended when Paul Lazarsfeld an colleagues studied the 1940 U.S. presidential campaign through a stratified random sample of about 2,400 Erie County, Ohio, citizens (Lazarsfeld, Berelson, and Gaudet 1944). Longitudinal studies became especially prominent in the 1960s and 1970s in the United States as the federal government turned its attention and resources to a domestic public agenda in which research and evaluation played an increasing part. As the technology of data collection, storage, and analysis developed, so too did the call for and subsequent investment in longitudinal surveys. In the United States, for example, some 13 national longitudinal surveys were conducted in the 1950s while 64 surveys of this kind were carried out in the following two decades (Taeuber and Rockwell 1982). Panel studies quite simply permitted the study of change that other study designs (principally, cross sectional surveys) could not. These surveys were asked to evaluate the effects of social programs and to unravel the processes by which individuals change. The surveys facilitated the development of several fields of inquiry, including -- but not limited to -- labor economics, developmental psychology, voting behavior, and evaluation research. Conversely, theoretical and conceptual developments within these fields called for the use of longitudinal surveys. 407 The love affair with longitudinal data appears to have been short lived, however. This earlier affection has been replaced with an increasing appreciation of the limits of longitudinal surveys. For example, the editors of a volume on longitudinal analysis of labor market data would begin the volume provocatively by saying, Longitudinal data are widely and uncritically regarded as a panacea. Given the substantial cost of collecting such data, it is surprising that so little attention has been devoted to justifying the expense. The conventional wisdom in social science equates "longitudinal" with "good," and discussion of the issue rarely rises above this level (Heckman and Singer 1985, p. xi). Similar questioning can be found in other fields of research. For example, Hirschi and Gottfredson assert in their review of research on the relationship between age and crime that "Funding agencies seem convinced by researchers that the longitudinal study is necessary for the proper study of crime" (Hirschi and Gotttfredson 1983, p 582). They argue instead that the causes of crime are similar across age cohorts and that cross-sectional designs are likely to produce more knowledge per dollar of research than are longitudinal designs, which Hirschi and Gottfredson believe to be relatively more costly to conduct (Greenberg 1985; Hirschi and Gottfredson 1983, 1985; Murray and Erickson 1987). The recent concern with longitudinal or panel surveys stems in part from the substantial investment in such data made during the past 20 years. There is also a suspicion that several important panel studies have reached or have gone beyond their maximum usefulness. Members of the policy and research communities now discuss these limitations as well as their comparative advantages in the reflective mood that was catalyzed by reductions in the data collection and social science budgets of the early part of the Reagan administration (Pearson 1985). The purpose of this chapter is to review several of the strengths and weaknesses of longitudinal surveys that have emerged from these discussions. The chapter will make special note of the manner in which these research designs have been oversold on one hand and underused on another. The chapter will discuss the several advantages and disadvantages of these instruments of social observation and draw attention to several claims about these data collection strategies that appear to be not well established, even if widely believed. The principal point of the chapter is a relatively simple one -- which survey design is most appropriate for a particular purpose is a complicated function of a large number of factors. These include, but are not limited to, the use to which the data are put, the cognitive capacities and interests of respondents, legal and 408 ethical restraints on the study of human subjects; the nature and quality of theories or assumptions about social processes and behavior; and the inferential abilities of different research designs. Unfortunately, these simple points are too often ignored. The research literature and the decisions concerning the choice of research designs appear to have become increasingly interested in choosing one rather than another design. Too little attention is paid to their fruitful combination, both within the survey research tradition -- the focus of this chapter -- and between this tradition and more qualitative research approaches. The Advantages and Disadvantages of Longitudinal Surveys Discussions of the advantages and disadvantages of a particular research designs are difficult to conduct in the abstract. This is so for several reasons. First, the discussion needs to be framed in a comparative perspective. Is the question one of the relative advantages and disadvantages of one longitudinal panel vs. another? Is it the relative merits of longitudinal vs. other designs? The former question faces secondary analysts of existing survey data. The latter question -- the principal focus of this chapter -- confronts those who sponsor and design research. These are often two distinct, though overlapping, levels of concern. Users of such data may find several surveys that are ostensibly relevant to a given topic, but have few tools for judging the equivalence of their measures. It is difficult to confirm, validate, and replicate research results across surveys. Paralleling the users' concerns, those who fund or design surveys must consider which studies to initiate, maintain, or terminate, and for what reasons? What combination of ongoing data collection programs will meet the present and future needs of research and public policy? Clearly, legitimate replication may be hard to distinguish from unnecessary redundancy. Although reliance on a single data source invites biased or inconclusive results, investments in similar or equivalent data series are likely to yield diminishing returns. Put briefly, each additional instrument may not lead to an equally valuable increment in knowledge. Second, discussions of the advantages and disadvantages of longitudinal surveys (and other research designs as well) are difficult because their evaluation depends on a variety of conditions. These conditions include: o The questions one wishes to answer. o The skills and analytic competences of the investigator or the "user friendliness" of the data. 409 o The sample size, target population, substantive content, and design of the survey. o The timeliness of the survey. o The quality of the information. o The documentation and dissemination of the data. The evaluation of longitudinal surveys as well as other survey research designs also depends on subtler factors. For example, there are substantial costs associated with gaining a working knowledge of the structure (and anomalies) of a large data set, costs that are not entirely transferable to another survey. These impediments to use are frequently confronted by analysts because many data collection programs do not devote resources to the creation of adequate documentation, data-based management structures, or the creation and distribution of users' access utility programs or constructed variables (David 1980, 1985). Many analysts use several different longitudinal data sets in their research. But when they do, they are often aided by students, research assistants, and computational facilities that minimize the costs of doing so. That is to say, the use (and usefulness) of these, and other relatively large data sets or instruments cannot be considered apart from a wider set of instrumentalities which include students, assistants, training programs, computational and analytical technologies, instructional materials, and the availability of research funds for secondary analysis. Standards or guidelines for the conduct of longitudinal surveys exist (Bailar and Lanphier 1978; Boruch and Pearson 1988). These guidelines cannot be used, however, a priori to compare one longitudinal survey to another because such evaluation relies heavily on the uses to which the results or findings of the studies are to be put. If the findings of a study are known before the data are collected, there would be little need to conduct the study. (For similar conclusions concerning the intractability of judging the relative value of different data, see David and Peskin 1984.) Comparisons of longitudinal with other research designs are difficult because many of the advantages and disadvantages of panel designs are shared by other designs. Nonresponse, confidentiality, data access are problems or concerns that face each (Boruch and Cecil 1979). Moreover, some disadvantages or difficulties posed by longitudinal surveys are also part of their strength. For example" how one defines and measures such ever-changing phenomena as the "family" is a problem that accompanies the increased ability to conceptualize and measure these dynamic phenomena (Koo 1985; Citro and Watts 1985 Citro, Hernandez, and Moorman 1986). 410 Equally important, some problems or disadvantages can be avoided or minimized if anticipated and if appropriate quality control mechanisms are built into the technology. For example, sample attrition of panel members can be reduced if sufficient attention and resources are devoted to collecting information from sample-respondents about friends or relatives who are likely to know where a respondent may move between waves of an interview. The effects of attrition can be monitored and, through imputational or weighting algorithms, compensated for during the analysis of the data. The comparison of different survey research designs are often inappropriate because they tend to criticize one research design while more or less explicitly extolling the virtues of an alternative, as if their discussion was part of a debate in which it was important that one type of research design "win", while others "lose". We should instead begin by agreeing that different designs can in principle be combined to take advantage of their relative merits and to overcome their relative disadvantages. One ought to ask what combination is most effective or efficient for answering one's questions rather than which one research design is best. Although comparisons of the relative advantages of longitudinal surveys should be made cautiously, research and experience suggest that longitudinal surveys have several generic advantages and disadvantages that are relatively well established. The advantages of longitudinal designs include, for example: o The development of reliable measures of individual change. (Retrospectively collected data are subject to telescoping, memory decay, etc.) Similarly, these designs permit the measurement of subjective phenomena as current states rather than as recalled states. (Consider the difficulty of asking a respondent to rate his or her health or happiness four years ago.) o The development of concepts that are characteristically dynamic rather than static. (The burgeoning multidisciplinary research on life-course perspectives owes part of its vitality to the creation and distribution of panel studies. See, for example, Baltes 1979; 1983.) o Better descriptions of the dynamics of change. (The typical episode of family poverty or welfare receipt has been shown htrough panel data to be considerably briefer than was assumed in studies using repeated cross- sectional surveys. See, for example, Dunca et al. 1984; Corcoran et al. 1985; and Duncan, Hill, and Hoffman 1988.) Similarly, longitudinal designs permit the 411 estimation of individual levels and rates of transition between states or conditions for which cross-sectional data may only provide gross or aggregate measures of group change. o The ability to conduct analyses that control for unmeasured attributes of individuals, thus improving the ability to distinguish between the influence of enduring individual differences (e.g., race and gender) and the influence of having previously experienced the condition that is under investigation (e.g., previous unemployment leading to current unemployment). The disadvantages of panel designs include: o Nonresponse bias (especially through panel attrition) may be high and analytically troublesome. (Respondents for whom subsequent interviews cannot be completed may differ in analytically important ways from those who remain in the survey.) o Response and learning effects (i.e., "panel effects") may prejudice responses. (People who are interviewed about their voting behavior tend to vote more frequently thereafter.) o Errors in the measurement of variables (and the correlation of these errors) and changes in the accuracy, reliability, and validity of such measures may spuriously create the appearance of change. o Panel data, unless regularly refreshed or augmented, may provide useful or accurate estimates of the population from which the original sample was drawn, but not from the current population, which may be of interest. o Panels always involve a moving target. Panel surveys of families, for example, must cope with movement into and out of families, the formation of new ones, and the dissolution of old. Let us consider in more detail the first of the listed advantages and discuss several features often included in such lists that are not well established. (For several discussions of the strengths and weaknesses of longitudinal surveys see, for example, Ashenfelter and Solon 1982; Boruch and Pearson 1988; Duncan, Juster, and Morgan 1984; Duncan and Kalton 1985; Fienberg and Tanur 1987a; and Subcommittee on Federal Longitudinal Surveys 1986.) The limits of retrospection. The repeated observations of longitudinal surveys permit an investigation of change in phenomena 412 that can be measured in the present. They rely less than, say, a single cross-sectional survey design on the memory of respondents' prior conditions. This principal limitation of the ability of cross-sectional research designs to assess individual change is one of the major relative advantages of longitudinally designed studies. Increasing evidence and recent theoretical developments in cognitive psychology and survey methodology question, in more sophisticated ways than in the past, the trustworthiness of retrospective -- or memory-based -- responses to survey questions. Some research has found that certain kinds of memory-based data are flawed not only by temporal confusion and forgetting, but are systematically influenced by the respondent's current emotional state and beliefs about life and self. Memory is basically reconstructive (cf. Bartlett 1932). And this reconstruction often involves the "top down" processing of the past that includes the development or use of scripts and narratives about the self or society, as well as the organization of details about the past. These scripts, schemata, self narratives, or, stereotypes define more or less coherent sets of beliefs around which more detailed images are actively (although not necessarily consciously) organized or distorted. If by virtue of sharing a. common culture, the respondent's schemata or theories of self or society are the same as those of the questioner (e.g., that adult mental distress follows from childhood problems), then research that relies on retrospective questioning techniques typically found in cross-sectional surveys may be systematically biased in the direction the questioner expects. The resulting "theory validation" of retrospective studies may simply be the result of widely -- even if only implicitly -- shared cultural stories, narratives, stereotypes, or folklore whose accuracy is unknown and unprovable (Dawes and Pearson 1987). Several studies substantiate-this conclusion. In two separate but similar experiments, for example, Conway and Ross (1984) examined randomly selected participants in a program designed to improve study skills and a control group of nonparticipants who indicated a desire to participate in the program but who were placed on a waiting list. Participants and control group members were questioned both before the beginning of the study skills program and at its conclusion. At both times, they were asked to assess their own study skills (e.g, how much of their study time was well spent, how satisfactory were their note taking skills, etc.) and the amount of time they studied. At the second interview they were also asked to recall what they reported during the first session concerning skills and study time. At the initial interview, participants and control group members did not significantly differ on any measure of skill, study time, or on additional information about grades on a psychology 413 examination taken prior to the study skills program. Nor did these two groups differ in their recall of hours spent studying; there was a slight tendency for subjects in both conditions to recall studying less than they initially reported. Recall of skills produced marked differences, however. Program participants recalled their study skills as being significantly worse than they initially reported. On the average, waiting list subjects recalled their study skills as being approximately the same as those they reported initially (p. 743). Participants in the study skills program appeared to exaggerate their improvement in a direction consistent with their theories of what ought to be -- taking a course should improve skills -- but they did so by retrospectively derogating their initial status. They did not exaggerate their current skills, but reconstructed their memory of the past to combine: (1) a theory that they should have improved because of the instruction and (2) a relatively accurate assessment of their current level of skills. In both studies, the study skills program did not have a significant effect on academic performance, as measured by subsequent psychology examinations or average grades for the semester. The recall of past events and conditions were in error, and these errors were in a direction that was consistent with what the students thought that the past should have been as a result of their current conditions and prior participation in a study skills program. Survey research has become increasingly aware of the distortions and misrepresentations of the past that are engendered by retrospective questions (cf., Turner and Martin 1984, p. 296; Sudman and Bradburn 1982, pp. 43-51; Schuman and Kalton 1986, pp. 644-647). In a recent validation study of employment-related information, for example, Mathiowetz (1986) and Duncan and Mathiowetz (1985) found that when a firm's employees were asked (in July 1983) whether they had been unemployed at any time during 1981 and 1982, 15 percent were in error concerning 1981 and seven percent were in error concerning 1982. Validation studies of retrospective reports have also observed substantial error in the recall of hospitalizations (Cannell, Fisher, and Baker 1965) and of victimizations (Turner 1972). Obviously, longitudinal, research designs themselves may rely upon recall, as well as other cognitive processes. Their dependence on retrospective accounts is in part a function of the length of time between waves of a panel study and the need to measure experiences prior to the first interview. Panel designs have the advantage of providing opportunities to employ bounded recall techniques (asking respondents to recall events since the last interview) and to use information provided in a previous interview to reinstate prior context and to provide cues to facilitate their recall. The relative advantage of longitudinal surveys in this regard is often grounds for choosing this design over cross-sectional surveys. 414 Costs: A red herring. The list of strengths and shortcomings of longitudinal research provided above is not exhaustive. But it excludes their cost as a relative disadvantage. One can find numerous references to the expensiveness of longitudinal designs (cf., Murray and Erickson 1987, p 109), a belief that appears to be widespread. Unfortunately, this belief is not well established And the limited attempts to empirically assess the relative costs of longitudinal and cross-sectional surveys have shown under certain assumptions that longitudinal surveys may be less expensive than repeated cross-sectional surveys (Duncan, Juster, and Morgan 1984). No one can argue that surveys such as the PSID, HS&B;, and NLS72 are relatively expensive instruments to create and maintain. But these costs are largely a function of the number or special character of sample members required by the study; not necessarily their longitudinal design. Surely, longitudinal surveys require the added expenses of tracing and tracking respondents as they move between waves of interviews, costs that are unique to a research design that follows subjects through time. Locating and securing the cooperation of sample respondents during an initial interview, however, reduces the costs associated with drawing new sampling frames or screening households for the desired universe of sample members. These features also permit the use of relatively less expensive modes of administering subsequent waves of the, survey (e.g., phone, mail back questionnaires) than may be required in cross-sectional national samples. Evaluating the relative costs of longitudinal surveys depends a great deal on what one chooses to compare them to. In this regard, we are faced with the difficulties posed by the proverbial comparison of apples and oranges. It is only suggestive -- but nonetheless in opposition to the belief about their expense -- that the average field cost of completed interviews of the 1987 General Social Survey of NORC was $400. The average costs of each completed interview of the 10th wave of the Youth Cohort of the National Longitudinal Survey of Labor Market Experience was $333 (Carter 1987). Similarly, the total cost of the first year of interviews of the National Post Secondary Student Aid Study of 1987 was $7.2 million; its first year follow-up is currently estimated to cost $3.0 million (Carroll 1987). 415 The comparative costs of different survey designs compound the difficulty of simultaneously weighing their advantages and disadvantages. A relatively ambiguous attitude toward panel surveys in assessing the effects of job training programs, as suggested for example by Heckman and Robb (1985), could be turned on its head by altering assumptions about the comparative costs of these different designs. Indeed, if one assumes relatively equal expenses, or cost advantages to panel designs, it would be prudent to select panel rather than cross-sectional designs (holding a great many other factors constant) because panel designs permit the use of a wider variety of statistical and theoretical assumptions. The application of a wider range of assumptions provides a useful means of testing how sensitive conclusions are to different assumptions. That is to say, one can Analyze a panel study as if it was a repeated cross-sectional design, but not vice versa. The largest costs of such studies lie in the creation and maintenance of the organization that is required to collect the data and in the burden which such surveys impose on their respondents. This commitment of resources is largely fixed and shared with other large survey-based designs; whether panel, experimental, or cross-sectional. Ongoing instruments of data collection are more likely to present opportunities for linkage or augmentation with side studies, experiments, topical modules than are "one-shot" data collection programs, which single cross- sectional surveys can often be, an advantage to which we return below. Causal Inference: An oversold advantage. Longitudinal survey designs do not, however, as is often incorrectly claimed, permit unequivicable inferences about causation. Surely, the temporal dimension of longitudinal surveys provides strong priors for assuming that a leads to or causes b if a is observed to occur before b. But there are several dangers in making such strong causal inferences from panel designs. One's anticipation of future events can influence current behavior, for example. And selection biases (i.e., people found in a program often differ in unmeasureable or unmeasured ways from nonparticipants) invariably trouble the estimation of program effects. Heckman and Robb (1985), for example, examined three survey designs and associated econometric techniques to determine whether one was "better" than the others in assessing the consequences of a public policy interventions, e.g. a, youth employment training program. They compared (1) single retrospective cross-sectional, (2) repeated cross-sectional, and (3) panel designs and their corresponding analytical techniques. Heckman and Robb showed that each design and corresponding analytical technique requires untestable assumptions in evaluating the earnings effects of participation in training programs. Their research argued that many of the assumptions of cross-sectional analytical techniques were no more or less justifiable than those upon which the panel 416 designs were based, although some assumptions could be -- although too infrequently are -- the object of independent study. Although panel studies permit one to trace spells and transitions and to order conditions in sequences that suggest causation, longitudinal surveys cannot do so without the aid of assumptions. This point is forcefully illustrated by Lord's paradox (1967, 1968, 1973) and its discussion by Holland and Rubin (1986). Attempts to understand how different models generate the same data, or how similar models can generate or represent different data, have produced a greater sensitivity to the problems of making causal inferences from the research designs of panel studies (and other observational designs such as matched comparison groups and cross-sectional surveys). Fraker and Maynard (1985, 1987), for example, analyzed data from several sources to compare the estimated earnings effects from participating in an employment and training program. They compared estimates of training effects derived from (1) control groups of the National Supported Work Demonstration program that were selected in accordance with experimental research designs and (2) comparison groups constructed from the Current Population Survey. Matched comparison designs involve the creation of samples of respondents (typically drawn from such surveys as the Current Population Survey) who are similar in important respects to the participants in a program that one seeks to evaluate. These research designs are common in program evaluations in part because information about program participants is regularly collected at their enrollment or discharge from the program. Experiments in which a number of eligible individuals are randomly precluded from participating in a program (and later compared to those who are allowed to enroll) on the other hand are at times difficult to conduct or proscribed by ethical and legal considerations. Fraker and Maynard's comparisons of experimental versus nonexperimental estimates of training program effects on annual earnings showed that comparison group procedures and analytical models produced estimates of large negative effects on the earnings of youth both during the program's employment period and after. The experimental design revealed estimates of program earnings for youth that were modestly positive during the program, and negligible thereafter. Comparisons of the effects of training on AFDC recipients revealed similar positive effects between the experimental and matched comparison designs. The differences in results between unemployed youth and AFDC recipients suggest that the greater earnings and employment variability of youth may result in more biased selection into the employment program, which in turn makes the task of defining a comparison group and an analytical model more difficult. Corroborative evidence to this work can be found in LaLonde (1986). 417 The implication of these marked differences in results is that the longitudinal and cross-sectional designs (or other nonexperimental designs) alone do not permit one to unravel the many causes and consequences of social and economic change or of program interventions. Perhaps more disturbing to those who must rely on such data, these research designs may produce the wrong answer when the behavior of the population under study is undergoing considerable change (as are the employment activities of youth). Experimental designs are superior to panel designs in making causal inferences. Rarely is the use of or investment in data made "on purely statistical grounds" alone, however. In addition to costs, choices are constrained by legal, ethical, and administrative considerations (Riecken et al. 1974). Considerable experience (much from studies of states and municipalities) has produced a greater appreciation of many of the difficulties of importing laboratory-oriented experimental designs into the field. It is often difficult, for example, to sustain the separation of treatment and control groups in the field. Moreover, some of the problems associated with the design and implementation of panel studies, such as attrition, apply equally to experimental designs (Betsey, Hollister and Papagiorgiou 1985). Experiments, are useful for assessing the relative differences among program variations on a common set of outcome variables. But experimental designs have their own scientific and administrative shortcomings. For example, treatments are often limited to a narrow set of variables and to specialized samples, and so their results may be of limited generalizability. Moreover, they are of ten difficult to administer and require substantial managerial skills to conduct. These limitations, among others, have retarded the use of experimental designs in the social sciences and in the evaluation of government programs. On the other hand, the design, implementation, and analysts of field experiments is possible, and some evidence exists of a renewed interest in them (c.f. Maynard 1987; Bloom, Borus, and Orr 1987; and Cottingham and Rodriguez 1987). Coupling experimental and longritudinal designs: a not fully realized potential. Longitudinal surveys are often an appropriate technology for describing the timing, duration, and sequence of individual change. And they are often better in this regard than alternative nonexperimental observational research designs because of the problems these alternatives confront when relying on retrospective measures of past conditions. Under certain conditions, longitudinal surveys appear to be no more expensive to conduct than repeated cross-sectional surveys. But their ability to draw causal inferences has in general been overdrawn. Although temporal order provides prima facie evidence for causation, it is insufficient. Increasingly, the research community is considering the fusion of longitudinal and experimental survey designs in which randomly 418 assigned treatments or interventions are given to some members of an ongoing longitudinal survey. Coupling experiments and longitudinal surveys capitalizes on the strongest merits of each design. That is, one obtains both the information produced by national probability samples -- often conducted over a considerable length of time -- and the information produced by smaller comparative experiments in which causal inferences are more appropriately deduced. Insofar, as the experiments can be adjoined systematically, their generalizability will be enhanced. Joining experiments to ongoing longitudinal surveys also permits one to use the experiments to calibrate estimates of program effects that are derived entirely from the longitudinal survey. That is, the biases engendered by using estimates that are based on longitudinal data can be assessed, and periodically corrected, through controlled experiments. Thus, longitudinal studies are likely to be more policy-relevant and less ambiguous with respect to biases in estimating program effects. Experiments are likely to benefit from their greater generalizability, lower costs, and more manageable administration. There is no doubt about the need for social experiments in understanding change (Berk et al. 1985). The National Academy of Sciences' Committee on Youth Employment Programs, for example, examined major studies to understand whether one could draw firm conclusions about program effects from earlier research. The committee concluded, among other things, that longitudinal surveys ate no substitute for randomized experiments when the object is to estimate the effectiveness-of youth employment programs. Moreover, they urged the use of randomized experiments for this purpose (Betsey, Hollister, and Papagiorgiou 1985). Coupling randomized designs to longitudinal surveys can also be traced to a technical advisory committee for employment program evaluation appointed by the U.S. Department of Labor. The DOL sought to learn whether analyses of manpower programs based on conventional longitudinal surveys lead to adequate estimates of program effects. Adequacy was assessed, for example, by comparing estimates of effects based on longitudinal surveys against estimates based on randomized trials. The conclusion of this exercise was that the two estimates are not always in accord. Indeed, they differ remarkably depending on what population is the subject of inquiry (Fraker and Maynard 1985, 1987). Obviously, changes in standard practices that are suggested here would introduce costs and difficulties, at least until their implementation permitted organizations to identify and remedy the problems that naturally arise with any new technology. And surely, there are a number of programs that could not be evaluated through such a coupling of designs because of the nature of the intervention or the limited number or location of possible respondents even in relatively large national longitudinal surveys. 419 Randomly varying policy responses to violent domestic disputes could not be comfortably grafted onto High School and Beyond (HS&B;), for example. Unfortunately, the development and Application of this general strategy has yet to adequately tested. Summary and Conclusion Longitudinal surveys are an important technology for the measurement of individual change and development. Considerable resources have been devoted during the last two decades to their creation and maintenance. These instruments of social observation have contributed a great deal to the development of several fields of inquiry, and promise to continue to do so. Recent years have seen a growing restlessness with these research designs, however. Their limitations especially those related to causal inference -- are increasingly recognized, although their relative strengths have continued to argue for their use as important new instruments of data collection. Their support and criticism is the healthy consequence of the continual scrutiny that a principal tool of social analysis should undergo. Their relative strengths, however, have not yet been systematically and regularly coupled with the strengths of another research design -- experiments. The promise of combining these methods and of moving beyond the discussion of the strengths and weaknesses of a particular research design still lays before us. References Ashenfelter, O. and G. Solon. 1982. Longitudinal labor market data: Sources, uses, and limitations. A paper presented at a conference sponsored by the National Council on Employment Policy, An Assessment of Labor Force Measurements for Policy Formulation, Washington, D.C. (June). Bailar, B. A. and C. M. Lanphier. 1978. Development of Survey Methods to Assist Survey Practices. Washington, D.C.: American Statistical Association. Baltes, P. B. 1979. Life-span developmental psychology: Some converging observations on history and theory. In Life-Span Development and Behavior. Vol. 2. eds. P. B. Baltes and 0. B. Brim, Jr. New York: Academic Press. Click HERE for graphic. 420 Bartlett, F. C. 1932. Remembering: A Study in Experimental and Social Psychology. Cambridge: Cambridge University Press. Berk, R. A. et al. 1985. Social policy experimentation. Evaluation Review 9:387-429. Betsey, C., R. Hollister, and M. Papagiorgiou. 1985. Report of the Committee on Youth Employment Programs. Washington, D.C.: National Research Council. Bloom H. S., M. E. Borus, and L. L. Orr. 1987. Using random assignment to evaluate an ongoing program: The National TTPA Evaluation. A paper presented at the annual meeting of the American Statistical Association, San Francisco (August 17-20). Boruch, R. F. 1975. Coupling randomized experiments and approximations to experiments in social program evaluation. Sociological Methods and Research. 4:31-53. Boruch, R. F. and H. W. Riecken. eds. 1975. Experimental Testing of Public Policy. Boulder, Colorado: Westview Press. Boruch, R. F. and J. S. Cecil. 1979. Assuring the Confidentiality of Social Research Data. Philadelphia, PA: University of Pennsylvania Press. Boruch, R. F. and R. W. Pearson. 1988. Assessing the Quality of Longitudinal Surveys. Evaluation Review In press. Cannell, C. F., G. Fisher, and T. Bakker. 1965. Reporting of hospitalizationin the Health Interview Survey, Vital and Health Statistics, Series 2, No. 6. Washington, D.C.: U.S. Government Printing Office. Carroll, D. 1987. Personal communication, December 8, 1987. Carter, W. 1987. Personal communication, December 7, 1987. Citro, C.F., and H. W. Watts. 1985. Patterns of Household Composition and Family Status Change. Paper presented to the American Economic Association, New York, New York. Citro, C. F., D. J. Hernandez, and J. E. Moorman. 1986. Longitudinal household concepts in SIPP. Paper presented to the American Statistical Association, Chicago, Illinois, May 30. Conway, M. and M. Ross. 1994. Getting what you want by revising what you had. Journal of Personality and Social Psychology 47:738-748. 421 Corcoran, M. E., G. J. Duncan, G. Gurin, and P. Gurin. 1985. Myth and reality: The causes of persistence of poverty. Journal of Policy Analysis and Management. Cottingham, P. and A. Rodriguez. 1987. The experimental testing of the Minority Female Single Parents Program. A paper presented at the annual meeting of the American Statistical Association, San Francisco (August 17-20). David, E. L. and H. M. Peskin. 1984. Theory of an optimal database. Review of Public Data Use 12:45-53. David, M. 1980. Access to data: The frustration and utopia of the researcher. Review of Public Data Use 8:327-337. Click HERE for graphic. Dawes, R. M. and R. W. Pearson. 1987. The efffect of the present on retrospective data: Measuring the then, now. New.York: Social Science Research Council, mimeo. Duncan, G. J., F. T. Juster, and J. N. Morgan. 1984. The role of panel studies in a world of scarce research resources. In The Collection and Analysis of Economic and Consumer Behavior Data: In Memory of Robert Ferber, eds. S. Sudman and M. A. Spaeth. Champagne, Illinois: Bureau of Economic and Business Research. Duncan, G. J., R. Coe, M. E. Corcoran, M. Hill, M. S. Hoffman, and J. M. Morgan. 1984. Years of Plenty, Years of Hope. Ann Arbor: Survey Research Center, University of Michigan. Duncan, G. J. and G. Kalton. 1985. Issues of design and analysis of surveys across time. A paper presented at the centenary session of the International Statistical Institute, Amsterdam. Duncan, G. J. and N. A. Mathiowetz. 1985. A validation study of economic survey data. Ann Arbor, MI: Institute for Social Research, mimeo. Duncan, C. J., M. S. Hill, and S. D. Hoffman. 1988. Welfare dependence within and across generations. Science. 239:467-471. Fienberg, S., B. and J. Tanur. 1986. From the inside out and the outside in: Combining experimental and sampling structures. Technical Report No. 373, Carnegie Mellon University (December). ____. 1987a. The design and analysis of longitudinal surveys: Controversies and issues of costs and continuity. In Designing Research With Scarce Resources, eds. R. F. Boruch and R. W. Pearson. New York: Springer-Verlag. 422 ____. 1987b. Experimental and sampling structures: Parallels diverging and meeting. International Statistical Review 55:75-96. Fraker, T. and R. Maynard. 1985. The use of comparison group designs in evaluation of employment related programs. Princeton, N. J.: Mathematica Policy Research, mimeo. ____. 1987. The Study of comparison group designs for evaluations of employment-related programs. The Journal of Human Resources 22: 194-227. Greenberg, D. F. 1985. Age, crime and social explanation. American Journal of Socilogy 91, 1-21. Heckman, J. J. and R. Robb, Jr. 1985. Alternative methods for evaluating the impact of interventions. In Longitudinal Analysis of Labor Market Data, eds. J. J. Heckman and B. Singer, 156-246. New York: Cambridge University Press. Heckman, J. J. and B. Singer. eds. 1985. Longitudinal Analysis of Labor Market Data. New York: Cambridge University Press. Hirschi, T. and M. Gottfredson. 1983. Age and the explanation of crime. American Journal of Sociology 91, 359-374. ____. 1985. Age and crime, logic and scholarship: Comment on Greenberg. American Journal of Sociology 91, 22-27. Holland, P. W. and D. B. Rubin. 1986. Research designs and causal inferences: On Lord's Paradox. In Survey Research Designs: Towards a Better Understanding of Their Costs and Benefits. eds. R. W. Pearson and R. F. Boruch, 7-37. New York: Springer-Verlag. Koo, H. 1985. Short-term change in household and family structure. Paper presented to the American Statistical Association, Las Vegas, Nevada. LaLonde, R. 1986. Evaluating the Econometric evaluations of training programs with experimental data. American Economic Review 76 (4):604-20 Lazarsfeld, P. F., B. Berelson, and H. Gaudet. (1944) 1960. The People's Choice: How the Voter Makes Up His Mind in a Presidential Campaign. 2nd ed. New York: Columbia University Press. Lord, F. M. 1967. A paradox in the interpretation of group comparisons. Psychological Bulletin 68:304-305. ____. 1968. Statistical adjustments when comparing preexisting groups. Psychological Bulletin 72:336-337. 423 ____. 1973. Lord's paradox. In Encyclopedia of Educational Evaluation. Anderson, S. B. et al. San Francisco: Jossey-Bass. Mathiowetz, N. A. 1986. Episodic recall and estimation: Applicability of cognitive theories to survey data. Paper presented at a Seminar on the Effects of Theory-Basedf Schetas on Retrospecitve Data, June 26-28, New York: Social Science Research Council. Murray, G. F. and P. G. Erickson. 1987. Cross-sectional versus longitudinal research: An empirical comparison of projected and subsequent-criminality. Social Science Research 16, 107-118. Newcomb, T. M. (1943) 1957. Personality and Social Change: Attitude Formation in a Student Community. New.York: Dryden. Pearson, R. W. 1985. The changing fortunes of the U.S. statistical system, 1980-1985. Review of Public Data Use 12:245-269. Rice, S. A. 1928. Quantitative Methods in Politics. New York: Knopf. Riecken, H. W. et al. 1974. Social Experimentation. New York: Academic. Schuman, H. and G. Kalton. 1986. Survey methods. In The Handbook of Social Psychology, 3rd ed. eds. G. Lindzey and E. Aronson, 635-697. Reading, MA: Addison-Wesley. Subcommittee on Federal Longitudinal Surveys, Federal Committee on Statistical Methodology. 1986. Federal Longitudinal Surveys. Washington, D.C.: Office of Management and Budget. Sudman, S. and N. M. Bradburn. 1982. Asking Questions. San Francisco,. Jossey-Bass. Taeuber, R. and R. C. Rockwell. 1982. National social data series: A compendium of brief descriptions. Review of Public Data Use 10:23-111. Turner, A. G. 1912. The San Jose methods test of known crime victims. Washington, D.C.: National Criminal Justice Information and Statistics Service, Law Enforcement Assistance Administration, U.S. Department of Justice. Turner, C. F. and E. Martin. eds. 1984. Surveying Subjective Phenomena, Volume 1. New York: Russell Sage Foundation. 424 LONGITUDINAL ANALYSIS OF FEDERAL SURVEY DATA Patricia Ruggles Joint Economic Committee I. Introduction Longitudinal panel data provide a unique opportunity to examine patterns and sources of economic and demographic change at the individual and family level. These data are relevant to a host of policy issues, from the assessment of welfare program participation to an understanding of patterns of health care usage or of the determinants of retirement. Many policy issues require some understanding of the factors that lead up to a particular event, or of the consequences that stem from it. Without repeated observations of the individuals concerned, however, such factors and consequences can only be inferred. Thus, our increasing store of longitudinal panel data holds the potential for major breakthroughs in our understanding of the basic determinants of economic and demographic change as they affect individuals and families over time. Unfortunately, however, many of our longitudinal data sets have been somewhat under-used by researchers so far, especially compared to similar cross-sectional surveys. To some extent this under-usage may simply stem from the fact that many of these data sets are still fairly new -- researchers need a chance to become familiar with the opportunities offered by these new sources of information. A more fundamental problem, however, is that to an analyst whose primary research experience is with cross-sectional microdata, a longitudinal panel of microdata on families and individuals can be rather intimidating. The purpose of this paper is to provide some guidance to users and potential users of longitudinal data sets who are trying to sort out appropriate approaches to the problems of analyzing longitudinal panel data. This paper does not attempt to offer any new insights into the methodologies available to estimate the determinants of change (or stability) in a given variable or set of variables over time, nor are the theoretical issues underlying these methodologies addressed in any detail. Instead, the paper is designed to be a much more basic "how to" guide, focusing on the most fundamental choices that must be made by the analyst in undertaking a project involving the use of longitudinal data to examine the economic circumstances of families and individuals. The major focus of this paper is on specific methods of making comparisons across time, with emphasis on matching the outcome measures and statistical techniques chosen to the basic research 425 question being asked. For many policy issues fairly simple outcome measures may be perfectly appropriate, but it is important to understand the measurement implications of alternative choices in order to avoid misinterpreting one's results. II. Making Comparisons Across Time The major purpose of a longitudinal research file is of course to facilitate the analysis of change over time. There are three major types of time-related analysis that are commonly carried out with such files, and there are some specific methodological issues that pertain to each. Comparing Two Points in Time The simplest type of time-related analysis the comparison of data from two discrete points in time -- does not actually require a complete longitudinal data file at all. The major advantage of this type of analysis is that it is relatively simple to implement and can often yield a great deal of useful information, particularly for questions that focus on rates of turnover in a specific variable. This method is very commonly used with many different longitudinal data sets -- several examples of such analyses can be found for PSID data in the Institute for Social Research's volume of PSID research results entitled Years of Poverty, Years of Plenty, for example. Other examples include Alan Fox's study using RHS data which examined income changes at retirement, and the SIPP-based study produced by Jack McNeil and his colleagues at the Census Bureaus that considered how many of those poor in 1984 were still poor in 1985. The major drawback of this method of making comparisons across time is that the outcome variables are sometimes quite sensitive to the specific time periods chosen-for analysis, and there is no way for the analyst to determine this if only two points in time are examined. Further, such comparisons are valid measures of change among those who already have a given characteristic, but cannot be used to determine the distribution of durations of a particular state among all those who enter it. 426 For example, using this method we can tell what the total remarriage rate for all divorced women is over a given period of time, but we cannot determine the average amount of time that women spend between marriages, because we do not know when those who were already divorced at the time of the first observation got divorced, and we have no distribution of remarriage probabilities by duration of divorce to use in forecasting future remarriage rates for those who have not yet remarried. Indeed, we cannot even determine if the remarriage rate is sensitive to the amount of time that has elapsed since the divorce. In other words, to the extent that the determinants of changes in state are themselves time-related, they may be difficult to observe if one must rely on simple "before and after" comparisons. Examining Transition Events A second approach to making comparisons across time, therefore, is to examine transitions between two states directly. By focusing on the transition itself one can more closely examine its association with other factors that may not be observable in a simple before and after comparison. This is helpful both in considering the effects of the transition on other variables and in estimating a causative model of the determinants of the transition itself. To illustrate this point, let us reconsider the analysis of divorce discussed briefly above. If the analyst is interested not only in the determinants of the divorce transition, but also in its impacts, a simple comparison of two points in time may be doubly misleading. For example, family income may dip temporarily at the time of divorce as the family changes from one household to two. Eventually, however, as the two households make post-divorce adjustments in employment and living arrangements, income is likely to recover at least somewhat. Estimates of the impact of the divorce on income and poverty status for the various family members may be quite sensitive to both the unit definition used to compute income (as discussed in the last section) and, to the specific timing of two income observations compared to the divorce itself. In a case like this, examination of income or poverty status over a longer period leading up to and then following the transition will give a better picture of its actual impacts. For this type of examination it is necessary to have a longitudinally linked file with the transition, flagged, but if such a file is available a descriptive analysis of this type is quite straightforward to perform. Similarly, the transition flags 427 themselves can be used as explanatory variables in a larger model of change over time as it affects some other variable. The recent paper by Suzanne Bianchi and Edie McArthur on the impacts of marital disruptions on children's economic status illustrates a transition analysis of this type. Considering the determinants of a given transition is also facilitated by the availability of a linked longitudinal file. For example, probit-type regression models can be used to examine the probability that a given transition will take place, subject to the various other characteristics of the cases in question. In analyzing divorce, for example, one might want to consider the impacts of the spouses' employment statuses in the period before the divorce on the probability that they will become divorced. In other cases, a broader set of dependent variables may be necessary -- those leaving a given state may have more than one alternative option. The work by Alan Gustman and Thomas Steinmeier on retirement probabilities as observed in the RHS offers a good example of a fairly complex application of this type of transition analysis. With a linked longitudinal file, the conditional probability of a given event such as divorce or retirement can be calculated fairly easily for specific population subgroups, and/or conditioned on specific events, using readily available software packages such as SAS. Again, however, such an approach can be misleading if the determinants of the transition in question are themselves time- related -- if for example, the previous duration of the marriage or even the length of the unemployment spell are important determinants of the probability of divorce. These duration-related issues, then, are potentially problematic with either a straightforward comparison of data from two points in time or with a more sophisticated analysis of specific transitions. Although it is sometimes possible to shoehorn duration-related information into one's transition analysis - one could create separate dummy variables for short and long unemployment spells in the above example, for instance - this is a rather ad hoc approach that is likely to leave many unanswered questions. In addition, in many cases one is interested not only 428 in the transition event itself, or even in its impact on other events, but also in the expected duration of the new state that it creates. One wishes to know, for example, how long someone who enters poverty may be expected to remain poor, or how long someone who loses a ]ob may be expected to remain unemployed. Questions of this type require some type of duration analysis. Analyzing Data on Duration There are many possible approaches to questions of duration, and alternative approaches can produce quite different and even seemingly contradictory statistics. The confusion generally results from differences in the population to which the duration estimate applies. The two major possibilities are cohort-based estimates, which typically apply to all those observed in a given state at a point in time, and spell estimates, which apply to all those observed to enter the state within a given span of time. To illustrate these possibilities, consider the case of welfare program participation. A point-in-time or cohort-based estimate of welfare durations will ask a question like "How long have those who are currently receiving welfare been on the program?" This question has been phrased retrospectively, but it can also be put in a prospective form: "How long are those currently on the program likely to remain on in the future?" In either case, the base population being considered is all those on the program at a given point in time. Such estimates are therefore relatively easy to line up with cross-sectional estimates of the total population on welfare, which are of necessity also point-in- time estimates. Estimates of this type are very useful for a number of purposes -- for example, estimating the future costs of the current welfare caseload (although obviously to get total costs one would also have to account for new welfare entrants). One useful way to think about estimates of this type is as an examination of the experiences of a particular cohort -- a group that all happened to be in a given state at a given point in time. The NLS, for example, is designed with just such applications in mind. It is possible to use these data to examine the subsequent experiences of several distinct demographic cohorts selected at specific points in time -- teenagers, men nearing retirement, women in their middle years. It is even possible, with the new youth cohort to link up families across generations, and to relate young women's experiences to those of their mothers, as Peter Gottschalk has done recently for welfare recipients, for example. A similar type of application using PSID data is Frank Levy's path-breaking 1977 paper on the "underclass," which traced the subsequent experiences of a cohort of those in poverty in 1967. 429 Cohort-type analyses are very useful for many policy questions, but it is important to be aware of their limitations in applying them to policy analyses. Specifically, because they apply only to those in the state at a given time, such analyses are sometimes difficult to generalize to the population as a whole, or even to the experience of all those who may pass through the state over a period of time. What a point-in-time estimate cannot do, in other words, is answer questions like "How long will a typical person entering welfare stay on the program?" Such a question refers not to the population on the program at a point in time, but rather to the population entering the program. Although that may seem like a subtle distinction, in fact these two populations are likely to be very different if there is any significant variation at all in spell durations within the population as a whole. Those who are on welfare at a point in time are likely to have much longer spell durations, on average, than the typical entrant, because those with longer spell durations are more likely to be in the welfare population at any particular point in time. To see this point, consider a very simple example. Suppose the population of interest consists of 13 people, one of whom is in the state under consideration for one year, and twelve of whom are in that state for one month each. Further suppose that these twelve one-month spells are distributed so that one occurs in every month of the year. At any given point in time, therefore, the total population in the state being considered will consist of two people, one who is in a one-month spell, and one who is in a twelve month spell. A point-in-time analysis conducted any time after the first month will therefore conclude that 50 percent of the observable population reports a spell of more than one month. An analysis based on all entrants observed during the year, however, will find that only one-thirteenth of the population reports a spell of more than one month. Clearly, if the reasons for these differences in estimates are not well understood, they could lead to very different conclusions about the prevalence of long spells. Many of the most useful and interesting questions that can be addressed using a longitudinal database are questions that relate to duration. In any type of duration analysis, however, it is 430 necessary to be sensitive to the issue of censoring. Inevitably, there will be some spells that start before the beginning of the observation period or that end after the panel has come to an end. Further, there will be some cases that join the panel with a spell already in progress or leave the panel before one has ended. These spells cannot simply be ignored, since of course longer spells are more likely than short ones to be censored and ignoring this problem will therefore produce biased estimates. An alternative approach that unfortunately is fairly often used by analysts who have not completely thought through the problem of spell censoring is to mix together all one's observations over a given span of time, whether they apply to completed spells or to those that are only partially observed. This produces results that are confusing and even potentially misleading, since it is easy to misclassify spells that are only partially observed as short spells, producing misleading estimates of average spell durations. The measure of the "persistently poor" produced by Duncan et al. using PSID data is an example of this approach, and illustrates some of its problems. In this study, the base population was defined as all those in the population during the ten year observation period -- not just those in poverty in a particular year, as in Levy's study. Duncan et al. then defined the "persistently poor" as those poor for at least eight out of the ten years. They went on to calculate the proportion of the total population that was "persistently poor" simply by dividing the number of people observed in poverty for at least eight years by the total population observed. The problem with this approach is that some people who are poor for less than eight years during the observation period are nevertheless in the midst of spells of poverty that will total eight or more years -- but unfortunately some of those years happen to fall outside the observation-period. Thus the true number of individuals in the sample who were actually poor for at least eight out of ten years (at least some of which fell in the sample period) cannot be estimated using these data. Estimates of the proportion of those observed who experience long poverty spells will be understated, because some spells that appear short are in fact longer, but they simply haven't been completely measured. At the same time, however, because these estimates mix together people who were poor in different years, they also cannot be used to predict, say, what proportion of those poor in a given year will still be poor eight years later. A preferable approach to the problem of estimating spell durations when some observations are censored is to use some sort 431 of survival analysis technique. Under this methodology, a survival function for a given type of spell is estimated based on the cumulative distribution of observed spell durations. In other words, in order to compute the probability that a spell of welfare participation, for example, will end in its sixth month, conditional on its having lasted for the first five months, one must include all cases known to have lasted at least five full months, whether or not their eventual disposition is known. Click HERE for graphic. By including all spells -- even those whose endings will eventually be unobserved -- for as long as information on their status is available, systematic biases related to spell duration will be minimized. At the same time, censored spells are essentially treated as if they had the same distribution of durations as spells with otherwise similar characteristics whose endings are observed. Under this methodology, censored spells do not pull down the estimated median spell duration, for example, as they do when the problem of censoring is not recognized. It is worth noting, however, that this approach assumes that censored spells are not systematically different from uncensored spells (except in ways fully captured in the X vector of explanatory variables), and that spells that occur at the beginning of the observation period are not systematically different from those starting nearer the end. To the extent that external events -- for example, legislative changes or changes in the state of the economy affect spell durations over time, analysis techniques that pool spell observations across the period at a whole may be misleading. 432 This approach does allow the contribution of a variety of factors -- either fixed (e.g., sex and race) or time-varying (e.g., employment status) -- to the conditional probability of exit (or of survival) to estimated -- these factors are simply included in the X vector of explanatory variables described above. This approach is very popular as a general method of analyzing spell durations and their determinants, and models of this type can be implemented in SAS as well as in other easily-obtained statistical packages (although typically the analyst is required to assume some specific underlying form for the distribution of exit probabilities). Only data sets that provide a reasonably continuous record for a reasonably large sample of individuals entering the state being examined can be used with this approach, however, which limits its usefulness with smaller or less focused data sets or those in which data has been collected in an intermittent pattern. III. Conclusions In summary, the many new sources of longitudinal data on incomes and family structures that have become available in the last decade offer exciting research opportunities to the policy analyst, but they bring with them their own unique measurement problems. Because these data sources are both, more complex and less familiar than are cross-sectional databases covering such topics, analyzing them can present some challenges. For analysts willing to address these challenges, however, there are useful solutions, and these data can be used to provide important new insights into the processes underlying economic and demographic change. Indeed, as discussed briefly in the various examples of measurement problems and their solutions given throughout the paper, important applications of longitudinal analysis to policy issues have already been carried out in many areas. A few examples include Bane and Ellwood's analysis of poverty spells and of AFDC participation using the PSID; the work by Bruce Vavricek and Ralph Smith of the Congressional Budget Office on spells of unemployment insurance recipiency as observed in the SIPP; several Social Security Administration-sponsored studies on retirement behavior as observed in the RHS; and Peter Gottshalk's work on intergenerational transmission of dependency as observed in the NLS. Projects are now underway to address a whole host of additional issues, including patterns of health insurance coverage, multiple program participation for low-income beneficiaries, and earnings and employment patterns for the working poor. The work that has been done so far and the work that is now underway represent major advances in our understanding of these issues, but there is much further analysis that could be done with our existing longitudinal survey data. To some extent, this expansion will simply take time analysts need to become more 433 familiar both with the surveys themselves and with appropriate techniques for analyzing and interpreting these data. Already, however, there is beginning to be a large literature on the applications of duration analysis, in particular, to economic and demographic data, and this literature can only be expected to grow over the next several years as additional data become available and additional issues are explored. What can statistical agencies, and data producers in particular, do to help the analyst undertaking this type of study? In my view, these agencies could support longitudinal analysis efforts in two major ways. First, data producers do not always produce files that are highly amenable to longitudinal analysis, even when such analysis is the primary mission of a particular data-collection effort. Understandably, when a new survey such as the SIPP comes out a great deal of effort is devoted to the early cross-sectional files, since analysts are anxious to see how these new data line up with data from famillar cross-sectional surveys. In addition, the early waves of any survey will be ready for analysis long before the survey itself has been completed and edited longitudinally, and data producers are understandably anxious to get these first products to the users as fast as possible. Once a survey has been in regular production for some period of time, however, it would make sense to lessen the emphasis on cross-sectional files and to increase efforts to produce reasonable longitudinal data in a reasonably timely fashion. We already have excellent cross-sectional data on family incomes and labor force status, and unless the survey in question is clearly adding to our store of available cross-sectional data on a particular topic, cross-sectional applications should receive less attention. In particular, the level of effort devoted to activities such as cross-sectional imputation that have no application in the longitudinal context should be reduced. instead, greater research efforts should be devoted to continuing problems like longitudinal editing and the development of reasonable longitudinal imputation procedures. The second way in which statistical agencies could support longitudinal analysis would be to undertake more of it themselves. Data producers typically publish at least some cross-sectional information from the files they produce, and in some cases -- the CPS publications in the Census P-60 series, for example, come to mind -- these tables themselves provide important information on which policy-makers come to rely. It ought to be possible for the Bureau of the Census and other data producers to publish similar information, but of a longitudinal nature, using the longitudinal databases that they now produce. 434 The assumptions underlying survival analyses might be difficult to explain in such a context, but basic information on the experience of a given cohort, for example, is fairly easy to explain and to interpret. For instance, one could look at how many of those becoming unemployed in a given period were still unemployed one, two, or more months later; how many of those on welfare or in poverty at a given point in time were still in that state x months (or years) later; and so forth. Similarly, one could examine the transitions between states more directly, along with the characteristics of those experiencing the transitions. One could ask, for example, what proportion of those leaving unemployment in a given year find jobs, and what proportion leave the labor force? Does it differ for men and women, blacks and whites, old and young workers? For that matter, one could ask who becomes unemployed, and how does the incidence differ by demographic characteristics? Or, for example, what about those who enter welfare programs in a given year -- what is the incidence of entry for those in different categories? What happens to those who leave welfare in that year? Do they get married? Do they get jobs? How many of those gaining jobs are still employed six months later, or a year later? Similar questions could be asked about the incidence and impacts of many other transitions, from divorce to retirement to the birth of a child. The longitudinal analysis issues outlined above represent only a small proportion of those that could be undertaken -- but the point here is that there is a great deal of fairly straightforward longitudinal analysis that would be very helpful to policy-makers, and that is not now being done in any systematic way. Some very useful reports have been issued, of course -- for example, the Census Bureau's P-70 series includes some longitudinal analysis from the SIPP, although so far such applications have been relatively limited in both quantity and scope. Again, many of these surveys, especially the SIPP and the NMCES, are still fairly new, so perhaps it is not surprising that their producers have not yet developed a complete, systematic schedule of reports examining basic longitudinal issues. Nevertheless, devoting more attention to their own longitudinal analyses would probably be the most important step data producers could take to support this type of research, and could also increase substantially the useful information that we are able to obtain from these surveys. References Allison, P.D. "Discrete-Time Methods for the Analysis of Event Histories," in S. Leinhardt (ed.), Sociological Methodology 1982, San Francisco: Jossey-Bass, 1982. 435 Bane, Mary Jo and David T. Ellwood. "The Dynamics of Dependence: the Routes to Self-Sufficiency." Report prepared for the U.S. Department of Health and Human Services. Cambridge, Mass.: Harvard University, 1983. Bane, Mary Jo and David T. Ellwood. "Slipping Into and Out of Poverty: The Dynamics of Spells." Journal of Human Resources, Winter 1986, 21(l), pp. 1-23. Bianchi, Suzanne, and Edith McArthur. "Family Disruption and Economic Hardship: The Short-Run Picture for Children." Paper presented at the annual meeting of the Population Association of America, May 1989. Blank, Rebecca. "How Important is Welfare Dependence?" Working Paper No. 2026. Cambridge, Mass.: National Bureau of Economic Research, Sept. 1986. Citro, Constance F., Donald J. Hernandez, and Roger A. Herriot. "Longitudinal Household Concepts in SIPP: Preliminary Results." SIPP Working Paper Series No. 8611. Washington D.C.: U. S. Bureau of the Census, 1986. Cox, B. and S. Cohen. Methodological Issues for Health Care Surveys. New York: Marcel Dekker, 1985. Duncan, Greg J. (ed.). Years of Poverty, Years of Plenty. Ann Arbor, Mich.: Institute for Social Research, 1984. Duncan, Greg J., Richard D. Coe, and Martha S. Hill. "The Dynamics of Poverty," in G. Duncan, ed., (op. cit.) 1984, pp. 33-70. Ernst, L., D. Hubble, and D. Judkins. "Longitudinal Family and Household Estimation in SIPP." Proceedings of the Survey Research Methods Section. Washington D.C.: American Statistical Association, 1984. Fox, Alan. "Work Status and Income Change, 1968-72: Retirement History Study Preview." Social-Security Bulletin, 1976. Gottschalk, Peter. "The Intergenerational Transmission of Welfare Participation: Facts and Possible Causes." Paper presented at the annual meeting of the Association for Public Policy Analysis and Management, November 1989. Gustman, Alan L. and Thomas L. Steinmeier. "A Structural Retirement Model." Econometrica, May 1986, pp. 555-584. Levy, Frank. "How Big is the American Underclass?" Working Paper 0090-1. Washington, D.C.: The Urban Institute, 1977. 436 McMillen, David B. and Roger A. Herriot. "Toward a Longitudinal Definition of Households." SIPP Working Paper Series No. 8402. Washington DC: U.S. Bureau of the Census, 1984. McNeil, John, Enrique Lamas and Cynthia Harpine. "Moving Into and Out of Poverty: Data from the First SIPP Panel File." Proceedings of the Social Statistics Section. Washington DC: American Statistical Association, 1988. Office of Management and Budget, Statistical Policy Office. Federal Longitudinal Surveys. Statistical Policy Working Paper No. 13. Washington DC: OMB, May 1986. Ruggles, Patricia. Drawing the Line: Alternative Poverty Measures and Their Implications for Public Policy. Washington DC: Urban Institute Press, 1990. Ruggles, Patricia. "Welfare Dependency and Its Causes: Determinants of the Duration of Welfare Spells." Paper presented at the annual meeting of the American Economic Association, Dec. 1988. Ruggles, Patricia and Roberton Williams. "Longitudinal Measures of Poverty: Accounting for Income and Assets Over Time." Review of Income and Wealth, Sept. 1989, 35(3), pp. 225-244. Ruggles, Patricia and Roberton Williams. "Transitions In and Out of Poverty." Paper presented at the annual meeting of the American Economic Association, Dec. 1986. Short, Pamela Farley, Joel C. Cantor, and Alan Monheit. "Dynamics of Medicaid Enrollment." Inquiry, Winter 1984, 25(4), pp. 504-516. Tuma, Nancy B. and Michael T. Hannan. Social Dynamics: Models and Methods. New York: Academic Press, 1984. Vavrichek, Bruce and Ralph E. Smith. Family Incomes of Unemployment Insurance Recipients and the Implications for Extending Benefits. Washington DC: Congressional Budget Office, 1990. Williams, Roberton. "Poverty Rates and Program Participation in the SIPP and the CPS." Paper presented at the annual meeting of the American Statistical Association, August 1986. Williams, Roberton and Patricia Ruggles. "Determinants of Changes in Income Status and Welfare Program Participation." Paper presented at the annual meeting of the American Statistical Association, August 1987. 437 DISCUSSION Michael Brick Westat, Inc. Pearson Pearson's paper is an excellent guide to federal agencies on the merits of choosing between various alternatives in designing a survey to meet specific policy relevant objectives. He argues quite persuasively that the important design question is not whether cross-sectional or longitudinal is better, but which combination of designs is most effective to answer the policy questions. Another important issue that Pearson raises is the underutilization of experimentation in longitudinal surveys. I strongly agree with him in that experiments are needed if causal modelling is a goal. Along these same lines, longitudinal surveys offer a rich environment for experimenting with a wide variety of other issues such as memory and recall. Some items could be collected in the baseline of a longitudinal study and then ask the respondent to recall this information in a later followup. Some examples that might be interesting are income from previous years, grades while in school, even opinion and attitudes. These types of experiments might help support some of the cognitive research theories or open the door to new and more realistic theories. Pearson's listing of the advantages and disadvantages of longitudinal files is very useful which can and should be used to help improve design decisions. I have a few quibbles about the list that may offer a slightly different perspective. The first issue is the placement of nonresponse as a disadvantage. Although it is true that attrition is typically a bigger concern in longitudinal files, the availability of additional covariates to reduce nonresponse bias may partially offset this disadvantage. However, until we devise and implement these methods effectively in large-scale longitudinal files, the nonresponse problem will remain a disadvantage. In many ways comparing cross-sectional and longitudinal nonresponse problems is fraught with many of the same difficulties associated with cost comparisons. If you are accomplishing something that cannot be done reliably in any other way, then you do have an "apples and oranges" comparison. When the cost of survey or the problem associated with nonresponse is discussed, the alternatives that satisfy the same objectives must be clearly specified. Pearson is correct that general statements or conventional wisdom can lead to poor design decisions. More complete models of the errors and costs for longitudinal 438 alternatives to cross-sectional surveys are needed to help unravel these questions. My main complaint with the list of advantages and disadvantages is that the discussion of response errors is too limited. If response errors create spurious estimates of change, then the major advantages (the first three of the four advantages he lists) of longitudinal files are reduced or eliminated. I'll return to this point after commenting on the paper by Kasprzyk and Jacobs. Kasprzyk and Jacobs The paper by Kasprzyk and Jacobs is a welcome insight into the many and varied issues that are peculiar to longitudinal survey design, operations and analysis. Their even-handed treatment of the differences that are encountered in large-scale longitudinal surveys obviously reflects many hours of wrestling with the real problems in this setting. In their discussion of the advantages and disadvantages of longitudinal surveys, they mention that the net change can be estimated more precisely because of the positive correlation that can often be expected in the variables over time. While this is true, the practice in many federal longitudinal surveys has not taken advantage of this correlation properly. In some cases only cross-sectional estimates of variances are ever computed. In other cases, correlations are estimated only for a very few statistics and then a generalized correlation is proposed for all other variables. Since the more precise estimation of net change is really probably the greatest advantage that a longitudinal survey has, this practice needs to be re-examined. If generalized correlations are to be used, then it is important to put greater efforts into their production and distribution. For example, in a recent survey Westat conducted for the National Science Foundation estimated correlations over a two year period that ranged from -0.10 to +0.65. sampling errors for estimates of net change are not difficult to measure and should be included as a routine product in a federal longitudinal survey. On a different issue, Kasprzyk and Jacobs note that in some longitudinal surveys efforts are made to avoid presenting data containing obvious errors. While this seems like a reasonable objective, it can actually result in poorer quality data. For example, if an error is made in the baseline period and all later data are verified against it for consistency, then new problems could be created. "Correcting" the errors in an edit program could be simply a way to suppress the problem so that users do not "see" it. It is still a real problem. The presence of earlier data may 439 encourage "over-editing" longitudinal survey data, creating false impressions of data quality, and increasing errors in computed statistics. In the discussion of longitudinal weighting and imputation, Kasprzyk and Jacobs review a number of important statistical issues. As they note in their discussion there is not a universal agreement on these issues. Re-iterating a previous comment, I think that imputation should play a much larger role than weighting in estimation from longitudinal files. The information obtained in different data collection waves should be used for more efficient estimation than is possible from simple weighting adjustments. Of course, the imputation of longitudinal files is also much more complex, and methods for handling imputation in large longitudinal surveys are not very advanced. This is a challenge for producers and analysts of longitudinal files. Click HERE for graphic. The Importance of Measurement Error in Longitudinal Surveys There are four concerns about measurement errors that I think are very important to designers and analysts of longitudinal surveys. These concerns are: - Measurement errors are the most crucial problem facing longitudinal surveys - Measurement errors result in biased estimates of gross change and the ability to measure gross change is a prime goal in many longitudinal surveys - Measurement errors are a much greater problem in longitudinal surveys than in cross-sectional surveys - Changes in survey processes are required if the potential of longitudinal surveys is to be realized The concern over measurement errors in longitudinal surveys is not new. Errors in estimates of gross change have long been recognized as having biases which reduce their usefulness. Efforts have been made to address these problems from both the design and the analytic perspective. A simple hypothetical example may help to understand the problem. Figure 1 shows values of a characteristic (e.g., 440 participation in a program, unemployment, health coverage) for a sample of units. The two extreme columns show the true values at times 1 and 2. Measurement error results in the values shown in the adjacent columns being actually observed. The observed values then give rise to the observed change or transition values shown in the center column. First, notice that measurement error has not greatly distorted the cross-sectional estimates for either time 1 or time 2 (the values in error are shown in bold). Therefore the estimate of level and the estimate of the net change between times 1 and 2, which could be measured with either a longitudinal or cross- sectional survey, are not greatly affected by the measurement error. On the other hand, the impact of measurement error on the gross change is dramatic. Ten units are observed to have changed, while the true number that changed is only 4. One of the most important and distinguishing features of many longitudinal surveys is the ability to produce estimates of gross change, but measurement error can seriously distort these estimates. Measurement error can have a profound impact on estimates of transitions, spells, durations, and flows. Click HERE for graphic. It is instructive to examine the reasons why measurement error causes so many more problems in longitudinal than cross-sectional surveys. The truth-by-survey table for a cross-sectional survey is a useful way of working with measurement error for qualitative variables. (See Table l.a and l.b) The net bias is the difference of two margins from the table (a+b) - (a+c), or simply b-c. The goal is to have zero or at least a small net bias. The conditions for zero net bias (i.e., when b=c) are given in the Appendix of the Bureau of Census (1985). Using their notation, let Pr(observed value = No / true value = Yes) = q and Pr(observed value = Yes / true value = No) = f. Then the net bias equals zero if Pq = (1-P)f, where P is the true proportion of the population with the characteristic. 441 If the two error rates are approximately equal and P>0 and q>0, then the net bias will get smaller as P approaches .50. If P=.02, then the ratio of q:f must be 49:1 for the net bias to equal zero. This merely points out the inter-relationship between the net bias and the size of the estimate. For estimates of rare characteristics, measurement error is likely to be more problematic. Of course, the distribution of the two error rates is also of great importance. If we extend the example to a second observation time, we encounter the same problem but the impact is larger. First, note that the net bias for the two observation periods is equal if the probabilities of error are the same between times 1 and 2 and the proportion with the characteristic does not change. Click HERE for graphic. The point of this simple exercise is to show that in a longitudinal setting the net bias for gross change involves the sum of four differences, while for estimates of level there is only one difference. The problem is naturally greater in trying to measure gross change, which is often one of the main objectives of a longitudinal survey. As I noted earlier, the problems of response errors in longitudinal surveys have been addressed from both a design and analysis perspective. The work of Bye and Schechter (1986), Chua and Fuller (1987), and Poterba and Summers (1984) are some excellent examples of the analytic approach. Marquis and Moore (1989) offer additional insight using data from records, and highlight the need for designing the surveys and instruments better. I suspect that if Dr. Deming were to become involved in this issue he might say that longitudinal surveys offer new challenges and we must change the way we do business. In longitudinal surveys we can no longer accept the errors and expect others to buy our products. We must concentrate on the survey process, identify the major sources of variability, and take steps to eliminate them from the system. If we fail to take these types of actions, then it is 442 likely that it will be harder and harder to support longitudinal surveys in the future. 443 Click HERE for graphic. 444 Click HERE for graphic. 445 References Bye. B.V. and Schechter, E.S. (1986), "A Latent Markov Model Approach to the Estimation of Response Errors in Multiwave Panel Data," Journal of the American Statistical Association, 81, 375-380. Chua, T.C. and Fuller, W.A. (1987), "A Model for Multinomial Response Error Applied to Labor Flows," Journal of the American Statistical Associatign, 82, 46-51. Marquis, K.H. and Moore, J.C. (1989), "Some Response Errors in the SIPP- With Thoughts About Their Effects and Remedies," Proceedings of the Section on Survey Research Methods of the American Statistical Association. Poterba, J.M. and Summers, L.H. (1984), "Adjusting the Gross Changes Data: Implications for Labor Market Dynamics," Proceedings of the Conference on Gross Flows in Labor Force Statistics. U.S. Bureau of the Census (1985), "Evaluating Censuses of Population and Housing," Statistical Training Document, ISP-TR-5. 446 DISCUSSION Marilyn E. Manser U. S. Bureau of Labor Statistics The papers by Patricia Ruggles and Robert Pearson, on which I was invited to comment, both provide helpful insights into the usefulness of longitudinal data. I find myself in agreement with them, for the most part. What I primarily will do is reinforce points which I think are particularly important and discuss other points on which my perspective may be a little different. Let me begin with a fundamental question suggested by this session's papers: what is the definition of a longitudinal survey vs. a cross-section survey? Although I do not have a clear answer to this I want to raise it for thought. Pearson defines a longitudinal survey as one "in which repeated observations are made of the same individual subjects." In his paper with Robert Boruch (1988), the Current Population Survey (CPS) is included in the description of longitudinal surveys. In contrast, the OMB Statistical Policy Working Paper 13, "Federal Longitudinal Surveys," excluded rotating panel surveys such as the CPS, the Consumer Expenditure Survey, and the National Crime Survey because there was no explicit plan for longitudinal analysis incorporated. Ruggles never explicitly defines what she means by a longitudinal survey, but is clearly using CPS as an example of a cross-sectional survey. Alternative definitions have of course been considered elsewhere. One possible design-based definition could include a requirement that in order to be called longitudinal a survey must follow movers -- on this basis, CPS would not be called longitudinal even if a specific plan were developed to make use of its longitudinal aspects. (CPS permits, for example, a variety of longitudinal studies of labor market situations, although at present there are other problems with the quality of longitudinal estimates based on CPS besides the fact that movers are not followed.) It is important to note also that a purely design-based criterion would be less than fully satisfactory -- problems could prevent the design from being implemented. For instance, budget cuts could prevent any follow-up after the first round. Less drastically, it is important to ask, if following movers is viewed as important, what proportion of movers are actually found? It would be useful to have this information produced regularly on longitudinal surveys. 447 To my knowledge, few major ongoing program efforts depend on truly one-shot surveys, which seem to be what Pearson is calling a cross-sectional survey. Most statistical surveys are used to produce at least aggregate estimates of change, even if that was not the primary purpose in mind when they were designed. For example, for many analyses of the economic situation one is really interested in whether the unemployment rate is high or low compared to other periods. Uses of data to construct aggregate measures of change and arguments for improving aggregate cross-sectional estimates can both justify a statistical design including a rotating panel, even when no explicit longitudinal analyses are planned. But in addition there may be cost implications to one- shot surveys, making them less cost effective than what I will call "mixed surveys", rotating panel surveys which fail stringent definitions of a longitudinal survey but are not truly one- shot. In any case, for a relatively small additional effort to improve longitudinal aspects of mixed surveys such as CPS, it may be possible to improve analytic possibilities enormously. Both the Pearson and the Ruggles papers focus on household surveys. But establishment surveys such as the Census's Annual Survey of Manufacturers and BLS's 790 survey typically go back to the same sample units repeatedly. Such surveys offer tremendous opportunities for increasing understanding of economic phenomena if the problems in making use of their panel aspects can be overcome. I. Advantages and Disadvantages of Longitudinal Surveys A major focus of Pearson's paper is on weighing the advantages and disadvantages of longitudinal surveys. This is a valid and useful discussion, but note that the disadvantages all center on measurement problems. No one has successfully argued, to my knowledge, that non- experimental cross-section surveys are more useful for the analytic purposes for which non-experimental longitudinal surveys are designed. Further, given that it is too burdensome to obtain the needed information with retrospective questions, which in any case would entail severe recall problems, there is really no alternative to longitudinal surveys for many types of analyses. Longitudinal surveys are extremely important and should be given greater use than they have received in the past but much more research is needed on measurement problems. One section of Pearson's paper argues for coupling longitudinal and experimental designs. Surveys conducted to collect and analyze data on social experiments have typically been longitudinal. The point made here is a recommendation to conduct policy-related experiments with individual respondents to a general-purpose longitudinal survey. Clearly this could be a cost- effective way to collect data if the individuals on whom the experiment were being conducted were no longer to be counted as part of the original survey. But otherwise I would be extremely 448 troubled by doing this. The whole purpose of policy-related experiments is often to influence outcomes, and even if that is not the intent outcomes are still likely to be influenced. If this occurs then responses of the sample members to a wide range of survey questions are no longer representative of the population as a whole. In contrast, conducting survey methodological experiments which can be assumed not to have a measurable impact on outcomes can be useful or necessary in some instances. Pearson also makes the related point that joining additional questions to an ongoing survey can be valuable. I am in wholehearted agreement with this. One thing that the Department of Labor's (DOL's) National Longitudinal Survey (NLS) program has done that has, in my view, been very beneficial to a variety of government agencies as well as to the outside research community is to accept funding from other agencies to collect information of interest to them which also enhances usefulness of the data for analyzing labor market behavior. For example, the National Institute for Child Health and Human Development has added blocks of questions on child care use to the NLS Youth survey. Because of the importance to us and to others of joining data collection needs from other agencies to our survey, we have developed a general policy to preserve the integrity of the basic data. II. Longitudinal Analysis As implied by her title, Patricia Ruggles' paper focuses on analysis of longitudinal data, primarily econometric analysis using the micro data. It has been in the micro area that the major use of longitudinal survey data has occurred. For instance, as documented by Frank Stafford (1986), much of what we know about labor economics has come from longitudinal surveys, primarily NLS and the Panel Study of Income Dynamics. Research using these data continues, on topics such as the impact of private sector training on future earnings, low wage jobs and their impacts, and the labor supply behavior of women during pregnancy and shortly after birth of the child. Similarly, as more experience with the Survey of Income and Program Participation (SIPP) accumulates, we are likely to see many useful micro studies that will impact the way the research community thinks about issues such as spells of dependence on various programs. But where longitudinal household surveys have not made a large contribution yet, it seems to me, is in short-term analysis of current data: for instance, a series of reports on a topic such as how have the transition rates out of poverty changed since last year. However, I understand that the Census Bureau has recently released two P-70 reports using SIPP to analyze transitions. I believe that development of a series of current analytic reports in addition to long-term econometric research studies from NLS is an extremely important goal for this program, but the program has 449 never in the past included this dimension. In general, I think that carrying out both long-term econometric studies and shorter- term, more current, tabular analyses is important and that they are complementary. However, as all three papers in this session note, further research on weighting problems is greatly needed. Much of the Ruggles paper considers the complexity of use of longitudinal data sets, for which they are often criticized. She recognizes, and this is important, that these complexities necessarily come along with the richness of the data sets that are responsible for the "exciting research opportunities" that they provide. While it is true that existing longitudinal data sets are hard to use, this is the case because of the vast amount of information they contain, particularly after several rounds of interviews have taken place. (For instance, she notes that SIPP provides a choice of accounting periods for income measures and this choice can make a difference to the analysis. This is in contrast to CPS where income is available for only one accounting period. With SIPP, richness of choice creates complexity. Unless there is widespread agreement about what measure to use, summary income measures provided on its files by a statistical agency would presumably not suit some of its users.) Another related point in Ruggles' paper is the recommendation that agencies append a myriad of transition flags to person records. While this is feasible, again, in general, unless there is a large set of users with a particular need or an ongoing agency use of the data for a particular purpose it will probably be the case that a particular user will not find all aspects of a publicly available file ideally suited to his or her particular use. Limited resources can often be used, however, to respond to needs affecting a number of users. For instance, many users of NLS data use the event history information on jobs -- very rich data that were initially very difficult to use. As a result, a Workhistory data tape was developed which contains weekly arrays of labor force status, usual hours worked per week, and dual job information. This Workhistory data tape makes analyses using this information considerably easier. As she notes in her introduction, Ruggles' focus is "almost exclusively on the application of longitudinal analysis to questions concerning patterns of family income, expenditures, and/or demographic change." Because of this focus, she devotes considerable attention to efforts to construct a longitudinal family definition. This is a major problem for analyses of topics such as transition on and off of means-tested government programs. This problem, too, exists because of the richness of this type of longitudinal data. It is well-known that examining income-levels by family type using CPS is plagued by the fact that the family structure information relates to a different period than the income 450 measure. Because a longitudinal data source entails problems for analysis does not necessarily mean that analysis of a similar topic would be preferable using other, easier-to-use data. Note also that this problem of family status definition is not a central problem in analysis of many types of longitudinal issues -- it is the focus here that makes it one. For example, studies of labor supply behavior, work experience, earnings growth, and so on focus on the individual. NLS follows individuals primarily because of its focus on labor force related information for people in groups of particular interest to DOL. Similarly, the Department of Education focuses on the individual in its longitudinal studies which focus primarily on educational experiences and outcomes for youth. Ruggles' point that just reweighting the way it is typically done does not necessarily solve a problem due to nonrandom attrition is an important one. This suggests one reason, among others, for why a microeconometric analysis may be preferable to looking only at tables: as she points out, it is possible to include people who are in the sample only for some of the periods in a micro study. But in general, use of tabular and econometric analyses can be complementary. In her section on analyzing data on duration, Ruggles provides a useful discussion of some of the pitfalls in this type of study. Restricting an analysis to a particular age/sex cohort is not problematic if the interest is in a particular group. It is using a variable that represents a choice to select a sample for analysis that causes all the types of problems she discusses in this section. In her conclusion, she makes two recommendations for statistical agencies. One is to lessen the emphasis on cross- sectional files. Her point that it is important not to use data with cross-sectional amputations for a longitudinal analysis is an important one. But surveys may have multiple purposes so that avoiding cross-sectional amputations entirely, especially given needs for issuing timely data, would not be a possibility in many cases. The second recommendation is for more longitudinal analysis by statistical agencies. I agree that this is important, even though longitudinal studies and tables may be difficult to explain in many cases as she notes. In conclusion, let me note that as part of the major ongoing joint BLS/Census CPS Redesign effort, attention to longitudinal issues is planned. Tables of gross monthly flows between labor force states are presently produced regularly but are not officially published because the estimates are not of sufficient quality. Efforts are planned to improve longitudinal aspects of the survey and to research adjustment techniques to improve the gross flows tables. In addition, if funding permits, plans are to 451 conduct a separate CPS-like longitudinal survey which would follow movers and keep people in the sample longer. This survey, in addition to supporting improved analysis of short-run changes in labor force behavior, would permit research on a multiplicity of survey-related topics. References Boruch, R. F. and R. W. Pearson, "Assessing the Quality of Longitudinal Surveys," Evaluation Review, Vol. 12, February 1988, pp. 3-58. Stafford, F., "Forestalling the Demise of Empirical Economics: The Role of Microdata in Labor Economics Research," in 0. C. Ashenfelter and R. Layard, eds., Handbook of Labor Economics, Vol. 1. Amsterdam: North-Holland, 1986, pp. 387-423. 452 TOWARDS AN AGENDA FOR THE FUTURE 453 454 TOWARDS AN AGENDA FOR THE FUTURE Stephen E. Fienberg Carnegie Mellon University My remarks this afternoon will focus on a few key themes that emerged in various sessions over the past two days. I will attempt to use these themes to point towards elements of an agenda for the future of the federal statistical system, not just the future of the Federal Committee on Statistical Methodology that oversees the OMB Statistical Policy Working Paper Series around which the seminar has been centered. On Quality George Hanuschak in the session on survey quality profiles recalled the words of one of the present-day quality gurus, to the effect that we should build quality into the system, not just inspect for the lack of it after the fact. A variant on this is the theme that we need to build quality and evaluation into our data collection processes. The traditional notion of coming back several months later to check on the answers provided by a survey respondent seems at odds with the notion of ongoing change and improvement. For example, consider two components of the 1990 Census, the group quarters censuses of college and university campuses and the special homeless component - - S-night - - program. In neither case can one expect to return a month or so after the enumeration to check on information recorded. Thus a careful census quality program would have some built-in evaluation mechanism for these components. At Carnegie Mellon University we have a new Statistical Center for Quality Improvement which we operate jointly with the statisticians at the University of Pittsburgh, and my colleagues associated with this center are fond of referring to the three generations of statistical approaches to quality. The first of these is the basic univariate control chart generation of technology associated with the names of Shewart, Deming, and others and based on ideas that were found in the literature in the 1920s and 1930s. The second generation was linked to the introduction of careful experimentation specifically designed for the industrial setting, e.g., response surface methodology and EVOP, and introduced in the 1950s and 1960s. The recent interest in Taguchi methods is rooted in large part in basic fractional factorial design ideas. we are just beginning to see the emergence of the third generation of quality techniques which focus on statistical methods for the analysis of complex multivariate data using high speed computation and computer graphics. 455 Based on what I know about quality efforts in the federal statistical agencies, and what I heard described at this seminar, I would describe the current state-of-the-art as being focussed on the first generation of quality ideas, univariate in approach, lacking careful and systematic experimentation, and devoid of techniques rooted in the modern world of computing. Yet there are ample opportunities for moving quickly into the second generation by utilizing ideas on the embedding of experiments in surveys (e.g., see Tanur and Fienberg, 1988, 1989). The simplest of the embedded designs (the split ballot experiment) is often recommended for use (as it was in the session here on questionnaire design) but rarely analyzed properly. Indeed, as we look to the widespread exploration of ideas and concepts coming out of the cognitive laboratories, the federal agencies must take seriously the second generation ideas of embedded experiments. On the Need for Integration Some of the recent advances in methods for data collection and analysis appear as add-ons, off to the side of the main enterprise. In the spirit of the Total Quality Management movement we have heard in several sessions about the need for Integration of the components of survey design, and in a larger sense for the integration of thinking across agencies. I am reminded of an academic story. As a dean at Carnegie Mellon, I sit on the university promotion and tenure committee and get the opportunity to review cases from diverse disciplines. A few years ago, we were reviewing the case of a physicist for tenure and his file contained a number of letters describing his experimental work as brilliant, innovative, or outstanding. As we looked over his curriculum vitae, we noted that he had no individually-authored papers but only appeared as one of a cast of thousands on each paper. Finally, one committee member asked the presenter of this case, what was so distinctive about the candidate's work in high energy physics that merited the laudatory comments. The response was: "He focuses the beam." Now many of you have roles in federal data collection that are akin to that of the physicist's beam-focussing. These are important and often crucial roles, but their value needs to be understood in the broader integrative setting, both by you in your work and by those who are looking towards quality improvement more broadly. On the Statistical Policy Working Paper Series While many of the sessions at this seminar were based directly on papers from the OMB Statistical Policy Working Paper Series, others have been on collateral advances in methods and data quality 456 assessment. Bob Groves began the seminar by noting three perspectives on the goals that the series should have. These were to serve as: (a) reports in the "state-of-the-art" of federal practice, (b) vehicles for agency cross-fertilization, (c) prods to new developments. Many of the nineteen papers issued to date have succeeded admirably in categories (a) and (b), and they have changed how work is done across agencies. Others have had only limited impact. But I think that we could agree with Groves that few of the papers were prods to major new methodological developments. Perhaps the Federal Committee on Statistical Methodology that oversees the OMB Statistical Policy Working Paper Series needs to be more daring in its choice of topics in the future. New topics need not be rooted in ongoing work in specific agencies nor do they need to be ones on which the committee agrees. For how else can we achieve a major shift or revolution in methods and quality? At the same time I should note the need for attention to and support for the committee's activities on the part of senior administrators in the statistical agencies. If staff do this work only in their spare time we can expect to see few major methodological advances. Shifts of Paradigm for Federal Statistics Fritz Scheuren has been talking both at this seminar and in recent years about the need for a paradigm shift in how we do federal statistics. I believe that he is correct in this claim although I do not think that many people understand what he and the philosophers of science mean by paradigm shifts. I commend those of you who have not read Thomas Kuhn on scientific revolutions to do so as his ideas often get mangled in the translation. Kuhn talks about the day-to-day orderly change and incremental knowledge approach to science which gets radically altered and reorganized by the introduction of a new set of ideas and a new paradigm such as that associated with the work of a Newton or an Einstein. Now when a paradigm shift occurs, the past tools and perspectives are not all discarded. Rather they are looked at in a different way and accorded a different place in the hierarchy of importance. What we also see is the introduction of dramatically different measurement methods, with markedly changed error profiles. Up through the present day the federal statistical system has been based in large part on tools developed many years ago, more often than not in the 1930s and 1940s. This is especially true in survey design and census taking. With the technological revolution 457 of the 1970s and 1980s, one might have expected to see a paradigm shift in statistics in the agencies, but the computer and its effects have been forced into the old paradigm instead of being the trigger to a reorganization of our thinking. The last decade has been a difficult one for statistical agencies, but perhaps the problems that the agencies have encountered during this period should spur us to rethink what we do and how. We should be asking if tools like CAPI, CATI, distributed computing networks, major new analytical statistical methods, and the cognitive-statistical laboratory may be the vehicles to major changes. Impediments to Major Change Perhaps the biggest impediment to change is the bureaucracy in which most of you work. A piece of this is the attitude: "We've always done it that way." This is related to the theme I would label as "The Agency is the Data." The purpose of collecting statistical data is not an end unto itself, but rather a means to a social or policy goal. The aim of the federal statistical agency then should be to serve these broader goals well, rather than to collect data insulated from outside input and protected from outside scrutiny. We need to move towards making our data relevant; to measure what is of importance, albeit poorly, instead of measuring what current methods are designed to be good at, even if it is of marginal interest. I'd like to tell a parable about of the National Goodness Survey which was mandated late one night in conference by Congress as an amendment to a foreign aid bill. The federal methodology coordinating committee was asked to propose a design for this new survey at one of its meetings, and each of its members was asked to come back to the next meeting with a proposal for the design: (a) The representative from the Bureau of the Census returned with a household survey design that resembled the Current Population Survey, and she noted that surely goodness resided in household locations, just as unemployment does. (b) The representative from the Energy Information Administration noted that goodness was likely to flow from reservoirs in the group and thus proposed a design modelled on their survey of natural gas reserves. (c) The Bureau of Labor Statistics representative suggested that we couldn't ignore the component of goodness that was due to business establishments, and proposed a separate survey based on their new establishment list. 458 But she also offered the auspices of the BLS cognitive laboratory for testing ideas on goodness consumption. (d) The Bureau of Justice Statistics representative noted that his agency didn't actually conduct its own surveys and referred the committee to the representative of the Census Bureau for how this should be done. (e) The representative from the National Center for Health Statistics suggested that goodness was a manifestation of physical well-being and urged that the new survey be a supplement to the National Health Interview survey. (f) Finally, National Center for Education Statistics proposed that we ask the state superintendents for public schools to report on the fostering of goodness in the educational process, and that we develop a new standardized test that could be administered annually to measure the acquisition of goodness skills. (I leave it as an exercise for the reader to describe how the representatives from BEA, DoD, IRS, and NASS responded.) Now part of the problem with my parable lies with the approach taken by each of the agency statisticians who, instead of asking what the concept of goodness is all about and how could one measure it, looked to analogues close at hand and let the standard methods he or she was familiar with frame all of the answers to the crucial unasked questions. Perhaps a survey is the wrong tool for the task of measuring goodness. The other problem arises from the fact that no agency has a monopoly on statistical methods or the ability to design new surveys, not the Bureau of the Census, not BLS, not even the small band of statisticians in OMB who must approve the design. New projects the federal statistical system is likely to face in the next decade likely will require innovative thinking and true interagency collaboration. The example given by Judy Lessler of measuring the quality "of Flowing Waters" is illustrative of the point I am trying to make. The Research Triangle Institute (RTI) statisticians put this problem of measuring the quality of the nation's flowing waters back into the traditional survey domain of a frame with a population of units (river reaches) to be sampled. The approach was ingenious and some might even call it innovative. But in so describing the problem of measuring the quality of the nation's waters Lessler missed the opportunity to note a point that I am sure the RTI statisticians discussed, namely that many radically different frameworks are possible for looking at this issue, and only some of these fit neatly into a traditional sampling approach. 459 Thus one of the messages I bring you today is that we all must learn to question the appropriateness of traditional statistical frameworks and institutional dogma. This is especially true as we move into some of the more fascinating new domains of federal statistics, e.g., related to the environment, as well as in considering different ways of collecting data, for censuses, and especially for longitudinal surveys. What we do know about longitudinal surveys is that they should not be a simple pasting together of waves of cross-sectional surveys. What we do not know is how to design such surveys except by faulty analogy to traditional cross-sectional methods. Traditional concepts of frames and survey coverage suddenly become elusive, shifting over time. For longitudinal surveys we need to rethink what data we collect, when, and how. And we need to have a more flexible set of analytical tools that allow the data to be viewed from multiple perspectives. Technology may well help here with problems associated with sample attrition and the followup of movers. Some Advice I'd like to end with a bit of advice and encouragement about what you can do to improve the quality and appropriateness of the statistical work in your own settings. Your challenge is to keep yourself from being isolated, to prevent yourself from accepting as infallible the data collection and analysis methods you currently use in your job, and to look beyond the walls of your organization. (a) Ask "why" more often that you have in the past. (b) Dare to have new ideas or suggest the exploration of someone else's new ideas. Innovative ideas have a long gestation period and only a small fraction of them actually work in practice. (c) Insist on careful evaluation and documentation of what you are doing. (d) Don't be afraid to say that you don't know or you don't understand. Such statements are often not a sign of ignorance but rather indicators of wisdom. (e) Hang in there. Your jobs are difficult and most of you are doing them well. The nation depends on your efforts. References 460 Fienberg, S. E. and Tanur, J. M. (1988). From the inside out and the outside in: combining experimental and sampling structures. Canadian Journal of Statistics, 16, 135-151. Fienberg, S. E. and Tanur, J. M. (1989). Combining cognitive and statistical approaches to survey design. Science, 243, 1017-1022. Kuhn, T. (1970) The Structure of Scientific Revolutions. (Second Edition, Enlarged) University of Chicago Press, Chicago. 461 TOWARDS AK AGENDA FOR THE FUTURE Margaret E. Martin I have been given a "Where do we go from here?" assignment to help in focussing the experience of the Federal Committee on Statistical Methodology (FCSM) on future directions. So who is "we" and where is "here"? I have chosen to consider "we" as something broader than the FCSM itself -- perhaps the coordinating role of the Statistical Policy office, perhaps that amorphous entity, the federal statistical system in general. Where is "here"? It seems to me "here" is an amazingly distant and productive way from the starting point when the FCSM was founded--19 "state of the art" reports ago. The productivity of the FCSM's part-time, interagency subcommittees has been outstanding. Much credit belongs to Maria Gonzalez. Some notion of expectations is essential in order to assess past progress and future progress. What can such committees accomplish? We do not usually look to such groups to produce major breakthroughs in statistical theory, nor to engage in detailed technological applications or experiments. Rather, it seems to me, an interagency committee might be expected to perform one or more of the following functions: 1) exchange knowledge, techniques or experience among committee members to enhance the quality of the member agencies' own operations; 2) provide "state of the art" reports to encourage best practice among a broader group; 3) recommend areas for improvement and needed directions for research; and 4) obtain consensus on such issues as -- defining problems and the priorities among them, developing or changing classifications or other concepts, and setting statistical standards. I am uncertain how much the various subcommittees have served to fulfill the first objective -- that of exchanging knowledge among the subcommittee members -- especially upon hearing informal comments that much subcommittee work is report drafting and criticizing undertaken by individuals on evenings and weekends, rather than exchanges at committee meetings. In such circumstances, the interplay among participants that sometimes leads to unexpected and happy outcomes is not encouraged. Perhaps this is a point that needs more consideration and possible 462 development in the future. Suggestions that arose in the opening session yesterday for more followup on subcommittee reports might lead to more continuing and profitable interactions among subcommittee members. The FCSM has fulfilled admirably the second objective I listed, that of providing state of the art reports to encourage best practice among a broader group of agencies both within and outside the Federal Government. Robert Groves reported yesterday, for example, that he has used some of the reports in training future survey statisticians. The record of the FCSM in meeting this objective is outstanding. Many of the reports meet the third objective of recommending areas for improvement and needed directions for further research -- although the record here is more spotty. The fourth objective, that of obtaining consensus on broad definitional, conceptual and classification issues, has not been well met. Indeed, the FCSM apparently does not deem such issues to be within its purview. Very well, but such issues are of immediate concern to OMB's Statistical Policy Office and at this time pressures are increasing to re-examine basic concepts and classifications in both economic and social areas and to establish more extensive statistical standards for the federal data collecting agencies. For example, many of our economic statistics depend for classification purposes on the concept of the establishment; many of our social statistics are collected about and from families -- yet both of these concepts are becoming more difficult to apply and possibly less relevant. Changes in either would have major impacts on the uses of the resulting data; they might also have major methodological repercussions. Although the FCSM has developed an admirable program of sponsoring, reviewing and publishing reports on specific topics, it has not been so forthcoming about its own operations. I am especially curious about how it selects the areas for subcommittee operation. The most important problems? By what criteria? The problems most likely of immediate solution? Or some interface between these criteria? Here I would only note that the Committee has not yet tackled the most difficult problem of all those facing federal statistical agencies, that of setting statistical priorities. It is possible that chances of a successful subcommittee outcome are too remote to warrant effort on this issue. It is now fifteen years since a panel of the Committee on National Statistics issued a preliminary report* and recommended additional research on costs, and, especially, the benefits of statistical activities. To my knowledge, there has been little if any follow-through by federal agencies. The time may be ripe for another look at this issue. 463 Fritz Scheuren spoke yesterday about possible paradigm shifts in the taking of the Census. I have my own pet paradigm shift to recommend. Back in the 1940's and 1950's when I was being educated in statistical methodology by Morris Hansen and others in the federal statistical agencies as part of the process of their obtaining Bureau of the Budget approval for forms, I learned the paradigm that the sponsor of the form (the subject matter specialist, the scientist, the policymaker) specified the subjects to be covered and the accuracy desired, the statistician provided the statistical design and methodology and estimated the costs of alternatives. The description is over-simplified but not unfair, I think. Yet how far from actual practice. The economic or social theorists seldom specify an operationally feasible concept. It is the applied survey economists, demographers and other specialists, together with the statisticians, interacting, who develop concepts, designs and methodologies in a succession of approximations. As a case in point, take the definition of employment. In classical economics, employment is not defined, nor even mentioned. An undifferentiated mass known as "labor" was identified as one of the three factors of production, and when current information on the demand for labor was wanted in the late nineteenth century, it was determined to collect data on employment from employers. The result was that employment was defined to be something obtainable from employer records, a concept approximating filled jobs. It excluded the self-employed, but one person could be counted on several payrolls by holding a number of part-time or part-period jobs. The source and method of data collection thus determined the effective definition. Later, when a new survey attempted to measure the unemployed, it proved necessary to go to persons in households for the information and to identify the employed to differentiate them from those not at work and seeking work. A quite different count of employment resulted, again reflecting the basic survey methodology. Some of the differences between the two series in level and change can be explained by known differences in the concepts, some remain intractable. I think it is high time to shift the paradigm more towards one centered on survey methodology broadly defined. This would argue for either expanding the scope of the FCSM or establishing additional coordinating committees under OMB auspices to work on developing consensus on critical conceptual and classification issues. Reference Setting Statistical Priorities, Report of the Panel on Methodology for Statistical Priorities, Richard Savage, Chair, Committee on National Statistics, National Academy of Sciences, Washington, DC, 1976. 464 TOWARDS AN AGENDA FOR THE FUTURE Hermann Habermann Office of Management and Budget The keynote address at this seminar given by Bob Groves presented the goals for the working papers prepared by the Federal Committee on Statistical Methodology. The goals are documentation of Federal statistical practices, cross-fertilization among agencies, and to prod new developments. The address suggested the need to establish a reward structure for the Federal staff that work on the FCSM projects. The Federal statistical system needs to periodically examine itself to determine if we are meeting our goals. Some of the areas of work that the statistical system must now investigate are listed below. o Cognitive laboratories o The National Academy of Sciences' Committee on National Statistics is studying: Trade data Disclosure-avoidance techniques o The Bureau of Economic Analysis is moving away from the present system of National Accounts towards the United Nations National Accounts System. o Census 2000 is 10 years away so we need a fresh look at the methods used to collect census data. o Private data bases are now burgeoning. The Federal government no longer has a monopoly on data bases. There are many changes in public attitudes. Some of the questions that we need to consider in the context of the Seminar on Quality of Federal Data follow. 1. What is the purpose of the decennial census? 2. What is the relationship between the "10 year ceremony" and intercensal data collection? 3. Where are we going on disclosure-avoidance techniques? 4. Where are we going with Federal-State statistical program? How can we evaluate the multiple models used in these programs? 465 5. What is the best strategy to take care of the increasing difficulties that agencies have with recruitment and training of technical personnel? 466 Reports Available in the Statistical Policy Working Paper Series 1. Report on Statistics for Allocation of funds (Available through NTIS Document Sales, PB86-211521/AS) 2. Report on Statistical Disclosure and Disclosure-Avoidance Techniques (NTIS Document Sales, PB86-211539/AS) 3. An Error Profile: Employment as Measured by the Current Population Survey (NTIS Document Sales PB86-214269/AS) 4. Glossary of Nonsampling Error Terms: An Illustration of a Semantic Problem in Statistics (NTIS Document Sales, PB86- 211547/AS) 5. Report on Exact and Statistical Matching Techniques (NTIS Document Sales, PB86-215829/AS) 6. Report on Statistical Uses of Administrative Records (NTIS Document Sales, PB86-214285/AS) 7. An Interagency Review of time-Series Revision Policies (NTIS Document Sales, PB86-232451/AS) 8. Statistical Interagency Agreements (NTIS Document Sales, PB86-230570/AS) 9. Contracting for Surveys (NTIS Document Sales, PB83-233148) 10. Approaches to Developing Questionnaires (NTIS Document Sales, PB84-105055/AS) 11. A Review of Industry Coding Systems (NTIS Document Sales, PB84-135276) 12. The Role of Telephone Data Collection in Federal Statistics (NTIS Document Sales, PB85-105971) 13. Federal Longitudinal Surveys (NTIS Document Sales, PB86- 139730) 14. Workshop on Statistical Uses of Microcomputers in Federal Agencies (NTIS Document Sales, PB87-166393) 15. Quality in Establishment Surveys (NTIS Document Sales, PB88- 232921) 16. A Comparative Study of Reporting Units in Selected Employer Data Systems (NTIS Document Sales, PB-90-205238) 17. Survey Coverage (NTIS Document Sales, PB90-205246) 18. Data Editing in Federal Statistical Agencies (NTIS Document Sales, PB90-205253) 19. Computer Assisted Survey Information Collection (NTIS Document Sales, PB90-205261) 20. Seminar on the Quality of Federal Data (NTIS Document Sales, PB91-142414) Copies of these working papers may be ordered from NTIS Document Sales, 5285 Port Royal Road, Springfield, VA 22161 (703) 487-4650 467 "1" A copy of the complete paper which details the sample selection and matching procedures used in ERUMS is available from John Pinkos, Bureau of Labor Statistics, GAO Building Room 2913, 441 G Street, NW, Washington, DC 20212, Telephone (202)523-1636. "2" Acknowledgement: This chapter was partially supported by a grant from the National Science Foundation's program of Measurement Methods and Data Improvement, Grant # SES-8511609. The chapter benefitted from the helpful comments and suggestions of Robert F. Boruch, Calvin C. Jones, and Nancy A. Mathiowetz. A more extended version of this chapter appears in Krishnan Namboodiri and Ronald G. Corwin, editors, Research in the Sociology of Education and Socialization. Volume S. Greenwich, Connecticut: JAI Press, 1989, pages 177-199, and is reprinted here in part with the permission of the publisher. "3" This survey of approximately 2,000 respondents was a face-to- face stratifed probability sample of the adult noninstitutionalized population of the United States, which included a special supplemental sample of minorities for that year. "4" This survey of approximately 10,000 respondents is a cohort of youth (age 14-21 during the first year of the survey in 1979) which included oversamples of females and minority youth and a special military sample. "5" A longer version of this paper that discusses several other topics in longitudinal analysis such as designing a longitudinal file, dealing with attrition, imputation and weighting issues, and the choice of an accounting period is available from the Census Bureau as SIPP Working Paper No. 9007. "6" See Duncan (ed.) (1984) and McNeil et al. (1988). "7" Applications illustrating the use of this technique to analyze income change can be found in Ruggles and Williams (1986) and Williams and Ruggles (1987). "8" See Bianchi and McArthur (1989). "9" See for example Guatman and Steiruneier (1986). "10" Additionally, if rates of divorce are changing rapidly over time, the use of pooled data on transitions from A long-term sample such as the PSID may give misleading estimates of, transition probabilities. See for example Tuma and Hannan (1984) for more discussion of this point. fnote>"12" Mary Jo Bane and David Ellwood's classic paper on poverty spells makes this point very well, and provides a good example of spell analysis as applied to the PSID. (See Bane and Ellwood (1986)). For a similar example using SIPP data, see Ruggles and Williams (1989). Other useful applications include the work by Pamela Parley Short and her colleagues on spells of Medicaid participation and Rebecca Slank's imaginative use of longitudinal data from the Seattle and Denver Income Maintenance Experiments to examine spells of welfare program participation. See Short et al. (1988) and Blank (1986). "13" See Duncin et al. (1984). "14" This discussion is aimed at the analyst trying to decide whether this approach is appropriate for the particular application he or she has in mind. Anyone attempting to implement such an analysis should bf course review some of the more technical literature on this topic. Tuma and Hannan (1984) provide a good basic an overview of these methods. In addition, the treatment in Allison (1982) may be helpful to analysts who are completely unfamiliar with event history analysis techniques. "15" The author is Assistant Commissioner, office of Economic Research, Bureau of Labor Statistics. The views expressed herein are those of the author and do not necessarily reflect those of the Bureau of Labor Statistics.
| Page Last Modified: April 20, 2007 | FCSM
Home Methodology Reports |