| Federal
Committee on Statistical
Methodology Office of Management and Budget |
FCSM
Home ^ Methodology Reports ^ |
Statistical Policy Working Paper 20 - Seminar on Quality of Federal Data - Part 3 of 3
Click HERE for graphic. Statistical Policy Working Paper 20 Seminar on Quality of Federal Data Part 3 of 3 Federal Committee on Statistical Methodology Statistical Policy Office Office of Information and Regulatory Affairs Office of Management and Budget March 1991 MEMBERS OF THE FEDERAL COMMITTEE ON STATISTICAL METHODOLOGY (February 1991) Maria E. Gonzalez, Chair office of Management and Budget Yvonne M. Bishop Daniel Kasprzyk Energy Information Bureau of the Census Administration Daniel Melnick Warren L. Buckler National Science Foundation Social Security Administration Robert P. Parker Charles E. Caudill Bureau of Economic Analysis National Agricultural Statistics Service David A. Pierce Federal Reserve Board Cynthia Z.F. Clark National Agricultural Thomas J. Plewes Statistics Service Bureau of Labor Statistics Zahava D. Doering Wesley L. Schaible Smithsonian Institution Bureau of Labor Statistics Robert M. Groves Fritz J. Scheuren Bureau of the Census Internal Revenue Service Roger A. Herriot Monroe G. Sirken National Center for National Center for Education Statistics Health Statistics C. Terry Ireland Robert D. Tortora National Computer Security Bureau of the Census Center Charles D. Jones Bureau of the Census PREFACE In 1975, the Office of Management and Budget (OMB) organized the Federal Committee on Statistical Methodology. Comprised of individuals selected by OMB for their expertise and interest in statistical methods, the committee has during the past 15 years determined areas that merit investigation and discussion, and overseen the, work of subcommittees organized to study particular issues. Since 1978, 19 Statistical Policy Working Papers have been published under the auspices of the Committee. On May 23-24, 1990, the Council of Professional Associations on Federal Statistics (COPAFS) hosted a "Seminar on the Quality of Federal Data." Developed to capitalize on work undertaken during the past dozen years by the Federal Committee on Statistical Methodology and its subcommittees, the seminar focused on a variety of topics that have been explored thus far in the Statistical Policy Working Paper series. The subjects covered at the seminar included: Survey Quality Profiles Paradigm Shifts Using Administrative Records Survey Coverage Evaluation Telephone Data Collection Data Editing Computer Assisted Statistical Surveys Quality in Business Surveys Cognitive Laboratories Employer Reporting Unit Match Study Approaches to Developing Questionnaires Statistical Disclosure-Avoidance Federal Longitudinal Surveys Each of these topics was presented in a two-hour session that featured formal papers and discussion, followed by informal dialogue among all speakers and Attendees. Statistical Policy Working Paper 20, published in three parts, presents the proceedings of the "Seminar on the Quality of Federal Data." In addition to providing the papers and formal discussions from each of the twelve sessions, this working paper includes Robert M. Groves' keynote address, "Towards Quality in a Working Paper Series on Quality," and comments by Stephen E. Fienberg, Margaret E. Martin, and Hermann Habermann at the closing session, "Towards an Agenda for the Future." We are indebted to all of our colleagues who assisted in organizing the seminar, and to the many individuals who not only presented papers and discussions but also prepared these materials for publication. A special thanks is due to Terry Ireland and his staff for their work in assembling this working paper. Table of Contents Wednesday, May 23, 1990 Part 1 KEYNOTE ADDRESS TOWARDS QUALITY IN A WORKING PAPER SERIES ON QUALITY . . . . . . . . 3 Robert M. Groves, The University of Michigan and U. S. Bureau of the Census Session 1 - SURVEY QUALITY PROFILES THE SIPP QUALITY PROFILE . . . . . ... . . . . . . . . . . . . . . . 19 Thomas B. Jabine, Statistical Consultant INITIAL REPORT ON THE QUALITY OF AGRICULTURAL SURVEY PROGRAM. . . . 29 George A. Hanuschak, National Agricultural Statistics service DISCUSSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 Barbara A. Bailar, American Statistical Association DISCUSSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 Nancy A. Mathiowetz, U. S. Bureau of the Census Session 2 - PARADIGM SHIFTS USING ADMINISTRATIVE RECORDS PARADIGM SHIFTS: ADMINISTRATIVE RECORDS AND CENSUS-TAKING . . . . . 53 Fritz Scheuren, Internal Revenue Service AN ADMINISTRATIVE RECORD PARADIGM: A CANADIAN EXPERIENCE. . . . . . 66 John Leyes, Statistics Canada DISCUSSION. . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 Gerald Gates, U.S. Bureau of the Census DISCUSSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 Edward J. Spar, Market Statistics Session 3 - SURVEY COVERAGE EVALUATION CONTROL MEASUREMENT, AND IMPROVEMENT OF SURVEY COVERAGE . . . . . 87 Gary M. Shapiro, U. S. Bureau of the Census; Raymond R. Bosecker, National Agricultural Statistics Service QUALITY OF SURVEY FRAMES . . . . . . . . . . . . . . . . . . . . . 100 Judith T. Lessler, Research Triangle Institute DISCUSSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 Fritz Scheuren, Internal Revenue Service DISCUSSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 Joseph Waksberg, Westat, Inc. Session 4 - TELEPHONE DATA COLLECTION QUALITY IMPROVEMENT IN TELEPHONE SURVEYS . . . . . . . . . . . . . 123 Leyla Mohadjer, David Morganstein, Westat, Inc. COMPUTER ASSISTED SURVEY TECHNOLOGIES IN GOVERNMENT: AN OVERVIEW . . . . . . . . . . . . . . . . . . . . . . . . . 137 Marc Tosiano, National Agricultural Statistics Service DISCUSSION. . . . . . . . . . . . . . . . . . . . . . . . . . . . .155 William L. Nicholls II, U. S. Bureau of the Census DISCUSSION. . . . . . . . . . . . . . . . . . . . . . . . . . . . .161 James T. Massey National Center Health Statistics iv Part 2 Session 5 - DATA EDITING OVERVIEW OF DATA EDITING IN FEDERAL STATISTICAL AGENCIES . . . . . .167 David A. Pierce, Federal Reserve Board EDITING SOFTWARE (An excerpt from Chapter IV of Working Paper 18) . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 Mark Pierzchala, National Agricultural Statistics Service RESEARCH ON EDITING . . . . . . . . . . . . . . . . . . . . . . . . 180 Yahia Ahmed, Internal Revenue Service DISCUSSION . . .. . . . . . . . . . . . . . . . . . . . . . . . . . 184 Charles E. Caudill, National Agricultural Statistics Service DISCUSSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . .186 Richard Bolstein, George Mason University Session 6 - COMPUTER ASSISTED STATISTICAL SURVEYS OVERVIEW OF COMPUTER ASSISTED SURVEY INFORMATION COLLECTION . . . . .191 Richard L. Clayton, U. S. Bureau of Labor Statistics A COMPARISON BETWEEN CATI AND CAPI . . . . . . . . . . . . . . . . . 197 Martin Baum, National Center for Health Statistics COMPUTER ASSISTED SELF INTERVIEWING . . . . . . . . . . . . . . . . .202 Ralph Gillmann, Energy Information Administration COMPUTER ASSISTED SELF INTERVIEWING: RIGS AND PEDRO, TWO EXAMPLES. . . . . . . . . . . . . . . . . . . . . . . . . .205 Ann M. Ducca, Energy Information Administration DATA COLLECTION . . . . . ... . . . . . . . . . . . . . . . . . . . .209 Cathy Mazur, National Agricultural Statistics Service v DISCUSSION. . . . . . . . . . . . . . . . . . . . . . . . . . . . .212 Robert N. Tinari, U. S. Bureau of the Census DISCUSSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . .216 David Morganstein, Westat, Inc. Thursday, May 24, 1990 Session 7 - QUALITY IN BUSINESS SURVEYS IMPROVING ESTABLISHMENT SURVEYS AT THE BUREAU OF LABOR STATISTICS . . . . . . . . . . . . . . . . . . . . . . . . . .221 Brian MacDonald, Alan R. Tupek, U. S. Bureau of Labor Statistics A REVIEW OF NONSAMPLING ERRORS IN FEDERAL ESTABLISHMENT SURVEYS WITH SOME AGRIBUSINESS EXAMPLES . . . . . . . . . . . . . . 232 Ron Fecso, National Agricultural Statistics Service DISCUSSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . .243 David A. Binder, Statistics Canada DISCUSSION. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247 Charles D. Cowan, Opinion Research Corporation Session 8 - COGNITIVE LABORATORIES THE BUREAU OF LABOR STATISTICS' COLLECTION PROCEDURES RESEARCH LABORATORY: ACCOMPLISHMENTS AND FUTURE DIRECTIONS . . . . 253 Cathryn S. Dippo, Douglas Herrmann, U. S. Bureau of Labor Statistics THE ROLE OF A COGNITIVE LABORATORY IN A STATISTICAL AGENCY . . . . .268 Monroe G. Sirken, National Center for Health Statistics DISCUSSION. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278 Elizabeth Martin U. S. Bureau of the Census DISCUSSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . .281 Murray Aborn, National Science Foundation (retired) vi Part 3 Session 9 - EMPLOYER REPORTING UNIT MATCH STUDY INTERAGENCY AGREEMENTS FOR MICRODATA ACCESS: THE ERUMS EXPERIENCE . . . . . . . . . . . . . . . . . . . . . .291 Thomas B. Petska, Internal Revenue Service; Lois Alexander, Social Security Administration SAMPLE SELECTION AND MATCHING PROCEDURES USED IN ERUMS . . . . . . . 301 John Pinkos, Kenneth LeVasseur, Marlene Einstein, U. S. Bureau of Labor Statistics; Joel Packman, Social Security Administration RESULTS, FINDINGS AND RECOMMENDATIONS OF THE ERUMS PROJECT . . . . . 309 Vern Renshaw, Bureau of Economic Analysis; Tom Jabine, Statistical Consultant DISCUSSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318 W. Joel Richardson, Charles A. Waite, U. S. Bureau of the Census DISCUSSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324 Thomas J. Plewes, U. S. Bureau of Labor Statistics Session 10 - APPROACHES TO DEVELOPING QUESTIONNAIRES TOOLS FOR USE IN DEVELOPING QUESTIONS AND TESTING QUESTIONNAIRES . . . . . . . . . . . . . . . . . . . . . . . . .331 Theresa J. DeMaio, U. S. Bureau of the Census TECHNIQUES FOR EVALUATING THE QUESTIONNAIRE DRAFT . . . . . . . . . .340 Deborah H. Bercini, National Center for Health Statistics DESIGNING QUESTIONNAIRES FOR CATI IN A MIXED MODE ENVIRONMENT. . . . . . . . . . . . . . . . . . . . . . . . . . .349 Gemma Furno, U. S. Bureau of the Census DISCUSSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 360 Carol C. House, National Agricultural Statistics Service vii Session 1 1 - STATISTICAL DISCLOSURE - AVOIDANCE DISCLOSURE AVOIDANCE PRACTICES AT THE CENSUS BUREAU . . . . . . . . .367 Brian Greenberg, U. S. Bureau of the Census THE MICRODATA RELEASE PROGRAM OF THE NATIONAL CENTER FOR HEALTH STATISTICS . . . . . . . . . . . . . . . . . . . . . . . .377 Robert H. Mugge, National Center for Health Statistics (retired) DISCUSSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385 George T. Duncan, Carnegie Mellon University Session 12 - FEDERAL LONGITUDINAL SURVEYS FEDERAL LONGITUDINAL SURVEYS . . . . . . . . . . . . . . . . . . . . 393 Daniel Kasprzyk, U. S. Bureau of the Census; Curtis Jacobs, U. S. Bureau of Labor Statistics THE ADVANTAGES AND DISADVANTAGES OF LONGITUDINAL SURVEYS . . . . . . 407 Robert W. Pearson, Social Science Research Council LONGITUDINAL ANALYSIS OF FEDERAL SURVEY DATA . . . . . . . . . . . . 425 Patricia Ruggles, Joint Economic Committee DISCUSSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 438 Michael Brick, Westat, Inc. DISCUSSION. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .447 Marilyn E. Manser, U. S. Bureau of Labor Statistics TOWARDS AN AGENDA FOR THE FUTURE Stephen E. Fienberg, Carnegie Mellon University . . . . . . . . . . .455 Margaret E. Martin . . . . . . . . . . . . . . . . . . . . . . . . . 462 Hermann Habermann, Office of Management and Budget . . . . . . . . . 465 viii Part 3 Session 9 EMPLOYER REPORTING UNIT MATCH STUDY 289 290 INTERAGENCY AGREEMENTS FOR MICRODATA ACCESS: THE ERUMS EXPERIENCE Thomas B. Petska Internal Revenue Service Lois Alexander Social Security Administration The Employer Reporting Unit Match Study (ERUMS) was a pilot record linkage study carried out under the auspices of the Federal Committee on Statistical Methodology of the Office of Management and Budget. The study linked records of employers and their reporting units from three agencies: the Bureau of Labor Statistics (BLS), the Social Security Administration (SSA) and the Internal Revenue Service (IRS). The primary linkages involved samples of the agencies' records for employers in the State of Texas covering their-activities in 1982. For the ERUMS Workgroup to gain access to the data sets needed for the study, arrangements had to be developed that would comply with the confidentiality provisions and statutes of the Federal and State agencies that controlled these data sets. This paper gives an overview of these arrangements and agreements. In the first section, background information on the statistical content and confidentiality provisions of each of the data sets is provided. In the second section, the actual arrangements for the release of confidential microdata are described. The last section provides a summary of what we have learned about such data sharing arrangements. Background Information The goal of ERUMS was to demonstrate the feasibility of matching employer and reporting unit data from different agency record systems as a means of obtaining more precise information about the coverage and content of the data in those systems. A purpose was to examine and I evaluate differences in wage and employment data at the state and county level as reported to those agencies. Despite the many difficulties encountered in establishing the data access agreements, ERUMS demonstrated that data such sharing Projects can be successful under current laws. 1. Data Sets The ERUMS study was a three-way data linkage study in which individual microdata records from BLS, SSA, and IRS were matched by Employer Identification Number (EIN). 291 a. BLS provided a 1982 Unemployment Insurance (UI) Address File, which, for each state, consists of data for individual employers and their reporting units, which are often equivalent to "establishments". The data for this file are submitted to BLS by the State employment security agencies that operate the Federal-State UI Program. BLS uses the data submitted by the states as a basis for statistical reports on employment and wages and uses the UI Address File as a national sampling frame for its establishment surveys. b. SSA provided an edited file of Form W-3 annual reports for 1982 and the Single Unit and Multi-Unit Code Files. The Form W-3 file provided data on individual employers and, in some cases, for each of their reporting units, which are frequently equivalent to establishments. The Single Unit Code File contains a record for most entities that have filed an application for an Employer Identification Number. The Multi-Unit Code File contains a record for each reporting Unit of multi-unit employers who are participating in the Establishment Reporting Plan, a voluntary program under which employers report wage information on Form W-3 separately for each of their reporting units. c. IRS data used for ERUMS were from a Census-edited file based on Forms 941 and 943 for Tax Years 1981-83. These forms are used by employers to report each quarter (annually for Form 943) to IRS on income taxes withheld from wages and other payments to employees and on taxes under the Federal Insurance Contributions Act (FICA) under the Social Security system. Extracts of data from these forms are provided annually by IRS to the Census Bureau for use in the latter's County Business Patterns Program and other statistical programs. The Census Bureau edits the files, particularly the industry codes, and imputes certain missing data. This file was made available to the IRS Statistics of Income (SOI) Division for use in its business employment and payroll studies and was used for ERUMS. In addition, copies of Form 940, Federal Unemployment Tax Return, were obtained for a substantial proportion of the ERUMS sample cases. 2. Data Sharing Issues For the ERUMS Workgroup to gain access to the data sets needed for the study, it was necessary to develop working arrangements that complied with the provisions of confidentiality statutes, regulations, and policies of the Federal and State agencies that controlled these data sets. 292 Although interagency exchange of identifiable microdata was the key to ERUMS, such data sharing is restricted by Federal confidentiality laws which generally permit agencies to disclose statistical information only in summary or other unidentifiable form. Since ERUMS was designed to link and compare information about individual employers collected separately by the different agencies, the Workgroup had to develop and implement lawful methods of transferring data on identifiable business units among the participants. A related task was to minimize the disclosure of identifiers in making those transfers and linkages. The Workgroup was particularly interested in the different ways an employer may report establishment or multi-unit enterprise data to various State and Federal agencies. To examine these differences, the Workgroup needed to compare employers' reports to the BLS State UI programs, the SSA FICA reporting, and the IRS employment tax returns. Members Of the Workgroup included employees of these agencies, plus employees of the Bureau of Economic Analysis, Office of Management and Budget, the Bureau of the Census, and the Committee on National Statistics of the National Academy of Sciences. The Workgroup planned to analyze the information that corresponded to each EIN as it was reported to each agency. The analysis and findings would be entirely statistical in nature with no reference to the individual (identifiable) cases. Nevertheless, the planning, processing, and analysis phases each required access to identifiable data. 3. Confidentiality of Federal and State Tax Records In the ERUMS study, the Employer, Identification Number (EIN) was the identifier that was common to all the reporting systems. It was used to define the sample drawn by BLS and was used as the basis for retrieving, linking and comparing records containing information from the SSA and IRS files. By law, the EIN is a tax identification number, and even when standing alone is protected by Internal Revenue Code confidentiality restrictions. ERUMS required access to data from W-3 records which by law are Federal tax records that are processed and maintained at SSA in conjunction with the computation of Social Security retirement benefits. Since these are tax records, it was necessary to satisfy IRS that the selection by SSA of sample cases, SSA's disclosure of W-3 data to BLS, and the use of employer data by other members of the Workgroup met the requirements of the Internal Revenue Code dealing with disclosure of tax information. (See No. 4 below.) BLS selected Texas as the State whose records it would sample, and it obtained written permission from the Texas State Employment Security Agency to use their UI records in the project. The Texas 293 Unemployment Compensation Act requires Texas employers to maintain records and file reports to the Texas Employment Commission with detailed information about the business operations and the number and compensation of employees. Texas law prohibits disclosure except for administering the Act, and it makes improper disclosure punishable by fines or imprisonment. 4. Other Confidentiality Considerations Since the Workgroup was composed of employees from several agencies and organizations, confidentiality laws did not apply to them uniformly. In varying degrees, certain laws, regulations, and policies affected each agency's access to identifiable records from particular sources and provided differential access to various individuals in the Workgroup. A recurring theme was the necessity at each phase of the process to identify the persons who needed to use identifiable data and to ensure that no others had access at that time. Besides affidavits and other written procedures to protect the confidentiality of records, certain technical safeguards were adopted to minimize disclosure risk. The first of these methods was to avoid identifying sample cases by EIN to persons who performed processing in the participating agencies but were not directly associated with the Workgroup. This method was adopted to conform to, the Internal Revenue Code requirements for tax information under the agreement BLS had with the State of Texas. At BLS this led to a decision not to process the data on the mainframe computer system at the Department of Labor that is operated by a private contractor. Instead, BLS used a mini- computer which was accessible only to BLS employees who were members of the Workgroup. State agencies periodically submit to BLS UI address files that compile identification data for all reporting units at the most-detailed level that is available from employers' reports. BLS compiles these reports under a pledge of confidentiality that allows the data to be used only by authorized persons for statistical purposes. Once BLS selected the Texas sample, it had to create a finder list so that SSA could extract corresponding records from its W-3 and related files for employers in the sample. The technical staff who performed these operations at SSA have routine access in their usual jobs to the employer records maintained at SSA. However, they did not need to know which of the employers' records comprised the sample selected by BLS from the Texas UI file. To avoid identifying those cases that were actually in sample, furnished SSA with a listing of 7 of the 9 digits of sample EINS. SSA staff then extracted records from the W-3 and related files for all records in which these 7 digits appeared without knowing which 294 employers were actually in the BLS sample. This procedure effectively masked the identities of sample cases derived from State UI files, and thus significantly limited the number of SSA employees who were required to sign BLS non-disclosure affidavits. Agreements for Interagency Data Sharing Access by the Workgroup to the data sets needed for the study was accomplished through three interagency agreements plus an additional access arrangement. The Workgroup had originally planned a tripartite arrangement through interagency agreements of SSA and BLS with IRS. However, IRS counsel raised objections that such a multi-party agreement would be unduly cumbersome, and approval would probably not be forthcoming. As an alternative, IRS proposed to contract exclusively with BLS for the performance by BLS of services that required access to tax data. SSA staff would be designated as special agents of BLS to process the data. Bilateral BLS/IRS and BLS/SSA agreements would also have to be drafted under this arrangement. The drafting of these arrangements proved to be a delicate task. By law, the purposes of IRS participation in the project and its service contract with BLS had to be related to IRS administration of the tax laws. Section 6103(n) of the Internal Revenue Code (IRC) allows IRS to disclose tax return information to persons outside of the agency as long as it is for purposes of tax administration [1]. Specifically, this purpose is to conduct statistical studies based on return information, which Section 6108(a) of the IRC authorizes IRS to perform [2]. A case was made that the ERUMS study was one such purpose 1. BLS and Texas Agreement BLS has cooperative agreements with 50 State Employment Security Agencies to use employment statistics collected by the states for its labor economics research. The 1982 data used in the ERUMS study was furnished to BLS in its ES-202 program by the Texas State Employment Commission under a cooperative agreement. It was necessary for BLS to obtain authorization from the State Commission to use the microdata for the ERUMS study and to provide access for the Workgroup members. Under this cooperative agreement, the access and use of the data were subject to the confidentiality requirements of the Texas Employment Compensation statute as well as those set out in the BLS Commissioner's Order No. 2-80. Each UI program is operated under state law that must conform to certain minimum federal standards, with reports that enable BLS to monitor state compliance. Under the Texas program, each 295 employing unit is required to file (and update periodically) a status report with the Texas Employment Commission, describing the type of ownership, location, and nature of business. On a quarterly basis, employers are required to file detailed reports on wages and contributions. Multi-Unit employers are asked to file a voluntary statistical supplement that provides detailed employment, wage, and contribution reports for each establishment. The ES-202 reports are compiled by BLS and form the basis for the UI Address file that BLS maintains. This is a micro-level employer file that contains first quarter information for each reporting unit, and the 1982 file provided the Texas sampling frame for the ERUMS sample. The confidentiality of statistical data collected under the cooperative agreement is protected by interrelated state and federal procedures. At the state level, these UI reports are collected under the Texas Unemployment Compensation Act which limits the availability of its UI reports to public employees in the performance of public duties, except, as the Employment Commission may find necessary in its administration of Texas law. At the federal level, BLS receives and maintains these confidential reports under the authority of the BLS Commissioner's Order that pledges confidentiality and prohibits disclosure except to authorized persons for statistical purposes. This Order precludes any use of identifiable information for non-statistical purposes, such as investigation or enforcement. Under this cooperative agreement with the State of Texas, it was necessary for BLS to obtain permission from the Texas Commissioner to select employer sample cases and to make information about them available to BLS and SSA employees in the ERUMS Workgroup and later to others in the Microdata Access Group. In Addition, BLS procedures establish the confidentiality of the identities and all information pertaining to employers in the sample. Members of the Workgroup who were not BLS employees were appointed as BLS agents pursuant to another interagency agreement with BLS. Like BLS employees, other Workgroup members were required to sign a Non-Disclosure Affidavit before they would be given access to the microdata. 2. IRS And BLS Agreement The initial draft of the statement of purpose by IRS representatives was, acceptable to IRS counsel since its justification for sharing of confidential tax information was defined as for purposes of tax administration, which is permissible under section 6103(n) of the Internal Revenue Code [1]. However, the case that was made for IRS tax administration purposes was not acceptable to other Workgroup participants because they felt that this did not clearly describe the purposes of the ERUMS project in general or SSA's role in particular. In the, subsequent draft, care was taken to define contractual purposes in language that covered 296 the statistical purposes of the several participating agencies and that provided for the exchange of records to create a common pool of data for a variety of analytical purposes, including those related to tax administration. In this agreement, IRS contracted with BLS for the performance of those parts of the ERUMS project that required access to tax data, including the wage report information that was to be provided by SSA. Under this agreement, SSA staff could be designated as special agents of BLS to carry out their part of the linkage and analysis operations. By law, the purposes of IRS participation in the project and its service contract with BLS had to be related to IRS administration of the tax laws. The terms of a contract between IRS and BLS which needed to be acceptable to SSA enabled BLS to receive tapes containing tax information from IRS and SSA and to combine them with records in the UI Address File maintained by BLS. It imposed strict safeguard procedures and required BLS to provide IRS with a list of all persons permitted to see confidential tax return data. This list included SSA employees who were required to sign affidavits as agents of BLS. 3. BLS and SSA Agreement The third agreement was a Conditions of Use agreement between BLS and SSA which enabled SSA to release data from its employer files to BLS and authorized BLS to link data from these files to data in the UI Address File and data to be furnished by IRS. Like the IRS/BLS agreement, it limited access at each stage of the project to those persons who needed to use identifiable data, kept the number of such persons to a minimum, and required adequate physical security procedures. This agreement, which needed to be acceptable to IRS, enabled BLS to use SSA files for the ERUMS project. Under this agreement, SSA would furnish BLS with SSA's Single Unit Code File, Multi Unit Code File, and Employer Report (W-3) Record. The agreement authorized BLS to link data from these statistical files with data in the BLS Unemployment Insurance Address File and with data to be furnished by IRS, and prohibited any other linkage. 4. Microdata Access Group In the planning and matching stages of the project, the persons who needed to have access to microdata were those members of the Workgroup who were performing the record matching and verification. At Workgroup meetings, members generally reviewed data in the form of frequencies and other summaries to track the progress of the matching operations and to plan future steps. Occasionally, discrepancies appeared or questions arose concerning 297 classification of a particular employer or possible mis-match of data. Those matters were usually referred to particular members to resolve, with access to microdata as needed on an ad hoc basis. When the matching steps were completed and time came to plan the analysis, new arrangements were needed to enable a different group of persons to examine identifiable microdata. The Microdata Access Group (MAG) was formed for this purpose. At this point, IRS agreed that its contractor, BLS, would be permitted to make Workgroup members its agents as needed for the analysis stage. This ehabled the Workgroup members who were employees of BEA and the Committee on National Statistics to become sworn agents who, like the employees of BLS and SSA, would be permitted to examine and analyze microdata. Thus, of the three agencies sharing microdata (BLS, SSA, and IRS), IRS was the only one that did not have access to the matched microdata file. This group met periodically to plan and perform the analysis, prepare findings, and to report its findings back to the full Workgroup. Once the terms of all contracts were agreed upon, the contracts and the conditions of use agreement were signed by officials of the participating agencies, and the way was cleared for the data transfers. Summary and Conclusions To say that the process of discussion and negotiation leading to the signing of the ERUMS access agreements was painstaking, sensitive, and costly in terms of staff time and delay in the study's completion is an understatement. The disclosure aspects of the study severely tested the will and resolve of the affected agencies. In retrospect, the signing of interagency agreements between IRS and BLS and between SSA and BLS documented a process of negotiation by which the study plan was adapted to the requirements of the varios confidentiality laws that impinged on it. In addition, it summarized a process in which a combination of technical and procedural safeguards were fitted to meet the requirements of the Federal and State agencies that were involved in the data sharing. While the participants in the ERUMS study all feel a certain, degree of Accomplishment due to their collective persistence, none are quite so upbeat about the long duration of the study. Clearly, the long incubation period for the interagency data sharing agreements was a major contributor. However, it is important to recognize that the prolonged negotiation for interagency agreements did not result from lack of cooperation among the participants. On the contrary, it reflected the complex mosaic of legal restrictions on use and interagency dissemination of records. 298 Once it became evident that a single multi-party agreement would be unworkable for the overall project, the plan was broken down into component steps of disclosure, record linkage, and analysis. Each failure to reach an agreement required a step back to re-examine the study imperatives and to adapt the procedures to the practical and legal necessities at each stage. In addition to adding to the overall time and resources consumed by the project, these delays further contributed to supplemental delays, including: 1. Personnel turnover among the project participants due to the extended length of the project's schedule necessitated slower progress on the technical issues. 2. The acquisition of IRS Form 940 data was adversely impacted since these have a 5 year retention and were scheduled for destruction by the time the sample EIN's were determined. On the positive side, however, ERUMS demonstrated that such data sharing projects can be successful under current laws if there is creativity, flexibility, and most of all, persistence. Notes and References [1] Section 6103(n) of the Internal Revenue Code (IRC) allows for the provision of confidential tax return information for purposes of tax administration. Specifically, it reads: "Certain Other Persons. -- Pursuant to regulations prescribed by the Secretary, returns and return information may be disclosed to any person, including any person described in Section 7513 (a), to the extent necessary in connection with the processing, storage, transmission, and reproduction of such returns and return information, and the programming, maintenance, repair, resting, and procurement of equipment, for purposes of tax administration." [2] Section 6108 of the IRC has three parts which call for the publication of statistical compilation of tax return information at regular intervals, but, unlike Section 6103(n), such information cannot identify a particular taxpayer. This Section is the primary "mandate" for IRS' Statistics of Income (SOI) program. a) Publication or other Disclosure of Statistics of Income. -- The Secretary shall prepare and publish not less than annually statistics reasonably available with respect to the operations of the internal revenue laws, including classifications of taxpayers and of income, the amounts 299 claimed or allowed as deductions, exemptions, and credits, and any other facts deemed pertinent and valuable. b) Special statistical Studies. -- The Secretary may, upon written request by any party or parties, make special statistical studies and compilations involving return information (as defined in section 6103 (b)(2)) and furnish to such party or parties transcripts of any such special statistical study or compilation. A reasonable fee may be prescribed for the cost of the work or services performed for such party or parties. c) Anonymous Form. -- No publication or other disclosure of statistics or other information required or authorized by subsection (a) or special statistical study authorized by subsection (b) shall in any manner permit the statistics, study, or any information so published, furnished, or otherwise disclosed to be associated with, or otherwise, identify, directly or indirectly, a particular taxpayer. Section 6108(a) has been interpreted as a tax administration purpose for the Statistics of Income (SOI) Program (unlike 6108(b) and 61O8 (c)); hence, if a 6108 (a) study requires the use of "outsiders", then a 6103(n) contract can be initiated as was done for the ERUMS study. 300 SAMPLE SELECTION AND MATCHING PROCEDURES IUSED IN ERUMS John PinkosKenneth LeVasseur Marlene Einstein U. S. Bureau of Labor Statistics Joel Packman Social Security Administration Introduction The first paper in this session described the experience with developing interagency agreements, the third described the findings resulting from the study while this one describes the sample selection and matching procedures used. In addition to describing the sample selection and matching procedures, the followinq will explain what the ERUMS Workgroup considered when developing the protect design. This paper also describes the sampling frames, data, and manual matching conducted by the ERUMS Workgroup. The ERUMS project was a pilot study, designed to develop and test procedures for linking and comparing employer and reporting unit data from different administrative record systems. The study from its inception was exploratory in nature, and the ERUMS Workgroup members hoped to observe and document the similarities and differences discovered between the records in the systems being studied and, thus, between the systems, themselves. The scope of the project included employer reporting unit data from the Bureau Labor Statistics and Social Security Administration employer data files which have similar coverage. Internal Revenue Service data, which were edited by the Bureau of the Census, were used to assist in the analysis of the sample. The ERUMS committee members included staff from Office of Management Budget (OMB) , Bureau of Labor Statistics (BLS), Social Security Administration (SSA), Bureau of Economic Analysis (BEA), Internal Revenue Service (IRS), Census and the Committee on National Statistics (CNS). Developing the sample design, selecting the sample, and performing the machine and manual match were conducted by SSA and BLS staff who were cleared to work with the confidential data. To conduct the final analysis of the data this group was later expanded to include staff from BEA and the CNS. 301 There are two reasons for providing an account of the ERUMS sample selection and matching procedures. The obvious reason is that the results, like those of any research study, are dependent on the procedures used, and anyone interested in the results is entitled to a full description of how the study was carried out. The other reason, equally or perhaps more important, is that ERUMS was a venture into uncharted territory, and we believe that future projects of this kind will benefit from the availability of a detailed road map of the procedures that were developed to match and compare employer and reporting unit records from BLS, SSA, and IRS for statistical purposes. Sample Design Considerations A major design consideration affecting the size and scope of the project was the limited staff time and resources each of the participating agencies was able to contribute. The committee realized from the beginning, the meat of the project would be in the manual review of the reporting units from each of the administrative record systems. To keep the workload manageable, the Workgroup decided to limit the study to one State rather than several. It was also decided that this State should be large and be one which could share its data with federal statistical agencies for research purposes. The State selected was Texas. Probability sampling was used at all stages of selection and provided two benefits. It ensured that sample results could be used to produce unbiased estimates for the study population, and it made possible estimation of sampling errors. Additionally, the Workgroup felt it would be useful for both analytical and methodological purposes to produce weighted estimates. Consideration was given to designing a baseline sample where a sample from one agency (e.g., BLS) would be drawn and then a search for the selected sample members would be conducted on the other agency's files (e.g., SSA). This approach would provide matched units on both files as well as those on the BLS file but not the SSA file. This method, however, would not identify those units on the SSA file but not on the BLS file. The baseline sample approach was abandoned and it was decided that samples would be selected in two stages. The stage one sample was an equal probability sample of the population which was then stratified by match status. The stage two sample was a systematic subsampling from these strata. This method of sampling provided a means for over- sampling selected types of records which were of more interest to the project and it also resulted in a manageable sample size. As a final design consideration, the committee wanted to ensure that records from both SSA and BLS had an equal chance of selection. Additionally, the Committee wanted to develop an approach that would minimize the number of computer searches 302 required to select the sample and relevant data elements from these large administrative record files. The sample design used was one that selected separate samples from the BLS and SSA files using the same get of random pairs of numbers. The purpose of this design was to measure overlap between the two frames and, more importantly, to measure the amount of nonoverlap between the two frames. The nonoverlap included those sample members on one frame but not the other. This design also minimized the computer costs and allowed the committee to select the sample in one pass through each agency's data file. Once the sample was selected, the relevant data elements for each sample member were downloaded to a micro computer. Sampling Frames Both the SSA and BLS data files are compilations of administrative tax records. The SSA data file includes data from employer W-2 and W-3 wage reports, whereas the BLS file includes data from employers' State Unemployment Insurance tax reports. The identifying data element common to both the SSA and BLS files and assigned from a single source is the Employer's Identification Number, or EIN. The EIN is a unique 9-digit number assigned to companies by IRS and is used to track federal tax payments. When companies pay State Unemployment Insurance Taxes the State assigns an Unemployment Insurance (UI) Tax number to track payment. Since companies are given a federal tax credit for State UI taxes, they provide their EIN to the State UI tax department. On an annual basis IRS provides each State UI tax department with a file of all the EINs registered in the State. The UI tax department then reconciles the amount of State UI taxes paid by each employer against the IRS file of EINs and tax credits claimed by each employer. By definition, all companies on the SSA files should have an EIN reported, because this is what is required for an employer to be included on the file. On the BLS State file a few units did not have an EIN reported since only a State Ul tax number is required for an employer to be included on that file. The first quarter 1982 Texas file had EINs reported for 98.7 percent of all reporting units. The sampling frame for BLS was all the EINs reported in the Texas first quarter 1982 U.I. Name and Address File. The sampling frame for SSA was all the EINs reported in the Single Unit or Multi Unit Code file with wage reports for calendar year 1982. The SSA files are continuous files linked over time, whereas the BLS file in 1982 was a snapshot of one calendar quarter. Effective with first quarter 1989 data, the BLS began linking data quarterly and now has a continuous data file. 303 The sampling rate was determined by the Workgroup's decision that 400 EINs would be a manageable sample size and that about one-. half of the sample should have EINs classified as multis, or companies with multiple locations. EINs classified as multis were of particular interest because there is more variation in reporting practices. To derive the sampling rate, the committee looked at the first quarter 1982 Texas file, which had 267,487 EINs classified as single units and 3,125 EINs classified as multi units. A sampling rate of 6 in 100 was selected since it provided approximately 188 EINs that were multi units. As previously mentioned, it was decided to select a two- stage sample. The first was an equal probability sample of the population. This first-stage sample was selected from all EINs that had 1 of 6 random pairs of numbers in positions 7 and 8 of the EIN. The sampling rate of 6 in 100, when applied to both the BLS and SSA frames provided a combined stage one sample of 19,964 EINS. The stage one sample was then machine matched and each EIN was assigned a status classification. The initial status classifications are shown below: MATCH STATUS IN: Table A Group BLS SSA 1 Single Single 2 Single Inactive 3 Inactive Single 4 Multi Single 5 Single Multi 6 Multi Inactive 7 Inactive Multi 8 Multi Multi EINs that were inactive in both systems obviously had no chance of entering the ERUMS sample. Another view of the status classifications is shown in attachment A, which is a 3x3 grid having classifications, single, multi, and No Wage Report (NWR) on each scale for both the BLS and SSA files. Records with no wage reports on the SSA file were considered inactive. The bottom right cell on the grid is not applicable since these would be records that did not exist on either file. 304 Based upon the interest of the Workgroup three of the basic classifications or cells were subdivided and are shown as the shaded sectors on the 3x3 grid (see attachment A). County and SIC became matching criteria for those EINS that were single on both files. The number of reporting units became a criterion for those EINS that were multis on the BLS file but were single on SSA file and those EINs that were multis on both. These eleven match status classifications became the strata used for the second stage sample. The second stage sample selection had equal probability within each stratum. The sampling rates used varied by stratum, from selecting all to selecting 1 in 173.78. Given the exploratory nature of ERUMS, the intent of the Workgroup was to pull a larger sample of EINs classified as multis and nonmatched records. These cases were expected to present more difficulties. Therfore, the Workgroup wanted to, have enough of these cases to learn what the situations were and to test methods of dealing with them. The final sample contained 401 EINS, including 201 classified as having multi units on, either the BLS or the SSA files. The remaining 200 EINs were those not classified as multis on either the BLS or SSA files. Once the sub sample was selected, the Workgroup began the review and analysis phase, which included labor-intensive manual matching. The working group reviewed reported employment and SIC and geographic codes for each of the 401 EINS. To assist in this process, the Workgroup made arrangements to have access to IRS data for tax years 1981 through 1983. Data for 385 of the 401 EINs were made available. During the review process the Workgroup attempted to uncover the reasons why records did not match or why records were on one file but not the other. In this process of looking very closely at the actual records from each agency, the Workgroup learned much about the two systems and found reasons to reclassify some of the records which affected the final match status. For example, in the area of multiunits, the BLS system defines multis as companies with multiple locations within the same State whereas the SSA system defines multis as companies that have multiple locations in the United States. During the review of the multi-unit records, employment levels were considered and attempts were made to reconcile differences in reporting units by aggregating employment of the individual multi units to the EIN level. As a result of this review, the Workgroup decided not to use employment as a match criterion. It was also decided that for purposes of this study, a multi unit EIN would be an EIN that had multiple locations within the State of Texas. This reduced the number of SSA multi unit EINs in the final sample from 120 to 10. The remaining 110 records were reclassified as single EINS. 305 As noted, the Workgroup also compared SIC and geographic codes from both files. SIC codes were first examined to see why there were non-matches at the four-digit SIC level. In some cases, the non matched EINs were assigned SIC code in related industries; in other cases, the industry code reflected la larger aggregation of the reporting unit. Another, and perhaps more important factor that accounted for differences at the 4-digit level was, both BLS and SSA have policies for SIC coding exceptions. The BLS in 1982 had 11 exceptions to 4-digit SIC coding which meant a 3-digit SIC code was assigned in certain industries in lieu of the 4-digit SIC code. This represented 43 4-digit industries. These are industries which either have a significant amount of overlapping in their industrial activities or are industries that historically had been difficult to collect sufficient information from to assign a 4-digit SIC. The BLS currently has reduced the number of 4-digit coding exceptions to 6, which represents 17 4- digit SIC industries. The SSA SIC coding exceptions exist in some agricultural industries and Public Administration, which are coded to the 1 digit level. This affected 64 4-digit industries. Approximately 63 other 4-digit industries were coded at the 3-digit level for one reason or another, typically insufficent information. In addition to reviewing SIC codes, the Workgroup also looked at geographic codes and tried to explain why some records did not match between files. Maps and coding manuals were consulted and the review showed there was some inherent misreporting of county codes by employers. Texas has more than 37 cities with the same name as a county but these cities ate not located in those counties. Houston, for example, is in Harris County not Houston County and Austin is in Travis County, not Austin County. Counties named Houston and Austin are located elsewhere in the State. In some cases the reason for non matching records was that the reporting unit was coded in an adjacent county. Texas has a very large number (254) of counties. For those employers who keep their records by city or are not familar with the county names, it is easy to see the potential for some misreporting. The Workgroup also looked very closely at the cases having inactive EINs on either the BLS or SSA files. Inactive EINs for the BLS were defined as those that appeared on the SSA file but did not Appear on the BLS File. Inactive EINs for SSA were defined as those on the SSA file with no wage reports for 1982. When reviewing the BLS inactive EINs, the Workgroup used SSA SIC and employment data to determine if the employer was exempt from Unemployment Insurance coverage. They also looked at IRS data to determine if the employer became active after the first quarter of 1982 and at the first quarter 1983 Texas file to see if the employer reported in 1983. 306 When reviewing the SSA inactive EINS, the Workgroup was able to use a more nearly complete SSA wage report file that included wage reports that were either delinquent when the sample was selected or were in the process of reconciliation with IRS. As a result of these additional data, 44 of the 99 EINs originally classified as inactive on the SSA file were determined to be active. The Workgroup also used the BLS 1982 and 1983 first quarter Texas files to conduct name searches to see if the same employer reported under a different EIN. The Texas files were also used to see whether zero employment was reported, which might have indicated no wages were paid. Additionally, IRS data were then used to see what level of employment was reported to IRS. The last step in the review and analysis phase was to determine the final match status of the 401 EINS. As a result of the review, it was decided to collapse the 11 categories shown in Attachment A down to the basic 8 cells shown in Table A. As part of the final analysis, committee members worked on completing the documentation for the project and discovered that an additional 2,608 EINs that were on the SSA file but not the BLS file were inadvertently omitted from the first stage sample and, consequently, from the second stage. Adding cases to the stale 1 and 2 samples at that point in time would have further delayed completion of the study, so the Workgroup decided the best way to deal with this problem was to reweight the sample cases in the two affected strata and rerun the results tables. 307 MATCH STATUS CLASSIFICATIONS Click HERE for graphic. KEY: NWR = No Wage Report SIC = Standard Industrial Code RU = Reporting Units 308 RESULTS, FINDINGS, AND RECOMMENDATIONS OF THE ERUMS PROJECT Vern Renshaw Bureau of Economic Analysis, Tom Jabine Statistical Consultant The other papers in this session have examined the administrative arrangements and the sample selection and matching procedures for the Employer Reporting Unit Match Study (ERUMS) This paper reviews the study's results, findings, and recommendations. The main purpose of the ERUMS project was to provide information on the technical and administrative feasibility of interagency record linkages. However, the ERUMS Workgroup hoped that the study would also shed some light on at least three areas of substative concern. 1) We hoped that geographic and industry information for reporting units contained in the Bureau of Labor Statistics (BLS) Unemployment Insurance (UI) Address File could help evaluate the potential statistical usefulness of a) reporting unit data supplied by multi unit employers participating in the Social Security Administration (SSA) Establishment Reporting Plan (ERP) for forms W-2 and W-3; and b) State data supplied to the Internal Revenue Service (IRS) on Form 940. SSA has been concerned about the quality of its reporting unit data because resources for maintaining the ERP had been inadequate for some time and the State data supplied on IRS Form 940 had never been used for statistical purposes. 2) We hoped that information from LRS and SSA files could help evaluate the completeness of employer coverage in the UI Address File. The UI Address File leaves out or estimates employer information that is not received by its statistical deadline, whereas information for late reports was generally available in the IRS and SSA files used for ERUMS. 3) We hoped that the analysis of matched records could help evaluate the consistency of industry and geographic coding in the BLS, IRS, and SSA systems. The extent to which the ERUMS project could actually shed light on these areas was limited by several factors. First, ERUMS was a pilot study based on a small sample drawn from a single State (Texas) for a single year (1982). The results, therefore, could 309 not be expected to reflect precisely the status of the data systems for the entire country or for subsequent years. (BLS has taken steps to improve the UI Address File since 1982.) Second, both the information content and processing procedures differed somewhat among the data systems. The W-2/W-3 data were for calendar 1982, for example, while the UI Address File that was used contained data only for the first quarter of 1982. Finally, a number of unanticipated problems were encountered in carrying out the study. The most limiting of these problems resulted from the slow implementation of ERUMS. For example, by the time the final sample of employers was selected, many IRS Form 940s for 1982 had been destroyed. Therefore, it was not possible to evaluate the State data contained on the Form 940s. Another unanticipated difficulty arose because the initial SSA files used in the matching process omitted some wage reports and were generally inadequate to determine if employers were actually reporting multiple units in Texas. These initial files were later supplemented with more complete information, but the supplementation occurred after the final sample had been I drawn; consequently the size of the sample was smaller than intended for some categories of employers, especially for multi unit employers. Finally, it proved to be more difficult than had been anticipated to account for differences in employer coverage among the data files. In part, this was because estimated data were not identified in the UI Address File (a deficiency being corrected) and because there was no documentation of such phenomena as dates when employment started for employers (or ended, or was changed by reorganization, etc.) or dates when forms filed by employers were received by the processing agencies. The clearest conclusion to emerge from the ERUMS project related to the poor quality of SSA's ERP data for multi unit employers. It was evident that SSA would need to take steps to improve quality control it the SSA system were ever to be useful for developing data by geographic and industry classification. The other findings of the ERUMS project were not so stark as those relating to the poor quality of SSA's establishment data, but the study could well reinforce the concerns of those who worry about the inconsistencies in industry coding that occur when employers are coded independently by different agencies. In the following sections of the paper the results, limitations, findings, and recommendations, of the ERUMS project are discussed in somewhat greater detail. Tables A-1 to A-8, which are referred to in the next two sections, appear in Chapter III of the ERUMS final report (Statistical Policy Working Paper 16). In order to meet space limitations, we have included Only Table A-4 with this paper. 310 Results As explained in detail by Pinkos et al in the second paper of this session, the ERUMS sample was a two-phase sample of employers, as defined by unique Employer Identification Numbers (EINs). Most of the results presented in this paper are estimates based on the Phase II sample of 401 EINS, weighted to account for the disproportionate sampling used in the second phase of the sample selection. Of the Texas EINs that were active in 1982 in the BLS or SSA systems, 67.1 percent were active in both systems, 27.6 percent were active only in the SSA system and 5.3 percent were active only in the BLS system (Table A-1). Only about 1.0 percent of all active EINs were classified as multi unit in one or both systems, and most of these were classified as multi unit only in the BLS system (Table A-4). For the matched single unit EINS, i.e., those that were active in both systems, an estimated 81.6 percent had the same State and county codes in both systems. The remaining cases were about equally distributed in three categories: same State, different county; same State with no county code in the SSA file; and different State (Table A-5). An estimated 70.2 percent of the matched single unit cases had the same two-digit industry codes. About half of the remaining cases were not classified by industry in the SSA system (Table A-5). When matched against the IRS/Census-edited Form 941/943 file, about three-fourths of the matched single units from both the BLS and SSA files had two-digit industry codes that agreed with those in the IRS/Census file. However, when the SSA unclassified cases were excluded from this comparison, the proportion of SSA cases that agreed with the IRS/Census two-digit code was somewhat greater than the corresponding proportion for the BLS matched single unit cases (Table A-8). Only a few EINs (nine sample cases) were classified as multi unit in both the BLS and SSA systems. Matching individual reporting units for these cases proved to be difficult. Overall, the nine sample employers had 105 Texas reporting units in the BLS system and 60 in the SSA system for 1982. Of the active SSA EINs not found in BLS's first quarter 1982 UI Address File, it was estimated that 69.2 percent had reported no first quarter employment to IRS on Form 941 and therefore would not normally be expected to appear in the BLS system (Table A-6). For another 10 percent of these employers, the analysis suggested that they may not have met requirements for UI coverage in Texas either because they had no operations in Texas, because of nonprofit status or because their payrolls were too small. For the remaining 20 percent, the reasons for their absence are not always clear, but 311 it may have resulted in part from lags in incorporating new employers in the UI State agency and BLS files. Most of the employers who were included in the 1982 UI Address File but did not file 1982 W-2/W-3 wage reports (22 sample cases) appeared to have ceased hiring employees, gone out of business, or gone through other changes that altered their reporting to IRS and SSA. Half of the employers in this group reported no employment in the 1982 UI Address File. Many of the remainder had filed their final Form 941 with IRS (at least for the period 1981-1983) for a quarter in 1981. An analysis of the sample EINs that appeared in SSA's Multi Unit Code File provided some indication of the extent to which multi unit employers were participating in SSA's Establishment Reporting Plan (ERP) in 1982 (Table A-7). An estimated 35.9 percent of these EINs had been incorrectly added to the Multi Unit Code File as the result of a processing error that has since been corrected. Most of the remaining employers had initially agreed to participate in the ERP, but more than half of this group did not provide separate data for each reporting unit in their W-3 wage reports for 1982. Limitations Several factors limit the broad applicability of the ERUMS findings. The results reflect the reporting requirements and operating procedures associated with the agency record systems in 1982. There have been significant changes since then. In particular, BLS has taken several steps to improve the timeliness and the completeness and accuracy of data in its UI Address File. The study was based on data for a single State, Texas, and on a small sample of employers and reporting units. The UI system gives the States some latitude in their record-keeping practices, so indications of the coverage of employers in the record systems of the Texas State Employment Agency in 1982 should hot be assumed to apply fully to the UI systems of other States at that time. The small sample size means that estimates based on the Phase II sample are subject to relatively large sampling errors. Because of limited resources and the complexity of the Phase II sample design, we were able to compute sampling errors only for a few key estimates (see Table A-4). The analysis of the results was complicated by differences in concepts and coverage in the record systems used in the study. These differences occurred in the basic filing requirements for the UI and SSA/IRS systems, the time reference of the basic BLS and SSA files used for matching, the definition of reporting units in the BLS and the SSA/ERP systems, and the structures of the BLS and SSA industry classification systems. In addition, certain file 312 deficiencies and operational problems made the analyses more difficult. About 1.3 percent of the records in the 1982 UI Address File for Texas did not have EINs and therefore were not included in the Phase I sample of EINs from that file. I In the SSA files, a significant proportion of employers lacked county and industry codes. The most serious problem was that a high proportion of multi unit employers were not reporting separately in 1982 for each reporting unit, so that we were unable to do a thorough comparison of reporting units for multi unit employers active in both the BLS and SSA systems. Although these differences and file deficiencies made the analyses more difficult, the fact that we succeeded in identifying and documenting them is an indication that the ERUMS project succeeded in its main goal, which was to demonstrate the feasibility of doing matching studies as a means of evaluating the suitability of administrative record systems for statistical uses. The data on amounts of employment and payroll available from SSA, BLS and IRS files were used in reviewing the unmatched sample cases and trying to understand why they were not present in both SSA and BLS files. However, the employment and payroll data were not added to the data file for the 401 sample EINs that were used to develop the estimates presented in this report. Therefore, all of the results shown are estimates of numbers of employers or reporting units classified by attributes such as match status, and geographic and industry codes in the different systems included in the study. We did not attempt to estimate what proportions of aggregate employment or payroll were accounted for by employers who were unmatched or had different geographic or industry codes. Findings The detailed analyses of the ERUMS data did not suggest that large numbers of employers who report wages in one of the payroll tax systems were failing to report in the other system when they should have been. They do, however, suggest that late reports and different procedures for processing the reports in the two systems created potential problems for using both of the systems data files for statistical purposes. Perhaps the clearest finding was that it is not possible to maintain a usable establishment reporting unit plan for multi unit employers in the absence of systematic procedures for monitoring employer reporting and updatig files for changes in the number, location and industry of each employer's reporting units. SSA's Establishment Reporting Plan clearly lacked the necessary resources to do this in 1982 and there is no reason to think that the situation has improved since then. 313 There, was a moderately high but by no means perfect correspondence between county and two-digit industry codes for single unit employers included in both the BLS and SSA systems. A substantial proportion of the differences arose from the absence of county or industry codes in the SSA system. Comparisons of industry codes at the three and four-digit level were not attempted because of the differences in the industry classification systems used by the two agencies. With some qualifications, we were successful in matching the records of employers, as defined by their EINS, in different systems. However, we were not successful in matching BLS and SSA records for reporting units, the main reason being the incompleteness of SSAs data for reporting units provided under the voluntary ERP. Other reasons were the lack of a common identifier, analogous to the EIN at the employer level, for reporting units and the slight differences in the reporting unit definitions used by BLS and SSA. We learned what we believe are some important lessons for others who may wish to match business records from different agency sources, whether for research or operational purposes. First, the plans and the necessary interagency agreements should be developed well ahead of the earliest date at which the files to be linked are expected to be available. In particular, the development of interagency agreements for the exchange of identifiable records is a painstaking process and considerable time may be needed for their completion and approval. Second, successful matching requires in-depth knowledge of all of the record systems involved and of the specific files that exist within those systems. An interagency team approach, with full exchange of information, is essential because there is unlikely to be a single individual who has all of the necessary information, even for the files of a single agency. Finally, whenever possible, it is essential to pretest matching procedures before embarking on large-scale operational applications. Recommendations ERUMS was designed primarily as a demonstration project and was therefore limited in its coverage and scope. Nevertheless, the Workgroup believes that the study results, along with other information acquired in the course of the study, justified the inclusion in its report of five formal recommendations addressed specifically to the BLS and SSA record systems for employers and reporting units. These recommendations were: 314 1. SSA should undertake a full review of the current status and uses of the Establishment Reporting Plan and decide either to continue it with adequate resources for maintenance and improvement of quality or to discontinue it entirely. (note- such a review was begun by SSA prior to the completion of the ERUMS project. As a result of that review, SSA is taking steps to prepare for the termination of the ERP.) 2. BLS should review the State Employment Security Agencies' procedures for identifying employer births (including those resulting from mergers and changes of organization) and seek ways of reducing the apparent lag between filing of applications for EINs and inclusion of new employers on State Agency and BLS lists used as frames for statistical surveys and reports. 3. Data in the UI Address File on employment and wages paid should be labelled to distinguish imputed data from data reported by employers. 4. The EIN should be identified as a key item in the UI Address File and efforts should be made to achieve 100 percent reporting initially and current reporting of changes in EINS. 5. BLS and SSA (if it continues the Establishment Reporting Plan) should strive to obtain data from employers for, their establishments as defined in the 1987 Standard Industrial Classification (SIC) Manual Both agencies should code industry for all establishments, without exception, at the 4-digit SIC level of detail. Whether or not the Establishment Reporting Plan is continued, SSA should code all employers identified on Forms SS-4 at the 4-digit level of detail. (see parenthetical note following recommendation 1 concerning the current status of the ERP) In a broader context, the ERUMS Workgroup concluded that current efforts to collect economic data at the establishment level are dispersed among Federal and State agencies, are poorly coordinated, and place unnecessary burden on employers. The Workgroup believes that further, more intensive and extensive interagency matching studies have an important role to play in resolving these problems and in determining the possible effects on statistical programs of prospective major changes in administrative reporting systems for employers. We therefore recommend that: 315 6. Further matching studies should be directed at acquiring information that will support the eventual development of a mandatory reporting system to meet the needs of all Federal and State statistical programs for establishment lists, including SIC codes. An interim goal should be that all agencies requiring or requesting employers to provide data at the establishment or reporting unit level adopt common definitions of units and data items to be submitted for these units. Three agencies the BLS, the Census Bureau and the National Agricultural Statistics Service -- play a dominant role in the direct collection of establishment-level economic data. Recent initiatives of these agencies, under the general guidance of OMB's Statistical Policy Office, have been directed at greater coordination of their respective list-building and maintenance activities. Further integration of business lists will require fuller understanding of the similarities and differences of the three systems, based on matching of individual establishments and reporting units in the different systems. 316 Click HERE for graphic. 1/Numbers in parentheses are standard errors of the percents. * Indicates a standard error of less than 0.05 percent. 317 DISCUSSION W. Joel Richardson Charles A. Waite U. S. Bureau of the Census Introductory Comments on ERUMS First of all, I would like to thank the many people who have been involved with ERUMS. Their commitment and resourcefulness have helped to make the ERUMS project a success. As Vernon has detailed, several recommendations were presented that undoubtedly will improve the business files of the Bureau of Labor Statistics (BLS) and the Social Security Administration (SSA). But more importantly, the ERUMS study provided valuable experience in the technical aspects of matching interagency data sets. I am hopeful that this experience will help to further the efforts of data- exchange initiatives among federal statistical agencies in the coming years. When the preliminary planning for ERUMS began in 1983, the Census Bureau expected to be one of the participating agencies. Our business employer files were to be matched along with those of the BLS, SSA, and the Internal Revenue Service (IRS). However, there were significant problems concerning the release of our confidential data. Though we realized the importance of ERUMS, we could not resolve these data-access problems soon enough to allow us to be an active participant. As an alternative, the Census Bureau obtained observer status, which enabled us to closely follow the progress of ERUMS. Before critiquing the three papers, I'd like to expound on the value of the ERUMS study to the federal statistical community. Warren stated that a major goal of ERUMS was to test the feasi- bility of matching employer records from the business lists of different government agencies. This goal was, accomplished in ERUMS, and the results showed that the matching of the two distinct data files is possible. Additionally, the ERUMS evaluation revealed problems associated with matching the interagency data files. I expect that these findings will be valuable in future matching studies. A matching study should be the first step in any data-sharing proposal -- before a data sharing proposal is accepted by the participating agencies, it is essential to confirm the comparability of the data sets and to resolve any conceptual an definitional differences. In my view, the ERUMS project showed that the BLS and SSA data sets are comparable, and that an effective matching operation is possible. 318 Although there are obvious discrepancies between the data sets -- only 67.1 percent of the EIN records were active in both systems -- significant benefits could be realized through data sharing. First, greater consistency in the industrial classification codes, geographic location indicators, and related data values could be achieved by sharing the data for matched records. Second, unmatched records could be researched in an effort to ensure the completeness of each of the employer universes. Though numerous issues would need to be explored and settled, such a data-sharing plan could result in greater comparability among the data series. Currently, the administration has a legislative proposal in Congress that would permit limited data sharing between the Census Bureau and the Bureau of Economic Analysis (BEA). The primary purpose of the proposal is to provide BEA with confidential access to the Census Bureau's establishment information. This information will augment and improve the data on foreign direct investment that BEA collects and publishes. There are other versions of the legislative proposal in Congress to share Census and BEA data -- not only with each other, but, in at least one version, with the Government Accounting Office (GAO) and the Committee on Foreign Investment in the U.S. (CFIUS). We are concerned that response rates may decline if our microdata are made available to such policy-making organizations as GAO and CFIUS. For this reason, the Census Bureau does not support this legislative proposal. The BEA collects foreign-investment data at the enterprise level. The Census Bureau conducted a feasibility study that showed BEA enterprise-level data could be linked successfully with Census Bureau establishment data. By integrating our establishment-level data with BEA enterprise data, BEA will be able to present foreign direct investment statistics at a much finer industry and geo- graphic level. This is one of many possible data-sharing plans that could provide significant cost and qualitative benefits to Federal statistical programs. I would like to believe that the administration's legislative initiative, together with successful match studies such as ERUMS, will provide the impetus for increased data sharing among Federal statistical agencies in the future. Interagency Agreements for Microdata Access: the ERUMS Experience Tom Petska's presentation focused on the interagency agreements required to comply with the confidentiality provisions that govern the three sets of data. Clearly, the matching of individual records in the ERUMS project could not take place until these confidentiality issues were resolved. 319 Tom has presented thoroughly the problems associated with sharing the individual records from different agencies. It is apparent that these legal agreements represented a major barrier in the ERUMS project. To their credit, the ERUMS workgroup was able to overcome, the confidentiality problems and to formulate a workable plan -- IRS contracted with BLS to perform the match, and SSA staff were designated as special agents of BLS to process the data. The IRS is permitted to disclose tax information to outside contractors as long as it is for purposes of tax administration, and the ERUMS study was considered to be a statistical study related to the administration of IRS tax laws. Unfortunately, considerable time was spent in determining this solution and in drafting the required legal agreements. This added considerably to the length of the ERUMS study. Future matching studies may face similar obstacles in gaining access to confidential data. As an example, the Census Bureau obtains the EIN and related data values for many small employer businesses from the IRS. Any future studies undoubtedly will rely on the EIN to match the records, because the EIN is the one key identifier common to U.S. data systems. But as Tom has pointed out, the EIN itself is protected by Internal Revenue Code confidentiality provisions. For this reason, the EIN and related data that the Census Bureau obtains from the IRS cannot be released to other statistical agencies such as the BLS. Only those business records whose EIN and related data have been confirmed through direct respondent contact would be eligible for release. This would impact on the completeness of any matching studies between the BLS and Census Bureau data sets, because a portion of our business universe has not been directly canvassed. The BLS was permitted access to IRS records in the ERUMS project because of tax-administration purposes. Although additional studies possibly could be conducted using similar arrangements, it would require the support of the IRS and other agencies that furnish the administrative data. Otherwise, future studies may require changes to relevant statutes and regulations before microdata access is authorized. Such changes are difficult to obtain. I do have one minor point on the paper concerning the confidentiality provisions of the BLS data. The ERUMS study used matched BLS records from only one state -- the state of Texas. Although Tom outlined the disclosure provisions associated with the data records from Texas, it was unclear whether these provisions were typical of the other 49 states. We understand that BLS affords each state with certain latitude as to the collection of the unemployment data. If the states also have different confidentiality provisions -- specifically, provisions that strictly prohibit the release of data to Federal agencies other than BLS -- the ERUMS project may not have been possible using records from these states. 320 One of the goals of ERUMS was to gain experience in the procedure of obtaining access to the confidential data of the various data sets. To this end, the ERUMS study was a success. The study revealed the problems associated with obtaining the access to the microdata for matching purposes, and also determined a workable solution that overcame these problems. However, I expect that disclosure problems will continue to be a major obstacle in future matching initiatives. Sample Selection and Matching Procedures for ERUMS John Pinkos's presentation focused on the sample selection and matching procedures in ERUMS. As John has pointed out, a major constraint affecting the sample size was the limited staff time and resources. Because considerable analysis was inevitable for the sampled records, the-ERUMS members agreed to select a relatively small sample. As it turned out, 401 cases were selected. By limiting the sample to one state, and oversampling from certain categories of records that were of particular interest, ERUMS was able to create a manageable set of sample records that were sufficient to meet the study's objectives. I expect that future matching studies will benefit from the details, of the procedures used in ERUMS. Three sources of data were used in the study -- BLS data, SSA data, and IRS data. Cases were selected first from the BLS data files and then independently from the SSA data file. Using this technique -- specifically, by selecting independently based on certain digits of the EIN -- the ERUMS sample included records that were present in only one of the two data systems, as well as records that were present in both systems. Records present in only one of the data systems were a critical part of the study, as these represented potential differences in employer coverage between the two data files. The ERUMS study, however, did not sample from the IRS data set. The IRS data were used only to help analyze the BLS/SSA cases selected in the sample. The IRS file was not included in the sample selection because of the difficulties in gaining access for such a purpose. Although this decision was unavoidable, it may have compromised the results of ERUMS somewhat. The IRS data file represents a complete universe of business employers in 1982 -- all employers who filed payroll tax returns in with no exclusions as to the size of the business or the nonprofit status of an organization, were included on the IRS file. Without this complete file of businesses, ERUMS was left to compare records from the BLS and SSA data sets. Although differences were identified and quantified, the study could not make valid estimates 321 on the completeness of the two data sets as compared to the universe of businesses on the IRS file. A similar point exists for the matching of multiunit records from the BLS and SSA data sets. The ERUMS study showed that about l percent of all active EINS were classified as multi unit in-one or both systems. Most of these were classified as multi unit only in the BLS system. One of the findings of the study was that the SSA multiunit file is deficient, and steps should be taken to either improve the quality or to discontinue it entirely. Because of the obvious deficiency in SSA's multiunit file, no legitimate conclusions could be reached on the accuracy of the BLS multiunit file. One last point on John's paper, he discussed briefly the comparison of industry classification and geographic location from the BLS and SSA files. I would liked to have seen some general table that presented these results. Even if the results were presented at broad industry and geographic levels, it would have provided some general information on the comparability of these critical data elements. Results, Findings and Recommendations of the ERUMS Project The agencies involved in the ERUMS project have gained valuable experience in the technical aspects of linking data files and in the administrative requirements for gaining access to the data. For this reason alone, the ERUMS project should be considered a success. In addition to the experience gained, the ERUMS project presented several recommendations that will help to improve the business files of the BLS and SSA. I understand that the BLS has already taken several measures to improve the timeliness, completeness, and accuracy of the data in its Unemployment Insurance Address File. Vernon's presentation detailed the recommendations that were identified in the ERUMS study. In one of the recommendations, he stated that BLS should review the procedures for identifying births in an effort to improve the timeliness of including new employers in the BLS lists. I suggest that the BLS review procedures for identifying deaths as well. Up-to-date operational status is a critical element of business employer records. The final recommendation in vernon's presentation covered the need for additional matching studies to acquire information that will support the eventual development of a reporting system to meet the needs of all Federal and State statistical programs. Because of certain legislative barriers -- for example, Title 26 strictly prohibits the release of IRS data to other statistical agencies -- and significant operational problems, such a far-reaching goal may not be plausible in the foreseeable future. 322 The Census Bureau supports a more achievable goal of data- sharing among Federal statistical agencies, and would welcome the opportunity to conduct additional matching studies in an effort to further data-sharing initiatives. Before proposing the Census/BEA data-sharing initiative, we conducted a matching study that confirmed the feasibility and value of linking our establishment- level data with BEA's enterprise-level data. This preliminary study was a necessary step in the Census/BEA data-sharing initiative. Additional matching studies may promote other data- sharing initiatives in the Federal Government. The ERUMS project, which effectively matched interagency data files, may help provide the impetus for increased data sharing in the coming years. With the necessary legislative changes, pertinent data from each of the employer files could be shared among statistical agencies. Such a data-sharing plan would provide major advantages, including greater comparability among economic data series, less respondent burden on the business community, and a reduction in overall Government costs. Summary Comparisons between data sources are beneficial because they highlight conceptual differences and identify the limitations and strengths of the data sets. The ERUMS project successfully met both of these objectives. In addition, ERUMS provided valuable experience it the technical aspects of matching interagency data sets. Our current mission should be to use this experience to further the efforts of data sharing in the Federal Government. Data sharing offers major advantages to Federal statistical agencies. By supplementing business data sets with applicable information from the data sets of other agencies, the Federal statistical system will attain greater comparability in related economic data series. The ERUMS project showed that interagency. data sharing is a viable option. I would like to congratulate the many people who have been involved with ERUMS for a job well done. 323 DISCUSSION Thomas J. Plewes U.S. Bureau of Labor Statistics I appreciate the opportunity to appear at this public unveiling of the Employer Reporting Unit Match Study (ERUMS) report. This is an event that has been long-awaited by all of those who have been involved in this multi-agency, multi-year, and multi-faceted project. I expect that no participant has awaited this day more anxiously than Warren Buckler, who, along with the folks here at the speaker's table and many in today's audience, has spent a great deal of time over the past few years in conceiving, giving birth, and nurturing this little study. Indeed, to carry the metaphor further, is hard to figure out where we stand now on the continuum from project conception to death. Is this session a commencement ceremony, or is it a eulogy? As my commentary will soon indicate, I hope that we are gathered for a commencement ceremony for the statistical community has learned important lessons about sharing and about the basic quality of two major business lists in this project at some significant cost. It would be a shame if the lessons learned were not put to use in implementing critically needed program improvements. I would like to accomplish two objectives in the short time I have allotted as a discussant. First, I want to step back to examine the environmental framework in which this study took place and contemplate the arena into which the report now has been thrust. My second goal is to draw specific conclusions from the exercise and suggest specific steps that should be taken as a result of the work that has been done. What is the environment in which we must consider this study? It is a complex environment, characterized by: 1. Little sharing of business directory information between Federal government agencies, but a growing pressure to develop, procedures for sharing so as to reduce the burden on respondents. These pressures are building to the extent that I believe sharing will surely be mandated. That mandate may come in the form of legislative action, a fiat from the Office of Management and Budget using its authority under the Paperwork Reduction Act, or of most profound consequence, through a centralization of the statistical agencies. 2. A reliance on lists characterized by their primary usage as administrative data sources which focus the support Of the administration of the law or function. We have built our elaborate business directory programs and constructed our business survey frames on databases that have been 324 developed with only a distant secondary concern for the statistical uses of the data. 3. Difficulty in separating statistical from enforcement purposes. If we, as statistical agencies, make the data better and create an environment for comparing lists, we enhance their use for enforcement and administrative purposes also. This aspect will be particularly troublesome when we involve, as we eventually must, the Internal Revenue Service in sharing schemes. The participation of the IRS in the ERUMS process gave us an indication of the lengths to which IRS will go to protect the tax data, and of the difficulties this injected in the ERUMS process. 4. A growing concern over confidentiality of establishment records. 5. A lack of consistency of definitions and coding that extends throughout the statistical system, but has a most profound impact on sharing of administratively-derived lists. Administrative differences in the programs lead to inconsistent definitions of even the most simple of terms, such as "employment", "address", "wages" and the like. 6. An expanding recognition that errors and omissions in the business lists are a significant source of error in the survey process. The Federal Committee on Statistical Methodology's Working Paper 15, "Quality in Establishment Surveys" documented this, and the Tupek-MacDonald paper this morning discussed the effect that the Bureau of Labor Statistics' Business Establishment List improvement project will have on BLS survey quality. These environmental elements pose formidable challenges to statistical agencies that want to improve the efficiency of their operations and reduce burden on their reporters. For example, in terms of frames for surveys of nonagricultural businesses, there are at present two major government lists -- the Census Bureau's Standard Statistical Establishment List (SSEL) and the BLS Business Establishment List (BEL) -- and one major private sector list -- the Dun & Bradstreet file -- with a myriad of lesser known and more specialized lists for more limited purposes. We can look at the SSEL as a representation of the of the SSA/IRS administrative data files with considerable value added by the Census Bureau. Likewise, the BEL may be seen as a representation of the State unemployment insurance files with considerable BLS value added. If these Federal government files do not match, and we suspect they do not through analysis of the macrodata, the problem can be with the basis administrative data files, with the value added, or both. Over the years, Fritz Sheuren's various administrative database 325 comparison projects have documented the systemic differences in the files very well. They must be borne in mind. Fixing the files once we have identified the root difficulties is quite another matter. The statistical agencies do not own them, and they are exceedingly expensive to change (in terms of budget and response burden). Indeed, quite often only a revision in law or nationwide program practice will do the trick. Fixing the "value added" portion is somewhat more possible, but it too is expensive in terms of budget and people. often there are good reasons for not fixing the way we add our value, such as the need to assure the continuity of historical data series. Definitions are another challenge. If we want to share lists, we must think in terms of three types of problem. In some cases, repair is relatively simple. We heard today, for example, that our definitions of multi-unit employers are already in close proximity. The EIN and SIC systems are also bedrock. Our challenge in those instances where there is close concordance between the files is to maintain the definitional base in a standardized, current and relevant manner. In other areas, we must change the way we do business but, if we are willing, our task will be reasonably easy. One match problem that ERUMS identified was that the project was comparing annual SSA reporters with lst Quarter UI reporters. This is one Of the problems that we can fix with time and resources, because the data are there. In a few important other cases, however, we are quite limited in our ability to bridge definitional gaps. For example, when coverages are based on Federal laws, State laws, and judicial precedent regulating the administrative database, we would be forced to justify a change in the insurance or tax program on statistical grounds. Certainly, confidentiality concerns have a presence in the equation. We, glimpse in the Petska-Alexander paper the importance that necessary confidentiality protection schemes had in this project, and the price those schemes exacted in terms of time and precision. That's one of the reasons I like the Petska-Alexander paper so much. It outlines the practical implications of maintaining a pledge of confidentiality when cooperating on a project of importance to the statistical agencies. Everything, as they so well point out, had to be invented. There are no text book examples of interagency agreements on confidentiality. The solutions which the project team developed were carefully crafted to stay within the very restrictive IRS law and were implemented with an eye toward the reality of the environment. Thus, there are really two stories in the Petska-Alexander paper. One story is about the difficulties that the team encountered in sharing confidential data. The other, written between the lines, is about 326 the sense of cooperation and dedication that allowed the cumbersome solutions to move forward. The Petska-Alexander paper starkly reminds us that the role of confidentiality policy is important but little understood. We may be hopeful that the current situation will be short-lived. The National Academy of Science's Committee on National Statistics had taken on these issue with the formation of an expert panel. Until we are able to benefit from that report, however, we are left with the fact that understanding of confidentiality of business records has not progressed very far as either science or practice. Only recently has a literature on the subject of confidentiality begun to emerge, but most of it addresses the more emotional topic of confidentiality of information about individuals. The literature pays little attention to issues surrounding confidentiality of business records. Without such a foundation, the statistical agencies have mostly assumed that the issues of confidentiality of business records are the same as those for individuals. This assumption has played an important role in justifying past limits on sharing between the Federal agencies. The second paper, by Einstein, Levasseur, Packman, and Pinkos, also attempts to stand back with benefit of hindsight and make some sense out of what was a convoluted process. Since 3 of the 4 authors work with me, these comments may not be as critical as others may have rendered, for all along the way I "bought-in" to the approaches taken and the effort expended. Nonetheless, I view the documentation that this paper offers in a somewhat different light than the authors, and draw slightly different conclusions. The matching process, as described, makes a good deal of statistical sense. The team selected a two-stage sample selection process, stratified into 9 groups. The second phase, a subset of about 400 cases of the first selected on a probability basis, provides for detailed analysis. Some of the specific steps in the process were to meet the confidentiality restrictions, but not all. The process that the team established should serve as a first step toward developing an on-going statistical process control system, if and when sharing does take place. Many of these same activities should be continued in a recurring program to meet the objectives of total quality management. Thus, the work of the team has long-term, permanent implication. The authors seemed to recognize this when they stated that "we believe future projects of this kind will benefit from-the availability of this detailed road map". Probably so, but I speculate that future researchers will look at the road map and decide against making the journey. That is why I would take pains to separate the enduring aspects that should be the foundation of a quality management system from those that were necessary to meet more bureaucratic objectives. 327 The contribution of the Renshaw-Jabine paper is to Yield some hope, in that it reminds us how close we are to an ability to share, while providing some sober reflection about some major tasks still lying ahead if we are to share. Their bottom line is that the systems are reasonably close in coverage -- eventually most employers emerged in the systems. There were troublesome differences in multi-unit identification, in county coding, and in industrial classification at the 2-digit level, but I would label these of moderate concern. Indeed, under the BEL initiative, BLS has taken steps to correct many of the inadequacies in its data, investing with the States in improving SIC coding, interpretation of SICS, and, more recently, in fixing the multi-establishment identification problems. Unfortunately, with lack of resources, the Social Security Administration has not been able to make the same investment, so many of the difficulties in the SSA file may have multiplied. In summary, we ought not let this expensive experience lie on the shelf. We have learned a great deal about two files -- lessons that should be extended to files maintained by the Bureau of the Census. And we need to get on with fixing some of the obvious flaws in the administrative data. Most importantly, we have learned that maintaining confidentiality is possible, that matching is feasible, and that the will is present at the staff level in the agencies to make it all come together. Now it is time for leadership. As Senator Bennett Johnston said in an argument before Congress, "There's a time to stop talking the talk and start walking the walk." We have the map. Let's start walking. 328 Session 10 APPROACHES TO DEVELOPING QUESTIONNAIRES 329 330 TOOLS FOR USE IN DEVELOPING QUESTIONS AND TESTING QUESTIONNAIRES Theresa J. DeMaio U. S. Bureau of the Census As the collection of information through surveys becomes more prevalent in our society, increasing numbers of people find themselves in a position to develop questionnaires. Writing a questionnaire seems like such a simple task -- many people think that anyone without training or experience can do it. But developing a good questionnaire -- one that can obtain good quality, information that meets the objectives of the survey -- is not as easy as it looks. Many different kinds of abilities, including subject matter expertise, writing capabilities, and knowledge of social psychological principles are necessary to develop a simple, cohesive questionnaire in which the questions are clearly worded. Developing a good questionnaire is not a solitary task -- simply a matter of sitting down at your desk for a few minutes or even a few hours. There are a number of procedures that can be used to involve potential respondents in content or question development, and to test and evaluate questionnaire drafts before they are finalized. The purpose of Statistical Policy Working Paper #10, Approaches to Developing Questionnaires. is to provide practical information about these methods. The report contains descriptions of 11 different techniques, which can be used at various stages of questionnaire development. The report is structured in three parts: tools to develop questions, procedures for testing the questionnaire draft, and techniques used to evaluate the questionnaire draft. This structure was somewhat artificially imposed for ease of presentation in the report. In fact, there is no one ideal way to go about the process of developing a questionnaire. Depending on a number of factors, such as whether you're working from scratch or from an existing questionnaire, how much time and funds are available for survey development, these techniques can be used in many different combinations. In terms of improving the content of a survey questionnaire before it goes out into the field, the important thing is that testing and developmental work be conducted, not necessarily that it be done according to the structure presented in the report. Having made this disclaimer, I am nevertheless going to discuss the techniques that are presented in the first two sections of the report -- that is, tools for developing questions and techniques for testing the questionnaire draft. I'm going to generally describe the methods contained in the report, and mention some additional techniques as well. 331 Developing Questionnaires Part I of the report describes three tools for developing questions. The report presents these methods as useful in developing new questionnaires. I'd like to expand on this a little and suggest that these techniques can be used in the early stages of questionnaire development of any survey. Most surveys are conducted more than once; subsequent rounds of data collection begin with an existing questionnaire draft that is subject to revision. These later rounds each have early stages of questionnaire development, complete with an existing questionnaire draft. In these cases too, the methods described in Part I of the report may be appropriate. Unstructured individual interviews Unstructured individual interviews are one-on-one conversations between a researcher and a member of the population for the survey or proposed survey. I use the term "conversations" because the discussion is unstructured; rather than having a set of specific questions, the researcher uses a topic outline that collects information on various aspects of these topics in whatever order, and using whatever terminology the respondent suggests. Respondents may also bring up additional issues related to the general topic, which might be incorporated into the topic outline for later interviews. The goal is an unstructured setting in which the researcher finds out how the respondent perceives the topic of interest, what terminology the respondent uses to talk about the topic, whether the respondent is knowledgeable and able to provide information on the topic. By working from a blank slate, the researcher is not constrained by the content and terminology of an existing questionnaire, and the true frame of mind of the respondent is more likely to surface. Qualitative Group Interviews Many of you may be familiar with qualitative group interviews under a different name, such as focus group interviews, group depth interviews, or focussed discussion groups. Essentially these are unstructured interviews with a group of respondents rather than a single respondent, led by a group moderator. About 8 to 12 people participate in a group, and the moderator uses a topic outline to guide the discussion. Qualitative group interviews are used for many research purposes other than questionnaire development. When used to assist in questionnaire construction, the goal is the same as the goal of unstructured individual interviews -- to elicit the terminology used by respondents in thinking about the topic in question, to determine aspects of the topic that respondents consider important, and to get a reading on how respondents react to aspects of the topic that survey planners consider important. 332 The difference between qualitative group interviews and unstructured individual interviews is, obviously, the group setting the diversity of opinions held by group members may stimulate interaction among them that elicits more information than could be obtained through interviews with each member separately. In order for these groups to be successful, however, the ability of the moderator is an important consideration. The idea is to stimulate discussion among all the participants and to avoid domination of the discussion by some people who may be more vocal than others. Participant Observation Participant observation is a technique that is used as an independent method of data collection, as well as a tool for questionnaire development. It has been extensively used around the world. The basic elements of the technique are suitable for questionnaire design purposes, especially in developing questionnaires for use by members of other cultures or subcultures living within our own country. For example, the homeless population is a subculture that is currently the object of much interest, and for which the use of participant observation techniques is relevant. Indeed, these techniques have been successfully used in research on homelessness being conducted at the Census Bureau. There are several distinguishing characteristics of participant observation research. First, the researcher must speak the respondents' language. This is not limited to English as opposed to a foreign language, but also refers to dialects, slang, or professional jargon. Second, the researcher associates with the members of the community he or she studies and engages in their activities. Ideally the researcher lives among the respondents; at a minimum, he or she develops contacts in the community over a long period of, time. The participant observer may also use the ethnographic interview technique during the course of his or her research. This involves using unstructured interviews (the methodology I previously described) with "key informants." These are members of the community who are willing to talk at length with the researcher or introduce the researcher to other community members. From this brief description, it should be obvious that participant observation is not a methodology that a person can "pick up" by reading an introductory textbook. The expertise required in the use of this technique dictates the involvement of trained ethnographers. While that may limit its use somewhat among U.S. statistical agencies, there are several ways it can be incorporated in a project. First, participation observation can be conducted as part of a project by trained anthropologists hired to serve on the project staff. In the homeless project I referred to a moment ago, we hired an anthropologist to work with a survey 333 methodologist, and this combination has worked out very well. A second way to make use of this technique is to consult with ethnographers who have prior experience among the culture of interest, and take advantage of this previous experience rather than conducting original fieldwork. This could be done either by hiring the person on staff or doing it on a consultant basis. Think Aloud Interviews Another technique suitable for the early stages of questionnaire development has gained in popularity since the Working Paper was completed in 1983. This is the think aloud interview. Also referred to as protocol analysis, this method is an extremely valuable source of information about how respondents understand the survey questions put to them, and how they go about answering the questions. The purpose of the technique is to get respondents to talk out loud and verbalize their thoughts as they respond to questionnaire items. The data of interest here are respondents' reactions to the items, their thoughts as they formulate answers to the items, and what decisions they make in answering the questions. Use of the technique requires a questionnaire draft. Since the results of these interviews are crucial to the questionnaire development process, the person doing the interviewing is generally a researcher or questionnaire designer. For interviewer- administered surveys, the questioner first explains to the respondent that rather than just answering the questions, he or she should actually think out loud -- that is, say what he/she is thinking as he/she answers each question. Respondents differ in their ability to verbalize their thoughts, and some may require a, bit of probing to uncover how they arrive at the answer to a question. At times it may take skillful questioning to probe completely what is on a respondent's mind. The interviews are generally tape-recorded (with the respondent's permission), since it is difficult to take notes and concentrate on probing the respondent's answers at the same time. This technique can also be adapted for self-administered interviews. In this case, the questioner is basically an observer. The respondent is instructed to complete the questionnaire, reading the questions and instructions out loud as well as verbalizing the responses. I've done quite a few of these interviews, and they really are quite helpful in detecting layout problems (not noticing skip instructions, etc.) in addition to uncovering problems with the questions. This technique is used with relatively small numbers of respondents. Ten or fewer think aloud interviews provide large amounts of information and can uncover systematic misinterpretations or other problems. Use of the technique is an 334 iterative process -- once the questionnaire designer conducts five to ten think aloud interviews, problem areas will generally surface. Then, after revisions to the questionnaire are made, additional interviews can be conducted to detect problems with the revisions. Or alternatively, some other method can be used for the next round of questionnaire development. Testing Questionnaires Whatever