Federal Committee on Statistical Methodology
Office of Management and Budget
FCSM Home ^
Methodology Reports ^

 

  Statistical Policy Working Paper 20 - Seminar on Quality of Federal Data - Part 3 of 3


Click HERE for graphic.                          

 

 

                          Statistical Policy

                           Working Paper 20





                  Seminar on Quality of Federal Data





                              Part 3 of 3





             Federal Committee on Statistical Methodology



 



                      Statistical Policy Office



           Office of Information and Regulatory Affairs



                    Office of Management and Budget



 



                           March 1991



 



                MEMBERS OF THE FEDERAL COMMITTEE ON 



                       STATISTICAL METHODOLOGY



                          (February 1991)



 



                    Maria E. Gonzalez, Chair



                 office of Management and Budget



 



Yvonne M. Bishop                  Daniel Kasprzyk



Energy Information                Bureau of the Census



  Administration



                                  Daniel Melnick



Warren L. Buckler                 National Science Foundation



Social Security Administration



                                  Robert P. Parker



Charles E. Caudill                Bureau of Economic Analysis



National Agricultural



  Statistics Service              David A. Pierce



                                  Federal Reserve Board



Cynthia Z.F. Clark



National Agricultural             Thomas J. Plewes



  Statistics Service              Bureau of Labor Statistics



 



Zahava D. Doering                 Wesley L. Schaible



Smithsonian Institution           Bureau of Labor Statistics



 



Robert M. Groves                  Fritz J. Scheuren



Bureau of the Census              Internal Revenue Service



 



Roger A. Herriot                  Monroe G. Sirken



National Center for               National Center for



  Education Statistics              Health Statistics



 



C. Terry Ireland                  Robert D. Tortora



National Computer Security        Bureau of the Census



  Center



 



Charles D. Jones



Bureau of the Census



 



                            PREFACE



 



In 1975, the Office of Management and Budget (OMB) organized the



Federal Committee on Statistical Methodology.  Comprised of



individuals selected by OMB for their expertise and interest in



statistical methods, the committee has during the past 15 years



determined areas that merit investigation and discussion, and



overseen the, work of subcommittees organized to study particular



issues.  Since 1978, 19 Statistical Policy Working Papers have been



published under the auspices of the Committee.



 



On May 23-24, 1990, the Council of Professional Associations on



Federal Statistics (COPAFS) hosted a "Seminar on the Quality of



Federal Data."  Developed to capitalize on work undertaken during



the past dozen years by the Federal Committee on Statistical



Methodology and its subcommittees, the seminar focused on a variety



of topics that have been explored thus far in the Statistical



Policy Working Paper series.  The subjects covered at the seminar



included:



 



     Survey Quality Profiles



     Paradigm Shifts Using Administrative Records



     Survey Coverage Evaluation



     Telephone Data Collection



     Data Editing



     Computer Assisted Statistical Surveys



     Quality in Business Surveys



     Cognitive Laboratories



     Employer Reporting Unit Match Study



     Approaches to Developing Questionnaires



     Statistical Disclosure-Avoidance



     Federal Longitudinal Surveys



 



Each of these topics was presented in a two-hour session that



featured formal papers and discussion, followed by informal



dialogue among all speakers and Attendees.



 



Statistical Policy Working Paper 20, published in three parts,



presents the proceedings of the "Seminar on the Quality of Federal



Data."  In addition to providing the papers and formal discussions



from each of the twelve sessions, this working paper includes



Robert M. Groves' keynote address, "Towards Quality in a Working



Paper Series on Quality," and comments by Stephen E. Fienberg,



Margaret E. Martin, and Hermann Habermann at the closing session,



"Towards an Agenda for the Future."



 



We are indebted to all of our colleagues who assisted in organizing



the seminar, and to the many individuals who not only presented



papers and discussions but also prepared these materials for



publication.  A special thanks is due to Terry Ireland and his



staff for their work in assembling this working paper.



 



 



                    Table of Contents



 



                    Wednesday, May 23, 1990



 



 



                            Part 1



 



 



                        KEYNOTE ADDRESS



 



 



TOWARDS QUALITY IN A WORKING PAPER SERIES ON QUALITY . . . . . . . . 3



   Robert M. Groves, The University of Michigan and U. S.



   Bureau of the Census



 



 



 



        Session 1 - SURVEY QUALITY PROFILES



 



 



 



THE SIPP QUALITY PROFILE . . . . . ... . . . . . . . . . . . . . . . 19



   Thomas B. Jabine, Statistical Consultant



 



INITIAL REPORT ON THE QUALITY OF AGRICULTURAL SURVEY PROGRAM. . . .  29



   George A. Hanuschak, National Agricultural Statistics



   service



 



DISCUSSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40



   Barbara A. Bailar, American Statistical Association



 



DISCUSSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46



   Nancy A. Mathiowetz, U. S. Bureau of the Census



 



 



 



      Session 2 - PARADIGM SHIFTS USING ADMINISTRATIVE



 



                              RECORDS



 



 



 



PARADIGM SHIFTS:  ADMINISTRATIVE RECORDS AND CENSUS-TAKING . . . . . 53



   Fritz Scheuren, Internal Revenue Service



 



AN ADMINISTRATIVE RECORD PARADIGM:  A CANADIAN EXPERIENCE. . . . . . 66



   John Leyes, Statistics Canada



 



DISCUSSION. . . . . . . . . . . . . . . . . . . . . . . . . . . . 77



    Gerald Gates, U.S. Bureau of the Census



 



DISCUSSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83



     Edward J. Spar, Market Statistics                             



 



 



 



                  Session 3 - SURVEY COVERAGE EVALUATION



 



 



 



 



CONTROL MEASUREMENT, AND IMPROVEMENT OF SURVEY COVERAGE . . . . .  87



     Gary M. Shapiro, U. S. Bureau of the Census; Raymond R.



     Bosecker, National Agricultural Statistics Service



 



QUALITY OF SURVEY FRAMES . . . . . . . . . . . . . . . . . . . . . 100



     Judith T. Lessler, Research Triangle Institute



 



DISCUSSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108



     Fritz Scheuren, Internal Revenue Service



 



DISCUSSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114



     Joseph Waksberg, Westat, Inc.



 



 



 



                  Session 4 - TELEPHONE DATA COLLECTION



 



 



 



 



QUALITY IMPROVEMENT IN TELEPHONE SURVEYS . . . . . . . . . . . . . 123



     Leyla Mohadjer, David Morganstein, Westat, Inc.



 



COMPUTER  ASSISTED SURVEY TECHNOLOGIES IN GOVERNMENT:



     AN OVERVIEW . . . . . . . . . . . . . . . . . . . . . . . . . 137



     Marc Tosiano, National Agricultural Statistics Service



 



DISCUSSION. . . . . . . . . . . . . . . . . . . . . . . . . . . . .155



     William L. Nicholls II, U. S. Bureau of the Census



 



DISCUSSION. . . . . . . . . . . . . . . . . . . . . . . . . . . . .161



     James T. Massey National Center Health Statistics



 



 



 



 



 



 



 



 



                                 iv



 



                              Part 2



 



 



 



                       Session 5 - DATA EDITING



 



 



 



OVERVIEW OF DATA EDITING IN FEDERAL STATISTICAL AGENCIES . . . . . .167



      David A. Pierce, Federal Reserve Board



 



EDITING SOFTWARE (An excerpt from Chapter IV of Working



    Paper 18) . . . . . . . . . . . . . . . . . . . . . . . . . . . 173



      Mark Pierzchala, National Agricultural Statistics



      Service



 



RESEARCH ON EDITING . . . . . . . . . . . . . . . . . . . . . . . . 180



      Yahia Ahmed, Internal Revenue Service 



 



DISCUSSION . . .. . . . . . . . . . . . . . . . . . . . . . . . . . 184



      Charles E. Caudill, National Agricultural Statistics



      Service



 



DISCUSSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . .186



      Richard Bolstein, George Mason University



 



 



 



               Session 6 - COMPUTER ASSISTED STATISTICAL



 



                                 SURVEYS



 



 



 



OVERVIEW OF COMPUTER ASSISTED SURVEY INFORMATION COLLECTION . . . . .191



      Richard L. Clayton, U. S. Bureau of Labor Statistics



 



A COMPARISON BETWEEN CATI AND CAPI . . . . . . . . . . . . . . . . . 197



      Martin Baum, National Center for Health Statistics



 



COMPUTER ASSISTED SELF INTERVIEWING . . . . . . . . . . . . . . . . .202



      Ralph Gillmann, Energy Information Administration 



 



COMPUTER ASSISTED SELF INTERVIEWING:  RIGS AND PEDRO,



      TWO EXAMPLES. . . . . . . . . . . . . . . . . . . . . . . . . .205



      Ann M. Ducca, Energy Information Administration



 



DATA COLLECTION . . . . . ... . . . . . . . . . . . . . . . . . . . .209



      Cathy Mazur, National Agricultural Statistics Service



 



                                     v



 



 DISCUSSION. . . . . . . . . . . . . . . . . . . . . . . . . . . . .212



     Robert N. Tinari, U. S. Bureau of the Census



 



 



 DISCUSSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . .216



      David Morganstein, Westat, Inc.



 



 



 



                          Thursday, May 24, 1990



 



 



                  Session 7 - QUALITY IN BUSINESS SURVEYS



 



 



 



 IMPROVING ESTABLISHMENT SURVEYS AT THE BUREAU OF LABOR



       STATISTICS . . . . . . . . . . . . . . . . . . . . . . . . . .221



       Brian MacDonald, Alan R. Tupek, U. S. Bureau of Labor



       Statistics



 



 A REVIEW OF NONSAMPLING ERRORS IN FEDERAL ESTABLISHMENT



 SURVEYS WITH SOME AGRIBUSINESS EXAMPLES . . . . . . . . . . . . . . 232



       Ron Fecso, National Agricultural Statistics Service



 



 DISCUSSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . .243



     David A. Binder, Statistics Canada



 



 DISCUSSION. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247



       Charles D. Cowan, Opinion Research Corporation



 



 



                    Session 8 - COGNITIVE LABORATORIES



 



 



 



 THE BUREAU OF LABOR STATISTICS' COLLECTION PROCEDURES



 RESEARCH LABORATORY:  ACCOMPLISHMENTS AND FUTURE DIRECTIONS . . . . 253



       Cathryn S. Dippo, Douglas Herrmann, U. S. Bureau of Labor



       Statistics



 



 THE ROLE OF A COGNITIVE LABORATORY IN A STATISTICAL AGENCY . . . . .268



       Monroe G. Sirken, National Center for Health Statistics



 



 DISCUSSION. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278



       Elizabeth Martin U. S. Bureau of the Census



 



 DISCUSSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . .281



       Murray Aborn, National Science Foundation (retired)



 



                                     vi



 



                                  Part 3



 



              Session 9 - EMPLOYER REPORTING UNIT MATCH



                                   STUDY



 



 



INTERAGENCY AGREEMENTS FOR MICRODATA ACCESS:



     THE ERUMS EXPERIENCE . . . . . . . . . . . . . . . . . . . . . .291



     Thomas B. Petska, Internal Revenue Service; Lois



     Alexander, Social Security Administration



 



SAMPLE SELECTION AND MATCHING PROCEDURES USED IN ERUMS . . . . . . . 301



     John Pinkos, Kenneth LeVasseur, Marlene Einstein,



     U. S. Bureau of Labor Statistics; Joel Packman, Social



     Security Administration



 



RESULTS, FINDINGS AND RECOMMENDATIONS OF THE ERUMS PROJECT . . . . . 309



     Vern Renshaw, Bureau of Economic Analysis; Tom Jabine,



     Statistical Consultant



 



DISCUSSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318



     W. Joel Richardson, Charles A. Waite, U. S. Bureau of the



     Census



 



DISCUSSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324



     Thomas J. Plewes, U. S. Bureau of Labor Statistics



 



 



                   Session 10 - APPROACHES TO DEVELOPING



                               QUESTIONNAIRES



 



TOOLS FOR USE IN DEVELOPING QUESTIONS AND TESTING



     QUESTIONNAIRES . . . . . . . . . . . . . . . . . . . . . . . . .331



     Theresa J. DeMaio, U. S. Bureau of the Census



 



TECHNIQUES FOR EVALUATING THE QUESTIONNAIRE DRAFT . . . . . . . . . .340



     Deborah H. Bercini, National Center for Health Statistics



 



DESIGNING QUESTIONNAIRES FOR CATI IN A MIXED MODE



     ENVIRONMENT. . . . . . . . . . . . . . . . . . . . . . . . . . .349



     Gemma Furno, U. S. Bureau of the Census



 



DISCUSSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 360



     Carol C. House, National Agricultural Statistics Service



 



                                      vii



 



             Session 1 1 - STATISTICAL DISCLOSURE - AVOIDANCE



 



 



 



DISCLOSURE AVOIDANCE PRACTICES AT THE CENSUS BUREAU . . . . . . . . .367



      Brian Greenberg, U. S. Bureau of the Census



 



THE MICRODATA RELEASE PROGRAM OF THE NATIONAL CENTER



FOR HEALTH STATISTICS . . . . . . . . . . . . . . . . . . . . . . . .377



      Robert H. Mugge, National Center for Health Statistics



      (retired)



 



DISCUSSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385



      George T. Duncan, Carnegie Mellon University



 



 



                   Session 12 - FEDERAL LONGITUDINAL SURVEYS



 



 



 



FEDERAL LONGITUDINAL SURVEYS . . . . . . . . . . . . . . . . . . . . 393



      Daniel Kasprzyk, U. S. Bureau of the Census; Curtis



      Jacobs, U. S. Bureau of Labor Statistics



 



THE ADVANTAGES AND DISADVANTAGES OF LONGITUDINAL SURVEYS . . . . . . 407



      Robert W. Pearson, Social Science Research Council



 



LONGITUDINAL ANALYSIS OF FEDERAL SURVEY DATA . . . . . . . . . . . . 425



      Patricia Ruggles, Joint Economic Committee



 



DISCUSSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 438



      Michael Brick, Westat, Inc.



 



DISCUSSION. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .447



      Marilyn E. Manser, U. S. Bureau of Labor Statistics



 



 



                         TOWARDS AN AGENDA FOR THE FUTURE



 



 



 



Stephen E. Fienberg, Carnegie Mellon University . . . . . . . . . . .455



 



Margaret E. Martin . . . . . . . . . . . . . . . . . . . . . . . . . 462



 



Hermann Habermann, Office of Management and Budget . . . . . . . . . 465



 



                                          viii



 



                               Part 3



 



                               Session 9



 



                           EMPLOYER REPORTING



                            UNIT MATCH STUDY



 



 



 



 



 



 



 



 



                                   289



 



 



                                 290



 



 



         INTERAGENCY AGREEMENTS FOR MICRODATA ACCESS:



                       THE ERUMS EXPERIENCE



 



                        Thomas B. Petska



                    Internal Revenue Service



 



                          Lois Alexander



                 Social Security Administration



 



 



    The Employer Reporting Unit Match Study (ERUMS) was a pilot



record linkage study carried out under the auspices of the Federal



Committee on Statistical Methodology of the Office of Management



and Budget.  The study linked records of employers and their



reporting units from three agencies: the Bureau of Labor Statistics



(BLS), the Social Security Administration (SSA) and the Internal



Revenue Service (IRS).  The primary linkages involved samples of



the agencies' records for employers in the State of Texas covering



their-activities in 1982.



 



    For the ERUMS Workgroup to gain access to the data sets needed



for the study, arrangements had to be developed that would comply



with the confidentiality provisions and statutes of the Federal and



State agencies that controlled these data sets.  This paper gives



an overview of these arrangements and agreements.  In the first



section, background information on the statistical content and



confidentiality provisions of each of the data sets is provided.



In the second section, the actual arrangements for the release of



confidential microdata are described.  The last section provides a



summary of what we have learned about such data sharing



arrangements.



 



 



Background Information



 



     The goal of ERUMS was to demonstrate the feasibility of



matching employer and reporting unit data from different agency



record systems as a means of obtaining more precise information



about the coverage and content of the data in those systems.  A



purpose was to examine and I evaluate differences in wage and



employment data at the state and county level as reported to those



agencies.  Despite the many difficulties encountered in



establishing the data access agreements, ERUMS demonstrated that



data such sharing Projects can be successful under current laws.



 



 



1.  Data Sets



 



     The ERUMS study was a three-way data linkage study in which



individual microdata records from BLS, SSA, and IRS were matched by



Employer Identification Number (EIN).



 



 



                                  291



 



 a.  BLS provided a 1982 Unemployment Insurance (UI) Address



      File, which, for each state, consists of data for



      individual employers and their reporting units, which are



      often equivalent to "establishments".  The data for this



      file are submitted to BLS by the State employment



      security agencies that operate the Federal-State UI



      Program.  BLS uses the data submitted by the states as a



      basis for statistical reports on employment and wages and



      uses the UI Address File as a national sampling frame



      for its establishment surveys.



 



  b.  SSA provided an edited file of Form W-3 annual reports



      for 1982 and the Single Unit and Multi-Unit Code Files.



      The Form W-3 file provided data on individual employers



      and, in some cases, for each of their reporting units,



      which are frequently equivalent to establishments.  The



      Single Unit Code File contains a record for most entities



      that have filed an application for an Employer



      Identification Number.  The Multi-Unit Code File contains



      a record for each reporting Unit of multi-unit employers



      who are participating in the Establishment Reporting



      Plan, a voluntary program under which employers report



      wage information on Form W-3 separately for each of their



      reporting units.



 



  c.  IRS data used for ERUMS were from a Census-edited file



      based on Forms 941 and 943 for Tax Years 1981-83.  These



      forms are used by employers to report each quarter



      (annually for Form 943) to IRS on income taxes withheld



      from wages and other payments to employees and on taxes



      under the Federal Insurance Contributions Act (FICA)



      under the Social Security system.  Extracts of data from



      these forms are provided annually by IRS to the Census



      Bureau for use in the latter's County Business Patterns



      Program and other statistical programs.  The Census



      Bureau edits the files, particularly the industry codes,



      and imputes certain missing data.  This file was made



      available to the IRS Statistics of Income (SOI) Division



      for use in its business employment and payroll studies



      and was used for ERUMS.  In addition, copies of Form 940,



      Federal Unemployment Tax Return, were obtained for a



      substantial proportion of the ERUMS sample cases.



 



 



 



2.  Data Sharing Issues



 



      For the ERUMS Workgroup to gain access to the data sets needed



for the study, it was necessary to develop working arrangements



that complied with the provisions of confidentiality statutes,



regulations, and policies of the Federal and State agencies that



controlled these data sets.



 



 



                                 292



 



    Although interagency exchange of identifiable microdata was



the key to ERUMS, such data sharing is restricted by Federal



confidentiality laws which generally permit agencies to disclose



statistical information only in summary or other unidentifiable



form.  Since ERUMS was designed to link and compare information



about individual employers collected separately by the different



agencies, the Workgroup had to develop and implement lawful methods



of transferring data on identifiable business units among the



participants.  A related task was to minimize the disclosure of



identifiers in making those transfers and linkages.



 



     The Workgroup was particularly interested in the different



ways an employer may report establishment or multi-unit enterprise



data to various State and Federal agencies.  To examine these



differences, the Workgroup needed to compare employers' reports to



the BLS State UI programs, the SSA FICA reporting, and the IRS



employment tax returns.  Members Of the Workgroup included



employees of these agencies, plus employees of the Bureau of



Economic Analysis, Office of Management and Budget, the Bureau of



the Census, and the Committee on National Statistics of the



National Academy of Sciences.



 



     The Workgroup planned to analyze the information that



corresponded to each EIN as it was reported to each agency.  The



analysis and findings would be entirely statistical in nature with



no reference to the individual (identifiable) cases.



Nevertheless, the planning, processing, and analysis phases each



required access to identifiable data.



 



3. Confidentiality of Federal and State Tax Records



 



     In the ERUMS study, the Employer, Identification Number (EIN)



was the identifier that was common to all the reporting systems.



It was used to define the sample drawn by BLS and was used as the



basis for retrieving, linking and comparing records containing



information from the SSA and IRS files.  By law, the EIN is a tax



identification number, and even when standing alone is protected by



Internal Revenue Code confidentiality restrictions.



 



     ERUMS required access to data from W-3 records which by law



are Federal tax records that are processed and maintained at SSA in



conjunction with the computation of Social Security retirement



benefits.  Since these are tax records, it was necessary to satisfy



IRS that the selection by SSA of sample cases, SSA's disclosure of



W-3 data to BLS, and the use of employer data by other members of



the Workgroup met the requirements of the Internal Revenue Code



dealing with disclosure of tax information. (See No. 4 below.)



 



     BLS selected Texas as the State whose records it would sample,



and it obtained written permission from the Texas State Employment



Security Agency to use their UI records in the project.  The Texas



 



                                   293



 



Unemployment Compensation Act requires Texas employers to maintain



records and file reports to the Texas Employment Commission with



detailed information about the business operations and the number



and compensation of employees.  Texas law prohibits disclosure



except for administering the Act, and it makes improper disclosure



punishable by fines or imprisonment.



 



 



4. Other Confidentiality Considerations



 



      Since the Workgroup was composed of employees from several



agencies and organizations, confidentiality laws did not apply to



them uniformly.  In varying degrees, certain laws, regulations, and



policies affected each agency's access to identifiable records from



particular sources and provided differential access to various



individuals in the Workgroup.  A recurring theme was the necessity



at each phase of the process to identify the persons who needed to



use identifiable data and to ensure that no others had access at



that time.



 



      Besides affidavits and other written procedures to protect the



confidentiality of records, certain technical safeguards were



adopted to minimize disclosure risk.  The first of these methods



was to avoid identifying sample cases by EIN to persons who



performed processing in the participating agencies but were not



directly associated with the Workgroup.  This method was adopted to



conform to, the Internal Revenue Code requirements for tax



information under the agreement BLS had with the State of Texas.



At BLS this led to a decision not to process the data on the



mainframe computer system at the Department of Labor that is



operated by a private contractor.  Instead, BLS used a mini-



computer which was accessible only to BLS employees who were



members of the Workgroup.



 



      State agencies periodically submit to BLS UI address files



that compile identification data for all reporting units at the



most-detailed level that is available from employers' reports.  BLS



compiles these reports under a pledge of confidentiality that



allows the data to be used only by authorized persons for



statistical purposes.



 



      Once BLS selected the Texas sample, it had to create a finder



list so that SSA could extract corresponding records from its W-3



and related files for employers in the sample.  The technical staff



who performed these operations at SSA have routine access in their



usual jobs to the employer records maintained at SSA.  However,



they did not need to know which of the employers' records comprised



the sample selected by BLS from the Texas UI file.  To avoid



identifying those cases that were actually in sample,



furnished SSA with a listing of 7 of the 9 digits of sample EINS.



SSA staff then extracted records from the W-3 and related files for



all records in which these 7 digits appeared without knowing which



 



                               294



 



 



employers were actually in the BLS sample.  This procedure



effectively masked the identities of sample cases derived from



State UI files, and thus significantly limited the number of SSA



employees who were required to sign BLS non-disclosure affidavits.



 



 



Agreements for Interagency Data Sharing



 



    Access by the Workgroup to the data sets needed for the study



was accomplished through three interagency agreements plus an



additional access arrangement.



 



     The Workgroup had originally planned a tripartite arrangement



through interagency agreements of SSA and BLS with IRS.  However,



IRS counsel raised objections that such a multi-party agreement



would be unduly cumbersome, and approval would probably not be



forthcoming.  As an alternative, IRS proposed to contract



exclusively with BLS for the performance by BLS of services that



required access to tax data.  SSA staff would be designated as



special agents of BLS to process the data.  Bilateral BLS/IRS and



BLS/SSA agreements would also have to be drafted under this



arrangement.



 



     The drafting of these arrangements proved to be a delicate



task.  By law, the purposes of IRS participation in the project and



its service contract with BLS had to be related to IRS



administration of the tax laws.  Section 6103(n) of the Internal



Revenue Code (IRC) allows IRS to disclose tax return information to



persons outside of the agency as long as it is for purposes of tax



administration [1].  Specifically, this purpose is to conduct



statistical studies based on return information, which Section



6108(a) of the IRC authorizes IRS to perform [2].  A case was made



that the ERUMS study was one such purpose



 



 



1. BLS and Texas Agreement



 



     BLS has cooperative agreements with 50 State Employment



Security Agencies to use employment statistics collected by the



states for its labor economics research.  The 1982 data used in the



ERUMS study was furnished to BLS in its ES-202 program by the Texas



State Employment Commission under a cooperative agreement.  It was



necessary for BLS to obtain authorization from the State Commission



to use the microdata for the ERUMS study and to provide access for



the Workgroup members.  Under this cooperative agreement, the



access and use of the data were subject to the confidentiality



requirements of the Texas Employment Compensation statute as well



as those set out in the BLS Commissioner's Order No. 2-80.



 



     Each UI program is operated under state law that must conform



to certain minimum federal standards, with reports that enable BLS



to monitor state compliance.  Under the Texas program, each



 



                                  295



 



employing unit is required to file (and update periodically) a



status report with the Texas Employment Commission, describing the



type of ownership, location, and nature of business.  On a



quarterly basis, employers are required to file detailed reports on



wages and contributions.  Multi-Unit employers are asked to file a



voluntary statistical supplement that provides detailed employment,



wage, and contribution reports for each establishment.  The ES-202



reports are compiled by BLS and form the basis for the UI Address



file that BLS maintains.  This is a micro-level employer file that



contains first quarter information for each reporting unit, and the



1982 file provided the Texas sampling frame for the ERUMS sample.



 



    The confidentiality of statistical data collected under the



cooperative agreement is protected by interrelated state and



federal procedures.  At the state level, these UI reports are



collected under the Texas Unemployment Compensation Act which



limits the availability of its UI reports to public employees in



the performance of public duties, except, as the Employment



Commission may find necessary in its administration of Texas law.



At the federal level, BLS receives and maintains these confidential



reports under the authority of the BLS Commissioner's Order that



pledges confidentiality and prohibits disclosure except to



authorized persons for statistical purposes.  This Order precludes



any use of identifiable information for non-statistical purposes,



such as investigation or enforcement.



 



    Under this cooperative agreement with the State of Texas, it



was necessary for BLS to obtain permission from the Texas



Commissioner to select employer sample cases and to make



information about them available to BLS and SSA employees in the



ERUMS Workgroup and later to others in the Microdata Access Group.



In Addition, BLS procedures establish the confidentiality of the



identities and all information pertaining to employers in the



sample.  Members of the Workgroup who were not BLS employees were



appointed as BLS agents pursuant to another interagency agreement



with BLS.  Like BLS employees, other Workgroup members were



required to sign a Non-Disclosure Affidavit before they would be



given access to the microdata.



 



 



2. IRS And BLS Agreement



 



    The initial draft of the statement of purpose by IRS



representatives was, acceptable to IRS counsel since its



justification for sharing of confidential tax information was



defined as for purposes of tax administration, which is permissible



under section 6103(n) of the Internal Revenue Code [1].  However,



the case that was made for IRS tax administration purposes was not



acceptable to other Workgroup participants because they felt that



this did not clearly describe the purposes of the ERUMS project in



general or SSA's role in particular.  In the, subsequent draft, care



was taken to define contractual purposes in language that covered



 



                               296



 



the statistical purposes of the several participating agencies and



that provided for the exchange of records to create a common pool



of data for a variety of analytical purposes, including those



related to tax administration.



 



    In this agreement, IRS contracted with BLS for the performance



of those parts of the ERUMS project that required access to tax



data, including the wage report information that was to be provided



by SSA.  Under this agreement, SSA staff could be designated as



special agents of BLS to carry out their part of the linkage and



analysis operations.  By law, the purposes of IRS participation in



the project and its service contract with BLS had to be related to



IRS administration of the tax laws.



 



    The terms of a contract between IRS and BLS which needed to be



acceptable to SSA enabled BLS to receive tapes containing tax



information from IRS and SSA and to combine them with records in



the UI Address File maintained by BLS.  It imposed strict safeguard



procedures and required BLS to provide IRS with a list of all



persons permitted to see confidential tax return data.  This list



included SSA employees who were required to sign affidavits as



agents of BLS.



 



 



3. BLS and SSA Agreement



 



    The third agreement was a Conditions of Use agreement between



BLS and SSA which enabled SSA to release data from its employer



files to BLS and authorized BLS to link data from these files to



data in the UI Address File and data to be furnished by IRS.  Like



the IRS/BLS agreement, it limited access at each stage of the



project to those persons who needed to use identifiable data, kept



the number of such persons to a minimum, and required adequate



physical security procedures.  This agreement, which needed to be



acceptable to IRS, enabled BLS to use SSA files for the ERUMS



project.  Under this agreement, SSA would furnish BLS with SSA's



Single Unit Code File, Multi Unit Code File, and Employer Report



(W-3) Record.  The agreement authorized BLS to link data from these



statistical files with data in the BLS Unemployment Insurance



Address File and with data to be furnished by IRS, and prohibited



any other linkage.



 



 



4. Microdata Access Group



 



    In the planning and matching stages of the project, the



persons who needed to have access to microdata were those members



of the Workgroup who were performing the record matching and



verification.  At Workgroup meetings, members generally reviewed



data in the form of frequencies and other summaries to track the



progress of the matching operations and to plan future steps.



Occasionally, discrepancies appeared or questions arose concerning



 



                                  297



 



classification of a particular employer or possible mis-match of



data.  Those matters were usually referred to particular members to



resolve, with access to microdata as needed on an ad hoc basis.



 



     When the matching steps were completed and time came to plan



the analysis, new arrangements were needed to enable a different



group of persons to examine identifiable microdata.  The Microdata



Access Group (MAG) was formed for this purpose.  At this point, IRS



agreed that its contractor, BLS, would be permitted to make



Workgroup members its agents as needed for the analysis stage.



This ehabled the Workgroup members who were employees of BEA and



the Committee on National Statistics to become sworn agents who,



like the employees of BLS and SSA, would be permitted to examine



and analyze microdata.  Thus, of the three agencies sharing



microdata (BLS, SSA, and IRS), IRS was the only one that did not



have access to the matched microdata file.  This group met



periodically to plan and perform the analysis, prepare findings,



and to report its findings back to the full Workgroup.



 



      Once the terms of all contracts were agreed upon, the



contracts and the conditions of use agreement were signed by



officials of the participating agencies, and the way was cleared



for the data transfers.



 



 



Summary and Conclusions



 



      To say that the process of discussion and negotiation leading



to the signing of the ERUMS access agreements was painstaking,



sensitive, and costly in terms of staff time and delay in the



study's completion is an understatement.  The disclosure aspects of



the study severely tested the will and resolve of the affected



agencies.



 



   In retrospect, the signing of interagency agreements between IRS



and BLS and between SSA and BLS documented a process of negotiation



by which the study plan was adapted to the requirements of the



varios confidentiality laws that impinged on it.  In addition, it



summarized a process in which a combination of technical and



procedural safeguards were fitted to meet the requirements of the



Federal and State agencies that were involved in the data sharing.



 



      While the participants in the ERUMS study all feel a certain,



degree of Accomplishment due to their collective persistence, none 



are quite so upbeat about the long duration of the study.  Clearly,



the long incubation period for the interagency data sharing



agreements was a major contributor.  However, it is important to



recognize that the prolonged negotiation for interagency agreements



did not result from lack of cooperation among the participants.  On



the contrary, it reflected the complex mosaic of legal restrictions



on use and interagency dissemination of records.



 



 



                                  298



 



     Once it became evident that a single multi-party agreement



would be unworkable for the overall project, the plan was broken



down into component steps of disclosure, record linkage, and



analysis.  Each failure to reach an agreement required a step back



to re-examine the study imperatives and to adapt the procedures to



the practical and legal necessities at each stage.



 



     In addition to adding to the overall time and resources



consumed by the project, these delays further contributed to



supplemental delays, including:



 



 1.  Personnel turnover among the project participants due to



     the extended length of the project's schedule



     necessitated slower progress on the technical issues.



 



 2.  The acquisition of IRS Form 940 data was adversely



     impacted since these have a 5 year retention and were



     scheduled for destruction by the time the sample EIN's



     were determined.



 



     On the positive side, however, ERUMS demonstrated that such



data sharing projects can be successful under current laws if there



is creativity, flexibility, and most of all, persistence.



 



Notes and References



 



[1] Section 6103(n) of the Internal Revenue Code (IRC) allows for



the provision of confidential tax return information for purposes



of tax administration.  Specifically, it reads:



 



     "Certain Other Persons.  -- Pursuant to regulations



     prescribed by the Secretary, returns and return



     information may be disclosed to any person, including any



     person described in Section 7513 (a), to the extent



     necessary in connection with the processing, storage,



     transmission, and reproduction of such returns and return



     information, and the programming, maintenance, repair,



     resting, and procurement of equipment, for purposes of



     tax administration."



 



[2] Section 6108 of the IRC has three parts which call for the



publication of statistical compilation of tax return information at



regular intervals, but, unlike Section 6103(n), such information



cannot identify a particular taxpayer.  This Section is the primary



"mandate" for IRS' Statistics of Income (SOI) program.



 



 a) Publication or other Disclosure of Statistics of Income.



     -- The Secretary shall prepare and publish not less than



     annually statistics reasonably available with respect to



     the operations of the internal revenue laws, including



     classifications of taxpayers and of income, the amounts



 



                                299



 



 



     claimed or allowed as deductions, exemptions, and



     credits, and any other facts deemed pertinent and



     valuable.



 



  b) Special statistical Studies.  -- The Secretary may, upon



     written request by any party or parties, make special



     statistical studies and compilations involving return



     information (as defined in section 6103 (b)(2)) and



     furnish to such party or parties transcripts of any such



     special statistical study or compilation.  A reasonable



     fee may be prescribed for the cost of the work or



     services performed for such party or parties.



 



  c) Anonymous Form.  -- No publication or other disclosure of



     statistics or other information required or authorized by



     subsection (a) or special statistical study authorized by



     subsection (b) shall in any manner permit the statistics,



     study, or any information so published, furnished, or



     otherwise disclosed to be associated with, or otherwise,



     identify, directly or indirectly, a particular taxpayer.



 



     Section 6108(a) has been interpreted as a tax administration



purpose for the Statistics of Income (SOI) Program (unlike 6108(b)



and 61O8 (c)); hence, if a 6108 (a) study requires the use of



"outsiders", then a 6103(n) contract can be initiated as was done



for the ERUMS study.



 



 



 



 



 



 



 



 



                                     300



 



     SAMPLE SELECTION AND MATCHING PROCEDURES IUSED IN ERUMS



 



                              John Pinkos



                           Kenneth LeVasseur



                            Marlene Einstein



                   U. S. Bureau of Labor Statistics



 



                              Joel Packman



                  Social Security Administration



 



 Introduction



 



     The first paper in this session described the experience with



 developing interagency agreements, the third described the findings



 resulting from the study while this one describes the sample



 selection and matching procedures used.



 



     In addition to describing the sample selection and matching



 procedures, the followinq will explain what the ERUMS Workgroup



 considered when developing the protect design.  This paper also



 describes the sampling frames, data, and manual matching conducted



 by the ERUMS Workgroup.



 



     The ERUMS project was a pilot study, designed to develop and



 test procedures for linking and comparing employer and reporting



 unit data from different administrative record systems.  The study



 from its inception was exploratory in nature, and the ERUMS



 Workgroup members hoped to observe and document the similarities



 and differences discovered between the records in the systems being



 studied and, thus, between the systems, themselves.



 



     The scope of the project included employer reporting unit data



 from the Bureau Labor Statistics and Social Security Administration



 employer data files which have similar coverage.  Internal Revenue



 Service data, which were edited by the Bureau of the Census, were



 used to assist in the analysis of the sample.  The ERUMS committee



 members included staff from Office of Management Budget (OMB) ,



 Bureau of Labor Statistics (BLS), Social Security Administration



 (SSA), Bureau of Economic Analysis (BEA), Internal Revenue Service



 (IRS), Census and the Committee on National Statistics (CNS).



 Developing the sample design, selecting the sample, and performing



 the machine and manual match were conducted by SSA and BLS staff



 who were cleared to work with the confidential data.  To conduct



 the final analysis of the data this group was later expanded to  



 include staff from BEA and the CNS.



 



     



                                301



 



     There are two reasons for providing an account of the ERUMS



sample selection and matching procedures.  The obvious reason is



that the results, like those of any research study, are dependent



on the procedures used, and anyone interested in the results is



entitled to a full description of how the study was carried out.



The other reason, equally or perhaps more important, is that ERUMS



was a venture into uncharted territory, and we believe that future



projects of this kind will benefit from the availability of a



detailed road map of the procedures that were developed to match



and compare employer and reporting unit records from BLS, SSA, and



IRS for statistical purposes.



 



Sample Design Considerations



 



     A major design consideration affecting the size and scope of



the project was the limited staff time and resources each of the



participating agencies was able to contribute.  The committee



realized from the beginning, the meat of the project would be in



the manual review of the reporting units from each of the



administrative record systems.  To keep the workload manageable,



the Workgroup decided to limit the study to one State rather than



several.  It was also decided that this State should be large and



be one which could share its data with federal statistical agencies



for research purposes.  The State selected was Texas.



 



     Probability sampling was used at all stages of selection and



provided two benefits.  It ensured that sample results could be



used to produce unbiased estimates for the study population, and it



made possible estimation of sampling errors.  Additionally, the



Workgroup felt it would be useful for both analytical and



methodological purposes to produce weighted estimates.



Consideration was given to designing a baseline sample where a



sample from one agency (e.g., BLS) would be drawn and then a search



for the selected sample members would be conducted on the other



agency's files (e.g., SSA).  This approach would provide matched



units on both files as well as those on the BLS file but not the



SSA file.  This method, however, would not identify those units on



the SSA file but not on the BLS file.



 



     The baseline sample approach was abandoned and it was decided



that samples would be selected in two stages.  The stage one sample



was an equal probability sample of the population which was then



stratified by match status.  The stage two sample was a systematic



subsampling from these strata.  This method of sampling provided a



means for over- sampling selected types of records which were of



more interest to the project and it also resulted in a manageable



sample size.  As a final design consideration, the committee wanted



to ensure that records from both SSA and BLS had an equal chance of



selection.  Additionally, the Committee wanted to develop an



approach that would minimize the number of computer searches



 



                              302



 



required to select the sample and relevant data elements from these



large administrative record files.



 



    The sample design used was one that selected separate samples



from the BLS and SSA files using the same get of random pairs of



numbers.  The purpose of this design was to measure overlap between



the two frames and, more importantly, to measure the amount of



nonoverlap between the two frames.  The nonoverlap included those



sample members on one frame but not the other.  This design also



minimized the computer costs and allowed the committee to select



the sample in one pass through each agency's data file.  Once the



sample was selected, the relevant data elements for each sample



member were downloaded to a micro computer.



 



Sampling Frames



 



    Both the SSA and BLS data files are compilations of



administrative tax records.  The SSA data file includes data from



employer W-2 and W-3 wage reports, whereas the BLS file includes



data from employers' State Unemployment Insurance tax reports.  The



identifying data element common to both the SSA and BLS files and



assigned from a single source is the Employer's Identification



Number, or EIN.  The EIN is a unique 9-digit number assigned to



companies by IRS and is used to track federal tax payments.



 



    When companies pay State Unemployment Insurance Taxes the



State assigns an Unemployment Insurance (UI) Tax number to track



payment.  Since companies are given a federal tax credit for State



UI taxes, they provide their EIN to the State UI tax department.



On an annual basis IRS provides each State UI tax department with



a file of all the EINs registered in the State.  The UI tax



department then reconciles the amount of State UI taxes paid by



each employer against the IRS file of EINs and tax credits claimed



by each employer.  By definition, all companies on the SSA files



should have an EIN reported, because this is what is required for



an employer to be included on the file.  On the BLS State file a few



units did not have an EIN reported since only a State Ul tax number



is required for an employer to be included on that file.  The first



quarter 1982 Texas file had EINs reported for 98.7 percent of all



reporting units.



 



    The sampling frame for BLS was all the EINs reported in the



Texas first quarter 1982 U.I. Name and Address File.  The sampling



frame for SSA was all the EINs reported in the Single Unit or Multi



Unit Code file with wage reports for calendar year 1982.  The SSA



files are continuous files linked over time, whereas the BLS file



in 1982 was a snapshot of one calendar quarter.  Effective with



first quarter 1989 data, the BLS began linking data quarterly and



now has a continuous data file.



 



 



                                 303



 



     The sampling rate was determined by the Workgroup's decision



that 400 EINs would be a manageable sample size and that about one-.



half of the sample should have EINs classified as multis, or



companies with multiple locations.  EINs classified as multis were



of particular interest because there is more variation in reporting



practices.



 



     To derive the sampling rate, the committee looked at the first



quarter 1982 Texas file, which had 267,487 EINs classified as



single units and 3,125 EINs classified as multi units.  A sampling



rate of 6 in 100 was selected since it provided approximately 188



EINs that were multi units.



 



     As previously mentioned, it was decided to select a two- stage



sample.  The first was an equal probability sample of the



population.  This first-stage sample was selected from all EINs



that had 1 of 6 random pairs of numbers in positions 7 and 8 of the



EIN.  The sampling rate of 6 in 100, when applied to both the BLS



and SSA frames provided a combined stage one sample of 19,964 EINS.



The stage one sample was then machine matched and each EIN was



assigned a status classification.  The initial status



classifications are shown below:



 



 



                           MATCH STATUS IN:



 



     Table A



 



     Group                        BLS                        SSA



 



        1                        Single                     Single



        2                        Single                     Inactive



        3                        Inactive                   Single



        4                        Multi                      Single



        5                        Single                     Multi



        6                        Multi                      Inactive



        7                        Inactive                   Multi



        8                        Multi                      Multi



 



 



EINs that were inactive in both systems obviously had no chance of



entering the ERUMS sample.



 



     Another view of the status classifications is shown in



attachment A, which is a 3x3 grid having classifications, single,



multi, and No Wage Report (NWR) on each scale for both the BLS and



SSA files.  Records with no wage reports on the SSA file were



considered inactive.  The bottom right cell on the grid is not



applicable since these would be records that did not exist on



either file.



 



 



 



                                  304



 



     Based upon the interest of the Workgroup three of the basic



classifications or cells were subdivided and are shown as the



shaded sectors on the 3x3 grid (see attachment A).  County and SIC



became matching criteria for those EINS that were single on both



files.  The number of reporting units became a criterion for those



EINS that were multis on the BLS file but were single on SSA file



and those EINs that were multis on both.



 



     These eleven match status classifications became the strata



used for the second stage sample.  The second stage sample



selection had equal probability within each stratum.  The sampling



rates used varied by stratum, from selecting all to selecting 1 in



173.78.  Given the exploratory nature of ERUMS, the intent of the



Workgroup was to pull a larger sample of EINs classified as multis



and nonmatched records.  These cases were expected to present more



difficulties.  Therfore, the Workgroup wanted to, have enough of



these cases to learn what the situations were and to test methods



of dealing with them.  The final sample contained 401 EINS,



including 201 classified as having multi units on, either the BLS or



the SSA files.  The remaining 200 EINs were those not classified as



multis on either the BLS or SSA files.



 



     Once the sub sample was selected, the Workgroup began the



review and analysis phase, which included labor-intensive manual



matching.



 



     The working group reviewed reported employment and SIC and



geographic codes for each of the 401 EINS.  To assist in this



process, the Workgroup made arrangements to have access to IRS data



for tax years 1981 through 1983.  Data for 385 of the 401 EINs were



made available.



 



     During the review process the Workgroup attempted to uncover



the reasons why records did not match or why records were on one



file but not the other.  In this process of looking very closely at



the actual records from each agency, the Workgroup learned much



about the two systems and found reasons to reclassify some of the



records which affected the final match status.  For example, in the



area of multiunits, the BLS system defines multis as companies with



multiple locations within the same State whereas the SSA system



defines multis as companies that have multiple locations in the



United States.  During the review of the multi-unit records,



employment levels were considered and attempts were made to



reconcile differences in reporting units by aggregating employment



of the individual multi units to the EIN level.  As a result of



this review, the Workgroup decided not to use employment as a match



criterion.  It was also decided that for purposes of this study, a



multi unit EIN would be an EIN that had multiple locations within



the State of Texas.  This reduced the number of SSA multi unit EINs



in the final sample from 120 to 10.  The remaining 110 records were



reclassified as single EINS.



 



 



                                305



 



     As noted, the Workgroup also compared SIC and geographic codes



from both files.  SIC codes were first examined to see why there



were non-matches at the four-digit SIC level.  In some cases, the



non matched EINs were assigned SIC code in related industries; in



other cases, the industry code reflected la larger aggregation of



the reporting unit.  Another, and perhaps more important factor that



accounted for differences at the 4-digit level was, both BLS and



SSA have policies for SIC coding exceptions.



 



     The BLS in 1982 had 11 exceptions to 4-digit SIC coding which



meant a 3-digit SIC code was assigned in certain industries in lieu



of the 4-digit SIC code.  This represented 43 4-digit industries.



These are industries which either have a significant amount of



overlapping in their industrial activities or are industries that



historically had been difficult to collect sufficient information



from to assign a 4-digit SIC.  The BLS currently has reduced the



number of 4-digit coding exceptions to 6, which represents 17 4-



digit SIC industries.



 



     The SSA SIC coding exceptions exist in some agricultural



industries and Public Administration, which are coded to the 1



digit level.  This affected 64 4-digit industries.  Approximately 63



other 4-digit industries were coded at the 3-digit level for one



reason or another, typically insufficent information.  In addition



to reviewing SIC codes, the Workgroup also looked at geographic



codes and tried to explain why some records did not match between



files.  Maps and coding manuals were consulted and the review



showed there was some inherent misreporting of county codes by



employers.  Texas has more than 37 cities with the same name as a



county but these cities ate not located in those counties.



Houston, for example, is in Harris County not Houston County and



Austin is in Travis County, not Austin County.  Counties named



Houston and Austin are located elsewhere in the State.  In some



cases the reason for non matching records was that the reporting



unit was coded in an adjacent county.  Texas has a very large



number (254) of counties.  For those employers who keep their



records by city or are not familar with the county names, it is



easy to see the potential for some misreporting.



 



     The Workgroup also looked very closely at the cases having



inactive EINs on either the BLS or SSA files.  Inactive EINs for



the BLS were defined as those that appeared on the SSA file but did



not Appear on the BLS File.  Inactive EINs for SSA were defined as



those on the SSA file with no wage reports for 1982.



 



     When reviewing the BLS inactive EINs, the Workgroup used SSA



SIC and employment data to determine if the employer was exempt



from Unemployment Insurance coverage.  They also looked at IRS data



to determine if the employer became active after the first quarter



of 1982 and at the first quarter 1983 Texas file to see if the



employer reported in 1983.



 



 



                               306



 



     When reviewing the SSA inactive EINS, the Workgroup was able



to use a more nearly complete SSA wage report file that included



wage reports that were either delinquent when the sample was



selected or were in the process of reconciliation with IRS.  As a



result of these additional data, 44 of the 99 EINs originally



classified as inactive on the SSA file were determined to be



active.  The Workgroup also used the BLS 1982 and 1983 first



quarter Texas files to conduct name searches to see if the same



employer reported under a different EIN.  The Texas files were



also used to see whether zero employment was reported, which might



have indicated no wages were paid.  Additionally, IRS data were



then used to see what level of employment was reported to IRS.



 



     The last step in the review and analysis phase was to



determine the final match status of the 401 EINS.  As a result of



the review, it was decided to collapse the 11 categories shown in



Attachment A down to the basic 8 cells shown in Table A.



 



     As part of the final analysis, committee members worked on



completing the documentation for the project and discovered that an



additional 2,608 EINs that were on the SSA file but not the BLS



file were inadvertently omitted from the first stage sample and,



consequently, from the second stage.  Adding cases to the stale 1



and 2 samples at that point in time would have further delayed



completion of the study, so the Workgroup decided the best way to



deal with this problem was to reweight the sample cases in the two



affected strata and rerun the results tables.



 



 



 



 



 



 



 



 



                                       307



 



 



 



 



                           MATCH STATUS CLASSIFICATIONS



Click HERE for graphic.                          



 



 



  KEY: NWR = No Wage Report



       SIC = Standard Industrial Code



        RU = Reporting Units



 



                                  308



 



 



  RESULTS, FINDINGS, AND RECOMMENDATIONS OF THE ERUMS PROJECT



 



                           Vern Renshaw



                   Bureau of Economic Analysis,



 



                            Tom Jabine



                       Statistical Consultant



 



     The other papers in this session have examined the



administrative arrangements and the sample selection and matching



procedures for the Employer Reporting Unit Match Study (ERUMS)



This paper reviews the study's results, findings, and



recommendations.



 



     The main purpose of the ERUMS project was to provide



information on the technical and administrative feasibility of



interagency record linkages.  However, the ERUMS Workgroup hoped



that the study would also shed some light on at least three areas



of substative concern.



 



 1)  We hoped that geographic and industry information for



     reporting units contained in the Bureau of Labor



     Statistics (BLS) Unemployment Insurance (UI) Address File



     could help evaluate the potential statistical usefulness



     of a) reporting unit data supplied by multi unit



     employers participating in the Social Security



     Administration (SSA) Establishment Reporting Plan (ERP)



     for forms W-2 and W-3; and b) State data supplied to the



     Internal Revenue Service (IRS) on Form 940.  SSA has been



     concerned about the quality of its reporting unit data



     because resources for maintaining the ERP had been



     inadequate for some time and the State data supplied on



     IRS Form 940 had never been used for statistical



     purposes.



 



 2)  We hoped that information from LRS and SSA files could



     help evaluate the completeness of employer coverage in



     the UI Address File.  The UI Address File leaves out or



     estimates employer information that is not received by



     its statistical deadline, whereas information for late



     reports was generally available in the IRS and SSA files



     used for ERUMS.



 



 3)  We hoped that the analysis of matched records could help



     evaluate the consistency of industry and geographic



     coding in the BLS, IRS, and SSA systems.



 



     The extent to which the ERUMS project could actually shed



light on these areas was limited by several factors.  First, ERUMS



was a pilot study based on a small sample drawn from a single State



(Texas) for a single year (1982).  The results, therefore, could



 



                                   309



 



not be expected to reflect precisely the status of the data systems



for the entire country or for subsequent years.  (BLS has taken



steps to improve the UI Address File since 1982.)  Second, both the



information content and processing procedures differed somewhat



among the data systems.  The W-2/W-3 data were for calendar 1982,



for example, while the UI Address File that was used contained data



only for the first quarter of 1982.



 



    Finally, a number of unanticipated problems were encountered



in carrying out the study.  The most limiting of these problems



resulted from the slow implementation of ERUMS.  For example, by



the time the final sample of employers was selected, many IRS Form



940s for 1982 had been destroyed.  Therefore, it was not possible



to evaluate the State data contained on the Form 940s.



 



    Another unanticipated difficulty arose because the initial SSA



files used in the matching process omitted some wage reports and



were generally inadequate to determine if employers were actually



reporting multiple units in Texas.  These initial files were later



supplemented with more complete information, but the



supplementation occurred after the final sample had been I drawn;



consequently the size of the sample was smaller than intended for



some categories of employers, especially for multi unit employers.



 



    Finally, it proved to be more difficult than had been



anticipated to account for differences in employer coverage among



the data files.  In part, this was because estimated data were not



identified in the UI Address File (a deficiency being corrected)



and because there was no documentation of such phenomena as dates



when employment started for employers (or ended, or was changed by



reorganization, etc.) or dates when forms filed by employers were



received by the processing agencies.



 



    The clearest conclusion to emerge from the ERUMS project



related to the poor quality of SSA's ERP data for multi unit



employers.  It was evident that SSA would need to take steps to



improve quality control it the SSA system were ever to be useful



for developing data by geographic and industry classification.  The



other findings of the ERUMS project were not so stark as those



relating to the poor quality of SSA's establishment data, but the



study could well reinforce the concerns of those who worry about



the inconsistencies in industry coding that occur when employers



are coded independently by different agencies.



 



    In the following sections of the paper the results,



limitations, findings, and recommendations, of the ERUMS project are



discussed in somewhat greater detail.  Tables A-1 to A-8, which are



referred to in the next two sections, appear in Chapter III of the



ERUMS final report (Statistical Policy Working Paper 16).  In order



to meet space limitations, we have included Only Table A-4 with



this paper.



 



                                 310



 



 



 Results 



 



     As explained in detail by Pinkos et al in the second paper of



 this session, the ERUMS sample was a two-phase sample of employers,



 as defined by unique Employer Identification Numbers (EINs).  Most



 of the results presented in this paper are estimates based on the



 Phase II sample of 401 EINS, weighted to account for the



 disproportionate sampling used in the second phase of the sample



 selection.



 



     Of the Texas EINs that were active in 1982 in the BLS or SSA



 systems, 67.1 percent were active in both systems, 27.6 percent



 were active only in the SSA system and 5.3 percent were active only



 in the BLS system (Table A-1).  Only about 1.0 percent of all



 active EINs were classified as multi unit in one or both systems,



 and most of these were classified as multi unit only in the BLS



 system (Table A-4).



 



     For the matched single unit EINS, i.e., those that were active



 in both systems, an estimated 81.6 percent had the same State and



 county codes in both systems.  The remaining cases were about



 equally distributed in three categories:  same State, different



 county; same State with no county code in the SSA file; and



 different State (Table A-5).  An estimated 70.2 percent of the



 matched single unit cases had the same two-digit industry codes.



 About half of the remaining cases were not classified by industry



 in the SSA system (Table A-5).  When matched against the



 IRS/Census-edited Form 941/943 file, about three-fourths of the



 matched single units from both the BLS and SSA files had two-digit



 industry codes that agreed with those in the IRS/Census file.



 However, when the SSA unclassified cases were excluded from this



 comparison, the proportion of SSA cases that agreed with the



 IRS/Census two-digit code was somewhat greater than the



 corresponding proportion for the BLS matched single unit cases



 (Table A-8).



 



      Only a few EINs (nine sample cases) were classified as multi



 unit in both the BLS and SSA systems.  Matching individual



 reporting units for these cases proved to be difficult.  Overall,



 the nine sample employers had 105 Texas reporting units in the BLS



 system and 60 in the SSA system for 1982.



 



      Of the active SSA EINs not found in BLS's first quarter 1982



 UI Address File, it was estimated that 69.2 percent had reported no



 first quarter employment to IRS on Form 941 and therefore would not



 normally be expected to appear in the BLS system (Table A-6).  For



 another 10 percent of these employers, the analysis suggested that



 they may not have met requirements for UI coverage in Texas either



 because they had no operations in Texas, because of nonprofit



 status or because their payrolls were too small.  For the remaining



 20 percent, the reasons for their absence are not always clear, but



 



                                   311



 



it may have resulted in part from lags in incorporating new



 employers in the UI State agency and BLS files.



 



     Most of the employers who were included in the 1982 UI Address



 File but did not file 1982 W-2/W-3 wage reports (22 sample cases)



 appeared to have ceased hiring employees, gone out of business, or



 gone through other changes that altered their reporting to IRS and



 SSA.  Half of the employers in this group reported no employment in



 the 1982 UI Address File.  Many of the remainder had filed their



 final Form 941 with IRS (at least for the period 1981-1983) for a



 quarter in 1981.



 



      An analysis of the sample EINs that appeared in SSA's Multi



 Unit Code File provided some indication of the extent to which



 multi unit employers were participating in SSA's Establishment



 Reporting Plan (ERP) in 1982 (Table A-7).  An estimated 35.9



 percent of these EINs had been incorrectly added to the Multi Unit



 Code File as the result of a processing error that has since been



 corrected.  Most of the remaining employers had initially agreed to



 participate in the ERP, but more than half of this group did not



 provide separate data for each reporting unit in their W-3 wage



 reports for 1982.



 



 Limitations



 



      Several factors limit the broad applicability of the ERUMS



 findings.  The results reflect the reporting requirements and



 operating procedures associated with the agency record systems in



 1982.  There have been significant changes since then.  In



 particular, BLS has taken several steps to improve the timeliness



 and the completeness and accuracy of data in its UI Address File.



 



      The study was based on data for a single State, Texas, and on



 a small sample of employers and reporting units.  The UI system



 gives the States some latitude in their record-keeping practices,



 so indications of the coverage of employers in the record systems



 of the Texas State Employment Agency in 1982 should hot be assumed



 to apply fully to the UI systems of other States at that time.  The



 small sample size means that estimates based on the Phase II sample



 are subject to relatively large sampling errors.  Because of



 limited resources and the complexity of the Phase II sample design,



 we were able to compute sampling errors only for a few key



 estimates (see Table A-4).



 



      The analysis of the results was complicated by differences in



 concepts and coverage in the record systems used in the study.



 These differences occurred in the basic filing requirements for the



 UI and SSA/IRS systems, the time reference of the basic BLS and SSA



 files used for matching, the definition of reporting units in the



 BLS and the SSA/ERP systems, and the structures of the BLS and SSA



 industry classification systems.  In addition, certain file



 



                                312



 



deficiencies and operational problems made the analyses more



difficult.  About 1.3 percent of the records in the 1982 UI Address



File for Texas did not have EINs and therefore were not included in



the Phase I sample of EINs from that file.  I In the SSA files, a



significant proportion of employers lacked county and industry



codes.  The most serious problem was that a high proportion of



multi unit employers were not reporting separately in 1982 for each



reporting unit, so that we were unable to do a thorough comparison



of reporting units for multi unit employers active in both the BLS



and SSA systems.



 



 



     Although these differences and file deficiencies made the



analyses more difficult, the fact that we succeeded in identifying



and documenting them is an indication  that the ERUMS project



succeeded in its main goal, which was to demonstrate the



feasibility of doing matching studies as a means of evaluating the



suitability of administrative record systems for statistical uses.



 



 



     The data on amounts of employment and payroll available from



SSA, BLS and IRS files were used in reviewing the unmatched sample



cases and trying to understand why they were not present in both



SSA and BLS files.  However, the employment and payroll data were



not added to the data file for the 401 sample EINs that were used



to develop the estimates presented in this report.  Therefore, all



of the results shown are estimates of numbers of employers or



reporting units classified by attributes such as match status, and



geographic and industry codes in the different systems included in



the study.  We did not attempt to estimate what proportions of



aggregate employment or payroll were accounted for by employers who



were unmatched or had different geographic or industry codes.



 



 



Findings



 



     The detailed analyses of the ERUMS data did not suggest that



large numbers of employers who report wages in one of the payroll



tax systems were failing to report in the other system when they



should have been.  They do, however, suggest that late reports and



different procedures for processing the reports in the two systems



created potential problems for using both of the systems data



files for statistical purposes.



 



     Perhaps the clearest finding was that it is not possible to



maintain a usable establishment reporting unit plan for multi unit



employers in the absence of systematic procedures for monitoring



employer reporting and updatig files for changes in the number,



location and industry of each employer's reporting units.  SSA's



Establishment Reporting Plan clearly lacked the necessary resources



to do this in 1982 and there is no reason to think that the



situation has improved since then.



 



                                       313



 



      There, was a moderately high but by no means perfect



correspondence between county and two-digit industry codes for



single unit employers included in both the BLS and SSA systems.  A



substantial proportion of the differences arose from the absence of



county or industry codes in the SSA system.  Comparisons of



industry codes at the three and four-digit level were not attempted



because of the differences in the industry classification systems



used by the two agencies.



 



      With some qualifications, we were successful in matching the



records of employers, as defined by their EINS, in different



systems.  However, we were not successful in matching BLS and SSA



records for reporting units, the main reason being the



incompleteness of SSAs data for reporting units provided under the



voluntary ERP.  Other reasons were the lack of a common identifier,



analogous to the EIN at the employer level, for reporting units and



the slight differences in the reporting unit definitions used by



BLS and SSA.



 



      We learned what we believe are some important lessons for



others who may wish to match business records from different agency



sources, whether for research or operational purposes.  First, the



plans and the necessary interagency agreements should be developed



well ahead of the earliest date at which the files to be linked are



expected to be available.  In particular, the development of



interagency agreements for the exchange of identifiable records is



a painstaking process and considerable time may be needed for their



completion and approval.



 



      Second, successful matching requires in-depth knowledge of all



of the record systems involved and of the specific files that exist



within those systems.  An interagency team approach, with full



exchange of information, is essential because there is unlikely to



be a single individual who has all of the necessary information,



even for the files of a single agency.



 



      Finally, whenever possible, it is essential to pretest



matching procedures before embarking on large-scale operational



applications.



 



 



Recommendations



 



      ERUMS was designed primarily as a demonstration project and



was therefore limited in its coverage and scope.  Nevertheless, the



Workgroup believes that the study results, along with other



information acquired in the course of the study, justified the



inclusion in its report of five formal recommendations addressed



specifically to the BLS and SSA record systems for employers and



reporting units.  These recommendations were:



 



 



 



                                 314



 



1. SSA should undertake a full review of the current status



    and uses of the Establishment Reporting Plan and decide



    either to continue it with adequate resources for



    maintenance and improvement of quality or to discontinue



    it entirely.



 



    (note- such a review was begun by SSA prior to the



    completion of the ERUMS project.  As a result of that



    review, SSA is taking steps to prepare for the



    termination of the ERP.)



 



 2. BLS should review the State Employment Security Agencies'



    procedures for identifying employer births (including



    those resulting from mergers and changes of organization)



    and seek ways of reducing the apparent lag between filing



    of applications for EINs and inclusion of new employers



    on State Agency and BLS lists used as frames for



    statistical surveys and reports.



 



 3. Data in the UI Address File on employment and wages paid



    should be labelled to distinguish imputed data from data



    reported by employers.



 



 4. The EIN should be identified as a key item in the UI



    Address File and efforts should be made to achieve 100



    percent reporting initially and current reporting of



    changes in EINS.



 



 5. BLS and SSA (if it continues the Establishment Reporting



    Plan) should strive to obtain data from employers for,



    their establishments as defined in the 1987 Standard



    Industrial Classification (SIC) Manual Both agencies



    should code industry for all establishments, without



    exception, at the 4-digit SIC level of detail.  Whether



    or not the Establishment Reporting Plan is continued, SSA



    should code all employers identified on Forms SS-4 at the



    4-digit level of detail.



 



    (see parenthetical note following recommendation 1



    concerning the current status of the ERP)



 



    In a broader context, the ERUMS Workgroup concluded that



current efforts to collect economic data at the establishment level



are dispersed among Federal and State agencies, are poorly



coordinated, and place unnecessary burden on employers.  The



Workgroup believes that further, more intensive and extensive



interagency matching studies have an important role to play in



resolving these problems and in determining the possible effects on



statistical programs of prospective major changes in administrative



reporting systems for employers.  We therefore recommend that:



 



                               315



 



6. Further matching studies should be directed at acquiring



    information that will support the eventual development of



    a mandatory reporting system to meet the needs of all



    Federal and State statistical programs for establishment



    lists, including SIC codes.  An interim goal should be



    that all agencies requiring or requesting employers to



    provide data at the establishment or reporting unit level



    adopt common definitions of units and data items to be



    submitted for these units.



 



    Three agencies the BLS, the Census Bureau and the National



Agricultural Statistics Service -- play a dominant role in the



direct collection of establishment-level economic data.  Recent



initiatives of these agencies, under the general guidance of OMB's



Statistical Policy Office, have been directed at greater



coordination of their respective list-building and maintenance



activities.  Further integration of business lists will require



fuller understanding of the similarities and differences of the



three systems, based on matching of individual establishments and



reporting units in the different systems.



 



                          316



 



   



     



 



Click HERE for graphic.                          



 



 



 



1/Numbers in parentheses are standard errors of the percents.



* Indicates a standard error of less than 0.05 percent.



 



                          317



 



 



                    DISCUSSION



 



                  W. Joel Richardson



                   Charles A. Waite



             U. S. Bureau of the Census



 



 



Introductory Comments on ERUMS



 



     First of all, I would like to thank the many people who have



been involved with ERUMS.  Their commitment and resourcefulness



have helped to make the ERUMS project a success.  As Vernon has



detailed, several recommendations were presented that undoubtedly



will improve the business files of the Bureau of Labor Statistics



(BLS) and the Social Security Administration (SSA).  But more



importantly, the ERUMS study provided valuable experience in the



technical aspects of matching interagency data sets.  I am hopeful



that this experience will help to further the efforts of data-



exchange initiatives among federal statistical agencies in the



coming years.



 



     When the preliminary planning for ERUMS began in 1983, the



Census Bureau expected to be one of the participating agencies.



Our business employer files were to be matched along with those of



the BLS, SSA, and the Internal Revenue Service (IRS).  However,



there were significant problems concerning the release of our



confidential data.  Though we realized the importance of ERUMS, we



could not resolve these data-access problems soon enough to allow



us to be an active participant.  As an alternative, the Census



Bureau obtained observer status, which enabled us to closely follow



the progress of ERUMS.



 



     Before critiquing the three papers, I'd like to expound on the



value of the ERUMS study to the federal statistical community.



Warren stated that a major goal of ERUMS was to test the feasi-



bility of matching employer records from the business lists of



different government agencies.  This goal was, accomplished in



ERUMS, and the results showed that the matching of the two distinct



data files is possible.   Additionally, the ERUMS evaluation



revealed problems associated with matching the interagency data



files.  I expect that these findings will be valuable in future



matching studies.



 



     A matching study should be the first step in any data-sharing



proposal -- before a data sharing proposal is accepted by the



participating agencies, it is essential to confirm the



comparability of the data sets and to resolve any conceptual an



definitional differences.  In my view, the ERUMS project showed



that the BLS and SSA data sets are comparable, and that an



effective matching operation is possible.



 



 



                                  318



 



    Although there are obvious discrepancies between the data sets



  -- only 67.1 percent of the EIN records were active in both systems



  -- significant benefits could be realized through data sharing.



First, greater consistency in the industrial classification codes,



geographic location indicators, and related data values could be



achieved by sharing the data for matched records.  Second,



unmatched records could be researched in an effort to ensure the



completeness of each of the employer universes.  Though numerous



issues would need to be explored and settled, such a data-sharing



plan could result in greater comparability among the data series.



 



    Currently, the administration has a legislative proposal in



Congress that would permit limited data sharing between the Census



Bureau and the Bureau of Economic Analysis (BEA).  The primary



purpose of the proposal is to provide BEA with confidential access



to the Census Bureau's establishment information.  This information



will augment and improve the data on foreign direct investment that



BEA collects and publishes.



 



    There are other versions of the legislative proposal in



Congress to share Census and BEA data -- not only with each other,



but, in at least one version, with the Government Accounting Office



(GAO) and the Committee on Foreign Investment in the U.S. (CFIUS).



We are concerned that response rates may decline if our microdata



are made available to such policy-making organizations as GAO and



CFIUS.  For this reason, the Census Bureau does not support this



legislative proposal.



 



    The BEA collects foreign-investment data at the enterprise



level.  The Census Bureau conducted a feasibility study that showed



BEA enterprise-level data could be linked successfully with Census



Bureau establishment data.  By integrating our establishment-level



data with BEA enterprise data, BEA will be able to present foreign



direct investment statistics at a much finer industry and geo-



graphic level.  This is one of many possible data-sharing plans



that could provide significant cost and qualitative benefits to



Federal statistical programs.



 



     I would like to believe that the administration's legislative



initiative, together with successful match studies such as ERUMS,



will provide the impetus for increased data sharing among Federal



statistical agencies in the future.



 



 



Interagency Agreements for Microdata Access: the ERUMS Experience



 



     Tom Petska's presentation focused on the interagency



agreements required to comply with the confidentiality provisions



that govern the three sets of data.  Clearly, the matching of



individual records in the ERUMS project could not take place until



these confidentiality issues were resolved.



 



 



                                  319



 



    Tom has presented thoroughly the problems associated with



sharing the individual records from different agencies.  It is



apparent that these legal agreements represented a major barrier in



the ERUMS project.  To their credit, the ERUMS workgroup was able



to overcome, the confidentiality problems and to formulate a



workable plan -- IRS contracted with BLS to perform the match, and



SSA staff were designated as special agents of BLS to process the



data.  The IRS is permitted to disclose tax information to outside



contractors as long as it is for purposes of tax administration,



and the ERUMS study was considered to be a statistical study



related to the administration of IRS tax laws.  Unfortunately,



considerable time was spent in determining this solution and in



drafting the required legal agreements.  This added considerably to



the length of the ERUMS study.  Future matching studies may face



similar obstacles in gaining access to confidential data.



 



    As an example, the Census Bureau obtains the EIN and related



data values for many small employer businesses from the IRS.  Any



future studies undoubtedly will rely on the EIN to match the



records, because the EIN is the one key identifier common to U.S.



data systems.  But as Tom has pointed out, the EIN itself is



protected by Internal Revenue Code confidentiality provisions.  For



this reason, the EIN and related data that the Census Bureau



obtains from the IRS cannot be released to other statistical



agencies such as the BLS.  Only those business records whose EIN



and related data have been confirmed through direct respondent



contact would be eligible for release.  This would impact on the



completeness of any matching studies between the BLS and Census



Bureau data sets, because a portion of our business universe has



not been directly canvassed.



 



    The BLS was permitted access to IRS records in the ERUMS



project because of tax-administration purposes.  Although



additional studies possibly could be conducted using similar



arrangements, it would require the support of the IRS and other



agencies that furnish the administrative data.  Otherwise, future



studies may require changes to relevant statutes and regulations



before microdata access is authorized.  Such changes are difficult



to obtain.



 



    I do have one minor point on the paper concerning the



confidentiality provisions of the BLS data.  The ERUMS study used



matched BLS records from only one state -- the state of Texas.



Although Tom outlined the disclosure provisions associated with the



data records from Texas, it was unclear whether these provisions



were typical of the other 49 states.  We understand that BLS



affords each state with certain latitude as to the collection of



the unemployment data.  If the states also have different



confidentiality provisions -- specifically, provisions that



strictly prohibit the release of data to Federal agencies other



than BLS -- the ERUMS project may not have been possible using



records from these states.



 



                               320



 



     One of the goals of ERUMS was to gain experience in the



procedure of obtaining access to the confidential data of the



various data sets.  To this end, the ERUMS study was a success.



The study revealed the problems associated with obtaining the



access to the microdata for matching purposes, and also determined



a workable solution that overcame these problems.  However, I



expect that disclosure problems will continue to be a major



obstacle in future matching initiatives.



 



 



Sample Selection and Matching Procedures for ERUMS



 



     John Pinkos's presentation focused on the sample selection and



matching procedures in ERUMS.  As John has pointed out, a major



constraint affecting the sample size was the limited staff time and



resources.  Because considerable analysis was inevitable for the



sampled records, the-ERUMS members agreed to select a relatively



small sample.  As it turned out, 401 cases were selected.



 



     By limiting the sample to one state, and oversampling from



certain categories of records that were of particular interest,



ERUMS was able to create a manageable set of sample records that



were sufficient to meet the study's objectives.  I expect that



future matching studies will benefit from the details, of the



procedures used in ERUMS.



 



     Three sources of data were used in the study -- BLS data, SSA



data, and IRS data.  Cases were selected first from the BLS data



files and then independently from the SSA data file.  Using this



technique -- specifically, by selecting independently based on



certain digits of the EIN -- the ERUMS sample included records that



were present in only one of the two data systems, as well as



records that were present in both systems.  Records present in only



one of the data systems were a critical part of the study, as these



represented potential differences in employer coverage between the



two data files.



 



     The ERUMS study, however, did not sample from the IRS data



set.  The IRS data were used only to help analyze the BLS/SSA cases



selected in the sample.  The IRS file was not included in the



sample selection because of the difficulties in gaining access for



such a purpose.  Although this decision was unavoidable, it may



have compromised the results of ERUMS somewhat.



 



     The IRS data file represents a complete universe of business



employers in 1982 -- all employers who filed payroll tax returns in



with no exclusions as to the size of the business or the



nonprofit status of an organization, were included on the IRS file.



Without this complete file of businesses, ERUMS was left to compare



records from the BLS and SSA data sets.  Although differences were



identified and quantified, the study could not make valid estimates



 



 



                              321



 



on the completeness of the two data sets as compared to the



universe of businesses on the IRS file.



 



      A similar point exists for the matching of multiunit records



from the BLS and SSA data sets.  The ERUMS study showed that about



l percent of all active EINS were classified as multi unit in-one



or both systems.  Most of these were classified as multi unit only



in the BLS system.  One of the findings of the study was that the



SSA multiunit file is deficient, and steps should be taken to



either improve the quality or to discontinue it entirely.  Because



of the obvious deficiency in SSA's multiunit file, no legitimate



conclusions could be reached on the accuracy of the BLS multiunit



file.



 



      One last point on John's paper, he discussed briefly the



comparison of industry classification and geographic location from



the BLS and SSA files.  I would liked to have seen some general



table that presented these results.  Even if the results were



presented at broad industry and geographic levels, it would have



provided some general information on the comparability of these



critical data elements.



 



Results, Findings and Recommendations of the ERUMS Project



 



      The agencies involved in the ERUMS project have gained



valuable experience in the technical aspects of linking data files



and in the administrative requirements for gaining access to the



data.  For this reason alone, the ERUMS project should be



considered a success.  In addition to the experience gained, the



ERUMS project presented several recommendations that will help to



improve the business files of the BLS and SSA.  I understand that



the BLS has already taken several measures to improve the



timeliness, completeness, and accuracy of the data in its



Unemployment Insurance Address File.



 



      Vernon's presentation detailed the recommendations that were



identified in the ERUMS study.  In one of the recommendations, he



stated that BLS should review the procedures for identifying births



in an effort to improve the timeliness of including new employers



in the BLS lists.  I suggest that the BLS review procedures for



identifying deaths as well.  Up-to-date operational status is a



critical element of business employer records.



 



      The final recommendation in vernon's presentation covered the



need for additional matching studies to acquire information that



will support the eventual development of a reporting system to meet



the needs of all Federal and State statistical programs.  Because



of certain legislative barriers -- for example, Title 26 strictly



prohibits the release of IRS data to other statistical agencies --



and significant operational problems, such a far-reaching goal may



not be plausible in the foreseeable future.



 



                                  322



 



    The Census Bureau supports a more achievable goal of data-



sharing among Federal statistical agencies, and would welcome the



opportunity to conduct additional matching studies in an effort to



further data-sharing initiatives.  Before proposing the Census/BEA



data-sharing initiative, we conducted a matching study that



confirmed the feasibility and value of linking our establishment-



level data with BEA's enterprise-level data.  This preliminary



study was a necessary step in the Census/BEA data-sharing



initiative.  Additional matching studies may promote other data-



sharing initiatives in the Federal Government.



 



    The ERUMS project, which effectively matched interagency data



files, may help provide the impetus for increased data sharing in



the coming years.  With the necessary legislative changes,



pertinent data from each of the employer files could be shared



among statistical agencies.  Such a data-sharing plan would provide



major advantages, including greater comparability among economic



data series, less respondent burden on the business community, and



a reduction in overall Government costs.



 



Summary



 



    Comparisons between data sources are beneficial because they



highlight conceptual differences and identify the limitations and



strengths of the data sets.  The ERUMS project successfully met



both of these objectives.  In addition, ERUMS provided valuable



experience it the technical aspects of matching interagency data



sets.



 



    Our current mission should be to use this experience to



further the efforts of data sharing in the Federal Government.



Data sharing offers major advantages to Federal statistical



agencies.  By supplementing business data sets with applicable



information from the data sets of other agencies, the Federal



statistical system will attain greater comparability in related



economic data series.  The ERUMS project showed that interagency.



data sharing is a viable option.  I would like to congratulate the



many people who have been involved with ERUMS for a job well done.



 



 



                            323



 



                        DISCUSSION



 



                       Thomas J. Plewes



                U.S. Bureau of Labor Statistics



 



 



     I appreciate the opportunity to appear at this public



unveiling of the Employer Reporting Unit Match Study (ERUMS)



report.  This is an event that has been long-awaited by all of



those who have been involved in this multi-agency, multi-year, and



multi-faceted project.  I expect that no participant has awaited



this day more anxiously than Warren Buckler, who, along with the



folks here at the speaker's table and many in today's audience, has



spent a great deal of time over the past few years in conceiving,



giving birth, and nurturing this little study.  Indeed, to carry



the metaphor further, is hard to figure out where we stand now on



the continuum from project conception to death.  Is this session a



commencement ceremony, or is it a eulogy?  As my commentary will



soon indicate, I hope that we are gathered for a commencement



ceremony for the statistical community has learned important



lessons about sharing and about the basic quality of two major



business lists in this project at some significant cost.  It would



be a shame if the lessons learned were not put to use in



implementing critically needed program improvements.



 



     I would like to accomplish two objectives in the short time I



have allotted as a discussant.  First, I want to step back to



examine the environmental framework in which this study took place



and contemplate the arena into which the report now has been



thrust.  My second goal is to draw specific conclusions from the



exercise and suggest specific steps that should be taken as a



result of the work that has been done.



 



     What is the environment in which we must consider this study?



It is a complex environment, characterized by:



 



  1. Little sharing of business directory information between



     Federal government agencies, but a growing pressure to



     develop, procedures for sharing so as to reduce the burden



     on respondents.  These pressures are building to the



     extent that I believe sharing will surely be mandated.



     That mandate may come in the form of legislative action,



     a fiat from the Office of Management and Budget using its



     authority under the Paperwork Reduction Act, or of most



     profound consequence, through a centralization of the



     statistical agencies.



 



 2.  A reliance on lists characterized by their primary usage



     as administrative data sources which focus the support Of



     the administration of the law or function.  We have built



     our elaborate business directory programs and constructed



     our business survey frames on databases that have been



 



                               324



 



    developed with only a distant secondary concern for the



     statistical uses of the data.



 



 3.  Difficulty in separating statistical from enforcement



     purposes.  If we, as statistical agencies, make the data



     better and create an environment for comparing lists, we



     enhance their use for enforcement and administrative



     purposes also.  This aspect will be particularly



     troublesome when we involve, as we eventually must, the



     Internal Revenue Service in sharing schemes.  The



     participation of the IRS in the ERUMS process gave us an



     indication of the lengths to which IRS will go to protect



     the tax data, and of the difficulties this injected in



     the ERUMS process.



 



  4. A growing concern over confidentiality of establishment



     records.



 



  5. A lack of consistency of definitions and coding that



     extends throughout the statistical system, but has a most



     profound impact on sharing of administratively-derived



     lists.  Administrative differences in the programs lead



     to inconsistent definitions of even the most simple of



     terms, such as "employment", "address", "wages" and the



     like.



 



  6. An expanding recognition that errors and omissions in the



     business lists are a significant source of error in the



     survey process.  The Federal Committee on Statistical



     Methodology's Working Paper 15, "Quality in Establishment



     Surveys" documented this, and the Tupek-MacDonald paper



     this morning discussed the effect that the Bureau of



     Labor Statistics' Business Establishment List improvement



     project will have on BLS survey quality.



 



     These environmental elements pose formidable challenges to



statistical agencies that want to improve the efficiency of their



operations and reduce burden on their reporters.  For example, in



terms of frames for surveys of nonagricultural businesses, there



are at present two major government lists -- the Census Bureau's



Standard Statistical Establishment List (SSEL) and the BLS Business



Establishment List (BEL) -- and one major private sector list --



the Dun & Bradstreet file -- with a myriad of lesser known and more



specialized lists for more limited purposes.  We can look at the



SSEL as a representation of the of the SSA/IRS administrative data



files with considerable value added by the Census Bureau.



Likewise, the BEL may be seen as a representation of the State



unemployment insurance files with considerable BLS value added.



If these Federal government files do not match, and we suspect they



do not through analysis of the macrodata, the problem can be with



the basis administrative data files, with the value added, or both.



Over the years, Fritz Sheuren's various administrative database



 



                                325



 



comparison projects have documented the systemic differences in the



files very well.  They must be borne in mind.  Fixing the files



once we have identified the root difficulties is quite another



matter.  The statistical agencies do not own them, and they are



exceedingly expensive to change (in terms of budget and response



burden).  Indeed, quite often only a revision in law or nationwide



program practice will do the trick.



 



     Fixing the "value added" portion is somewhat more possible,



but it too is expensive in terms of budget and people.  often there



are good reasons for not fixing the way we add our value, such as



the need to assure the continuity of historical data series.



 



     Definitions are another challenge.  If we want to share lists,



we must think in terms of three types of problem.  In some cases,



repair is relatively simple.  We heard today, for example, that our



definitions of multi-unit employers are already in close proximity.



The EIN and SIC systems are also bedrock.  Our challenge in those



instances where there is close concordance between the files is to



maintain the definitional base in a standardized, current and



relevant manner.



 



     In other areas, we must change the way we do business but, if



we are willing, our task will be reasonably easy.  One match



problem that ERUMS identified was that the project was comparing



annual SSA reporters with lst Quarter UI reporters.  This is one Of



the problems that we can fix with time and resources, because the



data are there.



 



     In a few important other cases, however, we are quite limited



in our ability to bridge definitional gaps.  For example, when



coverages are based on Federal laws, State laws, and judicial



precedent regulating the administrative database, we would be



forced to justify a change in the insurance or tax program on



statistical grounds.



 



     Certainly, confidentiality concerns have a presence in the



equation.  We, glimpse in the Petska-Alexander paper the importance



that necessary confidentiality protection schemes had in this



project, and the price those schemes exacted in terms of time and



precision.  That's one of the reasons I like the Petska-Alexander



paper so much.  It outlines the practical implications of



maintaining a pledge of confidentiality when cooperating on a



project of importance to the statistical agencies.  Everything, as



they so well point out, had to be invented.  There are no text book



examples of interagency agreements on confidentiality.  The



solutions which the project team developed were carefully crafted



to stay within the very restrictive IRS law and were implemented



with an eye toward the reality of the environment.  Thus, there are



really two stories in the Petska-Alexander paper.  One story is



about the difficulties that the team encountered in sharing



confidential data.  The other, written between the lines, is about



 



                                  326



 



the sense of cooperation and dedication that allowed the cumbersome



solutions to move forward.



 



    The Petska-Alexander paper starkly reminds us that the role of



confidentiality policy is important but little understood.  We may



be hopeful that the current situation will be short-lived.  The



National Academy of Science's Committee on National Statistics had



taken on these issue with the formation of an expert panel.  Until



we are able to benefit from that report, however, we are left with



the fact that understanding of confidentiality of business records



has not progressed very far as either science or practice.  Only



recently has a literature on the subject of confidentiality begun



to emerge, but most of it addresses the more emotional topic of



confidentiality of information about individuals.  The literature



pays little attention to issues surrounding confidentiality of



business records.  Without such a foundation, the statistical



agencies have mostly assumed that the issues of confidentiality of



business records are the same as those for individuals.  This



assumption has played an important role in justifying past limits



on sharing between the Federal agencies.



 



    The second paper, by Einstein, Levasseur, Packman, and Pinkos,



also attempts to stand back with benefit of hindsight and make some



sense out of what was a convoluted process.  Since 3 of the 4



authors work with me, these comments may not be as critical as



others may have rendered, for all along the way I "bought-in" to



the approaches taken and the effort expended.  Nonetheless, I view



the documentation that this paper offers in a somewhat different



light than the authors, and draw slightly different conclusions.



 



    The matching process, as described, makes a good deal of



statistical sense.  The team selected a two-stage sample selection



process, stratified into 9 groups.  The second phase, a subset of



about 400 cases of the first selected on a probability basis,



provides for detailed analysis.  Some of the specific steps in the



process were to meet the confidentiality restrictions, but not all.



 



     The process that the team established should serve as a first



step toward developing an on-going statistical process control



system, if and when sharing does take place.  Many of these same



activities should be continued in a recurring program to meet the



objectives of total quality management.  Thus, the work of the team



has long-term, permanent implication.  The authors seemed to



recognize this when they stated that "we believe future projects of



this kind will benefit from-the availability of this detailed road



map".  Probably so, but I speculate that future researchers will



look at the road map and decide against making the journey.  That



is why I would take pains to separate the enduring aspects that



should be the foundation of a quality management system from those



that were necessary to meet more bureaucratic objectives.



 



 



                               327



 



     The contribution of the Renshaw-Jabine paper is to Yield some



hope, in that it reminds us how close we are to an ability to



share, while providing some sober reflection about some major tasks



still lying ahead if we are to share.  Their bottom line is that



the systems are reasonably close in coverage -- eventually most



employers emerged in the systems.  There were troublesome



differences in multi-unit identification, in county coding, and in



industrial classification at the 2-digit level, but I would label



these of moderate concern.  Indeed, under the BEL initiative, BLS



has taken steps to correct many of the inadequacies in its data,



investing with the States in improving SIC coding, interpretation



of SICS, and, more recently, in fixing the multi-establishment



identification problems.  Unfortunately, with lack of resources,



the Social Security Administration has not been able to make the



same investment, so many of the difficulties in the SSA file may



have multiplied.



 



     In summary, we ought not let this expensive experience lie on



the shelf.  We have learned a great deal about two files -- lessons



that should be extended to files maintained by the Bureau of the



Census.  And we need to get on with fixing some of the obvious



flaws in the administrative data.  Most importantly, we have



learned that maintaining confidentiality is possible, that matching



is feasible, and that the will is present at the staff level in the



agencies to make it all come together.  Now it is time for



leadership.  As Senator Bennett Johnston said in an argument before



Congress, "There's a time to stop talking the talk and start



walking the walk."  We have the map.  Let's start walking.



 



 



                               328



 



 



                         Session 10



 



                   APPROACHES TO DEVELOPING



                         QUESTIONNAIRES



 



 



 



 



 



 



 



 



                              329



 



 



                            330



 



 



TOOLS FOR USE IN DEVELOPING QUESTIONS AND TESTING QUESTIONNAIRES



 



                         Theresa J. DeMaio



                     U. S. Bureau of the Census



 



    As the collection of information through surveys becomes more



prevalent in our society, increasing numbers of people find



themselves in a position to develop questionnaires.  Writing a



questionnaire seems like such a simple task -- many people think



that anyone without training or experience can do it.  But



developing a good questionnaire -- one that can obtain good quality,



information that meets the objectives of the survey -- is not as



easy as it looks.  Many different kinds of abilities, including



subject matter expertise, writing capabilities, and knowledge of



social psychological principles are necessary to develop a simple,



cohesive questionnaire in which the questions are clearly worded.



Developing a good questionnaire is not a solitary task -- simply a



matter of sitting down at your desk for a few minutes or even a few



hours.  There are a number of procedures that can be used to



involve potential respondents in content or question development,



and to test and evaluate questionnaire drafts before they are



finalized.



 



     The purpose of Statistical Policy Working Paper #10,



Approaches to Developing Questionnaires. is to provide practical



information about these methods.  The report contains descriptions



of 11 different techniques, which can be used at various stages of



questionnaire development.  The report is structured in three



parts:  tools to develop questions, procedures for testing the



questionnaire draft, and techniques used to evaluate the



questionnaire draft.  This structure was somewhat artificially



imposed for ease of presentation in the report.  In fact, there is



no one ideal way to go about the process of developing a



questionnaire.  Depending on a number of factors, such as whether



you're working from scratch or from an existing questionnaire, how



much time and funds are available for survey development, these



techniques can be used in many different combinations.  In terms of



improving the content of a survey questionnaire before it goes out



into the field, the important thing is that testing and



developmental work be conducted, not necessarily that it be done



according to the structure presented in the report.



 



     Having made this disclaimer, I am nevertheless going to



discuss the techniques that are presented in the first two sections



of the report -- that is, tools for developing questions and



techniques for testing the questionnaire draft.  I'm going to



generally describe the methods contained in the report, and mention



some additional techniques as well.



 



 



                                 331



 



Developing Questionnaires



 



    Part I of the report describes three tools for developing



questions.  The report presents these methods as useful in



developing new questionnaires.  I'd like to expand on this a little



and suggest that these techniques can be used in the early stages



of questionnaire development of any survey.  Most surveys are



conducted more than once; subsequent rounds of data collection



begin with an existing questionnaire draft that is subject to



revision.  These later rounds each have early stages of



questionnaire development, complete with an existing questionnaire



draft.  In these cases too, the methods described in Part I of the



report may be appropriate.



 



Unstructured individual interviews



 



     Unstructured individual interviews are one-on-one



conversations between a researcher and a member of the population



for the survey or proposed survey.  I use the term "conversations"



because the discussion is unstructured; rather than having a set of



specific questions, the researcher uses a topic outline that



collects information on various aspects of these topics in whatever



order, and using whatever terminology the respondent suggests.



Respondents may also bring up additional issues related to the



general topic, which might be incorporated into the topic outline



for later interviews.  The goal is an unstructured setting in which



the researcher finds out how the respondent perceives the topic of



interest, what terminology the respondent uses to talk about the



topic, whether the respondent is knowledgeable and able to provide



information on the topic.  By working from a blank slate, the



researcher is not constrained by the content and terminology of an



existing questionnaire, and the true frame of mind of the



respondent is more likely to surface.



 



Qualitative Group Interviews



 



    Many of you may be familiar with qualitative group interviews



under a different name, such as focus group interviews, group depth



interviews, or focussed discussion groups.  Essentially these are



unstructured interviews with a group of respondents rather than a



single respondent, led by a group moderator.  About 8 to 12 people



participate in a group, and the moderator uses a topic outline to



guide the discussion.  Qualitative group interviews are used for



many research purposes other than questionnaire development.  When 



used to assist in questionnaire construction, the goal is the same                                                    



as the goal of unstructured individual interviews -- to elicit the



terminology used by respondents in thinking about the topic in



question, to determine aspects of the topic that respondents



consider important, and to get a reading on how respondents react



to aspects of the topic that survey planners consider important.



 



                               332



 



The difference between qualitative group interviews and



unstructured individual interviews is, obviously, the group setting



the diversity of opinions held by group members may stimulate



interaction among them that elicits more information than could be



obtained through interviews with each member separately.  In order



for these groups to be successful, however, the ability of the



moderator is an important consideration.  The idea is to stimulate



discussion among all the participants and to avoid domination of



the discussion by some people who may be more vocal than others.



 



 



Participant Observation



 



    Participant observation is a technique that is used as an



independent method of data collection, as well as a tool for



questionnaire development.  It has been extensively used around the



world.  The basic elements of the technique are suitable for



questionnaire design purposes, especially in developing



questionnaires for use by members of other cultures or subcultures



living within our own country.  For example, the homeless



population is a subculture that is currently the object of much



interest, and for which the use of participant observation



techniques is relevant.  Indeed, these techniques have been



successfully used in research on homelessness being conducted at



the Census Bureau.



 



    There are several distinguishing characteristics of



participant observation research.  First, the researcher must speak



the respondents' language.  This is not limited to English as



opposed to a foreign language, but also refers to dialects, slang,



or professional jargon.  Second, the researcher associates with the



members of the community he or she studies and engages in their



activities.  Ideally the researcher lives among the respondents; at



a minimum, he or she develops contacts in the community over a long



period of, time.  The participant observer may also use the



ethnographic interview technique during the course of his or her



research.  This involves using unstructured interviews (the



methodology I previously described) with "key informants."  These



are members of the community who are willing to talk at length with



the researcher or introduce the researcher to other community



members.



 



     From this brief description, it should be obvious that



participant observation is not a methodology that a person can



"pick up" by reading an introductory textbook.  The expertise



required in the use of this technique dictates the involvement of



trained ethnographers.  While that may limit its use somewhat among



U.S. statistical agencies, there are several ways it can be



incorporated in a project.  First, participation observation can be



conducted as part of a project by trained anthropologists hired to



serve on the project staff.  In the homeless project I referred to



a moment ago, we hired an anthropologist to work with a survey



 



                              333



 



methodologist, and this combination has worked out very well.  A



second way to make use of this technique is to consult with



ethnographers who have prior experience among the culture of



interest, and take advantage of this previous experience rather



than conducting original fieldwork.  This could be done either by



hiring the person on staff or doing it on a consultant basis.



 



 



Think Aloud Interviews



 



    Another technique suitable for the early stages of



questionnaire development has gained in popularity since the



Working Paper was completed in 1983.  This is the think aloud



interview.  Also referred to as protocol analysis, this method is



an extremely valuable source of information about how respondents



understand the survey questions put to them, and how they go about



answering the questions.  The purpose of the technique is to get



respondents to talk out loud and verbalize their thoughts as they



respond to questionnaire items.  The data of interest here are



respondents' reactions to the items, their thoughts as they



formulate answers to the items, and what decisions they make in



answering the questions.



 



    Use of the technique requires a questionnaire draft.  Since



the results of these interviews are crucial to the questionnaire



development process, the person doing the interviewing is generally



a researcher or questionnaire designer.  For interviewer-



administered surveys, the questioner first explains to the



respondent that rather than just answering the questions, he or she



should actually think out loud -- that is, say what he/she is



thinking as he/she answers each question.  Respondents differ in



their ability to verbalize their thoughts, and some may require a,



bit of probing to uncover how they arrive at the answer to a



question.  At times it may take skillful questioning to probe



completely what is on a respondent's mind.  The interviews are



generally tape-recorded (with the respondent's permission), since



it is difficult to take notes and concentrate on probing the



respondent's answers at the same time.



 



    This technique can also be adapted for self-administered



interviews.  In this case, the questioner is basically an observer.



The respondent is instructed to complete the questionnaire, reading



the questions and instructions out loud as well as verbalizing the



responses.  I've done quite a few of these interviews, and they



really are quite helpful in detecting layout problems (not noticing



skip instructions, etc.) in addition to uncovering problems with



the questions.



 



    This technique is used with relatively small numbers of



respondents.  Ten or fewer think aloud interviews provide large



amounts of information and can uncover systematic



misinterpretations or other problems.  Use of the technique is an



 



                                334



 



iterative process -- once the questionnaire designer conducts five



to ten think aloud interviews, problem areas will generally



surface.  Then, after revisions to the questionnaire are made,



additional interviews can be conducted to detect problems with the



revisions.  Or alternatively, some other method can be used for the



next round of questionnaire development.



 



 



Testing Questionnaires



 



     Whatever