Federal Committee on Statistical Methodology
Office of Management and Budget
FCSM Home ^
Methodology Reports ^

 

  Statistical Policy Working Paper 16 - A Comparative Study of Reporting Units in Selected Employer Data Systems


 

 

Click HERE for graphic.

 

 





                MEMBERS OF THE FEDERAL COMMITTEE ON



                      STATISTICAL METHODOLOGY




(April 1990)   Maria E. Gonzalez (Chair) office of Management and Budget     Yvonne M. Bishop Daniel Kasprzyk Energy Information Bureau of the Census Administration Daniel Melnick Warren L. Buckler National Science Foundation Social Security Administration Robert P. Parker Charles E. Caudill Bureau of Economic Analysis National Agricultural Statistical Service David A. Pierce Federal Reserve Board John E. Cremeans Office of Business Analysis Thomas J. Plewes Bureau of Labor Statistics Zahava D. Doering Smithsonian Institution Wesley L. Schaible Bureau of Labor Statistics Joseph K. Garrett Bureau of the Census Fritz J. Scheuren Internal Revenue Service Robert M. Groves Bureau of the Census Monroe G. Sirken National Center for Health C. Terry Ireland Statistics National Computer Security Center Robert D. Tortora Bureau of the Census Charles D. Jones Bureau of the Census           PREFACE     The Federal Committee on Statistical Methodology was organized by OMB in 1975 to investigate methodological issues in Federal statistics. Members of the committee, selected by OMB on the basis of their individual expertise and interest in statistical methods, serve in their personal capacity rather than as agency representatives. The committee conducts its work through subcommittees and work groups that are organized to study particular issues and that are open to any Federal employee who wishes to participate in the studies. Working papers are prepared by the subcommittee/work group members and reflect only their individual and collective ideas.   The Employer Reporting Unit Match Study (ERUMS) Work Group of the Administrative Records Subcommittee was formed to conduct a study that compared employer and reporting unit data from the record systems of the Bureau of Labor Statistics (BLS), and the Social Security Administration (SSA), supplemented with employer level information from the Internal Revenue Service (IRS). To carry out the match study, interagency agreements were developed between BLS and SSA and between BLS and IRS. These agreements were the bases for sharing the microdata. The purpose of the match was to obtain more precise information on the differences and similarities in the coverage and content of the data in these systems.   Although the study was limited in scope, the results serve to point in the direction of future work which needs to be done in understanding various establishment microrecord systems. Also in the context of possible future sharing of microrecords, further studies need to be carried out.   The Employer Reporting Unit Match Study Work Group was chaired by Warren L. Buckler of the Social Security Administration, Department of Health and Human Services.           Members of the ERUMS Workgroup Administrative Records Subcommittee   (November 1989)   Warren Buckler*, Chair Social Security Administration   Lois Alexander Ken LeVasseur Social Security Administration Bureau of Labor Statistics   Marlene Einstein Bruce Levine Bureau of Labor Statistics Bureau of Economic Analysis   Jerry Gates (observer) Tom Petska Bureau of the Census Internal Revenue Service   Maria Gonzalez* (ex officio) John Pinkos Office of Management and Budget Bureau of Labor Statistics   Tom Grzesiak Vern Renshaw Bureau of Labor Statistics Bureau of Economic Analysis   Tom Jabine Alan Zempel Committee on National Statistics Internal Revenue Service   * Member, Federal Committee on Statistical Methodology   - ii -           ACKNOWLEDGEMENTS     This report represents the culmination of the collective efforts of many individuals who,have been involved with the ERUMS project throughout the course of its development and implementation. A designated individual Workgroup member had the primary responsibility for each section of the report. In several cases, significant contributions were made by others, as shown below:   Section Responsible author and other contributors Exec. Sum. Tom Jabine (CNSTAT) Ch I Tom Jabine (CNSTAT) Ch II,A,1 Marlene Einstein (BLS), Ken LeVasseur (BLS), Karen Mainzer (BLS) Ch II,A,2 Warren Buckler (SSA), Cheryl Williams (SSA) Ch II,A,3 Alan Zempel (IRS), Charles Day (IRS) Ch II,B & C Tom Jabine (CNSTAT) Ch II,D,1 Lois Alexander (SSA) Ch II,D,2 Warren Buckler (SSA) Ch III,A Vern Renshaw (BEA) Ch III,B Tom Jabine (CNSTAT) Ch IV,A Vern Renshaw (BEA) Ch IV,B Tom Jabine (CNSTAT)   The data processing and tabulation preparation operations were performed at BLS by Marlene Einstein, assisted by Suzie Yen, and by Joel Packman at SSA. Tom Jabine, CNSTAT, developed the outline for the format of the report and served as contents editor. All of the current members of the Workgroup reviewed successive drafts, offered comments and suggestions, and approved this final report. In addition, a number of improvements to the preliminary draft that was submitted to the Federal Committee on Statistical Methodology (FCSM) resulted from comments and suggestions made by the principal reviewers for that committee, Tom Plewes and Bob Parker, and by Fritz Scheuren and Dan Kasprzyk. .   The Workgroup would like to express its deep and sincere appreciation to all of the dedicated individuals who have been a part of this project. In addition to the current Workgroup members and other contributors to various sections of the report who have been previously cited, several former members of the Workgroup, as well as other staff of the participating,agencies, are to be recognized for their contributions. This group includes: Brian MacDonald, Linda Hardy, Michael Searson, John Pinkos, E.J. Filardi and Alan Tupek of the Bureau of Labor Statistics; Jackie Veach, Linda Dill, Cres Smith, Barry Bye, and Shirley Piazza of the Social Security Administration; Fritz Scheuren of the Internal Revenue Service and Alfred Nucci of the Census Bureau.   - iii -           The Workgroup would also like to express its appreciation to Maria Gonzalez for her patience, sound advice, and the guidance she provided throughout the project and to Tom Plewes for his unwavering support and constant encouragement for the work we were doing.   - iv -           TABLE OF CONTENTS   Page   EXECUTIVE SUMMARY . . . . . . . . . . . . . . . . . . . . . . . . 1   CHAPTER I.INTRODUCTION. . . . . . . . . . . . . . . . . . . . . . 9   A. Background. . . . . . . . . . . . . . . . . . . . . . . 9 B. Prior activities of the FCSM. . . . . . . . . . . . . .11 C. Goals of the ERUMS project. . . . . . . . . . . . . . .12 D. Organization of this report . . . . . . . . . . . . . .12   CHAPTER II. STUDY DESIGN AND EXECUTION . . . . . . . . . . . .15   A. Descriptions of systems and files . . . . . . . . . . .15   1. BLS. . . . . . . . . . . . . . . . . . . . . . . .15 2. SSA. . . . . . . . . . . . . . . . . . . . . . . .20 3. IRS. . . . . . . . . . . . . . . . . . . . . . . .24   B. Sample design . . . . . . . . . . . . . . . . . . . . .37   1. Design considerations. . . . . . . . . . . . . . .37 2. The sample design adopted. . . . . . . . . . . . .38   C. Sample selection and matching procedures. . . . . . . .43   D. Administrative arrangements . . . . . . . . . . . . . .57   1. Confidentiality protection and interagency agreements . . . . . . . . . . . . . . . . . . . .57 2. Working arrangements and schedule of operations. .62   CHAPTER III. RESULTS. . . . . . . . . . . . . . . . . . . . . .67   A. Substantive Results . . . . . . . . . . . . . . . . . .67   1. Introduction . . . . . . . . . . . . . . . . . . .67 2. Distribution by final match status . . . . . . . .68 3. Characteristics of matched cases . . . . . . . . .69 4. Characteristics of nonmatched cases. . . . . . . .70 5. SSA's Establishment Reporting Plan . . . . . . . .73 6. Results of matching BLS and SSA industry codes to IRS industry codes . . . . . . . . . . . . . . . .74           page   B. Limitations of the Design and Execution . . . . . . . .83   1. Limitations of the generality of the study findings . . . . . . . . . . . . . . . . . . . . . . . . . 83 2. Interagency differences in concepts and coverage .84 3. File deficiencies and operational problems . . . .85   CHAPTER IV. FINDINGS AND RECOMMENDATIONS . . . . . . . . . . .89   A. Findings. . . . . . . . . . . . . . . . . . . . . . . .89   1. Relative coverage. . . . . . . . . . . . . . . . .89 2. Multi unit employers: acquisition and updating of reporting unit information . . . . . . . . . . . .90 3. Content differences for matched units. . . . . . .91 4. The role of IRS records in the matching process. .92 5. Feasibility of interagency matching of employer and establishment records. . . . . . . . . . . . . . .93   B. Recommendations . . . . . . . . . . . . . . . . . . . .97   1. Introduction . . . . . . . . . . . . . . . . . . .97 2. Recommendations to SSA and BLS . . . . . . . . . .97 3. Future matching studies. . . . . . . . . . . . . 100   REFERENCES. . . . . . . . . . . . . . . . . . . . . . . . . . . 103   APPENDIX A. TABLES. . . . . . . . . . . . . . . . . . 107 APPENDIX B. INTERAGENCY AGREEMENTS. . . . . . . . . . . 115   - vi -           LIST OF EXHIBITS   Page Exhibit   Text   IIA-1 Application for Employer Identification Number (Form SS- 4). . . . . . . . . . . . . . . . . . . . . . . . . . .29 IIA-2 Employer's Annual Federal Unemployment (FUTA) Tax Return (Form 940). . . . . . . . . . . . . . . . . . . . . . .32 IIA-3 Employer's Quarterly Federal Tax Return (Form 941). . .34   IIB-1 Summary of the ERUMS Sample Design. . . . . . . . . . .41   IIC-1 ERUMS Project Overview. . . . . . . . . . . . . . . . .55 IIC-2 SSA Phase I Operations. . . . . . . . . . . . . . . . .56   IID-1 ERUMS Project Timetable . . . . . . . . . . . . . . . .65     Appendix   B-1 Agreement Between Statistics of Income Division, Internal Revenue Service and Bureau of Labor Statistics, Department of Labor. . . . . . . . . . . . . . . . . . . . . . . . . . . 115   B-2 Agreement Between SSA and BLS For Exchange of Statistical Information in Employer Reporting Unit Match Study (ERUMS) Pilot Project. . . . . . . . . . . . . . . . . . . . . . . 121   (note: Attachments to the above agreements (B-1, B-2) are not included with this report, but are available upon request.)   - vii -           LIST OF TABLES   Page Table   Text   IIC-1 Phase I Sample Counts by Stratum. . . . . . . . . . . .48 IIC-2 Phase II Sampling Intervals and Sample Sizes. . . . . .49   IIIA-1 Distribution of EINs by final match status . . . . . . 75 IIIA-2 Distribution of active BLS EINs by final match status . . . . . . . . . . . . . . . . . . . . . 76 IIIA-3 Distribution of active SSA EINs by final match status .77 IIIA-4 Distribution of EINs by single/multi and match status .78 IIIA-5 Distribution of matched SSA and BLS single units by geographic and SIC match status . . . . . . . . . . . .79 IIIA-6 Distribution of EINs not in 1982 UI File by 1982 IRS/SSA status. . . . . . . . . . . . . . . . . . . . . . . . .80 IIIA-7 Status of SSA employers included in the Multi Unit Code File (MUCF) . . . . . . . . . . . . . . . . . . . . . .81 IIIA-8 Distribution of matched BLS and SSA single units by result of match of their SIC codes IRS's at the two-digit level . . . . . . . . . . . . . . . . . . . . . . . . .82   Appendix   A-1 Distribution of EINs by single/multi and match status (original classification) . . . . . . . . . . . . . . 107 A-2(a) Match results for single, BLS/single SSA cases, based on final classification (unweighted) . . . . . . . . . . 108 A-2W(a) Table A-2(a) weighted to 1st stage sample . . . . . . 108 A-2(b) Horizontal % distribution of Table A-2(a) . . . . . . 109 A-2W(b) Horizontal % distribution of Table A-2W(a). . . . . . 109 A-2(c) Vertical % distribution of Table A-2(a) . . . . . . . 110 A-2W(c) Vertical % distribution of Table A-2W(a). . . . . . . 110 A-3(a) Match results for single, BLS/no SSA wage report cases, based on final classification (unweighted). . . . . . 111 A-3W(a) Table A-3(a) weighted to 1st stage sample . . . . . . 111 A-3(b) Horizontal % distribution of Table A-3(a) . . . . . . 112 A-3W(b) Horizontal % distribution of Table A-3W(a). . . . . . 112 A-3(c) Vertical % distribution of Table A-3(a) . . . . . . . 113 A-3W(c) Vertical % distribution of Table A-3W(a). . . . . . . 113   - viii           EXECUTIVE SUMMARY   Introduction (Chapter I)   The Employer Reporting Unit Match Study (ERUMS) was a pilot record linkage study carried out under the auspices of the Federal Committee on Statistical Methodology (FCSM), Office of Management and Budget. The study linked records of employers and their reporting units from three agencies: the Bureau of Labor Statistics (BLS), the Social Security Administration (SSA) and the Internal Revenue Service (IRS). The primary linkages involved samples of the agencies, records for employers in the State of Texas, covering their activities in 1982.   The ERUMS project was planned and carried out by an interagency workgroup under the general guidance of the Federal Committee on Statistical Methodology. Planning began in 1983 and the project operations were completed in 1989. The motivation for ERUMS came from earlier work of the FCSM Subcommittee on Statistical Uses of Administrative Records, which had determined that effective and efficient statistical uses of administrative records were being hampered by the existence of noncompatible systems for reporting employer information at the establishment level.   The goal of ERUMS was to demonstrate the feasibility of matching employer and reporting unit data from different agency record systems as a means of obtaining more precise information about differences in the coverage and content of the data in those systems. The study focussed on the BLS and SSA record systems, with employer-level data from IRS being used primarily to reconcile and explain BLS-SSA differences. It was expected that ERUMS, as a demonstration study, would provide valuable experience with the technical aspects of data linkage and the administrative requirements for gaining access to the data and carrying out the matching operations.     The record systems that were linked (Chapter II, Section A)   The primary source of data for ERUMS from BLS was the first quarter 1982 Unemployment Insurance (UI) Address File. For each State, the UI Address File contains data for individual employers and their reporting units, which are often but not always equivalent to establishments. The data for this file are submitted annually (more recently quarterly) to BLS by the State employment security agencies that operate the Federal-State UI Program. The BLS uses the data submitted by the States as a basis for periodic statistical reports on employment and wages and uses the UI Address File as a national sampling frame for its establishment surveys.   The principal SSA files used for ERUMS were files developed for statistical uses within SSA. They included an edited file of   - 1 -           Form W-3 annual wage reports for 1982 and the Single Unit and Multi Unit Code Files. The Form W-3 file provided wage data for individual employers and, in some cases, for each of their reporting units, which are frequently but not always equivalent to establishments. The Single Unit Code File, which is updated annually, contains a record for every entity that has filed an application for an Employer Identification Number (EIN), excluding non-employing entities and household employers. The Multi Unit Code File contains a record for each reporting unit of multi unit employers who are participating in the Establishment Reporting Plan, a voluntary program under which employers report their annual wage information on Form W-3 separately for each of their reporting units.   The main source of IRS data used for ERUMS was a Census-edited file based on Forms 941 and 943 for Tax Years 1981-83. These forms are used by employers to report each quarter (annually for Form 943) to IRS on income taxes withheld from wages and other payments to employees and on taxes under the Federal Insurance Contributions Act (Social Security taxes). Extracts of data from these forms are provided annually by IRS to the Census Bureau for use in the latter's County Business Patterns Program and other statistical purposes. The Census Bureau edits the files to use the best available industry code for each employer and impute certain missing data. A copy of the edited file has been made available to the IRS Statistics of Income Division for use in its statistical programs. Data from this Census-edited file were obtained for most of the employers in the Phase II ERUMS sample (see below). In addition, copies of Form 940, Federal Unemployment Tax Return, for 1982 or 1983 were obtained for a substantial proportion of the Phase II sample cases.     The study design (Chapter II, Sections B and C)   Because of the ERUMS Workgroup's limited resources, the study was restricted to a single State, Texas, and a small sample of employers and their reporting units from that State. The sampling unit was the employer, identified by a unique EIN. A probability sample of all EINs active in the State of Texas in 1982 was selected from the BLS and SSA files described above. Employers were considered to be active in the BLS system if they had one or more records in the 1982 UI Address File and in the SSA system if they had filed a W-2/W-3 wage report for 1982.   The sample was selected in two phases. The sampling fraction for Phase I was 6 in 100, and the selection was based on the 7th and 8th digits of the EIN. The BLS sample, which was selected first, contained 16,336 distinct EINS. The BLS sample was compared to the SSA files and an additional sample was selected (using the same pairs of digits) of 3,628 EINs which had at least one Texas reporting unit, had wage reports for 1982 and did not appear in the 1982 UI Address File. The Phase I sample EINs were stratified by match status (match, SSA only, BLS only)   - 2 -           and single/multi unit status. A Phase II sample of 401 EINs was selected from the Phase I sample, using disproportionate stratified sampling, with equal probability systematic selection within each stratum. Nonmatch and multi unit EINs were oversampled in Phase II because of their greater interest for the purposes of ERUMS.   The Phase II sample provided the basis for the detailed analyses presented in this report. For matched cases, BLS and SSA geographic and industry codes were compared. The industry codes from both sources were compared with those in the IRS/Census-edited Form 941 file. The status of unmatched EINs was clarified by reviewing additional data sources in the agency for which the EIN did not show up in the initial match. Several of the EINs not located initially in the SSA edited 1982 W-3 file were found among groups of delinquent reporters or cases for which the W-2/W-3 wage report and IRS Form 941 data were being reconciled. In addition, several of the Phase II sample employers originally classified as SSA multi unit were reclassified as single unit because it could not be established that they reported 1982 wages for two or more reporting units in Texas. As a result of these reviews and changes, the final distribution of the sample EINs by match status and single/multi unit classification differed substantially from the preliminary distribution of the Phase II sample.     Administrative arrangements (Chapter II, Section D)   For the ERUMS Workgroup to,gain access to the data sets needed for the study, it was necessary to develop working arrangements that complied with the provisions of confidentiality statutes, regulations and policies of the Federal and State agencies that controlled these data sets. After protracted negotiations, this was accomplished primarily through the development of two bilateral agreements (shown in Appendix B).   In one of these agreements, the IRS contracted with BLS for the performance of those parts of the ERUMS project that required access to tax data, including the wage report information that was to be provided by SSA. Under this agreement, SSA staff could be designated as special agents of BLS to carry out their part of the linkage and analysis operations. By law, the purposes of IRS participation in the project and its service contract with BLS had to be related to IRS administration of the tax laws.   The second agreement was a conditions of use agreement between SSA and BLS which allowed SSA to release relevant data from its employer files to BLS and authorized BLS to link data from these files with data from the UI Address File and certain data to be furnished by IRS, and prohibited any other linkage. Both agreements incorporated several safeguards, with emphasis on limiting access at each stage of the project to those persons who needed to use identifiable data, keeping the number of such   - 3 -           persons to a minimum and having them sign non-disclosure affidavits.   To meet the statutory confidentiality requirements of the State of Texas, BLS obtained the permission of the Texas State Employment Commission to use the 1982 Texas UI Address File microdata for the ERUMS study.     Results (Chapter III,A)   All results based on the ERUMS sample are estimates weighted to account for the disproportionate sampling used in the selection of the Phase II sample, unless otherwise noted. The main quantitative results are shown in Tables IIIA-1 through 8 at the end of Section III,A)   Of the Texas EINs that were active in 1982 in the BLS or SSA systems, 67.1 percent were active in both systems, 27.6 percent were active only in the SSA system and 5.3 percent were active only in the BLS system (Table IIIA-1). Only about 1.0 percent of all active EINs were classified as multi unit in one or both systems, and most of these were classified as multi unit only in the BLS system (Table IIIA-4).   For the matched single unit EINS, i.e., those that were active in both systems, an estimated 81.6 percent had the same State and county codes in both systems. The remaining cases were about equally distributed in three categories: same State, different county; same State with no county code in the SSA file; and different State (Table IIIA-5). An estimated 70.2 percent of the matched single unit cases had the same two-digit industry codes. About half of the remaining cases were not classified by industry in the SSA system (Table IIIA-5). When matched against the IRS/Census-edited Form 941/943 file, about three-fourths of the matched single units from both the BLS and SSA files had two-digit industry codes that agreed with those in the IRS/Census file. However, when the SSA unclassified cases were excluded from this comparison, the proportion of SSA cases that agreed with the IRS/Census two-digit code was somewhat greater than the corresponding proportion for the BLS matched single unit cases (Table IIIA-8).   Only a few EINs (nine sample cases) were classified as multi unit in both the BLS and SSA systems. Matching individual reporting units for these cases proved to be difficult. Overall, the nine sample employers had 105 Texas reporting units in the BLS system and 60 in the SSA system for 1982.   Of the active SSA EINs not found in BLS's first quarter 1982 UI Address File, it was estimated that 69.2 percent had reported no first quarter employment to IRS on Form 941 and therefore would not normally be expected to appear in the BLS system (Table IIIA-6). For another 10 percent of these employers, the analysis suggested that they may not have met requirements for UI coverage   - 4 -           in Texas either because they had no operations in Texas, because of nonprofit status or because their payrolls were too small. For the remaining 20 percent, the reasons for their absence are not always clear, but it may have resulted in part from lags in incorporating new employers in the UI State agency and BLS files.   Most of the employers who were included in the 1982 UI Address File but did not file 1982 W-2/W-3 wage reports (22 sample cases) appeared to have ceased hiring employees, gone out of business, or gone through other changes that altered their reporting to IRS and SSA. Half of the employers in this group reported no employment in the 1982 UI Address File. Many of the remainder had filed their final Form 941 with IRS (at least for the period 1981-1983) for a quarter in 1981.   An analysis of the sample EINs that appeared in SSA's Multi Unit Code File provided some indication of the extent to which multi unit employers were participating in SSA's Establishment Reporting Plan (ERP) in 1982 (Table IIIA-7). An estimated 35.9 percent of these EINs had been incorrectly added to the Multi Unit Code File as the result of a processing error that has since been corrected. Most of the remaining employers had initially agreed to participate in the ERP, but more than half of this group did not provide separate data for each reporting unit in their W-3 wage reports for 1982.     Limitations of the study (Chapter III,B)   Several factors limit the broad applicability of the ERUMS findings. The results reflect the reporting requirements and operating procedures associated with the agency record systems in 1982. There have been significant changes since then. In particular, BLS has taken several steps to improve the timeliness and the completeness and accuracy of data in its UI Address File.   The study was based on data for a single State, Texas, and on a small sample of employers and reporting units. The UI system gives the States some latitude in their record-keeping practices, so indications of the coverage of employers in the record systems of the Texas State Employment Agency in 1982 should not be assumed to apply fully to the UI systems of other States at that time. The small sample size means that estimates based on the Phase II sample are subject to relatively large sampling errors. Because of limited resources and the complexity of the Phase II sample design, we were able to compute sampling errors only for a few key estimates (see Table IIIA-4).   The analysis of the results was complicated by differences in concepts and coverage in the record systems used in the study. These differences occurred in the basic filing requirements for the UI and SSA/IRS systems, the time reference of the basic BLS and SSA files used for matching, the definition of reporting units in the BLS and the SSA/ERP systems, and the structures of the BLS and SSA industry classification systems. In addition,   - 5 -           certain file deficiencies and operational problems made the analyses more difficult. About 1.3 percent of the records in the 1982 UI Address File for Texas did not have EINs and therefore were not included in the Phase I sample of EINs from that file. In the SSA files, a significant proportion of employers lacked county and industry codes. The most serious problem was that a high proportion of multi unit employers were not reporting separately in 1982 for each reporting unit, so that we were unable to do a thorough comparison of reporting units for multi unit employers active in both the BLS and SSA systems.   Although these differences and file deficiencies made the analyses more difficult, the fact that we succeeded in identifying and documenting them is an indication that the ERUMS project succeeded in its main goal, which was to demonstrate the feasibility of doing matching studies as a means of evaluating the suitability of administrative record systems for statistical uses.   The data on amounts of employment and payroll available from SSA, BLS and IRS files were used in reviewing the unmatched sample cases and trying to understand why they were not present in both SSA and BLS files. However, the employment and payroll data were not added to the data file for the 401 sample EINs that were used to develop the estimates presented in this report. Therefore, all of the results shown are estimates of numbers of employers or reporting units, classified by attributes such as match status, and geographic and industry codes in the different systems included in the study. We did not attempt to estimate what proportions of aggregate employment or payroll were accounted for by employers who were unmatched or had different geographic or industry codes.     Findings (Chapter IV,A)   The detailed analyses of the ERUMS data did not suggest that large numbers of employers who report wages in one of the payroll tax systems were failing to report in the other system when they should have been. They do, however, suggest that late reports and different procedures for processing the reports in the two systems created potential problems for using both of the systems' data files for statistical purposes.   Perhaps the clearest finding was that it is not possible to maintain a usable establishment reporting unit plan for multi unit employers in the absence of systematic procedures,for monitoring employer reporting and updating files for changes in the number, location and industry of each employer's reporting units. SSA's Establishment Reporting Plan clearly lacked the necessary resources to do this in 1982 and there is no reason to think that the situation has improved since then.   There was a moderately high but by no means perfect correspondence between county and two-digit industry codes for   - 6 -           single unit employers included in both the BLS and SSA systems. A substantial proportion of the differences arose from the absence of county or industry codes in the SSA system. Comparisons of industry codes at the three and four-digit level were not attempted because of the differences in the industry classification systems used by the two agencies.   With some qualifications, we were successful in matching the records of employers, as defined by their EINS, in different systems. However, we were not successful in matching BLS and SSA records for reporting units, the main reason being the incompleteness of SSA's data for reporting units provided under the voluntary ERP. Other reasons were the lack of a common identifier, analogous to the EIN at the employer level, for reporting units and the slight differences in the reporting unit definitions used by BLS and SSA.   We learned what we believe are some important lessons for others who may wish to match business records from different agency sources, whether for research or operational purposes. First, the plans and the necessary interagency agreements should be developed well ahead of the earliest date at which the files to be linked are expected to be available. In particular, the development of interagency agreements for the exchange of identifiable records is a painstaking process and considerable time may be needed for their completion and approval.   Second, successful matching requires in-depth knowledge of all of the record systems involved and of the specific files that exist within those systems. An interagency team approach, with full exchange of information, is essential because there is unlikely to be a single individual who has all of the necessary information, even for the files of a single agency.   Finally, whenever possible, it is essential to pretest matching procedures before embarking on large-scale operational applications.     Recommendations (Chapter IV,B)   ERUMS was designed primarily as a demonstration project and was therefore limited in its coverage and scope. Nevertheless, the Workgroup believes that the study results, along with other information acquired in the course of the study, justified the inclusion in its report of five formal recommendations addressed specifically to the BLS and SSA record systems for employers and reporting units. These recommendations were:   1 - SSA should undertake a full review of the current status and uses of the Establishment Reporting Plan and decide either to continue it with adequate resources for maintenance and improvement of quality or to discontinue it entirely.   - 7 -           2 - BLS should review the State Employment Security Agencies' procedures for identifying employer births (including those resulting from mergers and changes of organization) and seek ways of reducing the apparent lag between filing of applications for EINs and inclusion of new employers on State Agency and BLS lists used as frames for statistical surveys and reports.   3 - Data in the UI Address File on employment and wages paid should be labelled to distinguish imputed data from data reported by employers.   4 - The EIN should be identified as a key item in the UI Address File and efforts should be made to achieve 100 percent reporting initially and current reporting of changes in EINS.   5 - BLS and SSA (if it continues the Establishment Reporting Plan) should strive to obtain data from employers for their establishments as defined in the 1987 Standard Industrial Classification (SIC) Manual. Both agencies should code industry for all establishments, without exception, at the 4-digit SIC level of detail. Whether or not the Establishment Reporting Plan is continued, SSA should code all employers identified on Forms SS-4 at the 4-digit level of detail.   In a broader context, the ERUMS Workgroup concluded that current efforts to collect economic data at the establishment level are dispersed among Federal and State,agencies, are poorly coordinated, and place unnecessary burden on employers. The Workgroup believes that further, more intensive and extensive interagency matching studies have an important role to play in resolving these problems and in determining the possible effects on statistical programs of prospective major changes in administrative reporting systems for employers. We therefore recommend that:   6 - Further matching studies should be directed at acquiring information that will support the eventual development of a mandatory reporting system to meet the needs of all Federal and State statistical programs for establishment lists, including SIC codes. An interim goal should be that all agencies requiring or requesting employers to provide data at the establishment or reporting unit level adopt common definitions of units and data items to be submitted for these units.   Three agencies -- the BLS, the Census Bureau and the National Agricultural Statistics Service -- play a dominant role in the direct collection of establishment-level economic data. Recent initiatives of these agencies, under the general guidance of OMB's Statistical Policy Office, have been directed at greater coordination of their respective list-building and maintenance activities. Further integration of business lists will require fuller understanding of the similarities and differences of the three systems, based on matching of individual establishments and reporting units in the different. systems.   - 8 -           CHAPTER I - INTRODUCTION   This working paper is a report on the Employer Reporting Unit Match Study (ERUMS), a pilot record linkage study carried out by Federal agencies under the auspices of the Federal Committee on Statistical Methodology, Office of Management and Budget (OMB). The report describes the design, procedures and findings of the study and presents recommendations based on the findings.   The study linked records of employers and their reporting units from three agencies: the Bureau of Labor Statistics (BLS), the Social Security Administration (SSA) and the Internal Revenue Service (IRS). The primary linkages involved samples of the agencies' records for employers in the State of Texas, covering their activities in 1982.   The study was designed and most of the work undertaken by members of the ERUMS Workgroup, whose members represented the three agencies whose records were linked, plus the OMB, the Bureau of Economic Analysis and the Committee on National Statistics, which has had a continuing interest in encouraging more effective statistical uses of administrative records. Bureau of the Census representatives attended many of the workgroup meetings as observers. The ERUMS Workgroup reported periodically to and received guidance from the Federal Committee on Statistical Methodology (FCSM). The chair of the FCSM attended most of the Workgroup meetings.     A. Background   Establishment-based economic and business statistics in the United States are derived in large part from reporting systems developed to administer the Federal Income Tax and Social Security systems and the Federal-State Unemployment Insurance system. BLS statistical series on employment and total wages are a by-product of administrative reporting systems established at the State level to support the Unemployment Insurance (UI) system. SSA uses information derived from records of employer taxes on earnings to classify persons included in its Continuous Work History Sample by industry and place of work. IRS uses samples of income tax and information returns for corporations, partnerships and sole proprietors to produce annual data for these units in its Statistics of Income program. The Census Bureau uses data from business tax returns for small units in lieu of direct data collection from these units in the quinquennial economic censuses and as a source of current employment and payroll data for its County Business Patterns Program.   In addition to their direct uses for statistical purposes, these administrative reporting systems provide lists of business units (sometimes called frames) that are used by statistical   - 9 -           agencies primarily the BLS and the Bureau of the Census, to determine which units to cover in periodic censuses and current surveys of economic establishments.   The extensive use of data from these administrative reporting systems for statistical purposes is cost-effective and reduces the reporting burden on business. However, use of administrative records also has its problems. A primary difficulty is that reports by businesses for administrative purposes are generally needed only at aggregate levels. Reports of earnings to IRS and SSA for the Social Security system are for employers, i.e., all activities covered by a unit with a single Employer Identification Number (EIN). Employer reports of earnings to a State employment security agency for the Unemployment Insurance system frequently cover all activities by the employer (EIN unit) in that State. Likewise, reports submitted by employers to IRS on Form 940 under the Federal Unemployment Tax Act provide aggregate data, by State, on covered wages.   Data at this level of aggregation have limited value for statistical analyses. Many corporations and employers have activities in several different locations and in several different categories of industry. Detailed statistical analysis of economic activity calls for information on inputs and outputs at the establishment level, i.e, separate data for each kind of economic activity at each physical location. The establishment, as formally defined by OMB, is the basic reporting unit for the Census Bureau's economic censuses and surveys.   To meet the need for establishment-type data, both BLS and SSA have developed voluntary statistical reporting systems to supplement their administrative reporting systems. BLS has a statistical reporting program, mandatory in 20 States and voluntary in the rest, under which employers submit quarterly reports to State employment security agencies with quarterly wage and monthly employment information by reporting unit. This information is used with data on single establishment firms to update BLS' Universe File, which is its frame for establishment surveys.   SSA has its voluntary Establishment Reporting Plan, under which participating employers filing their annual reports of earnings covered by Social Security provide separate information for each reporting unit. In 1982, the reference year for this study, the SSA reporting unit definition was similar to but not exactly the same as the one used by BLS. Both differed significantly from the OMB establishment definition used by the Census Bureau in its statistical programs. There are also some differences in how each of the agencies has adapted OMB's Standard Industrial Classification for use in its own statistical programs (OMB, 1984; Jabine, 1984). To meet its own requirements, the Census Bureau conducts an     - 10 -           annual survey, the Company Organization Survey, to collect current information about the location and activities of the establishments associated with multi unit employers. This information is used to update Census' Standard Statistical Establishment List (SSEL), which serves as the frame for all of its economic censuses and surveys.   There have been several studies comparing aggregate data on employment and earnings published by BLS, IRS, SSA and the Census Bureau (e.g., Bureau of the Budget, 1961; Bureau of Economic Analysis, 1972; Office of Federal Statistical Policy and Standards, 1980). As might be expected because of the differences in coverage and definition of the various administrative and statistical reporting systems, significant differences in data by industry and location have been observed in these studies. There have been few micro-level interagency comparisons of establishment-type data, especially in recent years. Those that have been undertaken (e.g., Bureau of the Census, 1965) have shown many differences in establishment reporting in the systems that were compared.   In summary, the effective and efficient use of administrative records for statistical purposes has been impeded by the existence of no-compatible systems for reporting of employer information at the establishment level. Serious problems exist because of differences in coverage, reporting unit definitions, and industry classification systems. These differences lead to lack of comparability in the economic statistics produced by different agencies in our decentralized statistical system.     B. Prior Activities of the FCSM   The FCSM has been concerned with statistical uses of administrative records since 1977: several subcommittees and working groups have examined different aspects of this topic. The Subcommittee on Statistical Uses of Administrative Records (Office of Federal Statistical Policy and Standards', 1980) made a broad review of the quality of administrative data and their suitability for statistical applications. The Subcommittee recommended further efforts to: promote the use of standard identifiers, concepts and definitions in administrative reporting programs; identify and resolve problems of access to data in these systems for statistical applications; and establish government-wide coordination and support of relevant collection programs and research activities. A continuing Administrative Records Subcommittee was formed to pursue these goals.   Under the Administrative Records Subcommittee, an Establishment Reporting Work Group was formed early in 1981 to make a more detailed study of three major record systems: the Unemployment Insurance record systems maintained by the States under rules and procedures established by the Department of Labor; the annual W-2 and W-3 wage reports submitted by employers   - 11 -           to SSA and used by both SSA and IRS for administrative purposes; and the Census Bureau's Standard Statistical Establishment List (SSEL), which serves as the frame for,that agency's economic censuses and surveys. The Work Group succeeded in documenting the structural differences among these three systems but was unable, for various reasons, to undertake a planned record matching study to shed additional light on the factors contributing to statistical inconsistencies among the three systems. However, the final recommendation of the Work Group to do further work in this area was heeded and the ERUMS Workgroup was formed early in 1983 (Cartwright, Levine and Buckler, 1983).     C. Goals of the ERUMS Project   Members of the ERUMS Workgroup felt that little more could be done to develop detailed recommendations for improved establishment reporting without first obtaining more precise information, at the micro-level, about inconsistencies among the major administrative reporting systems. Therefore, the Workgroup determined that its main goal would be to conduct a pilot study based on matching of data from employer wage reporting and establishment reporting systems of BLS, IRS and SSA. The study would focus on differences between the BLS and SSA systems, with employer-level data from IRS being used primarily to reconcile and explain BLS-SSA differences. For full coverage of the major establishment-based statistical programs, it would have been desirable to include the Census Bureau's SSEL in the matched data set, but the predecessor workgroup had not been able to arrange to do this, and it was decided not to pursue this effort as part of the ERUMS project.   It was expected that ERUMS, as a pilot study, would provide valuable experience with both the technical aspects of matching data from the three systems and the administrative requirements for gaining access to the data and carrying out the matching operations. In short, ERUMS was planned as a learning experience, and that is exactly how it turned out. Members of the Workgroup, in addition to getting hands-on experience in interagency matching of employer and establishment records, gained new insights into the strengths and weaknesses of their own agencies, record systems.     D. Organization of this Report   Chapter II of our report describes the study design and execution. Section A provides a detailed description, for each of the three agencies, of the systems and files used in the ERUMS project. Because resources were limited, matching could only be done for a sample of units in one State. Section B describes the sample design. The study design involved a relatively complex sequence of sample selection and matching operations; these are described in detail in Section C. Section D describes the administrative arrangements that were developed to gain access to identifiable records needed for ERUMS, to comply with the   - 12 -           agencies' requirements for maintaining confidentiality of the records, and to carry out the various phases of the study.   Chapter III presents the statistical results of ERUMS and an evaluation of the design that was used and its execution. Findings and recommendations are presented in Chapter IV. Section A presents the Workgroup's interpretation of statistical and other results from the study, and Section B presents recommendations based on these findings. A list of references follows the text of the report. Detailed tables are included in Appendix A.   - 13 -                 CHAPTER II - STUDY DESIGN AND EXECUTION   This chapter provides a detailed account of the design of ERUMS and how the study was carried out. The chapter has four sections. Section A describes the sources of the data for employers and reporting units that were matched. The data came from three agencies: the Bureau of Labor Statistics (BLS), the Social Security Administration (SSA) and the Internal Revenue Service (IRS). A subsection for each of these agencies provides a broad description of the programs requiring the administrative record systems used in the study, followed by a description of the specific data files that were used for the ERUMS project. The subsection on SSA records also discusses the relationship between the SSA and IRS records used in the administration of the Old-Age, Survivors and Disability Insurance programs.   Because of the limited resources available for ERUMS, the matching had to be done for a sample of employers, identified by their Employer Identification Numbers (EINs). Section B describes the design of the sample. Section C provides a detailed account of the sample selection and matching procedures. Section D explains the administrative arrangements for the ERUMS project. Subsection 1 describes the formal interagency agreements that were developed to permit the necessary exchanges of identifiable records between agencies, subject to their confidentiality requirements. Subsection 2 describes the working arrangements for the project: meetings of the ERUMS workgroup and the development and maintenance of a project timetable.   For a good understanding of the results presented in Chapter III, it is recommended that all readers look at Sections A and B of this chapter. Those not interested in the detailed procedures and working arrangements may then wish to proceed directly to Chapter III.   A. Description of systems and files   1. Bureau of Labor Statistics (The Unemployment Insurance System and Address File)   The Unemployment Insurance (UI) program was created by the Social Security Act of 1935 to provide temporary income assistance to workers who become involuntarily unemployed. The UI system is a social insurance program that covers employees of commercial and industrial employers, most State and local government employees, and employees of specified nonprofit organizations. Employees of the Federal Government are covered by the Unemployment Compensation for Federal Employees (UCFE) program. The UI and UCFE programs currently cover 97 percent of all wage and salary workers in the U.S.   The UI system covers, with certain exceptions, those employers with one employee on 1 day in each of 20 different   - 15 -           weeks in a calendar year, or who paid $1,500 or more in wages in one quarter in the current or previous calendar quarter. Those workers not covered by UI fall into a number of different categories. Agricultural workers are covered only if the employer has employed at least 10 workers in 20 weeks of the past or present calendar year, or has paid cash remuneration of $20,000 or more in any calendar quarter in the past or present year. Domestic workers employed in private homes, college clubs, or fraternities are covered only if their employer pays more than $1,000 in cash in any quarter for such services. Patients, student nurses, and interns employed by a hospital are excluded from coverage. Also excluded are self-employed persons; insurance agents working on commission; and students and spouses of students working for the school, college, or university where the student is enrolled. An officer of a corporation is considered an employee of the corporation and, therefore, is eligible for unemployment benefits unless the officer is unemployed due to the sale of the corporation and the officer was directly involved in the sale. The same holds true for members of partnerships and proprietors: they are covered unless they are unemployed due to the sale of their business and they were directly involved in the sale. A small number of State and local government employees are not covered, including elected officials, legislators, members of the judiciary, persons in policymaking and advisory positions, temporary emergency employees, and members of the State National Guard and Air National Guard. The extent of coverage discussed in this paragraph pertains to Texas in 1982, and most States have similar, although not identical, provisions for coverage.   The UI program is authorized by both Federal and State laws. The U.S. Department of Labor (DOL) oversees the State UI programs and carries out the Federal obligation of financing the administration of the programs. While DOL insures that each State's program complies with the minimum standards set by Federal law, each State is entitled to develop a program suited to its own conditions. Each of the 50 States, as well as the District of Columbia, Puerto Rico, and the Virgin Islands, has enacted laws to determine its own tax structure, eligibility requirements, benefit levels, and coverage provisions. The administration of the UI program is the responsibility of the State Employment Security Agency (SESA) in each State.   The UI system is financed primarily through taxes assessed by both Federal and State governments on employers for wages paid to their employees. The provisions for the financing were established by the Federal Unemployment Tax Act (FUTA), Chapter 23 of the Internal Revenue Code. Currently, the gross FUTA tax is 6.2 percent of the first $7,000 per year paid to each employee ($434 maximum). (In 1982, the Federal taxable wage base was $6,000; it was increased to $7,000 in 1983.) States levy employer UI taxes at rates determined by State law. If the State tax rate is at least 6.2 percent, employers receive a 5.4 percentage point credit against the FUTA tax, resulting in a net   - 16 -           Federal tax of 0.8 percent.   The Unemployment Insurance Address File is one of the statistical files produced under the Bureau of Labor Statistics (BLS) Federal/State ES-202 Program by the SESAs. The ES-202 Report (Quarterly Report on Employment, Wages, and Contributions) measures the extent of coverage under the various State Unemployment Insurance Programs. Its original use was to determine whether a State's program was in compliance with Federal law. The ES-202 Report represents the largest and most complete universe of monthly employment and quarterly wage information by industry, county, and State regularly available in this country. BLS funds and administers the ES-202 Program and provides conceptual, technical, and procedural guidance for all program activities.   The Unemployment Insurance Address File is a micro-level employer file prepared annually by each SESA. It contains first quarter information for each reporting unit subject to Unemployment Insurance reporting requirements in the State. A reporting unit is the most detailed economic unit for which data are submitted by the employer to the SESA. An establishment is an economic unit, generally at a single location, which is engaged primarily in one activity. In the case of a single establishment employer, the reporting unit and the establishment are identical. For many of the multi-unit employers, two or more establishments may comprise a single reporting unit. This cat occur when the establishments are engaged in similar activities (i.e., are in the same industry) and are located in the same county, or when the employment in the secondary industries and/or counties is not significant (i.e., less than 50).   For any given quarter, typically about 10 percent of the reporting units show zero employment for all 3 months. Some of these zero employment figures are estimated (as discussed later in this report), although the great majority come from actual employer reports. (Some employers maintain an account even if no business is conducted during the quarter.) Data from some new businesses which came into existence during the first quarter may not be included in the UI Address file. This can occur if there is a substantial time lag between when the business started and when the employer submitted the completed status determination form (required from all newly established businesses) to the SESA (Grzesiak and Lent, 1988; Montana Department of Labor and Industry, 1987).   The 1982 Texas UI File examined by ERUMS contained a total of 270,612 unique accounts and a total of 303,582 individual records (reporting units). Of the 303,582 records, 4,020 had a blank or zero-filled Federal Employer Identification Number (EIN) and were ignored for the purposes of this study. The accounts examined included 267,487 single unit accounts (equal to 267,487 records) and 3,125 multi-unit accounts comprised of 32,075 records.     - 17 -           The standardized UI Address File includes the following information for each reporting unit: name and address, State UI Account number, EIN, Standard Industrial Classification (SIC) code, Federal Information Processing Standards (FIPS) county code (township code for the New England States), ownership code, monthly employment levels for the payroll period including the 12th day of the month, and total quarterly wages.   Employer identifying information that enters the UI tax system, and eventually the UI Address File, is originally obtained from the initial status determination form. This form is used to collect information concerning the business name, location, ownership, anticipated number of employees, and primary product or activity. On the basis of this information, the employer is assigned an account number and the various codes by the SESA.   Each reporting unit in the UI File is assigned a four-digit industry code from the SIC Manual on the basis of its primary activity. The primary activity is determined by the primary good produced or distributed or the primary service provided. SIC code 9999 is assigned as a temporary holding code when there is insufficient information on the State's initial status determination form for assigning a specific-industry code. Those reporting units assigned SIC code 9999 are requested to complete and return an SIC Refiling Form, with more detailed information, on a flow basis but no later than the next Annual Refiling Survey. There are a few exceptions to the 4-digit SIC coding requirement. Currently, States have the option to code employers in seven different 3-digit industry groups (representing 25 industries) to only the 3-digit level. These exceptions were created because adequate employer records may not be available to code to the 4- digit level of detail or because reporting units in these industry groups frequently switch back and forth between 4-digit industries. These exceptions are as follows: SIC 074 (Veterinary services), SIC 078 (Landscape and horticulture services), SIC 152 (Residential building construction), SIC 154 (Nonresidential building construction), SIC 581 (Eating and drinking places), SIC 651 (Real estate operators and lessors), and SIC 721 (Laundry, cleaning, and garment services). SIC 421 (Trucking, local and long distance) and SIC 513 (Apparel, piece goods, and notions), comprised of a total of eight industries, were also coding exceptions in 1982.   In addition to an SIC code, the reporting unit is also assigned an ownership code according to legal proprietorship denoting Federal, State, Local, or International government,, or the private sector. A FIPS county code is assigned based upon the reporting unit's location or place of business. Besides the valid FIPS codes, there are additional codes which may be used: 996, 997, 998, and 999. County code 996 indicates a reporting unit located outside the U.S., Virgin Islands, and Puerto Rico but which reports to a SESA. County code 997 is assigned to reporting units with locations in more than one   - 18 -           county but not Statewide. Reporting units located in a State other than the State to which they report are assigned county code 998. Finally, those reporting units with Statewide locations or unidentified locations are assigned county code 999.   To maintain accuracy of data on an ongoing basis, reporting units are asked to complete an SIC Refiling Form every 3 years to verify or update much of the identifying information (e.g., SIC, county, ownership) first collected on the initial status determination form or updated in the last Annual Refiling Survey. One-third of the universe of employers is surveyed in each of the 3 years of the Annual Refiling Survey.   Employers subject to State Unemployment Insurance laws are required to complete quarterly contribution reports and submit them to the appropriate SESA. The information from the quarterly contribution report submitted by the employer for the first quarter is used in the preparation of the UI Address File. The contribution report provides current information on the name, address, and UI account number of an employer; monthly employment levels; total wages paid; taxable wages; and contributions (taxes). Multi-establishment employers are also asked (required in 20 States, but not in Texas) to complete a statistical supplement questionnaire for each quarter furnishing similar information for each of their reporting units. The SESA uses the data supplied on the contribution reports and statistical supplements to create the UI Address File.   The SESAs are responsible for editing and estimating data items missing from employer accounts. These data are missing because the employer either fa iled to complete all of the entries on the contribution report or statistical supplement or failed to submit a contribution report or statistical supplement altogether. Data missing from incomplete contribution reports and data for accounts delinquent 12 weeks after the end of the quarter are estimated. Estimates are generated for all delinquent accounts (including multi-establishments), unless the account is delinquent for two or more consecutive quarters. These delinquent accounts are contacted to determine if they are still active. Only if they are confirmed to be active are estimates prepared. Estimates are replaced on the State file when the actual data have been received and edited, but once estimated data items have been transmitted to BLS, they are not replaced with actual data.   Thus, the SESAs are responsible for editing and extracting data from their UI Tax file, collecting supplemental data, and maintaining the accuracy of the SIC and other codes for the UI Address file and ES-202 Report. After BLS reviews and edits the UI Address file transmitted by the State, that edited file is used to update the BLS Universe File. The Universe File is then used as a national sampling frame for BLS establishment surveys, including the Industry and Area Wage Surveys, Occupational Safety and Health Statistics, and Producer Price Index programs.   - 19 -           BLS is currently in a transitional period with respect to the UI Address File. For data through 1988, the SESAs were required to provide the UI Address File to BLS for only the first quarter of the year. Beginning with data for the first quarter of 1989, however, all States will be required to submit the file on a quarterly basis (6 months and 5 days following the end of the reference quarter). In addition, the UI Address File format will be expanded to contain supplementary information, including predecessor and successor UI Account and Reporting Unit numbers, expanded ZIP codes, address type indicators (e.g., physical location or corporate headquarters), multi-unit indicators, and telephone numbers.   Coinciding with the above improvement is the initiation of the new BLS Business Establishment List (BEL) Improvement Project (MacDonald, 1989). The fundamental goal of the BEL project is the collection of establishment level data, including physical location addresses or both single and multi-unit employers. These more detailed data will also be included in the UI Address File.     2. Social Security Administration   The Social Security Act of 1935 established a requirement that the Social Security Administration (SSA) perform the recordkeeping necessary to reflect accurately the earnings of workers in employment covered by the Act. As amended in 1939, the Act required detailed information on the continuity of employment by calendar quarter and covered wage amounts. The accumulation of quarters of coverage and quarterly wage amounts are used as the basis for determining eligibility for and amounts of program benefits. The law originally required all workers in industry and commerce, except railroad workers, to be covered. This coverage has been broadened over the years and self-employment has been added. Now the only large segments of uncovered jobs are Federal civilian employees who have chosen to remain covered under the U.S. Civil Service Retirement system, and employees of State and local governments who are not covered by a Federal-State agreement. The program currently covers over 95 percent of wage and salary jobs and the self-employed.   The Old-Age, Survivors and Disability Insurance (OASDI) programs administered by SSA provide monthly benefits to retired and disabled workers and their dependents and to survivors of insured workers. Benefit payments are financed principally through taxes collected from employers, employees and the self-employed. Taxes ate paid based on earnings up to an indexed statutory taxable maximum which began at $3,000 in 1937 and is $51,300 in 1990. The method chosen for collection of the taxes is through employer reporting which was required quarterly in the beginning of the program and annually beginning in 1978. In 1978, employer reporting of Social Security covered wages was combined with the existing W-2 (Wage and Tax Statement) income tax reporting that employers are required to complete for the   - 20 -           Internal Revenue Service (IRS). Details of the reporting process are discussed below.   In 1937, SSA began a process to enumerate workers and employers to facilitate its record-keeping process. Workers received a Social Security Number (SSN) and employers received a nine-digit identification number to be used in the reporting process. The worker identification information and subsequent wage reports became part of SSA's Summary Earnings Record. The employer information collected at the time of issuance of the identification number was made part of the Employer Registration File. In 1958, the IRS was given responsibility for issuing Employer Identification Numbers (EINS) and constructed a file called the Business Master File (BMF) that is currently used by SSA to identify employers. The employer information collected from the beginning of this enumeration process included geographic location and industrial activity. These particular items of information were not a direct part of SSA earnings processing, but were collected to help study the new emerging Social Security program. The additional information on employers evolved into a set of files used by SSA's Office of Research and Statistics (ORS) for special studies. These are the Single Unit and Multiunit Code Files that are discussed below along with the employer wage-reporting system that provided the source of employer information used in the ERUMS project.   Prior to January 1978, employers filed their tax and wage reports with the IRS on a quarterly basis, using Forms 941 (regular) and 942 (household work), and annually using Form 943 (agricultural work). Attached to these forms were Schedules A showing the detailed amounts of wages for each employee by SSN. These Schedules A were used by SSA to post wages each quarter to the workers' earnings records. Public Law 94-202 (Combined Old Age Survivors and Disability Insurance Income Tax Reporting Amendments of 1975) enacted January 2, 1976, provided for annual, rather than quarterly, wage reporting. These amendments were effective for tax years beginning   - 1978 for United States domestic employers (other than State and local governments),   - 1979 for employers in Guam, American Samoa, Virgin Islands, and Puerto Rico (other than State and local governments), and   - 1981 for State and local government employers .   Under the Combined Annual Wage Reporting process, employers continue to file Forms 941 and 942 quarterly and Form 943 annually with IRS, but no Schedule A is required. Instead, Forms W-2 are filed by the employer as the annual wage report for the employees. These reports, in the form of Copy A of the Form W-2, along with a copy of the employer transmittal and Form W-3 are filed with SSA annually on or before the last day of February in   - 21 -           the year following the wage reporting year. Employers filing via magnetic media submit W-2 and W-3 data on electronic records plus transmittal Form 6559. In processing the Forms W-2/W-3, SSA performs the following functions: data entry, balancing the sum of the money fields on Forms W-2 to totals on the Form W-3, microfilming, posting the Forms W-2 data to the master earnings records of individuals and transmitting the Social Security and income tax data to the IRS. In addition, SSA creates a W-3 tape file for purposes of reconciling differences between wage information reported to IRS and SSA and locating annual wage reports on the microfilm.   To insure that SSA has received and accurately recorded all FICA wages (wages as defined by the Federal Insurance Contributions Act), SSA's W-3 file is compared with IRS's 941 records annually, in a process known as reconciliation. This is an electronic comparison of SSA-processed employer FICA wage totals with the amount of FICA wages on which employers have paid taxes to the IRS. From this comparison, cases are identified in which IRS has a record of receiving taxes, but SSA has no record of having processed an annual wage report (W-2/W-3s) or SSA's processed wage totals for the employer are less than IRS's. Some other reasons for cases to be in reconciliation are: 1) the employer sent IRS wage information using one EIN and the Forms W-2/W-3 that were sent to SSA were processed using different EINS; 2) the employer transposed or used an incorrect digit in the EIN; and 3) IRS and/or SSA miskeyed the EIN. SSA corresponds with the employers of these reconciliation cases in an attempt to resolve the discrepancies.   As a byproduct of the employer reporting system, SSA maintains files that are used in ORS statistical programs. The Single Unit Code File and Multi Unit Code File contain coded information on the employer's geographic location and industrial activity. These coding files are updated each year with data from a special version of the Form W-3 file, which has been edited to exclude certain records which are not required in ORS statistical operations (e.g., non-FICA, household 'employers, delinquent reports). The primary purpose of the Single and Multi Unit Code Files is to provide geographic and industry data for records of workers in statistical files, e.g., the Continuous Work History Sample (CWHS) which is the source of data for a variety of statistical studies and analyses, making revenue estimates and in tables in publications of SSA program data and research reports.   The Single Unit Code File (SUCF) contains one record for each entity that has filed a Form SS-4, Application for an Employer Identification Number (EIN), with the exception of nonemploying entities (e.g., trust funds, fiduciaries and estates) and household employers. EINs are assigned by the IRS and the forms are forwarded to SSA where they are coded for geography, industry, class (i.e., individual, corporation, partnership, etc.), employer size and reason for application. The geographic   - 22 -           classification of the entity is based on the physical location of the business as provided by the employer on the Form SS-4, otherwise, the mailing address is used. When a location is not available, the entity is given a State code based on the Internal Revenue District (IRD), the first two digits of the EIN,, in which the number was issued and a statewide county code. The SSA has its own industry classification system based on the Standard Industrial Classification (SIC). In 1982, full four-digit SIC codes were used for most industries. There were exceptions for major groups 01 (agricultural production--crops) and 02 (agricultural production-- livestock) and division J, public administration. For each of these three categories, SSA used only a single code. In addition, for 63 four-digit industries in other categories, "foldback codes" for groups of four-digit industries were used when there was insufficient information to assign a specific four-digit code.   The SUCF is an historical file that includes both active (employers reporting annual wage reports in the current tax year) and inactive units (those employers no longer reporting annual wages, e.g., out of business). The file for the year ending December 1987 contained 21,325,091 EINS. It is updated annually with data from the coded Forms SS-4.   The Multi Unit Code File contains one record for each reporting unit of multi unit employers who are participating in a voluntary program, the Establishment Reporting Plan (ERP), conducted by the SSA. Excluded from the file are seasonal agricultural employers and Federal, State and local government employers. Employers are identified for participation in the ERP when the Form SS-4 indicates that the employer has more than one place of business and 100 or more employees or an annual wage report is received for 100 or more employees. Eligible employers are requested to participate in the ERP by providing SSA with a Form SSA-5019 (List of Establishments or Reporting Units) on which the employer lists his establishments and assigns a four digit unit number to each one. In addition, the employer must group his employees under these same unit numbers on his annual wage report. Forms SSA-5019 are coded for industry, geographic location, auxiliary units, non-profit coverage and employer size. Each unit is geographically classified based on either the physical location of a reporting unit or the countywide, Statewide or nationwide location of a payroll grouping. The industry classification used for the ERP coding of multiunit employers is also based on the Standard Industrial Classification. The Multi Unit Code File is an historical file which contained 33,957 EINs and 116,613 reporting units for the year ending December 1987. This file is updated on an annual basis with information from the coded Form SSA-5019.   For the ERUMS project, SSA provided records from the Single Unit and Multi Unit Code Files and the 1982 Form W-3 file. A detailed description of how these files were used in the project is included in Section C of this chapter.   - 23 -           3. Internal Revenue Service   Requirements to file the Form 940 for 1982   The Federal Unemployment Tax Act (FUTA) established a Federal- State unemployment compensation system financed by separate Federal and State payroll taxes on Employers. Administrative funds are derived from the Federal payroll tax and benefits are paid mainly from State payroll taxes.   The Form 940 is the Employer's Annual Federal Unemployment Tax (FUTA) Return. A copy of the 1982 Form is shown as Exhibit IIA-2. This is the form on which the employer reports the State, or States, where contributions are required to be made and the wage information necessary to compute the FUTA tax and the credit reduction for payments made to a State or States. In general, the form must be filed by every employer who either paid wages of $1,500 in any calendar quarter, or who had one or more employees for some part of a day in 20 different weeks.   Agricultural employers must file if they paid cash wages of $20,000 or more to farm workers during any calendar quarter, or employed 10 or more farmworkers during some part of the day for at least one day during any 20 different weeks.   Households which paid wages of $1,000 or more in any calendar quarter for household work in a private home were also required to file. For this purpose, household work in local college clubs and in the local chapters of college fraternities or sororities is included.   For purposes of counting its employees, a partnership does not count its partners.   Employers are authorized to claim a credit for contributions to a certified State unemployment fund by the due date for filing the Form 940. For this purpose, State was defined to include Puerto Rico and the Virgin Islands. "Contributions" are payments that State law requires an employer to make to an unemployment fund. The credit can be claimed for these "contributions" only to the extent that they are not deducted or deductible from the employees' pay.   The forms are filed with the IRS at a service center determined by the location of the employer's principal business office or agency. Penalties are assessed for late filing or late deposit unless reasonable cause for the delay can be shown. There are also penalties for failure to file, failure to pay the tax or filing fraudulent returns.   For FUTA purposes "wages" and "employment" do not include every payment and every kind of service an employee may perform. In general, payments excluded from wages and payments for services excepted from employment are not subject to tax.   - 24 -           Examples include benefit payments for sickness or injury under a worker's compensation law, insurance plan and certain employer plans, certain family employment, certain fishing activities and noncash payments for farm work or work in a private home and meals and lodging.   For 1982, only the first $6,000 in wages paid to an employee was used for the FUTA calculation. The Federal FUTA tax rate on this part of wages was 3.4 percent. Amounts in excess of the wage base were exempt from the FUTA calculation, but not necessarily from the State unemployment tax calculation. If a State's unemployment compensation program met the requirements of Federal law, employers in the State received a 2.7 percent credit against the 3.4 percent Federal FUTA tax for 1982. (For information on the current wage base and tax rates, see Section II,A,l.)     The Employer's Quarterly Federal Tax Return (Form 941) File   In order to facilitate the collection of social security and federal income taxes, employers are required to withhold some portion of each employee's wages, and to deposit that portion in a timely fashion to the credit of the Treasury. At the end of each calendar quarter, nonagricultural employers (excepting those who have only household employees) are required to file an Employer's Quarterly Federal Tax Return, Form 941 (Form 941E for employers who report only withheld income tax, such as certain State and local governments) with the IRS. The information on this form includes a record of their federal tax liability throughout the quarter, along with a summary of their employees, wages, tips, and other compensation which was subject to withholding, the amount of taxes withheld, a summary of wages subject to Federal Insurance Contributions Act (Social Security) taxes, and the Social Security tax paid. Once a year, each employer is required to report the number of persons he employed during the week of March 12. A copy of the Form 941 for the first quarter of 1982 is shown as Exhibit IIA-3.     The Tax Years 1981-83 Form 941 File   Each year the IRS prepares an extract of its Forms 941 and 943 data for the Census Bureau. This extract contains Employer Identification Number, payroll, employment, industry, and legal form of organization information. The Census Bureau edits the payroll, employment, and industry data and makes any needed amputations. For Tax Years 1981-83, the IRS and Census agreed that Census would return the edited extracts to the Statistics of Income Division (SOI) of IRS. These edited extract files were the ones used for ERUMS. Definitions of the items in the files were as follows:   Employment- For purposes of income tax withholding, a common-law employee is defined as follows:   - 25 -           Under common-law rules, every individual who performs services that are subject to the will and control of an employer, as to both what must be done and how it must be done, is an employee. It does not matter that the employer allows the employee considerable discretion and freedom of action, so long as the employer has the legal right to control both the method and,the result of the services.   Two of the usual characteristics of an employer-employee relationship are that the employer has the right to discharge the employee and the employer supplies the employee with tools and a place to work.   If an individual's relationship with an employer fits this description, then the employer is required to withhold federal income tax and Social Security tax from the employee's pay, and to report such withholding on Form 941. Employees who fall into the following categories are defined as statutory employees:   1) A driver who distributes meat, vegetable, fruit, or bakery products or beverages (other than milk) or picks up and delivers laundry or dry cleaning, if the driver is the employer's agent or is paid on commission.   2) A full-time life insurance sales agent whose principal business activity is selling life insurance or annuity contracts, or both, primarily for one life insurance company.   3) An individual who works at home on materials or goods which an employer supplies and which must be returned to the employer or a person the employer names, if the employer also furnishes specifications for the work to be done.   4) A full-time traveling or city salesperson who works on the employer's behalf and turns in orders to the employer from wholesalers, retailers, contractors, or operators of hotels, restaurants, or other similar establishments. The goods sold must be merchandise for resale or supplies for use in the buyer's business operation. The work performed for the employer must be the salesperson's principal business activity if: a) The service contract states or implies that almost all of the services are to be performed personally by the contractor; b) The investment in the facilities (other than in facilities for transportation) used to perform the services is not substantially the individuals; and c) The services are performed on a continuing basis.   Employers are required to withhold Social Security tax, but not federal income tax, from the wages of statutory employees. Individuals who are either common-law or statutory employees are to be reported as employees.   There is anecdotal evidence from exact match studies and from IRS audits that some firms, particularly in the oil and gas extraction industry, were not complying with these reporting   - 26 -           rules in Tax Year 1982. These firms attempted (illegally) to treat all of their employees as independent contractors for tax purposes; therefore, no taxes were withheld, and no Forms 941 filed by these firms. No estimate of the number of such nonfilers is available, but the problem is believed to be of little significance in other industries.   Payroll- The payroll field on the extract comes from line 2 of Form 941. The instructions for this line read as follows:   Enter the total of: all wages paid, tips reported, taxable fringe benefits provided, and other compensation paid to your employees, even if you do not have to withhold income or Social Security taxes on it. Do not include pensions, annuities, third-party sick pay, supplemental unemployment compensation benefits, or gambling winnings, even if you withheld income tax on them.   Legal Form of Organization- The IRS maintains, as part of its computerized Master File system, a record for each business which files a Form 941. This same record also contains information on the other tax returns which the business files, if the returns are posted to the Business Master File (BMF). (Note that sole proprietors report their income on Schedule C attached to their Form 1040, which posts to the Individual Master File. Thus, while a sole proprietorship with employees is represented in the BMF as a Form 941 filer, it was not possible to positively identify it from the BMF as a sole proprietorship in Tax Year 1982.) A portion of this record contains entity information,'for example, the name of the business, its address, its industry, and a set of codes indicating the type(s) of forms it is required to file. These filing requirement codes are a part of the Form 941 extract, and allow the identification of the legal form of organization of a business. A nonzero filing requirement code indicates that a business must file a form in the indicated series. Filing requirement codes,exist on the extract for Form 1120 (Corporation), Form 1065 (Partnership), and Form 990 (Nonprofit organization). As explained earlier, Sole Proprietorships are not directly identifiable from these codes, but few other types of entities may operate a business.   Industry- Each extract record sent to the Census Bureau contains an industry code assigned during IRS revenue processing. The IRS industry codes, while based on the Standard Industrial Classification (SIC), are considerably less detailed than those used by BLS and SSA. Four-digit codes are used; however, most of them represent groupings of several SIC four-digit industries. The particular groupings used differ by type of organization: corporation, partnership and sole proprietorship. In 1982, roughly 200 categories were coded separately for each of the three types of organization. As a part of its data editing process, Census assigns industry codes from the following sources in order of preference: 1) the most recent economic census, 2) the Census Bureau's Current Business Surveys, Annual Survey of   - 27 -           Manufacturers, Company Organization Survey and County Business Patterns Program, 3) the Social Security Administration birth code based on the EIN application, Form SS-4, or 4) the original IRS industry code. Sources 1) and 2) are used only for single- establishment EINS. If only the original IRS code is available, Census uses a conversion program to convert it to a standard SIC format. In some such cases, SIC codes can only be assigned at the 2- or 3-digit level of detail. The codes used for ERUMS were the codes assigned by the Census Bureau. These codes were provided to IRS under the authority of the 1953 Opinion by Attorney General James P. McGranery, 41 Op. A.G.120. Under this Opinion, the Census Bureau can check industry classifications assigned by another agency against its own and either certify or correct the other agency's classifications.     Improvements Subsequent to 1982   The greatest improvement in the Form 941 information is coming from changes in the data collection method. Census Bureau representatives report that the number of changes made during edit and imputation have fallen dramatically as the IRS has implemented scanning of paper documents and filing on magnetic media as an alternative to keying data from paper documents. Also, problems with firms attempting to treat employees as independent contractors (which caused employee data to be underestimated) have been greatly reduced through effective enforcement efforts.   - 28 -   B. Sample design   1. Design considerations   The criteria that governed the choice of a sample design for ERUMS were:   - The study should be limited to one State. - Within the selected state, probability sampling procedures should be used. - The sample size should take into account the resources available to the ERUMS Work Group for computer and manual matching and other processing activities. - All units in the selected state that were active during the study reference period in either the BLS or SSA reporting systems should have a chance of selection. - Cases of greater interest, for example, those found in only one of the two systems (unmatched cases) and those involving more than one reporting unit (multi units) should be oversampled.   ERUMS was a pilot study, designed to develop and test procedures for linking and comparing employer and reporting unit data from different administrative record systems. The agencies participating in the study could provide only limited staff time and other resources. These considerations dictated the Workgroup's decision to limit the study to one State and to a fairly small sample in that State.   Within the selected State, Texas, the use of probability sampling at all stages of selection provided two benefits. it ensured that sample results could be used to produce unbiased estimates for the study population and it made possible estimation of sampling errors from the sample. Although we recognized that sampling errors would be relatively large for most estimates, we felt it would be useful, for both analytical and methodological purposes, to produce weighted estimates.   One possible approach to the study design would have been to select a baseline sample from a single agency system, say the BLS UI system, and search for the sample units from that system in the SSA and IRS systems. However, that approach would have failed to provide any information about units that were in the SSA and IRS systems, but not in the BLS system. it proved to be feasible to use a design that sampled both the BLS and SSA systems, so that units existing in either one of these systems but not in the other would be represented. The Workgroup decided that it was not feasible to sample the IRS system independently, given the complexity of the system and the administrative   - 29 -   difficulties in gaining access to it for such a purpose. Therefore, the final sample does not represent any units that may have been included in the IRS system but not in the BLS and SSA systems. Units in the final combined BLS/SSA sample were matched against the IRS files described in Section A of this chapter, so that we do have IRS data for the BLS and SSA sample cases that were found in the IRS files.   The requirement that all in-scope units in the BLS and SSA systems should have a chance of selection was not completely fulfilled. Because the Employer Identification Number (EIN) was to be the primary basis for matching records in all three systems, the group of reporting units covered by a single EIN was chosen as the sampling unit for both the BLS and SSA systems. However, in the 1982 Texas UI Name and Address File, 4,020 reporting unit records (1.3 percent) out of a total of 303,582 did not have EINS. These units were not included in the initial sample selection from the BLS UI file.   Oversampling of unmatched and multi unit cases was dictated by the exploratory nature of ERUMS. If proportional sampling had been used, about 70 percent of the sample cases (as it turned out) would have been matched single units, for which the processing was expected to be straightforward. The unmatched and multi unit cases were expected to present more difficulties and the Work Group wanted to have enough of these cases to learn what the situations were and to test methods of dealing with them.     2. The sample design adopted   The sample design and the matching procedures were closely interrelated. A summary of the sample design is presented here; details of the sample selection and matching procedures are given in Section C below.   The main steps in sample selection and matching were:   (1) Select samples of EINs from the BLS and SSA frames. (2) Match each EIN in both agency's samples against the other agency's frame to determine whether it was included in that frame, i.e., whether it was a matched sample unit. (3) From the combined samples after steps (1) and (2), select a subsample of EINS, with subsampling rates that varied, depending on initial match status and classification as a single unit or multi unit. (4) Match the subsample units against selected IRS files and, for those located in the IRS files, add relevant IRS data to the data base for the subsample.   A key feature of the sample design was the use of a digital     - 30 -   sampling procedure, based on EINS, in step (1). The EIN is a unique nine-digit number assigned to each employer. Sampling based on the final (9th) digit is not recommended because the nature of the issuance process has resulted in an excess of EINs ending in 0 and 5 (Harte, 1986). For this reason, we selected, from both the BLS and SSA frames, all EINs that had one of six randomly selected pairs of digits in the 7th and 8th position. Using the same sets of digits for both the BLS and SSA samples made it possible to complete step (2) by matching the two samples against each other, rather than by matching each sample against the other agency's complete frame.   The Workgroup decided that the final sample size should be about 400 matched and unmatched EINs and that about one-half of these should be EiNs classified as multi unit in one or both systems. EIN counts obtained for the Texas UI File prior to the initial sample selection were:   Single unit 267,487 Multi unit 3,125 Total EINs 270,612   A sampling rate of 6 in 100 would produce an expected sample of about 188 multi unit EINs from the BLS frame: this was the rationale for using 6 out of 100 possible pairs of ending digits.   The initial sample selected by this method from the BLS and SSA frames contained a total of 19,964 EINS, of which 16,336 were selected initially from the BLS Texas UI file for 1982 and the remaining 3,628 were EINs from SSA's Single or Multi Unit Code Files that had all of the following characteristics:   - Wages reported for 1982.   - One or more reporting units in Texas shown on SSA's Single unit or multi Unit Code File.   - Not included in the BLS Texas UI file for 1982. (However, the employer could have been in the UI file without an EIN.)   All cases in the initial sample were then classified by match status and whether they were identified as single or multi unit EINs in the BLS and SSA files. On the basis of these classifications, 9 major strata were formed. Two of the strata that involved BLS multi unit EINs were subdivided, putting employers with 20 or more reporting units in a separate stratum in each case. Using varying sampling fractions, subsamples were selected from each of the 11 strata to produce a final sample of 200 EINs involving only single units and 201 EINs initially classified as multi unit by BLS, SSA or both.     The initial match and the BLS and SSA single/multi unit   - 31 -   classifications were used to form the strata from which the subsamples were selected. These classifications were later modified for analytical purposes, as will be explained in Section C. However, the weights applied to the sample cases to produce estimates depend on which of the strata they were selected from. Weighting by the reciprocal of the subsampling fractions produces estimates at the level of the first-stage sample. These estimates can be used to calculate percent distributions, because EINs in the first-stage sample were selected with equal probability. To produce estimates of totals for the universe, the first-stage estimates have to be further weighted by the reciprocal of 0.06, the sampling fraction used to select the first-stage sample.   After the selecti on of the second-stage sample, it was discovered that an additional 2,608 EINs should have been included in the first stage sample from the SSA frame, but were inadvertently omitted. This problem w as dealt with by reweighting the second stage sample cases for the strata that were affected. Further details are given in Section C of this chapter.   Sampling errors were calculated for a few key estimates and are shown in Tables IIIA-4 and A1. For the latter table, in which the estimates were based on the full first-stage sample, the actual sample of 22,572 EINs was treated as a fixed size simple random sample, selected without replacement, and the sampling errors were estimated under that assumption. The estimates in Table IIIA-4 were based on the second-stage sample. The calculation of sampling errors for these estimates treated the first-stage sample of 22,572 cases as the universe and the second-stage sample as though it had been a stratified random sample selected without replacement from that universe. These assumptions result in a slight understatement of the sampling errors, since they do not take into account the contribution of the first stage of sampling to the overall sampling errors.   Exhibit IIB-1 summarizes the main features of the ERUMS sample design. A more detailed description of the sample selection and matching procedures is given in Section C. Section D describes the administrative and working arrangements for carrying out the study. Readers who are mainly interested in the results may wish to proceed directly to Chapter III.   - 32 -   Exhibit IIB-1   Summary of the ERUMS sample design   FRAMES   BLS: EINs in Texas UI file for first quarter 1982 SSA: EINs in Single Unit and Multi Unit Code Files that: (1) Had wage reports for 1982 and (2) Had at least one Texas reporting unit and (3) Did not appear in the BLS frame.   FIRST-STAGE SAMPLE   Selection method: Equal probability, based on 7th and 8th digits of EIN   Sampling fraction: 6 in 100   Sample size: BLS frame 16,336 SSA frame 3,628*   Total 19,964   SECOND-STAGE SAMPLE   Selection method: Stratified systematic, equal probability within stratum Sampling fractions: Varied by stratum from take all to 1 in 173.78   Sample size: Multi unit in BLS or SSA 201 All other 200   Total 401   * Plus 2,608 cases that were inadvertently omitted. See discussion in Sections B and C of this chapter.   -33 -   C. Sample selection and matching procedures   There are two reasons for providing a detailed account of the ERUMS sample selection and matching procedures. The obvious reason is that the results, like those of any research study, are dependent on the procedures used and anyone interested in the results is entitled to a full description of how the study was carried out. The other reason, equally or perhaps more important, is that ERUMS was a venture into uncharted territory and we believe that future projects of this kind will benefit from the availability of a detailed road map of the procedures that were developed to match and compare employer and reporting unit records from BLS, SSA and IRS for statistical purposes.   Exhibit IIC-1 gives an overview of the ERUMS sample selection and matching operations that will be discussed in this section. The subsection numbers used in this section correspond to the operation numbers on the chart (1.0 to 10.0). Most of the 10 operations are relatively simple and therefore easy to describe; however operation 3.0, covering Phase I sample selection operations at SSA, was complex and required a separate chart (Exhibit IIC-2) for clarification.   An important consideration in developing the procedures was the large size of the administrative record files from which the samples were selected and relevant data for the sample cases extracted. This dictated a strategy of minimizing the number of runs of these large files and extracting only the sample units and data needed for the study so that working files would be of manageable size and could be processed on a microcomputer accessible only to BLS personnel cleared to work on the ERUMS project. In operation 3.0, for example, single runs of SSA's Single and Multi Unit Code Files were made to extract all of the data needed for the Phase I sample selection at one time.   Certain of the procedures used were necessary to comply with policies of the participating agencies concerning access to identifiable records from their systems. In particular', it can be seen in Exhibit IIC-1 that in operation 2.0, BLS transmitted only the stems (digits 1-6,9) of the sample EINs rather than the full 9- digit EINs to SSA. This was done because it was not considered appropriate to identify specific UI filers in an administrative record system operational environment. Later, when only SSA personnel cleared to participate in ERUMS had access to the working files for the study, full 9-digit EINs were included.   Once the study specifications had been agreed on and the interagency agreements approved, the project operations depicted in Exhibit IIC-1 occupied a period of about three years. The initial sample selection operations at BLS and SSA (steps 1.0 to 3.0) were completed during a relatively short period in mid-1986. The elimination of nonsample EINs and the electronic merge of SSA   - 34 -   and BLS data for the Phase I sample (steps 4.0 and 5.0) were completed at BLS in January 1987. The selection of the Phase II sample (step 6.0) was completed at BLS in October 1987. For the most part, the acquisition of additional BLS, SSA and IRS data for the Phase II sample cases (steps 7.0 to 9.0) was-completed by April 1988. Final review and analysis continued until the end of 198'9.     1. Selection of BLS Phase I sample   The first step, once the overall design for the study had been agreed on by the Workgroup, was to select the Phase I sample from the BLS UI Address File for the State of Texas for the first quarter of 1982. This file, which had been transmitted from the State to BLS in October 1982, contained records for all covered Texas employers who had filed their ES-202 statistical reports for the first quarter of 1982, plus records for some employers who had not filed but for whom employment had been imputed based on reports for prior quarters. The file included a few employers who had filed reports but reported zero employment for the first quarter of 1982.   The sample selection, as reported in the previous section, was based on the EIN as the sampling unit. Therefore, the 1.3 percent of records with no EINs reported were excluded from the sample selection.   All records having any one of six randomly selected pairs of 7th and 8th digits in their EINs were included in the sample. (To minimize disclosure risks, the specific pairs are not identified in this report.) If an EIN had only one reporting unit (RU) associated with it, it was classified as a BLS single unit EIN; if it had more than one associated RU, it was classified as a BLS multi unit EIN.   The Phase I BLS sample contained 16,336 EINS. The expected take was 0.06 x 270,612 = 16,237. For this sample.of 16,336 EINS, data items needed for ERUMS were extracted from the source file.     2. Listing of EIN stems for BLS Phase I sample   The "EIN stem" is defined as digits 1 to 6 and 9 of the full 9-digit EIN. BLS created and transmitted to SSA a file containing only the EIN stems of the 16,336 sample EINS. Some stems appeared more than once in this file. A listing of unique stems subsequently created by SSA contained 11,655 records.   As explained earlier, the reason for the use of EIN stems at this stage was to avoid identification to SSA operating staff, not cleared to participate in the study, of employers reporting to the UI system.   - 35 -   3. SSA Phase I sample selection operations   Exhibit IIC-2 shows the details of operation 3.0, the steps carried out at SSA to extract SSA data for EINs in the BLS Phase I sample and for other EINs meeting the criteria for sample selection but not included in the BLS sample. In the exhibit, operations are represented by rectangles; input and output files are represented by parallelograms.   More specifically, the goal of operation 3.0 was to produce two files and transmit them to BLS for further processing and Phase II sample selection. One output file consisted of full 9-digit EINs and data for stem matches, i.e., single and multi unit records from SSA's Single and Multi Unit Code Files which:   - Had the same stem (EIN digits 1-6,9) as at least one of the BLS Phase I sample EINs and;   - Were associated with employers who had filed W-3 Wage Reports for 1982 (active SSA employers).   This stem match file contained three types of records:   - Records for EINs corresponding to full 9-digit EINs in the BLS Phase I sample, i.e., matched cases.   - Records for EINs not corresponding to full 9-digit EINs in the BLS sample, but eligible for the study by reason of having one of the six designated pairs in digits 7 and 8, and having a Texas code. These records are referred to as sample nonmatches.   - All other records, i.e., nonsample nonmatches. These were of no further interest for the study.   The second output file contained 9-digit EINs and data for sample nonmatches, i.e., records from the Single Unit and Multi Unit Code Files that did not match any of the BLS stems and:   - Had one of the six designated sample pairs of digits in positions 7 and 8 of the EIN;   - Had a Texas code; and   - Were associated with employers who filed W-2/W-3 Wage Reports for 1982.   All of the records in this file were designated as sample nonmatches. Note that sample nonmatches could occur in either of the two output files. However as explained under step 4.0, the sample nonmatches in the stem match file were not included in the Phase I sample.   - 36 -   To understand the SSA operations described in this subsection, it is necessary to make a distinction between employers and reporting units. Each record in SSA's Single Unit Code File has a unique EIN, representing a single employer. All employers who completed Form SS-4 and were issued EINs should be included once in this file, regardless of the number of reporting units they have.   The records in the Multi Unit Code File represent reporting units, so that the same EIN can be associated with more than one record -in that file. Employers with one or more records in the Multi Unit Code File have been identified at some stage as having more than one reporting unit, but they do not all currently participate in SSA's voluntary Establishment Reporting Plan program and report their wages separately by reporting unit. Therefore, it is possible to have EINs with only one record in the Multi Unit Code File. All EINs appearing in the Multi Unit Code File should also appear in the Single Unit Code File, although there may be a few exceptions.   The steps in operation 3.0 were as follows:   Step 3.1 - The list of unduplicated BLS stems and the list of the six randomly selected sample pairs of digits were compared with each of the EINs in the Single Unit Code File to produce two extract files. The stem match extract file contained records for all EINs having one of the BLS sample stems. The sample nonmatch extract file contained records for all EINs with nonmatching stems that had a Texas state code and one of the sample pairs of digits in positions 7 and 8. The number of records in each of these extract files is shown in Exhibit IIC-2.   Step 3.2 - Essentially the same procedure was followed for the Multi Unit Code File. The stem match extract file contained all reporting unit records for every EIN having one of the BLS sample stems. The sample nonmatch extract file contained records that had Texas state codes and were associated with EINs that had nonmatching stems and one of the sample pairs of digits in positions 7 and 8. Thus, for sample nonmatch EINs for employers with reporting units in more than one State, only their Texas reporting units were included in the extract file. The number of records in each of these files is shown in Exhibit IIC-2.   Step 3.3 - The stem match extract files from the Single and Multi Unit Code Files were compared on the basis of EIN. Records in the single unit extract file having EINs that also appeared in the multi unit extract file were eliminated.   Step 3.4 - The records remaining from step 3.3 were compared with an edited W-3 Wage Report File for 1982, on the basis of EIN. Records for with EINs having no 1982 wage reports in this file were eliminated. The output of this step was a file of 182,536 records that were potential matches to the BLS sample EINS.   - 37 -   Step 3.5 - The sample nonmatch extract files from the Single and Multi Unit Code Files were compared on the basis of EIN. Records in the single unit extract file having EINs that also appeared on one or more records in the multi unit extract file were eliminated. The number of records eliminated at this point was quite small, probably because many of the EINs appearing in the multi unit extract file had records in the Single Unit Code File with non- Texas state codes, hence these EINs had not been included in the sample nonmatch file that was extracted from the Single Unit Code File in step 3.1.   Step 3.6 - The records remaining from step 3.5 were compared with the edited W-3 Wage Report File for 1982, on the basis of EIN. Records associated with EINs having no 1982 wage reports in this file were eliminated. The output file of sample nonmatches contained a total of 3,658 records.   Following completion of these steps, the final output 'files of stem (potential) matches and sample nonmatches were transmitted to BLS. In addition to full 9-digit EINS, these files included SSA geographic codes (State and county) and the first two digits of the SIC codes.     4. Elimination of non-sample EINs from SSA output files   All EINs in SSA's sample nonmatch output file were included in the final Phase I sample. However, as explained in subsection 3, some of the EINs in the stem match file did not meet the criteria for inclusion in the Phase I sample. BLS matched the full 9-digit EINs from its initial sample against the 9-digit EINs in the stem match file and retained in the Phase I sample only those EINs that matched. At that time, no one recognized that the stem match file could also include sample nonmatch cases. As a result, nonmatch cases that had stems appearing in BLS's initial sample were not included in the ERUMS Phase I and Phase II samples. When this oversight came to light, it was found that an additional 2,608 SSA nonmatch cases, of which 2,576 were single unit and 32 were multi unit, should have been included in the Phase I sample. As explained in subsection C,11, below, the weights for the affected strata were revised to compensate for their being undersampled.     5. Merge of BLS and SSA data for Phase I sample EINs   The output file from operation 4.0 was merged with the data file for the BLS Phase I sample from operation 1.0. Data elements for each EIN appearing in both files were combined on a single record for that EIN. The EINs in the merged file, whether or not appearing in both the BLS and SSA samples, constituted the final Phase I sample.   - 38 -   6. Phase II sample selection   The Phase I sample EINs were divided into 11 strata, as shown in Table IIC-1.     Table IIC-1 - PHASE I SAMPLE COUNTS BY STRATUM   Stratum BLS SSA Other No. of EINs status status classifiers   1 single single Match on county and 2-digit SIC 8,689 2 single single Different county or 2-digit SIC 4,392 3 single NWR 2,698 4 NWR single 3,559 5 multi Single <20 RUs in BLS 88 6 multi single 20+ RUs in BLS 6 7 single multi 356 8 multi NWR 41 9 NWR multi 69 10 multi multi <20 RUs in BLS 60 11 multi multi 20+ RUs in BLS 6   TOTAL 19,964   The definitions used in classifying EINs by strata were as follows (NWR stands for "no wage report"):     BLS status Single One reporting unit with EIN in Texas UI file for 1982   Multi 2+ reporting units with EINs in Texas UI file for 1982   NWR No reporting unit with EIN in Texas UI file for 1982     SSA status Single W-3 Wage Report for 1982, not included i n SSA Multi Unit Code File   Multi W-3 Wage Report for 1982, included in SSA Multi Unit Code File   NWR No W-3 Wage Report for 1982.   - 39 -   The sample counts shown in Table IIC-1 were reviewed by the ERUMS work group, which decided to allocate the Phase II sample as follows: take all EINs in strata 6,8 and 11; select 50 EINs from each of strata 1 to 4; select 34 EINs from stratum 5 (giving a total of 40 from strata 5 and 6 combined); select 40 EINs from stratum 7; and select 34 EINs from stratum 10 (giving a total of 40 from strata 10 and 11 combined). The specified number of EINs was then selected from each stratum systematically, using a random starting point and the sampling interval needed to achieve the desired sample size. The sampling intervals used and the sample sizes by stratum are shown in Table IIC-2.     Table IIC-2 - PHASE II SAMPLING INTERVALS AND SAMPLE SIZES   Stratum Sampling interval EINs selected   1 173.78 50 2 87.84 50 3 53.96 50 4 71.18 50 5 2.59 34 6 1.00 6 7 8.90 40 8 1.00 41 9 1.73 40 10 1.76 34 11 1.00 6   TOTAL 401     7. Listing of EINs for Phase II sample   For the relatively small Phase II sample, it was now possible to assemble information from several sources for use in the final analysis, which had several goals: to assign each sample EIN to a definitive final match status; to compare the characteristics, such