Federal Committee on Statistical Methodology
Office of Management and Budget
FCSM Home ^
Methodology Reports ^

 

  Statistical Policy Working Paper 5 - Report on Exact and Statistical Matching Techniques


Click HERE for graphic.

 

 

 

 

 



Statistical Policy Working Papers are a series of technical

documents prepared under the auspices of the Office of Federal

Statistical Policy and Standards.  These documents are the

product of working groups or task forces, as noted in the

Preface to each report.

 

These Statistical Policy Working Papers are published for the

purpose of encouraging further discussion of the technical

issues and to stimulate policy actions which flow from the



technical findings and recommendations.  Readers of Statistical

Policy Working Papers are encouraged to communicate directly

with the Office of Federal Statistical Policy and Standards with

additional views, suggestions, or technical concerns.

 

 



Office of           Joseph W. Duncan

Federal Statistical Director

Policy Standards



 

 



For sale by the Superintendent of Documents, U.S. Government

Printing Office Washington, D.C. 20402



 

 

 

 

 



Statistical Policy

Working Paper 5

 

Report on

Exact and Statistical

Matching Techniques



 

Prepared by

Subcommittee on Matching Techniques

Federal Committee on Statistical Methodology

 

 

 

 

 

 

 

 

 



DEPARTMENT OF COMMERCE

UNITED STATES OF AMERICA



 

 



U.S. DEPARTMENT OF COMMERCE

Philip M. Klutznick

Courtenay M. Slater, Chief Economist

 



Office of Federal Statistical Policy and Standards

Joseph W. Duncan, Director



 

Issued: June 1980

 

 

 

 

 



Office of Federal Statistical

Policy and Standards

 



Joseph W. Duncan, Director

 

Katherine K. Wallman, Deputy Director, Social Statistics

Gaylord E. Worden, Deputy Director, Economic Statistics

Maria E. Gonzalez, Chairperson, Federal Committee on Statistical



Methodology

 

 

Preface



     This working paper was prepared by the Subcommittee on Matching

Techniques, Federal Committee on Statistical Methodology.  The

Subcommittee was chaired by Daniel B. Radner, Office of Research and

Statistics, Social Security Administration, Department of Health and

Human Services.  Members of the Subcommittee include Rich Allen,

Economics, Statistics, and Cooperatives Service (USDA); Thomas B.

Jabine, Energy Information Administration (DOE); and Hans J. Muller,

Bureau of the Census (DOC).

 



     The Subcommittee report describes and contrasts exact and

statistical matching techniques.  Applications of both exact and

statistical matches are discussed.  The report is intended to be

useful to statisticians in various Federal agencies in determining

when it is appropriate to use exact matching techniques or when it

may be appropriate to use statistical matching techniques.  The

recommendations of the report also include suggestions for further

research.



 

                                        i



 

 

 



Members of the Subcommittee on

Matching Techniques

 



Daniel B. Radner, Chairperson

Office of Research and Statistics, Social Security Administration

Department of Health and Human Services

 

Rich Allen

Economics, Statistics, and Cooperatives Service

Department of Agriculture

 

Maria E. Gonzalez (ex officio)*

Chairperson, Federal Committee on Statistical Methodology

Office of Federal Statistical Policy and Standards

Department of Commerce

 

Thomas B. Jabine*

Energy Information Administration

Department of Energy

 

Hans J. Muller

Bureau of the Census

Department of Commerce

 

 

*Member, Federal Committee on Statistical Methodology

 



                                 ii



 

 

 

                          Acknowledgements



     The body of this report represents the collective effort of the

Subcommittee on Matching Techniques.  Although all members of the

Subcommittee reviewed and commented on all parts of the report,

specific members were responsible for writing different sections. 

The authors of the respective chapters and appendices appear below:

 

Chapter   Author(s)

 

I         Daniel Radner, Thomas Jabine, Rich Allen II

II        Hans Muller, Rich Allen

III       Daniel Radner 

IV        Daniel Radner, Thomas Jabine

 

     Appendix 

 

I         Rich Allen 

II        Daniel Radner 

III       Hans Muller, Rich Allen

 

     Maria E. Gonzalez and Thomas B. Jabine provided indispensable

guidance and encouragement throughout the Subcommittee's work.  Tore

Dalenius, an ex officio member of the Subcommittee when the work

began, provided important insights in the early stages of the work



and helpful comments on drafts of the report. Others who contributed

to the work as members of the Subcommittee in its earlier stages

include: Richard Barr, Richard Coulter, David Hirschberg, Matthew

Huxley, Benjamin Klugh, Stanley Kulpinski, Robert Penn, and Scott

Turner. Members of the Federal Committee on Statistical Methodology

and the Office of Federal Statistical Policy and Standards reviewed

and commented on drafts of the report.  Also, we are grateful to

Benjamin Tepping, Ivan Fellegi, Horst Alter, and Michael Colledge for

their helpful comments on drafts of the report, and to all those who

supplied examples of matching.

 

 

 



                                 iii



 

 



                 Members of the Federal Committee on

                       Statistical Methodology

                           (February 1979)

 

 

Maria Elena Gonzalez (Chair)       Charles D. Jones

Office of Federal Statistical      Bureau of the Census (Commerce)

Policy and Standards (Commerce)

William E. Kibler          

Barbara A. Bailar                  Economics, Statistics, and

Bureau of the Census (Commerce)    Cooperatives Service

                                   (Agriculture)



Norman D. Beller

Economics, Statistics, and         Frank de Leeuw

Cooperatives Service (Agriculture) Bureau of Economic Analysis

(Commerce)

Barbara A. Boyes

Bureau of Labor Statistics         Alfred D. McKeon

(Labor)                            Bureau of Labor Statistics

(Labor)

Edwin J. Coleman

Bureau of Economic Analysis

(Commerce)                         Lincoln E. Moses

                            Energy Information Administration

John E. Cremeans                   (Energy)

Bureau of Economic Analysis

(Commerce)                         Monroe G. Sirken

                                   National Center for Health

Marie D. Eldridge                  Statistics (HHS)

National Center for Education

Statistics (Education)             Wray Smith

                                   Office of the Assistant Secretary

Daniel H. Garnick                  for Planning and Evaluation

Bureau of Economic Analysis        (HHS)



(Commerce)

 

Thomas B. Jabine                   Thomas G. Staples

Energy Information Administration  Social Security Administration

(Energy)                           (HHS)

 

 



                                 iv



 

 

 

 

                          Table of Contents

 

                                                                Page

Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i

Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . iii

 

                 CHAPTER I-INTRODUCTION AND OVERVIEW

 

A. Scope of Study. . . . . . . . . . . . . . . . . . . . . . . . . 1

     1. Definitions and Uses of Matching . . . . . . . . . . . . . 1

     2. Matching Applications and Examples . . . . . . . . . . . . 2

     3. Confidentiality Issues . . . . . . . . . . . . . . . . . . 3

     4. The Role of Computers. . . . . . . . . . . . . . . . . . . 4

B. Auspices. . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

C. Dissemination of Report . . . . . . . . . . . . . . . . . . . . 5

D. Organization of Report. . . . . . . . . . . . . . . . . . . . . 5

 

                      CHAPTER II-EXACT MATCHING

 

A. Nature and History. . . . . . . . . . . . . . . . . . . . . . . 7

B. Types of Matching Error . . . . . . . . . . . . . . . . . . . . 8

C. Procedures. . . . . . . . . . . . . . . . . . . . . . . . . . . 9

     1. Preliminary Steps. . . . . . . . . . . . . . . . . . . . . 9

     2. Selection of Match Characteristics and Definition of

          "Agreement" and "Disagreement" for Each Characteristic . 9

     3. Blocking and Searching . . . . . . . . . . . . . . . . . .10

     4. Weighting of Characteristics of Comparison Pairs . . . . .10

     5. Determination of Thresholds. . . . . . . . . . . . . . . .11

     6. Validation of Decisions. . . . . . . . . . . . . . . . . .11

D. Practical Problems. . . . . . . . . . . . . . . . . . . . . . .12

     1. Source Data. . . . . . . . . . . . . . . . . . . . . . . .12

     2. Matching Procedures. . . . . . . . . . . . . . . . . . . .12

     3. Matching Mode. . . . . . . . . . . . . . . . . . . . . . .12

     4.  Follow-up . . . . . . . . . . . . . . . . . . . . . . . .13

E. Reliability                                                    13

F. Elimination of Duplication in One File. . . . . . . . . . . . .14

 

                 CHAPTER III-STATISTICAL MATCHING

 

A. Introduction. . . . . . . . . . . . . . . . . . . . . . . . . .15

B. A Suggested Framework for the Analysis of Statistical Matching



     Methods . . . . . . . . . . . . . . . . . . . . . . . . . . .16

    1. Universe . . . . . . . . . . . . . . . . . . . . . . . . .16

     2. Two Data Sets. . . . . . . . . . . . . . . . . . . . . . .16

     3. Hypothetical Exact Match . . . . . . . . . . . . . . . . .16

     4. Estimate of Hypothetical Exact Match . . . . . . . . . . .17

     5. Statistical Match Result . . . . . . . . . . . . . . . . .17

 

                                  v

 

 

 

 

TABLE OF CONTENTS-Continued

 

                                                               Page



C. Applications of Statistical Matching. . . . . . . . . . . . . .17

     1. Matching Steps . . . . . . . . . . . . . . . . . . . . . .18

     2. Two Basic Types of Methods . . . . . . . . . . . . . . . .18

     3. History and Development of Matching Methods. . . . . . . .19

 

          a. Bureau of Economic Analysis, U.S. Department of

          Commerce, CPS-TM Match . . . . . . . . . . . . . . . . .19

          b. Bureau of Economic Analysis, U.S. Department of

          Commerce, SFCC Match . . . . . . . . . . . . . . . . . .20

          c. Brookings Institution MERGE-66. . . . . . . . . . . .20

          d. Christopher Sims' Comments. . . . . . . . . . . . . .21

          e. Statistics Canada SCF-FEX Match . . . . . . . . . . .22

          f. Yale University (and National Bureau of Economic

          Research). . . . . . . . . . . . . . . . . . . . . . . .22

          g. Office of Tax Analysis, U.S. Department of the

          Treasury . . . . . . . . . . . . . . . . . . . . . . . .24

          h. Brookings Institution MERGE-70. . . . . . . . . . . .24

          i. Office of Research and Statistics, Social Security

          Administration . . . . . . . . . . . . . . . . . . . . .25

          j. Statistics Canada COC and MCF Matches . . . . . . . .26

          k. Mathematica Policy Researchs. . . . . . . . . . . . .26

          l. Other Statistical Matches . . . . . . . . . . . . . .27

 

D. Criticisms of Statistical Matching. . . . . . . . . . . . . . .27

E. Types of Errors in Statistically Matched Data . . . . . . . . .27

F. Summary and Conclusions . . . . . . . . . . . . . . . . . . . .28

 



               CHAPTER IV-FINDINGS AND RECOMMENDATIONS



 



A. Findings. . . . . . . . . . . . . . . . . . . . . . . . . . . .31



     1. Definitions of Exact and Statistical Matching. . . . . . .31



     2. Usefulness of Matching . . . . . . . . . . . . . . . . . .31



     3. Applications of Exact and Statistical Matching . . . . . .31



     4. Comparison of Errors . . . . . . . . . . . . . . . . . . .32



     5. Comparison of Relative Risk of Disclosure and Potential for



     Harm to Individuals . . . . . . . . . . . . . . . . . . . . .32



     6. Legal Obstacles to Exact Matching. . . . . . . . . . . . .32



B. Recommendations . . . . . . . . . . . . . . . . . . . . . . . .33



     1. General. . . . . . . . . . . . . . . . . . . . . . . . . .33



          a. When Should Matching be Used. . . . . . . . . . . . .33



          b. Choice between Exact and Statistical Matching . . . .33



          c. Documentation of Matches. . . . . . . . . . . . . . .33



          d. Public Release of Matched Data. . . . . . . . . . . .33



          e. Confidentiality Restrictions on Matching. . . . . . .33



     2. Research . . . . . . . . . . . . . . . . . . . . . . . . .34



          a. Exact Matching. . . . . . . . . . . . . . . . . . . .34



          b. Statistical Matching. . . . . . . . . . . . . . . . .34



 



                             APPENDICES



 



 



Appendix I. Economics, Statistics, and Cooperatives Service Example



of Exact Matching 



     A. Exact Matching Considerations. . . . . . . . . . . . . . .35



     B. Selected Match Rules . . . . . . . . . . . . . . . . . . .37



     C. Practical Problems . . . . . . . . . . . . . . . . . . . .39



     D. Technical Papers . . . . . . . . . . . . . . . . . . . . .39



Appendix II. Office of Research and Statistics Example of Statistical



Matching 



     A. Introduction and Input Files . . . . . . . . . . . . . . .41



     B. Matching Method. . . . . . . . . . . . . . . . . . . . . .41



 



                                 vi



 



 



 



 



 



TABLE OF CONTENTS-Continued



                                                                Page



     C. Correspondence of Values of Matching Variables . . . . . .42



     D. Tables . . . . . . . . . . . . . . . . . . . . . . . . . .43



Appendix III. Selected Examples of Exact Matching



     A. Record Check Studies of Population Coverage. . . . . . . .47



     B. Matching of Probation Department and Census Records. . . .48



     C. Computer Linkage of Health and Vital Records: Death



     Clearance . . . . . . . . . . . . . . . . . . . . . . . . . .49



     D. Use of Census Matching for Study of Psychiatric Admission



     Rates . . . . . . . . . . . . . . . . . . . . . . . . . . . .51



     E.  June 1975 Retired Uniformed Services Study. . . . . . . .51



     F.  Federal Annuitants-Unemployment Compensation Benefits



     Study . . . . . . . . . . . . . . . . . . . . . . . . . . . .51



     G.  Office of Education Income Validation Study . . . . . . .52



     H.  Department of Defense Study of Military Compensation. . .52



     I.  Department of the Treasury-Social Security Administration



     Match Study . . . . . . . . . . . . . . . . . . . . . . . . .52



     J. G.I. Bill Training Study . . . . . . . . . . . . . . . . .52



     K. 1973 Current Population Survey-Internal Revenue Service-



     Social Security Administration Exact Match Study. . . . . . .53



     L.Statistics Canada Health Division Matching Applications . .53



     M. Statistics Canada Agriculture Division Matching



     Applications. . . . . . . . . . . . . . . . . . . . . . . . .54



Bibliography               . . . . . . . . . . . . . . . . . . . .55



 



 



                                 vii



 



 



 



 



 



                              CHAPTER I



                      Introduction and Overview



 



                         A. Scope of Study 



 



     This report discusses matching of data files for research and



statistical purposes.  Two basic types of matching, exact matching



and statistical matching, are discussed and applications of those two



types by various organizations, mostly government agencies, are



described.  Matching for other purposes, e.g., administrative



purposes, is not considered here.  In the matching considered here,



identification of units, if needed at all, ordinarily is only



necessary to make the match.  After matching, that identification can



be removed. Most of the discussion in this report is in terms of



matching records for natural persons.  However, similar



considerations apply to matching of records for legal persons, for



example, corporations, partnerships, fiduciaries. Many aspects of



matching for research and statistical purposes have been reviewed by



the Subcommittee.  Among the aspects discussed in this report are:



     .    Matching procedures and their development 



     .    Some advantages and disadvantages of alternative procedures



     .    Confidentiality considerations 



     .    Accuracy of matching results



 



1. Definitions and Uses of Matching



 



 Although the terms "match," "exact match," and "statistical match"



have been used frequently in the literature, the Subcommittee knows



of no generally agreed upon definitions of these terms.  For purposes



of this report, the Subcommittee has defined a match as a linkage of



records from two or more files containing units from the same



population.  It has defined an exact match as a match in which the



linkage of data for the same unit (e.g., person) from the different



files is sought; linkages for units that are not the same occur only



as a result of error.  Exact matching normally requires the use of



identifiers, for example, name, address, social security number.  The



 use of the term "exact" match is not meant to suggest that such



matches are made without error; problems encountered in carrying out



exact matching are discussed in Chapter II.  Other terms for exact



matching such as "actual" and "object" matching have also been used.



The Subcommittee has defined a statistical match as a match in which



the linkage of data for the same unit from the different files either



is not sought or is sought but finding such linkages is not essential



to the procedure.  In a statistical match, the linkage of data for



similar units rather than for the same unit is acceptable and



expected.  Statistical matching ordinarily has been used where the



files being matched were samples with few or no units in common;



thus, linkage for the same unit was not possible for most units. 



Statistical matches are made on the basis of similar characteristics,



rather than unique identifying information, as in the usual exact



match.  Other terms have been used for statistical matching, such as



"synthetic," "stochastic," "attribute," and "data" matching..1 



     The definition of a match used here excludes such record linkage



techniques as the "hot deck" allocation of values to nonrespondents



in surveys because those techniques are considered to involve only



one file.  Techniques such as matched or paired sampling in



experiments are also excluded from the definition..2 



     Although the definitions used here do not provide a precise



dividing line between exact and statistical matching, in practice it



is ordinarily clear which matches are exact and which are



statistical. From the point of view of accuracy of the matched data,



exact matching has ordinarily been preferred to statistical matching. 



In many cases, for technical or files cannot be carried out.  For



example, both files



 



__________________________



 



     .1 The Subcommittee has chosen to use the terms exact match and



statistical match because those terms are the most frequently used,



not necessarily because those terms are considered to be the best.



     .2 See Althauser and Rubin (1969) for an example of a matched



sampling technique.



 



 



 



legal reasons, or both, an exact match between two might be samples



which have few units in common.  Legal restrictions on exact



matching, which have existed for some time, have been increasing in



recent years (e.g., the Privacy Act of 1974 and the Tax Reform Act of



1976).  These limitations on the use of exact matching have led to



further interest in alternative methods of matching.  In practice,



the choice between exact and statistical matching sometimes is a



choice between statistically matching easily obtainable files which



cannot be exactly matched and exactly matching files which are not as



easily obtained (especially with identifiers).  In some cases files



which can be exactly matched are obtainable but contain data which



are less appropriate for performing the desired statistical analyses.



The impetus for the formation of this Subcommittee came from



restrictions on the use of exact matching arising from



confidentiality considerations.  The original question to be examined



was to what extent and under what conditions is statistical matching



an acceptable alternative to exact matching.  Thus, the Subcommittee



did not examine alternatives to exact matching other than statistical



matching. Although a comprehensive comparison between exact and



statistical matching was originally intended, the Subcommittee



determined that such a comparison was not possible at this time



because so little is known about the error structure of statistical



matching procedures.  For this reason, the Subcommittee decided to



summarize in this report what is known about exact and statistical



matching, to give examples of applications of both types of matching,



to make some limited comparisons of exact and statistical matching,



and to suggest directions for future research.



 



2. Matching Applications and Examples



     Matching of data files for research or statistical purposes



ordinarily is a step in the preparation of the data needed to perform



statistical analyses.  In assessing the data needed for a given



analysis, there often are cases in which one existing data set does



not contain all of the variables needed (or contains variables of



less than sufficient accuracy).  Several different approaches can be



used to deal with this problem.  One possibility is direct data



collection of all the needed variables, for example, in a sample



survey.  Another possibility is the assignment or imputation of



values using statistical techniques such as regression analysis



(perhaps using information from another data file).  A third



possibility is matching two or more existing data sets to add the



desired variables, using either exact or statistical matching.  Thus,



matching is merely one of a larger group of techniques which can be



used to add variables needed to perform statistical analyses.



However, there may be cases in which matching, specifically exact



matching, is the only feasible method of preparing the needed data. 



For example, cumulative health histories of sufficient accuracy might



require the exact matching of hospital records. here are also cases



in which a comparison of the presence of units in two files, rather



than the addition of variables, is needed.  In this type of



application, there are few, if any alternatives to exact matching. 



Where the goal is the construction of a multipurpose file, rather



than performing a specific analysis, exact and statistical matching



can be particularly appropriate because large numbers of variables



can be added relatively easily using matching. The Subcommittee



collected many examples of matching of data files.  As noted above,



the applications can be divided into two broad categories: (1) adding



to a base file more variables or additional reports on the same



variables; and (2) comparing the presence of units in two files. 



Within type (1) several different kinds of applications can be



identified.  One application is the addition of more variables to



enrich analyses or to make possible analyses which otherwise could



not be done.  Both exact and statistical matching have been used in



this application.  A cross-section example of one such exact match is



the addition of Social Security Administration (SSA) age, race, and



sex data to Federal individual income tax return records in order to



make it possible to analyze income and tax data by those



characteristics.  In another cross-section example, a statistical



match was carried out between observations from a household survey



and a sample of Federal individual income tax returns in order to add



more detailed and more accurate income information to the household



survey data (Budd, Radner and Hinrichs, 1973).  A longitudinal



example of exact matching is the linkage of hospital admission and



separation records into cumulative health histories (Smith and



Newcombe, 1975). Another kind of application within type (1) is the



evaluation of data, in which initial variables are compared with



added variables, or with additional reports on the same variables-



from other existing sources or from special evaluation surveys. 



Evaluation of the accuracy of data was carried out using the 1973



Current Population Survey-Internal Revenue ServiceSSA Exact Match



Study.  In that project, the income data from the different data



sources were compared



 



                                  2



 



 



 



 



 



and response and reporting errors were analyzed (e.g., Alvey and



Cobleigh, 1975).  Definitional differences were examined in Sweden



using exact matching.  Two different definitions of unemploymentfrom



a household survey and from the labor market board-were compared by



matching survey responses and labor market board records.



     In type (2) (comparing the presence of units in two files), two



different kinds of applications can be identified: evaluation of



coverage and construction of more comprehensive lists.  The Bureau of



the Census has conducted numerous coverage evaluation studies in



connection with the Decennial Censuses.  For example, in connection



with the 1960 Population Census, samples from 1950 Census records,



registered births, and other sources were matched with 1960 Census



records, and coverage was assessed (Perkins and Jones, 1965).  In



such matches, the emphasis is upon the presence of units in the



files, rather than upon the relationships between data in the two



files.  In an example of list construction, the Economics,



Statistics, and Cooperatives Service (ESCS) of the U.S. Department of



Agriculture uses exact matching in the construction of a master list



sampling frame of farms in each state.  This master list was



constructed from several different lists, and exact matching was used



to detect duplication between (and within) the different lists



(Coulter, 1977).  Statistical matching is not appropriate for type



(2) applications.



     In most of the applications mentioned above, one possible effect



of matching was a reduction of response "burden".  That is, to



collect the same information without matching would have required a



considerable amount of direct data collection.  Also, in some of



those applications, cost reduction was a beneficial effect-i.e.,



matching was less expensive than direct collection of the same



combination of data would have been.  The Office of Federal



Statistical Policy and Standards (1978a) suggested the use of



statistical matching to reduce response burden and cost by means of



what are called "nested surveys." In such surveys, different samples



from the same population are asked separate sets of questions, with



a core of questions in common.  The data from these different samples



can then be statistically matched to obtain relationships among the



items not in the common core of questions.



 



3. Confidentiality Issues



 



     As noted earlier, legal restrictions on exact matching have led



to increased interest in alternatives to exact matching.  The



relevant confidentiality issues are discussed in this section. Exact



matching of records for individual reporting units for statistical



research purposes raises two important questions in the area of



confidentiality:



 



     To what extent should such matching activities be conditional on



     the "informed consent" of the individuals whose records are



     being matched?



 



     To illustrate this issue, consider the case of a statistical



     survey in which participation is voluntary and information is to



     be collected on topics such as income, assets, use of medical



     services, voting behavior, etc.  To measure the validity of the



     survey responses, they will be individually matched to and



     compared with relevant information in administrative record



     systems of tax collection agencies, banks, hospitals, and



     others.



 



     Such record checks (including reverse record checks, where the



     sample of persons to be interviewed is drawn from the relevant



     administrative system) have been a valuable tool for the



     improvement of survey methods.  Full respondent knowledge of the



     nature of the study and the procedures to be followed might



     condition their responses and to some extent defeat the purpose



     of the study.  Nevertheless, both ethical and legal



     considerations require that individuals providing data be ade-



     quately informed of the uses that will be made of the data they



     provide.



 



     Do the benefits to be gained by exact matching outweigh the



     risks inherent in assembling large amounts of information about



     individuals in a single location?



 



     When large amounts of information about an identifiable



     individual are available in a single file, the potential for use



     of the information to the detriment of that individual is



     greater than if the information were segmented and the parts



     maintained in different locations.  Some exact matching activ-



     ities conducted for statistical purposes have brought together



     large amounts of information for identified individuals, from



     both survey and administrative record sources.



 



     Although the creation of such files clearly increases the



     potential for harm to individuals, it is also relevant to ask



     whether any individuals have, in fact, been harmed as the result



     of disclosures from matched data files created for statistical



     purposes.  Inquiries made by another group (Office of Federal



     Statistical Policy and Standards, 1978b) have not identified any



     such cases.



 



                                  3



 



 



 



 



 



     These and related concerns have led to the creation of an



environment in which significant restrictions have been placed on the



exact matching of records belonging to more than one Federal agency



and on the matching of Federal agency records with those of other



organizations.



     The Privacy Act of 1974 placed certain limitations on the



disclosure of individually identifiable records in the hands of



Federal agencies.  In brief, these limitations have the following



effects on exact matching for statistical purposes:



.    Identifiable records can be disclosed (transferred) within an



     agency on a need to know basis.  For purposes of the Privacy



     Act, each Department (e.g., HHS), is an agency, so that intra-



     departmental matches can be carried out if not otherwise



     prohibited by law.



.    Identifiable records can be disclosed to the Census Bureau for



     use in its census and survey activities.  Subsequent to the



     Privacy Act, revised Census legislation placed reimbursable work



     conducted by the Census Bureau for other agencies in the



     category of Census activities to which this provision applies.



.    Identifiable records can be disclosed to any agency or



     organization under a routine use established for that system of



     records.  The routine use is established by the agency con-



     trolling the source record system, and the use for which the



     disclosure is to be made must be deemed "compatible with the



     purposes for which it was collected".  There may be problems in



     exercising the routine use provision where the planned match



     requires the exchange of identifiable records in both directions



     (Jabine, 1976, p. 229).



 



     In addition to the general restrictions imposed by the Privacy



Act, there are several agency statutes which further limit the



ability to conduct interagency matching studies.  Some statistical



agencies, in particular the Census Bureau and the National Center for



Health Statistics, have statutes which prohibit the transfer of



identifiable records to any other agency or organization.  The Tax



Reform Act of 1976 limits the release of tax return information,



broadly defined, for identifiable individuals and legal persons to



certain agencies, uses and types of information specified in the law. 



One example of the effects of these new restrictions is that most re-



searchers conducting follow-up studies no longer have access to IRS



records to determine which members of their study populations are



still alive and where they are located. Consideration of the issues



and problems described in this section has led many persons to



advocate greater use of alternatives to exact matching to achieve



desired ends, or at least to examine the feasibility of alternative



methods.  Statistical matching has been used in some situations where



exact matching was not feasible; the question has been raised in some



quarters as to whether it should be used even where exact matching is



feasible.  For example, Duncan (1976) recommended that consideration



be given to the use of statistical matching and to research on the



merging of grouped data to t0 estimate the relationships among



variables without matching individual records.



 



4. The Role of Computers



 



     Modern computers and development of advanced software for



matching have made many matching applications feasible which could



not be done manually.  Exact matching has been performed manually and



by computer.  Exact matching by computer, once the source materials



are in machine readable format, is much faster and less expensive



than performing the same matching manually, but the biggest advan-



tages arise from consistency of decisionmaking and use of more



complex matching rules.  For example, in a manual match of name and



address files, ordinarily last names are reviewed, then first names



of individuals with the same last names, then addresses, etc.  A



computer match procedure can compare all elements in one pass,



assigning agreement and disagreement weights to each element.  Some



matching examples in this report involve comparison of 15 or more



variables which would not have been feasible by manual procedures. 



There do remain some situations in which manual matching is more



practical or possibly more successful than computer matching.  In



Chapter 11, D, under Practical Problems, there is some discussion of



a few of these situations.  Statistical matching has only been per-



formed by computer; it would not be practical to carry out



statistical matching manually.



 



                             B. Auspices



 



     This report represents the collective effort of the Subcommittee



on Matching Techniques of the Federal Committee on Statistical



Methodology, which operated under the auspices of the Office of



Federal Statistical Policy and Standards, Department of Commerce



(previously the Statistical Policy Division,



 



                                  4



 



 



 



 



 



Office of Management and Budget).  The group was formed in early 1976



as one of two working groups of a Subcommittee on Confidentiality



Issues chaired by Thomas B. Jabine.  The working groups were



subsequently given separate subcommittee status.  The other group,



the Subcommittee on DisclosureAvoidance Techniques, issued its report



in May 1978 (Office of Federal Statistical Policy and Standards,



1978b). The opinions expressed here reflect the collective judgment



of the Subcommittee and do not necessarily reflect those of the



Federal Committee on Statistical Methodology or the Office of Federal



Statistical Policy and Standards.



 



C. Dissemination of Report



 



     This report is intended for circulation to agencies and Federal



offices which may utilize matching techniques.  However, a broader



audience may be interested in the report.  The report attempts to



present the major considerations and concerns for the use of matching



procedures.  Examples of present and past applications are included



to aid the reader in visualizing the types of files which can be



linked and the types of variables needed for matching.



 



D. Organization of Report



 



     Chapter II contains a discussion of exact matching.  That



discussion includes a brief overview of the nature and history of



exact matching, a description of the steps in exact matching



procedures, and descriptions of practical problems and reliability. 



A detailed example of exact matching is presented in Appendix I and



summaries of selected examples are shown in Appendix III. A



discussion of statistical matching is presented in Chapter III. 



Because statistical matching is not a very well-known technique, in



Chapter III substantial space is devoted to the nature of statistical



matching, and summaries of many statistical matches are included.



Discussions of criticisms of statistical matching and types of errors



in statistically matched data are also presented, although those



discussions are necessarily sketchy since little is known about the



reliability of statistical matching.  Appendix II contains a detailed



example of statistical matching. Chapter IV contains the findings and



recommendations of the Subcommittee.  The findings are concerned with



definitions, usefulness, and applications of matching, as well as



errors in matching and confidentiality considerations.  The general



recommendations involve the use of matching, documentation of



matches, public release of matched data, and confidentiality



restrictions on matching.  Also, further research on both exact and



statistical matching is recommended. A bibliography of exact and



statistical matching references is included at the end of this



report.



 



                                  5



 



 



 



                             CHAPTER II



 



                           Exact Matching



 



                       A. Nature and History.3



 



As defined earlier, an exact match is a match in which the linkage of



data for the same unit is sought.  Exact matching ordinarily is



carried out using a set of characteristics ("identifiers") contained



in both records. The unit may be a person, family, housing unit,



address, farm, business firm, and so fortb, or it may be an event



such as a birth.  The following observations refer mostly to person



matching but they could be applied or adapted to other units as well.



Usually, the records come from two different sources (files).  Three



or more files may be involved, but even in that case the matching is



often carried out between two files at a time; however, procedures



have been developed for matching multiple files simultaneously to end



up with a single unduplicated file (see Appendix I of this report).



In some cases, all units (and no others) are assumed to be



represented in both files; in others, one file may represent a subset



of the other one; or the two files may overlap but may each include



a number of units not covered by the other. In the following,



matching is described in terms of linking records from a "base file"



to those in a "reference file".  Matching in both directions may be



indicated in some circumstances; the procedures for two-way matching



are a simple extension of those for one-way matching. (When one file



is a subset of the other, exact matching is feasible only from the



subset to the complete file.) "Exact matching" is not necessarily



"exact" in the sense that there must be exact agreement on all char-



acteristics that are compared.  The source files usually include some



incomplete records and some inaccurate data.  Allowances must be made



for this at various stages of the matching process. Exact matching



techniques therefore are not just procedures for bringing together



two records that are clearly and uniquely identified and



unequivocally known to refer to the same unit.  Exact matching can be



practically error free under favorable conditions (for instance, when



matching two files on the basis of social security numbers that were



transcribed from reliable records rather than reported from memory);



but under less favorable conditions some uncertainty about the



results of the matching must be expected, that is, the matches



obtained will probably include some erroneous ones, and some true



matches will be missed. The matching procedures should be designed to



control matching error in such a way that the error in the



conclusions to be drawn from the study will be kept at a tolerable



level.  Thus the procedures must be adapted to the conditions



prevailing in each project, with respect to the objectives of the



study and the quality of the source files (and, as always, the human,



technical, and financial resources and, in some cases, time



constraints).  In general, with more incomplete and inaccurate source



files, more complex matching procedures are called for and a higher



proportion of matching errors may be unavoidable. Exact matching, in



its simplest form, has been known for many years.  For example, for



quite some time there has been interest in matching a list of current



taxpayers against the previous payee list or a list of units which



should be paying taxes.  However, in the context of this report this



type of example normally is not for statistical purposes and is ex-



cluded from consideration. Some of the earliest applications of exact



matching techniques for statistical purposes have been for follow-up



studies of Census data.  Appendix III, Reference A describes the



procedures used to match 1960 Population Census Records against 1950



Population Census Records, Registered Birth Records, 1950 Population



Evaluation Survey results, and Alien Registration Records.  This



match involved a clerical reverse record match procedure on



addresses.  Codes were given to the various name, address and supple



 



_________________________



 



.3 Marks et al., 1974; Steinberg and Pritzker, 1967.



 



 



                                  7



 



 



mental information items to characterize the amount of agreement. 



Each comparison case was then considered as matched or nonmatched.



The simplest clerical matching techniques utilize comparisons of



names only.  The development of computer capabilities gave rise to



exact matches on identifiers rather than names.  In the United States



social security number (SSN) has been extensively used for exact



matches of separate files.  Several of the examples in Appendix III



used only SSN for matching. A number of individuals have conducted



research in theory and procedures for exact matching of files.  The



paper by Fellegi and Sunter (1969) expressed a record linkage theory



involving probabilities for the matched and unmatched sets of units



from two files.  The Economics, Statistics, and Cooperatives Service,



USDA, exact match example in Appendix I bases much of the linkage



techniques on FellegiSunter.  Similar techniques were also used for



the Statistics Canada applications included in Appendix III,



references L and M.



 



     B. Types of Matching Error.4



 



     In practice it is almost inevitable in most matching projects



that some matching errors occur, even with the most sophisticated



procedure and the most careful execution.  These errors fall into two



major classes:



 



     a.   Erroneous match ("false match", "positive error", "Type II



          error"): Linking of records that correspond to different



          units.



 



     b.   Erroneous non-match ("false non-match", "negative error",



          "Type I error") : Failure to link records that do



          correspond to the same unit. "Gross matching error" is the



          sum of both types of error.



 



     "Net matching error" is their difference.  However, this concept



is useful only in certain applications, mainly in coverage



evaluation, where the objective is the estimation of the true size of



a population.  When the goal of the study is the estimation of other



population parameters, the "net error due to matching" may be a more



complex function of the two types of error, depending on bow each



type affects the estimates.



     Erroneous matches may be of two kinds:



 



a.   The reference file includes a true match for a certain base



     record but the latter is mistakenly linked not to its true match



     but to a different reference record.



 



     b.   The reference file does not include any true match for a



          certain base record but the latter is mistakenly linked to



          some reference record.



 



     The term "mismatch" is used by some for any erroneous match, by



others in a more restricted sense for the (a) kind only.  While the



(b) kind of erroneous match is always unacceptable, the (a) kind may



be considered as acceptable matches in some studies but not in



others, depending on the objectives of the study. For example, in



one-way matching, a base file unit for which there is a true but



undetected match in the reference file may be classified as "matched"



on the basis of an erroneous linkage with the reference file record



of a different unit (a "mismatch" in the strict sense of the(a)



kind).  In a coverage study in which the only objective is to



determine whether each base file unit is present in the reference



file or not, that mismatch would be acceptable.  The same mismatch



would be unacceptable, however, when the objective is the comparison



of certain characteristics reported for the same unit in the two



files or the addition of data from the reference file to the matching



record in the base file. The relative importance of each type of



error varies depending on the objectives of different projects. 



Content evaluation and other studies based on comparisons of



characteristics of matched pairs require a low Type 11 error, that



is, high confidence in 'matched" pairs being true matches; Type I



error (failure to find some true matches) will not affect the



findings derived from the matched pairs unless the characteristics



under study are distributed differently in the matched and the



erroneously not matched records. In coverage evaluation, on the other



band, both types of error affect the results-in opposite directions-



and the desired procedure is one that leads to a balance between both



types of error, resulting in a tolerably small net error. (However,



if Type I and II errors were both very large the procedure would be



suspect, even if it resulted in a very small net error.) The



foregoing considerations must be kept in mind when choosing the match



procedures for a particular project.  The ways in which the



procedures can be adjusted to serve the purpose of each study are



treated in Section C of this chapter.



 



 



.8 Marks et al., 1974; Seltzer and Adlakha, 1969.



 



                                  8



 



 



 



 



 



                           C. Procedures.5



 



     In general, exact matching requires the following steps:



 



     1.   Preliminary steps: Improvement of the quality of source



          files; elimination of outof-scope records; standardization



          of files.



     2.   Selection of match characteristics (components), and



          definition of "agreement" and "disagreement" (tolerance



          limits) for each characteristic.



     3.   Blocking (comparison reduction) and searching



          (identification of comparison pairs).



     4.   Weighting of characteristics of comparison pairs.



     5.   Determination of thresholds for designating "matches" and



          "non-matches" (or three groups: match, non-match,



          undetermined).



     6.   Validation of decisions; follow-up on undetermined cases



          (reconciliation).



     In practice, these may not always be recognizable as distinct



steps, but explicitly or implicitly, they are usually carried out in



some form. The procedure must be designed for each project, on the



basis of previous experience with the same or similar source files,



or of a special pilot study, or of early data from the study itself



(in which case tentative match rules must be set up initially based



on whatever information is available at the outset). The decisions



needed at each step may be taken on an intuitive, empirical, or



mathematical basis.  "Intuitive" decisions are based on the



researcher's experience with or knowledge about the same kind of



files and his best judgment of the quality and discriminating power



of the data.  "Empirical" decisions are derived more formally from



actual matching results from similar studies or, preferably, directly



from the study itself, either through a pilot study or a sample of



the main study.  "Mathematical" decisions are derived from



mathematical models of the matching procedure in the given set of



files, using prior knowledge or assumptions about the probability of



occurrence of various observed data configurations in true matches



and true nonmatches. The more complex procedures are not necessarily



always the best ones; the choice must be made in terms of the source



data, the objective of the study, the precision required in the



output, the resources available, cost and time limitations, etc. The



nature of the project is also a factor: in a continuous or multiround



project the initial period can be used for testing and improving the



match rules; for a onetime project of short duration a pilot study is



essential, or else, if the main study is small, it might be carried



out like a pilot study, with very thorough follow-up so that the



effect of different matching rules can be investigated.  The entire



procedure for a particular study should be oriented towards the goal



of minimizing (or reducing to a tolerable magnitude) the error in the



conclusions of the study.



 



1. Preliminary Steps 



 



     In many cases the researchers have no control over the quality



of the source files.  However, where one or both files are collected



especially for the matching project, the results of the matching can



be greatly . proved by intervening in the forms design, training ofmf



interviewers, and so forth, to make sure that characteristics that



will facilitate the matching are included, and that the interviewers



understand the importance of complete and accurate information for



those characteristics. Elimination of out-of-scope records may be



necessary in some cases, if the source files do not cover exactly the



same area or time period or population group.  Examples: uncertain



area boundaries; inclusion or exclusion of institutional population



or Armed Forces; and so forth.  Out-of-scope records in one file



cannot possibly be matched in the other file and should be eliminated



at the earliest possible stage, to keep them from being counted as



nonmatches. Standardization of the files is not as critical in



clerical matching as in matching by computer.  To be matchable by



computer, one or both files may have to be reformatted.



 



2.   Selection of Match Characteristics (Components), and Definition



     of "Agreement" and "Disagreement" (Tolerance Limits) for Each



     Characteristic.6 



 



     In many match projects so little information is available for



matching that all of it must be used in the matching process.  In



others there may be some redundant information, and the "best"



characteristics can be chosen as a basis for the matching decisions.



The selection should be based on the quality of the available data,



the discriminating power of the various characteristics, and the



purpose of the study.  Ideally, the most accurately reported and the



most



 



 



______________________________



 



.5 Marks et al., 1974; Appendix I of this report.



.6 Madigan and Wells, 1976; Housni et al., 1978; Nathan, 1978; U.S.



Dept. of Commerce, 1977.



 



                                  9



 



 



 



 



 



discriminating characteristics would be preferred, but there may be



a conflict between these two requirements. (Social security numbers



actually assigned are close to being a unique identifier = 100% dis-



crimination; however, social security numbers obtained in household



surveys contain a sizeable proportion of errors.) The less



discriminating power a characteristic has, the less information it



provides, and the more characteristics must be compared before a



decision (match or nonmatch) can be made. Because reporting in the



source files is not always accurate, insistence on exact agreement



between two records would lead to erroneous nonmatches.  The match



rules should allow some tolerance, such as age differences of plus or



minus one or two years, common spelling differences in names, etc. 



On the other hand, if the tolerances are too wide, erroneous matches



will result. The selection of the match characteristics and the



setting of tolerance limits for each characteristic should be done so



as to minimize the type of error that should be kept low in order to



best serve the purpose of each project.  Various more or less elab-



orate procedures for doing this have been described in the



literature; they may be based on the researcher's past experience and



judgment, or on thorough analysis of a pilot study or a sample of



data from the project itself; such an analysis would require a more



thorough investigation of potentially matched records than is



generally possible for an entire project, in order to establish the



characteristics of true matches (and nonmatches) with a high degree



of confidence. Operational efficiency should be considered also; if



there is a choice between several characteristics or tolerance limits



that are about equally efficient in terms of keeping the critical



type of matching error low, the selection should be made in terms of



operating considerations, such as cost, difficulty, and risk of error



in the implementation.



 



3. Blocking and Searching.7



 



     Searching in the reference file for a record or records that



might match the input record can be viewed as reducing the possible



comparison pairs (each input record paired with all reference



records, one at a time) to a number of comparison classes, each class



having some common characteristics and including a more manageable



number of comparison pairs that will then be compared on their other



characteristics.  In matching by computer, this Is important to keep



the cost down; it is achieved by "blocking" the files through the use



of Soundex or similar code systems for names, or of geographic codes



(street segments, enumeration districts), and so forth, with the



effect that each input record will be compared in detail with



relatively few reference records.  However, the saving must be



weighted against the risk of increasing the number of erroneous



nonmatches: a reference record that agrees with an input record on



all characteristics except the one used for blocking may in fact be



the true match for the unit record, but because it is not included in



the right block it will not be compared with the right unit record



and both records may be classified as not matched (or they may wind



up being paired with the wrong partners). This can be avoided to some



extent by multiple matching: the records not matched according to one



set of criteria are processed again using a different set. 



Obviously, that would increase the cost. In manual matching, blocking



may not be a separate step but is implicit in the search operation. 



For example, in matching by name, the clerk will use only that part



of the reference file that includes the names starting with the same



letters as the input record, and so forth. In general, the larger the



blocking unit, the higher the cost of matching within blocks and the



greater the risk of erroneous matches; the smaller the blocking unit,



the lower the cost of matching within blocks but the greater the risk



of erroneous nonmatches.  Ideally, blocking should be done on the



basis of characteristics which will virtually never disagree in the



case of true matches; they should also disagree nearly always in the



case of nonmatches.  The combination of two characteristics may be



most effective, e.g., father's name and mother's maiden name (double



Soundex code). The characteristic used for blocking should preferably



be independent of the other matching characteristics (e.g., blocking



by geographic characteristic, matching by name, etc.); if it is not



independent (e.g., blocking by Soundex, matching by full surname),



this fact must be taken into account in defining the matching rules.



 



4.   Weighting of Characteristics of Comparison Pairs.8



 



     After blocking, the characteristics of the input record are



compared with those of the reference



 



________________________



 



.7 U.S. Dept. of Agriculture, 1977; U.S. Depart of Commerce, 1977



.8 Perkins and Jones, 1966; Smith and Newcombe, 1975; Fellegi and



Sunter, 1969; Tepping, 1968; USDA technical papers cited in Appendix



I of this report.



 



                                 10



 



 



 



records in the corresponding comparison class, and the "best match"



is selected from those records.  Whenever more than one



characteristic is compared, the fact that the various characteristics



contribute different amounts of information must be taken into



account.  For example, for deciding whether the two records of a



comparison pair refer to the same person, agreement on sex



contributes less information than agreement on names; among names,



agreement on a common name contributes less than agreement on an



unusual name. These differences can be taken into account through a



system of weighting.  Weights can also reflect the amounts of



information derived from different degrees of agreement on one



characteristic, such as exact agreement on year of birth or a differ-



ence of plus or minus I year, 2 years, and so forth.  As a general



rule, more weight is given to items with high discriminating power



and low error rates. The weights can be derived from a set of



explicit and detailed rules, or they can be based on the judgment of



the person doing the matching as to the relative importance of the



observed kind and degree of agreement in each comparison pair. 



Explicit rules, in turn, can be formulated intuitively or they can be



derived from a mathematical model of the matching process; in either



case, some knowledge about the behavior of the matching



characteristics is needed, either from previous studies with similar



data, or from a pilot study, or it may be derived in the course of



the processing from the data under study. It should be noted that,



for some characteristics, agreement and disagreement do not carry



equal weight (in opposite directions).  For instance, agreement on



sex is not very conclusive evidence of a match, but disagreement on



sex is rather strong evidence against a match.  Disagreement as well



as agreement can be included in the weighting system; negative



weights are assigned as evidence against a match.  For each



comparison pair, the weights assigned to the various match



characteristics are combined into an overall score in order to select



the "best match" among the pairs in each comparison class (block). 



In classes with only one comparison pair there is no choice, but the



match data may need to be weighted in any case for the following



step.



 



5. Determination of Thresholds.9



 The "best match" among the pairs in a comparison class (or the only



pair in a class) is not necessarily an acceptable match.  It is



accepted as a match only if its level of agreement is higher than a



designated "threshold" level. As with other matching decisions, the



threshold can be defined intuitively on the basis of previous experi-



ence and knowledge of the data sets involved, or it can be derived



formally from a mathematical model.  The important criterion is that



this step, in conjunction with the other parts of the matching



procedure, should lead to the goal stated before, that is, to



minimize (or keep tolerably low) in each study the error of



estimation of the population parameters that are of interest in that



study. Ultimately, all comparison pairs should be designated as



"matched" or "unmatched", making sure that no reference record is



matched to more than one record.  If some follow-up is feasible, the



final decision may be improved by initially defining two thresholds-



an upper one above which a pair is considered as matched, and a lower



one below which a pair is considered as not matched.  The pairs



falling between the two thresholds can then be followed up either by



a thorough re-evaluation of the available information by an



experienced researcher, or by repeating the matching process but



including additional variables available in the records, or by addi-



tional field work to reconcile conflicting information in the records



or to obtain additional information.  In any case the follow-up work



should lead to a final decision of "matched" or "unmatched".



 



6.   Validation of Decisions



 



     If the source files were perfect-with complete and error-free



identifying information-matching problems would be controllable.  As



it is, the results will usually be affected by the previously



described uncertainties implicit in matching with imperfect data.  As



a general rule, a matching project should include a validation of the



matching decisions and an evaluation of the remaining matching error. 



This could take the form of an intensive study, including field



follow-up if at all possible, of a sample of "matched" and



"unmatched" records, endeavoring to ascertain their true status.  If



pilot studies were undertaken at earlier stages (for decisions on



matching characteristics, tolerances, weights, thresholds) , their



results may be useful for this purpose also and may reduce, if not



eliminate, the need for more field work. The findings from the sample



or pilot study-as to the proportion of each original match status



group that were found to be true matches or nonmatchescan then be



used to estimate the matching error remaining in the entire file.



 



__________________________



 



.9 References: see C. 4.



 



.10 Scheuren and Oh, 1976; Seltzer and Adlakha, 1969.



 



 



 



 



If the evaluation indicates that certain match status the probability



that the matched records refer to the groups have a very low error



rate and certain others same unit is very high.  There is less



certainty about have a high one, and if an extensive follow-up is



feasible (by mail, phone, personal interview, or record search), a



full follow-up may be undertaken only for the group with the high



error rate, in order to obtain more information that may either



confirm or change the match status and give the validated status a



higher probability of being correct.  At least a sample of the other



status groups should be followed up the same way, to avoid the



possibility of bias arising from special treatment for one group. 



More sophisticated methods of estimating the matching error have been



devised.  When the matching procedure is based on a mathematical



model the estimation of the error probabilities is an integral part



of the procedure.  With some models the admissible error rates for



each match status group may be specified to begin with and the match



rules chosen to give results with the specified error rates. Given



the probability that some "matched" records really refer to different



units and that some "unmatched" records really have a match in the



other file, the conclusions drawn from the results of the matching



are also subject to error because of these matching errors. (They may



also be affected by other error sources, such as different concepts



used in the source files for a variable that is to be compared



between the two files, or coverage differences between the files.)



Attempts can be made to adjust the results, on the basis of prior



knowledge or assumptions about the true distribution of some



characteristics.  Such adjustments have been designed specifically



for some studies.



 



                                 D.     Practical Problems



 



1. Source Data



 



     In practice, most if not all match projects are affected in some



degree by imperfections in the source files-outright errors in the



data; spelling variations; absence of some data from one file or the



other; differences in concept between apparently comparable data;



variability in data reported by different respondents, at different



times, or for different purposes; inclusion of units that should not



be included and omission of units that should be included.  Recent



legislation has restricted the use of the best identifiers (names,



social security numbers) in some cases. Generally, if a match is



based on a sufficiently discriminating combination of several



characteristics, failure to match: it could be due to an error in



either file or to a true change in some match characteristic if the



source files refer to different dates.  One wrong digit in an



identification number, or in a house number if the first search must



be based on the address, can cause an erroneous classification as



"nonmatch"; so can a misunderstood or misspelled name (unless it is



one of the common spelling variations that are taken into account in



the name coding schemes), or a change of address or (for women) a



name change due to marriage or divorce.  In some studies, the problem



of changing data can be reduced to a reporting problem by asking for



previous addresses and previous names (maiden name, former married



name) when the data for the later file are collected.



 



2. Matching Procedures



 



     Problems can arise if the purpose of the study is not kept in



mind at all stages when the matching procedure is designed.  A



procedure that is best for one study may distort the conclusions from



another study that has different objectives.  The execution of the



procedure is beset with other kinds of problems.  Except when the



matching decision can be based on a simple and practically unique



characteristic, such as a well-reported identification number, the



matching rules are bound to be complicated.



 



3. Matching mode (manual or computer)



 



     A computer program for matching requires very detailed rules for



tolerances, weights, etc., which is normally an advantage in that the



matching decisions will be uniform, not subject to different



interpretation by different clerks.  It may be a disadvantage if



there is supplementary information in the records that does not lend



itself to coding or could not be included in the computer program for



other reasons, but could be used by an experienced person to decide



for or against a match when the basic information is ambiguous.  For



instance, sometimes the question whether two records refer to the



same person may not have a clear answer if only the information in



the two records is compared; but if the records are part of household



or family groups the information about household composition



(relationships, birth order, etc.) and about the other household



members may provide the answer.  These intrahousehold relationships



can take so many different forms that they could not possibly all be



included in a computer program.  Similarly, an experienced reviewer



will



 



                                 12



 



 



 



 



 



often detect some misspellings that would escape matching by even the



most sophisticated name coding routines.



     The advantage of the greater speed of a computer for matching



may be lost if the records are not computerized to begin with and



require a large amount of manual preparation (coding, keying, etc.)



to make them machine readable.  Certain items (especially addresses)



may also need reformatting in one or more files before they can be



compared by computer; that would require additional programming and



computer time. In some applications manual matching may be less



costly.  For example, the determination if 2000 individuals are



included in a nationwide, well-indexed file of many millions of



records will be cheaper by manual look-up than by processing the



entire file by computer (unless the matching can be done while the



large file is passed through the computer anyway for some other



purpose). In some cases it may be possible to take advantage of the



best features of both computer and manual modes by doing the work in



two stages:



 



     1.   Computer match of the entire file, using criteria that will



          identify matches and nonmatches with near certainty,



          leaving a portion of the input file unclassified (if the



          identifying information is reasonably good, this should be



          a small proportion).



 



     2.   Manual review of the unclassified portion, making use of



          any available information not included in the computer



          program, possibly using additional files that are not



          machine readable.



 



4. Follow-up



 



     Like the matching procedure, the follow-up procedure must also



be designed to fit the purpose of the study.  In addition, it must



fit the matching rules.  For instance, it may be tempting to accept



the matches as probably correct but to follow up on the nonmatches



because they may be erroneous due to defects in the source data and



because the follow-up could yield better information.  That is a



correct procedure only if the matching rules are such that there is



known to be a very high probability that the matches are indeed



correct while many of the nonmatches may be erroneous.  If, on the



other hand, the matching rules are such that the probability of error



is about the same for matches and nonmatches, then both groups must



be followed up if there is any follow-up at all.



     It may be difficult to phrase the follow-up questions so that



the maximum of new information is obtained.  In most cases (except



"possible matches") the interviewer should not be given the



information already available and asked to verify it; that would be



a temptation to just confirm it without checking, if checking is



difficult (this is not a problem when the follow-up is done by mail). 



Nor should the follow-up usually be limited to asking again the same



questions that were asked before; the answers would tend to be the



same unless a different respondent happens to answer. Another follow-



up problem, when current data are involved, is the need to get back



to the respondent as soon as possible in order to minimize recall



problems and the possibility that the study unit may move or cease to



exist.  That requires good planning and coordination so the data can



flow from collection to matching to follow-up without delay.



 



                          E. Reliability.11



 



     Reliability of the results of an exact match project may be



defined as the proportion of erroneous decisions, that is, false



matches and erroneous nonmatches; or as the proportions of true



matches detected and spurious matches included.  In the special case



of matching to eliminate duplication, reliability is expressed in



terms of duplication left in the final file. The proportion of errors



may be estimated in various ways.  In some cases some independent in-



formation may make it possible to know or estimate in advance what



proportion of the base records should be in the reference file (in a



few cases this may be 100 percent, and a match rate of less than that



would indicate either an inefficient matching procedure or an



incomplete reference file-assuming that the records contain



sufficient information for matching).  Usually, if the files include



some corroborating information, it will be possible to be practically



certain about many matches; in some projects one may also be certain



about many nonmatches.  A sample of the remaining cases (and, for



confirmation, a small sample of "certain" cases) can then be put



through an additional round of searching with more thorough



procedures, or more information can be obtained through field follow-



up (by phone, mail, or interview).  The information obtained in that



way for the sample cases can then be used to estimate error rates. 



Another possibility would be to obtain such estimates in advance



through a pilot study.



 



.11 See References to B. and C.4; Neter el al., 1965.



 



                                 13



 



 



 



     As mentioned before (Section C.6), the estimation of error



probabilities may be built into a matching procedure based on a



mathematical model.  Reliability could be improved by putting all



(instead of a sample) of the records that are not either clearly



matched or clearlv not matched through additional rounds of matching,



or, if feasible, through a followup to get more information.  But



that would usually be very costly and would probably still leave a



residue of cases for which it cannot be determined satisfactorily



whether the base file records have no match in the reference file, or



whether there is a matching record in that file which cannot be found



because of defects of the available information.  If the data are of



poor quality, the most complex routines and the most sophisticated



computers will be of little use.  Improvements in the reliability of



matching applications can undoubtedly be made with greater certainty



by concentrating on the quality of the input data, instead of



devising complex and costly procedures to manipulate data of



questionable information value.



 



F. Elimination of Duplication in



One File



 



     Although it is not included in the definition of an



exact match used in this report, elimination of dupli-cation within



a file is a special application of a procedure similar to exact



matching.  Instead of matching one file against another for possible



matches the matching procedure must be set up to match each



individual record with all other records in the file or all other



records within blocks.  If the file exceeds a few thousand records it



will ordinarily be necessary to use blocking in order to control



costs of computer matching or in order to control time and cost



requirements of manual matching. Regardless of whether manual or



computer procedures are used it is usually best to block on two



different factors and run the matching procedure twice.  If manual



matching is used to identify duplicate records for the same person,



two different sort orders should be used.  The first would be a com-



pletely alphabetic listing of the entire file and the second an



alphabetic listing within zip code or city.  The first listing will



identify all of the complete duplicates (same name and address) and



identify possible duplicates for which the name is exactly the same



but address information has changed or may be in error.  The second



listing will enable matching of records with correct address



information but name misspellings.  A final step in the duplication



removal might be to check common misspellings from the second listing



back against the first listing.  This procedure might enable the



identification of possible duplicates which have common misspellings



of the same name and addresses which are close together



geographically.



 



                                 14



 



 



                             CHAPTER III



 



                        Statistical Matching



 



A. Introduction 



 



     As noted earlier, the Subcommittee has defined a statistical



match as a match in which the linkage of data for the same unit from



the different files either is not sought or is sought but finding



such linkages is not essential to the procedure.  In a statistical



match, the linkage of data for similar units rather than for the same



unit is acceptable and expected. Statistical matching is a relatively



new technique which has developed in connection with increased access



to computers and the increased availability of computer microdata



files.  In a statistical match each observation in one microdata set



(the "base" set) is assigned one or more observations from another



microdata set (the "nonbase" set) ; the assignment is based upon



similar characteristics.  Usually the observations are persons or



groups of persons, and the sets are samples which contain very few



(or no) persons in common.  Thus, except in rare cases, the



observations which are matched from the two sets do not contain data



for the same person.  This is in contrast to an exact match in which



data are matched for the same person from two different sets.  A



statistical match can be viewed as an approximation of an exact



match. (See Okner (1974) and Radner and Muller (1978) for papers



which contain overviews of exact and statistical matching work.) Some



statistical matching methods can be similar to exact matching



methods.  For example, the Census Bureau's Unimatch computer program



(Bureau of the Census, 1974) has been used for both exact and



statistical matching.12 Statistical matching methods can also be



similar to techniques used to match data for other purposes, such as



the "hot deck" allocation of data to non-respondents in household



surveys (e.g., Spiers and Knott, 1970) or matched or paired sampling



(e.g., Althauser and Rubin, 1969).  Statistical matching as defined



in this report differs from those other techniques because in a



statistical match two different microdata sets are matched and (in



almost all cases) the purpose is the addition of variables not



present for any observations in the base set.  In some cases those



added variables can have the same definition as base set variables



but contain less error. The study of statistical matching is still in



its early stages.  Many important theoretical and practical questions



about statistical matching have not been answered.  These unanswered



questions include:



     1.  How accurate are statistical matches?



 



     2.   For what purposes and under what conditions are the results



          of statistical matches sufficiently accurate?



     3.   What factors are important in determining the accuracy of



          the results of statistical matches?



     4.   What are optimal methods of statistical matching and how



          are those methods affected by the circumstances of the



          match?



     5.   Given a set of alternative statistical matching methods and



          a set of conditions, what is the relative accuracy of the



          different methods?



     6.   What are the best ways of handling practical problems such



          as those resulting from differences between samples and



          between the variables in the files?



     7.   How sensitive are the results of statistical matches to the



          assumptions made in carrying out the matches?



 



     Of course, these questions cannot be answered here.  We will



merely try to summarize what has been done and what is known, and



suggest directions for future work. In this chapter, a description of



a simple framework within which statistical matching can be analyzed



is followed by brief discussions of the steps carried out in making



a match and two basic types of statistical matching methods.  Then



the history and development of statistical matching are sum-



 



 



.13 See Springs and Beebout (1976) for an example of a statistical



match carried out using Unimatch.



 



 



                                 15



 



 



marized, followed by brief discussions of general criticisms of



statistical matching and errors in statistically matched results. 



Finally, a summary and conclusions are presented.13



 



B.A Suggested Framework for the Analysis of Statistical Matching



Methods



     In this section a brief summary of the theoretical steps



involved in a typical statistical match will be followed by a



somewhat more detailed discussion of those steps.  An example



involving household survey and income tax data will be used to



clarify the concepts as the discussion proceeds. In summarizing the



matching steps, we begin with a universe, "U," for which we want to



make estimates of variables and their relationships to each other. 



We have two microdata sets, "A" and "B," samples which provide



observations on the universe; each set contains some variables which



are not included in the other set.  We then define a hypothetical



exact match result which we want the statistical match to



approximate.  However, we do not know the hypothetical exact match



result; therefore we estimate it, either explicitly or implicitly,



using whatever information is available.  The appropriate matched



pairs of units are then chosen in a way which minimizes deviations



from the estimate of the exact match result.



 



1. Universe



 



     We begin the detailed discussion of the framework by considering



the universe U for which we want to estimate various relationships. 



U consists of a set of N units; for each unit there are values for R



variables.  By definition all information in U is error-free, and it



is assumed that all information relevant to the estimates we want to



make is contained in the R variables.  U can be represented by an N



x R matrix in which each of the N rows contains the values of the R



variables for one unit.



 



2. Two Data Sets 



 



     We will assume that we have two microdata sets of observations



on variables for units in U; these sets, A and B, are the sets we



want to match statistically.  A and B will be assumed to be samples



from U. A contains n.A units, while B contains n.B units, where both



n.A and n.B are less than N; n.B does not necessarily equal n.A.  It



will also be assumed that very few units from U appear in both A and



B; A and B could be independent samples for which n.A/N and n.B/N are



small.  For example, set A might be the persons interviewed in a



household sample survey for a given year, and set B might be a sample



of income tax returns for that same year. It will be assumed that A



contains observations on k variables, while B contains observations



on m variables.  By assumption, both k and m are less than R, and all



of the variables are contained in U. Some variables from U may be



contained in both A and B, while at least some will be contained in



only one set. The i.th unit in A, which will be denoted A.i, contains



k observed variables, as shown below:



 



                      A.i = (a.il a.i2...a.ik)



 



Similarly, the i.th, unit in B contains m observed variables:



                      B.i = (b.il b.i2... b.im)



 



     It will be assumed that at least some of the variables in A and



B can contain errors, while in U they do not.  Because of different



error components, a variable from U which appears in both A and B can



have different values in the two sets for the same underlying unit in



U. For example, even if wage income were defined identically in the



household survey and the tax return, the survey response might differ



from the amount shown on the tax return.



 



3. Hypothetical Exact Match



 



     At this point we have defined the universe and the two data sets



which will be matched statistically.  We will now define "C," a



hypothetical data set which represents the result of an exact match



(carried out without error) between A and B, if the underlying units



represented in A were also represented in B. The set C is



hypothetical because that exact match cannot be carried out.  The



exact match is impossible because very few of the units represented



in A are also represented in B. By assumption C contains all k



variables from A and all m variables from B, including their error



terms.  Because a statistical match is viewed as an approximation of



an exact match, C is the data set which we try to approximate when we



perform a statistical match..14 It is important to note that C is not



necessarily unique.  The form of C depends upon which data set, A or



B, is taken as the base..15 We are assuming that A is the base set.



 



____________________________ 



     .13 Earlier versions of much of the material in this chapter



appeared in Radner (1974, 1977, 1979).



     .14 There may be cases in which a statistical match is not an



approximation of an exact match.  For example, in some cases it might



be useful to bias the match (relative to the exact match result) in



order to adjust for underreporting of data and thereby avoid a



postmatch adjustment step.



     .15 One set can be used as the base set for part of the sample



and the other set can be used as the base set for the rest of the



sample.  For



 



                                 16



 



 



 



 



For the i.th, unit in A, the information in C will be denoted C.i,



and can be expressed as follows:



 



     C.i = (a.il a.i2 ... a.ik b*.il b*.12....b2.im)



          = (A.i B.i*)



 



Using the previously mentioned example, Ci contains the survey



response given by Ai and the data from the tax return filed by Ai. 



As noted above, that tax return does not appear in B, except in rare



cases.



 



4.   Estimate of Hypothetical Exact Match



 



     When we actually want to make a match, we do not know C (i.e.,



we do not know B.i*).  We therefore make (either explicitly or



implicitly, depending upon the matching method) an estimate of C,



called "L", using whatever information is available.  This estimate



is used in carrying out the match.  Not all of the variables in B.i*



need to be estimated.  The estimated variables in B.i* (along with



any constructed variables) will be used as "matching" variables; that



is, they will be used to carry out the match.  Estimated values can



be obtained by assumption.  For example, for a given A unit, it might



be assumed that the value for a given B variable should be equal to



the value for a given A variable (say, a.ll = bi*.ll).  We could say



that wage income in B should be identical to wage income in A. This



would be valid if wage income were defined identically and had an



identical error pattern in A and B, which ordinarily is not true. 



When such an equality does hold, we have a special case in which, for



those variables, the estimation of C is trivial.  Estimated values



can also be obtained by other means, for example, by regression



techniques or by using information from an exact match between sets



similar to A and B or from an exact match of subsamples of A and B.



The estimates often vary in reliability for the different B



variables.  In some cases the estimates of B.i* are constructed in



such a way that the distributions of the estimated variables



approximate the distributions of the original B variables.



     For the i.th unit in A, the information in L will be denoted



L.i, and can be expressed as follows:



 



     L.i = (a.il a.i2 ... a.ik b*.il b*.i2 ... b*.im) = (A.i B*.i)



 



Although we have shown all m vairiables estimate, as noted above, it



is not necessary to estimate all of them.  Using the continuing



example, for each unit in A, L contains that unit's survey response



data and estimates of some or all of the variables in the tax return



filed by that A unit.



 



5. Statistical Match Result



 



     We now introduce "M," the result of statistically matching sets



A and B in some unspecified way.  For the ill, unit in A, the



information in M will be denoted M.i, and can be expressed as follows:



     M.i = (a.il a.i2 ...  a.ik bø.il bø.i2 ... bø.im = (A.i Bø.i) 



In our example for each unit in A, M contains that unit's survey



response data and the tax return data from the B unit assigned to



that A unit in the statistical match.



     It should be noted that in some cases, where sample weights



differ, A units are assigned more than one B unit and sample weights



are split so that the total weight of the A unit (and of the B units)



remains unchanged.



     It is not necessary for every B unit to be used in the match



solution, and some B units can be used more than once in the



solution..16 it follows from the definition of a statistical match



that the m variables from each B unit are assigned as an entity.



     In making a statistical match we choose among alternative



solutions; each alternative solution is characterized by the



particular set of B units assigned and the particular A unit(s) to



which each is assigned.  We choose the solution in which M approxi-



mates L as closely as possible, in terms of the variables and



relationships of greatest importance in the results of the match. 



This approximation can be viewed in terms of a "distance function."



We can define in general terms a distance function, "D," which



measures the distance (DM) of M from L. The distance function D is



chosen according to the purpose of the match.  Thus,



                           D.M = D(M, L/P)



where P denotes the purpose of the match..17  The statistical match



solution which minimizes D.M is the optimal match result."



 



C. Applications of Statistical Matching



 



     The vast majority of statistical matching work has been in the



field of economics.  The first statistical match in economics was



performed at the Bureau of Economic Analysis of the U.S. Department



of Com-merce in 1968 in connection with estimating the size



 



______________________________



 



example, a tax return sample might be used as the base set for the



high-income portion of a match (where it is the denser sample), while



a household sample survey might be used as the base set for the rest



of the sample (where it is the denser sample).  In constrained



matches (see p. 18), both sets are used as base sets for the entire



sample.



     .16 In some matching procedures every B unit is required to be



used in the match solution, and used with its original sample weight. 



For exampl e, see Radner (1974) and Turner and Gilliam (1975).



     .17  In this formulation, it is assumed that the distributions



of the B variables in L approximate the distributions of those



variables in C. If that is not true, then, in some cases, the



formulation D.M = D(M,L,B/P) can be used since it might be desirable



to approximate distributions from B.



     .18 This is not meant to suggest that statistical matches should



necessarily be carried out using distance functions; random selection



within cells is one possible alternative.



 



                                 17



 



 



 



 



distribution of family personal income.  Another early match was



performed at the Brookings Institution in connection with analysis of



the tax system.  More recent work has been done at Statistics Canada,



Yale University (and the National Bureau of Economic Research), the



Office of Tax Analysis of the U.S. Treasury Department, Brookings,



the Office of Research and Statistics of the Social Security Adminis-



tration, and Mathematica Policy Research. These matches were



undertaken in order to construct more comprehensive and/or more



accurate data bases from existing ones.  Statistically matched files



have been used to make estimates of the distributions of income,



taxes, wealth, and the costs and effects of changes in government



programs.  Proposed uses include making estimates from "nested



surveys" (Office of Federal Statistical Policy and Standards, 1978a)



and the construction of microdata sets consistent with the sectors of



the National Income and Product Accounts (United Nations Statistical



Office, 1978). Most of the matches discussed here have been between



household survey samples and tax return samples.  Others were between



two household surveys, and between two files constructed from several



types of data using exact matches.



 



1. Matching Steps



 



Several steps in actually making a statistical match should be



mentioned here.  First, if the populations represented by the two



files differ, a "universe adjustment" might be needed.  Second, a



"units adjustment" might be needed if the units of observation in the



two files differ (e.g., persons and tax units).  Third, "matching



variables", the variables in the two files which are used to choose



the B set records to be matched with the A set records, need to be



chosen.  Ordinarily, matching variables are defined similarly in the



two files and are highly correlated with important "nonmatching"



variables.  In some cases, matching variables are constructed as



functions of one or more variables in the set.  Fourth, whatever



"linking information" exists needs to be identified.  Linking



information consists of information (or assumptions) about joint



distributions of the matching variables in the two files in C. Fifth,



that linking information is used in the construction of L (either



explicitly or implicitly).  The construction of L includes the ad-



justment of values of matching variables (in one or both sets) to



take account of differences in definitions and response and reporting



error patterns,.19 as well as the construction of matching variables. 



Estimated values might be obtained by assumption.  For example, as



noted earlier, for a given A unit it might be assumed that the value



for a given B variable should be equal to the value for a given A



variable.  We will call this assumption the "equality assumption."



Estimated values can also be obtained by other means, for example, by



regression techniques or by using cross-tabulations from an exact



match between subsets of A and B or between sets similar to A and B.



It is important to note that estimates of B set variables in L can



vary in their reliability. Finally, in the "merging" step, the



records from the nonbase set are chosen.  Although many different



methods have been used in this final step, several basic similarities



can be identified.  In most matches, both files have been separated



into comparable subsets of units, or "cells." Within each cell, rules



have been specified for the choice of one or more records from the



nonbase file to be assigned to each record from the base file.  The



selection of the record often was based upon a distance function by



which a distance was computed between a given base set record and



each potential match in the nonbase set.  The distance was computed



from differences between values of the matching variables in the two



records.  The potential match with the smallest distance ordinarily



was chosen as the match.



 



2. Two Basic Types of Methods



 



     Many different matching methods have been used.  These methods



will be separated into two principal types, "constrained" and



"unconstrained," according to the extent to which the distributions



of the nonbase set variables are used in the matching procedure.  In



a constrained match, every nonbase set record appears in the matched



result and has a sample weight identical to its sample weight before



matching..20   Thus, the distributions and joint distributions of



nonbase set variables (as well as base set variables) are not changed



by the match.  In an unconstrained match, there is no such



restriction on the nonbase set variableS..21   A constrained match



can be viewed as choosing nonbase set records without replacement,



while an



 



 



____________________



 



     .19 Such adjustments have been called "alignment" by Ruggles and



Ruggles (1974).



     .20 It should be noted that a nonbase set record can be matched



with more than one base set record if the original sample weight of



the nonbase set record is split among the base set records.  It



should also be noted that in practice the definition of a constrained



match can be relaxed to include matches in which sample weights (in



either file) are not identical before and after matching but can



change only slightly (e.g., due to round-off error).



     .21 Unconstrained matches could be separated into different



types, for example, according to whether, and how, the distributions



of the nonbase set variables are used in the construction of L.



 



                                 18



 



 



 



unconstrained match can be viewed as choosing with Census, and the



1964 Tax Model (TM), an Internal replacement.  A constrained match



does not always Revenue Service sample of Federal individual income



allow the best match for each base set record; thus, in a constrained



match, on the average, the matches are not as close as can be



obtained in an unconstrained match.  However, in a constrained match,



no reweighting error is added to the nonbase set information as



ordinarily happens in an unconstrained match.  A matched record will



contain two sample weights-one from each file.  In an unconstrained



match, ordinarily the sample weight from the base set portion of the



matched record is used in the results.  Thus, the nonbase set



information is reweighted.  In a constrained match, the sample



weights from the two files in a matched record will be the same.



 



3.   History and Development of Matching Methods



 



     Statistical matching in economics began as a solution to a



specific problem faced by the Bureau of Economic Analysis (BEA) of



the U.S. Department of Commerce.22_improving the accuracy of and



adding more detail to household sample survey income data (from the



Current Population Survey).  The solution was a statistical match



between the household sample survey and a sample of income tax



returns.  Such a statistical match was also the solution to a problem



the Brookings Institution was interested in-putting a sample of tax



returns on a family unit basis and adding nontaxable income types and



nonfilers to the tax return data.  However, BEA and Brookings chose



quite different matching methods. The BEA and Brookings (MERGE-66)



matches are the most important members of what might be called the



first generation of statistical matches in economics.  A second match



carried out by BEA (the SFCC match described later) also belongs to



the first generation.  The other matches described here belong to the



second generation.  Those other matches took into account the results



of and experience with the BEA and Brookings MERGE-66 matches.



     a.  Bureau of Economic Analysis, U.S. Department of Commerce,



CPS-TM Match.23 



     The BEA CPS-TM match was between the March 1965 Income



Supplement of the Current Population Survey (CPS), conducted by the



Bureau of the tax returns.  The purpose of the match was the im-



provement of the accuracy of CPS income amounts and the addition of



tax return income detail to the CPS observations; the CPS was the



base set.  There were some differences between the universes-some CPS



persons did not file tax returns and some TM returns were filed by



persons outside the CPS universe (e.g., persons abroad and some



military personnel).  The units in the two sets were differentpersons



in the CPS and tax filing units in the TM.  This was a constrained



match; cells and ranking of records according to size of income



amounts were used. The basic universe adjustment used was the esti-



mation and elimination from the CPS of those who filed no tax return



("nonfilers").  After the definitions of the units in the two sets



had been made roughly comparable by transforming CPS person units



into tax filing units using small amounts of information from the



1963 Pilot Link Study (an exact match), the nonfilers were chosen as



a residual.  Units considered to have the lowest probability of



filing were chosen to be nonfilers. There was very little empirical



(exact match) linking information available.  Matching variables were



chosen on the basis of the (subjective) reliability of the



assumptions regarding their joint distributions.  After examination



of the relevant overall (marginal) distributions (and taking into



account the exact match information that did exist), it was assumed



that the differential response error and differences in definition



between matching variables in the two sets were important factors. 



The ranking described below was used to take account of these



factors. Cells were constructed for each matching variable.  These



cells were constructed in sequence, with the cells for the second



variable defined within the cells for the first variable, and so



forth.  The variables used were (in order) marital status, wage and



salary income, self-employment income, and property income.  This



formulation incorporated the linking information which suggested that



the correlation between the CPS and TM amounts in an exact match



carried out without error would be highest for wage and salary



income, next highest for self-employment income, and lowest for



property income, among the numerical matching variables.  The



specific assumption about the joint distributions of matching vari-



ables which was used was that units with approximately the same rank



in the (conditional) distribu



 



___________________________



 



     .22 The Office of Business Economics (OBE) became the Bureau of



Economic Analysis in 1972.



     .23 Budd and Radner, 1969, 1975; Budd, 1971; Budd, Radner, and



Hinrichs, 1973; Radner, 1974.



 



 



 



tions of the specific variables in the two sets would be 



 



 



 



 



 



 



 



 



 



for different years.  The basic method was the sepamatched.  That is,



for numeric variables, the defini- ration of both files into cells



and then, within cells, tions of cells were based upon rank rather



than upon the absolute size of values.  Although this assumption was



consistent with the overall distributions in the two sets, it



obviously was crude.  The assumptions used also implied that, in each



cell, there would be the same weighted number of units in each set. 



In the final step in the match, observations in both sets were



duplicated and their sample weights were split so that no sampling



was needed and the overall distributions of all variables in both



sets were preserved.  One of the benefits of this technique was that



it eliminated possible error arising from widely differing sample



weights in the TM.  A crude sensitivity analysis was carried out by



comparing the constrained method results with the results of several



versions of an unconstrained method (Radner, 1974). The BEA match



gave a central role to differences between the matching variables in



the two sets.  Although this emphasis had its origin in the fact that



the match had correction of income amounts as its purpose,



differences between matching variables can be important factors in



many matches, regardless of their purpose.  BEA also emphasized the



accuracy of the overall distributions of variables in the matched



file.  These two factors led BEA to use a constrained method.



b.Bureau of Economic Analysis, U.S. Department of Commerce, SFCC



Match 24 A second early statistical match was also carried out in the



BEA income size distribution work.  This match was less detailed an



d less important than the CPS-TM match described above, but it does



deserve mention as one of the earliest statistical matches.  This



match, performed in 1969, was between the statistically matched 1964



CPS-TM file (corrected for income tax return audit) and the Survey of



Financial Characteristics of Consumers (SFCC).  The SFCC contained



income data for calendar 1962 and asset and liability data for the



end of 1962 for roughly 2,500 households.  The purpose of this match



was the addition of data by which amounts of several income types not



covered in the CPS-TM file could be assigned.  Most of those income



types were noncash types and most of the data added were asset data



. This match was performed on a family unit (family or unrelated in



dividual) basis, and was an unconstrained match.  The unconstrained



approach was chosen primarily because the two files contained data



 



     Budd, Radner, and Hinrichs, 1973.



 ranking the records in each file according to size of interest



income.  The specific SFCC record to be matched to a given CPS-TM



record was the SFCC record with a corresponding ranking. Size of



total money income, type of family unit, age, race, and major source



of earnings were used as cell classifiers.  These variables were



chosen primarily because of their relationship with the asset types



to be added to the CPS-TM file (interest income was used for the same



reason).  SFCC records were reweighted so that, within each cell, the



weighted numbers of records were equal in the two files.  The records



in both files were then ranked, within cells, according to size of



interest income (from high to low); matching was carried out based



upon that ranking.  The matching did not involve the splitting of



records as had been done in the CPS-TM match.  Instead, for each CPS-



TM record, the SFCC record which fell at a "selection point" In the



series of cumulated sample weights was chosen.  For a given CPS-TM



record, the selection point was defined to be one third of the



record's sample weight plus the cumulated sample weight of the CPS-TM



record above it in the ranking.  The highest ranking SFCC record



whose cumulated sample weight was greater than or equal to that value



was chosen as the match.  For example, if the selection point was



6,000, then the highest ranking SFCC record with a cumulated weight



of at least 6,000 would be the match.



     c.   Brookings Institution MERGE-66 25 MERGE-66 was between the



          Survey of Economic Opportunity (SEO) for income year 1966



          and the 1966 Internal Revenue Service Tax File of



          individual federal income tax returns.  This match was one



          step in the construction of a corrected and more detailed



          microdata base for policy analysis, particularly tax policy



          analysis.  The SEO was used as the base set; cells, ranges,



          and a distance function were used.  This was an



          unconstrained match.  Universe adjustments were made to



          both files: it was assumed that high-income (or loss) units



          were in the Tax File but not in the SEO, and some filers of



          tax returns were not in the SEO universe. The first step



          was the formation of cells in both sets based upon marital



          status, age, number of dependent exemptions, and income



          types received, including the major source of income; 74



          cells were used.  An acceptable range of major source



          income was defined for each SEO unit; this range was the



 



                           25 Okner, 1972.



 



                                 20



 



 



 



 



 



     SEO amount plus or minus two percent, with upper variables) to



     make those estimates.  Sims defined X and lower absolute amount



     bounds.  Then, for each variables, which appear in both sets, Y



     variables, SEO unit, each Tax File return which was both in the



     appropriate cell and with the acceptable major source range had



     a "consistency score" computed.  This score, which was a simple



     distance function,-26 was based upon the correspondence of the



     existence of home ownership, property income, self-employment



     income, and capital gains in the two sets (some of that



     information was estimated in each file).  The group was then



     narrowed down by including only the 25 percent of the group with



     the highest consistency scores.  In addition, a minimum absolute



     consistency score was required.  If this top 25 percent group



     was "large enough," then a Tax File return was selected



     randomly, with the probability of selection for each return



     proportional to its weight.  If the eligible subset was "too



     small," then the major source income band was widened and the



     whole process was repeated.  The basic procedure was essentially



     to treat the SEO units one at a time and to define a small



     subset of the Tax File from which one return would be drawn ran-



     domly.  Thus, the one best match for each SEO unit was not I



     identified; the final selection was random. The equality



     assumption was used for all variables, both reported and



     constructed.  The basic approach used in the construction of L



     (the estimated hypothetical exact match) was what might be



     called a " modal" one; the most common value of the variable was



     used in L. MERGE-66 can be compared to the Census Bureau's hot



     deck allocation procedure.  The hot deck procedure, which can be



     thought of as the state of the art" of record matching in



     economics (ot . her than exact matching) prior to the advent of



     statist I cal matching, resembled an unconstrained match with no



     differences between matching variables.  M ERGE-66 was similar



     to the hot deck method in that respect.  In contrast, the BEA



     match was a marked departure from the hot deck precedent.



d. Christopher Sims' CommentS27 A word should be said about



Christopher Sims' two early "Comments" on MERGE-66 and other matching



procedures.  Sims formulated the statistical matching problem as the



estimation of the joint distributions of variables which appear in



only one of the sets being matched (non-common variables), using



variables which appear in both sets (common



 In this distance function, the higher the value the better the



match.  This is the opposite of distance functions described earlier



in which lower values were better.  Both types are referred to as



distance fuinctions in this report.



27 SIMS, 1972, 1974.



 which appear in only one set, and Z variables, which appear only in



the other set.  The X variables in the two sets are then matched, and



estimates of the joint distributions of Y and Z are obtained.  Sims



interprets the MERGE 66 and other procedures to assume that Y and Z



are independent conditional upon X. This formulation suggests



conclusions regarding the accuracy of statistically matched sets.



Sims' formulation of the statistical matching problem has been quite



influential.  However, it should be noted that that formulation



applies to a special case of the generalized statistical matching



problem.  Two limitations on the applicability of his formulation



should be mentioned.  First, Sims gave little attention to the joint



distributions of the matching variables in the two sets.  In his



formulation, in effect he assumed that the equality assumption was



valid (although he did mention the adjustment of matching data). 



However, the separation of variables into X (variables which appear



in both files), Y (variables which appear only in one file), and Z



(variables which appear only in the other file) is frequently not



applicable.  In many cases the variables used to match on (X's) are



not strictly comparable; that is, they differ in definition or error



component (e.g., response error), or both.  In general, there can be



a range of degree of comparability between pairs of variables in the



two files.  Pairs of variables are chosen as matching variables when,



as a necessary condition, information about the joint distributions



of those variables (in an exact match carried out without error) is



known or can reasonably be inferred.  When the matching variables are



chosen, the variables are separated into matching and nonmatching



variables, but the matching variables often differ in the reliability



of the information available about their joint distributions.  These



differences can be reflected in the matching method. The second



limitation is that the purpose of the match is not always only the



estimation of the joint distribution of non-matching variables in the



two files.  In many matches the matching variables from the nonbase



set have been used in the results of the match.  Where tax return



files have been used, the matching variables from the tax return data



have usually been used in the results of the match.  This has been



done primarily because it was desirable to use the entire set of tax



return variables as an entity.  However, it should be noted that



where the matching variables in the two files differ in definition or



in the amount of error they contain, it can be useful to use



 



21



 



 



 



 



 



     the matching variables from the nonbase set in the results even



if the use of the nonbase set data as an entity is not crucial.  For



example, some nonbase set matching variables might contain less



response error.



 



e.   Statistics Canada SCF-FEX Match28 The Statistics Canada match



     was carried out between two Canadian microdata sets, the Survey



     of Consumer Finances (SCF) and the Family E penx diture Survey



     (FEX), which contain data for 197 . 70.  The purpose was the



     addition of expenditure data to the SCF.  This match had the



     advantage that both microdata sets were obtained using the same



     sampling frame, the Canadian Labour Force Survey.  Thus, both



     the universes and the definitions of units were identical.  In



     addition, many of the variables in the two sets purposely were



     defined identically.  The approach was influenced primarily by



     MERGE-66.  This was an unconstrained match, using the SCF as the



     base set.  Cells and a distance function were used, as was the



     equality assumption.



 The first step in this match was to use multiple linear regression



analysis to determine, given the purpose of the match, which



variables should be used as matching variables, and how much weight



should be given to each of those variables.  This step represented an



attempt to make the choice of matching variables and their relative



importance more objective.  This attempt was in contrast to both the



BEA and MERGE-66 matches in which those choices were almost entirely



subjective.  In the regressions, the independent variables (income



and demographic characteristics) were variables which appeared in



both sets.  The dependent variables chosen appeared only in one set



and were important to the results of the match; the SCF dependent



variables were asset and debt information, and the FEX dependent



variables were expenditure information.  Both sets were separated



into four subsets based upon home ownership and type of consumer unit



prior to the running of the regressions. Once the matching variables



had been chosen, they were separated into "mandatory" and "desirable"



variables.  The mandatory variables (which were categorical



variables) were used to partition the sets into cells.  Following the



precedent of the MERGE-66 consistency scores, "union scores" were



computed for desirable variables; this was a distance function. 



Different maximum point totals were assigned to different linking



variables on the basis of the regression results; the greater the



variable's explanatory power, the greater its maximum point total. 



For



 



     'Alter, 1974.



 example, "no discrepancy in amounts of major source income" was



worth 40 points, while "no discrepancy in total income" was worth 30



points.  The Statistics Canada technique differed from the MERGE-66



technique by assigning different point values to discrepancies of



different sizes; the MERGE-66 version was "all or nothing" in



concept. A ranking procedure was used in the merging step.  Records



in both sets were ordered according to size of income within the



mandatory cells.  Then the first FEX record with at least a 95



percent union score was matched with the relevant SCF record.  Some



SCF records were not matched in the first run and the subsequent runs



which were necessary because of the effect of file sequence.  Further



runs were made with the minimum acceptable consistency score lowered. 



Finally, several variables were changed from mandatory to desirable



so that all SCF records could De matched.  The FEX records were used



with replacement.  The ranking procedure produced biases, which are



commented on in Alter (1974). Statistics Canada also presented data



regarding the quality of the matching.  For example, the corre-



spondence of codes of variables which were used as desirable matches



was checked. In summary, the Statistics Canada match contained three



responses to the earlier matches: ( I ) an attempt to make the choice



of matching variables and their relative weights more objective; (2)



a refinement in the use of distance functions by relating the



distance (or union score) to the size of the deviation (discrepancy)



and (3) an emphasis on attempts to assess the quality of the



matching. f.   Yale University (and National Bureau of Economic



               Research) 29 The Yale group was interested in devising



               a generalized statistical matching procedure which can



               be applied efficiently to very large microdata sets



               (i.e., those containing several million observations). 



               In this respect, the Yale work differed from that



               carried out at BEA, Brookings, and Statistics Canada. 



               In those matches the procedures were tailored to the



               particular sets being matched, sets which were not



               very large.  The Yale approach can be viewed as having



               its origin in the comments by Sims.  An important part



               of the Yale work is an attempt to make the selection



               of cells more objective.  The procedure contains two



               important parts, the "sort-merge strategy" and the



               estimation of "I(X)" regions. The sort-merge strategy



               is a technique for implementing the use of cells which



               is particularly appro-29 Ruggles and Ruggles, 1974;



               Ruggles, Ruggles, and Wolff, 1977;



Wolff, 1977.



 



                                 22



 



 



 



 



 



     priate for microdata sets with large numbers of distributions of



     the non-common variables are disobservations.  In each file, for



     each of a set of match- similar.  Thus, when the chi-square test



     shows a ing (or "common" or "X") variables, each observation is



     assigned a set of sort tags.  These sort tags represent cells in



     the variable; more detailed (narrower) cells are nested within



     the broader cells.  If there are n levels of detail for the



     cells, and m matching variables, then each observation will have



     nm sort tags (cell codes) assigned to it.  The purpose of having



     different levels of detail is to ensure a match for every A file



     observation.  An A file record is matched with a B file record



     with identical sort tags for all matching variables at the most



     detailed cell level possible.  The procedure allows B set



     records to be used more than once, or not at all; thus, the



     procedure is of the unconstrained type.  Because both files only



     need to be sorted once on the basis of these nested sort tags



     (with the least detailed set as the primary sort), the costs of



     matching large data sets are held down. In most cases, the



     estimates of the I(X) regions define the cells which correspond



     to the sort tags.  The estimation of the regions follows the



     lines suggested in Sims (1972).  The I(X) regions are ranges of



     the matching (X) variables for which the distributions of the



     non-matching variables are significantly different.  Matching



     takes place within corresponding I(X) regions in the two sets. 



     In this technique the X (matching) variables are used only as



     intermediaries in the estimation of the joint distributions of



     the non-matching variables in the two sets.  It is in this view



     of the matching problem that the Yale procedure follows from



     Sims.  The estimation of the I(X) regions is an attempt to find



     an objective way to construct cells for matching, a goal which



     was similar to Statistics Canada's. Chi-square tests and the



     size of correlation coefficients between two distributions are



     used to estimate the I(X) regions.  To make these estimates,



     observations in adjacent ranges of any common variables are



     treated as though they belonged to different samples.  A chi-



     square test is then applied to test whether the distributions of



     the non-common variables in the two ranges of the common



     variable are significantly different.  If they are not



     significantly different, the two ranges can be combined.  If



     they are significantly different, each of the ranges is split



     into two parts and those parts are tested in a similar manner. 



     Because of the sensitivity of the chi-square tests to the number



     of observations involved, those tests are modified by examining



     the size of the correlation coefficient between the



     distributions which are being tested.  If the correlation



     coefficient is low, then the



 significant difference and the correlation coefficient is low, the



ranges are not combined.  By varying the significance levels for



these tests, the different levels of detail and hence different



numbers of cells are



defined.  It is in this way that more detailed sets of



cells are nested within less detailed cells.



Wolff     ( 1977) describes an application of the Yale



method,   the construction of the "MESP" database, which is the



result of three statistical matches and two sets of imputations. 



That file, which contains asset and liability and demographic



information for a sample of roughly 60,000 households, was con-



structed to serve several purposes; Wolff used it to estimate



household wealth distributions.  No single database contained the



data necessary to make those estimates. The first statistical match



in the construction of this file was between the 1969 IRS Tax Model



and an augmented version of the 1970 IRS Tax Model of individual



returns.  Although the 1969 Tax Model was the file of most interest,



the 1970 file contained race and age data (matched in from SSA



records in an exact match) and more detailed data on itemized



deductions which were not in the 1969 file.  The 1969 file was the



base file in this match; data were transferred from the 1970 file to



the 1969 file.  Broad cells based upon return type, sex, age



exemptions, and number of children were used; the Yale method was



applied within those cells.  Size of adjusted gross income (AGI) and



the major components of AGI as percentages of AGI, and total



deductions were used as matching variables.  Differences between AGI



in the files arising from the fact that the data were for different



years were handled by using percentile ranks. The second match, which



was the basic match, was between the result of the first match and



the 1970 Decennial Census 15 percent Public Use Sample (PUS).  The



PUS file was the base file, and detailed information on income from



assets along with other information was transferred to the PUS file. 



Broad cells based upon return type, sex, race, and age were used. 



The matching variables used within those cells were total income,



wage and salary income, self-employment income, number of children,



and home ownership status.  Total income and business and



professional income were matched according to percentile rank in



order to adjust for lack of comparability. The third match was



between the 1970 15 percent PUS and the 1970 5 percent PUS; the 15



percent



 



23



 



 



 



 



 



     file was the base file.  The 5 percent file contained data on



stocks of some consumer durables which were not in the 15 percent



file; those data were added to the 1 5 percent file.  Marital status,



age, sex, race, and home ownership status were used as broad cell



variables.  Matching variables within those cells were total