| Federal
Committee on Statistical
Methodology Office of Management and Budget |
FCSM
Home ^ Methodology Reports ^ |
Statistical Policy Working Paper 2 - Report on Statistical Disclosure and Disclosure- Avoidance Techniques
Click HERE for graphic. Statistical Working Papers are a series of technical documents prepared under the auspices of the Office of Federal Statistical Policy and Standards. These documents are the product of working groups or task forces, as noted in the Preface to each report; These Statistical Working Papers are published for the purpose of encouraging further discussion of the technical issues and to stimulate policy actions which flow from the technical findings. Readers of Statistical Working Papers are encouraged to communicate directly with the Office of Federal Statistical Policy and Stan- dards with additional views, suggestions, or technical concerns. Office of Joseph W. Duncan Federal Statistical Director Policy and Standards Statistical Policy Working Paper 2 Report on Statistical Disclosure and Disclosure- Avoidance Techniques prepared by Subcommittee on Disclosure-Avoidance Techniques Federal Committee on Statistical Methodology U.S. DEPARTMENT OF COMMERCE Juanita M. Kreps. Secretary Courtenay M. Slater. Chief Economist Office of Federal Statistical Policy and Standards Joseph W. Duncan. Director Issued:. May 1978 Office of Federal Statistical Policy and Standards' Joseph W. Duncan, Director George E. Hall, Deputy Director, Social Statistics Gaylord E. Worden, Deputy Director, Economic Statistics Mafia F- Gonzalez, Chairperson, Federal Committee on Statistical Methodology Preface This working paper was prepared by the members of the Subcommittee on Disclosure-Avoidance Techniques, Federal Committee on Statistical Methodology. The Subcommittee was chaired by John A. Michael National Center for Education Statistics, Department of Health, Education, and Welfare. The members of the Subcommittee are the authors of this report and their names are listed below. This report is intended to help managerial and technical staff of Federal agencies which publish or otherwise release on methodologies to achieve appropriate disclosure-avoidance practices. Data released both in tabulations and in the form of microdata are discussed in this report. The Office of Federal Statistical Policy and Standards hopes to organize, with the help of Subcommittee member seminars with Federal employees to disseminate the findings of the report In addition, the report may serve as a basis for discussions between Federal data producers and data users. iii Members of the Subcommittee on Disclosure-Avoidance Techniques John A. Michael Chairperson National Center for Education Statistics (HEW) Richard A. Bell Social Security Administration (HEW) Robert H. Mugge National Center for Health Statistics (HEW) Mervyn R. Stuckey Statistical Reporting Service (USDA) Maria Elena Gonzalez Chairperson Federal Committee on Statistical Methodology, Office of Federal Statistical Policy and Standards (Commerce) Member, Federal Committee on Statistical Methodology Thomas B. Jabine Social Security Administration (HEW) William J. Smith, Jr. Internal Revenue Service (Treasury) Paul T. Zeisset Bureau of the Census (Commerce) Ex Officio Maria Elena Gonzalez Chairperson* Federal Committee on Statistical Methodology Office of Federal Statistical Policy and Standards (Commerce) Tore E. Dalenius Brown University and University of Stockholm -------- *Member, Federal Committee on Statistical Methodology iv Acknowledgements The body of this report represents the collective effort of the Subcommittee on Disclosure-Avoidance Techniques. The Subcommittee began by developing the outline for this report, after which writing assignments were apportioned among members. Manuscript was usually subjected to several rounds of review before its acceptance. The major contributors to the respective chapters appear below: Chapter Major Contributor(s) I Michael II Jabine and Dalenius III Bell, Mugge, and Dalenius IV Zeisset V Michael and Zeisset VI Jabine Appendix A The respective agencies B Stuckey C Lawrence H. Cox, Bureau of the Census D Bell Throughout the development of the report, Thomas Jabine enlightened Subcommittee members on the complexities of the subject and Maria Gonzalez provided encouragement and goal directedness. Members of the Federal Committee on Statistical Methodology and the Office of Federal Statistical Policy and Standards, Department Of Commerce (formerly the Statistical Policy Division of OMB) reviewed and commented upon our work. Manuscript was prepared with the good- natured assistance of the management and secretaries of the various statistical agencies. Deserving special commendation is Joyce Peoples of the Social Security Administration who effectively managed the arduous task of preparing and assembling several drafts of this manuscript v Members of the Federal Committee on Statistical Methodology Barbara A. Bailar Bureau of the Census (Commerce) Norman D. Beller Statistical Reporting Service (USDA) Barbara A. Boyes Bureau of Labor Statistics (Labor) Edwin J. Coleman Bureau of Economic Analysis (Commerce) John E. Cremeans Bureau of Economic Analysis (Commerce) Marie D. Eldridge National Center for Education Statistics (HEW) Fred J. Frishman International Revenue Service (Treasury) Maria E. Gonzalez, Chairperson Office of Federal Statistical Policy and Standards (Commerce) Thomas B. Jabine Social Security Administration (HEW) Charles D. Jones Bureau of the- Census (Commerce) Alfred D. McKeon Bureau of Labor Statistics (Labor) Harold Nisselson Bureau of the Census (Commerce) Monroe G. Sirken National Center for Health Statistics Wray Smith Office of the Assistant Secretary for Planning and Evaluation (HEW) Editorial Note The opinions expressed in this report reflect the collective judgment of the Subcommittee and do not necessarily reflect the opinion of the Federal Committee or the Office of Federal Statistical Policy and Standards. vi Table of Contents Page Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .iii Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . .v CHAPTER I-INTRODUCTION A. Scope of Study and Organization of Report . . . . . . . . . . . . . .1 1. The Nature of Statistical Disclosure. . . . . . . . . . . . .1 2. Pinpointing Disclosure Potentials and Disclosure- Avoidance Techniques. . . . . . . . . . . . . . . . . . . . .1 3. Balancing Confidentiality Requirements Against Societal Needs for Information . . . . . . . . . . . . . . . . . . . .1 4. Other Considerations. . . . . . . . . . . . . . . . . . . . .2 5. Findings and Recommendations. . . . . . . . . . . . . . . . .2 B. Auspices. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2 C. Dissemination of Report . . . . . . . . . . . . . . . . . . . . . . .2 CHAPTER II-DEFINING STATISTICAL DISCLOSURE A. References in Statutes, Regulations, and Policy Statements. . . . . .3 1. The Privacy Act of 1974. . . . . . . . . . . . . . . . . . . . .3 2. The Freedom of Information Act . . . . . . . . . . . . . . . . .3 3. Agency Statutes and Regulations. . . . . . . . . . . . . . . . .4 a. Bureau of the Census, Title 13 . . . . . . . . . . . . .4 b. Internal Revenue Service . . . . . . . . . . . . . . . .4 c. Social Security Administration . . . . . . . . . . . . .4 d. Law Enforcement Assistance Administration. . . . . . . .4 c. National Center for Health Statistics. . . . . . . . . .4 4. Advisory Committee Reports . . . . . . . . . . . . . . . . . . .5 a. The President's Commission on Federal Statistics . . . .5 b. The HEW Secretary's Advisory Committee on Automated Personal Data Systems. . . . . . . . . . . . . . . . . .5 c. The American Statistical Association Ad Hoc Committee on Privacy and Confidentiality . . . . . . . .5 d. The Privacy Protection Study Commission. . . . . . . . .5 B. Evaluation of Statutory Requirements. . . . . . . . . . . . . . . . .6 C. Prior Definitions of Statistical Disclosure . . . . . . . . . . . . .6 D. A Proposed New Definition of Statistical Disclosure . . . . . . . . .7 1. The Insufficiency of Prevailing Definitions. . . . . . . . . . .7 2. A Framework for Defining "Statistical Disclosure . . . . . . . .7 a. The frame. . . . . . . . . . . . . . . . . . . . . . . . .7 b. Data associated with the objects in the frame. . . . . . .7 c. The statistics released from the survey. . . . . . . . . .8 vii Page (1) Macrostatistics . . . . . . . . . . . . . . . . . . . . .8 (2) Microstatistics . . . . . . . . . . . . . . . . . . . . .8 d. Extra objective data . . . . . . . . . . . . . . . . . . . . . .9 e. Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . .9 3. Statistical Disclosure Defined. . . . . . . . . . . . . . . . . . . 10 CHAPTER III-DISCLOSURE TN THE RELEASE OF TABULATIONS (SUMMARY DATA) FOR PUBLIC USE A. The Problem of Disclosure in Tabulations: Topology, Identification and Examples . . . . . . . . . . . . . . . . . . . 11 1. Exact Disclosure . . . . . . . . . . . . . . . . . . . . . . . 11 a. Count data . . . . . . . . . . . . . . . . . . . . . . . 11 b. Magnitude data . . . . . . . . . . . . . . . . . . . . . 12 2. Approximate Disclosure . . . . . . . . . . . . . . . . . . . . 12 a. Count data . . . . . . . . . . . . . . . . . . . . . . . 12 b. Magnitude data . . . . . . . . . . . . . . . . . . . . . 12 3. Probability-Based Disclosures (Approximate or Exact) . . . . . 13 4. Indirect Disclosure. . . . . . . . . . . . . . . . . . . . . . 13 5. External or Internal Disclosure. . . . . . . . . . . . . . . . 14 a. Count data (direct or indirect disclosure) . . . . . . . 15 b. Magnitude data (direct or indirect disclosure) . . . . . 15 B. Evaluating the Disclosure Problem . . . . . . . . . . . . . . . . . 16 1. The Level of Risk of Disclosure. . . . . . . . . . . . . . . . 17 a. The relative size of the sample. . . . . . . . . . . . . 17 b. The detail provided in the tabulation. . . . . . . . . . 17 c. The quality of the data. . . . . . . . . . . . . . . . . 17 d. Availability of external information . . . . . . . . . . 17 2. The Acceptability of the Disclosure Risk . . . . . . . . . . . 17 a. Sensitivity of data. . . . . . . . . . . . . . . . . . . 17 b. Possible adverse consequences of disclosure. . . . . . . 18 3. The Assurances Given to the Respondents. . . . . . . . . . . . 18 C. Disclosure-Avoidance Techniques. . . . . . . . . . . . . . . . . . 18 1. Data Suppression . . . . . . . . . . . . . . . . . . . . . . . 18 a. Cell suppression . . . . . . . . . . . . . . . . . . . . 18 b. Table suppression. . . . . . . . . . . . . . . . . . . . 18 2. "Rolling Up" Data. . . . . . . . . . . . . . . . . . . . . . . 19 3. Disturbing the Data. . . . . . . . . . . . . . . . . . . . . . 19 4. Limiting Distribution. . . . . . . . . . . . . . . . . . . . . 20 5. Evaluation of Alternative Techniques . . . . . . . . . . . . . 20 CHAPTER IV-DISCLOSURE IN MICRODATA A. Nature of the Problem . . . . . . . . . . . . . . . . . . . . . . . 23 1. Definition of Microdata. . . . . . . . . . . . . . . . . . . . 23 2. Federal Agency Examples of Microdata Release . . . . . . . . . 23 a. Bureau of the Census . . . . . . . . . . . . . . . . . . 23 b. Social Security Administration . . . . . . . . . . . . . 24 c. National Center for Health Statistics. . . . . . . . . . 24 d. National Center for Education Statistics . . . . . . . . 24 e. Internal Revenue Service . . . . . . . . . . . . . . . . 24 viii Page B. Evaluation of the Problem . . . . . . . . . . . . . . . . . . . . . 25 1. Factors Bearing on the Likelihood of Disclosure. . . . . . . . 25 a. Sample size or fraction of the universe. . . . . . . . . 25 b. Uniqueness . . . . . . . . . . . . . . . . . . . . . . . 25 (1) Geographic information . . . . . . . . . . . . . . 25 (2) Characteristics of the respondent. . . . . . . . . 25 c. Recognizability. . . . . . . . . . . . . . . . . . . . . 26 (1) Population registers . . . . . . . . . . . . . . . 26 (2) "Noise" in the data. . . . . . . . . . . . . . . . 26 (3) Time lag . . . . . . . . . . . . . . . . . . . . . 27 d. Hypothesized relationships among the various factors in two types of attempts to penetrate disclosure safeguards . . . . . . . . . . . . . . . . . . . . . . 27 (1) Searching for a specific individual. . . . . . . . 27 (2) "Fishing expedition. . . . . . . . . . . . . . . . 27 2. Acceptability of the Disclosure Risk . . . . . . . . . . . . . 28 a. Potential harm to the respondent. . . . . . . . . . 28 b. Potential harm to the agency. . . . . . . . . . . . 28 c. Resources available to the misuser. . . . . . . . . 28 C. Disclosure Prevention Techniques for Public-Use Microdata Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 1. General Tradeoffs. . . . . . . . . . . . . . . . . . . . . . . 28 2. Elimination of Categories Identifying Small Salient Groups. . 29 3. Allowing No Unique Cases . . . . . . . . . . . . . . . . . . . 29 4. Introduction of "Noise' into the Data. . . . . . . . . . . . . 29 5. Removal of Well-Known Individuals from the File. . . . . . . . 30 6. Release of Customized Files. . . . . . . . . . . . . . . . . . 30 D. Disclosure Prevention Through Restrictions on Use . . . . . . . . . 30 1. Alternatives Where Public-Use Microdata Are Not Satisfactory . . . . . . . . . . . . . . . . . . . . . . . . . 30 a. Special tabulations by the originating agency. . . . . . 30 b. Microdata available for restricted use . . . . . . . . . 30 2. Contractual/Administrative Requirements on the Restricted User . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3. Agency Experience with Use-Restricting Agreements. . . . . . . 31 a. Bureau of the Census . . . . . . . . . . . . . . . . . . 31 b. Other agencies . . . . . . . . . . . . . . . . . . . . . 31 4. Relationship of Computer Security to Use Restriction . . . . . 31 CHAPTER V-THE QUESTION OF BALANCE: PROTECTION OF INDIVIDUALS VS. PUBLIC NEEDS FOR INFORMATION A. Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 B. Comments in the Literature. . . . . . . . . . . . . . . . . . . . . 33 C. Reactions to Agency Policies and Procedures for Disclosure Avoidance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 1. Impact on Individual Data Subjects . . . . . . . . . . . . . . 34 2. Organizations as Data Subjects . . . . . . . . . . . . . . . . 36 3. Reactions of Data Users. . . . . . . . . . . . . . . . . . . . 36 a. Data-loss problem. . . . . . . . . . . . . . . . . . . . 36 b. Crosscutting standard geographic areas . . . . . . . . . 37 c. Changes in disclosure-avoidance techniques . . . . . . . 37 d. Changes in methodology . . . . . . . . . . . . . . . . . 38 e. Data-users options . . . . . . . . . . . . . . . . . . . 38 ix CHAPTER VI-FINDINGS AND RECOMMENDATIONS Page f. Recommendation by the Census Advisory Committee of the American Statistical Association. . . . . . . . . . . . . . 38 4. Reactions of Others. . . . . . . . . . . . . . . . . . . . . . 39 A. The Concept of Statistical Disclosure . . . . . . . . . . . . . . . 41 B. Deciding What to Release. . . . . . . . . . . . . . . . . . . . . . 41 Findings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 Recommendations . . . . . . . . . . . . . . . . . . . . . . . . . 41 C. Disclosure-Avoidance Techniques . . . . . . . . . . . . . . . . . . 43 Findings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 Recommendations . . . . . . . . . . . . . . . . . . . . . . . . . 43 D. Effects of Disclosure on Data Subjects and User . . . . . . . . . . 43 Findings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 Recommendations . . . . . . . . . . . . . . . . . . . . . . . . . 44 E. Needs for Research and Development. . . . . . . . . . . . . . . . . 44 Findings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 Recommendations . . . . . . . . . . . . . . . . . . . . . . . . . 44 APPENDICES Appendix A. Statistical Disclosure-Avoidance Practices of Selected Federal Agencies . . . . . . . . . . . . . . . . . . . . . 45 Appendix B. Protecting Data in Computer Systems. . . . . . . . . . . . 61 Appendix C. Selected Methodological Issue in Statistical Disclosure Avoidance. . . . . . . . . . . . . . . . . . . . . . . . . 65 Appendix D. Bibliography . . . . . . . . . . . . . . . . . . . . . . . 67 X CHAPTER I Introduction A. Scope of Study and Organization of Report This report is about techniques for avoiding disclosure of confidential information about individuals (natural and legal persons) in connection with the release of statistical tabulations and microdata files (computerized records pertaining to individual statistical units). The report culminates more than a year's study of potentials for statistical disclosure.e. disclosure of confidential information about identifiable (but not identified) units in tabulations and microdata files. Many Federal agencies which release tabulations or microdata files for statistical purposes have statutes, regulations, or policy requirements that releases be made in such a way that no information traceable -to a specific individual.1 will be disclosed. The major questions addressed during the year and reported here are as follows: - What is the nature of statistical disclosure? -How pervasive a problem is it? -How can agency requirements be translated into specific disclosure-avoidance techniques? - How can agency requirements be met without. unduly restricting data releases? - How do agency disclosure-avoidance practices affect data subjects and data users? 1. The Nature of Statistical Disclosure The problem of statistical disclosure is certainly not a new one. It has long been recognized that any available tabulation of the characteristics of a population is likely to narrow the range of uncertainty about the characteristics of specific individuals known to be members of that population. Recognition of the problem has been heightened by the widespread use of computers and microdata files as well as the increased demand for more detail in statistical releases. The sheer number of characteristics available about a given statistical unit in microdata form, which sometimes produces unique configurations, may make identification possible, even though identifiers.(such as names, social security numbers, or employer numbers) have been removed. Nevertheless, we discovered that comparatively little is known about disclosure. To begin with, there is no widely accepted definition or topology of "disclosure." Probing the definitional issue, we reviewed prevailing statutes, regulations, and policy directives at the Federal level to see what light they might shed on the nature of disclosure. Published literature on the topic was also consulted. Tore Dalenius, consultant to the Statistical Policy Division, OMB, developed a formal definition while working with the Subcommittee. We adopted this definition, as it was judged to provide the best basis for a comprehensive discussion of the disclosure issue. The definition is presented in Chapter II along with the above mentioned reviews. Citations to the literature appear in Appendix D. 2. Pinpointing Disclosure Potentials and Disclosure Avoidance Techniques The definitional effort was augmented by an examination of different types of disclosure and a review of the various factors affecting the potential for unintentional disclosure. Since the nature of the disclosure problem varies significantly for tabulations and microdata tapes, the discussion proceeds separately for the two modes of data dissemination in Chapters III and IV respectively. The latter portion of each of these chapters identifies and describes disclosure-avoidance techniques appropriate for the respective mode of release. To augment this general description, we assembled a description of the disclosure- avoidance practices of several Federal statistical agencies. These appear in Appendix A. 3. Balancing Confidentiality Requirements Against Societal Needs for Information We have used the term "disclosure avoidance" to describe efforts to reduce the risk of disclosure. The release of any data usually entails at least some element of risk. A decision to eliminate all risk of disclosure would curtail statistical releases drastically, _____________________ . Except where otherwise specified the word "individual" as used in this report is meant to cover all types of reporting units-natural persons, corporations, partnerships, fiduciaries, etc. 1 if not completely. Thus, for any proposed release of tabulations or microdata, the acceptability of the level of risk of disclosure must be evaluated. The use of the term "disclosure avoidance" should not be allowed to obscure the vital significance of such evaluations, or to lead to policies which attempt to eliminate disclosure risk completely. In summary, protection of the confidentiality of information about individuals must be balanced against the legitimate needs of society for information. This "Question of Balance" is discussed in Chapter V. 4. Other Considerations For the most part, our study was confined to matters internal to Federal agencies. However, at one point in Chapter V this limitation is relaxed to examine the impact of agency disclosure practices upon data subjects and data users. This report does not deal with the issue of releasing data with identifiers, whether such release is intentional or unintentional. Our treatment of disclosure differs from that commonly associated with the Privacy Act of 1974, for example, which treats disclosure as transferring information coupled with identifiers. The conception of disclosure advanced here excludes from consideration many identifier linked confidentiality issues, such as whether statistical data should be immune from mandatory release for administrative, legislative and judicial purposes. By the same token, the report deals only tangentially with the issue of computer security, ignoring the much ed potential for penetration and misuse. A substantial literature on that problem already exists, which this report highlights in Appendix B. The more relevant computer aspect is the possibility of mechanizing the search for disclosure risks and the implementation of disclosure-avoidance techniques. Appendix C reports on the development of an automated system to avoid disclosure in tabulations published by the Bureau of the Census from its economic censuses. 5. Findings and Recommendations Our findings and recommendations appear in Chapter VI. In framing recommendations, we have been mindful of the diversity of statistical activity within the Federal establishment, as well as the complexity of the matter, and refrained from advocating overly generalized solutions. Yet, because we were also mindful of the pressing nature of the disclosure problem, the report includes a number of suggestions for the development and review of agency disclosure-avoidance practices. B. Auspices The report represents the collective efforts of the Subcommittee on Disclosure-Avoidance Techniques of the Federal Committee on Statistical Methodology which operated under the auspices of the Office of Federal Statistical Policy and Standards, Department of Commerce (previously the Statistical Policy Division, Office of Management and Budget). The group was originally formed in early 1976 as one of two working groups of a Subcommittee on Confidentiality Issues chaired by Thomas B. Jabine. The working groups were subsequently given separate subcommittee status. The other group, the Subcommittee on Matching Techniques, examined methodological issues associated with the merger of microdata from different data sets. The opinions expressed here reflect the collective judgment of the Subcommittee and do not necessarily reflect those of the Federal Committee on Statistical Methodology or the Office of Federal Statistical Policy and Standards. C. Dissemination of Report This report is intended for circulation among managerial and technical staff of statistical agencies and those Federal offices which release information for statistical and research purposes. The report is intended to apprise such staff more fully of the dis- closure problem and encourage appropriate disclosure-avoidance practices at the individual agency level. In addition, we hope this report will furnish the basis for an informed discussion of the disclosure problem within the Federal establishment generally as well as between the Federal Government and its data suppliers and users. It may also be of more general use to persons interested in issues related to the avoidance of statistical disclosure. 2 CHAPTER II Defining Statistical Disclosure A. References in Statutes, Regulations, and Policy Statements The first requirement of Federal agency policies for avoiding disclosure in the release of tabulations and microdata is that these policies conform with relevant statutes and regulations. In addition, there have been several recommendations on this subject by advisory groups, which, while not binding, often carry considerable weight. This section of the chapter presents and reviews relevant sections of statutes, regulations and reports of advisory groups. 1. The Privacy Ad of 1974 The Privacy Act (P.L. 93-579, 1974) does not address the of disclosure in tabulations; however, it does have one provision relating to disclosure of microdata. Section 552a(b)(5) provides for disclosure without consent of the individual to whom the record pertains "to a recipient who has provided the agency with advance adequate written assurance that the record will be used solely as a statistical research or reporting record, and the record is to be transferred in a form that is not individually identifiable." The OMB Guidelines for Privacy Act Implementation (U.S. Office of Management and Budget, 1975) explain the statutory language as follows: "The use of the phrase 'in a form that is not individually identifiable' means not only that the information disclosed or transferred must be stripped of individual identifiers but also that the identity of the individual cannot be reasonably deduced by anyone from tabulations or other presentations of the information (i.e., the identity of the individual cannot be determined or deduced by combining various statistical records or by reference to public records or other available sources of information.)" The Guidelines go on to say "Fundamentally, agencies disclosing records under this provision are required to assure that information disclosed for use as a statistical research or reporting record cannot reasonably be used in any way to make determinations about individuals." Unfortunately, the applicability of this provision of the Privacy Act to the release of microdata from Privacy Act record systems is far from clear. It can be argued that records meeting the requirements of 552a(b)(5), are in general required to be released in response to Freedom of Information (FOI) Act (P.L. 93- 502, 1974) requests, since they do not come under any of the FOI exemptions. Surely, since all reasonable possibility of identification by recipients is presumed to have been eliminated, such records would not come under 552(b)(6) of the Freedom of Information Act, which exempts from mandatory FOI disclosure "personnel and medical files and similar files the disclosure of which would constitute a clearly unwarranted invasion of personal privacy." The Privacy Act itself provides in Section 552a(b)(2) for disclosure without consent where such disclosure would be "required under Section 552 of this title" (section 552 is the Freedom of Information Act), and it would seem that most disclosures of information meeting the requirements of 552a(b)(5) of not being individually identifiable would fall under 552a(b)(2) and not 552a(b)(5). If the above analysis is found to be confusing, this is indicative of the dilemma facing the Federal agency official trying to determine whether and under what conditions the Privacy Act permits him to release a specified microdata file. 2. The Freedom of Information Act In thinking about disclosure-avoidance policies, it is important to keep in mind that FOI requires Federal agencies to make any records or documents in their possession available to individuals on request, unless such materials come under one of the 9 exemptions in the act Thus, FOI requests for existing statistical tabulations and microdata files can be denied only if one or more of these exemptions applies. Furthermore, denials in such cases are not required by FOI: the materials may be released unless prohibited by another statute or regulation. Three of the 9 exemptions are pertinent, and are discussed below. Exemption (3).-This exemption formerly referred 3 to matters "specifically exempted from disclosure by statute." However, the Government in the Sunshine Act (P.L. 94-409, 1976) has changed this exemption (effective March 14, 1977) to read "specifically exempted from disclosure by statute (other than Sec- tion 552(b).1 of this title), provided that such statute (A) requires that the matters be withheld from the public in such a manner as to leave no discretion on the issue, or (B) establishes particular criteria for withholding or refers to particular types of matters to be withheld." The effect of the change was to substantially narrow the applicability of this exemption. Agencies, including for example the Social Security Administration, whose confidentiality statutes do not meet the new requirements of exemption (3) now have to rely on one of the other FOI exemptions when they wish to protect statistical tabulations or microdata files from mandatory release under FOI. Exemption (4).-This exemption refers to "trade secrets and commercial or financial information obtained from a person and privileged or confidential." The extent of applicability of this exemption to statistical tabulations and microdata is not well de- fined at this time, and will only become clearer as court decisions rule on its applicability to FOI requests for such data. Exemption (6),This exemption refers to 'personnel and medical files and similar files the disclosure of which would constitute a clearly unwarranted invasion of personal privacy.' As in the case of exemption (4), the extent of applicability of this exemption to tabulations and microdata is not yet clear. Recent court decisions have tended to limit its applicability. 3. Agency Statutes and Regulations Following is a review of selected provisions of agency statutes and regulations relevant to the release of statistical tabulations and microdata. It is not intended that this be a full review of agency confidentiality statutes and regulations. We cite here only those provisions which appear to be directly relevant to the question of defining statistical disclosure. a. Bureau ot the Census, Title 13,The relevant portion prohibits the Census Bureau from making "any publication whereby the data furnished by a particular establishment or individual under this title can be identified." b. Internal Revenue Service.-The section of the Internal Revenue Code dealing with "Statistical Publications and Studies as amended by the form Act (P.L. 94-455, 1976) provides that "No publication or other disclosure of statistics or other information required or authorized by subsection (a or special statistical study authorized by subsection (b) shall in any manner permit the statistics, study or any information so published, furnished, or otherwise disclosed to be associated with, or otherwise identify, directly or indirectly, a particular tax payer..2 .3 c. Social Security Administration-Regulation Number 1, promulgated under Section 1106 of the Social Security Act, deals with "Disclosure of Official Records and Information." Until recently, Section 401.3(k) of Regulation I provided that "Statistical data or other similar information not relating to any particular person which may be compiled from records regularly maintained by the Department may, be disclosed when efficient administration permits." d.Law Enforcement Assistance Administration -The Crime Control Act of 1973, in Section 524(a) provides that "Except as provided by Federal other than this title, no officer of the Federal Government, nor any recipient of assistance under the, provisions of this tide shall use or reveal any research or statistical information furnished under this title, by any person and identifiable to any specific private person for any purpose other than the purpose for which it was obtained in accordance with this title. The regulations implementing this Act (Law Enforcement Assistance Administration, 1976) defined "information identifiable to a private person" a "information which either- (1) Is labelled by name or other personal identifiers, or (2) Can, by virtue of sample size or other factor be reasonably interpreted as referring to a particular private person." e. National Center for Health Statistics-Public Law 93-353, Section 308(d) provides that "No information obtained in the course of activities under taken or supported under Section 304, 305, 306, c 307 may be used for any purpose other than the purpose for which it was supplied unless authorize _________________________________ .1 The section which sets forth the FOI exemptions. .2 This section became effective January 1. 1977. .3 Subsection (a) authorizes annual or more frequent publication "Statistics . . . with respect to the operations of the internal revenue laws.' Subsection b) authorizes the performance of "special statistical studies and compilations involving return information" f. others on 2 reimbursable basis. .4 Passage of the Government in the Sunshine Act referred to earth brought about the need for substantial revision of Regulation Pending final adoption of the revised Regulation 1 the Social Security Administration is operating under an interim version which does n explicitly with this question. 4 under regulations of the Secretary; and (1) in the case of information obtained in the course of health statistical activities under Section 304 or 306, such information may not be published or released in other form if the particular establishment or person supplying the information or described in it is identifiable unless such establishment or person has consented . . ." The common element in these and other agency statutes and regulations is the prohibition of the release of information that can be associated with or identified to a particular statistical unit In some cases the prohibition is limited to information about private individuals; in others, it extends to information for legal persons, such as businesses. 4. Advisory Committee Reports a. The President's Commission on Federal Statistics (1971).- Recommendations on privacy and confidentiality appear in Chapter 7 of the Commission's Report. Recommendation 7-4 says, in part, "use of the term confidential" should always mean that: a. Disclosure of data in a manner that. would allow public identification of the respondent or would in any way be harmful to him is prohibited." b.The HEW Secretary's Advisory Committee on Automated Personal Data Systems.-Chapter 6 of the Committee's Report (U.S. Department of Health, Education, and Welfare, 1973) deals with "Special Problems of Statistical-Reporting and Research Systems." In this chapter, the Committee recommends new Federal legislation protecting against compulsory disclosure. One of the features recommended for the legislation was: "The protection should be limited to data identifiable with, or traceable to, specific indi- viduals. When data are released in statistical form, reasonable ' precautions to protect against 'statistical disclosure' should be considered to fulfill the obligation not to disclose data that can be traced to specific individuals." A footnote to this paragraph provides a definition of statistical disclosure from an article by Fellegi (1972). "This is a risk that arises when a population is so narrowly defined that tabulations are apt to produce cells small enough to permit the identification of individual data subjects, or when a person using a statistical file has access to information which, if added to data in the statistical file, makes it possible to identify individual data subjects." c. The American Statistical Association Ad Hoc Committee on Privacy and Confidentiality (1977).The Committee's report includes several recommendations on "Release of statistical summaries and microdata without identifiers." The first of these recommendations is: "1. General public releases of statistical summaries and microdata files based on either administrative or statistical data sources should be permitted without restrictions or conditions provided that: (a) All identifying particulars, such as name, address and Social Security number, have been removed, and (b) It is virtually certain that no recipients can identify specific individuals in the files." For microdata files which do not meet condition (b) of this recommendation, the Committee recom- mends release for research and statistical purposes only under certain conditions, one of which is that the recipient agrees "Not to release any tabulations or other information that would make it possible for others to identify specific individuals." d.The Privacy Protection Study Commission (PPSC).-The Commission's final report was issued in July 1977 (PPSC, 1977). Chapter 15, entitled the Relationship Between Citizen and Government: The Citizen As Participant in Research and Statistical Studies," includes several recommendations and policy guidelines relating to the collection, use and disclosure of information about individuals (natural persons) in "individually identifiable form" for research and statistical purposes. The report defines "individually identifiable form" as "any material that could reasonably be uniquely associated with the identity of the individual to whom it pertains" (PPSC, 1977:572). Thus, it is clear that the Commission was fully aware of the problem of statistical disclosure, and, in fact, in a section of Chapter 15 on "Procedures to Protect Confidentiality" (PPSC, 1977:583-7), there are brief references to the work of this Subcommittee and to several of the disclosure-avoidance techniques discussed in this report. Recommendation (6) in Chapter 15 (PPSC, 1977: 587) is "That the National Academy of Sciences, in conjunction with the relevant Federal agencies and scientific and professional organizations, be asked to develop and promote the use of statistical and procedural techniques to protect the anonymity of an individual who is the subject of any information or record collected or maintained for a research or statistical purpose." The text immediately preceding this recommendation makes it clear that techniques to avoid statistical 5 disclosure (at least in its "exact" sense) are intended to be included in the recommended program of activities by the Academy and other organizations. B. Evaluation of Statutory Requirements Statutory prohibitions on disclosure are expressed in absolute terms. Thus, the Privacy Act refers to disclosure of a record "in a form that is not individually identifiable." The Census Title 13 prohibits "any publication whereby the data furnished by a particular establishment or individual under this title can be identified" If these statutory restrictions were interpreted literally, the flow of statistical data from the Federal Government would be stopped or drastically reduced. In a broad sense, any release of statistical tabulations reveals some information, at least in an approximate or probabilistic sense, about every individual known to be included in those tabulations. When a microdata file containing numerous items of information about each individual is released, it is virtually certain that many of the records will display combina- tions of characteristics not possessed by more than one individual in the population, and therefore will be potentially identifiable through matching with data that might be available from other sources. In practice, what is clearly expected on the part of agencies releasing statistical data is an effort to keep the probability of disclosure, however defined, at a very low level. Three of the advisory groups cited above confirm this view of the question. Thus, the HEW Committee called for "reasonable precautions to protect against statistical disclosure"; the ASA Committee recommended unrestricted release when "it is virtually certain that no recipients can identify specific individuals in the file."; and the Privacy Protection Study Commission used the word "reasonably" in, defining "individually identifiable form.' We may also note that the LEAA regulation uses the word "reasonably" in this context whereas the statute did not include any such qualifying term. This interpretation of statutes, regulations and recommended policies which prohibit disclosure leads to an important conclusion, i.e.. that they do not in themselves Provide a clear basis for deciding in an v particular case whether data should or should not be released. The decision on release calls for more specific rules and guidelines. If such rules and guidelines do not exist, then each case will be a judgment call by the responsible official. A major objective of this Subcommittee has been to determine what rules, guidelines and other criteria are being used by Federal agencies to avoid statistical disclosure; to review and evaluate these materials: and to make its findings widely available for the benefit of statisticians and others who must make decisions on what data to release, and on what terms. C. Prior Definitions of Statistical Disclosure We have seen that, without exception, laws and regulations do not provide a sufficiently precise definition of disclosure for operational use in determining what tabulations and microdata files are releasable. We have also reviewed the literature on the subject of statistical disclosure found in journals, reports and other publications. There we have found several attempts at a more precise definition. These are all helpful, but none of them seems to be broad enough to cover all the kinds of statistical disclosure problems met with in practice. Fellegi (1972) defines "inadvertent direct disclosure (i.d.d.)" as "disclosure of information on an individual who can be identified through his characteristics." He goes on to say that such disclosure "occurs when a user can identify a respondent by recognizing him through his characteristics and learning something about him." In other words, this kind of disclosure only occurs when two things happen: 1. The user recognizes an individual member of a population included in a tabulation or microdata file. 2. The user learns something, about that individual that he did not know from another source. Many more casual definitions of disclosure include only the first element. Fellegi does not say whether the information learned must be the exact value of some characteristic, or whether the disclosure can be in the form of a range. or a probability statement about the value in question. Hansen (1971) distinguishes between "exact" and "approximate" disclosure, the latter term being used for the case where a value for a particular individual is disclosed to be within some specified range. Fortunately, there is now available, in a report by Dalenius (1977) a mathematical treatment of the concept of statistical disclosure which we believe provides an adequate framework for discussion of all 6 aspects of statistical disclosure. Dalenius has kindly agreed to the inclusion of this material in our report D. A Proposed New Definition of Statistical Disclosure The reader is asked to keep in mind that the concept of disclosure presented here is a very broad one. It would not be desirable to require that there be a zero risk of disclosure, as defined below, in any release of tabulations or microdata files. Such a re- quirement would end a large proportion of all releases now being made. This would be too great a price to pay for complete elimination of any risk of disclosure. The material which follows in sections D1, D2 and D3 is presented verbatim from Dalenius' report, except for a few changes in terminology to conform with the language and structure of this report. 1.The Insufficiency of Prevailing Definitions Statistical disclosure is used in the literature in a way which parallels its use in nonstatistical contexts, Thus, in Webster's Third New International Dictionary, "disclosure" is defined as: (1) the act or an instance of opening up to view, knowledge or comprehension. (2) something that is disclosed. This definition is, indeed, general; it is by and large consistent with definitions of disclosure in the context of releases of statistical results. An example, Title 13, U.S. Code, Section 9-a- 2, gives an implicit definition of disclosure; it states that there shall not be: ". . . any publication whereby the data furnished by a particular establishment or individual under this title can be identified." The definition just quoted is less general than the definition taken from Webster's dictionary, by making identification of the object(s) concerned an element of the definition. While this is indeed a crucial difference, it does not make the resulting definition sufficiently specific to serve as a basis for regulations and/or procedures aiming at disclosure control; it does not easily and unambiguously lend itself to implementation. In sections D2 and D3 an effort will be made to deal with the conceptual problem thus present. 2. A Framework for Defining "Statistical Disclosure" "Statistical disclosure" is used here in accord with the use of this term in the context of releasing statistics from a survey3. In line with this notion of disclosure, the following four components are used to provide the conceptual framework called for: a. A frame comprising certain objects b. Data associated with these objects c. Statistics released from a survey d. Extra-objective data (a)The frame Consider a set of identifiable objects, to be referred to as the total population and denoted by T. In a typical case, T may be "all Swedish citizens." The survey concerns a subset of this total population, viz. that subset which is accessible by means of a certain frame; for convenience, this subset will be denoted by F. In a specific case, F may be "Swedish citizens living in Sweden." The complementary subset i.e., the subset made up by objects in T which are not in F is denoted by F. Thus, T is the "union" of F and F. Click HERE for graphic. In the case of a sample survey, it may prove useful to make an additional distinction, viz. between objects selected for the sample Fs and those not selected Fs (b) Data associated with the objects in the frame With each object in F, we associate data, which serves three different functions: i. Identifying function: We will denote the data serving this function by the identifier I. In a specific case, I may appear as a (registration) number, or as name and street address. ------------- .3 The Dalenius text uses the word "survey" in its broad sense to include a census or other data collection covering the total population. For purposes of this report. the definition may also be applied to the release of statistics based on administrative or program records. 7 ii. Classifying function: For purposes of presenting the "details" of the statistics to be released,, the objects in F will be associated with certain defined by reference to some classifier C In a specific case, C may appear as a "code" identifying a subset of F, for example a subset defined with reference to the sex and age of the objects in F. iii. Information function: The survey is carried out in order to provide information in terms of certain "survey characteristics" X,Y, . . ., Z. For the object O (J=1, . . ., N), the values of these characteristics are denoted by X, - - ., Z. Typically but not exclusively, these values may be in the nature of counts or magnitudes. It may be worth noting that some data may serve more than one of these 3 functions in one and the same survey. (c) The statistics released from the survey The objective of a survey is expressed in terms of some population and some data C and X Y,Z. In order to achieve this objective, the statistics S are released We will focus on two different kinds of statistics: i. statistics for sets of objects "microsta- tistics"; typically, the format of a report is used as a means of releasing the statistics ii. statistics for individual objects "microstatistics typically, the format of micro-data tape is used as the means of releasing the statistics. We will elaborate upon the above distinction in sections (1) and (2) below. (1) Macrostatistics In the case of macrostatistics, the statistics units, magnitudes, etc., as the case may be concern aggregates of the individual values of the survey characteristics belonging to the respective sets. The following tables are two cases in kind: These tables-while featuring the characteristics of real life statistics-are admittedly "small.' (2) Microstatistics In this kind of statistics, the individual values observed with respect to the characteristics X, Y, . . ., Z (possibly in conjunction with the associated classifiers) are released. The.identifiers, however, are not released. The following excerpt from U.S. Bureau of the Census (1976) is illustrative 8 iii. The statistics released from the survey:S iv. The extra-objective: E 3. Statistical Disclosure Defined We will now suggest a definition of disclosure within the conceptual framework presented in section 2. Thus, consider an object Ok in the total population T. This object may be a member of F, or it may be a member of P. We introduce a characteristic D which may be one of the survey characteristics X,Y, . . ., Z; or it may be some other characteristic. For the object Ok, this characteristic assumes the value Dx. It is helpful to consider two special cases: i. Dx = 1 if Ox has a certain property other wise Dx - O ii Dx is measured on a ratio scale: it is expressed as a magnitude. If the release of the statistics S makes it possible to determine the value Dx more accurately than i., possible without access to S, a disclosure has taker place; more exactly, a D- disclosure has taken place In a specific case, this D-closure may be an X-disclosure, or a Y-disclosure, etc. The definition just given applies to both releases of macrostatistics and release of microstatistics. Examples of disclosure for the former case may be found in Chapter III and for the latter case in Chapter IV. 10 CHAPTER III Disclosure in the Release of Tabulations (Summary Data) for Public Use A. The Problem of Disclosure in Tabulations: Topology, Identification and Examples The problem of disclosure in tabulations will now be discussed. A topology will be listed; ways to identify the various types of disclosure, together with appropriate examples, will be provided. The definitions of different kinds of disclosure used in this section are very broad. Not all of these kinds of disclosure need necessarily be avoided in all tabulations. The issues involved in determining what kinds of disclosure are acceptable in a particular situation are discussed in section B2 of this chapter. Our study of the literature on this subject did not reveal any generally accepted definitions of various types of disclosure. The proposed classifications which follow represent an effort to develop a comprehensive and logical description of different types of disclosure. Suggestions for improvement will be welcomed. Disclosure will be studied both for tabulations involving count (frequency) data and for those containing quantity (magnitude) data. Tables I and 2 show examples of count data and quantity data, respectively. Table 1.-Number ot beneficiaries by county and age Age class County Under 65 65-69 70-74 75& over Total A----- 3 15 11 8 37 B----- 7 60 34 20 121 C---- - 4 - - 4 Table 2.-Average benefit amount by county and age Age class County Under 65 65-69 70-74 75 & over D $63.30 $94.30 $85.20 $79.60 E 62.40 89.9 81.80 72.40 F 59.80 92.40 80.4 77.60 1.Exact Disclosure a.Count data-For tabulations involving counts of persons, establishments, etc., exact disclosure is said to occur when a respondent known to be a member of a set (marginal total) can be determined to be a member of a proper subset (cell). For the dis- closure to be exact, this proper subset or detail cell must be.defined as narrowly as possible. The detail cell must consist of respondents all having one of the basic, elementary values available from the records of the characteristic defining the cell single year of age, nearest dollar amount of benefit, a single race category, etc. Table 3 shows that all beneficiaries in County B are black-an example of exact disclosure. Table 3.-Number of beneficiaries by county and race Race County White Black Other Total A---- 15 20 5 40 B 0 30 0 30 On the other hand, the inference from Table 4 that no beneficiary in County B is white is not called exact disclosure because the subset of black or other beneficiaries is not as narrowly defined as possible from the records on which the tabulation is based. Table 4.-Number of beneficiaries by county and race Race County White Black Other Total A---- 15 20 5 40 B---- 0 28 2 30 Similarly, the fact that the ages of all beneficiaries in County C of Table I can be restricted to the interval 65-69 does not constitute exact disclosure as defined here because the age interval defining the detail cell does not represent a single year of age. In summary, exact disclosure from count data can be identified as follows: A marginal total (in the 11 dimension n-1) of an n-dimensional cross tabulation equals one of its detail cells; this detail cell is as narrowly defined as possible. b.Magnitude data-Exact disclosure from magnitude data can occur as a result of the publication of the value of a quantity corresponding to i cell with only one member. For example, the total sales for the single establishment in Industry B is disclosed by Table S. Table 5.-Total sales, by industry Industry No. of establishments Total sales A---- 18 $450,000,000 B---- 1 $125,000,000 A second type of exact disclosure from magnitude data occurs when auxiliary information concerning the possible numerical values of the characteristic under consideration can be used to determine the exact quantity for every member of a given cell. For example, consider the situation presented below: Table 6.-Average monthly benefits, by State Average Monthly State No. of beneficiaries benefit A---- 4 $158 B 36 $190 If the maximum possible monthly payment to any beneficiary under the program studied in Table 6 is $190, then the user will know that each person in State B receives precisely $190. However, the exact value of the payment to any beneficiary in State A is not disclosed. In summary, exact disclosure of the type from quantity data is identified by the publication of the numerical value of a characteristic corresponding to a cell with one member. Exact disclosure of the second from magnitude data is identified by the following equalities: A = L, equivalently T = LN or A = U, equivalently T = UN, where A is the average and T is the total value among all N members in a cell, U and L are the maximum and minimum possible values. respectively, for any member in the cell. 2.Approximate Disclosure a. Count data.-When all members of a total belong to one detail cell, the disclosure is approximate if the detail is not as narrowly defined as possible: otherwise, the disclosure is exact When all members of a total can be restricted to a proper subset of detail cells, there is approximate disclosure because it is disclosed that no member Of the marginal total belongs to any of the empty cells. Table 1 allows the user to restrict the age of each beneficiary in County C to the interval [65, 69]. Table 4 does not exactly specify the race of any person, but it shows that the race of each beneficiary in County B is either black or other, not white. Both of the above examples illustrate approximate disclosure from count data. Approximate disclosure from count data can be defined and identified as follows: A marginal total. (in the dimension n-1) of an n-dimensional cross tabulation equals one of its detail cells, or the sum: of a proper subset of detail cells (equivalently, the value of one or more detail cells is zero); but the disclosure is not exact. b.Magnitude data-In a broad sense the publication of a figure for quantity always permits the user to estimate, however crudely, the value of characteristic corresponding to a given member o the cell For example, the monthly benefit for each of the four beneficiaries in State A of Table 6 must be less than $632. Further, the total sales of each establishment in Industry B of Table 7 can be placed inside the interval [0, 125,000,000]. Table 7.-Total sales, by industry lndustry No. of establishments Total A---- 18 450,000,00 B 5 125,000.001 Often, the information provided in cases such a the above will not be sufficiently accurate or sensitive to require corrective measures. However, if the number of members in the cell is sufficiently small the interval of possible values for the quantity associated with a particular individual will be narrow enough to be considered a disclosure problem (Co; 1976). With the assumption that all values for quantity are non- negative, the interval of possible values a characteristic for a particular cell member is [OT] if the total, T, is published; equivalently the interval is [O.- NA] if the average, A, and cell siz N are published. Sometimes auxiliary information obtained from sources external to the summary data under consideration 12 can enable the user to estimate the value of an unpublished quantity more accurately. For example, if an employment distribution shows that all establishments in Industry B of Table 7 have approximately the same number of employees, the user can estimate a value $25,000,000 for the sales of each establishment. In the same vein, if it is known from another data source that the largest establishment of the five employs 80 percent of all workers in Industry B, a reasonable estimate for total sales for that establishment would be $100,000,000. In some situations, auxiliary information admitting more accurate approximation to values of aggregate data can be obtained from external sources other than statistical tabulations. In particular, legal requirements used in conjunction with summary data may determine narrow upper and lower limits for the value of a quantity for an individual respondent. For example, in Table 6 if the maximum benefit is $192, then it can be shown that each individual person in State B must receive at least $120-a restriction of each beneficiary's payment inside a range of values unknown prior to publication of the data. In general, if maximum and minimum values of the characteristic in question are known, such disclosure will occur under the following conditions: Click HERE for graphic. where A is the average and T is the total value among all N members in a cell, where N> 1; U and L are the maximum and minimum possible values, respectively, for any member in the cell; and P, where 0 < P < 1, specifies the relative size of the interval chosen to define disclosure of the value of the characteristic under consideration. For example, if disclosure is defined as knowing that the value for an individual lies within a quarter of the range (U-L) then P = .25. Finally, in some instances better approximations for the quantity data of an individual respondent can be computed by a user with precise information about a subset of members of the cell. This type of disclosure is discussed later in this chapter (see A 5: "Internal Disclosure") and in Appendix C. 3. Probability-Based Disclosures (Approximate or Exact) Sometimes although a fact is not disclosed with certainty, the published data can be used to make a statement which, within the framework of an implied probability model, has a high probability of being correct. For example, in Table 8 it is very likely that a given beneficiary in County B has a monthly income in excess of $2,000. Table 8.-Monthly income of beneficiaries Number of persons with income County Under $1000 $1000-$2000 Over $2000 A---- 70 60 65 B---- 10 20 230 C---- 30 50 40 Similarly, from Table 4, in the absence of other information, we might assign a probability of 0.93 that a person known to be a beneficiary in County B is black. Identification of probabilistic disclosure can be described as follows: DSP2 where D is the number of members in the detail cell, S is the number of members in the total cell, P1 is the smallest permissible proportion of members in a detail cell among all members belonging to the marginal to and P2 is the largest permissible proportion of members in a detail cell among all members belonging to the marginal total. As was the case for approximate disclosure for aggregates, the appropriate values of P, and P2 in a particular case must be determined by the agency releasing the tabulations. In many cases, the agency may not consider it necessary to avoid probabilistic disclosure at all; in such cases, we would set P1=0 and P2=1. 4. Indirect Disclosure Up to this point, the examples concerning exact, approximate, and probabilistic disclosure have involved information provided directly by published figures. This type of disclosure is said to be direct. However, information can often be derived by algebraic manipulation and/or logical operations performed upon data obtained from different tables based on the same data. If the publication of a 13 derived figure would result in one of the types of disclosure discussed above, then indirect (exact, approximate, or probabilistic-whichever is appropriate) disclosure is said to occur. Table 9.-Number of persons with hospital and medical coverage, by age and sex Hospital & Medical coverage Age Male Female Total Under 65--- 1,714 1,820 3,534 65-74---- 1,517 1,630 3,147 75 and over--- 1,402 1,510 2,912 Total---- 4,633 4,960 9,593 Table 10.-Number of person with medical coverage, by age and sex Medical Coverage Age Male Female Total Under 65--- 1,719 1,829 3,548 65-74 1,519 1,630 3,149 75 and over 1,402 1,510 2,912 Total 4,640 4,969 9,593 Neither Table 9 or Table 10 discloses individual information directly. However, by application of algebraic and logical operations to both tables, it follows that all men 75 and over with medical coverage have hospital coverage; all women with medical coverage but without hospital coverage are under 65, etc. As a further illustration of indirect disclosure, suppose Industry A consists. of two disjoint subindustries Al and A2, and that the following information is available from various tables. Industry NO. of Comparisons Total sales A---- 5 $200,000,000 A1---- 4 150,000,000 By subtraction, the total sales of $50,000,000 is computed for the one company belonging to Industry A2. To identify indirect disclosures, a determination must be made to we if a logically defined but unpublished cell. which would itself constitute a disclosure, can be derived from published cells. Because data from all sources available to the user must be considered, this work can Set quite involved. Discussions of this complex problem are given by Cox (1976) and Fellegi (1972). 5. External or Internal Disclosure Almost all of the above discussion has centered upon external disclosure, i.e., disclosure to someone who is not a member of the tabulated cell. Attention will now be focused upon internal disclosure-that is, the situation in which members of a group use their own as well as published data to obtain confidential information about others in the group. When some members of a group collaborate for this purpose, we will refer to this subset as a "coalition." Table 11 furnishes an example of internal disclosure for count data. The black worker in County C can determine from the table that every other employee in his industry and county is white. Table 11.-Race of workers in industry A, by county County Total White Black A---- 144 132 12 B---- 238 138 100 C---- 94 93 1 If there were precisely two black workers in County C instead of one and if they knew each other, they could deduce that all other employees in their industry and county are white. If the maximum possible benefit for each of the beneficiaries of Table 12 were $140, it would be impossible for a user not belonging to County B to determine the payment to either person in that county. However, either beneficiary could readily compute the payment to the other person by use of the published cell. Further, if one person in County A of Table 12 received a benefit.of $40, he would know that each of the other persons must receive between $120 and S 140. Table 12--Number of beneficiaries and average payment amount County Number Average Payment Amount A---- 3 $100 B---- 2 70 Another example of internal disclosure from quantity data is given by Table 7 which was also discussed in conjunction with approximate disclosure. As previously mentioned, by subtracting the value of its own sales from the published value S 125,000,000 an establishment can estimate the value of sales for its competitors with greater accuracy, perhaps, than they would like. 14 Finally, internal probabilistic disclosure can be discussed by modifying data for County C of Table 11 as follows: Total White Black 94 92 2 If either black employee knows that Mr. X is in his industry and county, the probability is only 1/93 that Mr. X is black. For the sake of completeness and summarization, the following list is provided for the identification of the different VM of internal disclosure. Definitions are analogous to the corresponding ones for external disclosure. a. Count data (direct or indirect disclosure).-The potential for internal disclosure is affected by two new factors not relevant to external disclosure. The first is the maximum size of coalition against which protection is believed to be necessary; the second is the distribution of the coalition members among the data cells to be protected. Since there is usually no way of knowing what the distribution of any particular coalition might be, the conservative approach in all cases is to protect against the distribution that would result in the greatest degree of disclosure. In the discussion below, S is the published number of members in the total cell, D is the published number of members in a detail cell, C is the maximum coalition size for which protection from disclosure is considered necessary, and X is the number of coalition members also belonging to the detail cell. Note that the number, X, of members of a coalition of size C which belong to a detail cell of size D must satisfy the following: 0 < x < minimum (C, D). (1)Exact disclosure: The difference between the values of a marginal total and one of its detail cells is equal to the number of members of a coalition not belonging to the detail cell (equivalently, S-D - C-X), the detail cell is as narrowly defined as possible. In a plan to guard against such disclosure by coalitions of size C, the extreme case X - 0 must be considered; that is, S-D < C should be avoided in publications. (2) Approximate disclosure: There exists at least one non-empty detail cell entirely contained in a coalition, but the disclosure is not exact. For this detail cell we have X - D. In a plan to guard against such disclosure by coalitions of size C, D < C should be avoided in publications. (3) Probabilistic disclosure.-(i)D-X < ( P, where D, X, S, and C arc as defined previously and P, is as defined for external probabilistic disclosure. In a plan to guard against such disclosure by coalitions of size C, the extreme case X - C must be considered; that is, D-C < (S-C) P, should be avoided in publications. (ii)D-X > (S-C) P2, where D, X, S, and C are as defined previously and P2 is as defined for external probabilistic disclosure. In a plan to guard against such disclosure by coalition, of size C, the extreme case X - 0 must be considered; that is, D > (Pz should be avoided in publications. b. Magnitude data (direct or indirect disclosure). (1) Exact disclosure: After a coalition of size C adjusts a published figure by means of its own data, the revised value involves either type of exact disclosure for magnitude data described for the external use. Equivalently, a quantity is published for a cell of size C + R, containing a coalition of size C, where one of the following conditions holds: (i) R = 1 (ii) The revised value of the published figure, obtained by adjusting for the contribution of the coalition, is a maximum or a minimum possible value determined from external, auxiliary information as described on page 12. (2) Approximate disclosure: With an adjustment of a published quantity figure by use of information about itself, a coalition of members of a cell can estimate, more accurately than an outside user, a quantity value corresponding to a member of the cell outside the coalition. For example, two beneficiaries, each receiving a monthly benefit of $250 in State, A of Table 6 would know that each of the other two beneficiaries must receive less than $132. Given that the (unpublished) values for sales in Industry B of Table 7 are as shown below: Establishment Sales 1------------------------------------------- 1,000,000 2------------------------------------------- 1,000,000 3------------------------------------------- 1,000,000 4------------------------------------------- 22,000,000 5------------------------------------------- 100,000,000 15 it follows that establishments 4 and 5 can objective and somewhat accurate information about each other (especially if each is aware of the relative sizes of the other four members of the cell). In particular, establishment 5 can deduce that establishment 4 has at most $25,000,000 in sales. In general, if all quantities are nonnegative, the interval of possible values for a particular cell member outside a coalition is [0, T - Q,I, or equivalent [0, NA - Q.] where T is the published total, A is the published average, N is the cell size, and Q. is the value of the quantity for the coalition. Finally, if upper and lower limits for the possible value of a quantity corresponding to an individual respondent are known, then internal approximate disclosure can be identified as follows for aggregate data: Click HERE for graphic. where A is the published average and T is the published total value for all N members in the cell, U and L are the maximum and minimum possible values, respectively, for any member in the cell, P.0 < P < 1, specifies the relative size of the interval which defines disclosure of the value of the characteristics under discussion, C is the number of members in the coalition, and Q.is the unpublished value of the quantity corresponding to members of the coalition. (3) Dominance rules and their relation to internal approximate disclosure of magnitudes: Cell suppression is commonly as a technique to avoid exact and approximate disclosures in tabulations of magnitude data. Typically, "dominance rules" are established to determine which cells should be suppressed. These rules are of the following general type: If n or fewer units account for p percent or more of the cell total, the cell must be suppressed. For example, we might say that if 1 or 2 firms account for 80 percent or more of total sales in a particular cell, that cell should not be published. One consequence of such a rule would, of course, be to require that all published magnitude cells be based on data for 3 or more firms. The effect of dominance rules is to limit the precision with which magnitudes for individual units can be estimated from the published data by persons who have exact or approximate knowledge of values for one or mote members of the cell. In particular, these rules limit the extent of internal approximate disclosure of magnitude data, as defined earlier in this chapter. Further discussion of dominance rules and their relation to approximate disclosure appears in Appendix C. If a dominance rule is used to determine when a cell magnitude should not be published, knowledge of the exact rule can make it possible for a member of the cell to obtain more accurate information about his competitors than would otherwise be the case. This may readily be understood from an example. Suppose a published cell shows sales for 1976 of S 1,000,000 for 6 companies in a particular industry. Company A knows that its own sales in 1976 were $750,000. If Company A does not know the dominance rule, it can deduce only that none of the other 5 companies had sales of more than $250,000. If the dominance rule is published however, additional information may be available to Company A. Consider two possibilities: 1. The rule is that no cell is published if 1 or 2 companies account for more than 90 percent of the total. In this case, Company A will know that none of its competitors had sales of more than $150,000. 2. The rule is that no cell is published if I or 2 companies account for more than 90 percent of the total. In this case, Company A will know not only that none of its competitors sales of more than $50,000, but also that each of the 5 other companies had sales of exactly $50,000 (since 5 companies must account for sales of $250,000, and none of them can have sales of more than $50,000). B. Evaluating the Disclosure Problem The definition of statistical disclosure adopted for this report is, as mentioned earlier, very broad While it may not be feasible to try to avoid completely the possibility of disclosure, it is imperative to exercise disclosure control. Doing so calls for an evaluation as to (1) the level of risk of disclosure 16 inherent in a proposed publication; (2) the acceptability of that risk; and (3) the assurances given to persons (data subjects or others) who provided the information. ln what follows, we will address these three points. 1. The Level of Risk of Disclosure We will now identify four factors which determine the risk of disclosure. In a real-life situation, it will be necessary to try to evaluate their combined effect a. The relative size of the sample.-As a first approximation, the risk of disclosure is smaller for tabulations based on a sample survey than for tabulations based on a complete survey; and by the same token, the smaller the sampling fraction, the smaller is the risk of disclosure. This evaluation is reasonable when we are dealing with surveys based on designs characterized by the use of an equal probability of selection method. Many large-scale surveys. are of are of this type. If the overall sampling fraction (usually denoted by n/N) is "small," say less than .05, it is less likely that a disclosure will place. If, however, the design does not involve equal probability of selection, the situation is different; in fact, for some of sampling design, the risk of disclosure may be very great for some large reporting units. As an illustration, consider the total of a characteristic with a highly skewed distribution. An example in kind is a survey to estimate total production. In such cases, an efficient sampling design would call for selecting relatively few small units. Disclosure potential would, therefore, be much higher for the large units than for the small units. The protection against risk of disclosure afforded by a small sampling fraction is considerably less where particular reporting units are, for whatever reason, known to be members of the sample. For example, if a sample is selected based on ending digits of social security numbers, the risk of disclosure is clearly greater if the digital sampling patterns actually used to select the sample are known. Similarly in a two-stage sample, if the identities of the primary units in the sample are known, then the sampling fractions within these primary units, rather than the overall sampling fraction, determine the degree of protection against the risk of disclosure. More generally, in multi-stage samples, protection is a function of the sampling fractions within units known to be in the sample. b. The detail provided in the tabulation.-A publication which provides only "overall" estimates is less likely to generate large risks of disclosure than a publication which provides detailed breakdowns of these estimates. It is useful to make a distinction between two kinds of breakdowns, viz., (1) by geography, and (2) by-other classifiers. If the data are presented for very small areas, the risk of disclosure is typically larger than for large areas. It is this experience which underlies the rules used by the Census Bureau to provide less detailed tabulations for areas such as census tracts and city blocks than it does for large areas such as SMSA'S. If data are published for small "cells" identified in terms of other classifiers such as age, sex and race (perhaps in combination with geography), the risk of disclosure may be large: the smaller the cell, the larger the risk. c. The quality of the data.-If the data on which estimates are based are impaired by non-sampling errors, the risk of disclosure is smaller than in, the case of more accurate data. This is in fact why "noise" is sometimes intentionally introduced into estimates. d. Availability of external information.-The existence of external information-for example, information available through directories or other institutional records-may make the risk of disclosure significantly higher than it would be if that information were not available. In a real-life situation, the survey statistician should, when planning the survey, take these and other factors into account; to some extent, the risk of disclosure can be controlled by the proper choice of survey design. This type of control must, however, be supplemented by disclosure analysis of the proposed publication. 2. The Acceptability of the Disclosure Risk The crucial point of the disclosure analysis just referred to is to determine if a certain risk of disclosure is too high or too low. It is too high if it may cause non-negligible harm to an individual being subject to disclosure, or to the statistical agency by impairing its ability to collect data in the future. It is too low if it unnecessarily reduces the amount of useful information that can be provided. Three factors which may be considered in an effort to determine whether a certain disclosure risk is acceptable or not are listed below. a. Sensitivity of data- Some types of data are clearly more sensitive than others; it suffices to mention data dealing with financial matters, health, 17 sexual behavior, and hand, some data may, at worst, disclose something that is entirely obvious or completely innocuous, or available in public records. For many data, the degree of sensitivity may be a decreasing function of their age. b. Possible adverse consequences of disclosure.This topic is closely related to the sensitivity of data. The more sensitive the data are, the more adverse the consequences of disclosure are likely to be. Clearly the kind of consequences caused by disclosure should be taken into account in the disclosure analysis. If the disclosure of some particular datum may reasonably be expected to create a social, economic or legal problem, the risk of disclosure must be kept very small. Thus, disclosing that someone has been treated for venereal disease, drinking problems, etc., may generate such a problem. 3.The Assurances Given to the Respondents Consideration must be given to what assurances have been given to the data subjects or other persons Providing information about uses of the data. Under no circumstances should such assurances be violated. If the information is definitely non-sensitive and no promise of confidentiality was given the, data subject, then the concern about possible disclosures would be considerably reduced. C. Disclosure-Avoidance Techniques A major goal of statistical agencies is to produce and publish as much useful and usable statistics as Possible for the benefit of their clients. The need to avoid the unintentional disclosure of sensitive information concerning individual persons or organiza- tions forms a constraint on this endeavor. The statistical agency, therefore, must find or develop techniques that will effectively avoid disclosure while at the same time permitting maximum useful statistical information to be conveyed. The agency would also seek to accomplish this by a method that is both simple and economical. Techniques for preventing disclosure through statistical tabulations fall into three general- classes: data suppression, rolling up data, and disturbing the data. 1. Data Suppression a.Cell suppression.-A data item which, it is determined, could lead to disclosure may simply be suppressed, i.e., the figure is omitted and replaced by an asterisk or other symbol which indicates that the figure is being omitted to maintain confidentiality for the subjects of the table. However, must be taken to assure that the disclosing figure may not then be deduced by subtraction, which requires that another figure in the same row and another in the same column also be suppressed, assuming it is desired that no changes be made in the row and column totals. In addition, at least one figure would need to be suppressed-the one at the intersection of the other row and column of the second and third suppressions to assure that the other suppressions also cannot be deduced by subtraction. Thus, if the row and column marginal totals are to be left unchanged, it is necessary in a two-way distribution to suppress at least four figures to avoid a disclosure. It is also possible that data in other tables published from the same body of data may enable one to deduce the suppressed figures. Therefore, it is necessary to review all relevant tables to ensure that they do not contain disclosures and also that through a process of subtraction or other algebraic operations they do not enable disclosures to be made, and all necessary suppressions must be made to avoid the possibility of disclosure. Cox (1976) discusses a linear programming technique for exposing cells which require suppression to avoid disclosure. So as to provide maximum consistency the suppression of certain data items may be made contingent on the acceptability of a "diagnostic" item. For example, in economic censuses if sales in a particular - kind of business must be suppressed, then employment, payroll and certain other figures are automatically suppressed with it. This enhances consistency, avoids incidental disclosures, and reduces costs. b. Table suppression.-Many (though not all) disclosure problems can be avoided inexpensively through the elimination of all tabulations involving fewer than some minimum number of cases. Thus, in the 1971 Census of Population in the United Kingdom, no tabulations were presented for enumeration districts having fewer than 25 persons or fewer than 8 households; for such enumeration districts only the total numbers of persons and households were given (Newman, 1975:6). In the 1970 Census, the U.S. Bureau of the Census suppressed distributions by a particular characteristic for any universe in which there were fewer than 5 cases (Barabba and Kaplan, 1975:9). In guidelines for the Social Security Administration (1977) it is suggested that separate tabulations for counties havens fewer than 50 beneficiaries be avoided. 18 For a general discussion of the use of suppression, see Sweden, National Central Bureau of Statistics (1974:32-34). For a discussion of the use of suppression in the U.S. Bureau of the Census, see Barabba and Kaplan (I 975:7-1 0). 2. "Rolling Up" Data Problems of confidentiality can frequently be solved by changing the structure of tables in such a way that the disclosure possibility is eliminated. Thus, rows or columns can be combined into larger class intervals or new groupings of characteristics. This may be a simpler solution than the suppression of individual items, but it tends to reduce the descriptive and analytical value of the table. It may also be expensive in that it might require that a few tables be customized in a large set of tables, the remainder of which are produced mechanically in identical formats. General discussions of the rolling-up process are to be found in Sweden, National Central Bureau of Statistics (1974:31-32) and in Social Security Administration (1977:6-7). An indirect but common example of rolling-up exists in data bases where the Standard Industrial Classification system is used. That hierchical system has 2-, 3- and 4-digit levels providing successively greater detail. When data are suppressed at the 4 digit level the 3-digit level summary provides the benefits of intermediate rolling-up. Hansen (1971:51) points out that using broad enough class intervals may even avoid approximate disclosure (in the terminology of this report, unacceptable approximate disclosure), for example, when the upper limit of each interval is at least double the lower limit 3.Disturbing the Data This Process involves changing the figures of a tabulation in some systematic fashion, with the result that the figures are insufficiently exact to disclose information about individual cases, but are not distorted enough to impair informative value of the table. Ordinarily rounding is the simplest example. Figures in a table may, for example, be rounded to the nearest multiple of 5. Where the figures involved are very large, this will have little or no effect on the informative value of the tables. If all cells in a table are rounded by the same rules, totals will not always agree with the sums of the detailed cells. If this is considered undesirable. the most detailed cells can be rounded and then added to obtain totals at various levels. Ordinary rounding was used for most tables involving large areas in the 1971 United Kingdom Census (Newman, 1975:9-10). Values of 0, 1, or 2 were replaced by asterisks; percentages were computed from the rounded tables. There is a growing body of techniques for avoiding disclosure involving the introduction of random error into the figures to be published. For example, in tables relating to small areas prepared from the 1971 United Kingdom Census, to each figure was added, at random, - 1, 0, or + 1, in the ratio of 1, 2, 1. Enumeration districts were paired, each having opposite correction factors in comparable figures, so that the totalled figures from a set of dis- tricts would be accurate, except if there was an odd number of districts in the set (Newman, 1975:3-8). One possible approach is to introduce "noise" into the file of microdata, thus avoiding the possibility of disclosure in any tabulations produced from the file. This method may simplify matters for the data producer, but it creates problems for the user (Dalenius, 1974). "Random rounding" a method which has received considerable attention in recent years, combines elements of both rounding and introducing random disturbances. Each figure is rounded to a multiple of some integer, usually 5, but not necessarily to the nearer one. Whether a figure is rounded up or down is determined at random, with the chance of rounding up or down depending upon the amount of change necessary: (Murphy, date unknown: 68-70; Social Security Administration, 1977:7-9). Final Digit Probability of Rounding Up 0 or 5------------------------------------------ 0 1 or 6------------------------------------------ 1/5 2 or 7------------------------------------------ 2/5 3 or 8------------------------------------------ 3/5 4 or 9------------------------------------------ 4/5 Nargundkar and Saveland (1972) describe and give theoretical support to the use of this method in the tabulations published from the 1971 Canadian censuses of population and housing. Fellegi (1975) presents a technique for controlling the random rounding to assure that the totals will be correct at some predetermined higher geographical area level. The Swedish Statistical Bureau proposes another random rounding technique which may be used if it is simply desired to remove ones from a table. The one is rounded randomly down to zero with a probability of 2/3 and up to 3 with a probability of 1/3 (Sweden, National Central Bureau of Statistics, 1974:34-35). 19 The models discussed above for disturbing data are all additive. Multiplicative models are also feasible. Hansen (1971:55-56) suggests one which involves disturbing the figure by a factor within the range of .5 to 1.5, the factor being chosen at random. 4. Limiting Distribution Situations may arise in which it is not necessary to take special steps to avoid disclosure from statistical tabulations. Under certain conditions a table may be made available to a particular organization, even though the table could not be published for reasons of maintaining confidentiality. An actual example is in the tables on local area social security data provided by the Office of Research and Statistics, Social Security Administration, to the Bureau of Economic Analysis. As a result, the expense of revising the table is avoided, and the actual distribution is available for full research use. This can be done when the receiving organization guarantees (and has the legal authority) to provide fully adequate protection to the confidentiality of the data while it has custody of them. For one agency. to make potentially identifiable data available to another, conditions such as these may be required: a. The activity must be in accordance with the laws governing the programs of the respective agencies. b. There must be a legitimate research purpose to be served by the process. c. The receiving agency must be strictly and legally accountable to the providing agency for its security program. d. The receiving agency must demonstrate that it has adequate security provisions. e. The likelihood that any information potentially harmful to an individual would be derived from the would, even so, be ex- f. The receiving agency would not and could not be required to turn the data over to any third party, even under subpoena or a Freedom of Information Act g. The providing agency would have opportunity to review any publication of information from the data to insure that no potential disclosures are published h. At the cowl of the project, and no later than some specified date the receiving agency would either return or destroy all of the tables involved. i Significant sanctions or penalties for improper disclosure would apply 5. Evaluation of Alternative Techniques If it is determined that there is a possibility that the publication of a table, or a datum within a table, might result in harm to some individual or organization, but, nevertheless, the table has sufficient value that, at least in some form, it should be published, then a decision must be made as to which technique will be used to avoid the disclosure. A number of examples have been cited; various other techniques am also possible. Four principal questions must be weighed in the making of this decision: a. The degree of protection provided.-All of the described methods reduce considerably the likelihood of a disclosure; some give virtually absolute protection against the possibility of disclosure but are more drastic in terms of loss of information. b. Effects on users of the data.-All of the techniques listed have some effect in reducing the value of the data to the user. There is some loss of information inherent whenever data are , suppressed, combined, or disturbed. The Swedish method of removing ones from tables by changing them to O's or 3's perhaps does the least harm to the data conveyed. At the other extreme, the method of "random rounding" to multiples of 5 has considerable effect, since it can cause any figure to be changed by as much as 4. In general both of these data disturbing.techniques may also yield inconsistent figures for the same data items in independently derived totals. Suppression could make some analyses impossible, particularly where the user wants to combine a number of smaller units to obtain totals and other statistics not provided in the tables. The multiplicative method cited by Hansen could cause any figure to be halved or increased by 50 percent. The Swedish suggestion for substituting a range for a sensitive value can also have severe effects if the range is relatively large. Even the smallest of these changes may affect the value of the published data for descriptive or analytic purposes (Dalenius, 1974:220). With the increasing use of computers in data analysis, particularly where a large number of aret are to be compared, the uniformity of the data input is another factor affecting users. In this context, rolling-up-so that dimensions of the data matrix vary from unit to unit-creates considerable difficulty. Suppression is also problematic in that suppression at any level can prevent the development of a desired total. In this context the data disturbing 20 techniques may be most satisfactory-in that data are always present and they can be added together without biasing effects on the totals derived. Other statistics such as ratios, e.g., persons per household, can be affected; however, with suitable precautions, these effects can be minimized. c. The "identifying" nature of the subject items. Some subject characteristics are more likely than others to lead to the ability to associate data with a particular individual. A tabulation of race and sex by income probably has more disclosure potential than a similarly detailed table of major field of study in college by income assuming that race and sex are more readily observable than major field of study. Area of residence is.considered highly identifying in nature, and frequently geographic or size of area characteristics are considered separately from any 46 subject" characteristics of a respondent in disclosure rules. On the other hand opinions recorded in a survey are normally of minimal utility in identifying a respondent The Census Bureau, for instance, has in the past used area of residence and race as the critical variables in determining the publishability of small area population census tabulations.If certain minimum population criteria were met in each arm then other characteristics of that population would be provided. On the other hand, the Census Bureau was willing to make available journey-to- work data from the 1970 census in the form of origin-destination matrix classified by mode (auto, bus, etc.) without any disclosure., control, on the assumption that journey-to-work characteristics are highly changeable (the question was asked relative to "last week) for an individual and therefore non- identifying. d. Cost.-Any procedure used to avoid disclosure in statistical tables will involve some cost to the statistical agency. There will be cost in the use of some operating funds, in the use of personnel time that would otherwise be available for other activities, in the computer programming, debugging, and processing, and in time required for the total process and the resulting delay in publication. * * * Agencies cited have studied the problem and have tended to settle on one particular technique to be used for all publications of a particular census, or as standard operating procedure. Once this is done and staff understand it, the procedure becomes routinized and automatic. Computer programs are written to. automatically "purify" the tables in the system on a mass- production basis, and costs are minimized. AU of the techniques described are capable of computerization, and some software packages are available (Cox, 1976:14-15). But such mass procedures may also result in wholesale losses of valuable information. Study of the effects of such procedures may reveal that in many instances the system's application resulted in particular losses of information that am both unfortunate and unnecessary. As described in Appendix C, the Census Bureau has developed programs which attempt to the number of suppressions in magnitude data. Each statistical agency must make its own study and its own decision to answer this question: How can we do our job of making available the needed data, in our area, while at the same time we make sum that no confidential information about any person or any establishment is -accidentally released through the tables we publish? Selected agency policies and practices to avoid unintentional disclosures are noted in Appendix A. 21 CHAPTER IV Disclosure in Microdata A. Nature of the Problem 1. Definition of Microdata We use the term microdata to refer to files in which each record provides data about an individual person, household, establishment or other unit. An agency's own files of basic records from a survey or other data collection are thus microdata, and normally they are summarized or aggregated to produce statistics for the reports and publications discussed in Chapter III. Release of microdata to a data user outside the originating agency can serve legitimate and important public purposes in that the data may be useful for many more tabulations or other analyses than the originating agency is prepared to provide. Certain statistical applications (e.g., simulation models) require input in microdata form. Obviously, release of records about individuals raises the issue of disclosure. Some files are by law not confidential for example, those from the Census of Governments from which detailed data for specific governmental units are released. On the other hand, most data bases are covered by statutes (discussed in Chapter II) which prohibit the release of data from which information may be gained about identifiable individuals. Agencies which release microdata for outside use have construed applicable law and regulations to permit the release of individual information insofar as it is not specific enough to allow identification of the individual. Invariably names and addresses, social security numbers and other positive identifiers are removed. Further, certain other information, such as location, is generally withheld or provided only in broad categories. Microdata is a particularly popular form of release since it gives the user considerable flexibility in his or her analyses. The capacity of data users to perform such analyses has been and is continuing to increase rapidly with the availability of computer resources. At the same time the statistical agency is frequently impelled to release microdata as a labor-saving device-it reduces somewhat the need for extensive published tabulations, and it cuts down on requests for special tabulations which are sometimes seen as diverting agency resources. Thus the dissemination of data in microdata form is steadily increasing. 2. Federal Agency Examples of Microdata Release a.Bureau of the Census.-Probably the best known of all Federal microdata bases are the public use samples of basic records from the 1960 and 1970 censuses of population and housing. From the first release in 1963, these samples have provided nearly the half richness of detail about households derivable from the decennial censuses: age, education, income, occupation, etc., of each family member along with characteristics of the family's housing. The sample originally released in 1963 had little geographic information and the sampling fraction was only 0. I percent of all U.S. households. As a result of the public acceptance and demonstrated utility of that microdata product, public-use samples from the 1970 census were created with a larger sampling fraction (one-percent) and more specific geographic information (areas as small as 250,000 population were identified). A total of six mutually exclusive one-percent samples were made available-taken together, six percent of the national population. These files are available for purchase by anyone and use is not restricted. Fairly comparable in content and structure to the census public-use samples are the Annual Demographic Files (ADF) generated each year from the March supplement to the Current Population Survey (CPS). A special provision must be added to the aforementioned disclosure rule since the CPS is an area sample and maps are available which define what areas are included in the first-stage sample. The minimum population criterion becomes 250,000 population within sampled primary sampling units in the area to be identified. For example, since central city, other metropolitan and nonmetropolitan components of the population have been identified 23 on the ADF through 1976, a State with even several million total population was not identifiable if there were less than 250,000 People in sampled nonmetropolitan counties. (Beginning with the 1977 ADF, all States will be identified, but with central city and metropolitan residence codes suppressed where necessarysee page 38). There are no restrictions on use of Annual Demographic Files. Files from a number of other household surveys are also released in a similar manner. b.Social Security Administration.-The Social Security Administration (SSA) makes available from its Continuous Work History Sample system the Longitudinal Employee-Employer Data (LEED) File, containing records for one percent of all employees covered by the Social Security System. For every individual in the file there is age, race, and sex information and a record for each employer in each year since 1957. The employer records indicate the industry, State, county, taxable wages and estimated total wages for the year. Scrambled social security numbers for employees are provided only to users who will be updating the sample with data for subsequent years. Purchasers must enter into a written agreement with SSA specifying the purposes for which the file may be used, prohibiting further dissation without SSA authorization, and specifically precluding any attempt to identify specific individuals or establishments or to match individual records with information in other files on specific individuals. Annual and quarterly files from the system are also available under the same conditions. SSA also releases microdata files for general public use, i.e., without any restrictions, from several different sources, including the Longitudinal Retirement History Survey, various surveys of disabled persons, the Survey of the Low-Income Aged and Disabled, and certain match studies using data from the Current Population Survey, IRS and SSA. These files are all based on relatively small samples (less than one-percent of the population) and carry only limited geographic information. Unusual values of variables or combinations of variables are suppressed prior to release of the files. c. National Center for Health Statistics.-The National Center for Health Statistics (NCHS) releases public-use microdata tapes from many of its surveys and statistical programs. These includes tapes from the Health Interview Survey, the Health and Nutrition. Examination Surveys, the National Ambulatory Medical Care Survey, the Hospital Discharge Survey, health manpower and health facility inventories, the inventory of family planning service sites, vital statistics for the Nation (natality, mortality, marriage, and divorce), and the national natality and mortality followback surveys. These public-use tapes are reported in a catalog published annually (NCHS, 1976). One NCHS microdata file quite unlike the examples from other agencies is the file on natality, a 50-percent sample of records from the NCHS birth registration system (100-percent for some States in 1972 and 1973). No other Federal microdata file released exhausts a universe or comes that close. Records on the natality file include the age, race and education of the father and mother, the State and county of residence of the mother, the birth date, legitimacy (if recorded) and several characteristics of the mother's previous childbearing history. Purchasers of NCHS microdata sign a simple statement that the file will be used solely for statistical research or reporting purposes. d.National Center for Education Statistics.-The National Center for Education Statistics has available microdata tapes with information gathered from 22,532 graduates of the high school class of 1972, a probability sample made up of approximately 0.7 percent of the National high school class for that year. Information was collected beginning in the spring of 1972, with follow-up surveys in October 1974, for the National Longitudinal Study of the High School Class of 1972. School record information, such as grade point average, class rank, and area of study are included along with test results and student-provided information on family back- ground, attitudes, and-plans for the future. Periodic follow-ups provide information on activity status and changes in attitudes and plans for the future. Geographic information specifies regions and type of community (e.g. rural, suburb, etc.). These files are available for purchase by anyone, and use is not restricted. e.Internal Revenue Service.-The Internal Revenue Service releases two samples of unidentified individual income = returns, with 150 data items from each return, for tabulation purposes and to allow simulation of the revenue impact of tax law chances. The Tax Revenue Model for National Estimates, with no geographic information is available for purchase and unrestricted use. Less than 0.2 percent of all returns are included in that file, although the sampling fraction varies among the classes of taxpayers. The Tax Model for State Estimates, including about 0.3 percent of all returns 24 identified to the State level, is available to State tax agencies for tax administration purposes and, once certainty strata are deleted, it is also made available to the public. B. Evaluation of the Problem While microdata are made available so that tabulations or other summarizations can be made, it is the possible scrutiny of individual records that causes concern for the violation of confidentiality. While we are cog our consideration to microdata files with no positive identifiers (e.g., name, address, or social security number) a combination of data elements, such as geographic location, age, race, and occupation, if sufficiently detailed, could identify an individual if known by the investigator in advance. Other information on the microdata record so identified would then be disclosed about the individual, e.g., income, marital history, educational attainment, etc. This section deals with the likelihood of such disclosure and with the bases for determining, in particular cases, whether or not the risks of disclosure are acceptable. 1. Factors Bearing on the Likelihood of Disclosure a. Sample size or fraction of the universe.-If an investigator were searching for a particular individual in a microdata file, his probability of success would be. no greater than the. chances that a randomly selected individuals record is present in the file, assuming of course that the investigator had no external way of knowing whether or not the individual was selected into the sample. For instance, in a one-percent sample the chances are 99-to.1 against a particular individual having a record in the file. In stratified samples the likelihood of selection into the sample may vary from stratum to stratum. Further, in multi-stage samples it may be possible for an outsider to determine that some counties but not others were subject to sampling beyond the first stage. It would then be the sampling fraction within the county that would be relevant, rather than the average or overall sampling rate. b. Uniqueness.-The term uniqueness is used here to characterize the situation where an individual can be distinguished from all other members in a population in terms of information available on microdata records. The existence of uniqueness is determined by the size of the population and the degree to which it is segmented by geographic information, and the number and detail of characteristics provided for each unit in the data base. (1) Geographic information: The smaller the population, the more easily an individual can be unique; the larger the population the more likely that his or her set of characteristics is duplicated elsewhere. (Also, the larger the population the more costly would be any linkage attempt.) Size of the population, or of the smallest segment that can be readily identified, can be varied most directly by varying the amount of geographic information supplied on a microdata file. Geographic information can be in terms of specific areas (e.g., the State of Maryland) or in terms of.type of areas (e.g., size of place or rural) or both. Multiple geographic identifiers in combination may identify a small area, e.g., the rural part of an SMSA, or a small part of an SMSA crossing a State line. Extraneous sources may also provide information about the location of the respondent: knowledge that only certain areas were surveyed or subject to final stage sampling; sequence of records in the file where they have not been scrambled; the existence of more than one version of a file with different sets of geography identified; and neighborhood, county or PSU summary characteristics if present and matchable to an external source. (2) Characteristics of the respondent. In general it can be said that the greater the number and detail of characteristics reported about an individual the mom likely it is that the individual's representation in the file would be different from that of any other individual in the population. Just 10 characteristics with four categories each create over a million possibilities (410), and when one considers that some data items may have 100 or more potential categories (e.g., age, occupation, industry, income, place of birth) the number of possibilities become astronomical in a file with a large number of characteristics. Many characteristics are, however, likely to be correlated with one another, thus reducing the degree to which an additional item creates additional unique records. For a given subject the number of categories does not entirely account for its potential in an identification process. Some. identify especially small populations, e.g., country of birth of the foreign born. It might then seem reasonable to designate a minimum category population, e.g., to collapse country of birth categories with less than 50 cases in the file. This technique, however, appears inadequate. While 25 there may be many Russian-born persons sampled, only one may be black, or only one may live in a particular identified area. More importantly, uniqueness in the sample is not the critical factor, for there may be a hundred such individuals in the population with no possibility of discriminating among them. Uniqueness in the population is the real question, and this cannot be determined without a census or administrative file exhausting the population or an identifiable subset thereof (e.g., a file of all doctors). Precluding uniqueness in the sample would be a very conservative approach to avoiding disclosure. Some public-use microdata files provide characteristics for all or at least multiple members of a household. The association of the characteristics of household members greatly increases the potential for unique combinations (e.g., a 66-year-old judge married to a 23-year-old.actress). c. Recognizability.-The term recognizability is used here to refer to the likelihood that an investigator could accurately associate unique records in the sample with particular individuals in the population and thereby gain additional information about them. A record in the sample may be unique, but if it cannot be linked with a specific person then disclosure cannot occur. Three factors affecting recognizability are discussed: the existence of a population register, "noise" in the microdata Me, and time lag or the degree to which the microdata information has become out-of-date for an individual. (1) Population registers: A population register is defined here to be a list of persons or households with specific identification, names or addresses, which also systematically contains information which coincides with data on public-use microdata records. Except for Census Bureau, Social Security Administration and Internal Revenue Service records, none of which are available to the public, we know of no registers which systematically cover most of the U.S. population. But neither nationwide coverage nor coverage of all segments of the population is required to -make a population register useable for matching purposes. Reasonable coverage of a defined subpopulation, along with a number of reliable matching characteristics, may suffice. A register of some groups like Black architects, American Indians, high public officials, or birth records . is not at all improbable. The existence of rather extensive registers of business establishments in the hands of governmental units, trade associations and firms like Dun and Bradstreet has virtually ruled out the possibility Of releasing microdata files about businesses for statistical purposes. The point is, of course, to be able to discriminate among the units on the register for the one which matches a public-use microdata record, and this requires inclusion on the register of stable and reliable matching characteristics. Among the charac- teristics most likely to reside in a population register file, date of birth and State or country of birth would seem to be the most reliable, regardless of time or circumstances of data collection. Veteran status, period of military service, and years of school completed wo