| Federal
Committee on Statistical
Methodology Office of Management and Budget |
FCSM
Home ^ Methodology Reports ^ |
Statistical Policy Working Paper 22 - Report on Statistical Disclosure Limitation Methodology
Click HERE for graphic. Statistical Policy Working Paper 22 Report on Statistical Disclosure Limitation Methodology Prepared by Subcommittee on Disclosure Limitation Methodology Federal Committee on Statistical Methodology Statistical Policy Office Office of Information and Regulatory Affairs Office of Management and Budget May 1994 MEMBERS OF THE FEDERAL COMMITTEE ON STATISTICAL METHODOLOGY (May 1994) Maria E. Gonzalez, Chair Office of Management and Budget Yvonne M. Bishop Daniel Melnick Energy Information Substance Abuse and Mental Administration Health Services Administration Warren L. Buckler Robert P Parker Social Security Administration Bureau of Economic Analysis Cynthia Z.F. Clark Charles P. Pautler, Jr. National Agricultural Bureau of the Census Statistics Service David A. Pierce Steven Cohen Federal Reserve Board Administration for Health Policy and Research Thomas J. Plewes Bureau of Labor Statistics Zahava D. Doering Smithsonian Institution Wesley L. Schaible Bureau of Labor Statistics Roger A. Herriot National Center for Fritz J. Scheuren Education Statistics Internal Revenue Service C. Terry Ireland Monroe G. Sirken National Computer security National Center for Center Health Statistics Charles D. Jones Robert D. Tortora Bureau of the Census Bureau of the Census Daniel Kasprzyk Alan R. Tupek National center for National Science Foundation Education Statistics Nancy Kirkendall Energy Information Administration PREFACE The Federal Committee on Statistical Methodology was organized by OMB in 1975 to investigate issues of data quality affecting Federal statistics. Members of the committee, selected by OMB on the basis of their individual expertise and interest in statistical methods, serve in a personal capacity rather than as agency representatives. The committee conducts its work through subcommittees that are organized to study particular issues. The subcommittees are open by invitation to Federal employees who wish to participate. Working papers are prepared by the subcommittee members and reflect only their individual and collective ideas. The Subcommittee on Disclosure Limitation Methodology was formed in 1992 to update the work presented in Statistical Policy Working Paper 2, Report on Statistical Disclosure and Disclosure Avoidance Techniques published in 1978. The Report on Statistical Disclosure Limitation Methodology, Statistical Policy Working Paper 22, discusses both tables and microdata and describes current practices of the principal Federal statistical agencies. The report includes a tutorial, guidelines, and recommendations for good practice; recommendations for further research; and an annotated bibliography. The Subcommittee plans to organize seminars and workshops in order to facilitate further communication concerning disclosure limitation. The Subcommittee on Disclosure Limitation Methodology was chaired by Nancy Kirkendall of the Energy Information Administration, Department of Energy. i Members of the Subcommittee on Disclosure Limitation Methodology Nancy J. Kirkendall, Chairperson Energy Information Administration Department of Energy William L. Arends National Agricultural Statistics Service Department of Agriculture Lawrence H. Cox Environmental Protection Agency Virginia de Wolf Bureau of Labor Statistics Department of Labor Arnold Gilbert Bureau of Economic Analysis Department of Commerce Thomas B. Jabine Committee on National Statistics National Research Council National Academy of Sciences Mel Kollander Environmental Protection Agency Donald G. Marks Department of Defense Barry Nussbaum Environmental Protection Agency, Laura V. Zayatz Bureau of the Census Department of Commerce ii Acknowledgements In early 1992, an ad hoc interagency committee on Disclosure Risk Analysis was organized by Hermann Habermann, Office of Management and Budget. A subcommittee was formed to look at methodological issues and to analyze results of an informal survey of agency practices. That subcommittee became a Subcommittee of the Federal Committee on Statistical Methodology (FCSM) in early 1993. The Subcommittee would like to Hermann Habermann for getting us started, and Maria Gonzalez and the FCSM for adopting us and providing an audience for our paper. Special thanks to Subcommittee member Laura Zayatz for her participation during the last two years. She helped to organize the review of papers, contributed extensively to the annotated bibliography and wrote the chapters on microdata and research issues in this working paper. In addition, she provided considerable input-to the discussion of disclosure limitation methods in tables. Her knowledge of both theoretical and practical issues in disclosure limitation were invaluable. Special thanks to Subcommittee member Laura Zayatz for her particpatin during the last last two years. He helped in the review of papers, analyzed the results of the informal survey of agency practices, contacted agencies to get more detailed information and wrote the chapter on agency practices. He and Mary Ann Higgs pulled together information from all authors and prepared three drafts and the final version of this working paper, making them all look polished and professional. He also arranged for printing the draft reports. Tom Jabine, Ginny deWolf and Larry Cox are relative newcomers to the subcommittee. Ginny joined in the fall of 1992, Tom and Larry in early 1993. Tom and Larry both participated in the work of the 1978 Subcommittee that prepared Statistical Poligy Working Paper 2, providing the current Subcommittee with valuable continuity. Tom, Ginny and Larry all contributed extensively to the introductory and recommended practices chapters, and Tom provided thorough and thoughtful review and comment on all chapters. Larry provided particularly helpful insights on the research chapter. Special thanks to Tore Dalenius, another participant in the preparation of Statistical Policy Working Paper 2, for his careful review of this paper. Thanks also to FCSM members Daniel Kasprzyk and Cynthia Clark for their thorough reviews of multiple drafts. The Subcommittee would like to acknowledge three people who contributed to the annotated bibliography: Dorothy Wellington, who retired from the Environmental Protection Agency; Russell Hudson, Social Security Administration; and Bob Burton, National Center for Education Statistics. Finally, the Subcommittee owes a debt of gratitude to Mary Ann A. Higgs of the National Agriculture Statistics Service for her efforts in preparing the report. Nancy Kirkendall chaired the subcommittee and wrote. the primer and tables chapters. iii TABLE OF CONTENTS Page 1. Introduction ....................................................1 A. Subject and Purposes of This Report.............................1 B. Some Definitions................................................2 1. Confidentiality and Disclosure...............................2 2. Tables and Microdata.........................................3 3 3. Restricted Data and Restricted Access........................3 C. Report of the Panel on Confidentiality and Data Access..........4 D. Organization of the Report......................................4 E. Underlying Themes of the Report.................................5 H. Statistical Disclosure Limitation: A Primer......................6 A. Background...................................................6 B. Definitions..................................................7 1. Tables of Magnitude Data Versus Tables of Frequency Data..7 2. Table Dimensionality......................................8 3. What is Disclosure?.......................................8 C. Tables of Counts or Frequencies.............................10 1. Sampling as a Statistical Disclosure Limitation Method..10 2. Special Rules...........................................10 3. The Threshold Rule......................................12 a. Suppression..........................................12 b. Random Rounding......................................14 c. Controlled Rounding..................................15 d. Confidentiality Edit.................................15 D. Tables of Magnitude..........................................19 E. Microdata....................................................20 1. Sampling, Removing Identifiers and Limiting Geographic Detail..21 2. High Visibility Variables................................21 a. Top-coding, Bottom-Coding, Recoding into Intervals....21 b. Adding Random Noise...................................23 c. Swapping or Rank Swapping.............................23 d. Blank and Impute for Randomly Selected Records........24 e. Blurring..............................................24 F. Summary......................................................24 iv TABLE OF CONTENTS (Continued) III. Current Federal Statistical Agency Practices.................25 A. Agency Summaries..........................................25 1. Department of Agriculture.............................25 a. Economic Research Service (ERS)....................25 b. National Agricultural Statistics Service (NASS)....26 2. Department of Commerce................................27 a. Bureau of Economic Analysis (BEA)............... ..27 b. Bureau of the Census (BOC).........................29 3. Department of Education: National Center for Education Statistics (NCES).......31 4. Department of Energy: Energy Information Administration (EIA)............. .32 5. Department of Health and Human Services...............33 a. National Center for Health Statistics (NCHS).......33 b. Social Security Administration (SSA)...............34 6. Department of Justice: Bureau of Justice Statistics (BJS)...35 7. Department of Labor: Bureau of Labor Statistics (BLS).35 8. Department of the Treasury: Internal Revenue Service, Statistics of Income Division (IRS, SOI)..............36 9. Environmental Protection Agency (EPA).................37 B. Summary...................................................38 1. Magnitude and Frequency Data...........................38 2. Microdata..............................................39 IV. Methods for Tabular Data.......................................42 A. Tables of Frequency Data....................................42 1. Controlled Rounding......................................43 2. The Confidentiality Edit.................................44 B. Tables of Magnitude Data....................................44 1. Definition of Sensitive Cells...........................45 a. The p-Percent Rule...................................46 b. The pq Rule..........................................47 c. The (n,k) Rule.......................................48 d. The Relationship Between (n,k) and p-Percent or pq Rules49 2. Complementary Suppression . . ..........................50 a. Audits of Proposed Complementary Suppression.........51 b. Automatic Selection of Cells for Complementary Suppression........................52 3. Infomation in Parameter Values..........................54 C. Technical Notes: Relationships Between Common Linear Sensitivity Measures....54 v TABLE OF CONTENTS (Continued) V. Methods for Public-Use Microdata Files..........................61 A. Disclosure Risk of Microdata................................62 1. Disclosure Risk and Intruders...........................62 2. Factors Contributing to Risk............................62 3. Factors that Naturally Decrease Risk....................63 B. Mathematical Methods of Addressing the problem..............64 1. Proposed Measures of Risk...............................65 2. Methods of Reducing Risk by Reducing the Amount of Information Released.............66 3. Methods of Reducing Risk by Disturbing Microdata........66 4. Methods of Analyzing Disturbed Microdata to Determine Usefulness.................................68 C. Necessary Procedures for Releasing Microdata Files..........68 1. Removal of Identifiers..................................68 2. Limiting Geographic Detail..............................69 3. Top-coding of Continuous High Visibility Variables......69 4. Precautions for Certain Types of Microdata..............70 a. Establishment Microdata.............................70 b. Longitudinal Microdata..............................70 c. Microdata Containing Administrative Data............70 d. Consideration of Potentially Matchable Files and Population Uniques..............71 D. Stringent Methods of Limiting Disclosure Risk...............71 1. Do Not Release the Microdata.............................71 2. Recode Data to Eliminate Uniques.........................71 3. Disturb Data to Prevent Matching to ExternalFiles........71 E. Conclusion VI. Recommended Practices..........................................73 A. Introduction .............................................73 B. Recommendations............................................74 1. General Recommendations for Tables and Microdata........74 2. Tables of Frequency Count Data..........................76 3. Tables of Magnitude Data................................76 4. Microdata files.........................................78 vi TABLE OF CONTENTS (Continued) VII. Research Agenda...............................................79 A. Microdata..................................................79 1. Defining Disclosure.....................................79 2. Effects of Disclosure Limitation on Data Quality and Usefulness..............................................80 a. Disturbing Data......................................80 b. More Information about Recoded Values................80 3. Reidentification Issues................................80 4. Economic Microdata.....................................81 5. Longitudinal Microdata.................................81 6. Contextual Variable Data...............................81 7. Implementation Issues for Microdata....................81 B. Tabular Data...............................................82 1. Effects of Disclosure Limitation-on Data Quality and Usefulness.............................................82 a. Frequency Count Data................................82 b. Magnitude Data......................................82 2. Near-Optimal Cell Suppression in Two-Dimensional Tables.83 3. Evaluating CONFID......................................83 4. Faster Software........................................83 5. Reducing Over-suppression..............................84 C. Data Products Other Than Microdata and Tabular Data........84 1. Database Systems........................................85 2. Disclosure Risk in Analytic Reports.....................87 vii TABLE OF CONTENTS (Continued) Appendices A. Technical Notes: Extending Primary Suppression Rules to Other Common Situations............................... .....89 1. Background..................................................89 2. Extension of Disclosure Limitation Practices................89 a. Sample Survey Data......................................89 b. Tables Containing Imputed Data..........................90 c. Tables that Report Negative Values......................90 d. Tables Where Differences Between Positive Values are Reported 90 e. Tables Reporting Net Changes (that is, Difference Between Values Reported at Different Times).............91 f. Tables Reporting Weighted Averages......................91 g. Output from Statistical Models..........................91 3. Simplifying Procedures......................................91 a. Key Item Suppression....................................91 b. Preliminary and Final Data..............................91 c. Time Series Data........................................92 B. Government References...........................................93 C. Annotated Bibliography..........................................94 viii CHAPTER I Introduction A. Subject and Purposes of This Report Federal agencies and their contractors who release statistical tables or microdata files are often required by law or established policies to protect the confidentiality of individual infomation. This confidentiality requirement applies to releases of data to the general public; it can also apply to releases to other agencies or even to other units within the same agency. The required protection is achieved by the application of statistical disclosure limitation procedures whose purpose is to ensure that the risk of disclosing confidential information about identifiable persons, businesses or other units will be very small. In early 1992 the Statistical Policy Office of the Office of Management and Budget convened an ad hoc interagency committee to review and evaluate statistical disclosure limitation methods used by federal statistical agencies and to develop recommendations for their improvement. Subsequently, the ad hoc committee became the Subcommittee on Disclosure Limitation Methodology, operating under the auspices of the Federal Committee on Statistical Methodology. This is the final report of the Subcommittee. The Subcommittee's goals in preparing this report were to: o update a predecessor subcommittee's report on the same topic (Federal Committee on Statistical Methodology, 1978); o describe and evaluate existing disclosure limitation methods for tables and microdata files; o provide recommendations and guidelines for the selection and use of effective disclosure limitation techniques; o encourage the development, sharing and use of software for the applications of disclosure limitation methods; and o encourage research to develop improved statistical disclosure limitation methods, especially for public-use microdata files. The Subcommittee believes that every agency or unit within an agency that releases statistical data should have the ability to select and apply suitable disclosure limitation procedures to all the data it releases. Each agency should have one or more employees with a clear understanding of the methods and the theory that underlies them. Introduction -1- Chapter I Disclosure Limitation Methodology May 1994 To this end our report is directed primarily at employees of federal agencies and their contractors who are engaged in the collection and dissemination of statistical data, especially those who are directly responsible for the selection and use of disclosure limitation procedures. We believe that the report will also be of interest to employees with similar responsibilities in other organizations that release statistical data, and to data users, who may find that it helps them to understand and use disclosure-limited. data products. B. Some Definitions In order to clarify the scope of this report, we define and discuss here some key terms that will be used throughout the report. B.1. Confidentiality and Disclosure A definition of confidentiality was given by the President's Commission on Federal Statistics (1971:222): [Confidential should mean that the dissemination] of data in a manner that would allow public identification of the respondent or would in any way be harmful to him is prohibited and that the data are immune from legal process. The second element of this definition, immunity from mandatory disclosure through legal process, is a legal question and is outside the scope of this report. Our concern is with methods designed to comply with the first element of the definition, in other words, to minimize the risk of disclosure (public identification) of the identity of individual units and information about them. The release of statistical data inevitably reveals some information about individual data subjects. Disclosure occurs when information that is meant to be treated as confidential is revealed. Sometimes disclosure can occur based on the released data alone; sometimes disclosure results from combination of the released data with publicly available information; and sometimes disclosure is possible only through combination of the released data with detailed external data sources that may or may not be available to the general public. At a minimum, each statistical agency must assure that the risk of disclosure from the released data alone is very low. Several different definitions of disclosure and of different types of disclosure have been proposed (see Duncan and Lambert, 1987 for a review of definitions of disclosure associated with the release of microdata). Duncan et al. (1993: 23-24) provide a definition that distinguishes three types of disclosure: Disclosure relates to inappropriate attribution of information to a data subject, whether an individual or an organization. Disclosure occurs when a data subject is identified from a released file (identity disclosure), sensitive information about a data subject is revealed through the released file (attribute disclosure), or the released data make it possible to Introduction -2- Chapter I Disclosure Limitation Methodology May 1994 determine the value of some characteristic of an individual more accurately than otherwise would have been possible (inferential disclosure). In the above definition, the word "data' could have been substituted for "file", because each type of disclosure can occur in connection with the release of tables or microdata. The definitions and implications of these three kinds of disclosure are examined in more detail in the next chapter. B.2. Tables and Microdata The choice of statistical disclosure limitation methods depends on the nature of the data products whose confidentiality must be protected. Most statistical data are released in the form of tables or microdata files. Tables can be further divided into two categories: tables of frequency (count) data and tables of magnitude data. For either category, data can be presented in the form of numbers, proportions or percents. A microdata file consists of individual records, each containing values of variables for a single person, business establishment or other unit. Some microdata files include explicit identifiers, like name, address or Social Security number. Removing any such identifiers is an obvious first step in preparing for the release of a file for which the confidentiality of individual information must be protected. B-3. Restricted Data and Restricted Access The confidentiality of individual information can be protected by restricting the amount of information in released tables and microdata files (restricted data) or by imposing conditions on access to the data products (restricted access), or by some combination of these. The disclosure limitation methods described in this report provide confidentiality protection by restricting the data. Public-use data products are released by statistical agencies to anyone without restrictions on use or other conditions, except for payment of fees to purchase publications or data files in electronic form. Agencies require that the disclosure risks for public-use data products be very low. The application of disclosure limitation methods to meet this requirement sometimes calls for substantial restriction of data content, to the point where the data may no longer be of much value for some purposes. In such circumstances, it may be appropriate to use procedures that allow some users to have access to more detailed data, subject to restrictions on who may have access, at what locations and for what purposes. Such restricted access arrangements normally require written agreements between agency and users, and the latter are subject to penalties for improper disclosure of individual information and other violations of the agreed conditions of use. The fact that this report deals only with disclosure limitation procedures that restrict data content should not be interpreted to mean that restricted access procedures are of less importance. Introduction -3- Chapter I Disclosure Limitation Methodology May 1994 Readers interested in the latter can find detailed information in the report of the Panel on Confidentiality and Data Access (see below) and in Jabine (1993b). C. Report of the Panel on Confidentiality and Data Access In October 1993, while the Subcommittee was developing this report, the Panel on Confidentiality and Data Access, which was jointly sponsored by the Committee on National Statistics (CNSTAT) of the National Research Council and the Social Science Research Council, released its final report (Duncan et al., 1993). The scope of the CNSTAT report is much broader than this one: disclosure limitation methodology was only one of many topics covered and it was treated in much less detail than it is here. The CNSTAT panel's recommendations on statistical disclosure limitation methods (6.1 to 6.4) are less detailed than the guidelines and recommendations presented in this report. However, we believe that the recommendations in the two reports are entirely consistent with and complement each other. Indeed, the development and publication of this report is directly responsive to the CNSTAT Panel's Recommendation 6.1, which says, in part, that "The Office of Management and Budget's Statistical Policy Office should continue to coordinate research work on statistical disclosure analysis and should disseminate the results of this work broadly among statistical agencies." D. Organization of the Report Chapter II, "Statistical Disclosure Limitation Methods: A Primer", provides a simple description and examples of disclosure limitation techniques that are commonly used to limit the risk of disclosure in releasing tables and microdata. Readers already familiar with the basics of disclosure limitation methods may want to skip over this chapter. Chapter III describes disclosure limitation methods used by twelve major federal statistical agencies and programs. Among the factors that explain variations in agencies' practices are differences in types of data and respondents, different legal requirements and policies for confidentiality protection, different technical personnel and different historical approaches to confidentiality issues. Chapter IV provides a systematic and detailed description and evaluation of statistical disclosure limitation methods for tables of frequency and magnitude data. Chapter V fulfills the same function for microdata. These chapters will be of greatest interest to readers who have direct responsibility for the application of disclosure limitation methods or are doing research to evaluate and improve existing methods or develop new ones. Readers with more general interests may want to skip these chapters and proceed to Chapters VI and VII. Due in part to the stimulus provided by our predecessor subcommittee's report (which we will identify in this report as Working Paper 2), improved methods of disclosure limitation have been developed and used by some agencies over the past 15 years. Based on its review of these methods, the Subcommittee has developed guidelines for good practice for all agencies. With separate sections for tables and microdata, Chapter VI presents guidelines for recommended practices. Introduction Chapter I Disclosure Limitation Methodology May 1994 Chapter VII presents an agenda for research on disclosure limitation methods. Because statistical disclosure limitation procedures for tabular data are more fully developed than those for microdata, the research agenda focuses more on the latter. The Subcommittee believed that a high priority should be assigned to research on how the quality and usefulness of data are affected by the application of disclosure limitation procedures. Two appendices are also included. Appendix A contains technical notes on practices the statistical agencies have found useful in extending primary suppression rules to other common situations. Appendix B is an annotated bibliography of articles about statistical disclosure limitation published since the publication of Working Paper 2. E. Underlying Themes of the Report Five principal themes underlie the guidelines in Chapter VI and the research agenda in Chapter VII: o There are legitimate differences between the disclosure limitation requirements of different agencies. Nevertheless, agencies should move as far as possible toward the use of a small number of standardized disclosure limitation methods whose effectiveness has been demonstrated. o Statistical disclosure limitation methods have been developed and implemented by individual agencies over the past 25 years. The time has come to make the best technology available to the entire federal statistical system. The Subcommittee believes that methods which have been shown to provide adequate protection against disclosure should be documented clearly in simple formats. The documentation and the corresponding software should then be shared among federal agencies. o Disclosure-limited products should be auditable to determine whether or not they meet the intended objectives of the procedure that was applied. For example, for some kinds of tabular data, linear programming software can be used to perform disclosure audits. o Several agencies have formed review panels to ensure that appropriate disclosure limitation policies and practices are in place and being properly used. Each agency should centralize its oversight and review of the application of disclosure limitation methods. o New research should focus on disclosure limitation methods for microdata and on how the methods used affect the usefulness and ease of use of data products. Introduction -5- Chapter I Disclosure Limitation Methodology May 1994 CHAPTER II Statistical Disclosure Limitation: A Primer This chapter provides a basic introduction to the disclosure limitation techniques which are used to protect statistical tables and microdata. It uses simple examples to illustrate the techniques. Readers who are already familiar with the methodology of statistical disclosure limitation may prefer to skip directly to Chapter 111, which describes agency practices, Chapter IV which provides a more mathematical discussion of disclosure limitation techniques used to protect tables, or Chapter V which provides a more detailed discussion of disclosure limitation techniques applied to microdata. A. Background One of the functions of a federal statistical agency is to collect individually identifiable data, process them and provide statistical summaries to the public. Some of the data collected are considered proprietary by respondents. Agencies are authorized or required to protect individually identifiable data by a variety of statutes, regulations or policies. Cecil (1993) summarizes the laws that apply to all agencies and describes the statutes that apply specifically to the Census Bureau, the National Center for Education Statistics, and the National Center for Health Statistics. Regardless of the basis used to protect confidentiality, federal statistical agencies must balance two objectives: to provide useful statistical information to data users, and to assure that the responses of individuals are protected. Not all data collected and published by the government are subject to disclosure limitation techniques. Some data on businesses collected for regulatory purposes are considered public. Some data are not considered sensitive and are not collected under a pledge of confidentiality. The statistical disclosure limitation techniques described in this paper are applied whenever confidentiality is required and data or estimates are to be publicly available. Methods of protecting data by restricting access are alternatives to statistical disclosure limitation. They are not discussed in this paper. See Jabine (1993) for a discussion of restricted access methods. All disclosure limitation methods result in some loss of information, and sometimes the publicly available data may not be adequate for certain statistical studies. However, the -intention is to provide as much data as possible, without revealing individually identifiable data. The historical method of providing data to the public is via statistical tables. With the advent of the computer age in the early 1960's agencies also started releasing microdata rdes. In a microdata file each record contains a set of variables that pertain to a single respondent and are related to that respondent's reported values. However, there are no identifiers on the file and the data may be disguised in some way to make sure that individual data items cannot be., uniquely associated with a particular respondent. A new method of releasing data has been. introduced by the National Center for Education Statistics (NCES) in the 1990's. Data are provided on diskette or CD-ROM in a secure data base system with access programs which allow A Primer -6- Chapter H Disclosure Limitation Methodology May 1994 users to create special tabulations. The NCES disclosure limitation and data accuracy standards are automatically applied to the requested tables before they are displayed to the user. This chapter provides a simple description of the disclosure limitation techniques which are commonly used to limit the possibility of disclosing identifying information about respondents in tables and microdata. The techniques are illustrated with examples. The tables or microdata produced using these methods are usually made available to the public with no further restrictions. Section B presents some of the basic definitions used in the sections and chapters that follow: included are a discussion of the distinction between tables of frequency data and tables of magnitude data, a definition of table dimensionality, and a summary of different types of disclosure. Section C discusses the disclosure limitation methods applied to tables of counts or frequencies. Section D addresses tables of magnitude data, section E discusses microdata, and Section F summarizes the chapter. B. Definitions Each entry in a statistical table represents the aggregate value of a quantity over all units of analysis belonging to a unique statistical cell. For example, a table that presents counts of individuals by 5-year age category and the total annual income in increments of $10,000 is comprised of statistical cells such as the cell {35-39 years of age, $40,000 to $49,999 annual income}. A table that displays value of construction work done during a particular period in the state of Maryland by county and by 4-digit Standard Industrial Code (SIC) groups is comprised of cells such as the cell (SIC 1521, Prince George's County). B-1. Tables of Magnitude Data Versus Tables of Frequency Data Tbe selection of a statistical disclosure limitation technique for data presented in tables (tabular data) depends on whether the data represent frequencies or magnitudes. Tables of frequency count data present the number of units of analysis in a cell. Equivalently the data may be presented as a percent by dividing the count by the total number presented in the table (or the total in a row or column) and multiplying by 100. Tables of magnitude data present the aggregate of a "quantity of interest" over all units of analysis in the cell. Equivalently the data may be presented as an average by dividing the aggregate by the, number of units in the cell. To distinguish formally between frequency count data and magnitude data, the quantity of interest" must measure something other than membership in the cell. Thus, tables of the number of establishments within the manufacturing sector by SIC group and by county-within-state are frequency count tables, whereas tables presenting total value of shipments for the same cells are tables of magnitude data. For practical purposes, entirely rigorous definitions are not necessary. The statistical disclosure limitation techniques used for magnitude data can be used for frequency data. However, for tables of frequency data other options are also available- A Primer -7- Chapter II Disclosure Limitation Methodology May 1994 B.2. Table Dimensionality If the values presented in the cells of a statistical table are aggregates over two variables, the table is a two-dimensional table. Both examples of detail cells presented above, (35-39 years of age, $40,000-$49,999 annual income) and (SIC 152 1, Prince George's County) are from two- dimensional tables. Typically, categories of one variable are given in columns and categories of the other variable are given in rows. If the values presented in the cells of a statistical table are aggregates over three variables, the table is a three-dimensional table. If the data in the first example above were also presented by county in the state of Maryland, the result might be a detail cell such as (35-39 years of age, $40,000-$49,999 annual income, Montgomery County). For the second example if the data were also presented by year, the result might be a detail cell such as {SIC 1521, Prince George's County, 1990). The first two-dimensions are said to be presented in rows and columns, the third variable in "layers". B.3. What is Disclosure? The definition of disclosure given in Chapter 1, and discussed further below is very broad. Because this report documents the methodology used to limit disclosure, the focus is on practical situations. Hence, the concern is only with the disclosure of confidential information through the public release of data products. As stated in Lambert (1993), "disclosure is a difficult topic. People even disagree about what constitutes a disclosure. In Chapter I, the three types of disclosure presented in Duncan, et. al (1993) were briefly introduced. These are identity disclosure, attribute disclosure and inferential disclosure. Identity disclosure occurs if a third party can identify a subject or respondent from the released data. Revealing that an individual is a respondent or subject of a data collection may or may not violate confidentiality requirements. For tabulations, revealing identity is generally not disclosure, unless the identification leads to divulging confidential information (attribute disclosure) about those who are identified. For microdata, identification is generally regarded as disclosure, because microdata records are usually so detailed that the likelihood of identification without revealing additional information is minuscule. Hence disclosure limitation methods applied to microdata files limit or modify information that might be used to identify specific respondents or data subjects. Attribute disclosure occurs when confidential information about a data subject is revealed and can be attributed to the subject. Attribute disclosure may occur when confidential information is revealed exactly or when it can be closely estimated. Thus, attribute disclosure comprises identification of the subject and divulging confidential information pertaining to the subject. A Primer Chapter II Disclosure Limitation Methodology May 1994 Attribute disclosure is the form of disclosure of primary concern to statistical agencies tabular data. Disclosure limitation methods applied to tables assure that respondent data are published only as part of an aggregate with a sufficient number of other -respondents to prevent attribute disclosure. The third type of disclosure, inferential disclosure, occurs when information can be inferred with high confidence from statistical properties of the released data. For example, the data may show a high correlation between income and purchase price of home. As purchase price of home is typically public information, a third party might use this information to infer the income of a data subject. In general, statistical agencies are not concerned with inferential disclosure, for two reasons. First a major purpose of statistical data is to enable users to infer and understand relationships between variables. If statistical agencies equated disclosure with inference, no data could be released. Second, inferences are designed to predict aggregate behavior, not individual attributes, and thus often poor predictors of individual data values. Click HERE for graphic. A Primer -9- Chapter II Disclosure Limitation Methodology May 1994 C. Tables of Counts or Frequencies The data collected from most surveys about people are published in tables that show counts (number of people by category) or frequencies (fraction or percent of people by category). A portion of a table published from a sample survey of households that collects information on energy consumption is shown in Table 1 on the previous page as an example. C.1. Sampling as a Statistical Disclosure Limitation Method One method of protecting the confidentiality of data is to conduct a sample survey rather than a census. Disclosure limitation techniques are not applied in Table 1 even though respondents are given a pledge of confidentiality because it is a large scale sample survey. Estimates are made by multiplying an individual respondent's data by a sampling weight before they are aggregated. If sampling, weights are not Published, this weighting helps to make an individual respondent's data less identifiable from published totals. Because the weighted numbers represent all households in the United States, the counts in this table are given in units of millions of households. They were derived from a sample survey of less than 7000 households. This illustrates the protection provided to individual respondents by sampling and estimation. Additionally, many agencies require that estimates must achieve a specified accuracy before they can to be published. In Table 1 cells with a "Q" are withheld because the relative standard error is greater than 50 percent. For a sample survey accuracy requirements such as this one result in more cells being withheld from publication than would a disclosure limitation rule. In Table 1 the values in the cells labeled Q can be derived by subtracting the other cells in the row from the marginal total. The purpose of the Q is not necessarily to withhold the value of the cell from the public, but rather to indicate that any number so derived does not meet the accuracy requirements of the agency. When tables of counts or frequencies are based directly on data from all units in the population (for example the 100-percent items in the decennial Census) then disclosure limitation procedures must be applied. In the discussion below we identify two classes of disclosure limitation rules for tables of counts or frequencies. The first class consists of special rules designed for specific tables. Such rules differ from agency to agency and from table to table. The special rules are generally designed to provide protection to data considered particularly sensitive by the agency. The second class is more general: a cell is defined to be sensitive if the number of respondents is less than some specified threshold (the threshold rule). Examples of both classes of disclosure limitation techniques are given in Sections II.C.2 and II.C.3. C.2. Special Rules Special rules impose restrictions on the level of detail that can be provided in a table. For example, Social Security Administration (SSA) rules prohibit tabulations in which a detail cell is equal to a marginal total or which would allow users to determine an individual's age within a five year interval, earnings within a $1000 interval or benefits within a $50 interval. Primer -10- Chapter II Disclosure Limitation Methodology May 1994 Tables 2 and 3 illustrate these rules. They also illustrate the method of restructuring tables and combining categories to limit disclosure in tables. Click HERE for graphic. Table 2 is a two-dimensional table showing the number of beneficiaries by county and size of benefit. This table would not be publishable because the data shown for counties B and D violate Social Security's disclosure rules. For county D, there is only one non-empty detail cell, and a beneficiary in this county is known to be receiving benefits between $40 and $59 per month. This violates two rules. First the detail cell is equal to the cell total; and second, this reveals that all beneficiaries in the county receive between $40 and $59 per month in benefits. This interval is less than the required $50 interval. For county B there are 2 'non-empty cells, but the range of possible benefits is from $40 to $79 per month, an interval of less than the required $50. To protect confidentiality, Table 2 could be restructured and rows or columns combined (sometimes referred to as "rolling-up categories"). Combining the row for county B with the row for county D would still reveal that the range of benefits is $40 to $79. Combining A with B and C with D does offer the required protection, as illustrated in Table 3. Click HERE for graphic. A Primer -11- Chapter II Disclosure Limitation Methodology May 1994 C.3. The Threshold Rule With the threshold rule, a cell in a table of frequencies is defined to be sensitive if the number of respondents is less than some specified number. Some agencies require at least 5 respondents in a cell, others require 3. An agency may structure tables and combine categories (as illustrated above), or use cell suppression, random rounding, controlled rounding or the confidentiality edit. Cell suppression, random rounding, controlled rounding and the confidentiality edit are described and illustrated below. Table 4 is a fictitious example of a table with disclosures. The fictitious data set consists of information concerning delinquent children. We define a cell with fewer than 5 respondents to be sensitive. Sensitive cells are shown with an asterisk. C.3.a. Suppression One of the most commonly used ways of protecting sensitive cells is via suppression. it is obvious that in a row or column with a suppressed sensitive cell, at least one additional cell must be suppressed, or the value in the sensitive cell could be calculated exactly by subtraction from the marginal total. For this reason, certain other cells must also be suppressed. These are referred to as complementary suppressions. While it is possible to select cells for complementary suppression manually, it is difficult to guarantee that the result provides adequate protection. Click HERE for graphic. A Primer -12- Chapter II Disclosure Limitation Methodology May 1994 Table 5 shows an example of a system of suppressed cells for Table 4 which has at least two suppressed cells in each row and column. This table appears to offer protection to the sensitive cells. But does it? Click HERE for graphic. This example shows that selection of cells for complementary suppression is more complicated than it would appear at first. Mathematical methods of linear programming are used to automatically select cells for complementary suppression and also to audit a proposed suppression pattern (eg. Table 5) to see if it provides the required protection. Chapter IV provides more detail on the mathematical issues of selecting complementary cells and auditing suppression patterns. Table 6 shows our table with a system of suppressed cells that does provide adequate protection for the sensitive cells. However, Table 6 illustrates one of the problems with suppression. Out of a total of 16 interior cells, only 7 cells are published, while 9 are suppressed. A Primer -13- Chapter II Disclosure Limitation Methodology May 1994 C.3.b. Random Rounding Click HERE for graphic. In order to reduce the amount of data loss which occurs with suppression, the U.S. Census Bureau has investigated alternative methods to protect sensitive cells in tables of frequencies. Perturbation methods such as random rounding and controlled rounding are examples of such alternatives. In random rounding cell values are rounded, but instead of using standard rounding conventions a random decision is made as to whether they will be rounded up or down. Click HERE for graphic. Because rounding is done separately for each cell in a table, the rows and columns do not necessarily add to the published row and column totals. In Table 7 the total for the first row is 20, but the sum of the values in the interior cells in the first row is 15. A table prepared using random rounding could lead the public to lose confidence in the numbers: at a minimum it looks as if the agency cannot add. The New Zealand Department of Statistics has used random rounding in its publications and this is one of the criticisms it has heard (George and Penny, 1987). A Primer -14- Chapter II Disclosure Limitation Methodology May 1994 Click HERE for graphic. C.3.c. Controlled Rounding To solve the additivity problem, a procedure called controlled rounding was developed. It is a form of random rounding, but it is constrained to have the sum of the published entries in each row and column equal the appropriate published marginal totals. Linear programming methods are used to identify a controlled rounding for a table. There was considerable research into controlled rounding in the late 1970's and early 1980's and controlled rounding was proposed for use with data from the 1990 Census, (Greenberg, 1986). However, to date it has not been used by any federal statistical agency. Table 8 illustrates controlled rounding. One disadvantage of controlled rounding is that it requires the use of specialized computer programs. At present these programs are not widely available. Another disadvantage is that controlled rounding solutions may not always exist for complex tables. These issues are discussed further in Chapters IV and VI. C.3.d. Confidentiality Edit The confidentiality edit is a new procedure developed by the U.S. Census Bureau to provide protection in data tables prepared from the 1990 Census (Griffin, Navarro, and Flores-Baez, 1989). There are two different approaches: one was used for the regular decennial Census data (the 100 percent data file); the other was used for the long-form of the Census which was filed by a sample of the population (the sample data file). Both techniques apply statistical disclosure limitation techniques to the microdata files before they are used to prepare tables. The adjusted files themselves are not released, they are used only to prepare tables. A Primer -15- Chapter II Disclosure Limitation Methodology May 1994 Click HERE for graphic. First, for the 100 percent microdata file, the confidentiality edit involves "data swapping" or "switching" (Dalenius and Reiss, 1982; Navarro, Flores-Baez, and Thompson, 1988). The confidentiality edit proceeds as follows. First, take a sample of records from the microdata file. Second, find a match for these records in some other geographic region, matching on a specified set of important attributes. Third, swap all attributes on the matched records. For small blocks, the Census Bureau increases the sampling fraction to provide additional protection. After the microdata file has been treated in this way it can be used directly to prepare tables and no further disclosure analysis is needed. Second, the sample data file already consists of data from only a sample of the population, and as noted previously, sampling provides confidentiality protection. Studies showed that this protection was sufficient except in small geographic regions. To provide additional protection in small geographic regions, one household was randomly selected and a sample of its data fields were blanked. These fields were replaced by imputed values. After the microdata file has been treated in this way it is used directly to prepare tables and no further disclosure analysis is needed. To illustrate the confidentiality edit as applied to the 100 percent microdata file we use fictitious records for the 20 individuals in county Alpha who contributed to Tables 4 through 8. Table 9 shows 5 variables for these individuals.. Recall that the previous tables showed counts of individuals by county and education level of head of household. The purpose of the confidentiality edit is to provide disclosure protection to tables of frequency data. However, to achieve this, adjustments are made to the microdata file before the tables are created. The following steps are taken to apply the confidentiality edit. A Primer -16- Chapter H Disclosure Limitation Methodology May 1994 Click HERE for graphic. 1. Take a sample of records from the microdata file (say a 10% sample). Assume that records number 4 and 17 were selected as part of our 10%sample. 2. Since we need tables by county and education level, we find a match in some other county on the other variables race, sex and income. (As a result of matching on race, sex and income, county totals for these variables will be unchanged by the swapping.) A match for record 4 (Pete) is found in County Beta. The match is with Alfonso whose head of household has a very high education. Record 17 (Mike) is matched with George in county Delta, whose head of household has a medium education. In addition, part of the randomly selected 10% sample from other counties match records in county A. One record from county Delta (June with high education) matches with Virginia, record. number 12. One record from 'county Gamma (Heather with low education) matched with Nancy, in record 20. A Primer -17- Chapter II Disclosure Limitation Methodology May 1994 3. After all matches are made, swap attributes on matched records. The adjusted microdata file after these attributes are swapped appears in Table 10. Click HERE for graphic. 4. Use the swapped data file directly to produce tables, see Table II. The confidentiality edit has a great advantage in that multidimensional tables can be prepared easily and the disclosure protection applied will always be consistent. A disadvantage is that it does not look as if disclosure protection has been applied. A Primer -18- Chapter II Disclosure Limitation Methodology May 1994 Click HERE for graphic. D. Tables of Magnitude Data Tables showing magnitude data have a unique set of disclosure problems. Magnitude data are generally nonnegative quantities reported in surveys or censuses of business establishments, farms or institutions. The distribution of these reported values is likely to be skewed, with a few entities having very large values. Disclosure limitation in this case concentrates on making sure that the published data cannot be used to estimate the values reported by the largest, most highly visible respondents too closely. By protecting the largest values, we, in effect, protect all values. For magnitude data it is less likely that sampling alone will provide disclosure protection because most sample designs for economic surveys include a stratum of the larger volume entities which are selected with certainty. Thus, the units which are most visible because of their size, do not receive any protection from sampling. For tables of magnitude data, rules called primary suppression rules or linear sensitivity measures, have been developed to determine whether a given table cell could reveal individual respondent information. Such a cell is called a sensitive cell, and cannot be published. The primary suppression rules most commonly used to identify sensitive cells by government agencies are the (n,k) rule, the p-percent rule and the pq rule. All are based on the desire to make it difficult for one respondent to estimate the value reported by another respondent too closely. The largest reported value is the most likely to be estimated accurately. Primary suppression rules can be applied to frequency data. However, since all respondents contribute the same value to a frequency count, the rules default to a threshold rule and the cell is sensitive if it has too few respondents. Primary suppression rules are discussed in more detail in Section VI.B.l. A Primer -19- Chapter II Disclosure Limitation Methodology May 1994 Once sensitive cells have been identified, there are only two options: restructure the table and collapse cells until no sensitive cells remain, or cell suppression. With cell suppression, once the sensitive cells have been identified they are withheld from publication. These are called primary suppressions. Other cells, called complementary suppressions are selected and suppressed so that the sensitive cells cannot be derived by addition or subtraction from published marginal totals. Problems associated with cell suppression for tables of count data were illustrated in Section II.C.3.a. The same problems exist for tables of magnitude data. An administrative way to avoid cell suppression is used by a number of agencies. They obtain written permission to publish a sensitive cell from the respondents that contribute to the cell. The written permission is called a "waiver" of the promise to protect sensitive cells. In this case, respondents are willing to accept the possibility that their data might be estimated closely from the published cell total. E. Microdata Information collected about establishments is primarily magnitude data. These data are likely to be highly skewed, and there are likely to be high visibility respondents that could easily be identified via other publicly available information. As a result there are virtually no public use microdata files released for establishment data. Exceptions are a microdata file consisting of survey data from the Commercial Building Energy Consumption Survey, which is provided by the Energy Information Administration and two files from the 1987 Census of Agriculture provided by the Census Bureau. Disclosure protection is provided using the techniques described below. It has long been recognized that it is difficult to protect a microdata set from disclosure because of the possibility of matching to outside data sources (Bethlehem, Keller and Panekoek, 1990). Additionally, there are no accepted measures of disclosure risk for a microdata file, so there is no 'standard' which can be applied to assure that protection is adequate. (This is a topic for which research is needed, as discussed in Chapter VII). The methods for protection of microdata files described below are used by all agencies which provide public use data files. To reduce the potential for disclosure, virtually all public use microdata files: 1. Include data from only a sample of the population, 2. Do not include obvious identifiers, 3. Limit geographic detail, and 4. Limit the number of variables on the file. Additional methods used to disguise high visibility variables include: 1. Top or bottom-coding, 2. Recoding into intervals or rounding, 3. Adding or multiplying by random numbers (noise), 4. Swapping or rank swapping (also called switching), A Primer -20- Chapter H Disclosure Limitation Methodology May 1994 5. Selecting records at random, blanking out selected variables and imputing for them (also called blank and impute), 6. Aggregating across small groups of respondents and replacing one individual's reported value with the average (also called blurring). These will be illustrated with the fictitious example we used in the previous section. E.l. Sampling, Removing Identifiers and Limiting Geographic Detail First: include only the data from a sample of the population. For this example we used a 10 percent sample of the population of delinquent children. Part of the population (County A) was shown in Table 9. Second: remove obvious identifiers. In this case the identifier is the first name of the child. Third: consider the geographic detail. We decide that we cannot show individual county data for a county with less than 30 delinquent children in the population. Therefore, the data from Table 4 shows that we cannot provide geographic detail for counties Alpha or Gamma. As a result counties Alpha and Gamma are combined and shown as AlpGam in Table 12. These manipulations result in the fictitious microdata file shown in Table 12. In this example we discussed only 5 variables for each child. One might imagine that these 5 were selected from a more complete data set including names of parents, names and numbers of siblings, age of child, ages of siblings, address, school and so on. As more variables are included in a microdata file for each child, unique combinations of variables make it more likely that a specific child could be identified by a knowledgeable person. Limiting the number of variables to 5 makes such identification less likely. E.2. High Visibility Variables It may be that information available to others in the population. could be used with the income data shown in Table 12 to uniquely identify the family of a delinquent child. For example, the employer of the head of household generally knows his or her exact salary. Such variables are called high visibility variables and require additional protection. E.2.a. Top-coding, Bottom-coding, Recoding into Intervals Large income values are top-coded by showing only that the income is greater than 100 thousand dollars per year. Small income values are bottom-coded by showing only that the income is less than 40 thousand dollars per year. Finally, income values are recoded by presenting income in 10 thousand dollar intervals. The result of these manipulations yields the fictitious public use data file in Table 13. Top-coding, bottom-coding and recoding into intervals are among the most commonly used methods to protect high visibility variables in microdata files. A Primer -21- Chapter II Disclosure Limitation Methodology May 1994 Click HERE for graphic. A Primer -22- Chapter II Disclosure Limiitation Methodology May 1994 E.2.b. Adding Random Noise An alternative method of disguising high visibility variables, such as income, is to add or multiply by random numbers. For example, in the above example, assume that we will add a normally distributed random variable with mean 0 and standard deviation 5 to income. Along with the sampling, removal of identifiers and limiting geographic detail, this might result in a microdata file such as Table 14. To produce this table, 14 random numbers were selected from the specified normal distribution, and were added to the income data in Table 12. Click HERE for graphic. E.2.c. Swapping or Rank Swapping Swapping involves selecting a sample of the records, finding a match in the data base on a set of predetermined variables and swapping all other variables. Swapping (or switching) was illustrated as part of the confidentiality edit for tables of frequency data. In that example records were identified from different counties which matched on race, sex and income and the variables first name of child and household education were swapped. For purposes of providing additional protection to the income variable in a microdata file, we might choose instead to find a match in another county on household education and race and to swap the income variables. Rank swapping provides a way of using continuous variables to define pairs of records for swapping. Instead of insisting that variables match (agree exactly), they are defined to be close A Primer -23- Chapter II Disclosure Limitation Methodology May 1994 based on their proximity to each other on a list sorted by the continuous variable. Records which are close in rank on the sorted variable are designated as pairs for swapping. Frequently in rank swapping, the variable used in the sort is the one that will be swapped. E.2.d. Blank and Impute for Randomly Selected Records The blank and impute method involves selecting a few records from the microdata file, blanking out selected variables and replacing them by imputed values. This technique is illustrated using data shown in Table 12. First, one record is selected at random from each publishable county, AlpGam, Beta and Delta. In the selected record the income value is replaced by an imputed value. If the randomly selected records are 2 in county AlpGam, 6 in county Beta and 13 in county Delta, the income value recorded in those records might be replaced by 63, 52 and 49 respectively. These numbers are also fictitious, but you can imagine that imputed values were calculated as the average over all households in the county with the same race and education. Blank and impute was used as part of the confidentiality edit for tables of frequency data from the Census sample data files (containing information from the long form of the decennial Census). E.2.e. Blurring Blurring replaces a reported value by an average. There are many possible ways to implement blurring. Groups of records for averaging may be formed by matching on other variables or by sorting the variable of interest. The number of records in a group (whose data will be averaged) may be fixed or random. The average associated with a particular group may be assigned to all members of a group, or to the "middle' member (as in a moving average.) It may be performed on more than one variable with different groupings for each variable. In our example, we illustratee this technique by blurring the income data. In the complete microdata file we might match on important variables such as county, race and two education groups (very high, high) and (medium, low). Then blurring could involve averaging households in each group, say two at a time. In county Alpha (see Table 9) this would mean that the household income for the group consisting of John and Sue would be replaced by the average of their incomes (139), the household income for the group consisting of Jim and Pete would be replaced by their average (82), and so on. After blurring, the data file would be subject to sampling, removal of identifiers, and limitation of geographic detail. F. Summary This chapter has described the standard methods of disclosure limitation used by federal statistical agencies to protect both tables and microdata. It has relied heavily on simple examples to illustrate the concepts. The mathematical underpinnings of disclosure limitation in tables and microdata are reported in more detail in Chapters IV and V, respectively. Agency practices in disclosure limitation are described in Chapter 111. A Primer -24- Chapter II Disclosure Limitation Methodology May 1994 CHAPTER III Current Federal Statistical Agency Practices This chapter provides an overview of Federal agency policies, practices, and procedures for statistical disclosure limitation. Statistical disclosure limitation methods are applied by the agencies to limit the risk of disclosure of individual information when statistics are disseminated in tabular or microdata formats. Some of the statistical agencies conduct or support research on statistical disclosure limitation methods. Information on recent and current research is included in Chapter VII. This review of agency practices is based on two sources. The first source is Jabine (1993b), a paper based in part on information provided by the statistical agencies in response to a request in 1990 by the Panel on Confidentiality and Data Access, Committee on National Statistics. Additional information for the Jabine paper was taken from an appendix to Working Paper 2. The second source for this summary of agency practices was a late 1991 request by Hermann Habermann, Office of Management and Budget, to Heads of Statistical Agencies. Each agency was asked to provide, for use by a proposed ad hoc Committee on Disclosure Risk Analysis, a description of its current disclosure practices, standards, and research plans for tabular and microdata. Responses were received from 12 statistical agencies. Prior to publication, the agencies were asked to review this chapter and update any of their practices. Thus, the material in this chapter is current as of the publication date. The first section of this chapter summarizes the disclosure limitation practices for each of the 12 largest Federal statistical agencies as shown in Statistical Programs of the United States Government: Fiscal Year 1993 (Office of Management and Budget). The agency summaries are followed by an overview of the current status of statistical disclosure limitation policies, practices, and procedures based on the available information. Specific methodologies and the state of software being used are discussed to the extent they were included in the individual agencies' responses. A. Agency Summaries A.1. Department of Agriculture A.1.a. Economic Research Service (ERS) ERS disclosure limitation practices are documented in the statement of "ERS Policy on Dissemination of Statistical Information," dated September 28, 1989. This statement provides that: Agency Practices -25- Chapter III Disclosure Limitation Methodology May 1994 Estimates will not be published from sample surveys unless: (1) sufficient nonzero reports are received for the items in a given class or data cell to provide statistically valid results which are clearly free of disclosure of information about individual respondents. In all cases at least three observations must be available, although more restrictive rules may be applied to sensitive data, (2) the unexpanded data for any one respondent must represent less than 60 percent of the total that is being published, except when written permission is obtained from that respondent ... The second condition is an application of the (n,k) concentration rule. In this instance (n,k) (1, 0.6). Both conditions are applied to magnitude data while the first condition also applies to counts. Within ERS, access to unpublished, confidential data is controlled by the appropriate branch chief. Authorized users must sign confidentiality certification forms. Restrictions require that data be summarized so individual reports are not revealed. ERS does not release public-use microdata. ERS will share data for statistical purposes with governmnent agencies, universities, and other entities under cooperative agreements as described below for the, National Agricultural Statistics Service (NASS). Requests of entities under cooperative agreements with ERS for tabulations of data that were originally collected by NASS are subject to NASS review. A.1.b. National Agricultural Statistics Service (NASS) Policy and Standards Memorandum (PSM) 12-89, dated July 12, 1989, outlines NASS policy for suppressing estimates and summary data to preserve confidentiality. PSM 7-90 (March 28, 1990) documents NASS policy on the release of unpublished summary data and estimates. In general, summary data and estimates may not be published if a nonzero value is based on information from fewer than three respondents or if the data for one respondent represents more than 60 percent of the published value. Thus NASS and ERS follow the same basic (n,k) concentration rule. Suppressed data may be aggregated to a higher level, but steps are defined to ensure that the suppressed data cannot be reconstructed from the published materials. This is particularly important when the same data are published at various time intervals such as monthly, quarterly, and yearly. These rules often mean that geographic subdivisions must be combined to avoid revealing information about individual operations. Data for many counties cannot be published for some crop and livestock items and State level data must be suppressed in other situations. NASS uses a procedure for obtaining waivers from respondents which permits publication of values that otherwise would be suppressed. Written approval must be obtained and updated periodically. If waivers cannot be obtained, data are not published or cells are combined to limit disclosure. Agency Practices -26- Chapter III Discloosure Limitation Methodology May 1994 NASS generally publishes magnitude data only, but the same requirement of three respondents is applied when tables of counts are generated by special request or for reimbursable surveys done for other agencies. NASS does not release public-use microdata. PSM 4-90 (Confidentiality of Information), PSM 5-89 (Privacy Act of 1974), and PSM 6-90 (Access to Lists and Individual Reports) cover NASS policies for microdata protection. Almost all NASS surveys depend upon voluntary reporting by farmers and business firms. Ilis cooperation is secured by a statutory pledge that individual reports will be kept confidential and used only for statistical purposes. While it is NASS policy to not release microdata files, NASS and ERS have developed an arrangement for sharing individual farm data from the annual Farm Costs and Returns Survey which protects confidentiality while permitting some limited access by outside researchers. The data reside in an ERS data base under security measures approved by NASS. All ERS employees with access to the data base operate under the same confidentiality regulations as NASS employees. Researchers wishing access to this data base must have their requests approved by NASS and come to the ERS offices to access the data under confidentiality and security regulations. USDA's Office of the General Counsel (OGC) has recently (February 1993) reviewed the laws and regulations pertaining to the disclosure of confidential NASS data. In summary, OGC's interpretation of the statutes allows data sharing to other agencies, universities, and private entities as long as it enhances the mission of USDA and is through a contract, cooperative agreement, cost-reimbursement agreement, or memorandum of understanding. Such entities or individuals receiving the data are also bound by the statutes restricting unlawful use and disclosure of the data. NASS's current policy is that data sharing for statistical purposes will occur on a case-by-case basis as needed to address an approved specified USDA or public need. To the extent future uses of data are known at the time of data collection, they can be explained to the respondent and permission requested to permit the data to be shared among various users. This permission is requested in writing with a release form signed by each respondent. NASS will also work with researchers and others to provide as much data for analysis as possible. Some data requests do not require individual reports and NASS can often publish additional summary data which are a benefit to the agricultural sector. A.2. Department of Commerce A.2.a. Bureau of Economic Analysis (BEA) BEA standards for disclosure limitation for tabular data are determined by its individual divisions. The International Investment Division is one of the few--and the major--division in BEA that collects data directly from U.S. business enterprises. It collects data on USDIA (U.S. Direct Investment Abroad), FDIUS (Foreign Direct Investment in the United States), and international services trade by means of statistical surveys. The surveys are mandatory and the Agency Practices -27- Chapter III Disclosure Limitation Methodology May 1994 data in them are held strictly confidential under the International Investment and Trade in Services Survey Act (P.L. 94472, as amended). A standards statement, "International Investment Division Primary Suppression Rules," covers the Division's statistical disclosure limitation procedures for aggregate data from its surveys. This statement provides that: The general rule for primary suppression involves looking at the data for the top reporter, the second reporter, and all other reporters in a given cell. If the data for all but the top two reporters add up to no more than some given percent of the top reporter's data, the cell is a primary suppression. This is an application of the p-percent rule with no coalitions (c=1). This rule protects the top reporter from the second reporter, protects the second -reporter from the top reporter, and automatically suppresses any cell with only one or two reporters. The value of that percent and certain other details of the procedures are not published "because information on the exact form of the suppression rules can allow users to deduce suppressed information for cells in published tables. When applying the general rule, absolute values are used if the data item can be negative (for example, net income). If a reporter has more than one data record in the same cell, these records are aggregated and suppression is done at the reporter level. In primary suppression, only reported data are counted in obtaining totals for the top two reporters; data estimated for any reason are not treated as confidential. The statement includes several "special rules" covering rounded estimates, country and industry aggregates, key item suppression (looking at a set of related items as a group and suppressing all items if the key item is suppressed), and the treatment of time series data. Complementary suppression is done partly by computer and partly by human intervention. All tables are checked by computer to see if the complementary suppression is adequate. Limited applications of linear programming techniques have been used to refine the secondary suppression methods and help redesign tables to lessen the potential of disclosure. The International Investment Division publishes some tables of counts. These are counts pertaining to establishments and are not considered sensitive. Under the International Investment and Trade in Services Survey Act, , limited sharing of data with other Federal agencies, and with consultants and contractors of BEA, is permitted, but only for statistical purposes and only to perform specific functions under the Act. Beyond this limited sharing, BEA does not make its microdata on international investment and services available to outsiders. Confidentiality practices and procedures with respect to the data are clearly specified and strictly upheld. Agency Practices -28- Chapter III Disclosure Limitation Methodology May 1994 According to Jabine (1993b), "BEA's Regional Measurement Division publishes estimates of local area personal income by major source. Quarterly data on wages and salaries paid by county are obtained from BLS's Federal/state ES-202 Program and BEA is obliged to follow statistical disclosure limitation rules that satisfy BLS requirements." Statistical disclosure limitation procedures used are a combination of suppression and combining data (such as, for two or more counties or industries). Primary cell suppressions are identified by combining a systematic roll up of three types of payments to earnings and a dominant-cell suppression test of wages as a specified percentage of earnings. Two additional types of complementary cell suppressions are necessary to prevent the derivation (indirect disclosure) of primary disclosure cells. The first type is the suppression of additional industry cells to prevent indirect disclosure of the primary disclosure cells through subtraction from higher level industry totals. The second type is the suppression of additional geographic units for the same industry that are suppressed to prevent indirect disclosure through subtraction from higher level geographic totals. These suppressions are determined using computer programs to impose a set of rules and priorities on a multi-dimensional matrix consisting of industry and county cells for each state and region. A.2.b. Bureau of the Census (BOC) According to Jabine (1993b): "The Census Bureau's past and current practices in the application of statistical disclosure limitation techniques and its research and development work in this area cover a long period and are well documented. As a pioneer in the release of public-use microdata sets, Census had to develop suitable statistical disclosure limitation techniques for this mode of data release. It would probably be fair to say that the Census Bureau's practices have provided a model for other statistical agencies as the latter have become more aware of the need to protect the confidentiality of individually identifiable information when releasing tabulations and microdata sets." The Census Bureau's current and recent statistical disclosure limitation practices and research are summarized in two papers by Greenberg (1990a, 1990b). Disclosure limitation procedures for frequency count tables from the 1990 Census of Population are described by Griffin, Navarro and Flores-Baez (1989). Earlier perspectives on the Census Bureau's statistical disclosure limitation practices are provided by Cox et al. (1985) and Barabba and Kaplan (1975). Many other references will be found in these five papers. For tabular data from the 1992 Census of Agriculture, the Census Bureau will use the p-percent rule and will not publish the value of p. For other economic censuses, the Census Bureau uses the (n,k) rule and will not publish the values of n or k. Sensitive cells are suppressed and complementary suppressions are identified by using network flow methodology for two- dimensional tables (see Chapter IV). For the three-dimensional tables from the 1992 Economic Censuses, the Bureau will be using an iterative approach based on a series of two-dimensional Agency Practices -29- Chapter III Disclosure Limitatation Methodology May 1994 networks, primarily because the alternatives (linear programming methods) are too slow for the large amount of data involved. For all demographic tabular data, other than data from the decennial census, disclosure analysis is not needed because of 1) very small sampling fractions; 2) weighted counts; and 3) very large categories (geographic and other). For economic magnitude dam most surveys do not need disclosure analysis for the above reasons. For the economic censuses, data suppression is used. However, even if some magnitude data are suppressed, all counts are published, even for cells of 1 and 2 units. Microdata files are standard products with unrestricted use from all Census Bureau demographic surveys. In February 1981, the Census Bureau established a formal Microdata Review Panel, being the first agency to do so. (For more details on methods used by the panel, see Greenberg (1985)). Approval of the Panel is required for each release of a microdata file (even files released every year must be approved). In February 1994, the Census Bureau added two outside advisory members to the Panel, a privacy representative and a data user representative. One criterion used by the Panel is that geographic codes included in microdata sets should not identify areas with less than 100,000 persons in the sampling frame, except for SIPP data (Survey of Income and Program Participation) for which 250,000 is used. This cutoff was adopted in 1981; previously a figure of 250,000 had been used for all data. Where businesses are concerned, the presence of dominant establishments on the files virtually precludes the release of any useful microdata. The Census Bureau has legislative authority to conduct surveys for other agencies under either Tide 13 or Tide 15 U.S.C. Title 13 is the statute that describes the statistical mission of the Census Bureau. This statute also contains the strict confidentiality provisions that pertain to the collection of data from the decennial census of housing and population as well as the quinquennial censuses of agriculture, etc. A sponsoring agency with a reimbursable agreement under Title 13 can use samples and sampling frames developed for the various Title 13 surveys and censuses. This would save the sponsor the extra expense that might be incurred if it had to develop its own sampling frame. However, the data released to an agency that sponsors a reimbursable survey under Title 13 are subject to the confidentiality provisions of any Census Bureau public-use microdata. file; for example, the Census Bureau will not release identifiable microdata nor small area data. The situation under Title 15 is quite different. In conducting surveys under Title 15, the Census Bureau may release identifiable information, as well as small area data, to sponsors. However, samples must be drawn from sources other than the surveys and censuses covered by Title 13. If the sponsoring agency furnishes the frame, then the data are collected under Title 15 and the sponsoring agency's confidentiality rules apply. Agency Practices -30- Chapter III Disclosure Limitation Methodology May 1994 A.3. Department of Education: National Center for Education Statistics (NCES) As stated in NCES standard IV-01-91, Standard for Maintaining Confidentiality: " In reporting on surveys and preparing public-use data tapes, the goal is to have an acceptably low probability of identifying individual respondents." The standard recognizes that it is not possible to reduce this probability to zero. The specific requirement for reports is that publication cells be based on at least three unweighted observations and subsequent tabulations (such as cross tabulations) must not provide additional information which would disclose individual identities. For percentages, there must be three observations in the numerator. However, in fact the issue is largely moot at NCES since all published tables for which disclosure problems might exist are typically based on sample data. For this situation the rule of three or more is superseded by the rule of thirty or more; that is, the minimum cell size is driven by statistical (variance) considerations. For public-use microdata tapes, consideration is given to any proposed -variables that are unusual (such as very high salaries) and data sources that may be available in the public or private sectors for matching purposes. Further details are documented in NCES's Policies and Procedures for Public Release Data. Public-use microdata tapes must undergo a disclosure analysis. A Disclosure Review Board was established in 1989 following passage of the 1988 Hawkins-Stafford Amendment which emphasized the need for NCES to follow disclosure limitation practices for tabulations and microdata files. The Board reviews all disclosure analyses and makes recommendations to the Commissioner of NCES concerning public release of microdata. The Board is required to "...take into consideration information such as resources needed in order to disclose individually identifiable information, age of the data, accessibility of external files, detail and specificity of the data, and reliability and completeness of any external files." The NCES has pioneered in the release of a new data product: a data base system combined with a spreadsheet program. The user may request tables to be constructed from many variables. The data base system accesses the respondent level data (which are stored without identifiers in a protected format and result from sample surveys) to construct these custom tables. The only access to the respondent level data is through the spreadsheet program. The user does not have a password or other special device to unlock the hidden respondent-level data. The software presents only weighted totals in tables and automatically tests to assure that no fewer than 30 respondents contribute to a cell (an NCES standard for data availability.) The first release of the protected data base product was for the NCES National Survey of Postsecondary Faculty, which was made available to users on diskette. In 1994 a number of NCES sample surveys are being made available in a CD-ROM data base system. This is an updated version of the original diskette system mentioned above. The CD-ROM implementation is more secure, faster and easier to use. Agency Practices -31- Chapter III Disclosure Limitation Methodology May 1994 The NCES Microdata Review Board evaluated the data protection capabilities of these products and determined that they provided the required protection. They believed that the danger of identification of a respondent's data via multiple queries of the dam base was minimal because only weighted data are presented in the tables, and no fewer than 30 respondents contribute to a published cell total. A.4. Department of Energy: Energy Information Administration (EIA) EIA standard 88-05-06 "Nondisclosure of Company Identifiable Data in Aggregate Cells" appears in the Energy Information Administration Standards Manual (April 1989). Nonzero value data cells must be based on three or more respondents. Primary suppression rule is the pq rule alone or in conjunction with some other subadditive rule. Values of pq (an input sensitivity parameter representing the maximum permissible gain in information when one company uses the published cell total and its own value to create better estimates of its competitors' values) selected for specific surveys are not published and are considered confidential. Complementary suppression is also applied to other cells to assure that the sensitive value cannot be reconstructed from published data. The Standards Manual includes a separate section with guidelines for implementation of the pq rule. Guidelines are included for situations where all values are negative; some data are imputed; published values are net values (the difference between positive numbers); and the published values are weighted averages (such as volume weighted prices). These guidelines have been augmented by other agencies' practices and appear as a Technical Note to this chapter. An alternative approach pursued by managers of a number of EIA surveys from which data were published without disclosure limitation protection for many years was to use a Federal Register Notice to announce EIA's intention to continue to publish these tables without disclosure limitation protection. The Notice pointed out that the result might be that a knowledgeable user could estimate an individual respondent's data. For most EIA surveys that use the pq rule, complementary suppressions are selected manually. One survey system that publishes complex tables makes use of software designed particularly for that survey to select complementary suppressions. It assures that there are at least two suppressed cells in each dimension, and- that the cells selected are those of lesser importance to data users. EIA does not have a standard to address tables of frequency data. However, it appears that there are only two routine publications of frequency data in EIA tables, the Household Characteristics publication of the Residential Energy Consumption Survey (RECS) and the Building Characteristics publication of the Commercial Building Energy Consumption Survey (CBECS). In both publications cells are suppressed for accuracy reasons, not for disclosure reasons. For the first publication, cell values are suppressed if there are fewer than 10 respondents or the Relative Standard Effors (RSE's) are 50 percent or greater. For the second publication, cell values are suppressed if there are fewer than 20 respondents or the RSE's are 50 percent or greater. No complementary suppression is used. Agency Practices -32- Chapter III Disclosure Limitation Methodology May 1994 EIA does not have a standard for statistical disclosure limitation techniques for microdata files. The only microdata files released by EIA are for RECS and CBECS. In these files, various standard statistical disclosure limitation procedures are used to protect the confidentiality of data from individual households and buildings. These procedures include: eliminating identifiers, limiting geographic detail, omitting or collapsing data items, top-coding, bottom-coding, interval- coding, rounding, substituting weighted average numbers (blurring), and introducing noise. A.5. Department of Health and Human Services A.5.a. National Center for Health Statistics (NCHS) NCHS statistical disclosure limitation techniques are presented in the NCHS Staff Manual on Confidentiality (September 1984), Section 10 "Avoiding Inadvertent Disclosures in Published Data' and Section 11 "Avoiding Inadvertent Disclosures Through Release of Microdata Tapes." No magnitude data figures should be based on fewer than three cases and a (1, 0.6) (n,k) rule is used. Jabine (1993b) points out that "the guidelines allow analysts to take into account the sensitivity and the external availability of the data to be published, as well as the effects of nonresponse and response errors and small sampling fractions in making it more difficult to identify individuals." In almost all survey reports, no low level geographic data are shown, substantially reducing the chance of inadvertent disclosure. The NCHS staff manual states that for tables of frequency data a) "in no table should all cases of any line or column be found in a single cell"; and b) "in no case should the total figure for a line or column of a cross-tabulation be less than 3". The acceptable ways to solve the problem (for either tables of frequency data or tables of magnitude data) are to combine rows or columns, or to use cell suppression (plus complementary suppression). The above rules apply only for census surveys: For their other data, which come from sample surveys, the general policy is that "the usual rules precluding publication of sample estimates that do not have a reasonably small relative standard error should prevent any disclosures from occurring in tabulations from sample data." It is NCHS policy to make microdata files available to the scientific community so that additional analyses can be made for the country's benefit. The manual contains rules that apply to all microdata tapes released which contain any information about individuals or establishments, ,except where the data supplier was told prior to providing the information that the data would be made public. Detailed information that could identify individuals (for example, date of birth) should not be included. Geographic places and characteristics of areas with less than 100,000 people are not to be identified. Information on the drawing of the sample which could identify data subjects should not be included. All new microdata sets must be reviewed for confidentiality issues and approved for release by the Director, Deputy Director, or Assistant to the Director, NCHS. Agency Practices -33- Chapter III Disclosure Limitation Methodology May 1994 A.5.b. Social Security Administration (SSA) SSA basic rules are from a 1977 document "Guidelines for Preventing Disclosure in Tabulations of Program Data," published in Working Paper 2. A threshold rule is used in many cases. In general, the rule is 5 or more respondents for a marginal cell. For more sensitive data, 3 or more respondents for all cells may be required. IRS rules are applied for publications based on IRS data. The SSA guidelines established in 1977 are: a) No tabulation should be released showing distributions by age, earnings or benefits in which the individuals (or beneficiary units, where applicable) in any group can be identified to (1) an age interval of 5 years or less. (2) an earnings interval of less than $1000. (3) a benefit interval of less than $50. b) For distribution by variables other than age, earnings and benefits, no tabulation should be released in which a group total is equal to one of its detail cells. Some exceptions to this rule may be made on a case-by-case basis when the detail cell in question includes individuals in more than one broad category. c) The basic rule does not prohibit empty cells as long as there are 2 or more non-empty cells corresponding to a marginal total, nor does it prohibit detail, cells with only one person. However, additional restrictions (see below) should be applied whenever the detailed classifications are based on sensitive information. The same restrictions should be applied to non-sensitive data if it can be readily done and does not place serious limitations on the uses of the tabulations. Additional restrictions may include one or more of the following: (1) No empty cells. An empty cell tells the user that an individual included in the marginal total is not in the class represented by the empty cell. (2) No cells with one person. An individual included in a one-person cell will know that no one else included in the marginal is a member of that cell. SSA mentions ways of avoiding disclosure to include a) suppression and grouping of data and b) introduction of error (for example, random rounding). In 1978 the agency tested a program for random rounding of individual tabulation cells in their semi-annual tabulations of Supplemental Security Income State and County data. Although SSA considered random rounding and/or controlled rounding they decided not to use it. SSA did not think that it provided sufficient protection, and feared that the data were less useful than with suppression or combining data. Thus, their typical method of dealing with cells that represent disclosure is through suppression and grouping of data. Agency Practices -34- Chapter III Disclosure Limitation Methodology May 1994 One example of their practices is from "Earnings and Employment Data for Wage and Salary Workers Covered Under Social Security by State an County, 1985", in which SSA states that they do not show table cells with fewer than 3 sample cases at the State level and fewer than 10 sample cases at the county level to protect the privacy of the worker. These are IRS rules and are applied because the data come from IRS. Standards for microdata protection are documented in an article by Alexander and Jabine (1978). SSA's basic policy is to make microdata without identifiers as widely available as possible, subject only to necessary legal and operational constraints. SSA has adopted a two-tier system for the release of microdata files with identifiers removed. Designated as public-use files are those microdata files for which, in SSA's judgment, virtually no chance exists that users will be able to identify specific individuals and obtain additional information about them from the records on the file. No restrictions are made on the uses of such files. Typically the public-use files are based on national samples, with small ... sampling fractions and, the files contain no geographic codes or at most regional and/or size of place identifiers. Those microdata files considered as carrying a disclosure risk greater than is acceptable for a public-use file are released only under restricted use conditions set forth in user agreements, including the purposes to be made of the data. A.6. Department of Justice: Bureau of Justice Statistics (BJS) Cells with fewer than 10 observations are not displayed in published tables. Display of geographic data is limited by Census Bureau Tide 13 restrictions for those data collected for BJS by the Census Bureau. Published tables may further limit identifiability by presenting quantifiable classification variables (such as age and years of education) in aggregated ranges. Cell and marginal entries may also be restricted to rates, percentages, and weighted counts. Standards for microdata protection are incorporated in BJS enabling legislation. In addition to BJS statutes, the release of all data collected by the Census Bureau for BJS is further restricted by Title 13 microdata restrictions. Individual identifiers are routinely stripped from all other microdata files before they are released for public use. A-7. Department of Labor: Bureau of Labor Statistics (BIS) Commissioner's Order 3-93, "The Confidential Nature of BLS Records," dated August 18, 19939 contains BLS's policy on the confidential data it collects. One of the requirements is that: 9e. Publications shall be prepared in such a way that they will not re -veal the identity of any specific respondent and, to the knowledge of the preparers will not allow the data of any specific respondent to be imputed from the published information. A subsequent provision allows for exceptions under conditions of informed consent and requires' prior authorization of the Commissioner before such an informed consent provision is used (for two programs this authority is delegated to specific Associate Commissioners). Agency Practices -35- Chapter III Disclosure Limitation Methodology May 1994 The statistical methods used to limit disclosure vary by program. For tables, the most commonly used procedure has two steps-the threshold rule, followed by the (n,k) concentration rule. For example, the BLS collective bargaining program, a census of all collective bargaining agreements covering 1,000 workers or more, requires that (1) each cell must have three or more units and (2) no unit can account for more than 50 percent of the total employment for that cell. The ES- 202 program, a census of monthly employment and quarterly wage information from Unemployment Insurance filings, uses a threshold rule that requires three or more establishments and a concentration rule of (1,0.80). In general, the values of k range from 0.5 to 0.8. In a few cases, a two-step rule used--an (n,k) rule for a single establishment is followed by an (n,k) rule for two establishments. Several wage and compensation statistics programs use a more complex approach that combines disclosure limitation methods and a certain level of reliability before the estimate can be published. For instance, one such approach uses a threshold rule requiring that each estimate be comprised of at least three establishments (unweighted) and at least six employees (weighted). It then uses a (1,0.60) concentration rule where n can be either a single establishment or a multi- establishment organization. Lastly, the reliability of the estimate is determined and if the estimate meets a certain criterion, then it can be published. BLS releases very few public-use microdata files. Most of these microdata files contain data collected by the Bureau of the Census under an interagency agreement and Census' Title 13. For these surveys (Cuffent Population Survey, Consumer Expenditure Survey, and four of the five surveys in the family of National Longitudinal Surveys) the Bureau of the Census determines the statistical disclosure limitation procedures that are used. Disclosure limitation methods used for the public-use microdata files containing data from the National Longitudinal Survey of Youth, collected under contract by Ohio State University, are similar to those used by the Bureau of the Census. A.8. Department of the Treasury: Internal Revenue Service, Statistics of Income Division (IRS, SOI) Chapter VI of the SOI Division Operating Manual (January 1985) specifies that "no cell in a tabulation at or above the state level will have a frequency of less than three or an amount based on a frequency of less than three.' Data cells for areas below the state level, for example counties, require at least ten observations. Data cells considered sensitive are suppressed or combined with other cells. Combined or deleted data are included in the corresponding column totals. SOI also documents its disclosure procedures in its publications, "Individual Income Tax Returns, 1989" and "Corporation Income Tax Returns, 1989." One example given (Individual Income Tax Returns, 1989) states that if a weighted frequency (the weighting frequency is obtained by dividing the population count of returns in a sample stratum by the number of sample returns for that stratum) is less than 3, the estimate and its corresponding amount are combined or deleted in order to avoid disclosure. Agency Practices -36- Chapter III Disclosure Limitation Methodology May 1994 SOI makes available to the public a microdata file of a sample of individual taxpayers' returns (the Tax Model). The data must be issued in a form that protects the confidentiality of individual taxpayers. Several procedural changes were made in 1984 including: removing some data fields and codes, altering some codes, reducing the size of subgroups used for the blurring process, and subsampling high-income returns. Jabine points out that "the SOI Division has sponsored research on statistical disclosure limitation techniques, notably the work by Nancy Spruill (1982, 1983) in the early 1980's, which was directed at the evaluation of masking procedures for business microdata. On the basis of her findings, the SOI released some microdata files for unincorporated businesses." Except for this and a few other instances, "the statistical agencies have not issued public-use microdata sets of establishment or company data, presumably because they judge that application of the statistical disclosure limitation procedures necessary to meet legal and ethical requirements would produce files of relatively little value to researchers. Therefore, access to such files continues to be almost entirely on a restricted basis." A.9. Environmental Protection Agency (EPA) EPA program offices are responsible for their own data collections. The types and subjects of data collections are required by statutes and regulations and the need to conduct studies. Data confidentiality policies and procedures are required by specific Acts or are determined on a case- by basis. Individual program offices are responsible for data confidentiality and disclosure as described in the following examples. The Office of Prevention, Pesticides and Toxic Substances (OPPT) collects confidential business information (CBI) for which there are disclosure avoidance requirements. These requirements come under the Toxic Substance Control Act (TSCA). Procedures are described in the CBI security manual. An OPPT Branch that conducts surveys does not have a formal policy in respect to disclosure avoidance for non-CBI data. The primary issue regarding confidentiality for most of their data collection projects is protection of respondent name and other personal identification characteristics. Data collection contractors develop a coding scheme to ensure confidentiality of these data elements and all raw data remain in the possession of the contractor. Summary statistics are reported in final reports. If individual responses are listed in an appendix to a final report identities are protected by using the contractor's coding scheme. In the Pesticides Program, certain submitted or collected data are covered by the provisions of the Federal Insecticide, Fungicide and Rodenticide Act (FIFRA). The Act addresses the protection of CBI and even includes a provision for exemption from Freedom of Information Act disclosure for information that is accorded protection. Two large scale surveys of EPA employees have taken place in the past five years under the aegis of intra-program task groups. In each survey, all employees of EPA in the Washington, D.C. area were surveyed. In each instance, a contractor was responsible for data. collection, Agency Practices -37- Chapter III Disclosure Limitation Methodology May 1994 analysis and final report. Data disclosure avoidance procedures were in place to ensure that the identification and responses of individuals and specific small groups of individuals could not occur. All returned questionnaires remained in the possession of the contractor. The data file was produced bythe contractor and permanently remained in the contractor's possession. Each record was assigned a serial number and the employee name file was permanently separated from the survey data file. The final reports contained summary statistics and cross-tabulations. A minimum cell size standard was adopted to avoid the possibility of disclosure. Individual responses were not shown in the Appendix of the reports. A public-use data tape was produced for one of the surveys it included a wide array of tabulations and cross-tabulations. Again, a minimum cell-size standard was used. B. Summary Most of the 12 agencies covered in this chapter have standards, guidelines, or formal review mechanisms that are designed to ensure that adequate disclosure analyses are performed and appropriate statistical disclosure limitation techniques are applied prior to release of tabulations and microdata. Standards and guidelines exhibit a wide range of specificity: some contain only one or two simple rules while others are much more detailed. Some agencies publish the parameter values they use, while others feel withholding the values provides additional protection to the data. Obviously, there is great diversity in policies, procedures, and practices among Federal agencies. B.1. Magnitude and Frequency Data Most standards or guidelines provide for minimum cell sizes and some type of concentration rule. Some agencies (for example, ERS, NASS, NCHS, and BLS) publish the values of the parameters they use in (n,k) concentration rules, whereas others do not. Minimum cell sizes of 3 are almost invariably used, because each member of a cell of size 2 could derive a specific value for the other member. Most of the agencies that published their parameter values for concentration rules used a single set, with n = 1. Values of k ranged from 0.5 to 0.8. BLS uses the lower value of k in one of its programs and the upper value in another. The most elaborate rule included in standards or guidelines were EIA's pq rule and BEA's and Census Bureau's related p-percent rules. They both have the property of subadditivity, and they give the disclosure analyst flexibility to specify how much gain in information about its competitors by an individual company is acceptable. Also, they provide a somewhat more satisfying rationale for what is being done than does the arbitrary selection of parameters for a (n,k) concentration rule. One possible method for dealing with data cells that are dominated by one or two large respondents is to ask those respondents for permission to publish the cells, even though the cell Agency Practices -38- Chapter III Disclosure Limitation Methodology May 1994 would be suppressed or masked under the agency's normal statistical disclosure limitation procedures. Agencies including NASS, EIA, the Census Bureau, and some of the state agencies that cooperate with BLS in its Federal-state statistical programs, use this type of procedure for some surveys. B.2. Microdata Only about half of the agencies included in this review have established statistical disclosure limitation procedures for microdata. Some agencies pointed out that the procedures for surveys they sponsored were set by the Census Bureau's Microdata Review Board, because the surveys had been conducted for them under the Census Bureau's authority (Title 13). Major releasers of public-use microdata--Census, NCHS and more recently NCES--have all established formal procedures for review and approval of new microdata sets. As Jabine (1993b) wrote, "In general these procedures do not rely on parameter-driven rules like those used for tabulations. Instead, they require judgments by reviewers that take into account factors such as: the availability of external files with comparable data, the resources that might be needed by an 'attacker' to identify individual units, the sensitivity of individual data items, the expected number of unique records in the file, the proportion of the study population included in the sample, the expected amount of error in the data, and the age of the data." Geography is an important factor. Census and NCHS specify that no geographic codes for areas with a sampling frame of less than 100,000 persons can be included in public-use data sets. If a file contains large numbers of variables, a higher cutoff may be used. The inclusion of local area characteristics, such as the mean income, population density and percent minority population of a census tract, is also limited by this requirement because if enough variables of this type are included, the local area can be uniquely identified. An interesting example of this latter problem was provided by EIA's Residential Energy Consumption Surveys, where the local weather information included in the microdata sets had to be masked to prevent disclosure of the geographic location of households included in the survey. Top-coding is commonly used to prevent disclosure of individuals or other units with extreme values in a distribution. Dollar cutoffs are established for items like income and