Federal Committee on Statistical Methodology
Office of Management and Budget
FCSM Home ^
Methodology Reports ^

 

  Statistical Policy Working Paper 22 - Report on Statistical Disclosure Limitation Methodology


Click HERE for graphic.
 
 
                  Statistical Policy
                   Working Paper 22

           Report on Statistical Disclosure
                 Limitation Methodology

                       Prepared by

      Subcommittee on Disclosure Limitation Methodology

        Federal Committee on Statistical Methodology

 

                  Statistical Policy Office

         Office of Information and Regulatory Affairs

               Office of Management and Budget

                            May 1994

 

              MEMBERS OF THE FEDERAL COMMITTEE ON

                   STATISTICAL METHODOLOGY

                            (May 1994)

 

                   Maria E. Gonzalez, Chair

                Office of Management and Budget

 

 Yvonne M. Bishop                  Daniel Melnick

 Energy Information                Substance Abuse and Mental

 Administration                    Health Services Administration

 

 Warren L. Buckler                 Robert P Parker

 Social Security Administration    Bureau of Economic Analysis

 

 Cynthia Z.F. Clark                Charles P. Pautler, Jr.

 National Agricultural             Bureau of the Census

 Statistics Service

                                   David A. Pierce

 Steven Cohen                      Federal Reserve Board

 Administration for Health

 Policy and Research               Thomas J. Plewes

                                   Bureau of Labor Statistics

 Zahava D. Doering

 Smithsonian Institution           Wesley L. Schaible

                                   Bureau of Labor Statistics

 Roger A. Herriot

 National Center for               Fritz J. Scheuren

 Education Statistics              Internal Revenue Service

 

 C. Terry Ireland                  Monroe G. Sirken

 National Computer  security       National Center for

 Center                            Health Statistics

 

 Charles  D. Jones                 Robert D. Tortora

 Bureau of the Census              Bureau of the Census

 

 Daniel  Kasprzyk                  Alan R. Tupek

 National center for               National Science Foundation

 Education Statistics

 

 Nancy Kirkendall

 Energy Information

 Administration

 

                          PREFACE

 

 

The Federal Committee on Statistical Methodology was organized by

OMB in 1975 to investigate issues of data quality affecting

Federal statistics.  Members of the committee, selected by OMB on

the basis of their individual expertise and interest in

statistical methods, serve in a personal capacity rather than as

agency representatives.  The committee conducts its work through

subcommittees that are organized to study particular issues.  The

subcommittees are open by invitation to Federal employees who

wish to participate.  Working papers are prepared by the

subcommittee members and reflect only their individual and

collective ideas.

 

The Subcommittee on Disclosure Limitation Methodology was formed

in 1992 to update the work presented in Statistical Policy

Working Paper 2, Report on Statistical Disclosure and Disclosure

Avoidance Techniques published in 1978.  The Report on

Statistical Disclosure Limitation Methodology, Statistical Policy

Working Paper 22, discusses both tables and microdata and

describes current practices of the principal Federal statistical

agencies.  The report includes a tutorial, guidelines, and

recommendations for good practice; recommendations for further

research; and an annotated bibliography.  The Subcommittee plans

to organize seminars and workshops in order to facilitate further

communication concerning disclosure limitation.

 

The Subcommittee on Disclosure Limitation Methodology was chaired

by Nancy Kirkendall of the Energy Information Administration,

Department of Energy.

 

                            i

 

         

 

                     Members of the

     Subcommittee on Disclosure Limitation Methodology

 

 

Nancy J. Kirkendall, Chairperson

Energy Information Administration

Department of Energy

 

William L. Arends

National Agricultural Statistics Service

Department of Agriculture

 

Lawrence H. Cox

Environmental Protection Agency

 

Virginia de Wolf

Bureau of Labor Statistics

Department of Labor

 

Arnold Gilbert

Bureau of Economic Analysis

Department of Commerce

 

Thomas B. Jabine

Committee on National Statistics

National Research Council

National Academy of Sciences

 

Mel Kollander

Environmental Protection Agency

 

Donald G. Marks

Department of Defense

 

Barry Nussbaum

Environmental Protection Agency,

 

Laura V. Zayatz

Bureau of the Census

Department of Commerce

 

                                      ii

 

                        Acknowledgements

 

In early 1992, an ad hoc interagency committee on Disclosure Risk

Analysis was organized by Hermann Habermann, Office of Management and

Budget.  A subcommittee was formed to look at methodological issues

and to analyze results of an informal survey of agency practices.

That subcommittee became a Subcommittee of the Federal Committee on

Statistical Methodology (FCSM) in early 1993.  The Subcommittee would

like to Hermann Habermann for getting us started, and Maria Gonzalez

and the FCSM for adopting us and providing an audience for our paper.

 

Special thanks to Subcommittee member Laura Zayatz for her

participation during the last two years.  She helped to organize the

review of papers, contributed extensively to the annotated

bibliography and wrote the chapters on microdata and research issues

in this working paper.  In addition, she provided considerable

input-to the discussion of disclosure limitation methods in tables.

Her knowledge of both theoretical and practical issues in disclosure

limitation were invaluable.

 

Special thanks to Subcommittee member Laura Zayatz for her particpatin

during the last last two years.  He helped in the review of papers,

analyzed the results of the informal survey of agency practices,

contacted agencies to get more detailed information and wrote the

chapter on agency practices.  He and Mary Ann Higgs pulled together

information from all authors and prepared three drafts and the final

version of this working paper, making them all look polished and

professional.  He also arranged for printing the draft reports.

 

Tom Jabine, Ginny deWolf and Larry Cox are relative newcomers to the

subcommittee.  Ginny joined in the fall of 1992, Tom and Larry in

early 1993.  Tom and Larry both participated in the work of the 1978

Subcommittee that prepared Statistical Poligy Working Paper 2,

providing the current Subcommittee with valuable continuity.  Tom,

Ginny and Larry all contributed extensively to the introductory and

recommended practices chapters, and Tom provided thorough and

thoughtful review and comment on all chapters.  Larry provided

particularly helpful insights on the research chapter.

 

Special thanks to Tore Dalenius, another participant in the

preparation of Statistical Policy Working Paper 2, for his careful

review of this paper.  Thanks also to FCSM members Daniel Kasprzyk and

Cynthia Clark for their thorough reviews of multiple drafts.

 

The Subcommittee would like to acknowledge three people who

contributed to the annotated bibliography: Dorothy Wellington, who

retired from the Environmental Protection Agency; Russell Hudson,

Social Security Administration; and Bob Burton, National Center for

Education Statistics.

 

Finally, the Subcommittee owes a debt of gratitude to Mary Ann

A. Higgs of the National Agriculture Statistics Service for her

efforts in preparing the report.

 

Nancy Kirkendall chaired the subcommittee and wrote. the primer and

tables chapters.

 

                              iii

 

 

                      TABLE OF CONTENTS

 

                                                                Page

 

1. Introduction ....................................................1

 

A.  Subject and Purposes of This Report.............................1

B.  Some Definitions................................................2

   1. Confidentiality and Disclosure...............................2

   2. Tables and Microdata.........................................3                                                   3

   3. Restricted Data and Restricted Access........................3

C.  Report of the Panel on Confidentiality and Data Access..........4

D.  Organization of the Report......................................4

E.  Underlying Themes of the Report.................................5

 

H. Statistical Disclosure Limitation: A Primer......................6

 

  A.  Background...................................................6

  B.  Definitions..................................................7

      1. Tables of Magnitude Data Versus Tables of Frequency Data..7

      2. Table Dimensionality......................................8

      3. What is Disclosure?.......................................8

  C.  Tables of Counts or Frequencies.............................10

      1.  Sampling as a Statistical Disclosure Limitation Method..10

      2.  Special Rules...........................................10

      3.  The Threshold Rule......................................12

          a. Suppression..........................................12

          b. Random Rounding......................................14

          c. Controlled Rounding..................................15

          d. Confidentiality Edit.................................15

  D. Tables of Magnitude..........................................19

  E. Microdata....................................................20

     1.  Sampling, Removing Identifiers and Limiting Geographic Detail..21

     2.  High Visibility Variables................................21

         a. Top-coding, Bottom-Coding, Recoding into Intervals....21

         b. Adding Random Noise...................................23

         c. Swapping or Rank Swapping.............................23

         d. Blank and Impute for Randomly Selected Records........24

         e. Blurring..............................................24

  F. Summary......................................................24

 

 

                                  iv

 

                TABLE OF CONTENTS (Continued)

 

 

III.  Current Federal Statistical Agency Practices.................25

 

     A. Agency Summaries..........................................25

        1.  Department of Agriculture.............................25

            a. Economic Research Service (ERS)....................25

            b. National Agricultural Statistics Service (NASS)....26

        2.  Department of Commerce................................27

            a. Bureau of Economic Analysis (BEA)............... ..27

            b. Bureau of the Census (BOC).........................29

        3.  Department of Education:

            National Center for Education Statistics (NCES).......31

        4.  Department of Energy:

            Energy Information Administration (EIA)............. .32

        5.  Department of Health and Human Services...............33

            a. National Center for Health Statistics (NCHS).......33

            b. Social Security Administration (SSA)...............34

        6.  Department of Justice: Bureau of Justice Statistics (BJS)...35

        7.  Department of Labor: Bureau of Labor Statistics (BLS).35

        8.  Department of the Treasury: Internal Revenue Service,

            Statistics of Income Division (IRS, SOI)..............36

        9.  Environmental Protection Agency (EPA).................37

     B. Summary...................................................38

        1. Magnitude and Frequency Data...........................38

        2. Microdata..............................................39

 

IV. Methods for Tabular Data.......................................42

 

   A. Tables of Frequency Data....................................42

      1. Controlled Rounding......................................43

      2. The Confidentiality Edit.................................44

   B. Tables of Magnitude Data....................................44

      1.  Definition of Sensitive Cells...........................45

          a. The p-Percent Rule...................................46

          b. The pq Rule..........................................47

          c. The (n,k) Rule.......................................48

          d. The Relationship Between (n,k) and p-Percent or pq Rules49

      2.  Complementary Suppression . . ..........................50

          a. Audits of Proposed Complementary Suppression.........51

          b. Automatic Selection of Cells

             for Complementary Suppression........................52

      3.  Infomation in Parameter Values..........................54

   C. Technical Notes:

      Relationships Between Common Linear Sensitivity Measures....54

 

                                 v

 

                   TABLE OF CONTENTS (Continued)

 

 

V. Methods for Public-Use Microdata Files..........................61

 

  A.  Disclosure Risk of Microdata................................62

      1.  Disclosure Risk and Intruders...........................62

      2.  Factors Contributing to Risk............................62

      3.  Factors that Naturally Decrease Risk....................63

  B.  Mathematical Methods of Addressing the problem..............64

      1.  Proposed Measures of Risk...............................65

      2.  Methods of Reducing Risk by           

          Reducing the Amount of Information Released.............66

      3.  Methods of Reducing Risk by Disturbing Microdata........66

      4.  Methods of Analyzing Disturbed Microdata

          to Determine Usefulness.................................68

  C.  Necessary Procedures for Releasing Microdata Files..........68

      1.  Removal of Identifiers..................................68

      2.  Limiting Geographic Detail..............................69

      3.  Top-coding of Continuous High Visibility Variables......69

      4.  Precautions for Certain Types of Microdata..............70

          a.  Establishment Microdata.............................70

          b.  Longitudinal Microdata..............................70

          c.  Microdata Containing Administrative Data............70

          d.  Consideration of Potentially

              Matchable Files and Population Uniques..............71

   D. Stringent Methods of Limiting Disclosure Risk...............71

      1. Do Not Release the Microdata.............................71

      2. Recode Data to Eliminate Uniques.........................71

      3. Disturb Data to Prevent Matching to ExternalFiles........71

   E. Conclusion

 

VI.  Recommended Practices..........................................73

 

    A. Introduction  .............................................73

    B. Recommendations............................................74

       1. General Recommendations for Tables and Microdata........74

       2. Tables of Frequency Count Data..........................76

       3. Tables of Magnitude Data................................76

       4. Microdata files.........................................78

 

 

 

 

 

                           vi

 

                  TABLE OF CONTENTS (Continued)

 

 

VII.  Research Agenda...............................................79

 

     A. Microdata..................................................79

        1. Defining Disclosure.....................................79

        2. Effects of Disclosure Limitation on Data Quality and 

           Usefulness..............................................80

           a. Disturbing Data......................................80

           b. More Information about Recoded Values................80

        3.  Reidentification Issues................................80

        4.  Economic Microdata.....................................81

        5.  Longitudinal Microdata.................................81

        6.  Contextual Variable Data...............................81

        7.  Implementation Issues for Microdata....................81

     B. Tabular Data...............................................82

        1.  Effects of Disclosure Limitation-on Data Quality and 

            Usefulness.............................................82

            a. Frequency Count Data................................82

            b. Magnitude Data......................................82

        2.  Near-Optimal Cell Suppression in Two-Dimensional Tables.83

        3.  Evaluating CONFID......................................83

        4.  Faster Software........................................83

        5.  Reducing Over-suppression..............................84

     C. Data Products Other Than Microdata and Tabular Data........84

        1. Database Systems........................................85

        2. Disclosure Risk in Analytic Reports.....................87

 

 

 

 

 

                                   vii

 

                 TABLE OF CONTENTS (Continued)

 

 

Appendices

 

A. Technical Notes: Extending Primary Suppression Rules

   to Other Common Situations...............................  .....89

 

   1.  Background..................................................89

   2.  Extension of Disclosure Limitation Practices................89

       a.  Sample Survey Data......................................89

       b.  Tables Containing Imputed Data..........................90

       c.  Tables that Report Negative Values......................90

       d.  Tables Where Differences Between Positive Values are Reported 90

       e.  Tables Reporting Net Changes (that is, Difference

           Between Values Reported at Different Times).............91

       f.  Tables Reporting Weighted Averages......................91

       g.  Output from Statistical Models..........................91

   3.  Simplifying Procedures......................................91

       a.  Key Item Suppression....................................91 

       b.  Preliminary and Final Data..............................91

       c.  Time Series Data........................................92

                    

B. Government References...........................................93

 

C. Annotated Bibliography..........................................94

 

 

 

 

 

 

 

                                 viii

                                 CHAPTER I

 

                               Introduction

 

A. Subject and Purposes of This Report

 

Federal agencies and their contractors who release statistical tables

or microdata files are often required by law or established policies

to protect the confidentiality of individual infomation.  This

confidentiality requirement applies to releases of data to the general

public; it can also apply to releases to other agencies or even to

other units within the same agency.  The required protection is

achieved by the application of statistical disclosure limitation

procedures whose purpose is to ensure that the risk of disclosing

confidential information about identifiable persons, businesses or

other units will be very small.

 

In early 1992 the Statistical Policy Office of the Office of

Management and Budget convened an ad hoc interagency committee to

review and evaluate statistical disclosure limitation methods used by

federal statistical agencies and to develop recommendations for their

improvement.  Subsequently, the ad hoc committee became the

Subcommittee on Disclosure Limitation Methodology, operating under the

auspices of the Federal Committee on Statistical Methodology.  This is

the final report of the Subcommittee.

 

The Subcommittee's goals in preparing this report were to:

 

 o update a predecessor subcommittee's report on the same topic (Federal 

   Committee on Statistical Methodology, 1978);

 

 o describe and evaluate existing disclosure limitation methods for 

   tables and microdata files;

 

 o provide recommendations and guidelines for the selection and use 

   of effective disclosure limitation techniques;

 

 o encourage the development, sharing and use of software for the 

   applications of disclosure limitation methods; and

 

 o encourage research to develop improved statistical disclosure 

   limitation methods, especially for public-use microdata files.

 

The Subcommittee believes that every agency or unit within an agency

that releases statistical data should have the ability to select and

apply suitable disclosure limitation procedures to all the data it

releases.  Each agency should have one or more employees with a clear

understanding of the methods and the theory that underlies them.

 

 

Introduction                  -1-                Chapter I

Disclosure Limitation Methodology                May 1994

 

To this end our report is directed primarily at employees of federal

agencies and their contractors who are engaged in the collection and

dissemination of statistical data, especially those who are directly

responsible for the selection and use of disclosure limitation

procedures.  We believe that the report will also be of interest to

employees with similar responsibilities in other organizations that

release statistical data, and to data users, who may find that it

helps them to understand and use disclosure-limited. data products.

 

B. Some Definitions

 

In order to clarify the scope of this report, we define and discuss

here some key terms that will be used throughout the report.

 

B.1. Confidentiality and Disclosure

 

A definition of confidentiality was given by the President's

Commission on Federal Statistics (1971:222):

 

 [Confidential should mean that the dissemination] of data in a 

 manner that would allow public identification of the respondent 

 or would in any way be harmful to him is prohibited and that 

 the data are immune from legal process.

 

The second element of this definition, immunity from mandatory

disclosure through legal process, is a legal question and is outside

the scope of this report.  Our concern is with methods designed to

comply with the first element of the definition, in other words, to

minimize the risk of disclosure (public identification) of the

identity of individual units and information about them.

 

The release of statistical data inevitably reveals some information

about individual data subjects.  Disclosure occurs when information

that is meant to be treated as confidential is revealed.  Sometimes

disclosure can occur based on the released data alone; sometimes

disclosure results from combination of the released data with publicly

available information; and sometimes disclosure is possible only

through combination of the released data with detailed external data

sources that may or may not be available to the general public.  At a

minimum, each statistical agency must assure that the risk of

disclosure from the released data alone is very low.

 

Several different definitions of disclosure and of different types of

disclosure have been proposed (see Duncan and Lambert, 1987 for a

review of definitions of disclosure associated with the release of

microdata).  Duncan et al. (1993: 23-24) provide a definition that

distinguishes three types of disclosure:

 

    Disclosure relates to inappropriate attribution of information to 

    a data subject, whether an individual or an organization.  

    Disclosure occurs when a data subject is identified from a 

    released file (identity disclosure), sensitive information about 

    a data subject is revealed through the released file (attribute 

    disclosure), or the released data make it possible to

 

 

 

Introduction                     -2-                  Chapter I

Disclosure Limitation Methodology                     May 1994

 

   determine the value of some characteristic of an individual more 

   accurately than otherwise would have been possible (inferential 

   disclosure).

 

In the above definition, the word "data' could have been substituted

for "file", because each type of disclosure can occur in connection

with the release of tables or microdata.  The definitions and

implications of these three kinds of disclosure are examined in more

detail in the next chapter.

 

B.2. Tables and Microdata

 

The choice of statistical disclosure limitation methods depends on the

nature of the data products whose confidentiality must be protected.

Most statistical data are released in the form of tables or microdata

files.  Tables can be further divided into two categories: tables of

frequency (count) data and tables of magnitude data.  For either

category, data can be presented in the form of numbers, proportions or

percents.

 

A microdata file consists of individual records, each containing

values of variables for a single person, business establishment or

other unit.  Some microdata files include explicit identifiers, like

name, address or Social Security number.  Removing any such

identifiers is an obvious first step in preparing for the release of a

file for which the confidentiality of individual information must be

protected.

 

B-3.  Restricted Data and Restricted Access

 

The confidentiality of individual information can be protected by

restricting the amount of information in released tables and microdata

files (restricted data) or by imposing conditions on access to the

data products (restricted access), or by some combination of these.

The disclosure limitation methods described in this report provide

confidentiality protection by restricting the data.

 

Public-use data products are released by statistical agencies to

anyone without restrictions on use or other conditions, except for

payment of fees to purchase publications or data files in electronic

form.  Agencies require that the disclosure risks for public-use data

products be very low.  The application of disclosure limitation

methods to meet this requirement sometimes calls for substantial

restriction of data content, to the point where the data may no longer

be of much value for some purposes.  In such circumstances, it may be

appropriate to use procedures that allow some users to have access to

more detailed data, subject to restrictions on who may have access, at

what locations and for what purposes.  Such restricted access

arrangements normally require written agreements between agency and

users, and the latter are subject to penalties for improper disclosure

of individual information and other violations of the agreed

conditions of use.

 

The fact that this report deals only with disclosure limitation

procedures that restrict data content should not be interpreted to

mean that restricted access procedures are of less importance.

 

 

Introduction                        -3-                   Chapter I

Disclosure Limitation Methodology                         May 1994

 

Readers interested in the latter can find detailed information in the

report of the Panel on Confidentiality and Data Access (see below) and

in Jabine (1993b).

 

C. Report of the Panel on Confidentiality and Data Access

 

In October 1993, while the Subcommittee was developing this report,

the Panel on Confidentiality and Data Access, which was jointly

sponsored by the Committee on National Statistics (CNSTAT) of the

National Research Council and the Social Science Research Council,

released its final report (Duncan et al., 1993).  The scope of the

CNSTAT report is much broader than this one: disclosure limitation

methodology was only one of many topics covered and it was treated in

much less detail than it is here.  The CNSTAT panel's recommendations

on statistical disclosure limitation methods (6.1 to 6.4) are less

detailed than the guidelines and recommendations presented in this

report.  However, we believe that the recommendations in the two

reports are entirely consistent with and complement each other.

Indeed, the development and publication of this report is directly

responsive to the CNSTAT Panel's Recommendation 6.1, which says, in

part, that "The Office of Management and Budget's Statistical Policy

Office should continue to coordinate research work on statistical

disclosure analysis and should disseminate the results of this work

broadly among statistical agencies."

 

D. Organization of the Report

 

Chapter II, "Statistical Disclosure Limitation Methods: A Primer",

provides a simple description and examples of disclosure limitation

techniques that are commonly used to limit the risk of disclosure in

releasing tables and microdata.  Readers already familiar with the

basics of disclosure limitation methods may want to skip over this

chapter.

 

Chapter III describes disclosure limitation methods used by twelve

major federal statistical agencies and programs.  Among the factors

that explain variations in agencies' practices are differences in

types of data and respondents, different legal requirements and

policies for confidentiality protection, different technical personnel

and different historical approaches to confidentiality issues.

 

Chapter IV provides a systematic and detailed description and

evaluation of statistical disclosure limitation methods for tables of

frequency and magnitude data.  Chapter V fulfills the same function

for microdata.  These chapters will be of greatest interest to readers

who have direct responsibility for the application of disclosure

limitation methods or are doing research to evaluate and improve

existing methods or develop new ones.  Readers with more general

interests may want to skip these chapters and proceed to Chapters VI

and VII.

 

Due in part to the stimulus provided by our predecessor subcommittee's

report (which we will identify in this report as Working Paper 2),

improved methods of disclosure limitation have been developed and used

by some agencies over the past 15 years.  Based on its review of these

methods, the Subcommittee has developed guidelines for good practice

for all agencies.  With separate sections for tables and microdata,

Chapter VI presents guidelines for recommended practices.

 

Introduction                                            Chapter I

Disclosure Limitation Methodology                       May 1994

 

Chapter VII presents an agenda for research on disclosure limitation

methods.  Because statistical disclosure limitation procedures for

tabular data are more fully developed than those for microdata, the

research agenda focuses more on the latter.  The Subcommittee believed

that a high priority should be assigned to research on how the quality

and usefulness of data are affected by the application of disclosure

limitation procedures.

 

Two appendices are also included.  Appendix A contains technical notes

on practices the statistical agencies have found useful in extending

primary suppression rules to other common situations.  Appendix B is

an annotated bibliography of articles about statistical disclosure

limitation published since the publication of Working Paper 2.

 

E. Underlying Themes of the Report

 

Five principal themes underlie the guidelines in Chapter VI and the

research agenda in Chapter VII:

 

  o   There are legitimate differences between the disclosure limitation 

      requirements of different agencies.  Nevertheless, agencies 

      should move as far as possible toward the use of a small number 

      of standardized disclosure limitation methods whose effectiveness 

      has been demonstrated.

 

  o   Statistical disclosure limitation methods have been developed 

      and implemented by individual agencies over the past 25 years.  

      The time has come to make the best technology available to the 

      entire federal statistical system.  The Subcommittee believes 

      that methods which have been shown to provide adequate protection 

      against disclosure should be documented clearly in simple formats.  

      The documentation and the corresponding software should then be 

      shared among federal agencies.

 

 o    Disclosure-limited products should be auditable to determine 

      whether or not they meet the intended objectives of the procedure 

      that was applied.  For example, for some kinds of tabular data, 

      linear programming software can be used to perform disclosure audits.

 

 o    Several agencies have formed review panels to ensure that 

      appropriate disclosure limitation policies and practices are in 

      place and being properly used.  Each agency should centralize 

      its oversight and review of the application of disclosure 

      limitation methods.

 

 o    New research should focus on disclosure limitation methods for 

      microdata and on how the methods used affect the usefulness and 

      ease of use of data products.

 

 

Introduction                           -5-                    Chapter I

Disclosure Limitation Methodology                             May 1994

 

 

                                                                                                         CHAPTER  II

 

 

            Statistical Disclosure Limitation:  A Primer

 

This chapter provides a basic introduction to the disclosure

limitation techniques which are used to protect statistical tables and

microdata.  It uses simple examples to illustrate the techniques.

Readers who are already familiar with the methodology of statistical

disclosure limitation may prefer to skip directly to Chapter 111,

which describes agency practices, Chapter IV which provides a more

mathematical discussion of disclosure limitation techniques used to

protect tables, or Chapter V which provides a more detailed discussion

of disclosure limitation techniques applied to microdata.

 

A. Background

 

One of the functions of a federal statistical agency is to collect

individually identifiable data, process them and provide statistical

summaries to the public.  Some of the data collected are considered

proprietary by respondents.  Agencies are authorized or required to

protect individually identifiable data by a variety of statutes,

regulations or policies.  Cecil (1993) summarizes the laws that apply

to all agencies and describes the statutes that apply specifically to

the Census Bureau, the National Center for Education Statistics, and

the National Center for Health Statistics.  Regardless of the basis

used to protect confidentiality, federal statistical agencies must

balance two objectives: to provide useful statistical information to

data users, and to assure that the responses of individuals are

protected.

 

Not all data collected and published by the government are subject to

disclosure limitation techniques.  Some data on businesses collected

for regulatory purposes are considered public.  Some data are not

considered sensitive and are not collected under a pledge of

confidentiality.  The statistical disclosure limitation techniques

described in this paper are applied whenever confidentiality is

required and data or estimates are to be publicly available.  Methods

of protecting data by restricting access are alternatives to

statistical disclosure limitation.  They are not discussed in this

paper.  See Jabine (1993) for a discussion of restricted access

methods.  All disclosure limitation methods result in some loss of

information, and sometimes the publicly available data may not be

adequate for certain statistical studies.  However, the -intention is

to provide as much data as possible, without revealing individually

identifiable data.

 

The historical method of providing data to the public is via

statistical tables.  With the advent of the computer age in the early

1960's agencies also started releasing microdata rdes.  In a microdata

file each record contains a set of variables that pertain to a single

respondent and are related to that respondent's reported values.

However, there are no identifiers on the file and the data may be

disguised in some way to make sure that individual data items cannot

be., uniquely associated with a particular respondent.  A new method

of releasing data has been.  introduced by the National Center for

Education Statistics (NCES) in the 1990's.  Data are provided on

diskette or CD-ROM in a secure data base system with access programs

which allow

 

 

A Primer                           -6-                 Chapter H

Disclosure Limitation Methodology                      May 1994

 

users to create special tabulations.  The NCES disclosure limitation

and data accuracy standards are automatically applied to the requested

tables before they are displayed to the user.

 

This chapter provides a simple description of the disclosure

limitation techniques which are commonly used to limit the possibility

of disclosing identifying information about respondents in tables and

microdata.  The techniques are illustrated with examples.  The tables

or microdata produced using these methods are usually made available

to the public with no further restrictions.  Section B presents some

of the basic definitions used in the sections and chapters that

follow: included are a discussion of the distinction between tables of

frequency data and tables of magnitude data, a definition of table

dimensionality, and a summary of different types of disclosure.

Section C discusses the disclosure limitation methods applied to

tables of counts or frequencies.  Section D addresses tables of

magnitude data, section E discusses microdata, and Section F

summarizes the chapter.

 

B. Definitions

 

Each entry in a statistical table represents the aggregate value of a

quantity over all units of analysis belonging to a unique statistical

cell.  For example, a table that presents counts of individuals by

5-year age category and the total annual income in increments of

$10,000 is comprised of statistical cells such as the cell {35-39

years of age, $40,000 to $49,999 annual income}.  A table that

displays value of construction work done during a particular period in

the state of Maryland by county and by 4-digit Standard Industrial

Code (SIC) groups is comprised of cells such as the cell (SIC 1521,

Prince George's County).

 

B-1.  Tables of Magnitude Data Versus Tables of Frequency Data

 

Tbe selection of a statistical disclosure limitation technique for

data presented in tables (tabular data) depends on whether the data

represent frequencies or magnitudes.  Tables of frequency count data

present the number of units of analysis in a cell.  Equivalently the

data may be presented as a percent by dividing the count by the total

number presented in the table (or the total in a row or column) and

multiplying by 100.  Tables of magnitude data present the aggregate of

a "quantity of interest" over all units of analysis in the cell.

Equivalently the data may be presented as an average by dividing the

aggregate by the, number of units in the cell.

 

To distinguish formally between frequency count data and magnitude

data, the quantity of interest" must measure something other than

membership in the cell.  Thus, tables of the number of establishments

within the manufacturing sector by SIC group and by

county-within-state are frequency count tables, whereas tables

presenting total value of shipments for the same cells are tables of

magnitude data.  For practical purposes, entirely rigorous definitions

are not necessary.  The statistical disclosure limitation techniques

used for magnitude data can be used for frequency data.  However, for

tables of frequency data other options are also available-

 

 

 

 

 

 

A Primer                                 -7-                Chapter II

Disclosure Limitation Methodology                           May 1994

 

B.2. Table Dimensionality

 

If the values presented in the cells of a statistical table are

aggregates over two variables, the table is a two-dimensional table.

Both examples of detail cells presented above, (35-39 years of age,

$40,000-$49,999 annual income) and (SIC 152 1, Prince George's County)

are from two- dimensional tables.  Typically, categories of one

variable are given in columns and categories of the other variable are

given in rows.

 

If the values presented in the cells of a statistical table are

aggregates over three variables, the table is a three-dimensional

table.  If the data in the first example above were also presented by

county in the state of Maryland, the result might be a detail cell

such as (35-39 years of age, $40,000-$49,999 annual income, Montgomery

County).  For the second example if the data were also presented by

year, the result might be a detail cell such as {SIC 1521, Prince

George's County, 1990).  The first two-dimensions are said to be

presented in rows and columns, the third variable in "layers".

 

B.3. What is Disclosure?

 

The definition of disclosure given in Chapter 1, and discussed further

below is very broad.  Because this report documents the methodology

used to limit disclosure, the focus is on practical situations.

Hence, the concern is only with the disclosure of confidential

information through the public release of data products.

 

As stated in Lambert (1993), "disclosure is a difficult topic.  People

even disagree about what constitutes a disclosure.  In Chapter I, the

three types of disclosure presented in Duncan, et.  al (1993) were

briefly introduced.  These are identity disclosure, attribute

disclosure and inferential disclosure.

 

Identity disclosure occurs if a third party can identify a subject or

respondent from the released data.  Revealing that an individual is a

respondent or subject of a data collection may or may not violate

confidentiality requirements.  For tabulations, revealing identity is

generally not disclosure, unless the identification leads to divulging

confidential information (attribute disclosure) about those who are

identified.

 

For microdata, identification is generally regarded as disclosure,

because microdata records are usually so detailed that the likelihood

of identification without revealing additional information is

minuscule.  Hence disclosure limitation methods applied to microdata

files limit or modify information that might be used to identify

specific respondents or data subjects.

 

Attribute disclosure occurs when confidential information about a data

subject is revealed and can be attributed to the subject.  Attribute

disclosure may occur when confidential information is revealed exactly

or when it can be closely estimated.  Thus, attribute disclosure

comprises identification of the subject and divulging confidential

information pertaining to the subject.

 

 

A Primer                                               Chapter II

Disclosure Limitation Methodology                      May 1994

 

Attribute disclosure is the form of disclosure of primary concern to

statistical agencies tabular data.  Disclosure limitation methods

applied to tables assure that respondent data are published only as

part of an aggregate with a sufficient number of other -respondents to

prevent attribute disclosure.

 

The third type of disclosure, inferential disclosure, occurs when

information can be inferred with high confidence from statistical

properties of the released data.  For example, the data may show a

high correlation between income and purchase price of home.  As

purchase price of home is typically public information, a third party

might use this information to infer the income of a data subject.  In

general, statistical agencies are not concerned with inferential

disclosure, for two reasons.  First a major purpose of statistical

data is to enable users to infer and understand relationships between

variables.  If statistical agencies equated disclosure with inference,

no data could be released.  Second, inferences are designed to predict

aggregate behavior, not individual attributes, and thus often poor

predictors of individual data values.

 

 

 

Click HERE for graphic.

 

 

A Primer                             -9-                Chapter II

Disclosure Limitation Methodology                       May 1994

 

C. Tables of Counts or Frequencies

 

The data collected from most surveys about people are published in

tables that show counts (number of people by category) or frequencies

(fraction or percent of people by category).  A portion of a table

published from a sample survey of households that collects information

on energy consumption is shown in Table 1 on the previous page as an

example.

 

C.1. Sampling as a Statistical Disclosure Limitation Method

 

One method of protecting the confidentiality of data is to conduct a

sample survey rather than a census.  Disclosure limitation techniques

are not applied in Table 1 even though respondents are given a pledge

of confidentiality because it is a large scale sample survey.

Estimates are made by multiplying an individual respondent's data by a

sampling weight before they are aggregated.  If sampling, weights are

not Published, this weighting helps to make an individual respondent's

data less identifiable from published totals.  Because the weighted

numbers represent all households in the United States, the counts in

this table are given in units of millions of households.  They were

derived from a sample survey of less than 7000 households.  This

illustrates the protection provided to individual respondents by

sampling and estimation.

 

Additionally, many agencies require that estimates must achieve a

specified accuracy before they can to be published.  In Table 1 cells

with a "Q" are withheld because the relative standard error is greater

than 50 percent.  For a sample survey accuracy requirements such as

this one result in more cells being withheld from publication than

would a disclosure limitation rule.  In Table 1 the values in the

cells labeled Q can be derived by subtracting the other cells in the

row from the marginal total.  The purpose of the Q is not necessarily

to withhold the value of the cell from the public, but rather to

indicate that any number so derived does not meet the accuracy

requirements of the agency.

 

When tables of counts or frequencies are based directly on data from

all units in the population (for example the 100-percent items in the

decennial Census) then disclosure limitation procedures must be

applied.  In the discussion below we identify two classes of

disclosure limitation rules for tables of counts or frequencies.  The

first class consists of special rules designed for specific tables.

Such rules differ from agency to agency and from table to table.  The

special rules are generally designed to provide protection to data

considered particularly sensitive by the agency.  The second class is

more general: a cell is defined to be sensitive if the number of

respondents is less than some specified threshold (the threshold

rule).  Examples of both classes of disclosure limitation techniques

are given in Sections II.C.2 and II.C.3.

 

C.2. Special Rules

 

Special rules impose restrictions on the level of detail that can be

provided in a table.  For example, Social Security Administration

(SSA) rules prohibit tabulations in which a detail cell is equal to a

marginal total or which would allow users to determine an individual's

age within a five year interval, earnings within a $1000 interval or

benefits within a $50 interval.

 

 

 

Primer                                       -10-             Chapter II

Disclosure Limitation Methodology                             May 1994

 

Tables 2 and 3 illustrate these rules.  They also illustrate the

method of restructuring tables and combining categories to limit

disclosure in tables.

 

Click HERE for graphic.

 

 

Table 2 is a two-dimensional table showing the number of beneficiaries

by county and size of benefit.  This table would not be publishable

because the data shown for counties B and D violate Social Security's

disclosure rules.  For county D, there is only one non-empty detail

cell, and a beneficiary in this county is known to be receiving

benefits between $40 and $59 per month.  This violates two rules.

First the detail cell is equal to the cell total; and second, this

reveals that all beneficiaries in the county receive between $40 and

$59 per month in benefits.  This interval is less than the required

$50 interval.  For county B there are 2 'non-empty cells, but the

range of possible benefits is from $40 to $79 per month, an interval

of less than the required $50.

 

To protect confidentiality, Table 2 could be restructured and rows or

columns combined (sometimes referred to as "rolling-up categories").

Combining the row for county B with the row for county D would still

reveal that the range of benefits is $40 to $79.  Combining A with B

and C with D does offer the required protection, as illustrated in

Table 3.

 

 

Click HERE for graphic.

 

 

A Primer                            -11-                 Chapter II

Disclosure Limitation Methodology                        May 1994

 

C.3. The Threshold Rule

 

With the threshold rule, a cell in a table of frequencies is defined

to be sensitive if the number of respondents is less than some

specified number.  Some agencies require at least 5 respondents in a

cell, others require 3. An agency may structure tables and combine

categories (as illustrated above), or use cell suppression, random

rounding, controlled rounding or the confidentiality edit.  Cell

suppression, random rounding, controlled rounding and the

confidentiality edit are described and illustrated below.

 

Table 4 is a fictitious example of a table with disclosures.  The

fictitious data set consists of information concerning delinquent

children.  We define a cell with fewer than 5 respondents to be

sensitive.  Sensitive cells are shown with an asterisk.

 

C.3.a. Suppression

 

One of the most commonly used ways of protecting sensitive cells is

via suppression. it is obvious that in a row or column with a

suppressed sensitive cell, at least one additional cell must be

suppressed, or the value in the sensitive cell could be calculated

exactly by subtraction from the marginal total.  For this reason,

certain other cells must also be suppressed.  These are referred to as

complementary suppressions.  While it is possible to select cells for

complementary suppression manually, it is difficult to guarantee that

the result provides adequate protection.

 

 

Click HERE for graphic.

 

A Primer                          -12-                  Chapter II

Disclosure Limitation Methodology                       May 1994

 

Table 5 shows an example of a system of suppressed cells for Table 4

which has at least two suppressed cells in each row and column.  This

table appears to offer protection to the sensitive cells.  But does

it?

 

 

Click HERE for graphic.

 

This example shows that selection of cells for complementary

suppression is more complicated than it would appear at first.

Mathematical methods of linear programming are used to automatically

select cells for complementary suppression and also to audit a

proposed suppression pattern (eg.  Table 5) to see if it provides the

required protection.  Chapter IV provides more detail on the

mathematical issues of selecting complementary cells and auditing

suppression patterns.

 

Table 6 shows our table with a system of suppressed cells that does

provide adequate protection for the sensitive cells.  However, Table 6

illustrates one of the problems with suppression.  Out of a total of

16 interior cells, only 7 cells are published, while 9 are suppressed.

 

 

 

A Primer                                 -13-                   Chapter II

Disclosure Limitation Methodology                               May 1994

 

C.3.b. Random Rounding

 

Click HERE for graphic.

 

In order to reduce the amount of data loss which occurs with

suppression, the U.S. Census Bureau has investigated alternative

methods to protect sensitive cells in tables of frequencies.

Perturbation methods such as random rounding and controlled rounding

are examples of such alternatives.  In random rounding cell values are

rounded, but instead of using standard rounding conventions a random

decision is made as to whether they will be rounded up or down.

 

 

Click HERE for graphic.

 

 

Because rounding is done separately for each cell in a table, the rows

and columns do not necessarily add to the published row and column

totals.  In Table 7 the total for the first row is 20, but the sum of

the values in the interior cells in the first row is 15.  A table

prepared using random rounding could lead the public to lose

confidence in the numbers: at a minimum it looks as if the agency

cannot add.  The New Zealand Department of Statistics has used random

rounding in its publications and this is one of the criticisms it has

heard (George and Penny, 1987).

 

 

A Primer                               -14-                 Chapter II

Disclosure Limitation Methodology                           May 1994

 

 

Click HERE for graphic.

 

C.3.c. Controlled Rounding

 

To solve the additivity problem, a procedure called controlled

rounding was developed.  It is a form of random rounding, but it is

constrained to have the sum of the published entries in each row and

column equal the appropriate published marginal totals.  Linear

programming methods are used to identify a controlled rounding for a

table.  There was considerable research into controlled rounding in

the late 1970's and early 1980's and controlled rounding was proposed

for use with data from the 1990 Census, (Greenberg, 1986).  However,

to date it has not been used by any federal statistical agency.  Table

8 illustrates controlled rounding.

 

One disadvantage of controlled rounding is that it requires the use of

specialized computer programs.  At present these programs are not

widely available.  Another disadvantage is that controlled rounding

solutions may not always exist for complex tables.  These issues are

discussed further in Chapters IV and VI.

 

C.3.d. Confidentiality Edit

 

The confidentiality edit is a new procedure developed by the

U.S. Census Bureau to provide protection in data tables prepared from

the 1990 Census (Griffin, Navarro, and Flores-Baez, 1989).  There are

two different approaches: one was used for the regular decennial

Census data (the 100 percent data file); the other was used for the

long-form of the Census which was filed by a sample of the population

(the sample data file).  Both techniques apply statistical disclosure

limitation techniques to the microdata files before they are used to

prepare tables.  The adjusted files themselves are not released, they

are used only to prepare tables.

 

 

 

A Primer                                   -15-                 Chapter II

Disclosure Limitation Methodology                             May 1994

 

 

Click HERE for graphic.

 

First, for the 100 percent microdata file, the confidentiality edit

involves "data swapping" or "switching" (Dalenius and Reiss, 1982;

Navarro, Flores-Baez, and Thompson, 1988).  The confidentiality edit

proceeds as follows.  First, take a sample of records from the

microdata file.  Second, find a match for these records in some other

geographic region, matching on a specified set of important

attributes.  Third, swap all attributes on the matched records.  For

small blocks, the Census Bureau increases the sampling fraction to

provide additional protection.  After the microdata file has been

treated in this way it can be used directly to prepare tables and no

further disclosure analysis is needed.

 

Second, the sample data file already consists of data from only a

sample of the population, and as noted previously, sampling provides

confidentiality protection.  Studies showed that this protection was

sufficient except in small geographic regions.  To provide additional

protection in small geographic regions, one household was randomly

selected and a sample of its data fields were blanked.  These fields

were replaced by imputed values.  After the microdata file has been

treated in this way it is used directly to prepare tables and no

further disclosure analysis is needed.

 

To illustrate the confidentiality edit as applied to the 100 percent

microdata file we use fictitious records for the 20 individuals in

county Alpha who contributed to Tables 4 through 8. Table 9 shows 5

variables for these individuals.. Recall that the previous tables

showed counts of individuals by county and education level of head of

household.  The purpose of the confidentiality edit is to provide

disclosure protection to tables of frequency data.  However, to

achieve this, adjustments are made to the microdata file before the

tables are created.  The following steps are taken to apply the

confidentiality edit.

 

 

A Primer                                   -16-           Chapter H

Disclosure Limitation Methodology                         May 1994

 

 

 

Click HERE for graphic.

 

    1.  Take a sample of records from the microdata file (say a 10% 

        sample).  Assume that records number 4 and 17 were selected 

        as part of our 10%sample.

 

    2.  Since we need tables by county and education level, we find 

        a match in some other county on the other variables race, sex 

        and income. (As a result of matching on race, sex and income, 

        county totals for these variables will be unchanged by the 

        swapping.) A match for record 4 (Pete) is found in County Beta.  

        The match is with Alfonso whose head of household has a very 

        high education.  Record 17 (Mike) is matched with George in 

        county Delta, whose head of household has a medium education.

 

        In addition, part of the randomly selected 10% sample from 

        other counties match records in county A. One record from 

        county Delta (June with high education) matches with Virginia, 

        record. number 12.  One record from 'county Gamma (Heather with 

        low education) matched with Nancy, in record 20.

 

 

 

 

 

 

 

A Primer                              -17-                 Chapter II

Disclosure Limitation Methodology                          May 1994

 

     3. After all matches are made, swap attributes on matched records.  

        The adjusted microdata file after these attributes are swapped 

        appears in Table 10.

 

 

Click HERE for graphic.

 

 

  4. Use the swapped data file directly to produce tables, see Table II.

 

The confidentiality edit has a great advantage in that

multidimensional tables can be prepared easily and the disclosure

protection applied will always be consistent.  A disadvantage is that

it does not look as if disclosure protection has been applied.

 

 

A Primer                             -18-               Chapter II

Disclosure Limitation Methodology                       May 1994

 

 

Click HERE for graphic.

 

D. Tables of Magnitude Data

 

Tables showing magnitude data have a unique set of disclosure

problems.  Magnitude data are generally nonnegative quantities

reported in surveys or censuses of business establishments, farms or

institutions.  The distribution of these reported values is likely to

be skewed, with a few entities having very large values.  Disclosure

limitation in this case concentrates on making sure that the published

data cannot be used to estimate the values reported by the largest,

most highly visible respondents too closely.  By protecting the

largest values, we, in effect, protect all values.

 

For magnitude data it is less likely that sampling alone will provide

disclosure protection because most sample designs for economic surveys

include a stratum of the larger volume entities which are selected

with certainty.  Thus, the units which are most visible because of

their size, do not receive any protection from sampling.  For tables

of magnitude data, rules called primary suppression rules or linear

sensitivity measures, have been developed to determine whether a given

table cell could reveal individual respondent information.  Such a

cell is called a sensitive cell, and cannot be published.

 

The primary suppression rules most commonly used to identify sensitive

cells by government agencies are the (n,k) rule, the p-percent rule

and the pq rule.  All are based on the desire to make it difficult for

one respondent to estimate the value reported by another respondent

too closely.  The largest reported value is the most likely to be

estimated accurately.  Primary suppression rules can be applied to

frequency data.  However, since all respondents contribute the same

value to a frequency count, the rules default to a threshold rule and

the cell is sensitive if it has too few respondents.  Primary

suppression rules are discussed in more detail in Section VI.B.l.

 

 

 

A Primer                                 -19-             Chapter II

Disclosure Limitation Methodology                         May 1994

 

Once sensitive cells have been identified, there are only two options:

restructure the table and collapse cells until no sensitive cells

remain, or cell suppression.  With cell suppression, once the

sensitive cells have been identified they are withheld from

publication.  These are called primary suppressions.  Other cells,

called complementary suppressions are selected and suppressed so that

the sensitive cells cannot be derived by addition or subtraction from

published marginal totals.  Problems associated with cell suppression

for tables of count data were illustrated in Section II.C.3.a. The

same problems exist for tables of magnitude data.

 

An administrative way to avoid cell suppression is used by a number of

agencies.  They obtain written permission to publish a sensitive cell

from the respondents that contribute to the cell.  The written

permission is called a "waiver" of the promise to protect sensitive

cells.  In this case, respondents are willing to accept the

possibility that their data might be estimated closely from the

published cell total.

 

E. Microdata

 

Information collected about establishments is primarily magnitude

data.  These data are likely to be highly skewed, and there are likely

to be high visibility respondents that could easily be identified via

other publicly available information.  As a result there are virtually

no public use microdata files released for establishment data.

Exceptions are a microdata file consisting of survey data from the

Commercial Building Energy Consumption Survey, which is provided by

the Energy Information Administration and two files from the 1987

Census of Agriculture provided by the Census Bureau.  Disclosure

protection is provided using the techniques described below.

 

It has long been recognized that it is difficult to protect a

microdata set from disclosure because of the possibility of matching

to outside data sources (Bethlehem, Keller and Panekoek, 1990).

Additionally, there are no accepted measures of disclosure risk for a

microdata file, so there is no 'standard' which can be applied to

assure that protection is adequate. (This is a topic for which

research is needed, as discussed in Chapter VII).  The methods for

protection of microdata files described below are used by all agencies

which provide public use data files.  To reduce the potential for

disclosure, virtually all public use microdata files:

 

   1. Include data from only a sample of the population,

   2. Do not include obvious identifiers,

   3. Limit geographic detail, and

   4. Limit the number of variables on the file.

 

Additional methods used to disguise high visibility variables include:

 

   1. Top or bottom-coding,

   2. Recoding into intervals or rounding,

   3. Adding or multiplying by random numbers (noise),

   4. Swapping or rank swapping (also called switching),

 

 

 

A Primer                                  -20-             Chapter H

Disclosure Limitation Methodology                          May 1994

 

   5. Selecting records at random, blanking out selected variables 

      and imputing for them (also called blank and impute),

    6. Aggregating across small groups of respondents and replacing 

       one individual's reported value with the average (also called 

       blurring).

 

These will be illustrated with the fictitious example we used in the

previous section.

 

E.l. Sampling, Removing Identifiers and Limiting Geographic Detail

 

First: include only the data from a sample of the population.  For

this example we used a 10 percent sample of the population of

delinquent children.  Part of the population (County A) was shown in

Table 9. Second: remove obvious identifiers.  In this case the

identifier is the first name of the child.  Third: consider the

geographic detail.  We decide that we cannot show individual county

data for a county with less than 30 delinquent children in the

population.  Therefore, the data from Table 4 shows that we cannot

provide geographic detail for counties Alpha or Gamma.  As a result

counties Alpha and Gamma are combined and shown as AlpGam in Table 12.

These manipulations result in the fictitious microdata file shown in

Table 12.

 

In this example we discussed only 5 variables for each child.  One

might imagine that these 5 were selected from a more complete data set

including names of parents, names and numbers of siblings, age of

child, ages of siblings, address, school and so on.  As more variables

are included in a microdata file for each child, unique combinations

of variables make it more likely that a specific child could be

identified by a knowledgeable person.  Limiting the number of

variables to 5 makes such identification less likely.

 

E.2. High Visibility Variables

 

It may be that information available to others in the

population. could be used with the income data shown in Table 12 to

uniquely identify the family of a delinquent child.  For example, the

employer of the head of household generally knows his or her exact

salary.  Such variables are called high visibility variables and

require additional protection.

 

E.2.a. Top-coding, Bottom-coding, Recoding into Intervals

 

Large income values are top-coded by showing only that the income is

greater than 100 thousand dollars per year.  Small income values are

bottom-coded by showing only that the income is less than 40 thousand

dollars per year.  Finally, income values are recoded by presenting

income in 10 thousand dollar intervals.  The result of these

manipulations yields the fictitious public use data file in Table 13.

Top-coding, bottom-coding and recoding into intervals are among the

most commonly used methods to protect high visibility variables in

microdata files.

 

 

A Primer                                     -21-             Chapter II

Disclosure Limitation Methodology                             May 1994

 

 

Click HERE for graphic.

 

 

A Primer                                     -22-             Chapter II

Disclosure Limiitation Methodology                            May 1994

 

E.2.b. Adding Random Noise

 

An alternative method of disguising high visibility variables, such as

income, is to add or multiply by random numbers.  For example, in the

above example, assume that we will add a normally distributed random

variable with mean 0 and standard deviation 5 to income.  Along with

the sampling, removal of identifiers and limiting geographic detail,

this might result in a microdata file such as Table 14.  To produce

this table, 14 random numbers were selected from the specified normal

distribution, and were added to the income data in Table 12.

 

 

Click HERE for graphic.

 

 

E.2.c. Swapping or Rank Swapping

 

Swapping involves selecting a sample of the records, finding a match

in the data base on a set of predetermined variables and swapping all

other variables.  Swapping (or switching) was illustrated as part of

the confidentiality edit for tables of frequency data.  In that

example records were identified from different counties which matched

on race, sex and income and the variables first name of child and

household education were swapped.  For purposes of providing

additional protection to the income variable in a microdata file, we

might choose instead to find a match in another county on household

education and race and to swap the income variables.

 

Rank swapping provides a way of using continuous variables to define

pairs of records for swapping.  Instead of insisting that variables

match (agree exactly), they are defined to be close

 

 

A Primer                                   -23-             Chapter II

Disclosure Limitation Methodology                           May 1994

 

based on their proximity to each other on a list sorted by the

continuous variable.  Records which are close in rank on the sorted

variable are designated as pairs for swapping.  Frequently in rank

swapping, the variable used in the sort is the one that will be

swapped.

 

E.2.d. Blank and Impute for Randomly Selected Records

 

The blank and impute method involves selecting a few records from the

microdata file, blanking out selected variables and replacing them by

imputed values.  This technique is illustrated using data shown in

Table 12.  First, one record is selected at random from each

publishable county, AlpGam, Beta and Delta.  In the selected record

the income value is replaced by an imputed value.  If the randomly

selected records are 2 in county AlpGam, 6 in county Beta and 13 in

county Delta, the income value recorded in those records might be

replaced by 63, 52 and 49 respectively.  These numbers are also

fictitious, but you can imagine that imputed values were calculated as

the average over all households in the county with the same race and

education.  Blank and impute was used as part of the confidentiality

edit for tables of frequency data from the Census sample data files

(containing information from the long form of the decennial Census).

 

E.2.e. Blurring

 

Blurring replaces a reported value by an average.  There are many

possible ways to implement blurring.  Groups of records for averaging

may be formed by matching on other variables or by sorting the

variable of interest.  The number of records in a group (whose data

will be averaged) may be fixed or random.  The average associated with

a particular group may be assigned to all members of a group, or to

the "middle' member (as in a moving average.) It may be performed on

more than one variable with different groupings for each variable.

 

In our example, we illustratee this technique by blurring the income

data.  In the complete microdata file we might match on important

variables such as county, race and two education groups (very high,

high) and (medium, low).  Then blurring could involve averaging

households in each group, say two at a time.  In county Alpha (see

Table 9) this would mean that the household income for the group

consisting of John and Sue would be replaced by the average of their

incomes (139), the household income for the group consisting of Jim

and Pete would be replaced by their average (82), and so on.  After

blurring, the data file would be subject to sampling, removal of

identifiers, and limitation of geographic detail.

 

F. Summary

 

This chapter has described the standard methods of disclosure

limitation used by federal statistical agencies to protect both tables

and microdata.  It has relied heavily on simple examples to illustrate

the concepts.  The mathematical underpinnings of disclosure limitation

in tables and microdata are reported in more detail in Chapters IV and

V, respectively.  Agency practices in disclosure limitation are

described in Chapter 111.

 

 

 

 

A Primer                          -24-                         Chapter II

Disclosure Limitation Methodology                              May 1994

 

                               CHAPTER III

 

                  Current Federal Statistical Agency Practices

 

This chapter provides an overview of Federal agency policies,

practices, and procedures for statistical disclosure limitation.

Statistical disclosure limitation methods are applied by the agencies

to limit the risk of disclosure of individual information when

statistics are disseminated in tabular or microdata formats.  Some of

the statistical agencies conduct or support research on statistical

disclosure limitation methods.  Information on recent and current

research is included in Chapter VII.

 

This review of agency practices is based on two sources.  The first

source is Jabine (1993b), a paper based in part on information

provided by the statistical agencies in response to a request in 1990

by the Panel on Confidentiality and Data Access, Committee on National

Statistics.  Additional information for the Jabine paper was taken

from an appendix to Working Paper 2.

 

The second source for this summary of agency practices was a late 1991

request by Hermann Habermann, Office of Management and Budget, to

Heads of Statistical Agencies.  Each agency was asked to provide, for

use by a proposed ad hoc Committee on Disclosure Risk Analysis, a

description of its current disclosure practices, standards, and

research plans for tabular and microdata.  Responses were received

from 12 statistical agencies.  Prior to publication, the agencies were

asked to review this chapter and update any of their practices.  Thus,

the material in this chapter is current as of the publication date.

 

The first section of this chapter summarizes the disclosure limitation

practices for each of the 12 largest Federal statistical agencies as

shown in Statistical Programs of the United States Government: Fiscal

Year 1993 (Office of Management and Budget).  The agency summaries are

followed by an overview of the current status of statistical

disclosure limitation policies, practices, and procedures based on the

available information.  Specific methodologies and the state of

software being used are discussed to the extent they were included in

the individual agencies' responses.

 

A. Agency Summaries

 

A.1. Department of Agriculture

 

A.1.a. Economic Research Service (ERS)

 

ERS disclosure limitation practices are documented in the statement of

"ERS Policy on Dissemination of Statistical Information," dated

September 28, 1989.  This statement provides that:

 

 

Agency Practices             -25-                                   Chapter III

Disclosure Limitation Methodology                                   May 1994

 

      Estimates will not be published from sample surveys unless: (1) 

      sufficient nonzero reports are received for the items in a given 

      class or data cell to provide statistically valid results which 

      are clearly free of disclosure of information about individual 

      respondents.  In all cases at least three observations must be 

      available, although more restrictive rules may be applied to 

      sensitive data, (2) the unexpanded data for any one respondent 

      must represent less than 60 percent of the total that is being 

      published, except when written permission is obtained from that 

      respondent ...

 

The second condition is an application of the (n,k) concentration

rule.  In this instance (n,k) (1, 0.6). Both conditions are applied to

magnitude data while the first condition also applies to counts.

 

Within ERS, access to unpublished, confidential data is controlled by

the appropriate branch chief.  Authorized users must sign

confidentiality certification forms.  Restrictions require that data

be summarized so individual reports are not revealed.

 

ERS does not release public-use microdata.  ERS will share data for

statistical purposes with governmnent agencies, universities, and

other entities under cooperative agreements as described below for

the, National Agricultural Statistics Service (NASS).  Requests of

entities under cooperative agreements with ERS for tabulations of data

that were originally collected by NASS are subject to NASS review.

 

A.1.b. National Agricultural Statistics Service (NASS)

 

Policy and Standards Memorandum (PSM) 12-89, dated July 12, 1989,

outlines NASS policy for suppressing estimates and summary data to

preserve confidentiality.  PSM 7-90 (March 28, 1990) documents NASS

policy on the release of unpublished summary data and estimates.  In

general, summary data and estimates may not be published if a nonzero

value is based on information from fewer than three respondents or if

the data for one respondent represents more than 60 percent of the

published value.  Thus NASS and ERS follow the same basic (n,k)

concentration rule.

 

Suppressed data may be aggregated to a higher level, but steps are

defined to ensure that the suppressed data cannot be reconstructed

from the published materials.  This is particularly important when the

same data are published at various time intervals such as monthly,

quarterly, and yearly.  These rules often mean that geographic

subdivisions must be combined to avoid revealing information about

individual operations.  Data for many counties cannot be published for

some crop and livestock items and State level data must be suppressed

in other situations.

 

NASS uses a procedure for obtaining waivers from respondents which

permits publication of values that otherwise would be suppressed.

Written approval must be obtained and updated periodically.  If

waivers cannot be obtained, data are not published or cells are

combined to limit disclosure.

 

 

 

 

Agency Practices                   -26-                            Chapter III

Discloosure Limitation Methodology                                    May 1994

 

 

NASS generally publishes magnitude data only, but the same requirement

of three respondents is applied when tables of counts are generated by

special request or for reimbursable surveys done for other agencies.

 

NASS does not release public-use microdata.  PSM 4-90 (Confidentiality

of Information), PSM 5-89 (Privacy Act of 1974), and PSM 6-90 (Access

to Lists and Individual Reports) cover NASS policies for microdata

protection.  Almost all NASS surveys depend upon voluntary reporting

by farmers and business firms.  Ilis cooperation is secured by a

statutory pledge that individual reports will be kept confidential and

used only for statistical purposes.

 

While it is NASS policy to not release microdata files, NASS and ERS

have developed an arrangement for sharing individual farm data from

the annual Farm Costs and Returns Survey which protects

confidentiality while permitting some limited access by outside

researchers.  The data reside in an ERS data base under security

measures approved by NASS.  All ERS employees with access to the data

base operate under the same confidentiality regulations as NASS

employees.  Researchers wishing access to this data base must have

their requests approved by NASS and come to the ERS offices to access

the data under confidentiality and security regulations.

 

USDA's Office of the General Counsel (OGC) has recently (February

1993) reviewed the laws and regulations pertaining to the disclosure

of confidential NASS data.  In summary, OGC's interpretation of the

statutes allows data sharing to other agencies, universities, and

private entities as long as it enhances the mission of USDA and is

through a contract, cooperative agreement, cost-reimbursement

agreement, or memorandum of understanding.  Such entities or

individuals receiving the data are also bound by the statutes

restricting unlawful use and disclosure of the data.  NASS's current

policy is that data sharing for statistical purposes will occur on a

case-by-case basis as needed to address an approved specified USDA or

public need.

 

To the extent future uses of data are known at the time of data

collection, they can be explained to the respondent and permission

requested to permit the data to be shared among various users.  This

permission is requested in writing with a release form signed by each

respondent.

 

NASS will also work with researchers and others to provide as much

data for analysis as possible.  Some data requests do not require

individual reports and NASS can often publish additional summary data

which are a benefit to the agricultural sector.

 

A.2. Department of Commerce

 

A.2.a. Bureau of Economic Analysis (BEA)

 

BEA standards for disclosure limitation for tabular data are

determined by its individual divisions.  The International Investment

Division is one of the few--and the major--division in BEA that

collects data directly from U.S. business enterprises.  It collects

data on USDIA (U.S.  Direct Investment Abroad), FDIUS (Foreign Direct

Investment in the United States), and international services trade by

means of statistical surveys.  The surveys are mandatory and the

 

 

Agency Practices            -27-                              Chapter III

Disclosure Limitation Methodology                                May 1994

 

 

data in them are held strictly confidential under the International

Investment and Trade in Services Survey Act (P.L. 94472, as amended).

 

A standards statement, "International Investment Division Primary

Suppression Rules," covers the Division's statistical disclosure

limitation procedures for aggregate data from its surveys.  This

statement provides that:

 

      The general rule for primary suppression involves looking at 

      the data for the top reporter, the second reporter, and all 

      other reporters in a given cell.  If the data for all but the 

      top two reporters add up to no more than some given percent of 

      the top reporter's data, the cell is a primary suppression.

 

This is an application of the p-percent rule with no coalitions (c=1).

This rule protects the top reporter from the second reporter, protects

the second -reporter from the top reporter, and automatically

suppresses any cell with only one or two reporters.  The value of that

percent and certain other details of the procedures are not published

"because information on the exact form of the suppression rules can

allow users to deduce suppressed information for cells in published

tables.

 

When applying the general rule, absolute values are used if the data

item can be negative (for example, net income).  If a reporter has

more than one data record in the same cell, these records are

aggregated and suppression is done at the reporter level.  In primary

suppression, only reported data are counted in obtaining totals for

the top two reporters; data estimated for any reason are not treated

as confidential.

 

The statement includes several "special rules" covering rounded

estimates, country and industry aggregates, key item suppression

(looking at a set of related items as a group and suppressing all

items if the key item is suppressed), and the treatment of time series

data.

 

Complementary suppression is done partly by computer and partly by

human intervention.  All tables are checked by computer to see if the

complementary suppression is adequate.  Limited applications of linear

programming techniques have been used to refine the secondary

suppression methods and help redesign tables to lessen the potential

of disclosure.

 

The International Investment Division publishes some tables of counts.

These are counts pertaining to establishments and are not considered

sensitive.

 

Under the International Investment and Trade in Services Survey Act, ,

limited sharing of data with other Federal agencies, and with

consultants and contractors of BEA, is permitted, but only for

statistical purposes and only to perform specific functions under the

Act.  Beyond this limited sharing, BEA does not make its microdata on

international investment and services available to outsiders.

Confidentiality practices and procedures with respect to the data are

clearly specified and strictly upheld.

 

 

 

Agency Practices                         -28-                Chapter III

Disclosure Limitation Methodology                               May 1994

 

According to Jabine (1993b), "BEA's Regional Measurement Division

publishes estimates of local area personal income by major source.

Quarterly data on wages and salaries paid by county are obtained from

BLS's Federal/state ES-202 Program and BEA is obliged to follow

statistical disclosure limitation rules that satisfy BLS

requirements."  Statistical disclosure limitation procedures used are

a combination of suppression and combining data (such as, for two or

more counties or industries).

 

Primary cell suppressions are identified by combining a systematic

roll up of three types of payments to earnings and a dominant-cell

suppression test of wages as a specified percentage of earnings.  Two

additional types of complementary cell suppressions are necessary to

prevent the derivation (indirect disclosure) of primary disclosure

cells.  The first type is the suppression of additional industry cells

to prevent indirect disclosure of the primary disclosure cells through

subtraction from higher level industry totals.  The second type is the

suppression of additional geographic units for the same industry that

are suppressed to prevent indirect disclosure through subtraction from

higher level geographic totals.  These suppressions are determined

using computer programs to impose a set of rules and priorities on a

multi-dimensional matrix consisting of industry and county cells for

each state and region.

 

A.2.b. Bureau of the Census (BOC)

 

According to Jabine (1993b):

 

     "The Census Bureau's past and current practices in the application 

     of statistical disclosure limitation techniques and its research 

     and development work in this area cover a long period and are well 

     documented.  As a pioneer in the release of public-use microdata 

     sets, Census had to develop suitable statistical disclosure 

     limitation techniques for this mode of data release.  It would 

     probably be fair to say that the Census Bureau's practices have 

     provided a model for other statistical agencies as the latter have 

     become more aware of the need to protect the confidentiality of 

     individually identifiable information when releasing tabulations 

     and microdata sets."

 

The Census Bureau's current and recent statistical disclosure

limitation practices and research are summarized in two papers by

Greenberg (1990a, 1990b).  Disclosure limitation procedures for

frequency count tables from the 1990 Census of Population are

described by Griffin, Navarro and Flores-Baez (1989).  Earlier

perspectives on the Census Bureau's statistical disclosure limitation

practices are provided by Cox et al. (1985) and Barabba and Kaplan

(1975).  Many other references will be found in these five papers.

 

For tabular data from the 1992 Census of Agriculture, the Census

Bureau will use the p-percent rule and will not publish the value of

p.  For other economic censuses, the Census Bureau uses the (n,k) rule

and will not publish the values of n or k. Sensitive cells are

suppressed and complementary suppressions are identified by using

network flow methodology for two- dimensional tables (see Chapter IV).

For the three-dimensional tables from the 1992 Economic Censuses, the

Bureau will be using an iterative approach based on a series of

two-dimensional

 

 

 

Agency Practices               -29-                          Chapter III

Disclosure Limitatation Methodology                             May 1994

 

networks, primarily because the alternatives (linear programming

methods) are too slow for the large amount of data involved.

 

For all demographic tabular data, other than data from the decennial

census, disclosure analysis is not needed because of 1) very small

sampling fractions; 2) weighted counts; and 3) very large categories

(geographic and other).  For economic magnitude dam most surveys do

not need disclosure analysis for the above reasons.  For the economic

censuses, data suppression is used.  However, even if some magnitude

data are suppressed, all counts are published, even for cells of 1 and

2 units.

 

Microdata files are standard products with unrestricted use from all

Census Bureau demographic surveys.  In February 1981, the Census

Bureau established a formal Microdata Review Panel, being the first

agency to do so. (For more details on methods used by the panel, see

Greenberg (1985)).  Approval of the Panel is required for each release

of a microdata file (even files released every year must be approved).

In February 1994, the Census Bureau added two outside advisory members

to the Panel, a privacy representative and a data user representative.

One criterion used by the Panel is that geographic codes included in

microdata sets should not identify areas with less than 100,000

persons in the sampling frame, except for SIPP data (Survey of Income

and Program Participation) for which 250,000 is used.  This cutoff was

adopted in 1981; previously a figure of 250,000 had been used for all

data.  Where businesses are concerned, the presence of dominant

establishments on the files virtually precludes the release of any

useful microdata.

 

The Census Bureau has legislative authority to conduct surveys for

other agencies under either Tide 13 or Tide 15 U.S.C. Title 13 is the

statute that describes the statistical mission of the Census Bureau.

This statute also contains the strict confidentiality provisions that

pertain to the collection of data from the decennial census of housing

and population as well as the quinquennial censuses of agriculture,

etc.  A sponsoring agency with a reimbursable agreement under Title 13

can use samples and sampling frames developed for the various Title 13

surveys and censuses.  This would save the sponsor the extra expense

that might be incurred if it had to develop its own sampling frame.

However, the data released to an agency that sponsors a reimbursable

survey under Title 13 are subject to the confidentiality provisions of

any Census Bureau public-use microdata. file; for example, the Census

Bureau will not release identifiable microdata nor small area data.

The situation under Title 15 is quite different.  In conducting

surveys under Title 15, the Census Bureau may release identifiable

information, as well as small area data, to sponsors.  However,

samples must be drawn from sources other than the surveys and censuses

covered by Title 13.  If the sponsoring agency furnishes the frame,

then the data are collected under Title 15 and the sponsoring agency's

confidentiality rules apply.

 

 

 

Agency Practices                -30-                        Chapter III

Disclosure Limitation Methodology                              May 1994

 

A.3. Department of Education: National Center for Education Statistics (NCES)

 

As stated in NCES standard IV-01-91, Standard for Maintaining

Confidentiality: " In reporting on surveys and preparing public-use

data tapes, the goal is to have an acceptably low probability of

identifying individual respondents." The standard recognizes that it

is not possible to reduce this probability to zero.

 

The specific requirement for reports is that publication cells be

based on at least three unweighted observations and subsequent

tabulations (such as cross tabulations) must not provide additional

information which would disclose individual identities.  For

percentages, there must be three observations in the numerator.

However, in fact the issue is largely moot at NCES since all published

tables for which disclosure problems might exist are typically based

on sample data.  For this situation the rule of three or more is

superseded by the rule of thirty or more; that is, the minimum cell

size is driven by statistical (variance) considerations.

 

For public-use microdata tapes, consideration is given to any proposed

-variables that are unusual (such as very high salaries) and data

sources that may be available in the public or private sectors for

matching purposes.  Further details are documented in NCES's Policies

and Procedures for Public Release Data.

 

Public-use microdata tapes must undergo a disclosure analysis.  A

Disclosure Review Board was established in 1989 following passage of

the 1988 Hawkins-Stafford Amendment which emphasized the need for NCES

to follow disclosure limitation practices for tabulations and

microdata files.  The Board reviews all disclosure analyses and makes

recommendations to the Commissioner of NCES concerning public release

of microdata.  The Board is required to "...take into consideration

information such as resources needed in order to disclose individually

identifiable information, age of the data, accessibility of external

files, detail and specificity of the data, and reliability and

completeness of any external files."

 

The NCES has pioneered in the release of a new data product: a data

base system combined with a spreadsheet program.  The user may request

tables to be constructed from many variables.  The data base system

accesses the respondent level data (which are stored without

identifiers in a protected format and result from sample surveys) to

construct these custom tables.  The only access to the respondent

level data is through the spreadsheet program.  The user does not have

a password or other special device to unlock the hidden

respondent-level data.  The software presents only weighted totals in

tables and automatically tests to assure that no fewer than 30

respondents contribute to a cell (an NCES standard for data

availability.)

 

The first release of the protected data base product was for the NCES

National Survey of Postsecondary Faculty, which was made available to

users on diskette.  In 1994 a number of NCES sample surveys are being

made available in a CD-ROM data base system.  This is an updated

version of the original diskette system mentioned above.  The CD-ROM

implementation is more secure, faster and easier to use.

 

 

 

 

Agency Practices                       -31-                  Chapter III

Disclosure Limitation Methodology                               May 1994

 

The NCES Microdata Review Board evaluated the data protection

capabilities of these products and determined that they provided the

required protection.  They believed that the danger of identification

of a respondent's data via multiple queries of the dam base was

minimal because only weighted data are presented in the tables, and no

fewer than 30 respondents contribute to a published cell total.

 

A.4. Department of Energy: Energy Information Administration (EIA)

 

EIA standard 88-05-06 "Nondisclosure of Company Identifiable Data in

Aggregate Cells" appears in the Energy Information Administration

Standards Manual (April 1989).  Nonzero value data cells must be based

on three or more respondents.  Primary suppression rule is the pq rule

alone or in conjunction with some other subadditive rule.  Values of

pq (an input sensitivity parameter representing the maximum

permissible gain in information when one company uses the published

cell total and its own value to create better estimates of its

competitors' values) selected for specific surveys are not published

and are considered confidential.  Complementary suppression is also

applied to other cells to assure that the sensitive value cannot be

reconstructed from published data.  The Standards Manual includes a

separate section with guidelines for implementation of the pq rule.

Guidelines are included for situations where all values are negative;

some data are imputed; published values are net values (the difference

between positive numbers); and the published values are weighted

averages (such as volume weighted prices).  These guidelines have been

augmented by other agencies' practices and appear as a Technical Note

to this chapter.

 

An alternative approach pursued by managers of a number of EIA surveys

from which data were published without disclosure limitation

protection for many years was to use a Federal Register Notice to

announce EIA's intention to continue to publish these tables without

disclosure limitation protection.  The Notice pointed out that the

result might be that a knowledgeable user could estimate an individual

respondent's data.

 

For most EIA surveys that use the pq rule, complementary suppressions

are selected manually.  One survey system that publishes complex

tables makes use of software designed particularly for that survey to

select complementary suppressions.  It assures that there are at least

two suppressed cells in each dimension, and- that the cells selected

are those of lesser importance to data users.

 

EIA does not have a standard to address tables of frequency data.

However, it appears that there are only two routine publications of

frequency data in EIA tables, the Household Characteristics

publication of the Residential Energy Consumption Survey (RECS) and

the Building Characteristics publication of the Commercial Building

Energy Consumption Survey (CBECS).  In both publications cells are

suppressed for accuracy reasons, not for disclosure reasons.  For the

first publication, cell values are suppressed if there are fewer than

10 respondents or the Relative Standard Effors (RSE's) are 50 percent

or greater.  For the second publication, cell values are suppressed if

there are fewer than 20 respondents or the RSE's are 50 percent or

greater.  No complementary suppression is used.

 

 

 

Agency Practices                       -32-                 Chapter III

Disclosure Limitation Methodology                              May 1994

 

EIA does not have a standard for statistical disclosure limitation

techniques for microdata files.  The only microdata files released by

EIA are for RECS and CBECS.  In these files, various standard

statistical disclosure limitation procedures are used to protect the

confidentiality of data from individual households and buildings.

These procedures include: eliminating identifiers, limiting geographic

detail, omitting or collapsing data items, top-coding, bottom-coding,

interval- coding, rounding, substituting weighted average numbers

(blurring), and introducing noise.

 

A.5. Department of Health and Human Services

 

A.5.a. National Center for Health Statistics (NCHS)

 

NCHS statistical disclosure limitation techniques are presented in the

NCHS Staff Manual on Confidentiality (September 1984), Section 10

"Avoiding Inadvertent Disclosures in Published Data' and Section 11

"Avoiding Inadvertent Disclosures Through Release of Microdata Tapes."

No magnitude data figures should be based on fewer than three cases

and a (1, 0.6) (n,k) rule is used.  Jabine (1993b) points out that

"the guidelines allow analysts to take into account the sensitivity

and the external availability of the data to be published, as well as

the effects of nonresponse and response errors and small sampling

fractions in making it more difficult to identify individuals." In

almost all survey reports, no low level geographic data are shown,

substantially reducing the chance of inadvertent disclosure.

 

The NCHS staff manual states that for tables of frequency data a) "in

no table should all cases of any line or column be found in a single

cell"; and b) "in no case should the total figure for a line or column

of a cross-tabulation be less than 3".  The acceptable ways to solve

the problem (for either tables of frequency data or tables of

magnitude data) are to combine rows or columns, or to use cell

suppression (plus complementary suppression).

 

The above rules apply only for census surveys: For their other data,

which come from sample surveys, the general policy is that "the usual

rules precluding publication of sample estimates that do not have a

reasonably small relative standard error should prevent any

disclosures from occurring in tabulations from sample data."

 

It is NCHS policy to make microdata files available to the scientific

community so that additional analyses can be made for the country's

benefit.  The manual contains rules that apply to all microdata tapes

released which contain any information about individuals or

establishments, ,except where the data supplier was told prior to

providing the information that the data would be made public.

Detailed information that could identify individuals (for example,

date of birth) should not be included.  Geographic places and

characteristics of areas with less than 100,000 people are not to be

identified.  Information on the drawing of the sample which could

identify data subjects should not be included.  All new microdata sets

must be reviewed for confidentiality issues and approved for release

by the Director, Deputy Director, or Assistant to the Director, NCHS.

 

 

 

Agency Practices                         -33-                Chapter III

Disclosure Limitation Methodology                               May 1994

 

A.5.b. Social Security Administration (SSA)

 

SSA basic rules are from a 1977 document "Guidelines for Preventing

Disclosure in Tabulations of Program Data," published in Working Paper

2. A threshold rule is used in many cases.  In general, the rule is 5

or more respondents for a marginal cell.  For more sensitive data, 3

or more respondents for all cells may be required.  IRS rules are

applied for publications based on IRS data.  The SSA guidelines

established in 1977 are:

 

     a)   No tabulation should be released showing distributions by 

          age, earnings or benefits in which the individuals (or 

          beneficiary units, where applicable) in any group can be 

          identified to

 

          (1) an age interval of 5 years or less.

          (2) an earnings interval of less than $1000.

          (3) a benefit interval of less than $50.

 

     b)   For distribution by variables other than age, earnings and 

          benefits, no tabulation should be released in which a group 

          total is equal to one of its detail cells.  Some exceptions 

          to this rule may be made on a case-by-case basis when the 

          detail cell in question includes individuals in more than one 

          broad category.

 

     c)   The basic rule does not prohibit empty cells as long as 

          there are 2 or more non-empty cells corresponding to a 

          marginal total, nor does it prohibit detail, cells with 

          only one person.  However, additional restrictions (see 

          below) should be applied whenever the detailed classifications 

          are based on sensitive information.  The same restrictions 

          should be applied to non-sensitive data if it can be readily 

          done and does not place serious limitations on the uses of the 

          tabulations.  Additional restrictions may include one or more 

          of the following:

 

          (1) No empty cells.  An empty cell tells the user that an 

              individual included in the marginal total is not in the 

              class represented by the empty cell.

 

          (2) No cells with one person. An individual included in a 

              one-person cell will know that no one else included in 

              the marginal is a member of that cell.

 

SSA mentions ways of avoiding disclosure to include a) suppression and

grouping of data and b) introduction of error (for example, random

rounding).  In 1978 the agency tested a program for random rounding of

individual tabulation cells in their semi-annual tabulations of

Supplemental Security Income State and County data.  Although SSA

considered random rounding and/or controlled rounding they decided not

to use it.  SSA did not think that it provided sufficient protection,

and feared that the data were less useful than with suppression or

combining data.  Thus, their typical method of dealing with cells that

represent disclosure is through suppression and grouping of data.

 

 

 

Agency Practices                   -34-                      Chapter III

Disclosure Limitation Methodology                               May 1994

 

One example of their practices is from "Earnings and Employment Data

for Wage and Salary Workers Covered Under Social Security by State an

County, 1985", in which SSA states that they do not show table cells

with fewer than 3 sample cases at the State level and fewer than 10

sample cases at the county level to protect the privacy of the worker.

These are IRS rules and are applied because the data come from IRS.

 

Standards for microdata protection are documented in an article by

Alexander and Jabine (1978).  SSA's basic policy is to make microdata

without identifiers as widely available as possible, subject only to

necessary legal and operational constraints.  SSA has adopted a

two-tier system for the release of microdata files with identifiers

removed.  Designated as public-use files are those microdata files for

which, in SSA's judgment, virtually no chance exists that users will

be able to identify specific individuals and obtain additional

information about them from the records on the file.  No restrictions

are made on the uses of such files.  Typically the public-use files

are based on national samples, with small ... sampling fractions and,

the files contain no geographic codes or at most regional and/or size

of place identifiers.  Those microdata files considered as carrying a

disclosure risk greater than is acceptable for a public-use file are

released only under restricted use conditions set forth in user

agreements, including the purposes to be made of the data.

 

A.6. Department of Justice: Bureau of Justice Statistics (BJS)

 

Cells with fewer than 10 observations are not displayed in published

tables.  Display of geographic data is limited by Census Bureau Tide

13 restrictions for those data collected for BJS by the Census Bureau.

Published tables may further limit identifiability by presenting

quantifiable classification variables (such as age and years of

education) in aggregated ranges.  Cell and marginal entries may also

be restricted to rates, percentages, and weighted counts.

 

Standards for microdata protection are incorporated in BJS enabling

legislation.  In addition to BJS statutes, the release of all data

collected by the Census Bureau for BJS is further restricted by Title

13 microdata restrictions.  Individual identifiers are routinely

stripped from all other microdata files before they are released for

public use.

 

A-7.  Department of Labor: Bureau of Labor Statistics (BIS)

 

Commissioner's Order 3-93, "The Confidential Nature of BLS Records,"

dated August 18, 19939 contains BLS's policy on the confidential data

it collects.  One of the requirements is that:

 

    9e. Publications shall be prepared in such a way that they will not 

        re -veal the identity of any specific respondent and, to the 

        knowledge of the preparers will not allow the data of any 

        specific respondent to be imputed from the published information.

 

A subsequent provision allows for exceptions under conditions of

informed consent and requires' prior authorization of the Commissioner

before such an informed consent provision is used (for two programs

this authority is delegated to specific Associate Commissioners).

 

 

 

Agency Practices                 -35-                        Chapter III

Disclosure Limitation Methodology                               May 1994

 

 

The statistical methods used to limit disclosure vary by program.  For

tables, the most commonly used procedure has two steps-the threshold

rule, followed by the (n,k) concentration rule.  For example, the BLS

collective bargaining program, a census of all collective bargaining

agreements covering 1,000 workers or more, requires that (1) each cell

must have three or more units and (2) no unit can account for more

than 50 percent of the total employment for that cell.  The ES- 202

program, a census of monthly employment and quarterly wage information

from Unemployment Insurance filings, uses a threshold rule that

requires three or more establishments and a concentration rule of

(1,0.80). In general, the values of k range from 0.5 to 0.8. In a few

cases, a two-step rule used--an (n,k) rule for a single establishment

is followed by an (n,k) rule for two establishments.

 

Several wage and compensation statistics programs use a more complex

approach that combines disclosure limitation methods and a certain

level of reliability before the estimate can be published.  For

instance, one such approach uses a threshold rule requiring that each

estimate be comprised of at least three establishments (unweighted)

and at least six employees (weighted).  It then uses a (1,0.60)

concentration rule where n can be either a single establishment or a

multi- establishment organization.  Lastly, the reliability of the

estimate is determined and if the estimate meets a certain criterion,

then it can be published.

 

BLS releases very few public-use microdata files.  Most of these

microdata files contain data collected by the Bureau of the Census

under an interagency agreement and Census' Title 13.  For these

surveys (Cuffent Population Survey, Consumer Expenditure Survey, and

four of the five surveys in the family of National Longitudinal

Surveys) the Bureau of the Census determines the statistical

disclosure limitation procedures that are used.  Disclosure limitation

methods used for the public-use microdata files containing data from

the National Longitudinal Survey of Youth, collected under contract by

Ohio State University, are similar to those used by the Bureau of the

Census.

 

A.8. Department of the Treasury: Internal Revenue Service, Statistics

of Income Division (IRS, SOI)

 

Chapter VI of the SOI Division Operating Manual (January 1985)

specifies that "no cell in a tabulation at or above the state level

will have a frequency of less than three or an amount based on a

frequency of less than three.' Data cells for areas below the state

level, for example counties, require at least ten observations.  Data

cells considered sensitive are suppressed or combined with other

cells.  Combined or deleted data are included in the corresponding

column totals.  SOI also documents its disclosure procedures in its

publications, "Individual Income Tax Returns, 1989" and "Corporation

Income Tax Returns, 1989."

 

One example given (Individual Income Tax Returns, 1989) states that if

a weighted frequency (the weighting frequency is obtained by dividing

the population count of returns in a sample stratum by the number of

sample returns for that stratum) is less than 3, the estimate and its

corresponding amount are combined or deleted in order to avoid

disclosure.

 

 

 

 

Agency Practices                        -36-                 Chapter III

Disclosure Limitation Methodology                               May 1994

 

 

SOI makes available to the public a microdata file of a sample of

individual taxpayers' returns (the Tax Model).  The data must be

issued in a form that protects the confidentiality of individual

taxpayers.  Several procedural changes were made in 1984 including:

removing some data fields and codes, altering some codes, reducing the

size of subgroups used for the blurring process, and subsampling

high-income returns.

 

Jabine points out that "the SOI Division has sponsored research on

statistical disclosure limitation techniques, notably the work by

Nancy Spruill (1982, 1983) in the early 1980's, which was directed at

the evaluation of masking procedures for business microdata.  On the

basis of her findings, the SOI released some microdata files for

unincorporated businesses." Except for this and a few other instances,

"the statistical agencies have not issued public-use microdata sets of

establishment or company data, presumably because they judge that

application of the statistical disclosure limitation procedures

necessary to meet legal and ethical requirements would produce files

of relatively little value to researchers.  Therefore, access to such

files continues to be almost entirely on a restricted basis."

 

A.9. Environmental Protection Agency (EPA)

 

EPA program offices are responsible for their own data collections.

The types and subjects of data collections are required by statutes

and regulations and the need to conduct studies.  Data confidentiality

policies and procedures are required by specific Acts or are

determined on a case- by basis.  Individual program offices are

responsible for data confidentiality and disclosure as described in

the following examples.

 

The Office of Prevention, Pesticides and Toxic Substances (OPPT)

collects confidential business information (CBI) for which there are

disclosure avoidance requirements.  These requirements come under the

Toxic Substance Control Act (TSCA).  Procedures are described in the

CBI security manual.

 

An OPPT Branch that conducts surveys does not have a formal policy in

respect to disclosure avoidance for non-CBI data.  The primary issue

regarding confidentiality for most of their data collection projects

is protection of respondent name and other personal identification

characteristics.  Data collection contractors develop a coding scheme

to ensure confidentiality of these data elements and all raw data

remain in the possession of the contractor.  Summary statistics are

reported in final reports.  If individual responses are listed in an

appendix to a final report identities are protected by using the

contractor's coding scheme.

 

In the Pesticides Program, certain submitted or collected data are

covered by the provisions of the Federal Insecticide, Fungicide and

Rodenticide Act (FIFRA).  The Act addresses the protection of CBI and

even includes a provision for exemption from Freedom of Information

Act disclosure for information that is accorded protection.

 

Two large scale surveys of EPA employees have taken place in the past

five years under the aegis of intra-program task groups.  In each

survey, all employees of EPA in the Washington, D.C. area were

surveyed.  In each instance, a contractor was responsible for

data. collection,

 

 

Agency Practices                 -37-                         Chapter III

Disclosure Limitation Methodology                                May 1994

 

 

analysis and final report.  Data disclosure avoidance procedures were

in place to ensure that the identification and responses of

individuals and specific small groups of individuals could not occur.

 

All returned questionnaires remained in the possession of the

contractor.  The data file was produced bythe contractor and

permanently remained in the contractor's possession.  Each record was

assigned a serial number and the employee name file was permanently

separated from the survey data file.

 

The final reports contained summary statistics and cross-tabulations.

A minimum cell size standard was adopted to avoid the possibility of

disclosure.  Individual responses were not shown in the Appendix of

the reports.  A public-use data tape was produced for one of the

surveys it included a wide array of tabulations and cross-tabulations.

Again, a minimum cell-size standard was used.

 

B. Summary

 

Most of the 12 agencies covered in this chapter have standards,

guidelines, or formal review mechanisms that are designed to ensure

that adequate disclosure analyses are performed and appropriate

statistical disclosure limitation techniques are applied prior to

release of tabulations and microdata.  Standards and guidelines

exhibit a wide range of specificity: some contain only one or two

simple rules while others are much more detailed.  Some agencies

publish the parameter values they use, while others feel withholding

the values provides additional protection to the data.  Obviously,

there is great diversity in policies, procedures, and practices among

Federal agencies.

 

B.1. Magnitude and Frequency Data

 

Most standards or guidelines provide for minimum cell sizes and some

type of concentration rule.  Some agencies (for example, ERS, NASS,

NCHS, and BLS) publish the values of the parameters they use in (n,k)

concentration rules, whereas others do not.  Minimum cell sizes of 3

are almost invariably used, because each member of a cell of size 2

could derive a specific value for the other member.

 

Most of the agencies that published their parameter values for

concentration rules used a single set, with n = 1. Values of k ranged

from 0.5 to 0.8. BLS uses the lower value of k in one of its programs

and the upper value in another.  The most elaborate rule included in

standards or guidelines were EIA's pq rule and BEA's and Census

Bureau's related p-percent rules.  They both have the property of

subadditivity, and they give the disclosure analyst flexibility to

specify how much gain in information about its competitors by an

individual company is acceptable.  Also, they provide a somewhat more

satisfying rationale for what is being done than does the arbitrary

selection of parameters for a (n,k) concentration rule.

 

One possible method for dealing with data cells that are dominated by

one or two large respondents is to ask those respondents for

permission to publish the cells, even though the cell

 

 

Agency Practices                     -38-                 Chapter III

Disclosure Limitation Methodology                            May 1994

 

 

would be suppressed or masked under the agency's normal statistical

disclosure limitation procedures.  Agencies including NASS, EIA, the

Census Bureau, and some of the state agencies that cooperate with BLS

in its Federal-state statistical programs, use this type of procedure

for some surveys.

 

B.2. Microdata

 

Only about half of the agencies included in this review have

established statistical disclosure limitation procedures for

microdata.  Some agencies pointed out that the procedures for surveys

they sponsored were set by the Census Bureau's Microdata Review Board,

because the surveys had been conducted for them under the Census

Bureau's authority (Title 13).  Major releasers of public-use

microdata--Census, NCHS and more recently NCES--have all established

formal procedures for review and approval of new microdata sets.  As

Jabine (1993b) wrote, "In general these procedures do not rely on

parameter-driven rules like those used for tabulations.  Instead, they

require judgments by reviewers that take into account factors such as:

the availability of external files with comparable data, the resources

that might be needed by an 'attacker' to identify individual units,

the sensitivity of individual data items, the expected number of

unique records in the file, the proportion of the study population

included in the sample, the expected amount of error in the data, and

the age of the data."

 

Geography is an important factor.  Census and NCHS specify that no

geographic codes for areas with a sampling frame of less than 100,000

persons can be included in public-use data sets.  If a file contains

large numbers of variables, a higher cutoff may be used.  The

inclusion of local area characteristics, such as the mean income,

population density and percent minority population of a census tract,

is also limited by this requirement because if enough variables of

this type are included, the local area can be uniquely identified.  An

interesting example of this latter problem was provided by EIA's

Residential Energy Consumption Surveys, where the local weather

information included in the microdata sets had to be masked to prevent

disclosure of the geographic location of households included in the

survey.

 

Top-coding is commonly used to prevent disclosure of individuals or

other units with extreme values in a distribution.  Dollar cutoffs are

established for items like income and