Federal Committee on Statistical Methodology
Office of Management and Budget
FCSM Home ^
Methodology Reports ^

 

 

Statistical Policy Working Paper 2 - Report on Statistical Disclosure and Disclosure- Avoidance Techniques


 

Click HERE for graphic.

 

 

 

Statistical Working Papers are a series of technical documents

prepared under the auspices of the Office of Federal Statistical

Policy and Standards.  These documents are the product of working

groups or task forces, as noted in the Preface to each report;

 

These Statistical Working Papers are published for the purpose of

encouraging further discussion of the technical issues and to

stimulate policy actions which flow from the technical findings. 

Readers of Statistical Working Papers are encouraged to communicate

directly with the Office of Federal Statistical Policy and Stan-

dards with additional views, suggestions, or technical concerns.

 

 

Office of                     Joseph W. Duncan

Federal Statistical           Director

Policy and Standards

 

Statistical Policy

Working Paper 2

 

Report on

Statistical Disclosure and Disclosure-

Avoidance Techniques

 

prepared by

Subcommittee on Disclosure-Avoidance Techniques

Federal Committee on Statistical Methodology

 

 

U.S. DEPARTMENT OF COMMERCE

Juanita M. Kreps.  Secretary

Courtenay M. Slater.  Chief Economist

 

Office of Federal Statistical Policy and Standards

Joseph W. Duncan.  Director

 

Issued:.  May 1978

 

 

 

          Office of Federal Statistical Policy and Standards'

 

                       Joseph W. Duncan, Director

 

           George E. Hall, Deputy Director, Social Statistics

        Gaylord E. Worden, Deputy Director, Economic Statistics 

    Mafia F- Gonzalez, Chairperson, Federal Committee on Statistical 

     Methodology

 

 

                                 Preface

 

This working paper was prepared by the members of the Subcommittee

on Disclosure-Avoidance Techniques, Federal Committee on

Statistical Methodology.  The Subcommittee was chaired by John A.

Michael National Center for Education Statistics, Department of

Health, Education, and Welfare.  The members of the Subcommittee

are the authors of this report and their names are listed below. 

This report is intended to help managerial and technical staff of

Federal agencies which publish or otherwise release on

methodologies to achieve appropriate disclosure-avoidance

practices.  Data released both in tabulations and in the form of

microdata are discussed in this report.   The Office of Federal

Statistical Policy and Standards hopes to organize, with the help 

of Subcommittee member seminars with Federal employees to disseminate 

the findings of the report In addition, the report may serve as a 

basis for discussions between Federal data producers and data users.

 

 

                                   iii

 

 

                     Members of the Subcommittee on 

                     Disclosure-Avoidance Techniques

 

John A. Michael Chairperson

National Center for Education Statistics (HEW)

 

Richard A. Bell

Social Security Administration (HEW)

 

Robert H. Mugge

National Center for Health Statistics (HEW)

 

Mervyn R. Stuckey

Statistical Reporting Service (USDA)

 

 

Maria Elena Gonzalez Chairperson

Federal Committee on Statistical Methodology, Office of Federal

Statistical Policy and Standards (Commerce)

 

Member, Federal Committee on Statistical Methodology

 

Thomas B. Jabine

Social Security Administration (HEW)

 

William J. Smith, Jr.

Internal Revenue Service (Treasury)

 

Paul T. Zeisset

Bureau of the Census (Commerce)

 

 

 

                               Ex Officio

Maria Elena Gonzalez Chairperson*

Federal Committee on Statistical Methodology

  Office of Federal Statistical Policy and Standards

  (Commerce)

Tore E. Dalenius

Brown University and University of Stockholm

--------

*Member, Federal Committee on Statistical Methodology

 

 

 

                                   iv

 

                           Acknowledgements

 

     The body of this report represents the collective effort of

the Subcommittee on Disclosure-Avoidance Techniques.

     The Subcommittee began by developing the outline for this

report, after which writing assignments were apportioned among

members.  Manuscript was usually subjected to several rounds of

review before its acceptance.  The major contributors to the

respective chapters appear below:

 

     Chapter               Major Contributor(s)

     I               Michael

     II              Jabine and Dalenius

     III             Bell, Mugge, and Dalenius

     IV              Zeisset

     V               Michael and Zeisset

     VI              Jabine

 

     Appendix

     A               The respective agencies

     B               Stuckey

     C               Lawrence H. Cox, Bureau of the Census

     D               Bell

 

Throughout the development of the report, Thomas Jabine enlightened

Subcommittee members on the complexities of the subject and Maria

Gonzalez provided encouragement and goal directedness.  Members of

the Federal Committee on Statistical Methodology and the Office of

Federal Statistical Policy and Standards, Department Of Commerce

(formerly the Statistical Policy Division of OMB) reviewed and

commented upon our work.  Manuscript was prepared with the good-

natured assistance of the management and secretaries of the various

statistical agencies.  Deserving special commendation is Joyce

Peoples of the Social Security Administration who effectively

managed the arduous task of preparing and assembling several drafts

of this manuscript

 

 

 

                                    v

 

 

 

                  Members of the Federal Committee on

                         Statistical Methodology

 

Barbara A. Bailar

Bureau of the Census (Commerce)

 

Norman D. Beller

Statistical Reporting Service (USDA)

 

Barbara A. Boyes

Bureau of Labor Statistics (Labor)

 

Edwin J. Coleman

Bureau of Economic Analysis (Commerce)

 

John E. Cremeans

Bureau of Economic Analysis (Commerce)

 

Marie D. Eldridge

National Center for Education Statistics (HEW)

 

Fred J. Frishman

International Revenue Service (Treasury)

 

Maria E. Gonzalez, Chairperson

Office of Federal Statistical Policy and Standards (Commerce)

 

Thomas B. Jabine

Social Security Administration (HEW)

 

Charles D. Jones

Bureau of the- Census (Commerce)

 

Alfred D. McKeon

Bureau of Labor Statistics (Labor)

 

Harold Nisselson Bureau of the Census (Commerce)

 

Monroe G. Sirken

National Center for Health Statistics

 

Wray Smith

Office of the Assistant Secretary for Planning and Evaluation (HEW)

 

 

                             Editorial Note

 

The opinions expressed in this report reflect the collective

judgment of the Subcommittee and do not necessarily reflect the

opinion of the Federal Committee or the Office of Federal

Statistical Policy and Standards.

 

 

                                   vi

 

 

 

                           Table of Contents

 

                                                                     Page

Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .iii

 

Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . .v

 

                         CHAPTER I-INTRODUCTION

 

A. Scope of Study and Organization of Report . . . . . . . . . . . . . .1

     1.    The Nature of Statistical Disclosure. . . . . . . . . . . . .1

     2.    Pinpointing Disclosure Potentials and Disclosure-

           Avoidance Techniques. . . . . . . . . . . . . . . . . . . . .1

     3.    Balancing Confidentiality Requirements Against Societal

           Needs for Information . . . . . . . . . . . . . . . . . . . .1

     4.    Other Considerations. . . . . . . . . . . . . . . . . . . . .2

     5.    Findings and Recommendations. . . . . . . . . . . . . . . . .2

B. Auspices. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2

C. Dissemination of Report . . . . . . . . . . . . . . . . . . . . . . .2

 

 

               CHAPTER II-DEFINING STATISTICAL DISCLOSURE

 

A. References in Statutes, Regulations, and Policy Statements. . . . . .3

     1. The Privacy Act of 1974. . . . . . . . . . . . . . . . . . . . .3

     2. The Freedom of Information Act . . . . . . . . . . . . . . . . .3

     3. Agency Statutes and Regulations. . . . . . . . . . . . . . . . .4

           a.   Bureau of the Census, Title 13 . . . . . . . . . . . . .4

           b.   Internal Revenue Service . . . . . . . . . . . . . . . .4

           c.   Social Security Administration . . . . . . . . . . . . .4

           d.   Law Enforcement Assistance Administration. . . . . . . .4

           c.   National Center for Health Statistics. . . . . . . . . .4

     4. Advisory Committee Reports . . . . . . . . . . . . . . . . . . .5

           a.   The President's Commission on Federal Statistics . . . .5

           b.   The HEW Secretary's Advisory Committee on Automated

                Personal Data Systems. . . . . . . . . . . . . . . . . .5

           c.   The American Statistical Association Ad Hoc

                Committee on Privacy and Confidentiality . . . . . . . .5

 

           d.   The Privacy Protection Study Commission. . . . . . . . .5

B. Evaluation of Statutory Requirements. . . . . . . . . . . . . . . . .6

C. Prior Definitions of Statistical Disclosure . . . . . . . . . . . . .6

D. A Proposed New Definition of Statistical Disclosure . . . . . . . . .7

     1. The Insufficiency of Prevailing Definitions. . . . . . . . . . .7

     2. A Framework for Defining "Statistical Disclosure . . . . . . . .7

           a. The frame. . . . . . . . . . . . . . . . . . . . . . . . .7

           b. Data associated with the objects in the frame. . . . . . .7

           c. The statistics released from the survey. . . . . . . . . .8

 

                                   vii

 

 

                                                                    Page

           (1) Macrostatistics . . . . . . . . . . . . . . . . . . . . .8

           (2) Microstatistics . . . . . . . . . . . . . . . . . . . . .8

     d. Extra objective data . . . . . . . . . . . . . . . . . . . . . .9

     e. Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . .9

3. Statistical Disclosure Defined. . . . . . . . . . . . . . . . . . . 10

 

          CHAPTER III-DISCLOSURE TN THE RELEASE OF TABULATIONS

                      (SUMMARY DATA) FOR PUBLIC USE

 

A.   The Problem of Disclosure in Tabulations: Topology,

     Identification and Examples . . . . . . . . . . . . . . . . . . . 11

     1. Exact Disclosure . . . . . . . . . . . . . . . . . . . . . . . 11

           a. Count data . . . . . . . . . . . . . . . . . . . . . . . 11

           b. Magnitude data . . . . . . . . . . . . . . . . . . . . . 12

     2. Approximate Disclosure . . . . . . . . . . . . . . . . . . . . 12

           a. Count data . . . . . . . . . . . . . . . . . . . . . . . 12

           b. Magnitude data . . . . . . . . . . . . . . . . . . . . . 12

     3. Probability-Based Disclosures (Approximate or Exact) . . . . . 13

     4. Indirect Disclosure. . . . . . . . . . . . . . . . . . . . . . 13

     5. External or Internal Disclosure. . . . . . . . . . . . . . . . 14

           a. Count data (direct or indirect disclosure) . . . . . . . 15

           b. Magnitude data (direct or indirect disclosure) . . . . . 15

B. Evaluating the Disclosure Problem . . . . . . . . . . . . . . . . . 16

     1. The Level of Risk of Disclosure. . . . . . . . . . . . . . . . 17

           a. The relative size of the sample. . . . . . . . . . . . . 17

           b. The detail provided in the tabulation. . . . . . . . . . 17

           c. The quality of the data. . . . . . . . . . . . . . . . . 17

           d. Availability of external information . . . . . . . . . . 17

     2. The Acceptability of the Disclosure Risk . . . . . . . . . . . 17

           a. Sensitivity of data. . . . . . . . . . . . . . . . . . . 17

           b. Possible adverse consequences of disclosure. . . . . . . 18

     3. The Assurances Given to the Respondents. . . . . . . . . . . . 18

C.  Disclosure-Avoidance Techniques. . . . . . . . . . . . . . . . . . 18

     1. Data Suppression . . . . . . . . . . . . . . . . . . . . . . . 18

           a. Cell suppression . . . . . . . . . . . . . . . . . . . . 18

           b. Table suppression. . . . . . . . . . . . . . . . . . . . 18

     2. "Rolling Up" Data. . . . . . . . . . . . . . . . . . . . . . . 19

     3. Disturbing the Data. . . . . . . . . . . . . . . . . . . . . . 19

     4. Limiting Distribution. . . . . . . . . . . . . . . . . . . . . 20

     5. Evaluation of Alternative Techniques . . . . . . . . . . . . . 20 

 

CHAPTER IV-DISCLOSURE IN MICRODATA

 

A. Nature of the Problem . . . . . . . . . . . . . . . . . . . . . . . 23

     1. Definition of Microdata. . . . . . . . . . . . . . . . . . . . 23

     2. Federal Agency Examples of Microdata Release . . . . . . . . . 23

           a. Bureau of the Census . . . . . . . . . . . . . . . . . . 23

           b. Social Security Administration . . . . . . . . . . . . . 24

           c. National Center for Health Statistics. . . . . . . . . . 24

           d. National Center for Education Statistics . . . . . . . . 24

           e. Internal Revenue Service . . . . . . . . . . . . . . . . 24

 

                                  viii

 

                                                                    Page

 

B. Evaluation of the Problem . . . . . . . . . . . . . . . . . . . . . 25

     1. Factors Bearing on the Likelihood of Disclosure. . . . . . . . 25

           a. Sample size or fraction of the universe. . . . . . . . . 25

           b. Uniqueness . . . . . . . . . . . . . . . . . . . . . . . 25

                (1) Geographic information . . . . . . . . . . . . . . 25

                (2) Characteristics of the respondent. . . . . . . . . 25

           c. Recognizability. . . . . . . . . . . . . . . . . . . . . 26

                (1) Population registers . . . . . . . . . . . . . . . 26

                (2) "Noise" in the data. . . . . . . . . . . . . . . . 26

                (3) Time lag . . . . . . . . . . . . . . . . . . . . . 27

           d.   Hypothesized relationships among the various factors

                in two types of attempts to penetrate disclosure

                safeguards . . . . . . . . . . . . . . . . . . . . . . 27

                (1) Searching for a specific individual. . . . . . . . 27

                (2) "Fishing expedition. . . . . . . . . . . . . . . . 27

     2. Acceptability of the Disclosure Risk . . . . . . . . . . . . . 28

                a. Potential harm to the respondent. . . . . . . . . . 28

                b. Potential harm to the agency. . . . . . . . . . . . 28

                c. Resources available to the misuser. . . . . . . . . 28

C. Disclosure Prevention Techniques for Public-Use Microdata        

   Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

     1. General Tradeoffs. . . . . . . . . . . . . . . . . . . . . . . 28

     2. Elimination of Categories Identifying Small Salient  Groups. . 29

     3. Allowing No Unique Cases . . . . . . . . . . . . . . . . . . . 29

     4. Introduction of "Noise' into the Data. . . . . . . . . . . . . 29

     5. Removal of Well-Known Individuals from the File. . . . . . . . 30

     6. Release of Customized Files. . . . . . . . . . . . . . . . . . 30

D. Disclosure Prevention Through Restrictions on Use . . . . . . . . . 30

     1. Alternatives Where Public-Use Microdata Are Not 

        Satisfactory . . . . . . . . . . . . . . . . . . . . . . . . . 30

           a. Special tabulations by the originating agency. . . . . . 30

           b. Microdata available for restricted use . . . . . . . . . 30

     2. Contractual/Administrative Requirements on the Restricted        

        User . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

     3. Agency Experience with Use-Restricting Agreements. . . . . . . 31

           a. Bureau of the Census . . . . . . . . . . . . . . . . . . 31

           b. Other agencies . . . . . . . . . . . . . . . . . . . . . 31

     4. Relationship of Computer Security to Use Restriction . . . . . 31

 

 

 

            CHAPTER V-THE QUESTION OF BALANCE: PROTECTION OF

              INDIVIDUALS VS. PUBLIC NEEDS FOR INFORMATION

 

A. Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

B. Comments in the Literature. . . . . . . . . . . . . . . . . . . . . 33

C. Reactions to Agency Policies and Procedures for Disclosure       

   Avoidance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

     1. Impact on Individual Data Subjects . . . . . . . . . . . . . . 34

     2. Organizations as Data Subjects . . . . . . . . . . . . . . . . 36

     3. Reactions of Data Users. . . . . . . . . . . . . . . . . . . . 36

           a. Data-loss problem. . . . . . . . . . . . . . . . . . . . 36

           b. Crosscutting standard geographic areas . . . . . . . . . 37

           c. Changes in disclosure-avoidance techniques . . . . . . . 37

           d. Changes in methodology . . . . . . . . . . . . . . . . . 38

           e. Data-users options . . . . . . . . . . . . . . . . . . . 38

 

                                   ix

 

 

                CHAPTER VI-FINDINGS AND RECOMMENDATIONS

                                                                     Page

 

     f. Recommendation by the Census Advisory Committee of the

           American Statistical Association. . . . . . . . . . . . . . 38

 

     4. Reactions of Others. . . . . . . . . . . . . . . . . . . . . . 39

A. The Concept of Statistical Disclosure . . . . . . . . . . . . . . . 41

 

B. Deciding What to Release. . . . . . . . . . . . . . . . . . . . . . 41

     Findings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

     Recommendations . . . . . . . . . . . . . . . . . . . . . . . . . 41

C. Disclosure-Avoidance Techniques . . . . . . . . . . . . . . . . . . 43

     Findings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

     Recommendations . . . . . . . . . . . . . . . . . . . . . . . . . 43

D. Effects of Disclosure on Data Subjects and User . . . . . . . . . . 43

     Findings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

     Recommendations . . . . . . . . . . . . . . . . . . . . . . . . . 44

E. Needs for Research and Development. . . . . . . . . . . . . . . . . 44

     Findings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

     Recommendations . . . . . . . . . . . . . . . . . . . . . . . . . 44

 

 

                               APPENDICES

 

Appendix A. Statistical Disclosure-Avoidance Practices of Selected       

            Federal Agencies . . . . . . . . . . . . . . . . . . . . . 45

Appendix B. Protecting Data in Computer Systems. . . . . . . . . . . . 61

Appendix C. Selected Methodological Issue in Statistical Disclosure 

            Avoidance. . . . . . . . . . . . . . . . . . . . . . . . . 65

 

Appendix D. Bibliography . . . . . . . . . . . . . . . . . . . . . . . 67

 

 

                                    X

 

 

                                                              CHAPTER I

                              Introduction

 

 

A. Scope of Study and Organization of Report

 

     This report is about techniques for avoiding disclosure of

confidential information about individuals (natural and legal

persons) in connection with the release of statistical tabulations

and microdata files (computerized records pertaining to individual

statistical units).  The report culminates more than a year's study

of potentials for statistical disclosure.e. disclosure of

confidential information about identifiable (but not identified)

units in tabulations and microdata files.  Many Federal agencies

which release tabulations or microdata files for statistical

purposes have statutes, regulations, or policy requirements that

releases be made in such a way that no information traceable -to a

specific individual.1 will be disclosed.

     The major questions addressed during the year and reported

here are as follows:

     -     What is the nature of statistical disclosure? -How

           pervasive a problem is it? -How can agency requirements

           be translated into specific disclosure-avoidance

           techniques? 

     -     How can agency requirements be met without.

           unduly restricting data releases? 

     -     How do agency disclosure-avoidance practices affect data

           subjects and data users?

 

1. The Nature of Statistical Disclosure

 

     The problem of statistical disclosure is certainly

not a new one.  It has long been recognized that any available

tabulation of the characteristics of a population is likely to

narrow the range of uncertainty about the characteristics of

specific individuals known to be members of that population. 

Recognition of the problem has been heightened by the widespread

use of computers and microdata files as well as the increased

demand for more detail in statistical releases.  The sheer number

of characteristics available about a given statistical unit in

microdata form, which sometimes produces unique configurations,

may make identification possible, even though identifiers.(such as

names, social security numbers, or employer numbers) have been

removed.  Nevertheless, we discovered that comparatively little is

known about disclosure.  To begin with, there is no widely accepted

definition or topology of "disclosure." Probing the definitional

issue, we reviewed prevailing statutes, regulations, and policy

directives at the Federal level to see what light they might shed

on the nature of disclosure.  Published literature on the topic was

also consulted.  Tore Dalenius, consultant to the Statistical

Policy Division, OMB, developed a formal definition while working

with the Subcommittee.  We adopted this definition, as it was

judged to provide the best basis for a comprehensive discussion of

the disclosure issue.  The definition is presented in Chapter II

along with the above mentioned reviews.  Citations to the

literature appear in Appendix D.

 

2.   Pinpointing Disclosure Potentials and Disclosure Avoidance

     Techniques

 

     The definitional effort was augmented by an examination of

different types of disclosure and a review of the various factors

affecting the potential for unintentional disclosure.  Since the

nature of the disclosure problem varies significantly for

tabulations and microdata tapes, the discussion proceeds separately

for the two modes of data dissemination in Chapters III and IV

respectively.  The latter portion of each of these chapters

identifies and describes disclosure-avoidance techniques

appropriate for the respective mode of release.  To augment this

general description, we assembled a description of the disclosure-

avoidance practices of several Federal statistical agencies.  These

appear in Appendix A.

 

3.   Balancing Confidentiality Requirements Against Societal Needs

     for Information

 

     We have used the term "disclosure avoidance" to describe

efforts to reduce the risk of disclosure.  The release of any data

usually entails at least some element of risk.  A decision to

eliminate all risk of disclosure would curtail statistical releases

drastically, 

 

 

_____________________

. Except where otherwise specified the word "individual" as used in this

report is meant to cover all types of reporting units-natural

persons, corporations, partnerships, fiduciaries, etc.

 

1

 

 

 

if not completely.  Thus, for any proposed release of

tabulations or microdata, the acceptability of the level of risk of

disclosure must be evaluated.  The use of the term "disclosure

avoidance" should not be allowed to obscure the vital significance

of such evaluations, or to lead to policies which attempt to

eliminate disclosure risk completely.

     In summary, protection of the confidentiality of information

about individuals must be balanced against the legitimate needs of

society for information.  This "Question of Balance" is discussed

in Chapter V.

 

4.  Other Considerations

 

     For the most part, our study was confined to matters internal

to Federal agencies.  However, at one point in Chapter V this

limitation is relaxed to examine the impact of agency disclosure

practices upon data subjects and data users.  This report does not

deal with the issue of releasing data with identifiers, whether

such release is intentional or unintentional.  Our treatment of

disclosure differs from that commonly associated with the Privacy

Act of 1974, for example, which treats disclosure as transferring

information coupled with identifiers.  The conception of disclosure

advanced here excludes from consideration many identifier linked

confidentiality issues, such as whether statistical data should be

immune from mandatory release for administrative, legislative and

judicial purposes.  By the same token, the report deals only

tangentially with the issue of computer security, ignoring the much

ed potential for penetration and misuse.  A substantial literature

on that problem already exists, which this report highlights in

Appendix B. The more relevant computer aspect is the possibility of

mechanizing the search for disclosure risks and the implementation

of disclosure-avoidance techniques.  Appendix C reports on the

development of an automated system to avoid disclosure in

tabulations published by the Bureau of the Census from its economic

censuses.

 

5.   Findings and Recommendations

 

     Our findings and recommendations appear in Chapter VI.  In

framing recommendations, we have been mindful of the diversity of

statistical activity within the Federal establishment, as well as

the complexity of the matter, and refrained from advocating overly

generalized solutions.  Yet, because we were also mindful of the

pressing nature of the disclosure problem, the report includes a

number of suggestions for the development and review of agency

disclosure-avoidance practices.

 

                               B. Auspices

 

The report represents the collective efforts of the Subcommittee on

Disclosure-Avoidance Techniques of the Federal Committee on

Statistical Methodology which operated under the auspices of the

Office of Federal Statistical Policy and Standards, Department of

Commerce (previously the Statistical Policy Division, Office of

Management and Budget).  The group was originally formed in early

1976 as one of two working groups of a Subcommittee on

Confidentiality Issues chaired by Thomas B. Jabine.  The working

groups were subsequently given separate subcommittee status.  The

other group, the Subcommittee on Matching Techniques, examined

methodological issues associated with the merger of microdata from

different data sets.  The opinions expressed here reflect the

collective judgment of the Subcommittee and do not necessarily

reflect those of the Federal Committee on Statistical Methodology

or the Office of Federal Statistical Policy and Standards.

 

                       C. Dissemination of Report

 

This report is intended for circulation among managerial and

technical staff of statistical agencies and those Federal offices

which release information for statistical and research purposes. 

The report is intended to apprise such staff more fully of the dis-

closure problem and encourage appropriate disclosure-avoidance

practices at the individual agency level.  In addition, we hope

this report will furnish the basis for an informed discussion of

the disclosure problem within the Federal establishment generally

as well as between the Federal Government and its data suppliers

and users.  It may also be of more general use to persons

interested in issues related to the avoidance of statistical

disclosure.

 

 

                                    2

 

 

                                                              CHAPTER II

 

                     Defining Statistical Disclosure

 

 

 

A.   References in Statutes, Regulations, and

                Policy Statements

 

The first requirement of Federal agency policies for avoiding

disclosure in the release of tabulations and microdata is that

these policies conform with relevant statutes and regulations.  In

addition, there have been several recommendations on this subject

by advisory groups, which, while not binding, often carry

considerable weight.  This section of the chapter presents and

reviews relevant sections of statutes, regulations and reports of

advisory groups.

 

1.   The Privacy Ad of 1974

 

     The Privacy Act (P.L. 93-579, 1974) does not address the of 

disclosure in tabulations; however, it does have one provision relating 

to disclosure of microdata.  Section 552a(b)(5) provides for disclosure 

without consent of the individual to whom the record pertains "to a 

recipient who has provided the agency with advance adequate written 

assurance that the record will be used solely as a statistical 

research or reporting record, and the record is to be transferred in a 

form that is not individually identifiable."

 

     The OMB Guidelines for Privacy Act Implementation (U.S. Office

of Management and Budget, 1975) explain the statutory language as

follows: "The use of the phrase 'in a form that is not individually

identifiable' means not only that the information disclosed or

transferred must be stripped of individual identifiers but also

that the identity of the individual cannot be reasonably deduced by

anyone from tabulations or other presentations of the information

(i.e., the identity of the individual cannot be determined or

deduced by combining various statistical records or by reference to

public records or other available sources of information.)" The

Guidelines go on to say "Fundamentally, agencies disclosing records

under this provision are required to assure that information

disclosed for use as a statistical research or reporting record

cannot reasonably be used in any way to make determinations about

individuals."

 

     Unfortunately, the applicability of this provision of the

Privacy Act to the release of microdata from Privacy Act record

systems is far from clear.  It can be argued that records meeting

the requirements of 552a(b)(5), are in general required to be

released in response to Freedom of Information (FOI) Act (P.L. 93-

502, 1974) requests, since they do not come under any of the FOI

exemptions.  Surely, since all reasonable possibility of

identification by recipients is presumed to have been eliminated,

such records would not come under 552(b)(6) of the Freedom of

Information Act, which exempts from mandatory FOI disclosure

"personnel and medical files and similar files the disclosure of

which would constitute a clearly unwarranted invasion of personal

privacy."

 

     The Privacy Act itself provides in Section 552a(b)(2) for

disclosure without consent where such disclosure would be "required

under Section 552 of this title" (section 552 is the Freedom of

Information Act), and it would seem that most disclosures of

information meeting the requirements of 552a(b)(5) of not being

individually identifiable would fall under 552a(b)(2) and not

552a(b)(5).

 

     If the above analysis is found to be confusing, this is

indicative of the dilemma facing the Federal agency official trying

to determine whether and under what conditions the Privacy Act

permits him to release a specified microdata file.

 

2.      The Freedom of Information Act

     In thinking about disclosure-avoidance policies, it is

important to keep in mind that FOI requires Federal agencies to

make any records or documents in their possession available to

individuals on request, unless such materials come under one of the

9 exemptions in the act Thus, FOI requests for existing statistical

tabulations and microdata files can be denied only if one or more

of these exemptions applies.  Furthermore, denials in such cases

are not required by FOI: the materials may be released unless

prohibited by another statute or regulation.  Three of the 9

exemptions are pertinent, and are discussed below.

     Exemption (3).-This exemption formerly referred

 

 

                                    3

 

 

 

to matters "specifically exempted from disclosure by statute."

However, the Government in the Sunshine Act (P.L. 94-409, 1976) has

changed this exemption (effective March 14, 1977) to read

"specifically exempted from disclosure by statute (other than Sec-

tion 552(b).1 of this title), provided that such statute (A)

requires that the matters be withheld from the public in such a

manner as to leave no discretion on the issue, or (B) establishes

particular criteria for withholding or refers to particular types

of matters to be withheld." The effect of the change was to

substantially narrow the applicability of this exemption. 

Agencies, including for example the Social Security Administration,

whose confidentiality statutes do not meet the new requirements of

exemption (3) now have to rely on one of the other FOI exemptions

when they wish to protect statistical tabulations or microdata

files from mandatory release under FOI.

     Exemption (4).-This exemption refers to "trade secrets and

commercial or financial information obtained from a person and

privileged or confidential." The extent of applicability of this

exemption to statistical tabulations and microdata is not well de-

fined at this time, and will only become clearer as court decisions

rule on its applicability to FOI requests for such data.

     Exemption (6),This exemption refers to 'personnel and medical

files and similar files the disclosure of which would constitute a

clearly unwarranted invasion of personal privacy.' As in the case

of exemption (4), the extent of applicability of this exemption to

tabulations and microdata is not yet clear.  Recent court decisions

have tended to limit its applicability. 

 

3.  Agency Statutes and Regulations 

 

Following is a review of selected provisions of agency statutes and

regulations relevant to the release of statistical tabulations and

microdata.  It is not intended that this be a full review of agency

confidentiality statutes and regulations.  We cite here only those

provisions which appear to be directly relevant to the question of

defining statistical disclosure.  

     a. Bureau ot the Census, Title 13,The relevant portion

prohibits the Census Bureau from making "any publication whereby

the data furnished by a particular establishment or individual

under this title can be identified."

     b. Internal Revenue Service.-The section of the Internal

Revenue Code dealing with "Statistical Publications and Studies as

amended by the form Act (P.L. 94-455, 1976) provides that "No

publication or other disclosure of statistics or other information

required or authorized by subsection (a or special statistical

study authorized by subsection (b) shall in any manner permit the

statistics, study or any information so published, furnished, or

otherwise disclosed to be associated with, or otherwise identify,

directly or indirectly, a particular tax payer..2 .3

     c. Social Security Administration-Regulation Number 1,

promulgated under Section 1106 of the Social Security Act, deals

with "Disclosure of Official Records and Information." Until

recently, Section 401.3(k) of Regulation I provided that

"Statistical data or other similar information not relating to any

particular person which may be compiled from records regularly

maintained by the Department may, be disclosed when efficient

administration permits."

     d.Law Enforcement Assistance Administration -The Crime Control

Act of 1973, in Section 524(a) provides that "Except as provided by

Federal other than this title, no officer of the Federal

Government, nor any recipient of assistance under the, provisions

of this tide shall use or reveal any research or statistical

information furnished under this title, by any person and

identifiable to any specific private person for any purpose other

than the purpose for which it was obtained in accordance with this

title.  The regulations implementing this Act (Law Enforcement

Assistance Administration, 1976) defined "information identifiable

to a private person" a "information which either-

     (1)   Is labelled by name or other personal identifiers,

           or

     (2)   Can, by virtue of sample size or other factor be

     reasonably interpreted as referring to a particular private

     person."

     e.    National Center for Health Statistics-Public Law 93-353,

Section 308(d) provides that "No information obtained in the course

of activities under taken or supported under Section 304, 305, 306,

c 307 may be used for any purpose other than the purpose for which

it was supplied unless authorize

 

_________________________________

 

    .1 The section which sets forth the FOI exemptions.

    .2 This section became effective January 1. 1977.

    .3 Subsection (a) authorizes annual or more frequent publication

     "Statistics . . . with respect to the operations of the

     internal revenue laws.' Subsection b) authorizes the

     performance of "special statistical studies and compilations

     involving return information" f. others on 2 reimbursable

     basis.

    .4 Passage of the Government in the Sunshine Act referred to

     earth brought about the need for substantial revision of

     Regulation Pending final adoption of the revised Regulation 1

     the Social Security Administration is operating under an

     interim version which does n explicitly with this question.

 

                                    4

 

 

 

under regulations of the Secretary; and (1) in the case of

information obtained in the course of health statistical activities

under Section 304 or 306, such information may not be published or

released in other form if the particular establishment or person

supplying the information or described in it is identifiable unless

such establishment or person has consented . . ."

     The common element in these and other agency statutes and

regulations is the prohibition of the release of information that

can be associated with or identified to a particular statistical

unit In some cases the prohibition is limited to information about

private individuals; in others, it extends to information for legal

persons, such as businesses.

 

4. Advisory Committee Reports

 

     a. The President's Commission on Federal Statistics (1971).-

Recommendations on privacy and confidentiality appear in Chapter 7

of the Commission's Report.  Recommendation 7-4 says, in part, "use

of the term confidential" should always mean that: a. Disclosure of

data in a manner that. would allow public identification of the

respondent or would in any way be harmful to him is prohibited."

     b.The HEW Secretary's Advisory Committee on Automated Personal

Data Systems.-Chapter 6 of the Committee's Report (U.S. Department

of Health, Education, and Welfare, 1973) deals with "Special

Problems of Statistical-Reporting and Research Systems." In this

chapter, the Committee recommends new Federal legislation

protecting against compulsory disclosure.  One of the features

recommended for the legislation was: "The protection should be

limited to data identifiable with, or traceable to, specific indi-

viduals.  When data are released in statistical form, reasonable '

precautions to protect against 'statistical disclosure' should be

considered to fulfill the obligation not to disclose data that can

be traced to specific individuals."

     A footnote to this paragraph provides a definition of

statistical disclosure from an article by Fellegi (1972).  "This is

a risk that arises when a population is so narrowly defined that

tabulations are apt to produce cells small enough to permit the

identification of individual data subjects, or when a person using

a statistical file has access to information which, if added to

data in the statistical file, makes it possible to identify

individual data subjects." 

     c. The American Statistical Association Ad Hoc Committee on

Privacy and Confidentiality (1977).The Committee's report includes

several recommendations on "Release of statistical summaries and

microdata without identifiers." The first of these recommendations

is:

     "1.  General public releases of statistical summaries and

microdata files based on either administrative or statistical data

sources should be permitted without restrictions or conditions

provided that:

 

     (a)   All identifying particulars, such as name, address and

Social Security number, have been removed, and 

     (b)   It is virtually certain that no recipients can identify

specific individuals in the files."  For microdata files which do

not meet condition (b) of this recommendation, the Committee recom-

mends release for research and statistical purposes only under

certain conditions, one of which is that the recipient agrees "Not

to release any tabulations or other information that would make it

possible for others to identify specific individuals."

 

d.The Privacy Protection Study Commission (PPSC).-The Commission's

final report was issued in July 1977 (PPSC, 1977).  Chapter 15,

entitled the Relationship Between Citizen and Government: The

Citizen As Participant in Research and Statistical Studies,"

includes several recommendations and policy guidelines relating to

the collection, use and disclosure of information about individuals

(natural persons) in "individually identifiable form" for research

and statistical purposes.

     The report defines "individually identifiable form" as "any

material that could reasonably be uniquely associated with the

identity of the individual to whom it pertains" (PPSC, 1977:572). 

Thus, it is clear that the Commission was fully aware of the

problem of statistical disclosure, and, in fact, in a section of

Chapter 15 on "Procedures to Protect Confidentiality" (PPSC,

1977:583-7), there are brief references to the work of this

Subcommittee and to several of the disclosure-avoidance techniques

discussed in this report.

     Recommendation (6) in Chapter 15 (PPSC, 1977: 587) is "That

the National Academy of Sciences, in conjunction with the relevant

Federal agencies and scientific and professional organizations, be

asked to develop and promote the use of statistical and procedural

techniques to protect the anonymity of an individual who is the

subject of any information or record collected or maintained for a

research or statistical purpose."

     The text immediately preceding this recommendation makes it

clear that techniques to avoid statistical

 

 

                                    5

 

 

 

disclosure (at least in its "exact" sense) are intended to be

included in the recommended program of activities by the Academy

and other organizations.

 

B. Evaluation of Statutory Requirements

 

     Statutory prohibitions on disclosure are expressed in absolute

terms.  Thus, the Privacy Act refers to disclosure of a record "in

a form that is not individually identifiable." The Census Title 13

prohibits "any publication whereby the data furnished by a

particular establishment or individual under this title can be

identified"

     If these statutory restrictions were interpreted literally,

the flow of statistical data from the Federal Government would be

stopped or drastically reduced.  In a broad sense, any release of

statistical tabulations reveals some information, at least in an

approximate or probabilistic sense, about every individual known to

be included in those tabulations.  When a microdata file containing

numerous items of information about each individual is released, it

is virtually certain that many of the records will display combina-

tions of characteristics not possessed by more than one individual

in the population, and therefore will be potentially identifiable

through matching with data that might be available from other

sources.

     In practice, what is clearly expected on the part of agencies

releasing statistical data is an effort to keep the probability of

disclosure, however defined, at a very low level.  Three of the

advisory groups cited above confirm this view of the question. 

Thus, the HEW Committee called for "reasonable precautions to

protect against statistical disclosure"; the ASA Committee

recommended unrestricted release when "it is virtually certain that

no recipients can identify specific individuals in the file."; and

the Privacy Protection Study Commission used the word "reasonably"

in, defining "individually identifiable form.' We may also note

that the LEAA regulation uses the word "reasonably" in this context

whereas the statute did not include any such qualifying term.

     This interpretation of statutes, regulations and recommended

policies which prohibit disclosure leads to an important

conclusion, i.e.. that they do not in themselves Provide a clear

basis for deciding in an v particular case whether data should or

should not be released.  The decision on release calls for more

specific rules and guidelines.  If such rules and guidelines do not

exist, then each case will be a judgment call by the responsible

official.

     A major objective of this Subcommittee has been to determine

what rules, guidelines and other criteria are being used by Federal

agencies to avoid statistical disclosure; to review and evaluate

these materials: and to make its findings widely available for the

benefit of statisticians and others who must make decisions on what

data to release, and on what terms.

 

C. Prior Definitions of Statistical Disclosure 

 

     We have seen that, without exception, laws and regulations do

not provide a sufficiently precise definition of disclosure for

operational use in determining what tabulations and microdata files

are releasable.  We have also reviewed the literature on the

subject of statistical disclosure found in journals, reports and

other publications.  There we have found several attempts at a more

precise definition.  These are all helpful, but none of them seems

to be broad enough to cover all the kinds of statistical disclosure

problems met with in practice.

 

     Fellegi (1972) defines "inadvertent direct disclosure

(i.d.d.)" as "disclosure of information on an individual who can be

identified through his characteristics." He goes on to say that

such disclosure "occurs when a user can identify a respondent by

recognizing him through his characteristics and learning something

about him." In other words, this kind of disclosure only occurs

when two things happen:

 

     1.    The user recognizes an individual member of a population

included in a tabulation or microdata file.

 

     2.    The user learns something, about that individual that he

did not know from another source. Many more casual definitions of

disclosure include only the first element.

 

Fellegi does not say whether the information learned must be the

exact value of some characteristic, or whether the disclosure can

be in the form of a range. or a probability statement about the

value in question.  Hansen (1971) distinguishes between

"exact" and "approximate" disclosure, the latter term being used

for the case where a value for a particular individual is disclosed

to be within some specified range.

 

Fortunately, there is now available, in a report by Dalenius (1977)

a mathematical treatment of the concept of statistical disclosure

which we believe provides an adequate framework for discussion of

all

 

 

                                    6

 

 

 

aspects of statistical disclosure.  Dalenius has kindly agreed to

the inclusion of this material in our report

 

D.   A Proposed New Definition of Statistical

     Disclosure

 

The reader is asked to keep in mind that the concept of disclosure

presented here is a very broad one.  It would not be desirable to

require that there be a zero risk of disclosure, as defined below,

in any release of tabulations or microdata files.  Such a re-

quirement would end a large proportion of all releases now being

made.  This would be too great a price to pay for complete

elimination of any risk of disclosure.

 

     The material which follows in sections D1, D2 and D3 is

presented verbatim from Dalenius' report, except for a few changes

in terminology to conform with the language and structure of this

report.

 

1.The Insufficiency of Prevailing Definitions 

 

     Statistical disclosure is used in the literature in a

way which parallels its use in nonstatistical contexts, Thus, in

Webster's Third New International Dictionary, "disclosure" is

defined as:

           (1)  the act or an instance of opening up to view,

                knowledge or comprehension.

 

           (2)  something that is disclosed.

This definition is, indeed, general; it is by and large consistent

with definitions of disclosure in the context of releases of

statistical results.  An example, Title 13, U.S. Code, Section 9-a-

2, gives an implicit definition of disclosure; it states that there

shall not be:

           ". . . any publication whereby the data furnished

           by a particular establishment or individual under this

           title can be identified."

 

     The definition just quoted is less general than the definition

taken from Webster's dictionary, by making identification of the

object(s) concerned an element of the definition.  While this is

indeed a crucial difference, it does not make the resulting

definition sufficiently specific to serve as a basis for

regulations and/or procedures aiming at disclosure control; it does

not easily and unambiguously lend itself to implementation.

 

In sections D2 and D3 an effort will be made to deal with the

conceptual problem thus present.

 

2.   A Framework for Defining "Statistical Disclosure"

 

     "Statistical disclosure" is used here in accord with the use

     of this term in the context of releasing statistics from a

     survey3.  In line with this notion of disclosure, the following

     four components are used to provide the conceptual framework

     called for:

 

           a.   A frame comprising certain objects

           b.   Data associated with these objects

           c.   Statistics released from a survey

           d.   Extra-objective data

 

           (a)The frame 

                Consider a set of identifiable objects, to be

                referred to as the total population and denoted by

                T. In a typical case, T may be "all Swedish

                citizens." The survey concerns a subset of this

                total population, viz. that subset which is

                accessible by means of a certain frame; for

                convenience, this subset will be denoted by F. In a

                specific case, F may be "Swedish citizens living in

                Sweden." The complementary subset i.e., the subset

                made up by objects in T which are not in F is

                denoted by F. Thus, T is the "union" of F and F.

 

 

Click HERE for graphic.

 

                

                In the case of a sample survey, it may prove useful

                to make an additional distinction, viz. between

                objects selected for the sample Fs and those not

                selected Fs

 

           (b)  Data associated with the objects in the

                frame

                With each object in F, we associate data, which serves

                three different functions:

 

           i.  Identifying function:

                We will denote the data serving this function by the

                identifier I. In a specific case, I may appear as a

                (registration) number, or as name and street

                address.

 

-------------

.3 The Dalenius text uses the word "survey" in its broad sense to

include a census or other data collection covering the total

population.  For purposes of this report. the definition may also

be applied to the release of statistics based on administrative or

program records.

 

 

                                    7

 

 

ii. Classifying function:

     For purposes of presenting the "details" of the statistics to

     be released,, the objects in F will be associated with certain

     defined by reference to some classifier C In a specific case,

     C may appear as a "code" identifying a subset of F, for

     example a subset defined with reference to the sex and age of

     the objects in F.

 

iii. Information function:

 

     The survey is carried out in order to provide information in

     terms of certain  "survey characteristics" X,Y, . . ., Z.

     For the object O (J=1, . . ., N), the values of these

     characteristics are denoted by X, - - ., Z.  Typically but not

     exclusively, these values may be in the nature of counts or

     magnitudes.

 

     It may be worth noting that some data may serve more than one

     of these 3 functions in one and the same survey.

 

(c)  The statistics released from the survey

 

     The objective of a survey is expressed in

     terms of some population and some data C

     and X Y,Z. In order to achieve this objective, 

     the statistics S are released We will focus on 

     two different kinds of statistics:

 

     i.    statistics for sets of objects "microsta-

           tistics"; typically, the format of a report

           is used as a means of releasing the statistics

 

 

     ii.   statistics for individual objects "microstatistics

           typically, the format of micro-data tape is used as the

           means of releasing the statistics.

 

 

 

     We will elaborate upon the above distinction in sections (1)

     and (2) below.

 

     (1)   Macrostatistics

 

           In the case of macrostatistics, the statistics

           units, magnitudes, etc., as the case may be concern

           aggregates of the individual values of the survey

           characteristics belonging to the respective sets.  The

           following tables are two cases in kind:

 

                                    

 

           These tables-while featuring the characteristics of real

           life statistics-are admittedly "small.'

 

(2)  Microstatistics

 

     In this kind of statistics, the individual values observed

     with respect to the characteristics X, Y, . . ., Z (possibly

     in conjunction with the associated classifiers) are released. 

     The.identifiers, however, are not released.  The following

     excerpt from U.S. Bureau of the Census (1976) is illustrative

 

 

                                    8

 

 

 

iii. The statistics released from the survey:S

           iv.  The extra-objective: E

 

3.   Statistical Disclosure Defined

 

     We will now suggest a definition of disclosure 

within the conceptual framework presented in section 2.

     Thus, consider an object Ok in the total population T. This

object may be a member of F, or it may be a member of P. We

introduce a characteristic D which may be one of the survey

characteristics X,Y, . . ., Z; or it may be some other

characteristic.  For the object Ok, this characteristic assumes the

value Dx.  It is helpful to consider two special cases:

 

 

 

           i.   Dx = 1 if Ox has a certain property other

                wise Dx - O

           ii   Dx is measured on a ratio scale: it is 

                     expressed as a magnitude.

     If the release of the statistics S makes it possible to

determine the value Dx more accurately than i., possible without

access to S, a disclosure has taker place; more exactly, a D-

disclosure has taken place In a specific case, this D-closure may

be an X-disclosure, or a Y-disclosure, etc.

     The definition just given applies to both releases of

macrostatistics and release of microstatistics.  Examples of

disclosure for the former case may be found in Chapter III and for

the latter case in Chapter IV.

 

 

                                   10

 

                                                              CHAPTER III

 

                Disclosure in the Release of Tabulations

                     (Summary Data) for Public Use         

 

A.  The Problem of Disclosure in Tabulations: Topology,

Identification and Examples

 

     The problem of disclosure in tabulations will now

be discussed.  A topology will be listed; ways to identify the

various types of disclosure, together with appropriate examples,

will be provided.

     The definitions of different kinds of disclosure used in this

section are very broad.  Not all of these kinds of disclosure need

necessarily be avoided in all tabulations.  The issues involved in

determining what kinds of disclosure are acceptable in a particular

situation are discussed in section B2 of this chapter.

     Our study of the literature on this subject did not reveal any

generally accepted definitions of various types of disclosure.  The

proposed classifications which follow represent an effort to

develop a comprehensive and logical description of different types

of disclosure.  Suggestions for improvement will be welcomed.

     Disclosure will be studied both for tabulations involving

count (frequency) data and for those containing quantity

(magnitude) data.  Tables I and 2 show examples of count data and

quantity data, respectively.

 

           Table 1.-Number ot beneficiaries by county and age

 

                                Age class

County     Under 65        65-69     70-74      75& over        Total

 

A-----          3          15        11              8          37

B-----          7          60        34             20          121

C----           -           4        -               -            4

 

 

            Table 2.-Average benefit amount by county and age

 

                                Age class

 

 

County     Under 65        65-69           70-74           75 & over

 

D          $63.30          $94.30          $85.20          $79.60

 

E          62.40           89.9            81.80           72.40

 

F          59.80           92.40           80.4            77.60

 

 

 

1.Exact Disclosure

 

a.Count data-For tabulations involving counts of persons,

establishments, etc., exact disclosure is said to occur when a

respondent known to be a member of a set (marginal total) can be

determined to be a member of a proper subset (cell).  For the dis-

closure to be exact, this proper subset or detail cell must

be.defined as narrowly as possible.  The detail cell must consist

of respondents all having one of the basic, elementary values

available from the records of the characteristic defining the cell

single year of age, nearest dollar amount of benefit, a single race

category, etc.  Table 3 shows that all beneficiaries in County B

are black-an example of exact disclosure.

 

           Table 3.-Number of beneficiaries by county and race

 

                                  Race

County          White      Black           Other           Total

 

A----           15         20              5               40

B                0         30              0               30

 

     On the other hand, the inference from Table 4 that no

beneficiary in County B is white is not called exact disclosure

because the subset of black or other beneficiaries is not as

narrowly defined as possible from the records on which the

tabulation is based.

 

           Table 4.-Number of beneficiaries by county and race

 

                                  Race

 

County          White      Black           Other           Total

A----            15         20              5               40

 

B----             0         28              2               30

 

     Similarly, the fact that the ages of all beneficiaries in

County C of Table I can be restricted to the interval 65-69 does

not constitute exact disclosure as defined here because the age

interval defining the detail cell does not represent a single year

of age.

     In summary, exact disclosure from count data can be identified

as follows: A marginal total (in the

 

 

 

                                   11

 

 

 

dimension n-1) of an n-dimensional cross tabulation equals one of

its detail cells; this detail cell is as narrowly defined as

possible.

     b.Magnitude data-Exact disclosure from magnitude data can

occur as a result of the publication of the value of a quantity

corresponding to i cell with only one member.  For example, the

total sales for the single establishment in Industry B is disclosed

by Table S.

 

                    Table 5.-Total sales, by industry

 

Industry        No. of establishments           Total sales

 

A----                      18                   $450,000,000

B----                       1                   $125,000,000

 

A second type of exact disclosure from magnitude data occurs when

auxiliary information concerning the possible numerical values of

the characteristic under consideration can be used to determine the

exact quantity for every member of a given cell.  For example,

consider the situation presented below:

 

               Table 6.-Average monthly benefits, by State

 

                                                Average Monthly

State           No. of beneficiaries                 benefit

 

A----                   4                            $158

 

B                      36                            $190

 

     If the maximum possible monthly payment to any beneficiary

under the program studied in Table 6 is $190, then the user will

know that each person in State B receives precisely $190.  However,

the exact value of the payment to any beneficiary in State A is not

disclosed.

     In summary, exact disclosure of the  type from quantity data

is identified by the publication of the numerical value of a

characteristic corresponding to a cell with one member.  Exact

disclosure of the second from magnitude data is identified by the

following equalities:

 

     A = L, equivalently T = LN 

or

     A = U, equivalently T = UN, 

where

 

A is the average and T is the total value among all N members in a

cell, U and L are the maximum and minimum possible values.

respectively, for any member in the cell.

 

2.Approximate Disclosure

 

     a.    Count data.-When all members of a total belong to one

detail cell, the disclosure is approximate

     if the detail is not as narrowly defined as possible:

otherwise, the disclosure is exact

     When all members of a total can be restricted to a proper

subset of detail cells, there is approximate disclosure because it

is disclosed that no member Of the marginal total belongs to any of

the empty cells.

     Table 1 allows the user to restrict the age of each

beneficiary in County C to the interval [65, 69].  Table 4 does not

exactly specify the race of any person, but it shows that the race

of each beneficiary in County B is either black or other, not

white.

     Both of the above examples illustrate approximate disclosure

from count data.

     Approximate disclosure from count data can be defined and

identified as follows: A marginal total. (in the dimension n-1) of

an n-dimensional cross tabulation equals one of its detail cells,

or the sum: of a proper subset of detail cells (equivalently, the

value of one or more detail cells is zero); but the disclosure is

not exact.

     b.Magnitude data-In a broad sense the publication of a figure

for quantity always permits the user to estimate, however crudely,

the value of characteristic corresponding to a given member o the

cell For example, the monthly benefit for each of the four

beneficiaries in State A of Table 6 must be less than $632. 

Further, the total sales of each establishment in Industry B of

Table 7 can be placed inside the interval [0, 125,000,000].

 

                    Table 7.-Total sales, by industry

lndustry                   No. of establishments                Total

 

A----                              18                      450,000,00

 

B                                   5                      125,000.001

 

 

     Often, the information provided in cases such a the above will

not be sufficiently accurate or sensitive to require corrective

measures.  However, if the number of members in the cell is

sufficiently small the interval of possible values for the quantity

associated with a particular individual will be narrow enough to be

considered a disclosure problem (Co; 1976).

     With the assumption that all values for quantity are non-

negative, the interval of possible values a characteristic for a

particular cell member is [OT] if the total, T, is published;

equivalently the interval is [O.- NA] if the average, A, and cell

siz N are published.

     Sometimes auxiliary information obtained from sources external

to the summary data under consideration

 

 

                                   12

 

 

 

can enable the user to estimate the value of an unpublished

quantity more accurately.  For example, if an employment

distribution shows that all establishments in Industry B of Table 7

have approximately the same number of employees, the user can

estimate a value $25,000,000 for the sales of each establishment. 

In the same vein, if it is known from another data source that the

largest establishment of the five employs 80 percent of all workers

in Industry B, a reasonable estimate for total sales for that

establishment would be $100,000,000.

     In some situations, auxiliary information admitting more

accurate approximation to values of aggregate data can be obtained

from external sources other than statistical tabulations.  In

particular, legal requirements used in conjunction with summary

data may determine narrow upper and lower limits for the value of a

quantity for an individual respondent.

     For example, in Table 6 if the maximum benefit is $192, then

it can be shown that each individual person in State B must receive

at least $120-a restriction of each beneficiary's payment inside a

range of values unknown prior to publication of the data.

     In general, if maximum and minimum values of the

characteristic in question are known, such disclosure will occur

under the following conditions:

 

 

Click HERE for graphic.

 

where A is the average and T is the total value among all N members

in a cell, where N> 1; U and L are the maximum and minimum possible

values, respectively, for any member in the cell; and P, where 0 <

P < 1, specifies the relative size of the interval chosen to define

disclosure of the value of the characteristic under consideration. 

For example, if disclosure is defined as knowing that the value for

an individual lies within a quarter of the range (U-L) then P =

.25.

     Finally, in some instances better approximations for the

quantity data of an individual respondent can be computed by a user

with precise information about a subset of members of the cell. 

This type of disclosure is discussed later in this chapter (see

A 5: "Internal Disclosure") and in Appendix C.

 

3.   Probability-Based Disclosures (Approximate or

     Exact)

 

     Sometimes although a fact is not disclosed with certainty, the

published data can be used to make a statement which, within the

framework of an implied probability model, has a high probability

of being correct.  For example, in Table 8 it is very likely that a

given beneficiary in County B has a monthly income in excess of

$2,000.

 

                Table 8.-Monthly income of beneficiaries

 

                      Number of persons with income

County          Under $1000          $1000-$2000     Over $2000

 

A----                70                    60                   65

 

B----                10                    20                  230

 

C----                30                    50                   40

 

Similarly, from Table 4, in the absence of other information, we

might assign a probability of 0.93 that a person known to be a

beneficiary in County B is black.

     Identification of probabilistic disclosure can be described as

follows:

 

                             DSP2 

 

where

     D is the number of members in the detail cell, 

     S is the number of members in the total cell,

     P1 is the smallest permissible proportion of members in a

     detail cell among all members belonging to the marginal to and

     P2 is the largest permissible proportion of members in a detail

     cell among all members belonging to the marginal total.

     As was the case for approximate disclosure for aggregates, the

appropriate values of P, and P2 in a particular case must be

determined by the agency releasing the tabulations.  In many cases,

the agency may not consider it necessary to avoid probabilistic

disclosure at all; in such cases, we would set P1=0 and P2=1.

 

4.   Indirect Disclosure 

     Up to this point, the examples concerning exact, approximate,

and probabilistic disclosure have involved information provided

directly by published figures.  This type of disclosure is said to

be direct.

     However, information can often be derived by algebraic

manipulation and/or logical operations performed upon data obtained

from different tables based on the same data.  If the publication

of a

 

 

                                   13

 

 

derived figure would result in one of the types of disclosure

discussed above, then indirect (exact, approximate, or

probabilistic-whichever is appropriate) disclosure is said to

occur.

 

              Table 9.-Number of persons with hospital and

                    medical coverage, by age and sex

                                           Hospital & Medical coverage

 

Age                        Male            Female               Total

 

Under 65---                1,714           1,820                3,534

65-74----                  1,517           1,630                3,147

75 and over---             1,402           1,510                2,912

     Total----             4,633           4,960                9,593

 

 

           Table 10.-Number of person with medical coverage, 

                             by age and sex

 

                           Medical Coverage

 

Age             Male            Female                     Total

 

Under 65---     1,719           1,829                      3,548

65-74           1,519           1,630                      3,149

75 and over     1,402           1,510                      2,912

     Total      4,640           4,969                      9,593

 

 

     Neither Table 9 or Table 10 discloses individual information

directly.  However, by application of algebraic and logical

operations to both tables, it follows that all men 75 and over with

medical coverage have hospital coverage; all women with medical

coverage but without hospital coverage are under 65, etc.

     As a further illustration of indirect disclosure, suppose

Industry A consists. of two disjoint subindustries Al and A2, and

that the following information is available from various tables.

 

Industry                   NO. of Comparisons              Total sales

 

A----                              5                       $200,000,000

A1----                             4                       150,000,000

 

 

     By subtraction, the total sales of $50,000,000 is computed for

the one company belonging to Industry A2.

     To identify indirect disclosures, a determination must be made

to we if a logically defined but unpublished cell. which would

itself constitute a disclosure, can be derived from published

cells.  Because data from all sources available to the user must be

considered, this work can Set quite involved.  Discussions of this

complex problem are given by Cox (1976) and Fellegi (1972).

 

 

5.   External or Internal Disclosure

 

     Almost all of the above discussion has centered upon external

disclosure, i.e., disclosure to someone who is not a member of the

tabulated cell.  Attention will now be focused upon internal

disclosure-that is, the situation in which members of a group use

their own as well as published data to obtain confidential

information about others in the group.  When some members of a

group collaborate for this purpose, we will refer to this subset as

a "coalition."

     Table 11 furnishes an example of internal disclosure for count

data.  The black worker in County C can determine from the table

that every other employee in his industry and county is white.

 

     Table 11.-Race of workers in industry A, by county

 

County               Total           White           Black

 

A----                144              132              12

B----                238              138             100

C----                 94               93               1

 

     If there were precisely two black workers in County C instead

of one and if they knew each other, they could deduce that all

other employees in their industry and county are white.

     If the maximum possible benefit for each of the beneficiaries

of Table 12 were $140, it would be impossible for a user not

belonging to County B to determine the payment to either person in

that county.  However, either beneficiary could readily compute the

payment to the other person by use of the published cell.

     Further, if one person in County A of Table 12 received a

benefit.of $40, he would know that each of the other persons must

receive between $120 and S 140.

 

           Table 12--Number of beneficiaries and average

                           payment amount

 

County                     Number               Average Payment Amount 

A----                        3                             $100

B----                        2                               70

 

     Another example of internal disclosure from quantity data is

given by Table 7 which was also discussed in conjunction with

approximate disclosure.  As previously mentioned, by subtracting

the value of its own sales from the published value S 125,000,000

an establishment can estimate the value of sales for its

competitors with greater accuracy, perhaps, than they would like.

 

 

                                   14

 

 

 

    Finally, internal probabilistic disclosure can be discussed by

modifying data for County C of Table 11 as follows:

 

Total                White                 Black 

94                    92                     2

 

     If either black employee knows that Mr. X is in his industry

and county, the probability is only 1/93 that Mr. X is black.

For the sake of completeness and summarization, the following list

is provided for the identification of the different VM of internal

disclosure.  Definitions are analogous to the corresponding ones

for external disclosure.

     a.    Count data (direct or indirect disclosure).-The potential

for internal disclosure is affected by two new factors not relevant

to external disclosure.  The first is the maximum size of coalition

against which protection is believed to be necessary; the second is

the distribution of the coalition members among the data cells to

be protected.  Since there is usually no way of knowing what the

distribution of any particular coalition might be, the conservative

approach in all cases is to protect against the distribution that

would result in the greatest degree of disclosure.

     In the discussion below,

S is the published number of members in the total cell,

D is the published number of members in a detail cell,

C is the maximum coalition size for which protection from

disclosure is considered necessary, and

X is the number of coalition members also belonging to the detail

cell.

Note that the number, X, of members of a coalition of size C which

belong to a detail cell of size D must satisfy the following:

           0 < x < minimum (C, D).

 

     (1)Exact disclosure: The difference between the values of a

marginal total and one of its detail cells is equal to the number

of members of a coalition not belonging to the detail cell

(equivalently, S-D - C-X), the detail cell is as narrowly defined

as possible.  In a plan to guard against such disclosure by

coalitions of size C, the extreme case X - 0 must be considered;

that is, S-D < C should be avoided in publications.

     (2) Approximate disclosure: There exists at least

one non-empty detail cell entirely contained in a coalition, but

the disclosure is not exact.  For this detail cell we have X - D.

In a plan to guard against such disclosure by coalitions of size C,

D < C should be avoided in publications.

     (3) Probabilistic disclosure.-(i)D-X < ( P, where D, X, S,

and C arc as defined previously and P, is as defined for external

probabilistic disclosure.  In a plan to guard against such

disclosure by coalitions of size C, the extreme case X - C must be

considered; that is, D-C < (S-C) P, should be avoided in

publications.

(ii)D-X > (S-C) P2, 

where D, X, S, and C are as defined previously and P2 is as defined

for external probabilistic disclosure.  In a plan to guard against

such disclosure by coalition, of size C, the extreme case X - 0

must be considered; that is, D > (Pz should be avoided in

publications.

     b. Magnitude data (direct or indirect disclosure).

     (1) Exact disclosure: After a coalition of size C adjusts a

published figure by means of its own data, the revised value

involves either type of exact disclosure for magnitude data

described for the external use.  Equivalently, a quantity is

published for a cell of size C + R, containing a coalition of size

C, where one of the following conditions holds:

     (i) R = 1

     (ii) The revised value of the published figure, obtained by

adjusting for the contribution of the coalition, is a maximum or a

minimum possible value determined from external, auxiliary

information as described on page 12. 

(2) Approximate disclosure: With an adjustment of a published

quantity figure by use of information about itself, a coalition of

members of a cell can estimate, more accurately than an outside

user, a quantity value corresponding to a member of the cell

outside the coalition.

     For example, two beneficiaries, each receiving a monthly

benefit of $250 in State, A of Table 6 would know that each of the

other two beneficiaries must receive less than $132.

     Given that the (unpublished) values for sales in Industry B of

Table 7 are as shown below:

 

           Establishment                                            Sales

 

1-------------------------------------------                    1,000,000

2-------------------------------------------                    1,000,000

3-------------------------------------------                    1,000,000

4-------------------------------------------                   22,000,000

5-------------------------------------------                  100,000,000

 

 

                                   15

 

 

 

it follows that establishments 4 and 5 can objective and somewhat

accurate information about each other (especially if each is aware

of the relative sizes of the other four members of the cell).  In

particular, establishment 5 can deduce that establishment 4 has at

most $25,000,000 in sales.

     In general, if all quantities are nonnegative, the interval of

possible values for a particular cell member outside a coalition is

[0, T - Q,I, or equivalent [0, NA - Q.] where T is the published

total, A is the published average, N is the cell size, and Q. is

the value of the quantity for the coalition.

     Finally, if upper and lower limits for the possible value of a

quantity corresponding to an individual respondent are known, then

internal approximate disclosure can be identified as follows for

aggregate data:

 

Click HERE for graphic.

 

where

 

A is the published average and T is the published total value for

all N members in the cell,

 

U and L are the maximum and minimum possible values, respectively,

for any member in the cell,

 

P.0 < P < 1, specifies the relative size of the interval which

defines disclosure of the value of the characteristics under

discussion,

 

C is the number of members in the coalition, and

 

Q.is the unpublished value of the quantity corresponding to members

of the coalition.

 

(3)  Dominance rules and their relation to internal approximate

disclosure of magnitudes: Cell suppression is commonly as a

technique to avoid exact and approximate disclosures in tabulations

of magnitude data.  Typically, "dominance rules" are established to

determine which cells should be suppressed.  These rules are of the

following general type:

     If n or fewer units account for p percent or more of the cell

     total, the cell must be suppressed.

 

For example, we might say that if 1 or 2 firms account for 80

percent or more of total sales in a particular cell, that cell

should not be published.  One consequence of such a rule would, of

course, be to require that all published magnitude cells be based

on data for 3 or more firms.

     The effect of dominance rules is to limit the precision with

which magnitudes for individual units can be estimated from the

published data by persons who have exact or approximate knowledge

of values for one or mote members of the cell.  In particular,

these rules limit the extent of internal approximate disclosure of

magnitude data, as defined earlier in this chapter.

     Further discussion of dominance rules and their relation to

approximate disclosure appears in Appendix C.  

     If a dominance rule is used to determine when a cell magnitude

should not be published, knowledge of the exact rule can make it

possible for a member of the cell to obtain more accurate

information about his competitors than would otherwise be the case. 

This may readily be understood from an example.

     Suppose a published cell shows sales for 1976 of S 1,000,000

for 6 companies in a particular industry.  Company A knows that its

own sales in 1976 were $750,000.  If Company A does not know the

dominance rule, it can deduce only that none of the other 5

companies had sales of more than $250,000.  If the dominance rule

is published however, additional information may be available to

Company A. Consider two possibilities:

           1.   The rule is that no cell is published if 1 or 2

                companies account for more than 90 percent of the

                total.  In this case, Company A will know that none

                of its competitors had sales of more than $150,000.

           2.   The rule is that no cell is published if I or 2

                companies account for more than 90 percent of the

                total.  In this case, Company A will know not only

                that none of its competitors sales of more than

                $50,000, but also that each of the 5 other companies

                had sales of exactly $50,000 (since 5 companies must

                account for sales of $250,000, and none of them can

                have sales of more than $50,000).

 

           B. Evaluating the Disclosure Problem

 

     The definition of statistical disclosure adopted for this

report is, as mentioned earlier, very broad While it may not be

feasible to try to avoid completely the possibility of disclosure,

it is imperative to exercise disclosure control.  Doing so calls

for an evaluation as to (1) the level of risk of disclosure

 

 

                                   16

 

 

 

inherent in a proposed publication; (2) the acceptability of that

risk; and (3) the assurances given to persons (data subjects or

others) who provided the information. ln what follows, we will

address these three points.

 

1.   The Level of Risk of Disclosure

 

     We will now identify four factors which determine the risk of

disclosure.  In a real-life situation, it will be necessary to try

to evaluate their combined effect 

     a. The relative size of the sample.-As a first approximation,

the risk of disclosure is smaller for tabulations based on a sample

survey than for tabulations based on a complete survey; and by the

same token, the smaller the sampling fraction, the smaller is the

risk of disclosure.

     

     This evaluation is reasonable when we are dealing with surveys

based on designs characterized by the use of an equal probability

of selection method.  Many large-scale surveys. are of are of this

type. If the overall sampling fraction (usually denoted by n/N) is

"small," say less than .05, it is less likely that a disclosure

will place.

 

     If, however, the design does not involve equal probability of

selection, the situation is different; in fact, for some  of

sampling design, the risk of disclosure may be very great for some

large reporting units.  As an illustration, consider the total of a

characteristic with a highly skewed distribution.  An example in

kind is a survey to estimate total production.  In such cases, an

efficient sampling design would call for selecting relatively few

small units.  Disclosure potential would, therefore, be much higher

for the large units than for the small units.

     

     The protection against risk of disclosure afforded by a small

sampling fraction is considerably less where particular reporting

units are, for whatever reason, known to be members of the sample. 

For example, if a sample is selected based on ending digits of

social security numbers, the risk of disclosure is clearly greater

if the digital sampling patterns actually used to select the sample

are known.

     

     Similarly in a two-stage sample, if the identities of the

primary units in the sample are known, then the sampling fractions

within these primary units, rather than the overall sampling

fraction, determine the degree of protection against the risk of

disclosure.  More generally, in multi-stage samples, protection is

a function of the sampling fractions within units known to be in

the sample.

 

     b. The detail provided in the tabulation.-A publication which

provides only "overall" estimates is

less likely to generate large risks of disclosure than a

publication which provides detailed breakdowns of these estimates.

     It is useful to make a distinction between two kinds of

breakdowns, viz., (1) by geography, and (2) by-other classifiers.

     If the data are presented for very small areas, the risk of

disclosure is typically larger than for large areas.  It is this

experience which underlies the rules used by the Census Bureau to

provide less detailed tabulations for areas such as census tracts

and city blocks than it does for large areas such as SMSA'S.

     If data are published for small "cells" identified in terms of

other classifiers such as age, sex and race (perhaps in combination

with geography), the risk of disclosure may be large: the smaller

the cell, the larger the risk.

     c. The quality of the data.-If the data on which estimates are

based are impaired by non-sampling errors, the risk of disclosure

is smaller than in, the case of more accurate data.  This is in

fact why "noise" is sometimes intentionally introduced into

estimates.

     d. Availability of external information.-The existence of

external information-for example, information available through

directories or other institutional records-may make the risk of

disclosure significantly higher than it would be if that

information were not available.

     In a real-life situation, the survey statistician should, when

planning the survey, take these and other factors into account; to

some extent, the risk of disclosure can be controlled by the proper

choice of survey design.  This type of control must, however, be

supplemented by disclosure analysis of the proposed publication.

 

2. The Acceptability of the Disclosure Risk 

 

     The crucial point of the disclosure analysis just

referred to is to determine if a certain risk of disclosure is too

high or too low.  It is too high if it may cause non-negligible

harm to an individual being subject to disclosure, or to the

statistical agency by impairing its ability to collect data in the

future.  It is too low if it unnecessarily reduces the amount of

useful information that can be provided.

     Three factors which may be considered in an effort to

determine whether a certain disclosure risk is acceptable or not

are listed below.

     a.  Sensitivity of data- Some types of data are clearly more

sensitive than others; it suffices to mention data dealing with

financial matters, health,

 

 

 

                                   17

 

 

 

sexual behavior, and hand, some data may, at worst, disclose

something that is entirely obvious or completely innocuous, or

available in public records.

     For many data, the degree of sensitivity may be a decreasing

function of their age.

     b. Possible adverse consequences of disclosure.This topic is

closely related to the sensitivity of data.  The more sensitive the

data are, the more adverse the consequences of disclosure are

likely to be.

     Clearly the kind of consequences caused by disclosure should

be taken into account in the disclosure analysis.  If the

disclosure of some particular datum may reasonably be expected to

create a social, economic or legal problem, the risk of disclosure

must be kept very small.  Thus, disclosing that someone has been

treated for venereal disease, drinking problems, etc., may generate

such a problem.

 

3.The Assurances Given to the Respondents 

 

     Consideration must be given to what assurances have been given

to the data subjects or other persons Providing information about

uses of the data.  Under no circumstances should such assurances be

violated.

     If the information is definitely non-sensitive and no promise

of confidentiality was given the, data subject, then the concern

about possible disclosures would be considerably reduced.

 

     C. Disclosure-Avoidance Techniques

 

     A major goal of statistical agencies is to produce and publish

as much useful and usable statistics as Possible for the benefit of

their clients. The need to avoid the unintentional disclosure of

sensitive information concerning individual persons or organiza-

tions forms a constraint on this endeavor.  The statistical agency,

therefore, must find or develop techniques that will effectively

avoid disclosure while at the same time permitting maximum useful

statistical information to be conveyed.  The agency would also seek

to accomplish this by a method that is both simple and economical.

     Techniques for preventing disclosure through statistical

tabulations fall into three general- classes: data suppression,

rolling up data, and disturbing the data.

 

1. Data Suppression

 

     a.Cell suppression.-A data item which, it is determined, could

lead to disclosure may simply be suppressed, i.e., the figure is

omitted and replaced by an asterisk or other symbol which indicates

that the figure is being omitted to maintain confidentiality

for the subjects of the table.  However, must be taken to assure

that the disclosing figure may not then be deduced by subtraction,

which requires that another figure in the same row and another in

the same column also be suppressed, assuming it is desired that no

changes be made in the row and column totals.  In addition, at

least one figure would need to be suppressed-the one at the

intersection of the other row and column of the second and third

suppressions to assure that the other suppressions also cannot be

deduced by subtraction.  Thus, if the row and column marginal

totals are to be left unchanged, it is necessary in a two-way

distribution to suppress at least four figures to avoid a

disclosure.

 

It is also possible that data in other tables published from the

same body of data may enable one to deduce the suppressed figures. 

Therefore, it is necessary to review all relevant tables to ensure

that they do not contain disclosures and also that through a

process of subtraction or other algebraic operations they do not

enable disclosures to be made, and all necessary suppressions must

be made to avoid the possibility of disclosure.  Cox (1976)

discusses a linear programming technique for exposing cells which

require suppression to avoid disclosure.

     So as to provide maximum consistency the suppression of

certain data items may be made contingent on the acceptability of a

"diagnostic" item.  For example, in economic censuses if sales in a

particular - kind of business must be suppressed, then employment,

payroll and certain other figures are automatically suppressed with

it.  This enhances consistency, avoids incidental disclosures, and

reduces costs.

     b. Table suppression.-Many (though not all) disclosure

problems can be avoided inexpensively through the elimination of

all tabulations involving fewer than some minimum number of cases. 

Thus, in the 1971 Census of Population in the United Kingdom, no

tabulations were presented for enumeration districts having fewer

than 25 persons or fewer than 8 households; for such enumeration

districts only the total numbers of persons and households were

given (Newman, 1975:6).  In the 1970 Census, the U.S. Bureau of the

Census suppressed distributions by a particular characteristic for

any universe in which there were fewer than 5 cases (Barabba and

Kaplan, 1975:9).  In guidelines for the Social Security

Administration (1977) it is suggested that separate tabulations for

counties havens fewer than 50 beneficiaries be avoided.

 

 

 

                                   18

 

 

 

For a general discussion of the use of suppression, see

Sweden, National Central Bureau of Statistics (1974:32-34).  For a

discussion of the use of suppression in the U.S. Bureau of the

Census, see Barabba and Kaplan (I 975:7-1 0).

 

2. "Rolling Up" Data

 

     Problems of confidentiality can frequently be solved by

changing the structure of tables in such a way that the disclosure

possibility is eliminated.  Thus, rows or columns can be combined

into larger class intervals or new groupings of characteristics. 

This may be a simpler solution than the suppression of individual

items, but it tends to reduce the descriptive and analytical value

of the table.

     It may also be expensive in that it might require that a few

tables be customized in a large set of tables, the remainder of

which are produced mechanically in identical formats.  General

discussions of the rolling-up process are to be found in Sweden,

National Central Bureau of Statistics (1974:31-32) and in Social

Security Administration (1977:6-7).

     An indirect but common example of rolling-up exists in data

bases where the Standard Industrial Classification system is used. 

That hierchical system has 2-, 3- and 4-digit levels providing

successively greater detail.  When data are suppressed at the 4

digit level the 3-digit level summary provides the benefits of

intermediate rolling-up.

     Hansen (1971:51) points out that using broad enough class

intervals may even avoid approximate disclosure (in the terminology

of this report, unacceptable approximate disclosure), for example,

when the upper limit of each interval is at least double the lower

limit

 

3.Disturbing the Data

 

     This Process involves changing the figures of a

tabulation in some systematic fashion, with the result that the

figures are insufficiently exact to disclose information about

individual cases, but are not distorted enough to impair

informative value of the table.

     Ordinarily rounding is the simplest example.  Figures in a

table may, for example, be rounded to the nearest multiple of 5.

Where the figures involved are very large, this will have little or

no effect on the informative value of the tables.  If all cells in

a table are rounded by the same rules, totals will not always agree

with the sums of the detailed cells.  If this is considered

undesirable. the most detailed cells can be rounded and then added

to obtain totals at various levels.  Ordinary rounding was used for

most tables involving large areas in the 1971 United Kingdom Census

(Newman, 1975:9-10).  Values of 0, 1, or 2 were replaced by

asterisks; percentages were computed from the rounded tables.

     There is a growing body of techniques for avoiding disclosure

involving the introduction of random error into the figures to be

published.  For example, in tables relating to small areas prepared

from the 1971 United Kingdom Census, to each figure was added, at

random, - 1, 0, or + 1, in the ratio of 1, 2, 1. Enumeration

districts were paired, each having opposite correction factors in

comparable figures, so that the totalled figures from a set of dis-

tricts would be accurate, except if there was an odd number of

districts in the set (Newman, 1975:3-8).

     One possible approach is to introduce "noise" into the file of

microdata, thus avoiding the possibility of disclosure in any

tabulations produced from the file.  This method may simplify

matters for the data producer, but it creates problems for the user

(Dalenius, 1974).

     "Random rounding" a method which has received considerable

attention in recent years, combines elements of both rounding and

introducing random disturbances.  Each figure is rounded to a

multiple of some integer, usually 5, but not necessarily to the

nearer one.  Whether a figure is rounded up or down is determined

at random, with the chance of rounding up or down depending upon

the amount of change necessary: (Murphy, date unknown: 68-70;

Social Security Administration, 1977:7-9).

 

Final Digit                                    Probability of Rounding Up

0 or 5------------------------------------------             0

1 or 6------------------------------------------           1/5

2 or 7------------------------------------------           2/5

3 or 8------------------------------------------           3/5

4 or 9------------------------------------------           4/5

 

     Nargundkar and Saveland (1972) describe and give theoretical

support to the use of this method in the tabulations published from

the 1971 Canadian censuses of population and housing.  Fellegi

(1975) presents a technique for controlling the random rounding to

assure that the totals will be correct at some predetermined higher

geographical area level.

     The Swedish Statistical Bureau proposes another random

rounding technique which may be used if it is simply desired to

remove ones from a table.  The one is rounded randomly down to zero

with a probability of 2/3 and up to 3 with a probability of 1/3

(Sweden, National Central Bureau of Statistics, 1974:34-35).

 

 

 

                                   19

 

 

 

The models discussed above for disturbing data are all

additive.  Multiplicative models are also feasible.  Hansen

(1971:55-56) suggests one which involves disturbing the figure by a

factor within the range of .5 to 1.5, the factor being chosen at

random.

 

4.   Limiting Distribution

 

     Situations may arise in which it is not necessary

to take special steps to avoid disclosure from statistical

tabulations.  Under certain conditions a table may be made

available to a particular organization, even though the table could

not be published for reasons of maintaining confidentiality.  An

actual example is in the tables on local area social security data

provided by the Office of Research and Statistics, Social Security

Administration, to the Bureau of Economic Analysis.  As a result,

the expense of revising the table is avoided, and the actual

distribution is available for full research use.  This can be done

when the receiving organization guarantees (and has the legal

authority) to provide fully adequate protection to the

confidentiality of the data while it has custody of them.

     For one agency. to make potentially identifiable data

available to another, conditions such as these may be required:

           a.   The activity must be in accordance with the laws

                governing the programs of the respective agencies.

           b.   There must be a legitimate research purpose to be

                served by the process.

           c.   The receiving agency must be strictly and legally

                accountable to the providing agency for its security

                program.

           d.   The receiving agency must demonstrate that it has

                adequate security provisions.

           e.   The likelihood that any information potentially

                harmful to an individual would be derived from the 

                would, even so, be ex-

           f.   The receiving agency would not and could not be

                required to turn the data over to any third party,

                even under subpoena or a Freedom of Information Act

           g.   The providing agency would have opportunity to

                review any publication of information from the data

                to insure that no potential disclosures are

                published

           h.   At the cowl of the project, and no later than some

                specified date the receiving agency would either

                return or destroy all of the tables involved.

           i    Significant sanctions or penalties for improper

                disclosure would apply

 

 

5.   Evaluation of Alternative Techniques

 

     If it is determined that there is a possibility that

the publication of a table, or a datum within a table, might result

in harm to some individual or organization, but, nevertheless, the

table has sufficient value that, at least in some form, it should

be published, then a decision must be made as to which technique

will be used to avoid the disclosure.  A number of examples have

been cited; various other techniques am also possible.  Four

principal questions must be weighed in the making of this decision:

 

a.   The degree of protection provided.-All of the described

methods reduce considerably the likelihood of a disclosure; some

give virtually absolute protection against the possibility of

disclosure but are more drastic in terms of loss of information.

 

b.   Effects on users of the data.-All of the techniques listed

have some effect in reducing the value of the data to the user. 

There is some loss of information inherent whenever data are ,

suppressed, combined, or disturbed.  The Swedish method of removing

ones from tables by changing them to O's or 3's perhaps does the

least harm to the data conveyed.  At the other extreme, the method

of "random rounding" to multiples of 5 has considerable effect,

since it can cause any figure to be changed by as much as 4. In

general both of these data disturbing.techniques may also yield

inconsistent figures for the same data items in independently

derived totals.  Suppression could make some analyses impossible,

particularly where the user wants to combine a number of smaller

units to obtain totals and other statistics not provided in the

tables.  The multiplicative method cited by Hansen could cause any

figure to be halved or increased by 50 percent.  The Swedish

suggestion for substituting a range for a sensitive value can also

have severe effects if the range is relatively large. Even the

smallest of these changes may affect the value of the published

data for descriptive or analytic purposes (Dalenius, 1974:220).

     With the increasing use of computers in data analysis,

particularly where a large number of aret are to be compared, the

uniformity of the data input is another factor affecting users.  In

this context, rolling-up-so that dimensions of the data matrix vary

from unit to unit-creates considerable difficulty.  Suppression is

also problematic in that suppression at any level can prevent the

development of a desired total.  In this context the data

disturbing

 

 

 

                                   20

 

 

techniques may be most satisfactory-in that data are always present

and they can be added together without biasing effects on the

totals derived.  Other statistics such as ratios, e.g., persons per

household, can be affected; however, with suitable precautions,

these effects can be minimized.

     c.  The "identifying" nature of the subject items. Some

subject characteristics are more likely than others to lead to the

ability to associate data with a particular individual.  A

tabulation of race and sex by income probably has more disclosure

potential than a similarly detailed table of major field of study

in college by income assuming that race and sex are more readily

observable than major field of study.  Area of residence

is.considered highly identifying in nature, and frequently

geographic or size of area characteristics are considered

separately from any 46 subject" characteristics of a respondent in

disclosure rules.  On the other hand opinions recorded in a survey

are normally of minimal utility in identifying a respondent

     The Census Bureau, for instance, has in the past used area of

residence and race as the critical variables in determining the

publishability of small area population census tabulations.If

certain minimum population criteria were met in each arm then other

characteristics of that population would be provided.  On the other

hand, the Census Bureau was willing to make available journey-to-

work data from the 1970 census in the form of  origin-destination

matrix classified by mode (auto, bus, etc.) without any

disclosure., control, on the assumption that journey-to-work

characteristics are highly changeable (the question was asked

relative to "last week) for an individual and therefore non-

identifying.

     d.    Cost.-Any procedure used to avoid disclosure in

     statistical tables will involve some cost to the statistical

     agency.  There will be cost in the use of some operating

     funds, in the use of personnel time that would otherwise be

     available for other activities, in the computer programming,

     debugging, and processing, and in time required for the total

     process and the resulting delay in publication.

                           *    *    *

     Agencies cited have studied the problem and have tended to

settle on one particular technique to be used for all publications

of a particular census, or as standard operating procedure.  Once

this is done and staff understand it, the procedure becomes

routinized and automatic.  Computer programs are written to.

automatically "purify" the tables in the system on a mass-

production basis, and costs are minimized.  AU of the techniques

described are capable of computerization, and some software

packages are available (Cox, 1976:14-15).  But such mass procedures

may also result in wholesale losses of valuable information.  Study

of the effects of such procedures may reveal that in many instances

the system's application resulted in particular losses of

information that am both unfortunate and unnecessary.  As described

in Appendix C, the Census Bureau has developed programs which

attempt to the number of suppressions in magnitude data.

Each statistical agency must make its own study and its own

decision to answer this question: How can we do our job of making

available the needed data, in our area, while at the same time we

make sum that no confidential information about any person or any

establishment is -accidentally released through the tables we

publish?

     Selected agency policies and practices to avoid unintentional

disclosures are noted in Appendix A.

 

 

 

                                   21

 

 

                                                              CHAPTER IV

 

                         Disclosure in Microdata

 

 

 

           A. Nature of the Problem

 

1.   Definition of Microdata

 

     We use the term microdata to refer to files in

which each record provides data about an individual person,

household, establishment or other unit.  An agency's own files of

basic records from a survey or other data collection are thus

microdata, and normally they are summarized or aggregated to

produce statistics for the reports and publications discussed in

Chapter III.

     Release of microdata to a data user outside the originating

agency can serve legitimate and important public purposes in that

the data may be useful for many more tabulations or other analyses

than the originating agency is prepared to provide.  Certain

statistical applications (e.g., simulation models) require input in

microdata form.

     Obviously, release of records about individuals raises the

issue of disclosure.  Some files are by law not confidential for

example, those from the Census of Governments from which detailed

data for specific governmental units are released.  On the other

hand, most data bases are covered by statutes (discussed in Chapter

II) which prohibit the release of data from which information may

be gained about identifiable individuals.

     

     Agencies which release microdata for outside use have

construed applicable law and regulations to permit the release of

individual information insofar as it is not specific enough to

allow identification of the individual.  Invariably names and

addresses, social security numbers and other positive identifiers

are removed.  Further, certain other information, such as location,

is generally withheld or provided only in broad categories.

 

     Microdata is a particularly popular form of release since it

gives the user considerable flexibility in his or her analyses. 

The capacity of data users to perform such analyses has been and is

continuing to increase rapidly with the availability of computer

resources.  At the same time the statistical agency is frequently

impelled to release microdata as a labor-saving device-it reduces

somewhat the need for extensive published tabulations, and it cuts

down on requests for special tabulations which are sometimes seen

as diverting agency resources.  Thus the dissemination of data in

microdata form is steadily increasing.

 

2.   Federal Agency Examples of Microdata Release

 

     a.Bureau of the Census.-Probably the best known of all Federal

microdata bases are the public use samples of basic records from

the 1960 and 1970 censuses of population and housing.  From the

first release in 1963, these samples have provided nearly the half

richness of detail about households derivable from the decennial

censuses: age, education, income, occupation, etc., of each family

member along with characteristics of the family's housing.  The

sample originally released in 1963 had little geographic

information and the sampling fraction was only 0. I percent of all

U.S. households.  As a result of the public acceptance and

demonstrated utility of that microdata product, public-use samples

from the 1970 census were created with a larger sampling fraction

(one-percent) and more specific geographic information (areas as

small as 250,000 population were identified).  A total of six

mutually exclusive one-percent samples were made available-taken

together, six percent of the national population.  These files are

available for purchase by anyone and use is not restricted.

 

     Fairly comparable in content and structure to the census

public-use samples are the Annual Demographic Files (ADF) generated

each year from the March supplement to the Current Population

Survey (CPS).  A special provision must be added to the

aforementioned disclosure rule since the CPS is an area sample and

maps are available which define what areas are included in the

first-stage sample.  The minimum population criterion becomes

250,000 population within sampled primary sampling units in the

area to be identified.  For example, since central city, other

metropolitan and nonmetropolitan components of the population have

been identified

 

 

                                   23

 

 

 

on the ADF through 1976, a State with even several million total

population was not identifiable if there were less than 250,000

People in sampled nonmetropolitan counties. (Beginning with the

1977 ADF, all States will be identified, but with central city and

metropolitan residence codes suppressed where necessarysee page

38).  There are no restrictions on use of Annual Demographic Files. 

Files from a number of other household surveys are also released in

a similar manner.

 

     b.Social Security Administration.-The Social Security

Administration (SSA) makes available from its Continuous Work

History Sample system the Longitudinal Employee-Employer Data

(LEED) File, containing records for one percent of all employees

covered by the Social Security System.  For every individual in the

file there is age, race, and sex information and a record for each

employer in each year since 1957.  The employer records indicate

the industry, State, county, taxable wages and estimated total

wages for the year.  Scrambled social security numbers for

employees are provided only to users who will be updating the

sample with data for subsequent years.  Purchasers must enter into

a written agreement with SSA specifying the purposes for which the

file may be used, prohibiting further dissation without SSA

authorization, and specifically precluding any attempt to identify

specific individuals or establishments or to match individual

records with information in other files on specific individuals. 

Annual and quarterly files from the system are also available under

the same conditions.

     SSA also releases microdata files for general public use,

i.e., without any restrictions, from several different sources,

including the Longitudinal Retirement History Survey, various

surveys of disabled persons, the Survey of the Low-Income Aged and

Disabled, and certain match studies using data from the Current

Population Survey, IRS and SSA.  These files are all based on

relatively small samples (less than one-percent of the population)

and carry only limited geographic information.  Unusual values of

variables or combinations of variables are suppressed prior to

release of the files.

     c. National Center for Health Statistics.-The National Center

for Health Statistics (NCHS) releases public-use microdata tapes

from many of its surveys and statistical programs.  These includes

tapes from the Health Interview Survey, the Health and Nutrition. 

Examination Surveys, the National Ambulatory Medical Care Survey,

the Hospital Discharge Survey, health manpower and health facility

inventories,

the inventory of family planning service sites, vital statistics

for the Nation (natality, mortality, marriage, and divorce), and

the national natality and mortality followback surveys.  These

public-use tapes are reported in a catalog published annually

(NCHS, 1976).

     One NCHS microdata file quite unlike the examples from other

agencies is the file on natality, a 50-percent sample of records

from the NCHS birth registration system (100-percent for some

States in 1972 and 1973).  No other Federal microdata file released

exhausts a universe or comes that close.  Records on the natality

file include the age, race and education of the father and mother,

the State and county of residence of the mother, the birth date,

legitimacy (if recorded) and several characteristics of the

mother's previous childbearing history.  Purchasers of NCHS

microdata sign a simple statement that the file will be used solely

for statistical research or reporting purposes.

     d.National Center for Education Statistics.-The National

Center for Education Statistics has available microdata tapes with

information gathered from 22,532 graduates of the high school class

of 1972, a probability sample made up of approximately 0.7 percent

of the National high school class for that year.  Information was

collected beginning in the spring of 1972, with follow-up surveys

in October 1974, for the National Longitudinal Study of the High

School Class of 1972.  School record information, such as grade

point average, class rank, and area of study are included along

with test results and student-provided information on family back-

ground, attitudes, and-plans for the future.  Periodic follow-ups

provide information on activity status and changes in attitudes and

plans for the future.  Geographic information specifies regions and

type of community (e.g. rural, suburb, etc.). These files are

available for purchase by anyone, and use is not restricted.

     e.Internal Revenue Service.-The Internal Revenue Service

releases two samples of unidentified individual income = returns,

with 150 data items from each return, for tabulation purposes and

to allow simulation of the revenue impact of tax law chances.  The

Tax Revenue Model for National Estimates, with no geographic

information is available for purchase and unrestricted use.  Less

than 0.2 percent of all returns are included in that file, although

the sampling fraction varies among the classes of taxpayers.  The

Tax Model for State Estimates, including about 0.3 percent of all

returns

 

 

                                   24

 

 

 

identified to the State level, is available to State tax agencies

for tax administration purposes and, once certainty strata are

deleted, it is also made available to the public.

 

           B. Evaluation of the Problem

 

     While microdata are made available so that tabulations or

other summarizations can be made, it is the possible scrutiny of

individual records that causes concern for the violation of

confidentiality.  While we are cog our consideration to microdata

files with no positive identifiers (e.g., name, address, or social

security number) a combination of data elements, such as geographic

location, age, race, and occupation, if sufficiently detailed,

could identify an individual if known by the investigator in

advance.  Other information on the microdata record so identified

would then be disclosed about the individual, e.g., income, marital

history, educational attainment, etc.

     This section deals with the likelihood of such disclosure and

with the bases for determining, in particular cases, whether or not

the risks of disclosure are acceptable.

 

1.   Factors Bearing on the Likelihood of Disclosure

 

a.   Sample size or fraction of the universe.-If an investigator

were searching for a particular individual in a microdata file, his

probability of success would be. no greater than the. chances that

a randomly selected individuals record is present in the file,

assuming of course that the investigator had no external way of

knowing whether or not the individual was selected into the sample. 

For instance, in a one-percent sample the chances are 99-to.1

against a particular individual having a record in the file.

 

     In stratified samples the likelihood of selection into the

sample may vary from stratum to stratum.  Further, in multi-stage

samples it may be possible for an outsider to determine that some

counties but not others were subject to sampling beyond the first

stage.  It would then be the sampling fraction within the county

that would be relevant, rather than the average or overall sampling

rate.

     b. Uniqueness.-The term uniqueness is used here to

characterize the situation where an individual can be distinguished

from all other members in a population in terms of information

available on microdata records.  The existence of uniqueness is

determined by the size of the population and the

degree to which it is segmented by geographic information, and the

number and detail of characteristics provided for each unit in the

data base.

     (1) Geographic information: The smaller the population, the

more easily an individual can be unique; the larger the population

the more likely that his or her set of characteristics is

duplicated elsewhere. (Also, the larger the population the more

costly would be any linkage attempt.)

     Size of the population, or of the smallest segment that can be

readily identified, can be varied most directly by varying the

amount of geographic information supplied on a microdata file.

     Geographic information can be in terms of specific areas

(e.g., the State of Maryland) or in terms of.type of areas (e.g.,

size of place or rural) or both.  Multiple geographic identifiers

in combination may identify a small area, e.g., the rural part of

an SMSA, or a small part of an SMSA crossing a State line.

     Extraneous sources may also provide information about the

location of the respondent: knowledge that only certain areas were

surveyed or subject to final stage sampling; sequence of records in

the file where they have not been scrambled; the existence of more

than one version of a file with different sets of geography

identified; and neighborhood, county or PSU summary characteristics

if present and matchable to an external source.

 

(2) Characteristics of the respondent.  In general it can be said

that the greater the number and detail of characteristics reported

about an individual the mom likely it is that the individual's

representation in the file would be different from that of any

other individual in the population.  Just 10 characteristics with

four categories each create over a million possibilities (410), and

when one considers that some data items may have 100 or more

potential categories (e.g., age, occupation, industry, income,

place of birth) the number of possibilities become astronomical in

a file with a large number of characteristics.  Many

characteristics are, however, likely to be correlated with one

another, thus reducing the degree to which an additional item

creates additional unique records.  For a given subject the number

of categories does not entirely account for its potential in an

identification process.  Some. identify especially small

populations, e.g., country of birth of the foreign born.

     It might then seem reasonable to designate a minimum category

population, e.g., to collapse country of birth categories with less

than 50 cases in the file.  This technique, however, appears

inadequate.  While

 

 

                                   25

 

 

 

there may be many Russian-born persons sampled, only one may be

black, or only one may live in a particular identified area.  More

importantly, uniqueness in the sample is not the critical factor,

for there may be a hundred such individuals in the population with

no possibility of discriminating among them.  Uniqueness in the

population is the real question, and this cannot be determined

without a census or administrative file exhausting the population

or an identifiable subset thereof (e.g., a file of all doctors). 

Precluding uniqueness in the sample would be a very conservative

approach to avoiding disclosure.

     Some public-use microdata files provide characteristics for

all or at least multiple members of a household.  The association

of the characteristics of household members greatly increases the

potential for unique combinations (e.g., a 66-year-old judge

married to a 23-year-old.actress).

     c. Recognizability.-The term recognizability is used here to

refer to the likelihood that an investigator could accurately

associate unique records in the sample with particular individuals

in the population and thereby gain additional information about

them.  A record in the sample may be unique, but if it cannot be

linked with a specific person then disclosure cannot occur.

     Three factors affecting recognizability are discussed: the

existence of a population register, "noise" in the microdata Me,

and time lag or the degree to which the microdata information has

become out-of-date for an individual.

     (1) Population registers: A population register is defined

here to be a list of persons or households with specific

identification, names or addresses, which also systematically

contains information which coincides with data on public-use

microdata records.  Except for Census Bureau, Social Security

Administration and Internal Revenue Service records, none of which

are available to the public, we know of no registers which

systematically cover most of the U.S. population.  But neither

nationwide coverage nor coverage of all segments of the population

is required to -make a population register useable for matching

purposes.

     Reasonable coverage of a defined subpopulation, along with a

number of reliable matching characteristics, may suffice.  A

register of some groups like Black architects, American Indians,

high public officials, or birth records . is not at all improbable. 

The existence of rather extensive registers of business

establishments in the hands of governmental units, trade

associations and firms like Dun and Bradstreet

has virtually ruled out the possibility Of releasing microdata

files about businesses for statistical purposes.

     The point is, of course, to be able to discriminate among the

units on the register for the one which matches a public-use

microdata record, and this requires inclusion on the register of

stable and reliable matching characteristics.  Among the charac-

teristics most likely to reside in a population register file, date

of birth and State or country of birth would seem to be the most

reliable, regardless of time or circumstances of data collection. 

Veteran status, period of military service, and years of school

completed wo