Federal Committee on Statistical Methodology
Office of Management and Budget
FCSM Home ^
Methodology Reports ^

 

 

Statistical Policy Working Paper 17 - Survey Coverage


Click  HERE for graphic.

 

 



                MEMBERS OF THE FEDERAL COMMITTEE ON

                      STATISTICAL METHODOLOGY


(April 1990)   Maria E. Gonzalez (Chair) office of Management and Budget     Yvonne M. Bishop Daniel Kasprzyk Energy Information Bureau of the Census Administration   Daniel Melnick Warren L. Buckler National Science Foundation Social Security Administration Robert P. Parker Charles E. Caudill Bureau of Economic Analysis National Agricultural Statistical Service David A. Pierce Federal Reserve Board John E. Cremeans Office of Business Analysis Thomas J. Plewes Bureau of Labor Statistics Zahava D. Doering Smithsonian Institution Wesley L. Schaible Bureau of Labor Statistics Joseph K. Garrett Bureau of the Census Fritz J. Scheuren Internal Revenue service Robert M. Groves Bureau of the Census Monroe G. Sirken National Center for Health C. Terry Ireland Statistics National Computer Security Center Robert D. Tortora Bureau of the Census Charles D. Jones Bureau of the Census           PREFACE     The Federal Committee on Statistical Methodology was organized by the Office of Management and Budget (OMB) in 1975 to investigate methodological issues in Federal statistics. Members of the committee, selected by OMB on the basis of their individual expertise and interest in statistical methods, serve in their personal capacity rather than as agency representatives. The committee conducts its work through subcommittees that are organized to study particular issues and that are open to any Federal employee who wishes to participate in the studies. working papers are prepared by the subcommittee members and reflect only their individual and collective ideas.   The Subcommittee on Survey Coverage studied the survey errors that can seriously bias sample survey data because of undercoverage of certain subpopulations or because of overcoverage of other subpopulations. The purpose of this report is to heighten the awareness of survey planners and data users regarding the existence and effects of coverage error, and to provide survey researchers with information to evaluate the trade-offs between coverage error and survey costs. The report profiles selected methods for controlling and measuring the effects of coverage errors using examples from Federal sampling frames and surveys. The report includes seven case studies based on Federal surveys that illustrate selected aspects of coverage errors.   The Subcommittee on Survey Coverage was cochaired by Cathryn S. Dippo of the Bureau of Labor Statistics, Department of Labor, and Gary M. Shapiro of the Bureau of the Census, Department of Commerce.           MEMBERS OF THE SUBCOMMITTEE ON   SURVEY COVERAGE     Cathryn S. Dippo (Co-chair) Bureau of Labor Statistics (Labor)   Gary M. Shapiro (Co-chair) Bureau of the Census (Commerce)   Raymond R. Bosecker National Agricultural Statistics Service (Agriculture)   Vicki Huggins Bureau of the Census (Commerce)   Roy Kass Energy Information Administration (Energy)   Gary L. Kusch Bureau of the Census (Commerce)   Melanie Martindale Defense Manpower Data Center (Defense)   D.E.B. Potter Agency for Health Care Policy and Research (Health and Human Services)           ACKNOWLEDGMENTS   This report is the result of the collective work and many meetings of the Subcommittee on Survey Coverage. All of the subcommittee members made significant contributions to the text of the report, taking responsibility for various sections of the report during the long period of preparation.   All of the members of the Federal Committee on Statistical Methodology reviewed several drafts and made many important suggestions. The subcommittee wishes to recognize in particular the valuable contributions made by the following committee members: Yvonne Bishop, Joseph Garrett, Charles Jones, Daniel Kasprzyk, Fritz Scheuren Monroe Sirken, and Robert Tortora. The subcommittee also benefitted significantly from an outside review of the final draft by Steven Heeringa and Benjamin Tepping.   The subcommittee also thanks the following persons: John Paletta and Richard Pratt for preparing the Current Population Survey and Producer Price Index case studies, respectively; Robert Casady and Charles Cowan for contributing to the section on sample design strategies; and Rosalie Epstein of the Bureau of Labor Statistics for editing the report.           TABLE OF CONTENTS   Page   LIST OF TABLES. . . . . . . . . . . . . . . . . . . . . . . . . vii   LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . .viii   EXECUTIVE SUMMARY . . . . . . . . . . . . . . . . . . . . . . . . 1   CHAPTER 1. Coverage errors occurring before sample selection. 3   1.1 Conceptual or relevance error . . . . . . . . . . . . . 4   1.2 Frame construction and maintenance. . . . . . . . . . . 8   1.2.1. Classification of frame errors. . . . . . . .13   Missing elements; clusters of elements appearing on list; blanks or foreign elements; duplicate elements; incorrect auxiliary information   1.2.2. Frame maintenance . . . . . . . . . . . . . .15   New frame elements; inactive frame elements; misclassified elements; out-of-scope elements; split-out or combined frame elements   1.2.3. Match-merging of independent source lists . .21   1.3. Sample design strategies to minimize coverage error . .22   1.3.1. Defining target population to equal frame population. . . . . . . . . . . . . . . . . .23   1.3.2. Random-digit dialing sampling . . . . . . . .23   1.3.3. Multiple frame sampling . . . . . . . . . . .24   1.3.4. Sampling rare populations . . . . . . . . . .25   1.3.5. Estimation procedures . . . . . . . . . . . .27   1.4. Evaluation methods. . . . . . . . . . . . . . . . . . .28   1.4.1. Macro-level analysis. . . . . . . . . . . . .28   1.4.2. Micro-level analysis. . . . . . . . . . . . .29     CHAPTER 2. Coverage errors occurring after initial sample selection. . . . . . . . . . . . . . . . . . . . .31   2.1. Incorrect association of frame with reporting unit(s) .31   2.1.1. Location errors . . . . . . . . . . . . . . .31   2.1.2. Classification errors . . . . . . . . . . . .33           2.1.3. Temporal errors . . . . . . . . . . . . . . .36   2.2. Listing errors. . . . . . . . . . . . . . . . . . . . .38   2.2.1. Area segment listing errors . . . . . . . . .39   Studies measuring error; an alternative to area listing   2.2.2. Household listing errors. . . . . . . . . . .43   Motivational causes; lack of correspondence between survey designer's and respondent's residency concepts; effect of household listing errors; methods for reducing household listing errors   2.2.3. Nonhousehold listing errors . . . . . . . . .47   2.3. Other nonsampling errors. . . . . . . . . . . . . . . .47   2.3.1. Recording errors. . . . . . . . . . . . . . .47   2.3.2. Responses from nonsampled units . . . . . . .49   2.3.3. Coverage errors resulting from nonresponse. .50     CONCLUSION. . . . . . . . . . . . . . . . . . . . . . . . . . . .53   APPENDIX A. CASE STUDIES   Introduction . . . . . . . . . . . . . . . . . . . . . . . .55   A.1. Annual Survey of Manufactures (ASM) . . . . . . . . . .56   A.2. National Long-term Care Survey (NLTCS). . . . . . . . .61   A.3. National Master Facility Inventory (NMFI) . . . . . . .65   A.4. Producer Price Index (PPI). . . . . . . . . . . . . . .71   A.5. Quarterly Agricultural Surveys (QAS). . . . . . . . . .77   A.6. Monthly Report of Industrial Natural Gas Deliveries . .83   A.7. Current Population Survey (CPS) . . . . . . . . . . . .89   APPENDIX B. GLOSSARY OF ACRONYMS . . . . . . . . . . . . . . .96   APPENDIX C. GLOSSARY OF TERMS. . . . . . . . . . . . . . . . .97   REFERENCES. . . . . . . . . . . . . . . . . . . . . . . . . . . 106           LIST OF TABLES   Number Title Page   1. Selected sampling frames used for Federal surveys. . . . . .10 2. Scope of frame versus population of interest for selected surveys. . . . . . . . . . . . . . . . . . . . . . . . . . .33 3. Reinterview classification of units originally classified as noninterview: October 1966 . . . . . . . . . . . . . . . . .34 4. Reinterview classification of units originally classified as noninterview: April to September 1966. . . . . . . . . . . .35 5. Reinterview classification of units originally classified as noninterview: 1987 35 6. Type B rates for the Survey of Income and Program Participation and the Current Population Survey, 1985-87 (percent). . . . . . . . . . . . . . . . . . . . . . . . . .35 7. Selected surveys in which the frame sampling unit and the final sampling unit are the same . . . . . . . . . . . . . .38 8. Selected surveys in which the frame sampling unit and the final sampling unit differ . . . . . . . . . . . . . . . . .38 9. Examples of surveys requiring field listing 39 10. Comparison of A.C. Nielsen 1982 field canvass of housing units with 1980 census housing unit counts by block group or enumeration district (National Nielsen Television Index Survey segments only) . . . . . . . . . . . . . . . . . . . . . . .40 11. Number of listing errors found in Labor Force Survey study (Statistics Canada). . . . . . . . . . . . . . . . . . . . .41 12. Reasons units were added and deleted during reinterview, as determined by reconciliation--area segments only: October 1966 . . . . . . . . . . . . . . . . . . . . . . . . . . . .41 13. Estimates of percent net CPS within-household undercoverage relative to the 1980 census for males aged 25 and over by their household status (standard errors in parentheses). . .45 14. 1986 average coverage ratios by age, sex, and race for CPS .92 15. 1986 average coverage ratios for Hispanics by age and sex for CPS. . . . . . . . . . . . . . . . . . . . . . . . . . . . .92           LIST OF FIGURES Number Title Page   1. Typical physical flow of natural gas from gas well to industrial customer (custody relationship) . . . . . . . . .84 2. Possible financial flows (ownership) from gas well to industrial customer (equity relationship). . . . . . . . . .84 3. Industrial gas estimates from Form EIA-857 submissions: Total United States. . . . . . . . . . . . . . . . . . . . . . . .87   viii           EXECUTIVE SUMMARY   Coverage errors can cause serious biases in estimates based upon sample survey data. Undercoverage may be substantial in many surveys, especially of selected subpopulations. For example, the estimated undercoverage of Hispanic males aged 14 and over is 23 percent in the Current Population Survey (see appendix A.7). In economic surveys, new businesses may be missed at a higher rate than older ones. If the characteristics of the missed portion of the population are very different from those of the covered portion, serious biases in the survey estimates for the total population will result.   The purpose of this report is to heighten the awareness of survey program planners and data users concerning the existence and effects of coverage error and to provide survey researchers with information and guidance on how to assess and improve coverage in sample surveys. The report outlines the possible sources and effects of coverage error by documenting current knowledge of coverage errors in Federal surveys. It also profiles selected methods for controlling, measuring, determining the effects of, and reducing coverage errors using examples from Federal surveys and sampling frames.   This report utilizes a broad definition of coverage error. Some authors have included only errors associated with the sampling frame. Here, however, coverage error is defined to include all possible sources of error which are not classified as observational or content errors (U.S. Department of Commerce 1978b). For example, errors or mistakes leading to noncoverage of target population units (undercoverage), errors or mistakes leading to the inclusion of units which are not members of the target population (overcoverage), and failure to elicit a response for a sampled population unit (nonresponse) are included.   The report narrative is structured to follow the sequential procedures typically used in a survey. Other approaches, including one based upon a typology of sampling units (housing units, persons, and establishments), were considered but discarded because of the complexity of many surveys. (An excellent discussion of the coverage errors in housing unit surveys can be found in United Nations (1982).) The survey process has been divided, for the purpose of this report, into two components. Chapter 1 discusses coverage errors which might occur before the first stage of sampling. Issues associated with the creation and maintenance of sampling frames and the choice of sampling frame and strategy are included. Chapter 2 discusses coverage errors which might occur after the first-stage sampling units are selected. Coverage errors associated with field listing, screening, subsequent sampling operations, interviewing, and processing are presented, along with overcoverage due to volunteer respondents. Nonresponse as an important source of coverage error and bias, particularly in housing unit surveys and mail surveys of establishments, is also discussed.   Each chapter includes a detailed discussion of the circumstances leading to coverage errors. A discussion is also provided regarding the seriousness of the errors, their effects on survey estimates, and methods for controlling, measuring, and improving survey coverage. Numerous studies, which have been conducted to measure either overall frame coverage or the effects on coverage of selected data collection procedures, are cited throughout the report. One large single source of coverage error identified in this report is within-household listing of persons. In general, coverage error is a more significant problem in housing unit surveys than in establishment surveys.   Throughout the report, examples are used to illustrate, not to encompass, the diversity of knowledge and experience derived from surveys conducted by Federal agencies. Although the examples in the text are necessarily brief, a more detailed examination of selected coverage issues is provided in appendix A, which presents exemplary material from the following surveys: Annual Survey of Manufactures, National Long-term Care Survey, National Master Facility           Inventory, Producer Price Index, Quarterly Agricultural Surveys, Monthly Report of Industrial Natural Gas Deliveries, and Current Population Survey. Readers are encouraged to compare their current knowledge and practices concerning coverage with those of other Federal agencies as represented by the examples in the report. To assist the reader, glossaries of acronyms and terms are included at the end of this report as appendix B and C.   2           CHAPTER 1 COVERAGE ERRORS OCCURRING BEFORE SAMPLE SELECTION   This chapter's goal is to provide a comprehensive set of evaluative tools which will enable users to identify and minimize potential coverage problems associated with a survey research program or specific research project, to assess the strengths and weaknesses of alternative research methodologies as these relate to potential survey coverage error, and to identify overly ambitious research projects and recast them into an achievable framework.   Four major subjects are discussed: Conceptual or relevance "error, frame construction and maintenance, sample design strategies to minimize coverage error, and coverage evaluation methods. The chapter delineates the thinking, planning, and assessing processes which should precede and inform a complete survey design with its associated sampling plan.   The first section of the chapter contains a discussion on the importance of thinking carefully about prospective research and the necessity of using clear and concise language in the statement defining the research project or program. Attention to correct and clear thinking about, and specification of, research goals, concepts, and targeted population(s), an often neglected or abbreviated phase of research planning, helps to avoid or minimize many coverage problems at the outset.   Types of frame errors, standards for selecting or building a high- quality frame, and the many complex issues associated with correct and thorough frame maintenance, including match-merging independent source lists for updating and correcting frames, are discussed in the section on frame construction and maintenance. Not only are the major and minor problems arising from the failure to maintain frames illustrated with many examples, but evaluative criteria by which potential users of already existing frames may assess the appropriateness and adaptability of these frames for their own surveys are provided. The goal of this section is to provide the tools by which to identify appropriate existing frames, to assess those frames, and to determine when either supplemental or additional frames may be needed, or when a totally new frame must be built.   The third section presents sample design strategies that can minimize coverage errors associated with specific frame weaknesses. Moreover, design strategies for sampling rare populations for which existing frames are incomplete or inefficient are discussed. The section closes with a discussion of estimation procedures which compensate for known coverage error in the sampling frame(s).   Both macro- and micro-level analysis, as methods for measuring frame coverage, are discussed in the last section of the chapter. The degree of coverage error is measured routinely in many Federal establishment surveys. For example, reconciliations are made at the Bureau of the Census between economic census totals and corresponding"totals in the Current Industrial Reports annual survey for census years to measure and improve coverage. Similarly, the National Agricultural Statistics Service (NASS) conducts a continuous survey program for the agriculture sector and compares inventory and production estimates with those obtained in the Census of Agriculture. Administrative data are also used to measure coverage of establishment surveys. For example, the Bureau of Labor Statistics makes annual comparisons between the employment reported to State unemployment insurance systems and establishment employment estimates from the monthly Current Employment Survey.   For the housing unit surveys conducted by the Bureau of the Census, a demographic approach is used to estimate the degree of coverage error. This approach is similar to what is termed demographic analysis, where the coverage of the decennial census rather than of survey data is   3           analyzed using other sources. However, using census data as a benchmark for survey coverage must be done cautiously, since the coverage error detected supplements the coverage error that already exists in the census results.   1.1. Conceptual or relevance error   Coverage errors can be caused by incorrect specifications of the concepts to be measured or the populations) to be targeted by the survey. Incorrect specifications often result from conceptual errors. Some of these are hasty or incomplete thinking concerning the goals of the research, faulty reasoning or incorrect assumptions about the measurable characteristics of targeted groups, and failure to recognize existing information as relevant (or irrelevant) to the specifications being written. Incorrect specifications can sometimes be spotted by their vague, nonspecific, or ambiguous language. These faulty specifications, in turn, can lead to the construction of incomplete, inadequate, or otherwise flawed frames.   Hansen, Hurwitz, and Pritzker (1967) present the general concept of the mean-square, error of a survey estimate which includes a term for the "relevance of the survey specifications as related to the requirements." This is the squared difference between a statistic which constitutes the ideal goals of the statistical survey and a statistic based upon the specifications actually set for the survey if carried out precisely according to defined goals. Using vague or ambiguous language in terms of the ideal goals can lead to greater relevance error because this language can increase any difference between ideal goals and actual specifications. Thus, relevance error is a type of coverage error, since failure to specify correctly the concepts to be measured can lead to the construction of a flawed frame.   To ensure a more useful and complete frame, a clear, precise statement of the research question(s) and population(s) of interest needs to be written down, with careful attention to the exact language used. This is particularly important when the ideal goals are proposed by nontechnical sponsors or appear in enabling legislation. It may even be useful to write down what is = being researched and who is = being targeted, especially if exclusions may be of interest to some client or to researchers generally. Taking the time to think through the possible meanings of key terms and variables and, if needed, to determine whether and how the population(s) and concepts have been defined and researched by others can reduce duplication of effort, reveal previous conceptual errors, and highlight potential frame construction problems. Even in a recurring survey, a review of the concepts and definitions can be very useful. For example, this effort may reveal changes in how the target population(s) and concepts are being defined by other researchers.   Sometimes, it may even be necessary to devise new terminology or revise definitions rather than to perpetuate the use of terms which now seem too general or otherwise objectionable. For instance, the "black" population of the United States has not always been called "black," and it may soon be preferable to use "African American." A review of shifts over time in race and ethnic concepts used in Federal research reveals various intersecting but nonidentical definitions for the black population, such as "nonwhite" and "colored." Such examples show how the use of vague terms makes it difficult to know who or what has been studied and how, over time, changes in terminology have generally been made to increase detail or specificity (and, thus, measurement accuracy),.even at some cost in data continuity.   The language used to denote concepts, key variables, and the like should aim for concreteness and clarity. General equivalents of concepts (such as "education" to stand for "years of school completed") or of populations (such as "children" to stand for "persons aged 17 and under") should be avoided. Although very difficult, making the attempt to write specifications using I, standard scientific terminology" rather than the "language of everyday life" wherever possible should help one to avoid vagueness in defining populations and concepts. Vague definitions of   4           populations and concepts tend to create coverage errors because they lead to inappropriate unit inclusions on, or exclusions from, a frame and even to naming a population which cannot be adequately represented on a frame.   A good rule to follow in examining the initial formulation of a problem is to ask a series of questions:   - To what population(s) of units does this problem refer?   Distinguish among populations from which information is sought, those which will be frame units, and those which may be reporting units, if different from the frame units. For example, suppose one wished to do research on "the scholastic achievement (as measured by grades) of children of recent immigrants." In this case, "children of recent immigrants," more suitably specified perhaps as "persons aged roughly 5 to 17 enrolled in Grades 1 through 12 of the U.S. public schools and living in a household in which at least one related head has been resident in the United States 5 or fewer years," would be the population about which information is sought However, it seems likely that one might need to construct two or more frames in order to reach this population. One of the frames might have U.S. public schools as units, while another might consist of residential addresses to be screened. In this example, reporting units might well consist of two groups, school record keepers and parents or guardians.   - Is (are) this (these) populations) observable or potentially measurable,? How?.   Continuing from the example above, one can see that the suggested specification of "children of recent immigrants" takes account of some of the presumably unobservable "children of recent immigrants," such as those who may be homeless and those who may not be currently enrolled in school. Among recent immigrants, those who entered the country illegally may not be observable, as well as those who died following entry, leaving schoolage dependents. Sources for obtaining U.S. public school and residential addresses might be lists from various agencies. Thinking through all possible categories of the populations of interest should reveal those subsets which cannot be measured or reached; those whose measurements (observation) might be achieved; and those which seem reachable with some existing or proposed methodology. Thus, the "children" may be reached by means of a housing unit survey, school survey, and/or institutional survey (hospitals, orphanages).   - Are there one or more subsets of this (these) population(s) which cannot be measured/observed in some way? What are these? Would they ever be measurable?   Continuing the example of "children of recent immigrants," some possibly unobservable components of the populations have already been mentioned. The potentially measurable components might be those which cannot be reached now but which might be reached using a methodology that may be prohibitively expensive, such as scanning all death certificates or other sources of information to identify deceased recent immigrants. Thus, it may be useful to distinguish the inherently unobservable from the practically unobservable components of populations of interest   - Does time enter into the answer to one or more of the questions above, in the sense that the measurable population(s) may change or may have changed?   Continuing the example of "children of recent immigrants," one may find that a change in a legal boundary or definition can turn "internal migrants" to "recent immigrants" or vice versa. This would happen, for example, if Puerto Rico became a U.S. State, thus solving the problem of how technically to classify migrants to the mainland, who would become   5           "internal migrants." Such a change might force a redefinition of the size and location of the populations of interest.   - Have previous efforts been made to build a frame of this (these) population(s)? What problems were encountered in frame construction? Was one of these faulty conceptualization? Which of these problems has been solved?   This series of questions focuses on the need to locate previous research, to attempt to contact those who designed and conducted the research, or to obtain procedural histories about it and to evaluate carefully the definitions and language used by others. An assessment of previous research often reveals use of frames built for other purposes by still earlier researchers, especially when the frames are very expensive to assemble. Information needed for adequate frames may now be available (such as improved school lists) due either to improvements in information processing or to changes in laws regarding availability of administrative data.   Answering this list of questions has several important goals. The first is to decrease the slippage between the conceptual population(s) of interest and the actual units to be included on frames. The second goal is to facilitate the correct use of language, so that what can and cannot be researched is clearly understood. The third goal is to facilitate the specification of comprehensive and correct rules for frame maintenance. The fourth important goal is to help insure that the population(s) and concept(s) of interest will be defined and measurable, so that one can answer the research question(s) of interest with the greatest accuracy and completeness possible, with minimum coverage error.   In beginning to answer the questions given above (and there may be other useful ones to ask), background preparation might include a brief review of some of the literature on conceptualization. While extended treatments of this topic abound in the philosophy of science literature, statisticians such as Deming (1961) and applied researchers like Blalock (1968) have discussed the importance of conceptualization in more accessible and measurement-oriented works. Deming and others have discussed at length the "true value" of any variable or concept we attempt to measure, noting that there is no inherent "true value" and that the entity that we call the "true value" is a unique outcome of the concepts, assumptions, definitions, and procedures we use to arrive at it. With somewhat more focus on language, Blalock emphasizes the distinction between "concepts by intuition" and "concepts by postulation." These two kinds of concepts, one kind more or less abstract and the other more or less concrete, are linked for research and measurement purposes by assumptions (for example, the assumption that " education" is adequately reflected in "years of school completed"). It is upon these assumptions that researchers sometimes founder, not least because the language used is "everyday language" (for example, "children," "education," it worker").   The language of everyday life frequently is not suited to scientific inquiry. But when research is formulated solely or substantially in everyday language, the population(s) and concepts of interest will be named by this language as well. It is up to the researcher to guard against such usage and to clarify, specify, and define key terms, concepts, and populations, so that adequate frames can be constructed and coverage errors minimized. To build on an earlier example, a research organization might begin preparations for a study which will answer the question of "how children of recent immigrants are faring in school." Such a description of the research question may suffice for press coverage or to quickly summarize the general thrust of the effort for family and friends, but it reveals very little else. What is a measurable child? What does "recent" mean? Who is an immigrant? What is the process of "faring in school?" Is the intention to study a process, an outcome, or a set of outcomes? What is a "school?" Not only does the vague language used tell very little, it actually militates against thinking in clear ways. For example, once the word "school" is used, we probably unwittingly think of our own unique   6           individual "school" experiences, thus creating a tendency to omit by assumption other possibilities for defining "school."   This is not to say that one cannot use ordinary words; indeed, there is no method by which anyone can transcend all the limitations, contradictions, rules, and assumptions which are language itself. It means that there is a distinction to be made between the use of any word for purposes such as casual conversation and for research purposes. The difference usually lies in the modifiers and extended definitions containing the specificity and detail required for classification and measurement. In regard to frame construction specifically, concepts and population(s) of interest should be defined in such a way as to be observable and measurable in regard to the research question(s). (See appendix A.2 for an example of a carefully-defined target population.)   Sometimes it is possible to work with potential survey sponsors in order to gain assistance in formulating the original research questions). Such discussions often reveal incomplete thinking and allow the researcher to eliminate undoable or excessively costly projects or to modify those involving major frame construction problems. Even after research agendas are set, as with enabling legislation, a meeting with sponsors will reveal intent and can save time and trouble later on. For example, recently proposed legislation called for the establishment of a Consumer Price Index (CPI) for the "elderly," where "elderly" was defined as "all persons 62 and over." In the initial proposal, the definition was "all persons 62 and over and retired." In both definitions, the targeted units were persons, whereas the usual units for interview are .,consumer units." Not only do the different definitions imply potentially different sampling frames (and thus different cost levels) but also different procedures for constructing a CPI. Given these problems and others, had this legislation been enacted, it would have been necessary to determine the intent of Congress and establish a working definition for constructing an "elderly" index, so that the resulting research would have provided the information desired.   Finally, in thinking through any research agenda, attention should be paid to exclusions. Some frame (and research) exclusions are recognized and noted in many areas of research. One example might be something like: "This research focuses on immigrants who entered the United States within the last 5 years as identifiable in census data. It does not cover persons missed by the census. Some illegal entrants are included, but not identifiable, in the census data."   However, many exclusions are not noted, partly because researchers do not specify their work precisely, but often because the exclusions (or their existence in the real world) do not occur, or are unknown, to them. In addition, there is some lag between the emergence of new social phenomena and their explicit recognition.   In order to identify exclusions, it is often necessary to examine hidden assumptions and biases about the world. Reexamining topics from the perspectives of several disciplines and actually going out into the field might well be part of this process. As the result of these kinds of efforts and in various other complex ways, new concepts and populations for research do emerge. An example of this is the "hidden" economy.   Despite evidence that some forms of economic activity were not being included in the national accounts or were not the focus of serious research, "official" statistics and economic researchers in general failed until fairly recently to acknowledge such things as bartering, illegal activities of various kinds, and economic phenomena that were associated with other kinds of economic systems. Once the "hidden" economy or some subset or version of it was actually named, then vaguely described, it became easier for people to begin to think in new ways about the workings of the economic world. Once this kind of "breakthrough" occurred, it became easier to "see" exclusions. Today much more work has been done to identify, define, and attempt to research facets of an economic world which was largely ignored by the statistical establishment in the   7           past. However, for the "hidden" economy, as for any newly emergent topic, this process is by no means complete because inherent (and predictable) problems surrounding the interface between conceptualization and measurement have not yet been resolved.   One example of research intended to address the lack of terminological specificity in work on the "hidden" economy is McDonald's (1984) examination of the charge that Bureau of Labor Statistics employment, price, and productivity indexes are significantly affected by unreported economic activity. McDonald asserts: "Establishing the existence of a subterranean economy ... does not necessarily prove that government statistics are invalid. To determine whether a particular government statistic is affected also requires careful consideration of the way data are gathered ... and the relation between economic activities that may be covered by the survey and those that are not.... many of the critics of government statistics have simply not taken this necessary step" (p. 4). After discussing the most narrow through the most broad ways in which the underground economy had been defined in the literature to date, McDonald examined the extent to which evidence on the underground economy under any of these definitions implied mismeasurement of concepts measured by BLS data series and found that the critics had not proved their case. The importance of this work lies in its attempt to delimit explicitly several crucial interfaces between conceptualization and measurement pertaining to a researchable subject whose accumulated literature exhibited a notable lack of conceptual rigor. McDonald not only provided a solid point of departure for further conceptual and quantitative work on the "hidden" economy, but also pointed out what he had = covered in his assessment.   Of course, it hardly needs mentioning that it takes some creative thinking and observational acuity to "see" and to figure out how to name various forms of formerly unsuspected or illegal economic activity, let alone measure the monetary influence of these activities. Despite this, published economic research should not fail to mention any exclusions of already recognized "hidden" economic activity, where appropriate to the topic, and to state something about the potential effect of such exclusions on the findings at hand, regardless of whether this effect is minimal, large, or unknown. Since "hidden" economic activity has been the subject of a great deal of attention in recent years, even a statement that one or another form of it is irrelevant to the research being reported will reflect a prudent and thoughtful research approach and may prevent certain predictable criticisms.   More generally, a discussion of exclusions should be included in all published research as a matter of course. It should no longer be acceptable to omit mention of subpopulations which cannot be included on a frame. Excluding them from mention might well insure that no future attention will be accorded them and could give the false impression that existing frames are adequate or that new frames may not be needed. Put simply, mentioning exclusions points the way to future research and places the reported research in the correctly limited context. As a start, it is essential that statistical studies begin with a more extensive interaction between subject matter experts and research methodologists. The gains can be large and may well enable researchers to avoid many of the other coverage problems discussed in this report.     1.2. Frame construction and maintenance   Once a decision is made concerning the target population, either the sample design must be based upon an available sampling frame(s) or the frame(s) must be constructed specifically for the study. Dalenius (1995) notes the following three important properties of a frame:   - Makes it possible to compute estimates concerning a population which is sufficiently "close" to the target population,   - Serves to yield a sample of elements which can be unambiguously identified, and   - Makes it possible to determine how the units in the frame are associated with the elements in the (sampled) population.   8           The first stage of sampling is usually dependent upon a frame consisting of a physical listing of units. This may be a list of names of individuals, establishments, institutions, counties, cities, streets, etc., or a list of numbers attached to city blocks, land area segments, houses, pages, or any number of other unique, definable entities. However, as Kish (1965, p. 53) notes, a "Frame is a more general concept: it includes physical lists and also procedures that can account for all the sampling units without the physical effort of actually listing them." Deming (1960) cites one exception to the making of a list of sampling units, i.e., when a watch is used to sample time intervals during which customers leaving a store are interviewed.   The units listed in the initial frame may not correspond to the units about or from which information is sought. Often, additional frames are needed for successive stages of sampling in order to progress from available sampling units to the units to be contacted or measured. For example, areas may be selected from a listing or area of all blocks in an area frame. Housing units inside sampled areas may then be listed and sampled in order to achieve a listing of persons to be sampled that are members of the target population from which information is sought.   A more complex example is the procedure for selecting items to be priced in the Consumer Price Index. The sample of priced items is selected from items sold by a sample of outlets which, in turn, was selected from a list of outlets created from information provided by interviews with consumer units in addresses sampled from the decennial census, new construction permits, and area listings. In this case, interviews are conducted in a sample of housing units to create a sample frame of establishments, not a population frame, from which a sample is selected. Within the sampled outlets, probability methods are used to select increasingly more detailed classes of goods until a particular item is selected. A complete list of all the items available for sale is never constructed.   A variety of sampling frames utilized by agencies of the Federal Government is presented in table 1. Associated information related to construction and survey use of the frames is included.   In practice, with the exception of area frames consisting of land segments, the target population a sampling frame purports to represent is constantly changing. For a one-time survey, when it is desirable to obtain data for a specific point in time, this fluctuation is not usually critical, assuming the frame represents the near truth relative to the time of interest. It becomes more critical for ongoing surveys. While for such surveys, panel maintenance rules are inevitably applied so that the frames remain representative of the changing target populations, these rules are often difficult to apply comprehensively because of funding- limitations and/or methodological complexities (see appendix A.1 and A.4). The result is that, over time, any panel may no longer be representative of the target population that is to be measured. Thus, resampling from a current frame usually occurs regularly for such surveys. A current frame is usually available because a procedure for updating the frame is formulated during the panel survey's design process and so is in place at the time the first sample is selected. This updating assures that the frame remains representative of the target population over time. For example, a universe file has been established for the Producer Price Index survey. The primary purpose of this file is to provide up-to-date establishment information including name, location, industry classification, employment, and other pertinent items. These data are obtained via telephone interviews during a frame refinement process or by personal visits to collect data only from sampled units (see appendix A.4, section HI).   Not all frame maintenance procedures address problems of coverage. Some, such as the removal of inactive units, are geared toward sample efficiency. Still, neglecting such procedures can affect coverage. For example, deaths on a frame may be sampled but are not likely to respond. If they are treated as active units and data are imputed for them, bias is introduced. Therefore, it is proper to consider frame maintenance methods in more detail (see section 1.2.2). Before   9 doing so, however, it is useful to note some additional distinguishing features of sampling 0 frames.   Not all sampling frames are maintained over time, even those for ongoing surveys. In fact, the frames created for many sampling operations are discarded once samples are selected and approved. The sample that is representative of the frame at the time it is selected does not remain representative of the population of interest over time and neither does the frame from which it was drawn, if not maintained. When and if a new sample is selected, it is first necessary to construct a new frame that represents the current target population. An example is the use of the Census of Manufactures as a frame for the selection of the Annual Survey of Manufactures sample. The Census of Manufactures represents the manufacturing establishment population at a point in time and thus is not subject to change until the next Census of Manufactures, in 5 years. It serves as the primary, but not the exclusive, frame source for the Annual Survey of Manufactures, and is itself a derivative of the Standard Statistical Establishment List (SSEL). Once the sample for the Annual Survey of Manufactures is selected, it undergoes coverage updating each year, but no updating to the census frame can be done until the next Census of Manufactures. When the next census is completed, it will serve as the new frame for the next sample selection. The new census, while conceptually an update of the old census, is in fact developed from the latest version of the SSEL, which itself made use of the prior census results (see table 1).   For other sampling operations, the frames are evolutionary; that is to say, they are not fixed nor are they instantaneous creations. Instead, they evolve from periodic updates to a previous version of the frame. Each sample is taken to represent the target population at the reference time; however, the frame is maintained and updated to reflect the continuity of changes in the population it covers. In this context, frame maintenance is part of an iterative procedure, with results of a given survey contributing to changes in the frame from which subsequent cycles of samples are drawn. Two examples of this type of frame are the unemployment insurance (UI) file maintained by the Bureau of Labor Statistics (BLS) and the SSEL file maintained by the Bureau of the Census. Both files maintain a current profile (ownership, mailing address, Standard Industrial Classification (SIC) code, physical location, etc.) of economic entities in the United States. The UI file is updated quarterly using employer reports to the UI system. The UI file is supplemented with quarterly data for multiple reporting units, and the SIC and county codes are verified on a rotational basis, one- third of the establishment population each year. The SSEL is updated on a continuous basis using a variety of sources, including administrative records from the Internal Revenue Service (IRS) and Social Security Administration (SSA), the economic censuses conducted every 5 years, the Company Organization Survey conducted annually (except during a census year for most multiunit companies), and the many current economic surveys conducted by the Bureau of the Census. The UI file serves as a sampling frame for most BLS establishment surveys (see appendix A.4), while the SSEL is the underlying frame for most Bureau of the Census"economic censuses and surveys (see appendix A.1).   Other examples of evolutionary frames are two frames maintained by the Energy Information Administration (EIA). The Oil and Gas Well Operator List is used as the frame for the Annual Survey of Crude Oil, Natural Gas, and Natural Gas Liquids. A list of fl= selling petroleum products is used as the frame for two surveys: The Annual Fuel Oil and Kerosene Sales Report and the Monthly Petroleum Products Price Report. Information to update these frames comes, in part, from responses given by operators and firms on their survey submissions. Information from several other sources, including the triennial Petroleum Product Sales Identification Survey, is also used in adding, deleting, or modifying entries on the appropriate frame.   One other point is worth noting. Files whose primary purpose may be to serve as a sampling frame may serve other functions as well. For example, UI data collected by the States is used primarily to administer the Federal-State UI system. Additionally, the file provides a base from   10 which to estimate the wage and salary component of national personal income and the gross national product. In addition to being a sampling frame, the SSEL serves many other purposes. It must fulfill the needs of many different survey programs with many different requirements. Because of this diversity, the amount of information included is limited, so the SSEL is not always used as the direct establishment frame source for sampling operations at the Bureau of the Census. For example, the Current Industrial Reports surveys are selected from a frame created from the Census of Manufactures. These surveys are commodity surveys, and, for the most part, the population of interest is all producers of the particular commodities covered by the survey. Primary producers of these commodities can be identified on the SSEL, i.e., those establishments classified in industries which include those commodities, but the SSEL does not contain information on quantity or value of those commodities. More importantly, secondary producers cannot be identified, e.g., a steel plant which also happens to produce leather shoes would not be identified as in the scope of a survey to estimate shoe production. For some surveys, the contribution of secondary producers could be significant. The Census of Manufactures, on the other hand, contains product data which allow all known producers to be identified, and for this reason sampling frames are created directly from it. The underlying basis for this census, of course, is the SSEL.   The remainder of section 1.2 contains a discussion of coverage errors associated with the creation and maintenance of physical lists as sampling frames like those included in table 1. Section 1.2.1 gives a classification of frame errors as put forward by Kish (1965) and modified by Lessler (1980). The problems of maintaining or updating a sampling frame to reflect changes in the covered population over time are addressed in section 1.2.2. The concerns and procedures discussed are also relevant to the creation of a physical list of population elements which is to serve as a sampling frame. Section 1.2.3 treats the special case of frame updating or creation by means of matching and merging multiple lists to create a single more current or complete frame.     1.2.1. Classification of frame errors   Kish (1965) states a "frame is perfect if every element appears on the list separately, once, only once, and nothing else appears on the list," and classifies possible frame errors into four types: Missing elements, clusters of elements appearing on the list, blanks or foreign elements, and duplicate elements.   In a detailed presentation of errors associated with frames, Lessler (1980) classifies six types of error: The four types that Kish discusses, plus incorrect auxiliary information, and information insufficient to locate target elements. Incorrect auxiliary information can affect the coverage of the frame if the information is used to define subpopulations or subframes. Information insufficient to locate target elements does not reflect a coverage error in the frame, but may result in a coverage error as discussed under rules of association in section 2.1.1.   Missing elements. The omission of units in the target population causes greatest concern. Because units are missing, no examination of any sample from the frame will reveal the nature of the missing component of the population. Research conclusions may be erroneously extended beyond an incomplete frame on the frequently tenuous assumption that missing units are like or very similar to those on the frame. This assumption is to be distinguished from the assumption often made for sample estimation purposes that survey nonrespondents are like respondents. When this assumption about the frame used is not clearly revealed in research reports, the research community receives misinformation, as mentioned in section 1.1.   Missing units are most commonly the result of the following situations: Absence from sources used for frame construction, failure to report to an administrative system, births (new to relevant population), and zero units by definition. All of these circumstances might contribute to a conclusion that missing units are not like others which are included in the frame. Because it may   11           be extremely expensive to attempt to obtain complete coverage, an organization may or should show the missing component to be a trivial proportion of the total or institute some form of estimation procedure to account for the missing portion of the population. This is especially true when the missing units are suspected of being unlike the included units.   Examples of list frames considered very nearly complete for survey purposes include the UI file of business establishments, the SSEL, Oil and Gas Well Operators, the Department of Defense Master Gain/Loss File, and the National Master Facility Inventory (NMFI) for health services. (Refer to table 1 for some selected characteristics of these and other frames.) Some organizations such as the Department of Agriculture and the Bureau of the Census maintain area frames that are considered complete, since all areas of land in the United States are contained within the frames. Therefore, all activities occurring within the United States are theoretically reachable through these frames. However, completeness does not imply the frames or the surveys which use these frames are free of coverage errors.     Clusters of elements appearing on list. A frame is ideally composed of individual sampling units with known characteristics which identify or link to reporting units. The initial sampling units may be known to consist of clusters of subunits which can be incorporated into a sampling design. An example would be a listing of single-family dwellings that contains some duplexes. Another example is a list of farm operator names of which the vast majority represent a one-name/one-farm relationship but some represent a one-name/multiple-farm relationship. Jessen (1978) describes four different relationships between what he refers to as frame units and observation units. These various relationships introduce complexity into the survey process. There is a definite possibility for coverage error if field representatives have not been thoroughly trained in the proper procedures for handling clusters of reporting units associated with a single sampling unit.   Blanks or foreign elements. If a frame is created or an existing list modified for a particular onetime survey, elements on the list which are blank or are not members of the population of interest should be removed. However, most Federal surveys are repetitive or ongoing, and many frames are used for more than one survey. Thus, quite often it is appropriate to retain elements on the frame which previously were members of the population of interest for at least one survey. For a discussion of frame creation and maintenance procedures designed to deal with inactive or out-of-scope frame elements, see section 1.2.2.   Duplicate elements. Duplication of units on the frame may result in overcoverage, i.e., some members of the population are represented more than once. Population totals may then be overstated and means could be biased. Moreover, multiple representation of units on a sampling frame leads to sampling inefficiencies. There are, however, survey procedures which may be employed to identify and compensate for frame duplication (e.g., see Gurney and Gonzalez 1972).   Data collection may be complicated in the face of suspected frame duplication by the necessity of obtaining additional information in order to allow matching with the frame to find other frame elements representing the same population unit. For example, a farm name may be present on the list in addition to the farm operator's name. In the case of partnerships, any enterprise may have multiple representation through the names of individual partners. The necessity of obtaining these names and cross-checking against the frame lengthens the interview and complicates the survey process (see section 2.1.1, Location errors). The Producer Price Index Establishment Universe Maintenance System was developed for the Producer Price Index survey as a means of minimizing duplication as well as other sampling frame problems. It captures all changes made during frame refinement and collection feedback (see appendix A.4, sections III-V).   12           Undetected duplication resulting from nonsampling errors made during data collection or frame-check activities may result in a biased survey estimate. The extent of the bias depends upon the amount of duplication for which no adjustment is made and the size of the units involved. For example, a business enterprise may exist in the form of a vertically integrated company having a pyramid structure. Individual units may then maintain their own books on number of workers or value of production and contribute to the next higher unit in the structure. The parent unit may have the relevant data pertaining to the entire organization. The effect of not detecting the relationship among these sampling units depends upon which units happen to be included in the sample and how the structures of their operations compare to those of the remainder of the population.     Incorrect auxiliary information. Great care must be exercised when units are intentionally excluded from a frame because they are not thought to be members of the population of interest. Errors in frame variables, like size, type, class, or location of unit, could cause valid units to be excluded. For example, the SSEL file contains a relatively large number of records for which no industry classification has been assigned. These unclassified units become missing units on the various frames which are derived from the SSEL, since frame eligibility is first determined on the basis of industry classification. A major effort is made prior to each census year to code the unclassified units that have accumulated on the SSEL since the previous census, including, as a last resort, mailing an inquiry to an establishment to obtain a description of its activity. Little is done between census years because of the cost, but the business surveys which use the SSEL attempt, on a sample basis, a yearly classification, since experience has shown that most unclassified units are ultimately coded to their domain.   For a discussion of frame creation and maintenance procedures relevant to misclassified elements, see under section 1.2.2, Misclassified elements.     1.2.2. Frame maintenance   In this section, frame maintenance procedures are discussed with reference to the kinds of coverage error described in the previous section. These procedures can be classified as follows:   - Adding new frame elements or births, - Eliminating or identifying inactive frame elements or deaths, - Correcting misclassified frame elements, - Identifying- existing frame elements no longer in scope, or in scope for the first time, and - Determining whether or not elements have combined with other elements or have split from existing elements (e.g., change in ownership, mergers, and divestitures in an economic setting).   Each of these updating procedures is discussed in turn below. The discussions address the effects on the frame of failure to update. Distinctions are made between updating procedures intended to determine the cur-rent status of existing frame elements, and those intended to identify elements not previously known to exist. In addition, procedures that update the frame as a whole are distinguished from those that may update only a subset of the frame.   New frame elements. When the research population is dynamic, it is important that the frame which represents it be updated to reflect births. Samples drawn from frames which are not updated for births can result in serious biases, especially if simple weighted estimates are to be used (see discussion of missing elements in section 1.2.1).   One effective method for detecting new units is to canvass periodically the existing frame elements. As an example, the larger (50 or more employees) multiunit companies on the SSEL are canvassed on a yearly basis (with the exception of the census year) via the Company   13           Organization Survey. A proportion Of the smaller companies is also canvassed in years other than either the census year or the year following. Companies are queried as to whether or not they have started new operations. However, companies do not always specify whether a newly listed establishment is a new entity (birth) or represents the purchase of an existing plant. If a plant is treated as a birth and sampled when, in fact, it had a chance of selection under another name or code, bias can result. (See appendix A.1 for additional details.)   A second method of identifying new units results from coverage maintenance operations performed for samples selected from the frame. This method, like the first, uses canvassing, but only of the sampled portion of the frame units. As part of the questionnaire administration process in nearly all surveys, inquiries are made about the status of the sampled units and whether any changes in their status have occurred since the last data collection period. Although the inquiries are targeted to sampled units believed not to be births, sometimes incidental information about other units (including births) can be obtained. This is obviously more a random than systematic approach for identifying new units. Inquiries made in the Annual Survey of Manufactures of single-unit companies provide an example of the use of this approach. Each sampled single-unit company is asked whether any additional plants operate at its location or whether the company owns any additional plants or is owned by someone else. The purpose of these inquiries is to determine whether or not the company is a multiunit company. If the single unit does identify other locations, these may well be establishments which are new or which were not previously known to exist.   Establishments are also added to the SSEL through new employer identification numbers (EIN's) received from the Internal Revenue Service. New numbers do not necessarily imply new establishments, however, as existing plants often request new numbers. The SSEL does not distinguish between the two. Duplication on the file of a plant under both a new and an old EIN will soon resolve itself, as the old EIN will eventually show no payroll data and will be dropped. Survey designers need to be able to identify the true births, however, and this requires additional work. In the Annual Survey of Manufactures, for example, classification cards are mailed to all manufacturing-coded establishments given a new EIN in an attempt to determine whether the establishments are births or existing plants. Only a sample of true births is added to the survey panel (see appendix A.1, section U).   Administrative records are also used to add establishments to the UI file. New business establishments are required to file with the State employment security agencies. However, there is a time lag between filing and being added to the UI file. Units added to the UI file are not necessarily births. Mergers, changes in ownership, branch offices, etc., may sometimes be assigned new UI account numbers. In an effort to address this problem, State agencies are trying to, identify units which are legal predecessors and successors within the UI system. In addition, units which do not meet the legal UI requirements, but are still essentially the same economic units, may be identified as predecessor/successor by the States. In the meantime, the Producer Price Index survey annually uses an automated process whereby the new incoming UI file is compared to the universe file. If an establishment fails to match a unit on the universe file, it is added to the universe file with a special code (see appendix A.4, section HI).   The Bureau of Labor Statistics (Grzesiak and Tupek 1987) has conducted several studies of business births in conjunction with its Current Employment Statistics program. The usefulness of the UI file as a sampling frame for new businesses is constrained by the delay between the time a business first hires employees and the time it enters the UI file. A study of all 12,983 UI accounts (the sampling frame for this program) assigned by Florida for 3 months in 1984 found almost 80 percent were new accounts without predecessors. The study focused on determining the length of time between a business's first liability for UI coverage and its entrance on the UI file, which depended on how the State discovered the employer and whether the employer had a predecessor. The median lag-time for all new accounts was found to be 120 days. A study was   14           also conducted in New York to develop a methodology for identifying new businesses using the UI system and to construct new procedures for estimating the employment of new businesses for incorporation into the Current Employment Survey. The median lag-time in New York was also found to be 120 days, with 93 percent of the establishments having fewer than 10 employees.   Record checks with outside sources can also be used to identify birth elements. Generally, these checks do not allow one to distinguish between new establishments and previously missed establishments, but in either event they provide information for updating the file, and reveal elements that had no chance of inclusion in previously selected survey samples. One example of an outside source frequently used in record checks of establishment frames is the trade association list.   Several methods have been used for identifying birth elements to the National Master Facility Inventory. Each method has relied on State agencies"lists of facilities. (See appendix A.3, section II for details.)   The traditional method for including births in housing unit surveys is to update field-generated listings of sampling or reporting units within sampled geographic areas. Initial listings are usually made just prior to the first interviewing period and are subsequently updated through a recanvass to correct errors and to add newly constructed units.   The Bureau of the Census uses this approach in some geographic areas for the housing unit surveys it conducts. However, in most areas, births are included by sampling building permits. (See appendix A.7, section II for details.) Sampling building permits results in a significantly lower sampling variance, since large housing projects can result in very large clusters of sampling units being added during the alternative field-listing update process.   However, the building permit files do not identify illegal new construction, conversions, and new mobile home placements; nor do they identify new special places, such as dormitories, fraternity houses, boarding houses, and public housing. To illustrate, it was estimated for the 1985 Annual Housing Survey that approximately 25 percent of all new mobile homes were missed (Schwanz 1988a). In the Survey of Income and Program Participation, the undercoverage of new mobile homes in the building permit file was estimated to result in a 1 percent underestimate of the number of households in poverty (Singh 1989). In 1976, the undercoverage of births in the building permit frame was estimated to be about 2.3 percent (U.S. Department of Commerce 1978a). Since the Bureau of the Census normally uses this building permit frame for sample augmentation over a 10 year period, e.g., 1975-85, undercoverage may increase substantially over this time span. (For a more complete description of the procedures used by the Bureau of the Census to identify and sample dwelling units created after the last decennial census, see appendix A.7.)   Another methodology for capturing new housing construction and reducing undercoverage in sampled geographic areas is the half-open interval procedure. Instead of listing all units within the sampled area, a string of k units is listed in a predetermined order. The string begins with a designated unit from the original frame and is bounded by the k-th unit that was reported in the original frame (Montie and MacKenzie 1978). A modification of this procedure was used in the 1977 National Health Care Expenditure Survey and the 1980 National Medical Cost Utilization and Expenditure Survey. Cursory analysis indicates the approach may be limited in its ability to capture new construction (Adams 1989).   Inactive frame elements. Efficient sampling dictates that inactive units or deaths on a sampling frame be identified. In the initial construction of a frame, deaths in the various sources used to construct the frame should be identified and, if needed, removed. Existing frames should be updated periodically to remove or flag units that are no longer active. Failure to identify deaths   15           on a sampling frame does not necessarily result in overcoverage, but, as was noted earlier, biased sample estimates can result if an inactive element is sampled and imputed for when no response is obtained.   It may be desirable to retain inactive units on a frame for a certain period of time because of estimation considerations or because it is desirable to have a history of elements available. When doing so, the inactive units should be identified either through flagging or partitioning to a distinct death subfile. In the Producer Price Index survey frame, a death is defined as either a sampled unit which has been identified as out of business or out of scope, or any existing frame unit that remains unmatched when the next UI file is compared to the universe file. All deaths are removed from the universe file and added to the death file (see appendix A.4, sections III and V).   Two methods, both of which were mentioned in connection with identifying births, are particularly effective in identifying whether units are still active or in existence. The periodic canvass of existing frame elements is one. The frequency of these updates varies. For the COS, a canvass of the multiunit portion of the SSEL is conducted annually (except for the census year) for the large companies (50 or more employees). The smaller companies are periodically canvassed as well, but not on a yearly basis and not all in the same year. The companies are asked to identify any listed plants that are now out of operation. The National Agricultural Statistics Service's list frame is canvassed within 5 years of the preceding canvass. The UI file is updated quarterly using a census of UI reports. Units not paying taxes for some period of time (4-8 quarters) are removed from the frame. On the other hand, fixed frame elements are not canvassed at all, since frames such as the Census of Manufactures are not updated. It should be noted, however, that the companies in the Census = canvassed to update the SSEL.   Inactive elements are also identified through maintenance operations performed on samples drawn from the frame. In nearly all survey panels, sampled units which are no longer active are routinely identified through inquiry. Maintenance operations performed on samples are more likely to reveal changes in status of known elements (from active to inactive) than to reveal births (new elements), although this method can be used for both purposes. The information obtained using this method can then be used to update the source file (frame). The SSEL is continually being updated from information obtained by the many Current Business Reports surveys, the Current Industrial Reports surveys, the Annual Survey of Manufactures, and many other economic surveys. The UI and National Agricultural Statistics Service list files are updated in this fashion as well. It is important to note that this method yields information for updating only the sampled units and for adding new elements revealed through survey operations. It does not permit updating the entire frame.   Misclassified elements. A problem with many frames is not that elements are missing, but that they are misclassified or are not classified at all with respect to one or more variables. This assumes importance if the variable or variables that are misclassified determine either the elements eligible for sampling or the subpopulations for which estimates are produced. Housing occupancy status (vacant or occupied), geographic codes, SIC code, age, race, and gender are examples of such variables.   For economic surveys, where the population distribution of the variable to be estimated is usually extremely skewed, some measure of size is often used in the sample design as either a stratification variable or for sampling with probability proportional to size. Incorrect values for the variable(s) used to derive an establishment's measure of size can result in its being placed in the wrong stratum and in other sampling inefficiencies. To illustrate, in the Annual Survey of Manufactures, which has a probability-proportional-to-size design, an arbitrary certainty stratum is defined to consist of all frame establishments with total employment greater than 250. Since estimates are not published for this stratum, the erroneous inclusion or exclusion of an   16           establishment in this stratum because of an incorrect employment size value does not bias a survey estimate, but the resulting sampling error may be different than expected.   Coverage error at the estimation stage due to misclassification will often be confined to sublevels. For example, in economic surveys like the Annual Survey of Manufactures (ASM) with independent sampling across SIC's, an inaccurate SIC code could result in undercoverage if the code identifies the unit as nonmanufacturing. However, if the SIC code is incorrect, but within manufacturing, only industry level estimates will be affected (see appendix A.1, section II).   Incorrect classifications may result from errors in data input during frame creation, but more often they occur because changes in frame units are never detected by the surveying organization. To update industry and location changes on the UI file, BLS conducts a refiring survey in which SIC and county codes are verified for each unit of the universe on a 3-year rotational cycle.   In the 1977 Economic Censuses, misclassification studies were conducted for both the employer and nonemployer segments of the administrative records frame. For the employer segment, a subsample of 5,505 out-of-scope employer cases on the 1977 Economic Censuses Master Sample were mailed the Economic Census General Schedule to complete. An estimated 3.1 percent of out-of-scope employer establishments with 0.4 percent of the employees and 0.3 percent of the annual payroll were found to be misclassified as out of scope (Hanczaryk and Sullivan 1980). An estimated 12 percent of the nonemployer establishments were misclassified as out of scope, resulting in a 20-percent underestimate of nonemployer receipts.   A majority of the employer misclassifications resulted from errors in the SIC on the administrative file. In some industries, an establishment may be comprised of distinct but related activities, e.g., construction and real estate. However, it is classified into an industry on the basis of which activity yields the highest percentage of total receipts. For example, an establishment with 45 percent of receipts attributable to construction and 55 percent to real estate sales would be classified as in the real estate industry and excluded from the Economic Censuses. If the percentages had changed between the time of coding and the census, the establishment would be misclassified, since construction was within the scope of the census.   The evaluation study found that many of the establishments misclassified on the basis of out-of-scope SIC codes had in-scope Principal Industrial Activity codes on the administrative file. Therefore, a significant drop in the misclassification rate could have been achieved through the use of another variable. On the other hand, missing or incorrect tax return Principal Industrial Activity codes were responsible for a majority of the nonemployer misclassifications. Again, in this situation, the use of additional tax return information could have substantially reduced the misclassification rate.   While many of the procedures discussed previously can, to a degree, aid in the identification of misclassified elements, none of them is really intended to address this problem comprehensively. Their effectiveness depends on the specific variable of interest. For example, although the SIC code is part of the information collected in the Company Organization Survey (COS), it is not likely that the companies would routinely verify that the right code is assigned. Nor can the Bureau of the Census determine the validity of the code, since the COS does not collect detailed data.   Another procedure to handle misclassification is used in the Producer Price Index survey. Since some establishments can easily modify their capital equipment to produce a different product, depending on demand, the 4-digit industry classification can change. Thus, industries for which a high proportion of this type of misclassification occurs are treated as related SIC's and sampled   17           at the same time. The sampled establishments are then assigned the proper SIC (see appendix A.4, sections III-V).   An auxiliary variable that is used as a measure of size can sometimes cause coverage error if it is incorrect. For example, in the Producer Price Index, the employment value on the UI file is used as the measure of size. Some establishments are reported with zero employees. In order to ensure that all units have a positive probability of selection within the probability-proportional-to- size sample design, two employees are added to the employment value of each unit. For EIA's survey of active oil and gas well operators, companies on the frame having no known production are sampled, so that they are represented along with those operators having known current production.   Unclassified elements are unique in that it is known at the outset that they exist. Knowing this information and perhaps the reason(s) for lack of classification allows the surveying organization to design a strategy for obtaining codes. This strategy may not necessarily be useful for resolving the lack of classification on all frames. For example, most of the units on the SSEL without SIC codes do not have them because the SSA is unable to assign them a code when their applications for new EIN's are received. Prior to a census year, the Bureau of the Census makes a concerted effort to code these records. This includes identifying key words in an establishment's name which might identify its activity and, when this fails, culminates in the mailing of a classification card to the establishment which asks for a description of its activity. A certain number of records remain uncoded. Although new EIN's are obtained by the Bureau on a continuing basis, little is done for them between census years because of the cost. However, attempts are made to classify a sample of them for the business surveys because most unclassified units fall within their population of interest.   Out-of-scope elements. Closely related to the problem of misclassification is the problem of out-of-scope elements, i.e., elements that if properly classified would not be part of the population of interest. They differ from the type considered above in that if they were properly classified, they would be dropped from the frame. Out-of-scope cases generally arise because historically they were coded in error. It may be, however, that their status has changed so that they are no longer part of the population of interest (see appendix A.2, section II). As with death elements, the presence of out-of-scope elements on a sampling frame does not result in any biased sample results should they be sampled (assuming the survey processing identifies them as out of scope), but it does compromise the efficiency of the sample.   Split-out or combined frame elements. The composition of elements constituting a frame will often change over time. This is especially true for establishment frames, where, for example, individual plants are bought and sold by companies, two or more companies merge, or companies divest. But it is also true for frames of housing units and households. In these instances, frame maintenance properly includes activities which update or modify the frame to account for compositional changes.   Compositional changes do not necessarily affect the number of units on establishment frames and, thus, the overall coverage of the frames. Indeed, it is likely that no changes in plant activity occur at all. If both the sampling unit and the reporting unit are the establishment, it is really not vital that the corporate owner of the plant be known as far as data collection is concerned. From a coverage point of view, however, ownership may be important because the sample status of a sold establishment often depends upon the status of the buying company. Also, in some economic surveys, establishment records are combined into company records for sampling purposes. Thus, there are a variety of other reasons, some coverage related, which mandate maintenance of proper identification of plant ownership.   18           For the SSEL, the annual canvass of the multiunit establishments by the COS is the prime basis for maintaining the identification. The COS provides a list for each company of all the known establishments of the company and requests that the company verify and update the list by indicating any new establishments it has opened, the name and seller of any additional plants it has acquired, the name and purchaser of any plants it has sold, and any plants it has closed down. COS processing identifies the new owners (successors) of sold plants and the old owners (predecessors) of bought plants and corrects the records for those companies as well. Similar activities are conducted for virtually every census/sample survey enumeration at whatever level the reporting is done.   The UI file is another example of a broad-based file which is supplemented with quarterly data for multiple reporting units. The frame units that have undergone changes in composition are identified as new owners with predecessor relationships and old owners with successor relationships. Thus, for establishment-based surveys, the company is queried about changes in the operation of each establishment, including whether or not it has been sold or leased to any other company.   In the Producer Price Index survey, four economic characteristics are used to define a unit. Pertinent establishment data are obtained via telephone interviews during the frame refinement process. Any change in the composition of an establishment is captured on the universe file. If an establishment is split, the new portion is treated as a birth and is added to the universe file with a special code. If a unit is sampled and during the first interview a portion is identified as either split or sold, it is treated as a field-created sampling unit and data are collected (see appendix A.4, sections III and IV).   Agricultural frames generally do not have this kind of multiunit situation. Each operation (farm or ranch) is defined by one common land operating arrangement. This lowest common denominator precludes the necessity of keeping track of elements within farm units. However, bookkeeping arrangements covering farms in a hierarchical management system may necessitate periodic monitoring to be sure each unit is accounted for but not duplicated. A more complete description of counting rules for agricultural surveys is provided in the Quarterly Agricultural Surveys case study (appendix A.5).   Likewise, for the Census of Population and Housing frame, the identification of basic addresses does not change at the transfer of ownership from one person to another. Other identification problems may exist, such as how many separate living quarters really exist at a particular location.     1.2.3. Match-merging of independent source lists   In many of the examples of updating procedures discussed above, it was noted that outside source lists or files were used to update a primary frame. Among the problems arising from the use of such lists not addressed above are those associated with matching and merging such lists to update primary frames. These problems deserve special mention because they affect frame completeness.   The two general classes of error that can occur when combining lists are: Erroneously adding an element already in the frame and erroneously removing a qualified element from the frame. The two types of error are not equally problematic, since a more stringent set of rules can govern the deactivation of a frame element than governs the incorporation of a new element   The updating process entails some formal matching between the primary frame and the source list. Various identifiers may be utilized in the match operation (Fellegi and Sunter 1969, Scheuren and Oh 1985). These identifiers may have different degrees of precision ranging from   19           very precise (e.g., EIN) to less precise (e.g., name, address), and it may be that successive matches are attempted on each level of identifiers. At the end of this process, records on the primary frame and on the update list can be allocated into three mutually exclusive parts:   a. Records which are classified as matching (i.e., appear on both frame and source list), b. Records on the primary frame that do not match to the source list, and c. Records on the source list that do not match to the frame.   Some records which match may represent false matches. Depending on the quality of the source list, i.e., whether it is truly free of duplicates and out-of-scope units, false matches can lead to failure to add a unit or, more rarely, failure to identify a potential death.   Depending on the conceptual and operational fit between the source list and the in-scope population, the failure to match some frame records to the source list may or may not be a problem. The ascribed completeness, timeliness, and accuracy of the source list are all important in deciding whether unmatched entities have died and should be eliminated from the frame (or flagged) or to leave them on the frame and let sampled deaths be revealed during data collection. If any sampled entity is a death, that fact will become apparent when it cannot be found, although in a mail- out/mail-back survey, any such unit may be presumed to be a nonrespondent for which data are imputed.   Source list records not matched to the frame represent potential additions to it. Although these records can just be added to the frame, it would be more prudent to try to determine whether they are duplicates of existing units or are out of scope.   Many variants of the process described above occur in survey situations. At one extreme, there is no attempt to determine if firms with newly issued EIN's are already on the SSEL under an older EIN. At the other extreme is the procedure followed for the National Master Facility Inventory of inpatient health facilities, in which names and addresses of facilities from source lists that do not match the primary frame are automatically added to the frame.   Problems related to the timeliness of the source lists can arise. Since the population of interest is not static and events are cumulative, combining untimely information from source lists with that on an existing frame can lead to numerous errors. Ideally, one wants the frame resulting from the match-merge to include all elements in the population of interest and to exclude elements which are not. For example, if a survey is to be conducted of firms currently in business, one does not want to rely on a historical file of all businesses that does not denote firms no longer in business. To do so would be to risk including these firms in the sample. Less error prone, but still problematic, is the use of source lists containing information for units that existed at any single point during the year. Units that exist throughout the year will be included, along with those born during the year. But, those that died will also be included. Should such a list be used to update a frame for a sample survey in the succeeding year, units no longer in existence may be included. In such instances, the reasons units no longer exist may be extremely useful pieces of information, if they can be obtained. For example, has a f= simply gone out of business, did its name change, or was it purchased by another company?, Otherwise, when deaths are sampled and revealed as deaths during data collection, frame records for these deaths can be flagged as inactive or deleted from the frame, as appropriate. Also, the use of less than timely source lists can result in the addition"of unknown out-of-scope units that will remain on the frame to plague subsequent surveys.   1.3. Sample design strategies to minimize coverage error   In the previous section, the discussion focused on coverage error associated with sampling frames. Solutions to problems arising from the limitations of available frame sources are a major   20           challenge to the survey design statistician. Colledge (1989) identifies and discusses 26 specific coverage and classification problems faced by Statistics Canada in its Business Survey Redesign Project, as well as possible alternative solutions.   This section presents some sample design and estimation options available to survey designers in dealing with recognized deficiencies in a frame. The options discussed are: Defining the target population to equal the frame population, random-digit dialing sampling, multiple frame sampling, sampling rare populations, and estimation procedures.     1.3.1. Defining target population to equal frame population   While it is important not to imply coverage of a wider population than the one covered by the frame(s), it is more important to make concerted efforts to reach every member of the original target population, even if this means using additional frames or more expensive procedures. Only intolerable expense or practical impossibilities should be grounds for narrowing the defined target population, as discussed in section 1.1.   Hansen, Hurwitz, and Jabine (1963) provide an example of how a coverage problem for a survey about truck ownership and operation was handled. When it became clear that State motor vehicle registration records did not include all trucks being operated and that coverage of truck registration varied by State, the scope of the study was redefined as registered trucks instead of all trucks.     1.3.2. Random-digit dialing sampling   One household sampling method used to avoid omission of households with telephones is random-digit dialing (RDD) (Waksberg 1978). The use of telephone directories as sampling frames often results in unacceptable levels of undercoverage because they omit unlisted numbers for some nontypical portions of the population. With RDD, a sample of telephone households is located through the use of randomly generated telephone numbers. In this way, only those households without telephones are omitted. For many surveys, this could be considered a trivial exclusion. In others, differences between telephone and nontelephone households may have a profound effect on the characteristics being measured. For example, measures of poverty and income from entitlement programs would most likely be biased because households in poverty or receiving such income are less likely than other households to have telephones. The collective experiences of numerous researchers and survey statisticians who have used RDD are presented in Groves, et al. (1988).   An extensive discussion of the health characteristics of persons in telephone and nontelephone households is presented by Thornberry and Massey (1988). Data from the National Health Interview Survey indicate that those in the nontelephone U.S. population are more likely to suffer disability days, chronic conditions, and hospitalizations than those in the telephone population. At the same time, those without telephones have fewer visits to physicians and dentists and are much less likely to have private health insurance. These findings are consistent with expectations, given there are disproportionately more low income families in the nontelephone population. The authors note that for most characteristics, the differences between the values for the telephone households and the total population are small because 93 percent of all households can be reached by telephone via RDD. However, estimates for certain population subgroups could be severely biased when based on an RDD survey. The authors note that an RDD survey seeking information on preschool aged children would exclude about 12 percent of them, and also almost one-third of such children living in poverty.   An example of favorable results from using RDD is reported by Williams and Chakrabarty (1983) for the State of Michigan portion of the 1980 National Fishing, Hunting, and Wildlife   21           Associated Recreation Survey. Parallel surveys were conducted utilizing an RDD sample and a subsample from previous Current Population Survey samples which did not depend upon presence of a telephone. The report points out "the socioeconomic characteristics and the sportsmen variables between the two studies do not reflect any substantially important differences." However, there were differences in results for "nonconsumtive users," i.e., wildlife-related activities outside of hunting or fishing. These activities were highly related to the geographic location of the user, so findings may result from the geographically restricted nature of the expired CPS samples compared to the unrestricted nature of the RDD sample.   A study by McGowan (1982) on telephone ownership in the National Crime Survey sample contains evidence that the exclusion of nontelephone households has a significant effect on the measurement of crime victimization in the United States. In this instance, the use of RDD without a supplemental frame to provide a sample of nontelephone households would be unacceptable.     1.3.3. Multiple frame sampling   Coverage may be improved through the use of multiple frames. Sometimes, no single frame fully covers the target population and merging independent source lists would be impractical. In this case, separate probability samples from different frames can be used to expand coverage beyond any available single frame. (Additional frames may also be used to increase sampling efficiency if coverage is already sufficient.) The use of multiple frames entails two assumptions (Hartley 1962):   - Every unit in the population of interest belongs to at least one of the frames, and - It is possible to record for each sampled unit whether or not it belongs to the other frame(s).   The first assumption requires linkage between the sampling frame units and the target population. Application of rules of association to accomplish this linkage is needed when sampling from any frame (Hansen, Hurwitz, and Jabine 1963). When multiple frames are used, sampling units = often different between frames. This is of no consequence as long as the different sampling units lead to a common reporting unit during the survey. Complete coverage of the reporting units should be equivalent to complete coverage of the population of interest. The rules of association, from sampling units selected to reporting units tabulated, must ensure the representation of each population element once and only once in the final estimates. Field representatives must be equipped with clearly defined rules that can be communicated to respondents to achieve this unique representation.   Difficulty with the first assumption, given a need to use multiple frames, arises because concurrent application of different rules of association may be required of field representatives depending upon which frame supplied the sampled unit. Potential errors in associating sampling units with reporting units are discussed in section 2.l.   The second assumption requires that frame membership be known for each population unit. Nonoverlapping frames are a special case, wherein each population unit is assumed to be on one and only one frame. The statistical theory for this special case is essentially the same as for stratified sample designs.   In the case of nonoverlapping frames, each frame represents a different, unique segment of the total population. The principal consideration here is that the same reporting unit not be included in more than one of the sampling frames. In this way, estimates for the nonoverlapping frames are additive to the greater population of interest. Examples of the use of nonoverlapping frames within government are the Current Employment Statistics survey as well as other establishment surveys conducted by the Bureau of Labor Statistics. The primary frame for employees comes   22           from UI reports filed with State employment security agencies. The UI frame covers about 98 percent of wage and salary employment in the United States. Supplemental, nonoverlapping coverage comes from the Interstate Commerce Commission for interstate railroad employees. Another example is found in the National Cancer Institute's epidemiology studies, where the under-age-65 group is selected from a frame of driver's license records and the over-65 group is selected from a frame of Medicare records.   Most housing unit surveys conducted by the Bureau of the Census, e.g., American Housing Survey, Current Population Survey, Consumer Expenditure Survey, National Crime Survey, and the Survey of Income and Program Participation, use a combination of frames (U.S. Department of Commerce 1978a). In areas where building permits are required and maintained by a local government and the Census of Population an( Housing addresses contain street names and , numbers, the census lists are used as the basic sampling frame. A sample of building permits is also selected to cover housing units built after the census. Conceptually, these two frames are nonoverlapping even though they refer to the same land areas. In other areas of the country where permits are not available for sampling or the census address lists are considered inadequate, land areas are sampled, and an address list is created by field representatives. For a discussion of coverage errors in this listing process, see section 2.2.l.   Use of overlapping frames in the application of multiple-frame survey methodology mandates that extraordinary attention be paid to potential errors in the survey process. A population (reporting) unit may fall within any or all frames utilized. Sampling units are selected from each frame and linked by rules of association with corresponding reporting units. Each reporting unit must ultimately be represented exactly once across all frames utilized. This may be accomplished either directly through a matching process to remove duplicates or indirectly by weighting adjustments. (The latter tends to be far less costly.) Duplication because of multiple representation or omission through failure to account for the unit in at least one of the frames can result in serious coverage errors.   Sampling from overlapping frames is most commonly done when an area frame and an overlapping list frame are available. The area frame is generally designed to provide complete coverage by including as sampling units all land parcels which encompass the population of interest. The list frame is nearly always incomplete, a common attribute of lists, but its use provides certain sampling efficiencies which enable the multiple frame survey to provide the same precision at a much lower cost than would an area frame survey alone.   Examples of the area/list dual-frame survey approach may be found in the Department of Agriculture (in nearly all inventory and economic probability surveys conducted by the National Agricultural Statistics Service) and in the Bureau of the Census (in the Monthly Retail Trade Survey and the Services Annual Survey). The Department of Agriculture's application of this approach for the Quarterly Agricultural Surveys of crops, hogs, and grain stocks inventories is a typical illustration of the linkage requirements with multiple frame sampling (see appendix A.5).   An important special case occurs when an existing complete frame is used in conjunction with a list of telephone numbers. This general case has been discussed extensively in the literature. See, for example, Lepkowski and Groves (1986) and Biemer (1983). Important special cases are considered by Lepkowski (1988).     1.3.4. Sampling rare populations   There are two known procedures to compensate for undercoverage that are especially useful for surveys of rare or elusive populations: Network sampling and capture-recapture methodology. Both are briefly described below.   23           Network sampling used in conjunction with multiplicity estimation (Sirken 1970 and Sirken and Levy 1974) relies on a known set of relationships (or links) between members of the population. Network sampling, unlike more traditional sampling, uses links which extend beyond the usual sampling or reporting unit by building rules for more extensive sampling. One example of such extended rules is the sibling rule, where sampled members are asked not only about themselves, but also about all brothers and sisters not living in the same household.   In network sampling, a sample is drawn from an established frame using a probability sampling procedure. The sample is then contacted and interviewed to determine which sampled members have the characteristic of interest. Sampled members are then asked about the set of related individuals having the characteristic being studied. In this way, several members of the population are covered in one interview. Since this procedure is potentially prone to increased response error or item nonresponse, the names and addresses for the related individuals who are said to have the characteristic of interest are often obtained, so that the individuals can be contacted directly. This technique is best known for its potential to improve efficiency when the characteristic of interest is rare. However, it also has the potential for improving coverage when people are reluctant or unable to provide information about themselves and when the sampling frame is incomplete (Sirken 1983).   A survey to collect data on recent decedents is an example of a population unable to provide information about itself. The traditional methodology would be to collect information at the household that had been the decedent's place of residence. A network sampling approach would be to collect information at the household of a surviving spouse, siblings, or children residing in the county of the decedent, either instead of or in addition to at the decedent's last place of residence. Sirken (1983) reports on the results of experiments conducted in North Carolina to compare coverage between network sampling and traditional sampling (Sirken and Royston 1976). The traditional method missed 29 percent of the deaths; reports from decedents' relatives' households alone missed 22 percent of the deaths; and reports from both decedents' former residences and decedents' relatives' households missed 15 percent of the deaths. Emigrants are another group of people for whom network sampling can improve coverage because they cannot report for themselves.   Network sampling can be useful to improve coverage on incomplete sampling frames (Sirken 1983). Persons with no fixed address would usually be missed by traditional sampling but could be identified by relatives or friends. Also, if institutions or Armed Forces barracks are not included in the sampling frames, network sampling can be used to find persons living in these otherwise uncovered places.   Use of network sampling requires that the number of population units eligible to report each sampled individual be known. This number is used in the estimation process to adjust the probability sampling weight for each sampled unit.   Also known as dual system (or multiple system) estimation, capture- recapture methodology assumes that one or several frames have less than perfect coverage of the population, and that the amount of undercoverage is unknown. Capture-recapture methodology is essentially a counting technique and is used to determine the number of individuals in a population, or the number of individuals with a specific characteristic in a known population.   The population to be studied is defined independently of any frames, but at least two overlapping frames are needed to make an estimate of the population size. Membership on any frame is modeled as a stochastic event and, for two frames, membership is also assumed to be an independent event between frames. The two frames are matched, or a sample from one frame is matched to the entirety of the other frame. An estimate is then made using the number of persons estimated to be in the first frame (N,), the number of persons in the second frame (N),   24           and the number found in the match to be in both frames (M). The estimator (Marks, Seltzer, and Krotki 1974, p. 15) of the population size (T) is t = N.1N.2³M.   A number of assumptions are required to satisfy the model which generates this estimator; some of the assumptions can be relaxed if more lists are available for sampling. Lists can include administrative records, but the model requires the assumption that membership in the records system is a random event, an assumption that usually does not hold. (References include: Casady, Nathan, and Sirken (1985); Czaja, Snowden, and Casady (1986); and Cowan, Breakey, and Fischer (1988).)   For a general treatment of strategies for sampling rare populations, see Kalton and Anderson (1986). See also appendix A.2, section III.     1.3.5. Estimation procedures   Estimation procedures which compensate for known coverage error in frames may be used to decrease the bias of survey estimates. Improving frame coverage is always better than using these estimation procedures. One such procedure is ratio estimation or benchmarking; another approach is multiple frame estimation.   The Bureau of Labor Statistics employs a benchmarking procedure to revise monthly employment estimates from the Current Employment Statistics survey (U.S. Bureau of Labor Statistics 1989). Sample estimates are compared each year with later summarizations of mandatory UI reports filed by employers. The UI data, which serve as a benchmark, are an aggregation from the same source as the microdata used to construct the frame from which the sample was selected, except that the benchmark data are one year newer. Hence, the benchmark file takes into account new firms or changes in industrial classification to ensure more accurate coverage. The completeness of the UI administrative data affords the opportunity to analyze and adjust for frame deficiencies (Thomas 1986).   Most of the current surveys conducted by the Bureau of the Census use ratio estimation to projected population totals by age, sex, and race. For further discussion of the procedure as applied to the Current Population Survey, see appendix A.7, section VI.   The use of multiple overlapping frames requires the use of an estimator which may be written as follows for the two-frame case, where frame sizes are known but overlap domain size is unknown (Hanley 1962):   Y = (N.A³n.A)(y.a + py'.ab) + (N.B³n.B)(y.b + qy".ab)   where subscripts A and a denote the two sampling frames, N and n are the population and sampled units, and Y is the total of some variable to be estimated. Subscripted y's are estimated totals from the two frames (y.a based on units uniquely in frame A, y.b for units only in frame B, and y'.ab, the estimated total for units in both frames as measured by the frame A sample, while y".ab applies to units common to both frames from the frame B sample), and p and q are weights which sum to one. In this way, the estimates unique to each frame are added to a weighted combination of those units common to both frames. The parameters is selected so variance is minimized subject to a cost function reflecting differences in sampling from each frame.   A common application of this estimator utilizes a complete area frame A and an incomplete but more efficient list frame B to generate a screening multiple frame estimate of the form:   Y = (N.A³n.A)y'.a + (N.B³n.B)y".ab   24           where the estimate for the units unique to the area frame (nonoverlapping domain) is added to the list"frame estimate for the units common to both frames. In this case, the parameter p, from the general formula above, is zero, and q equals one. Other terms disappear because no units exist on the list that are not contained within the area frame.   It is easy to see in the simple form of the multiple frame estimator the importance of properly determining whether or not a unit is represented by one or both frames. Unrecognized overlap between the frames produces duplication in the estimate, while improper designation of a unit as overlapping results in omission.     1.4. Evaluation methods   One method of measuring the degree of frame coverage error is comparative analysis. Comparative analysis can occur at two levels. The first is a macro-level evaluation, which compares known population values with totals derived from summing characteristics for each sampling frame unit. The second type of analysis is performed at the micro or individual sampling unit level. This most often involves matching of data available from different sources for individual units.     1.4.1. Macro-level analysis   How do totals associated with sampling units compare with other measures of the target population? Suppose we have an area frame. The sum of the areas in individual sampling units or segments should match closely with the measured area of the total frame, e.g., county, State, or other target area. The National Agricultural Statistics Service electronically digitizes clusters of area sampling units and verifies that the accumulated total is within 0.5 percent of the published land area for each State (Cotter and Nealon 1987).   Tortora (1987) notes that with two frames, one a complete area frame, a process quality control evaluation of a list frame is possible through the use of survey data. For example, list coverage of the number of farms or land in farms can be estimated by the sizes of the overlap and nonoverlap domains from the area frame. Likewise, the number of out-of-scope list units can be estimated from the samples in each frame. Monitored over time, the measures of list frame performance will provide knowledge and control of list coverage.   Similarly, the number of names in a list frame can sometimes be compared with census counts for the population of interest. Generally, the information available on every sampling unit is very limited and only gross comparisons with known population totals can be made. More often, totals estimated from sample surveys can be compared to similar quantities from other sources in order to provide measurements for the frame. Two of the most common sources are census and administrative files.   Reconciliations are made between economic census totals and corresponding totals from the Current Industrial Reports annual survey for census years. Similarly, the Department of Agriculture conducts a continuous survey program for the agricultural sector and routinely compares inventory and production estimates with those obtained in the agricultural censuses conducted at 5-year intervals by the Bureau of the Census.   The Bureau of the Census utilizes still another macro-level approach for frame completeness evaluation called demographic analysis. With this method, demographic data from various sources are used to develop expected values for the population as a whole and by race, age, and sex to compare with the census counts. This procedure relies on aggregate statistics of birth, death, immigration, emigration, past censuses, Medicare enrollment, and other sources to provide estimates of net census coverage errors for broad categories at State and national levels   25           (Fay, Passel, and Robinson 1988). The estimate of the net undercount of the legally resident population in the 1980 decennial census is 1.0 percent using this procedure (p. 26).   The mandatory UI reports filed by employees with their State employment security agencies are the primary source of information for the BLS Universe File. This file is used both for sampling frame maintenance and during the estimation process. For example, comprehensive totals from the Universe File at the SIC or SIC/size class level can be used to evaluate the sampling frame inadequacies caused by lack of timeliness for the Current Employment Statistics Survey. Births of new firms (economic units which have begun operations since the time of frame construction) and inaccuracies at detailed levels resulting from changes in SIC codes contribute to differences between the survey frame and the target population. The degree of undercoverage during the time lag until discovery of new units.depends upon the number and size of operations entering the target population. During the estimation process, an updated Universe File is used to ratio adjust the estimates; the reference period for the updated Universe File is one year later than for the Universe File used as a sampling frame. Evaluations of survey data versus target population totals show that only minor revisions apply to Current Employment Statistics Survey results (Thomas 1986).   Several studies have been made of business births and job generation which indicate the importance of measuring employment in new businesses. Roughly 800,000 businesses are formed each year, creating 2,500,000 new jobs. While jobs in new businesses constitute a small fraction of total nonagricultural payroll employment (annual average employment of 104,300,000 in 1988), they are a substantial portion of net new jobs (2,800,000 from 1986 to 1987 and 3,200,000 from 1987 to 1988). An analysis of Dun and Bradstreet credit rating information (Birch 1979) showed that small businesses (20 or fewer employees) accounted for two-thirds of net new jobs between 1969 and 1976. Other studies using Small Business Administration files at the national level or files at the State level have shown that more than half of the net employment growth came from small businesses or business births (Armington and Odle 1981; Teitz, Glasmeier, and Svensson 1981; Connor, Heeringa, and Jackson 1985). These studies show the importance of including new businesses in establishment surveys of employment.     1.4.2. Micro-level analysis   Micro-level analysis of sampling frame units implies direct matching or linkage of the same units found in more than one source. Given a common reference unit, be it person, housing unit, or business, the information available from an administrative file, a census, or survey source should verify and enhance the dam associated with the unit.   The U.S. Department of Commerce's "Report on Statistical Uses of Administrative Records" (1980) includes four case studies of projects utilizing comparative analysis between surveys, census data, and administrative records. All four of these studies utilize matching between files at the individual record level to assess coverage problems and illustrate the kinds of sampling unit evaluations possible across frame sources.   Such record-matching studies are performed for statistical purposes only. In general, strict laws govern the relea