INCREASING THE EFFECTIVENESS OF RATIO EDITS
BY USING JOINT TRANSFORMED VARIABLES

Ramesh A. Dandekar
Statistics and Methods Group, EI-70
U. S. Department of Energy
Washington DC 20585-0670
email:ramesh.dandekar@eia.doe.gov

ABSTRACT

It is a common practice in editing of survey data to use simple ratio tests to identify potential outliers. The edits based on simple ratio tests work especially well when all the respondents are, more or less, of the same size, and exhibit similar tendencies from one survey period to the next. The effectiveness of edits based on simple ratio tests diminishes when the size of the variable spans over a wide range and the extent of fluctuation from one survey period to the next is dependent on the magnitude of the variable. Typically smaller respondents tend to experience larger fluctuations than their larger counterparts. This paper identifies a simple method to increase the effectiveness of ratio edits for data coming from populations with wide distributions. The proposed method is equally effective irrespective of type of distribution.

INTRODUCTION

The Energy Information Administration (EIA) of the United States Department of Energy (DOE) collects national energy production, shipment, transfer and consumption related information for all fossil and non-fossil fuels. The information is made available to the general public at various levels of aggregation such as; states, regions, industry groups, fuel types, and functional uses.

The overall quality of the aggregate information presented in typical statistical publications is dominated by the quality of the information collected from the relatively larger respondents. The quality of the information from individual smaller respondents, though collectively important, does not have as much impact on the overall aggregate information provided in the statistical publication.

In addition to the respondent size, the quality of the information on the magnitude of changes from one survey period to the next is important. Typically smaller respondents tend to have larger fluctuations, as measured by ratio tests, compared to their large counterparts. A balancing act during data edits is required to put an appropriate weight on the (1) magnitude of respondent value and (2) magnitude of fluctuation from one survey period to the next.

To improve the overall quality of the published information, the prudent strategy is to concentrate relatively more resources on checking the quality of the information collected from the larger respondents with relatively larger fluctuations from one survey period to the next. By using joint transformed variables, it is possible to achieve a balance ratio edit search on the respondent level data.

 

OVERVIEW OF BASIC METHODOLOGY


  1. Let Qp and Qc be the quantity reported by the survey respondent during the previous and current survey-reporting period.
  2. To illustrate the edit methodology using a one tail test, let us define R, the ratio, as follows: R = Qp / Qc, if Qp > Qc ; R = Qc / Qp otherwise.
  3. Let us define D = Log { Q * ( R - 1 ) } where Q = (Qp + Qc ) / 2.


  4. Compute Score: S = R * D.
  5. Arrange respondents in decreasing order of S.


Using this scheme, the respondents with larger scores S tend to have relatively larger sizes and relatively larger fluctuations from the previous survey period. The respondents with higher scores are, therefore, given higher priority for edit review ahead of the smaller respondents with larger fluctuations.

The hypothetical example below, consisting of seven respondents, demonstrates the relative ranking of respondent ratios R and scores S relative to size ( Qp + Qc ) / 2 for a population in which ratios tend to decrease with an increase in size Q.


In the example above, the simple ratio edits by themselves would have ranked smaller respondents with wider fluctuations ahead of relatively larger respondents with smaller fluctuations. However, when the composite score of the ratio is used, the ranking order is changed. Some larger respondents with relatively smaller fluctuations are ranked ahead of smaller respondents with large fluctuations.

 

ENHANCING THE BASIC METHODOLOGY

In the example above, we had limited success in ranking some of the larger respondents ahead of relatively smaller respondents. In many survey operations, the procedure outlined above will be somewhat successful but may not go far enough to capture potential outliers with significant impact on the quality of the aggregate statistics. To accommodate such a survey operation, we propose using some exponential weights on R and D prior to computing the composite score S.

This is done by attaching an exponent r and d to R and D as follows:

Exponent Based Edit Score = S = Rr * Dd

Let us consider the example above, after using exponent of r = 1 and d = 1.4

It is clear from the example above that by assigning a relatively higher exponent to D, we have been able to rank more of the larger respondents with smaller fluctuations ahead of smaller respondents with large fluctuations.

 

IS IT ABSOLUTE OR RELATIVE VALUE FOR EXPONENT?

It can be easily demonstrated that by using different combinations of values for exponent r and d, various ranking schemes could be generated to fit any given survey data editing operation. However, it is important to note that it is the relative magnitude between exponent r and d that gives the desired results and not the absolute values of exponent r and d. There are many different combinations of value for the exponents, which will produce exactly the same ranking scheme. For example, r = 2 and d = 2.8 will give exactly the same ranking scheme as in the example above where r = 1 and d = 1.4 values are used.

 

CONTOURS OF CONSTANT EDIT SCORE

The graph below shows the contour map at constant value of edit score of 12.0 for five different values for exponent of Power (= d/r) at various different values for ratios R and quantities Q. The graph is derived using the generic exponent based edit score equation of the form

Score = 12 = R * [ Log ( Q * { R - 1 } ) ] power

A binary search algorithm is used to determine the combination of R and Q, that produces a score of 12.0 at five different values for exponent Power (= d/r). The graph above shows that by adjusting the value for the exponent Power (= d/r) upward and downward, one can easily change the extent of curvature of the contour line. The graph also shows that by adjusting the Power (=d/r) one can shift the contour line up or down closer to any desired location. This characteristics of the exponent based score could be used to develop a continuous hyperbolic ratio edit function to pass through desired points covering the entire range of the population of interest.

It is common practice in the survey data editing process to use discrete ratio step functions to edit data from a wide-range population. By using the non-linear least square analysis techniques on the joint transformed edit variable identified in this paper, one could easily approximate midpoints of the discrete step functions in to a continuous function of the form:

S = Rr * Dd

Optimum values of r and d derived using the non-linear least square analysis could be used to determine continuous ratio edit cut-off values over the entire range of the variable.

 

REAL LIFE EXAMPLE

An example below demonstrates relative ranking of respondents for edit on the vertical scale of the graph using three different ratio based edit schemes. The first scheme uses simple ratios while the next two schemes use slightly different approach for computing score.



By changing the exponent of the second term in the score a significant change in the ranking scheme is achieved.

 

MODEL CALIBRATION BY EXTREME POINT ANALYSIS

The approximate value for exponent Power (=d/r) in the edit score equation could be determined by analyzing the desired ratio and quantity combination at two different points located at the opposite ends of the distribution of the variable being edited. The two points selected for the calibration have to be such that the edit score for these two points are desired to be the same.

Therefore,

Edit Score = R1 r * { Log [ Q1 * ( R1 - 1 ) ] }d

= R2 r * { Log [ Q2 * ( R2 - 1 ) ] }d

By rearranging the equation above,

d/r = Log ( R2 / R1 ) / Log { Log [ Q1 * ( R1 – 1 ) ] / Log [ Q2 * ( R2 – 1 ) ] }

The table below provides numerical examples for calculating Power ( = d/r ) using desired values for equivalent ratios at two different extreme values of quantity by using the equation above.

It is important to note that in the event the estimated d/r is too large for computational purposes, adjusting the value of r to lower level will eliminate the difficulties associated with computing too large an edit score. For example, if d/r = 60 is desired, setting d = 6 and r = 0.1 will provide desired computational capability of the edit score.

 

SIMULATIONS USING HYPOTHETICAL DATA

Two sets of hypothetical data, each consisting of 1000 data points, were generated to evaluate the effect of distributional characteristics on the exponent based edit scores. The first data set consisted of a uniform distribution with ratios ranging from 1 to 6. The second data set consisted of triangular distribution with ratios ranging from 1 to 11. For both the populations the quantity range was set from zero to 1,000,000. Both populations were evaluated using four different edit schemes. The first scheme consisted of conventional ratio edits. The remaining three schemes used different combinations of exponential weights on the edit score. After ranking the edit scores, the 1000 data points were separated in to five different percentile ranges, namely 0-25, 25-50, 50-75, 75-90 and 90-100 percentile. The simulation results are in the appendix. The first four figures in the appendix are based on the uniform distribution. The last four figures are based on the triangular distribution. Based on these simulations, it could be concluded that the exponent based edit scores work equally well, irrespective of the distributional characteristics of the variable.

 

POTENTIAL USES

The procedure outlined in this paper could be used to perform either univariate edits or multivariate edits. Box whisker method to detect potential outliers could be used on the exponent-based scores to separate data in to different stratums as a part of data edit operation.

In case of multivariate edits, emphasis is typically on avoiding repeated calls to the same respondents, asking them for a clarification on multiple data fields. Principal Component Analysis and or Factor analysis of multivariate exponent based edit scores, in combination with Box Whisker outlier detection technique could be used for this purpose.

 

REFERENCE

Greenberg, B. and Surdi R. (1984). A flexible and interactive edit and imputation system for ratio edits. Proceedings of Section on Survey Research Methods, American Statistical Association, 421-426.

Hidiroglou, M. A. and Berthelot, J. M. (1986). Statistical Editing and imputation for periodic business surveys. Survey Methodology 12, 73-83.

Thompson, K. J. and Sigman, R.S. (1996). Statistical methods for developing ratio edit tolerances for economic censuses. Proceedings of the Section on Survey Research Methods, American Statistical Association, 166-171.

 

APPENDIX

Ratio versus Quantity Using Uniform Distribution

Figure 1: Conventional Ratio Edits



Ratio versus Quantity Using Uniform Distribution

Figure 2: Edit Score = R * D

 

Ratio versus Quantity Using Uniform Distribution

Figure 3: Edit Score = R * D8



Ratio versus Quantity Using Uniform Distribution

Figure 4: Edit Score = R0 * D1



Ratio versus Quantity Using Triangular Distribution

Figure 5: Conventional Ratio Edits Using Triangular Distribution

 

Ratio versus Quantity Using Triangular Distribution

Figure 6: Edit Score = R * D



Ratio versus Quantity Using Triangular Distribution

Figure 7: Edit Score = R * D8



Ratio versus Quantity Using Triangular Distribution

Figure 8: Edit Score = R0 * D1