SOME PROBLEMS AND SOLUTIONS WITH A DELETE-A-GROUP JACKKNIFE
Phillip S. Kott
National Agricultural Statistics Service

 
Abstract
 
The National Agricultural Statistics Service has been using the Delete-a-Group (DAG) jackknife in an increasing number of its surveys. This has led to some problems. For one, the variance estimator for national statistics requires that data from all the states be processed at the same time, a computational nightmare. The solution to this problem was the development of a "hybrid" variance estimator combining linearization and jackknife principles. A second problem is the setting of the number of replicates. NASS uses 15. This can render some standard statistical tests based on the F distribution questionable. Fortunately, an alternative approach based on Bonferroni's inequality often can be taken. Finally, the DAG requires that the number of first-phase samples in each stratum be large, say, greater than five. This is not always the case in practice. The resultant upward bias in the variance estimator may be acceptable in some situations. For others, the Extended DAG jackknife has been developed that uses information from all replicates in calculating each replicate estimate.

1. Introduction

The Delete-a-Group (DAG) jackknife is a relatively new name for a widely used procedure in survey sampling. For example, it is identical to "Jackknife1" computed by WESVAR (see Westat, 1977, pp. 141-147). When done correctly, the DAG jackknife produces nearly unbiased estimates of mean squared error for a remarkably broad range of estimation strategies including many involving calibration, composite estimation, and multi-phase sampling (see Kott, 1998).

In brief, the DAG jackknife procedure divides the (first-phase) sample into R random groups and then estimates variances (or mean squared errors) by:

1. deleting one group at a time from the sample,
2. computing R "replicate" estimates in an appropriate manner, and
3. taking the sum of the squared differences between the R replicate estimates and the original estimate mulitplied by (R-1)/R.

The National Agricultural Statistics Service (NASS) has been using the DAG jackknife in an increasing number of its surveys. This has led to some problems. For example, NASS produces both national and State-level estimates. A jackknife variance estimator for national statistics requires that data from all States be processed at the same time, which is difficult under present circumstances. The solution of this problem for simple expansions and ratios, which are the primary statistics of interest to NASS, is to use a "hybrid" variance estimator combining linearization and jackknife principles. It is described in Section 2.

A second problem involves the setting of the number of random groups, R. NASS routinely uses 15 groups. This sets the degrees of freedom available for univariate estimation and testing at 14. Some users of NASS data are interested in conducting mutivariate tests of statistical models, but standard tests based on the F distribution can break down in this context (see Korn and Graubard, 1990). Section 3 discusses an alternative that applies the Bonferroni inequality to a set of univariate t tests.

Finally, the near unbiasedness of DAG jackknife requires that the number of first-phase sample units in each stratum be large. Kott (1998) puts this number at 5. This requirement is not always met in NASS surveys, especially the agency's area-based surveys. The resultant upward bias in the variance estimator may be acceptable in some situations. For others, the Extended DAG jackknife has been developed.  It is described in Section 4.

2. Cross-State Aggregates

Estimating the variance of an expansion estimator, ti, for a total, Ti, in State i with a DAG jackknife is a simple matter.  One computes

var(ti) = (14/15)S15(ti(r) - ti)2,

where ti(r) is the replicate estimator for Ti computed with the r'th set of replicate weights.

Variance estimation is just as simple for an estimated state ratio, bi = t1i /t2i, where t1i and t2i are, respectively, expansion estimators of state totals T1i and T2i. The DAG jackknife variance estimator for bi is

var(bi) = (14/15) S15(t1i(r) /t2i(r) - t1i /t2i)2,

where t1i(r) and t2i(r) are, respectively, replicate estimates for T1i and T2i computed with the r'th set of replicate weights.
 
2.1 Cross-State Estimates

Suppose we are interested in the cross-state estimator for a total, namely, tS = SS ti, where S is a collection of states, such as the entire U.S. (S will denote both the collection of states and the number of states in that collection). One way to estimate the variance of TS = SS Ti would be with the hybrid estimator:

varH(tS) = SS var(ti),

where var(ti) is again  (14/15)S15(ti(r) - ti)2. The name "hybrid" derives from varH(tS) being a hybrid of the S state DAG jackknives and linearization principles.

For the ratio estimator, bS = SS t1i / SS t2i, the hybrid variance estimator is

varH(bS) = (SS t2i)-2{SS var(t1i) + bS2 SS var(t2i) - 2bSS S cov(t1i, t2i)},

where  cov(t1i, t2i) = (14/15) S15(t1i(r) - t1i)(t2i(r) - t2i).
 
2.2 Discussion
 
For NASS summaries, the direct DAG jackknife variance estimators (var(ti) and var(bi) above) make sense to use at the state level, while the hybrid estimators make sense for aggregates that combine data across states (like US-level totals and ratios).
 
There is no hybrid analogue to var(ti).  In principle, however, the hybrid analogue to var(bi) is

varH(bi) = t2i-2{var(t1i) + bi2 var(t2i) - 2bi cov(t1i, t2i)}.

There are no theoretical reasons to prefer varH(bi) over var(bi). The two variance estimators are asymptotically indistinguishable. In practice, since we need to calculate var(t1i), var(t2i), and cov(t1i, t2i) for aggregation anyway, it is convenient to use varH(bi) as the state-level variance estimator and avoid calculating var(bi) altogether.
 
In principle, the direct DAG variance estimator for tS is

var(tS) = (14/15) S15(SS ti(r) - SS ti)2.
 
Although both var(tS) and varH(tS) have asymptotically ignorable biases, the hybrid version has less variance; that is to say, the variance of varH(tS) as an estimator for the true variance tS is less than that of var(tS). To see why, suppose each ti(r) - ti were roughly normal, then var(tS) would have a relative variance of roughly 2/14, while varH(tS) would have a relative variance between 2/[14S] and 2/14. In other words, var(tS) has roughly 14 degrees of freedom under ideal conditions (more precisely, (tS - TS)/var(tS) has roughly a Student's t distribution with 14 degrees of freedom), while varH(tS) has between 14 and 14S effective degrees of freedom. This is another reason why the hybrid is preferable for NASS summaries.

The direct DAG variance estimator for bS is

var(bS) = (14/15) S15(SS t1i(r)/SS t2i(r) - SS t1i /SS t2i)2,

but the hybrid varH(bS) has less variance and is preferred for NASS summaries. Using the hybrid requires that cov(t1i, t2i) be calculated in each state for every pair of survey items NASS desires to put in an item-to-item ratio.
 
Hybrid principles can also be used when agregating list and area-based nonoverlap (NOL) estimators within a state. Presently, NOL variances and covariances are computed at NASS using linearization methods.

Some users, such as economists at the Economics Research Service, may be interested in analyzing multi-state NASS data as a single data set. Under those circumstances, it will often be more convenient to use direct DAG jackknife variance estimators rather than hybrid variance estimators. Indeed, it was partly for these users that NASS began using the DAG.

3. A Bonferroni-adjusted t-test

Suppose we are evaluating a linear model that may or may not have regional effects. In particular, we want to determine whether the addition of a dummy variable to represent each of the four U.S. Census regions is warranted. One common practice is to omit one of the regions arbitrarily and use an F test to determine whether the coefficients of the other three dummies are simultaneously zero under a model including an intercept. Unfortunately, an F test can be unreliable for this purpose when using a DAG jackknife based on only 15 replicates. An alternative test procedure suggested by Korn and Graubard (1990) is outlined in the next sub-section.

3.1 The Batt

We restrict ourselves here to the simutaneous testing of K linear regression coefficients, but there are other potential applications of the test about to be described.  In particular, suppose we want to test whether a set of K regression coefficients are simultaneously equal to zero.  The first thing to do is calculate the z-value of the k'th estimated coefficient, and call it zk (the z-value for an estimate is the estimate itself divided by its estimated standard error).   Let zmax = maxK{|zk|}. One can reject the null hypothesis that all the K parameters are jointly zero at significance level a when the probability a Student's t distribution with 14 degrees of freedom is larger than zmax is a/(2K).

Testing a joint hypothesis in the manner described above is called "a Bonferroni-adjusted t-test" or Batt.  Observe that when
K = 1, the Batt collapses into the standard two-sided t-test.  When K > 1, the Batt can be conservative.  That means it will fail to reject the null hypothesis more often than it should.

Text-books often advise against the analogous use of the Bonferroni confidence intervals when K is large citing the inefficiency and conservativeness of the Bonferroni technique. For testing purposes, however, it is reassuring to observe that when a = .05 and K =100, the null hypothesis will be rejected when zmax exceeds 4.5 - not an unreasonable large number.

3.2 Dummy-like Variables

Unlike an F test of a joint hypothesis, a Batt is sensitive to how the regression model (which can be linear or non-linear) is parameterized. In our motivating dummy example, any one of the four regional dummies could be omitted. Those are four possible parameterizations. The choice of which dummy to omit needs be done randomly.

A better Bonferroni procedure is available for testing the simultaneous existence of a set of dummy variables. Before proceeding to it, we first introduce the concept of a set of "dummy-like" variables. We want this definition to include, for example, a set of slope coefficients that potentially differ by region.

Let A denote a variable of interest, and xA be the n-vector of sample values for A. A set G of  variables is said to be dummy-like if
 
1) xA is not equal to 0 for any variable (A)  in G, and
2) xA'xB = 0 for any pair of variables (A and B)  in G.
 
When SG xA is a vector of 1's,  G contains conventional dummy variables.

To test whether the coefficients for a set of dummy-like variables are all zero, we first parameterize the regression so that all the estimated coefficients for the dummy-like variables are non-negative. For a set of dummies, parameterization involves choosing which dummy to omit from the regression (assuming the model has an intercept) and replacing xA by -xA when necessary.   In general, one dummy-like variable is omitted from a parameterization, while  SG xA (or the equivalent), which is not a dummy-like variable, effectively takes its place.

Armed with a parameterization having non-negative estimated coefficients for the dummy-like variables, we can calculate the z-value for each, and let zmax be the largest of these non-negative values. We reject the null hypothesis that the set of dummy-like variables as a whole has no impact on the data at the a significance level when the probability a Student's t distribution with 14 degrees of freedom is larger than zmax is a/(d[d-1]). Note that d(d-1)/2 has effectively replaced K = d-1 in the Batt with a random parameterization. This is because there are d possible parameterizations, but forcing all coefficients to be non-negative is an exact mirror of forcing them all to be non-positive. Hence K (= d-1) needs to be multiplied by d/2 to account for us choosing the "worst" parameterization.

This test easily extends to Q sets of dummy-like variables. We again need to parameterize so that every dummy-like variable in one of the Q sets has a non-negative estimated coefficient. We calculate zmax over all the Q sets, and reject the null hypothesis that the Q sets of dummy-like variables as a whole have no impact on the data at the a significance level when the probability a Student's t distribution with 14 degrees of freedom is larger than zmax is a/(d(Q)[d(Q) - Q]), where d(Q) is the number of dummy-like variables across all Q sets.

It is also a simple matter to combine K0 non-dummy-like variables with Q sets of dummy-like variables. Once more, we parameterize so that every dummy-like variable in one of the Q sets has a non-negative estimated coefficient. We calculate zmax over all the Q sets and the other K0 variables, and reject the null hypothesis that the Q sets of dummy-like variables and the K0 additional variables as a whole have no impact on the data at the a significance level when the probability a Student's t distribution with 14 degrees of freedom is larger than zmax is a/(K0/2 + d(Q)[d(Q) - Q]).
 
4. The Extended Delete-A-Group Jackknife

In this section, we extend the concept of a DAG jackknife variance estimator. For simplicity, we consider only the variance of an estimator without explicit calibration.  The sample itself can have multiple stages.  For NASS, it is multi-stage area samples than often have the small stratum sample sizes of concern here.

Let

whjk  be the weight of element k in PSU (segment) j of stratum h,
nh     be the number of sampled PSU's in stratum h,
H      be the number of strata,
R      be the number of variance groups (the members of each first-stage stratum are distributed into the R replicate groups
         in as nearly equal a manner as possible), and
Shr    be the set of PSU's in stratum h and group r.

In NASS applications, R is 15. Kott (1998) argues that the DAG variance estimator is reasonable when all nh are greater or equal to 5. What if they aren't?

Kott(1999) proposes an effective method for calculating replicate weights in this situation.  Let G be an integer less than or equal to R. When nh < G, we can define the replicate-r weight of hjk for the Extended Delete-A-Group jackknife as

whjk(r)(G)   =   whjk                         when Shr is empty,
                      whjk(1 - [nh -1]Z)    when j is in Shr, and
                      whjk(1 + Z)              otherwise,

where Z = R/[(R-1)nh(nh -1)]. When nh is greater or equal to G, we define the whjk(r)(G) for the Extended DAG to be the  same as for the DAG.

When nh = R in the above equation, one (and only one) j will be in Shr , Z = 1/(R -1) = 1/(nh -1), and the usual DAG replicate-weight formula obtains.  Observe than when nh < R, whjk(r)(G) in the above equation is not zero when j is in Shr.  This is unusual for a jackknife.  A sketch of a proof for the near unbiasedness of the Extended DAG jackknife can be found in Kott (1999).

What value to use for G is an open question. Following Kott (1998), we can choose G = 5, but clearly a higher value would produce a less-biased variance estimator. In many practical situation, it is convenient to set G equal to R.

The Extended DAG replicate weights given above are not explicitly calibrated.  In practice, if the original weights are calibrated, so must the replicate weights (see Kott, 1998).  The equation for whjk(r)(G)  tells us only where to start for calibrated estimators.
 
References

Korn, Edward L. and Graubard, Barry I. (1990). Simultaneous Tesing of Regression Coefficients With Complex Survey Data. American Statistician, 44, 270-276.

Kott, Phillip S. (1998). Using the Delete-A-Group Jackknife Variance Estimator in NASS Surveys, RD Research Report No. RD-98-01, USDA, NASS: Washington, DC.

Kott, Phillip S. (1999). The Extended Delete-A-Group Jackknife. Bulletin of the International Statistical Instititute. 52nd Session. Contributed Papers. Book 2, 167-168..

Westat, Inc. (1997). A User's Guide to WesVarPC®, Version 2.1, Westat: Rockville, MD.