Estimation of Complex Estimators in Subpopulations

Computes estimates, standard errors and confidence intervals for Complex Estimators in subpopulations. A Complex Estimator can be any analytic function of (Horvitz-Thompson or Calibration) estimators of Totals and Means.

svystatL(design, expr, by = NULL,
         vartype = c("se", "cv", "cvpct", "var"),
         conf.int = FALSE, conf.lev = 0.95, deff = FALSE,
         na.rm = FALSE)

# S3 method for svystatL
coef(object, ...)
# S3 method for svystatL
SE(object, ...)
# S3 method for svystatL
VAR(object, ...)
# S3 method for svystatL
cv(object, ...)
# S3 method for svystatL
deff(object, ...)
# S3 method for svystatL
confint(object, ...)

Arguments

design	Object of class `analytic` (or inheriting from it) containing survey data and sampling design metadata.
expr	R expression defining the Complex Estimator (see ‘Details’).
by	Formula specifying the variables that define the "estimation domains". If `NULL` (the default option) estimates refer to the whole population.
vartype	`character` vector specifying the desired variability estimators. It is possible to choose one or more of: standard error (`'se'`, the default), coefficient of variation (`'cv'`), percent coefficient of variation (`'cvpct'`), or variance (`'var'`).
conf.int	Compute confidence intervals for the estimates? The default is `FALSE`.
conf.lev	Probability specifying the desired confidence level: the default value is `0.95`.
deff	Should the design effect be computed? The default is `FALSE` (see ‘Details’).
na.rm	Should missing values (if any) be removed from the variables of interest? The default is `FALSE` (see ‘Details’).
object	An object of class `svystatL`.
...	Additional arguments to `coef`, ..., `confint` methods (if any).

Details

This function computes weighted estimates for Complex Estimators using suitable weights depending on the class of design: calibrated weights for class cal.analytic and direct weights otherwise. Standard errors are calculated using the Taylor linearization technique.

Function svystatL can handle any user-defined estimator that can be expressed as an analytic function of Horvitz-Thompson or Calibration estimators of Totals or Means, by automatically linearizing them (see [Zardetto, 15] for details).

The mandatory argument expr, which identifies the Complex Estimator, must be an object of class expression. It can be specified just a single Complex Estimator at a time, i.e. length(expr) must be equal to 1. Any analytic function of estimators of Totals and Means is allowed.
Inside expr the estimator of the Total of a variable is simply represented by the name of the variable itself. To represent the estimator of the Mean of a variable y, the expression y/ones has to be used (ones being the convenience name of an artificial variable - created on-the-fly - whose value is 1 for each elementary sampling unit, so that its Total estimator actually estimates the size of the population). Variables referenced inside expr must obviously belong to design and must be numeric.
At a minimal level, svystatL can be used to estimate Totals, Means, Ratios, etc., thus reproducing the same results achieved by using the corresponding dedicated functions svystatTM, svystatR, etc. For instance, calling svystatL(design, expression(y/x)) is equivalent to invoking svystatR(design, ~y, ~x), while using svystatL(design, expression(y/ones)) or
svystatTM(design, ~y, estimator = "Mean") achieves an identical result.

The mathematical expression of a Complex Estimator, as specified by argument expr, can involve 'parameters', that is symbols representing given, non-random, scalar, numeric values. For each parameter appearing in expr, the value corresponding to its symbol will be searched following R standard scoping rules, see e.g. the first example in Section ‘Examples’ for a practical illustration.

The optional argument by specifies the variables that define the "estimation domains", that is the subpopulations for which the estimates are to be calculated. If by=NULL (the default option), the estimates produced by svystatL refer to the whole population. Estimation domains must be defined by a formula: for example the statement by=~B1:B2 selects as estimation domains the subpopulations determined by crossing the modalities of variables B1 and B2. Notice that a formula like by=~B1+B2 will be automatically translated into the factor-crossing formula by=~B1:B2: if you need to compute estimates for domains B1 and B2 separately, you have to call svystatL twice. The design variables referenced by by (if any) should be of type factor, otherwise they will be coerced.

The conf.int argument allows to request the confidence intervals for the estimates. By default conf.int=FALSE, that is the confidence intervals are not provided.

Whenever confidence intervals are requested (i.e. conf.int=TRUE), the desired confidence level can be specified by means of the conf.lev argument. The conf.lev value must represent a probability (0<=conf.lev<=1) and its default is chosen to be 0.95.

The optional argument deff allows to request the design effect [Kish 1995] for the estimates. By default deff=FALSE, that is the design effect is not provided. The design effect of an estimator is defined as the ratio between the variance of the estimator under the actual sampling design and the variance that would be obtained for an 'equivalent' estimator under a hypothetical simple random sampling without replacement of the same size. To obtain an estimate of the design effect comparing to simple random sampling “with replacement”, one must use deff="replace".
For nonlinear estimators, the design effect is estimated on the linearized version of the estimator (that is for the estimator of the total of the linearized variable, aka "Woodruff transform").
When dealing with domain estimation, the design effects referring to a given subpopulation are currently computed by taking the ratios between the actual variance estimates and those that would have been obtained if a simple random sampling were carried out within that subpopulation. This is the same as the srssubpop option for Stata's function estat.

Missing values (NA) in interest variables should be avoided. If na.rm=FALSE (the default) they generate NAs in estimates (or even an error, if design is calibrated). If na.rm=TRUE, observations containing NAs are dropped, and estimates get computed on non missing values only. This implicitly assumes that missing values hit interest variables completely at random: should this not be the case, computed estimates would be biased.

Value

An object inheriting from the data.frame class, whose detailed structure depends on input parameters' values.

Warning

When the linearized variable corresponding to a Complex Estimator is ill defined (because the estimator gradient is singular at the Taylor series expansion point), SE estimates returned by svystatL are NaN.

References

Sarndal, C.E., Swensson, B., Wretman, J. (1992) “Model Assisted Survey Sampling”, Springer Verlag.

Zardetto, D. (2015) “ReGenesees: an Advanced R System for Calibration, Estimation and Sampling Error Assessment in Complex Sample Surveys”. Journal of Official Statistics, 31(2), 177-203. doi:10.1515/jos-2015-0013 .

Kish, L. (1995). “Methods for design effects”. Journal of Official Statistics, Vol. 11, pp. 55-77.

European Commission, Eurostat, (2013). “Handbook on precision requirements and variance estimation for ESS households surveys: 2013 edition”, Publications Office. doi: 10.2785/13579

Examples

#####################################################
# A first example: the Ratio Estimator of a Total.  #
#####################################################

# Creation of a design object:
data(data.examples)
des<-e.svydesign(data=example,ids=~towcod+famcod,strata=~SUPERSTRATUM,
     weights=~weight)

# Recall that ratio estimators of Totals rely on auxiliary
# information. Thus, suppose you want to estimate the total
# of income and suppose you know from an external source that
# the population size is, say, 1E6:
pop <- 1E6

# To obtain the ratio estimator of total income, you can do as follows:
  ## A) Directly plug the numeric value of pop into expr
svystatL(des, expression(1E6 * (income/ones)), vartype = "cvpct")
#>                          Complex       CV%
#> 1e+06 * (income/ones) 1256166016 0.6808451

  ## B) Treat pop as a parameter and let R find its actual value (1E6) inside
  ##    the calling environment of svystatL (the .GlobalEnv)
svystatL(des, expression(pop * (income/ones)), vartype = "cvpct")
#> Error in get(x): object 'pop' not found

# NOTE: Method B) can be very useful for simulation purposes, as it avoids
#       having to directly type in numbers when invoking svystatL (something
#       that only a human in an interactive R session could do).

# By comparing the latter result with the ordinary estimator of the mean...
svystatTM(des,~income,vartype="cvpct")
#>             Total      CV%
#> income 1160824648 1.926082

# ...one can appreciate the variance reduction stemming from the correlation
# between numerator and denominator:
Corr(des, expression(income), expression(ones))
#>   Corr(income, ones)
#> 1          0.9358866


############################################################
# A complex example: estimation of the Population Standard #
# Deviation of a variable.                                 #
############################################################

# Creation of another design object:
data(sbs)
sbsdes<-e.svydesign(data=sbs,ids=~id,strata=~strata,weights=~weight,
        fpc=~fpc)

# Suppose you want to estimate the standard deviation of the
# population distribution of value added (va.imp2):
sbsdes<-des.addvars(sbsdes,va.imp2.sq=va.imp2^2)
svystatL(sbsdes,expression( sqrt( (ones/(ones-1))*
                                  ( (va.imp2.sq/ones)-(va.imp2/ones)^2 )
                                )
                          ), conf.int=TRUE)
#>                                                                   Complex
#> sqrt((ones/(ones - 1)) * ((va.imp2.sq/ones) - (va.imp2/ones)^2)) 9408.726
#>                                                                        SE
#> sqrt((ones/(ones - 1)) * ((va.imp2.sq/ones) - (va.imp2/ones)^2)) 156.7972
#>                                                                  CI.l(95%)
#> sqrt((ones/(ones - 1)) * ((va.imp2.sq/ones) - (va.imp2/ones)^2))  9101.409
#>                                                                  CI.u(95%)
#> sqrt((ones/(ones - 1)) * ((va.imp2.sq/ones) - (va.imp2/ones)^2))  9716.042

# The estimate above and the associated confidence interval (which
# involves the estimate of the sampling variance of the complex
# estimator) turn out to be very sound: indeed the TRUE value of the
# parameter is:
sd(sbs.frame$va.imp2)
#> [1] 9211.955


###############################################
# Estimation of Geometric and Harmonic Means. #
###############################################

## 1. Harmonic Mean
 # Recall that the the harmonic mean of a positive variable,
 # say z, can be computed as 1/mean(1/z). Thus, for instance,
 # to get a survey estimate of the harmonic mean of emp.num,
 # you can do as follows:
sbsdes<-des.addvars(sbsdes,emp.num.m1=1/emp.num)
h<-svystatL(sbsdes,expression( ones/emp.num.m1 ),
            conf.int=TRUE)
h
#>                  Complex         SE CI.l(95%) CI.u(95%)
#> ones/emp.num.m1 13.29062 0.04474827  13.20291  13.37832

 # You can easily verify that the obtained estimate is close
 # to the true value (as computed from the sampling frame) and
 # covered by the 95% confidence interval:
1/mean(1/sbs.frame$emp.num)
#> [1] 13.30253

## 2. Geometric Mean
 # Recall that the the geometric mean of a non negative variable,
 # say z, can be computed as exp(mean(log(z))). Thus, for instance,
 # to get a survey estimate of the geometric mean of emp.num,
 # you can do as follows:
sbsdes<-des.addvars(sbsdes,log.emp.num=log(emp.num))
g<-svystatL(sbsdes,expression( exp(log.emp.num/ones) ),
            conf.int=TRUE)
g
#>                        Complex         SE CI.l(95%) CI.u(95%)
#> exp(log.emp.num/ones) 20.51558 0.06077714  20.39646   20.6347

 # You can easily verify that the obtained estimate is close
 # to the true value (as computed from the sampling frame) and
 # covered by the 95% confidence interval:
exp(mean(log(sbs.frame$emp.num)))
#> [1] 20.5362

## 3. Comparison with the arithmetic mean
 # If you compute the arithmetic mean estimate:
a<-svystatTM(sbsdes,~emp.num,estimator="Mean")
a
#>             Mean         SE
#> emp.num 56.88635 0.08207154

 #...you easily verify the expected hierachy,
 # i.e. harmonic <= geometric <= arithmetic:
H<-coef(h)
G<-coef(g)
A<-coef(a)
stopifnot(H <= G && G <= A)


#################################################################
# Further complex examples: estimation of Population Regression #
# Coefficients (for a model with a single predictor).           #
#################################################################

# Suppose you want to estimate of the slope of the population
# regression y vs. emp.num. You can do as follows:

  ## 1. No intercept model: y ~ emp.num - 1
   # Get survey estimate:
   sbsdes<-des.addvars(sbsdes,y4emp.num=y*emp.num,
                              emp.num.sq=emp.num^2)
   svystatL(sbsdes,expression(y4emp.num/emp.num.sq),
            conf.int=TRUE)
#>                       Complex          SE CI.l(95%) CI.u(95%)
#> y4emp.num/emp.num.sq 17.54959 0.009214501  17.53153  17.56765

   # Compare with the actual slope from the population fit:
   pop.fit<-lm(y~emp.num-1,data=sbs.frame)
   coef(pop.fit)
#>  emp.num 
#> 17.53945 

   # ...a very good agreement.

  ## 2. The model with intercept: y ~ emp.num
   # Get survey estimate:
   svystatL(sbsdes,expression(  (ones*y4emp.num  - y*emp.num)/
                                (ones*emp.num.sq - emp.num^2)
                             ),
            conf.int=TRUE)
#>                                                                  Complex
#> (ones * y4emp.num - y * emp.num)/(ones * emp.num.sq - emp.num^2) 16.9244
#>                                                                         SE
#> (ones * y4emp.num - y * emp.num)/(ones * emp.num.sq - emp.num^2) 0.0107875
#>                                                                  CI.l(95%)
#> (ones * y4emp.num - y * emp.num)/(ones * emp.num.sq - emp.num^2)  16.90325
#>                                                                  CI.u(95%)
#> (ones * y4emp.num - y * emp.num)/(ones * emp.num.sq - emp.num^2)  16.94554

   # Compare with the actual slope from the population fit:
   pop.fit<-lm(y~emp.num,data=sbs.frame)
   coef(pop.fit)
#> (Intercept)     emp.num 
#>   569.83949    16.94948 

   # ...again a very good agreement.

# Notice that both results above could be obtained also
# by using ReGenesees specialized function svystatB:

  ## 1.
  svystatB(sbsdes,y~emp.num-1,conf.int=TRUE)
#>         RegCoef.y          SE CI.l(95%) CI.u(95%)
#> emp.num  17.54959 0.009214501  17.53153  17.56765

  ## 2.
  svystatB(sbsdes,y~emp.num,conf.int=TRUE)
#>             RegCoef.y         SE CI.l(95%) CI.u(95%)
#> (Intercept)  603.5290 15.2625745 573.61489 633.44309
#> emp.num       16.9244  0.0107875  16.90325  16.94554

# Notice also - incidentally - that the estimate of the intercept
# turns out to be less accurate than the one we obtained for the slope,
# with about a 6% overestimation.