Linearized Variable(s) of Complex Estimators by Domains

Computes the linearized variable(s) of a Complex Estimator in subpopulations (domains). The Complex Estimator can be any analytic function of Horvitz-Thompson or Calibration estimators.

get.linvar(design, expr, by = NULL, stack = FALSE, na.rm = FALSE)

Arguments

design	Object of class `analytic` (or inheriting from it) containing survey data and sampling design metadata. Can be either uncalibrated or calibrated.
expr	R expression defining the Complex Estimator (see `svystatL` for the basic syntax; see also section ‘Details’).
by	Formula specifying the variables that define the "estimation domains". If `NULL` (the default option) the estimator refers to the whole population.
stack	If `FALSE` (the default), the function will return a `matrix` of domain-specific linearized variables, with one column per domain. If `TRUE`, the domain-specific linearized variables will be stacked without information loss and returned as a single column. The stacked output format is only possible for uncalibrated designs (see ‘Details’).
na.rm	Allow missing values in the variables entering the estimator expression? The default is `FALSE`.

Details

This function has been designed mainly for programmers or advanced users willing to build upon ReGenesees: typical users are not expected to feel much need of it.

The ReGenesees package adopts the Taylor-series linearization technique to estimate the sampling variance of nonlinear (smooth) estimators. Using this technique, it can handle any Complex Estimator that can be expressed as an analytic function of Horvitz-Thompson or Calibration estimators. This includes even user-defined estimators, through function svystatL [Zardetto, 15].

In the Taylor-series linearization approach, estimating the sampling variance of a Complex Estimator amounts to estimating the sampling variance of the Horvitz-Thompson estimator of the total of the linearized variable of the Complex Estimator under the sampling design at hand. Function get.linvar computes that linearized variable. If one is interested in estimating the sampling variance of a Complex Estimator by domains (i.e. for subpopulations), then - owing to the Taylor linearization technique - a different linearized variable has to be calculated for each domain. Therefore, in the most general case, the output of function get.linvar will be a matrix of linearized variables, with one column per domain.

If the input object design is calibrated, then the linearized variable produced by function get.linvar correctly takes into account the theoretical properties of Calibration estimators (see, e.g., [Deville, 1999]). In particular, get.linvar computes the needed g-weighted residuals (see, e.g., equations (8) and (9) of [Zardetto, 15]) by leveraging function get.residuals.

The mandatory argument expr, which identifies the Complex Estimator, must be an object of class expression. You can specify just a single Complex Estimator at a time, i.e. length(expr) must be equal to 1. Any analytic function of estimators of Totals and Means derived from the input design object design is allowed.

The basic syntax follows the same rules described for function svystatL. Inside expr, the estimator of the Total of a variable is simply represented by the name of the variable itself. The reserved name ones can be used to reference an artificial variable (which will be created on-the-fly, if not already present) whose value is 1 for each sampling unit, so that its Total estimator actually estimates the size of the population in terms of elementary units. Therefore, e.g., expression y/ones represents the estimator of the Mean of variable y. Variables referenced inside expr must be numeric and belong to design.

The optional argument by specifies the variables that define the "estimation domains", that is the subpopulations for which the estimates are to be calculated. If by=NULL (the default option), the estimates produced by svystatL refer to the whole population. Estimation domains must be defined by a formula: for example the statement by=~B1:B2 selects as estimation domains the subpopulations determined by crossing the modalities of variables B1 and B2. Notice that a formula like by=~B1+B2 will be automatically translated into the factor-crossing formula by=~B1:B2: if you need to compute estimates for domains B1 and B2 separately, you have to call svystatL twice. The design variables referenced by by (if any) should be of type factor, otherwise they will be coerced.

As already noted, in the most general case, the output of function get.linvar will be a matrix of linearized variables, with one column per domain. However, the linearized variable of any domain estimator that is a function of Horvitz-Thompson estimators is identically equal to zero outside the domain. Thanks to this property, and only for functions of HT estimators, the columns of the output matrix can be stacked into one single column without information loss. Users may request this convenient output format by specifying stack = TRUE. Unfortunately, the same result cannot be obtained for functions of Calibration estimators, because the g-weighted residuals are generally non-zero outside the active estimation domain. Therefore, argument stack does nothing if design is calibrated.

Value

A matrix of linearized variables.

References

Deville, J. C. (1999) “Variance Estimation for Complex Statistics and Estimators: Linearization and Residual Techniques.”. Survey Methodology 25: 193-203.

Zardetto, D. (2015) “ReGenesees: an Advanced R System for Calibration, Estimation and Sampling Error Assessment in Complex Sample Surveys”. Journal of Official Statistics, 31(2), 177-203. doi:10.1515/jos-2015-0013 .

Examples

# Load sbs data:
data(sbs)

# Create a design object to be calibrated:
sbsdes<-e.svydesign(data= sbs, ids= ~id, strata= ~strata, weights= ~weight, fpc= ~fpc)

##################################################################
# Just some checks on the consistency of the numerical results   #
# obtained by ReGenesees with well known theoretical properties. #
##################################################################

## In the Taylor-series linearization approach, estimating the sampling variance of a
## Complex Estimator amounts to estimating the sampling variance of the Horvitz-Thompson
## estimator of the total of the linearized variable of the Complex Estimator under the
## sampling design at hand.

#############
## Check 1 ##
#############
# Global estimate (i.e., no domains) of a nonlinear function of HT estimators:
# - Average value added per employee

# Compute the linearized variable of the estimator:
z <- get.linvar(sbsdes, expression(va.imp2/emp.num))

# Have a look: 
head(z)
#>               z
#> 1  0.0033787523
#> 2 -0.0002174877
#> 3 -0.0010440456
#> 4 -0.0012760633
#> 5 -0.0016815759
#> 6 -0.0028392399

# Now add that variable to the input design object:
sbsdes <- des.addvars(sbsdes, z = z)

# Now compute the estimated standard error of the HT total of the linearized variable:
svystatTM(sbsdes, ~z)
#>           Total       SE
#> z -1.111315e-14 1.018897

# And compare it to the estimated standard error of the complex estimator:
svystatL(sbsdes, expression(va.imp2/emp.num))
#>                  Complex       SE
#> va.imp2/emp.num 57.14199 1.018897

# ...which - in this case - can equivalently be obtained exploiting its ratio nature:
svystatR(sbsdes, num = ~va.imp2, den= ~emp.num)
#>                    Ratio       SE
#> va.imp2/emp.num 57.14199 1.018897

# Test the equality of the results:
all.equal(svystatTM(sbsdes, ~z)$SE,
          svystatL(sbsdes, expression(va.imp2/emp.num))$SE)
#> [1] TRUE
# OK

#############
## Check 2 ##
#############
# Domain estimates of a nonlinear function of HT estimators:
# - Average value added per employee by NACE macro-sectors

# Compute the linearized variable(s) of the domain estimator(s):
z.dom <- get.linvar(sbsdes, expression(va.imp2/emp.num), by = ~nace.macro)

# Have a look: 
head(z.dom)
#>      z_Agriculture z_Industry z_Commerce z_Services
#> [1,]    0.19211694          0          0          0
#> [2,]   -0.01512942          0          0          0
#> [3,]   -0.06249362          0          0          0
#> [4,]   -0.07565296          0          0          0
#> [5,]   -0.09969501          0          0          0
#> [6,]   -0.16841719          0          0          0
tail(z.dom)
#>         z_Agriculture z_Industry z_Commerce    z_Services
#> [6904,]             0          0          0 -0.0045717747
#> [6905,]             0          0          0 -0.0012016249
#> [6906,]             0          0          0 -0.0011432199
#> [6907,]             0          0          0 -0.0013394175
#> [6908,]             0          0          0 -0.0005712673
#> [6909,]             0          0          0 -0.0006619357

# Note the staggered matrix-format of the result. Since you are dealing with HT estimators
# (namely sbsdes is not calibrated), you can opt for the more convenient (but equivalent)
# stacked format:
z.dom <- get.linvar(sbsdes, expression(va.imp2/emp.num), by = ~nace.macro, stack = TRUE)

# Have a look: 
head(z.dom)
#>      z_nace.macro
#> [1,]   0.19211694
#> [2,]  -0.01512942
#> [3,]  -0.06249362
#> [4,]  -0.07565296
#> [5,]  -0.09969501
#> [6,]  -0.16841719
tail(z.dom)
#>          z_nace.macro
#> [6904,] -0.0045717747
#> [6905,] -0.0012016249
#> [6906,] -0.0011432199
#> [6907,] -0.0013394175
#> [6908,] -0.0005712673
#> [6909,] -0.0006619357

# Now add that variable to the input design object:
sbsdes <- des.addvars(sbsdes, z.dom = z.dom)

# Now compute the estimated standard error of the HT total of the linearized variable by
# domains:
svystatTM(sbsdes, ~z.dom, by= ~nace.macro)
#>              nace.macro   Total.z.dom SE.Total.z.dom
#> Agriculture Agriculture  3.761748e-15      3.2226705
#> Industry       Industry -4.964047e-15      0.6580034
#> Commerce       Commerce -3.002329e-14     12.9021106
#> Services       Services  3.586974e-15      0.5767339

# And compare it to the estimated standard error of the complex estimator:
svystatL(sbsdes, expression(va.imp2/emp.num), by= ~nace.macro)
#>              nace.macro va.imp2/emp.num SE.va.imp2/emp.num
#> Agriculture Agriculture        58.59297          3.2226705
#> Industry       Industry        47.93647          0.6580034
#> Commerce       Commerce       226.99062         12.9021106
#> Services       Services        36.89123          0.5767339

# ...which - in this case - can equivalently be obtained exploiting its ratio nature:
svystatR(sbsdes, num = ~va.imp2, den= ~emp.num, by= ~nace.macro)
#>              nace.macro va.imp2/emp.num SE.va.imp2/emp.num
#> Agriculture Agriculture        58.59297          3.2226705
#> Industry       Industry        47.93647          0.6580034
#> Commerce       Commerce       226.99062         12.9021106
#> Services       Services        36.89123          0.5767339

# Test the equality of the results:
all.equal(svystatTM(sbsdes, ~z.dom, by= ~nace.macro)$SE,
          svystatL(sbsdes, expression(va.imp2/emp.num), by= ~nace.macro)$SE)
#> [1] TRUE
# OK

###################################################################
# Now run the same checks for functions of Calibration Estimators #
###################################################################

# Build a population totals template and fill it with actual known totals computed from
# the sampling frame (sbs.frame):
pop <- pop.template(sbsdes, calmodel= ~ent:emp.cl + y:nace.macro-1, partition= ~region)
pop <- fill.template(sbs.frame, pop)
#> 
#> # Coherence check between 'universe' and 'template': OK
#> 

# Calibrate the weights:
sbscal <- e.calibrate(sbsdes, pop)
g.range(sbscal)
#>     g.min     g.max 
#> 0.3820877 1.1697223 

#############
## Check 3 ##
#############
# Global estimate (i.e., no domains) of a nonlinear function of CAL estimators:
# - Average value added per employee

# Compute the linearized variable of the estimator (the name gez should remind you of
# the g-weighted residuals of the HT linearized variable w.r.t. the calibration model):
gez <- get.linvar(sbscal, expression(va.imp2/emp.num))

# Have a look: 
head(gez)
#>             gez
#> 1  0.0002200544
#> 2 -0.0025673217
#> 3 -0.0027184816
#> 4 -0.0027183470
#> 5 -0.0033873263
#> 6 -0.0033391179

# Now add that variable to the initial *uncalibrated* design object:
sbsdes <- des.addvars(sbsdes, gez = gez)

# Now compute the estimated standard error of the *HT total* of the linearized variable:
svystatTM(sbsdes, ~gez)
#>             Total        SE
#> gez -4.937674e-15 0.8554308

# And compare it to the estimated standard error of the complex *calibration* estimator:
svystatL(sbscal, expression(va.imp2/emp.num))
#>                  Complex        SE
#> va.imp2/emp.num 56.54261 0.8554308

# ...which - in this case - can equivalently be obtained exploiting its ratio nature:
svystatR(sbscal, num = ~va.imp2, den= ~emp.num)
#>                    Ratio        SE
#> va.imp2/emp.num 56.54261 0.8554308

# Test the equality of the results:
all.equal(svystatTM(sbsdes, ~gez)$SE,
          svystatL(sbscal, expression(va.imp2/emp.num))$SE)
#> [1] TRUE
# OK

#############
## Check 4 ##
#############
# Domain estimates of a nonlinear function of CAL estimators:
# - Average value added per employee by NACE macro-sectors

# Compute the linearized variable(s) of the domain estimator(s):
gez.dom <- get.linvar(sbscal, expression(va.imp2/emp.num), by = ~nace.macro)

# Have a look: 
head(gez.dom)
#>      gez_Agriculture gez_Industry gez_Commerce  gez_Services
#> [1,]      0.08832571 -0.001569001  0.003286128 -0.0009192289
#> [2,]     -0.07952122 -0.001724095  0.003953348 -0.0005727482
#> [3,]     -0.09158169 -0.001863556  0.004554392 -0.0002598170
#> [4,]     -0.09263152 -0.001912281  0.004764609 -0.0001501959
#> [5,]     -0.13106081 -0.001857972  0.004530306 -0.0002723712
#> [6,]     -0.21674829  0.000218825  0.003100713  0.0003404993
tail(gez.dom)
#>         gez_Agriculture  gez_Industry gez_Commerce  gez_Services
#> [6904,]   -3.350826e-04  0.0003841800  0.017881716 -0.0068263119
#> [6905,]   -4.642672e-05  0.0007062987  0.019760903 -0.0025790194
#> [6906,]   -5.569347e-04 -0.0001049788  0.004258778 -0.0016469592
#> [6907,]   -5.778319e-04 -0.0001282706  0.004124138 -0.0019071877
#> [6908,]   -1.389237e-03 -0.0001650359  0.001535488 -0.0006649312
#> [6909,]   -1.411036e-03 -0.0001894359  0.001394730 -0.0008227651

# As you see the matrix no longer has the staggered structure of the HT case. In fact, the
# calibration residuals are in general not 0 outside the active domain. Therefore, you
# cannot opt for the more convenient stacked format!

# Let's thus add all the column of the linearized variable(s) matrix to the initial
# *uncalibrated* design object:
sbsdes$variables <- cbind(sbsdes$variables, gez.dom)

# Now compute the estimated standard error of the *HT totals* of the *domain-specific*
# linearized variables:
svystatTM(sbsdes, ~gez_Agriculture + gez_Industry + gez_Commerce + gez_Services)
#>                         Total         SE
#> gez_Agriculture -1.053756e-13  2.6731348
#> gez_Industry     4.727406e-15  0.7109626
#> gez_Commerce     1.956698e-14 14.4259517
#> gez_Services     3.743587e-15  0.6770744

# And compare it to the estimated standard errors of the complex *calibration* estimator
# by domains:
svystatL(sbscal, expression(va.imp2/emp.num), by= ~nace.macro)
#>              nace.macro va.imp2/emp.num SE.va.imp2/emp.num
#> Agriculture Agriculture        57.14679          2.6731348
#> Industry       Industry        48.17033          0.7109626
#> Commerce       Commerce       227.37322         14.4259517
#> Services       Services        36.80444          0.6770744

# ...which - in this case - can equivalently be obtained exploiting its ratio nature:
svystatR(sbscal, num = ~va.imp2, den= ~emp.num, by= ~nace.macro)
#>              nace.macro va.imp2/emp.num SE.va.imp2/emp.num
#> Agriculture Agriculture        57.14679          2.6731348
#> Industry       Industry        48.17033          0.7109626
#> Commerce       Commerce       227.37322         14.4259517
#> Services       Services        36.80444          0.6770744

# Test the equality of the results:
all.equal(
  svystatTM(sbsdes, ~gez_Agriculture + gez_Industry + gez_Commerce + gez_Services)$SE,
  svystatL(sbscal, expression(va.imp2/emp.num), by= ~nace.macro)$SE)
#> [1] TRUE
# OK

Linearized Variable(s) of Complex Estimators by Domains

Arguments

Details

Value

References

See also

Examples

Contents

Author