get.linvar.Rd
Computes the linearized variable(s) of a Complex Estimator in subpopulations (domains). The Complex Estimator can be any analytic function of Horvitz-Thompson or Calibration estimators.
get.linvar(design, expr, by = NULL, stack = FALSE, na.rm = FALSE)
design | Object of class |
---|---|
expr | R expression defining the Complex Estimator (see |
by | Formula specifying the variables that define the "estimation domains". If |
stack | If |
na.rm | Allow missing values in the variables entering the estimator expression? The default is
|
This function has been designed mainly for programmers or advanced users willing to build upon ReGenesees: typical users are not expected to feel much need of it.
The ReGenesees package adopts the Taylor-series linearization technique to estimate the sampling variance of nonlinear (smooth) estimators. Using this technique, it can handle any Complex Estimator that can be expressed as an analytic function of Horvitz-Thompson or Calibration estimators. This includes even user-defined estimators, through function svystatL
[Zardetto, 15].
In the Taylor-series linearization approach, estimating the sampling variance of a Complex Estimator amounts to estimating the sampling variance of the Horvitz-Thompson estimator of the total of the linearized variable of the Complex Estimator under the sampling design at hand. Function get.linvar
computes that linearized variable. If one is interested in estimating the sampling variance of a Complex Estimator by domains (i.e. for subpopulations), then - owing to the Taylor linearization technique - a different linearized variable has to be calculated for each domain. Therefore, in the most general case, the output of function get.linvar
will be a matrix
of linearized variables, with one column per domain.
If the input object design
is calibrated, then the linearized variable produced by function get.linvar
correctly takes into account the theoretical properties of Calibration estimators (see, e.g., [Deville, 1999]). In particular, get.linvar
computes the needed g-weighted residuals (see, e.g., equations (8) and (9) of [Zardetto, 15]) by leveraging function get.residuals
.
The mandatory argument expr
, which identifies the Complex Estimator, must be an object of class expression
. You can specify just a single Complex Estimator at a time, i.e. length(expr)
must be equal to 1
. Any analytic function of estimators of Totals and Means derived from the input design object design
is allowed.
The basic syntax follows the same rules described for function svystatL
. Inside expr
, the estimator of the Total of a variable is simply represented by the name of the variable itself. The reserved name ones
can be used to reference an artificial variable (which will be created on-the-fly, if not already present) whose value is 1
for each sampling unit, so that its Total estimator actually estimates the size of the population in terms of elementary units. Therefore, e.g., expression y/ones
represents the estimator of the Mean of variable y
. Variables referenced inside expr
must be numeric
and belong to design
.
The optional argument by
specifies the variables that define the "estimation domains", that is the subpopulations for which the estimates are to be calculated. If by=NULL
(the default option), the estimates produced by svystatL
refer to the whole population. Estimation domains must be defined by a formula: for example the statement by=~B1:B2
selects as estimation domains the subpopulations determined by crossing the modalities of variables B1
and B2
. Notice that a formula like by=~B1+B2
will be automatically translated into the factor-crossing formula by=~B1:B2
: if you need to compute estimates for domains B1
and B2
separately, you have to call svystatL
twice. The design
variables referenced by by
(if any) should be of type factor
, otherwise they will be coerced.
As already noted, in the most general case, the output of function get.linvar
will be a matrix
of linearized variables, with one column per domain. However, the linearized variable of any domain estimator that is a function of Horvitz-Thompson estimators is identically equal to zero outside the domain. Thanks to this property, and only for functions of HT estimators, the columns of the output matrix can be stacked into one single column without information loss. Users may request this convenient output format by specifying stack = TRUE
. Unfortunately, the same result cannot be obtained for functions of Calibration estimators, because the g-weighted residuals are generally non-zero outside the active estimation domain. Therefore, argument stack
does nothing if design
is calibrated.
A matrix of linearized variables.
Deville, J. C. (1999) “Variance Estimation for Complex Statistics and Estimators: Linearization and Residual Techniques.”. Survey Methodology 25: 193-203.
Zardetto, D. (2015) “ReGenesees: an Advanced R System for Calibration, Estimation and Sampling Error Assessment in Complex Sample Surveys”. Journal of Official Statistics, 31(2), 177-203. doi: https://doi.org/10.1515/jos-2015-0013.
svystatL
to calculate estimates and sampling errors of Complex Estimators, get.residuals
to compute residuals of interest variables w.r.t. the calibration model adopted to build a calibrated object.
# Load sbs data: data(sbs) # Create a design object to be calibrated: sbsdes<-e.svydesign(data= sbs, ids= ~id, strata= ~strata, weights= ~weight, fpc= ~fpc) ################################################################## # Just some checks on the consistency of the numerical results # # obtained by ReGenesees with well known theoretical properties. # ################################################################## ## In the Taylor-series linearization approach, estimating the sampling variance of a ## Complex Estimator amounts to estimating the sampling variance of the Horvitz-Thompson ## estimator of the total of the linearized variable of the Complex Estimator under the ## sampling design at hand. ############# ## Check 1 ## ############# # Global estimate (i.e., no domains) of a nonlinear function of HT estimators: # - Average value added per employee # Compute the linearized variable of the estimator: z <- get.linvar(sbsdes, expression(va.imp2/emp.num)) # Have a look: head(z)#> z #> 1 0.0033787523 #> 2 -0.0002174877 #> 3 -0.0010440456 #> 4 -0.0012760633 #> 5 -0.0016815759 #> 6 -0.0028392399# Now add that variable to the input design object: sbsdes <- des.addvars(sbsdes, z = z) # Now compute the estimated standard error of the HT total of the linearized variable: svystatTM(sbsdes, ~z)#> Total SE #> z -1.111315e-14 1.018897# And compare it to the estimated standard error of the complex estimator: svystatL(sbsdes, expression(va.imp2/emp.num))#> Complex SE #> va.imp2/emp.num 57.14199 1.018897# ...which - in this case - can equivalently be obtained exploiting its ratio nature: svystatR(sbsdes, num = ~va.imp2, den= ~emp.num)#> Ratio SE #> va.imp2/emp.num 57.14199 1.018897# Test the equality of the results: all.equal(svystatTM(sbsdes, ~z)$SE, svystatL(sbsdes, expression(va.imp2/emp.num))$SE)#> [1] TRUE# OK ############# ## Check 2 ## ############# # Domain estimates of a nonlinear function of HT estimators: # - Average value added per employee by NACE macro-sectors # Compute the linearized variable(s) of the domain estimator(s): z.dom <- get.linvar(sbsdes, expression(va.imp2/emp.num), by = ~nace.macro) # Have a look: head(z.dom)#> z_Agriculture z_Industry z_Commerce z_Services #> [1,] 0.19211694 0 0 0 #> [2,] -0.01512942 0 0 0 #> [3,] -0.06249362 0 0 0 #> [4,] -0.07565296 0 0 0 #> [5,] -0.09969501 0 0 0 #> [6,] -0.16841719 0 0 0tail(z.dom)#> z_Agriculture z_Industry z_Commerce z_Services #> [6904,] 0 0 0 -0.0045717747 #> [6905,] 0 0 0 -0.0012016249 #> [6906,] 0 0 0 -0.0011432199 #> [6907,] 0 0 0 -0.0013394175 #> [6908,] 0 0 0 -0.0005712673 #> [6909,] 0 0 0 -0.0006619357# Note the staggered matrix-format of the result. Since you are dealing with HT estimators # (namely sbsdes is not calibrated), you can opt for the more convenient (but equivalent) # stacked format: z.dom <- get.linvar(sbsdes, expression(va.imp2/emp.num), by = ~nace.macro, stack = TRUE) # Have a look: head(z.dom)#> z_nace.macro #> [1,] 0.19211694 #> [2,] -0.01512942 #> [3,] -0.06249362 #> [4,] -0.07565296 #> [5,] -0.09969501 #> [6,] -0.16841719tail(z.dom)#> z_nace.macro #> [6904,] -0.0045717747 #> [6905,] -0.0012016249 #> [6906,] -0.0011432199 #> [6907,] -0.0013394175 #> [6908,] -0.0005712673 #> [6909,] -0.0006619357# Now add that variable to the input design object: sbsdes <- des.addvars(sbsdes, z.dom = z.dom) # Now compute the estimated standard error of the HT total of the linearized variable by # domains: svystatTM(sbsdes, ~z.dom, by= ~nace.macro)#> nace.macro Total.z.dom SE.Total.z.dom #> Agriculture Agriculture 3.761748e-15 3.2226705 #> Industry Industry -4.964047e-15 0.6580034 #> Commerce Commerce -3.002329e-14 12.9021106 #> Services Services 3.586974e-15 0.5767339# And compare it to the estimated standard error of the complex estimator: svystatL(sbsdes, expression(va.imp2/emp.num), by= ~nace.macro)#> nace.macro va.imp2/emp.num SE.va.imp2/emp.num #> Agriculture Agriculture 58.59297 3.2226705 #> Industry Industry 47.93647 0.6580034 #> Commerce Commerce 226.99062 12.9021106 #> Services Services 36.89123 0.5767339# ...which - in this case - can equivalently be obtained exploiting its ratio nature: svystatR(sbsdes, num = ~va.imp2, den= ~emp.num, by= ~nace.macro)#> nace.macro va.imp2/emp.num SE.va.imp2/emp.num #> Agriculture Agriculture 58.59297 3.2226705 #> Industry Industry 47.93647 0.6580034 #> Commerce Commerce 226.99062 12.9021106 #> Services Services 36.89123 0.5767339# Test the equality of the results: all.equal(svystatTM(sbsdes, ~z.dom, by= ~nace.macro)$SE, svystatL(sbsdes, expression(va.imp2/emp.num), by= ~nace.macro)$SE)#> [1] TRUE# OK ################################################################### # Now run the same checks for functions of Calibration Estimators # ################################################################### # Build a population totals template and fill it with actual known totals computed from # the sampling frame (sbs.frame): pop <- pop.template(sbsdes, calmodel= ~ent:emp.cl + y:nace.macro-1, partition= ~region) pop <- fill.template(sbs.frame, pop)#> #> # Coherence check between 'universe' and 'template': OK #>#> g.min g.max #> 0.3820877 1.1697223############# ## Check 3 ## ############# # Global estimate (i.e., no domains) of a nonlinear function of CAL estimators: # - Average value added per employee # Compute the linearized variable of the estimator (the name gez should remind you of # the g-weighted residuals of the HT linearized variable w.r.t. the calibration model): gez <- get.linvar(sbscal, expression(va.imp2/emp.num)) # Have a look: head(gez)#> gez #> 1 0.0002200544 #> 2 -0.0025673217 #> 3 -0.0027184816 #> 4 -0.0027183470 #> 5 -0.0033873263 #> 6 -0.0033391179# Now add that variable to the initial *uncalibrated* design object: sbsdes <- des.addvars(sbsdes, gez = gez) # Now compute the estimated standard error of the *HT total* of the linearized variable: svystatTM(sbsdes, ~gez)#> Total SE #> gez -4.937674e-15 0.8554308# And compare it to the estimated standard error of the complex *calibration* estimator: svystatL(sbscal, expression(va.imp2/emp.num))#> Complex SE #> va.imp2/emp.num 56.54261 0.8554308# ...which - in this case - can equivalently be obtained exploiting its ratio nature: svystatR(sbscal, num = ~va.imp2, den= ~emp.num)#> Ratio SE #> va.imp2/emp.num 56.54261 0.8554308# Test the equality of the results: all.equal(svystatTM(sbsdes, ~gez)$SE, svystatL(sbscal, expression(va.imp2/emp.num))$SE)#> [1] TRUE# OK ############# ## Check 4 ## ############# # Domain estimates of a nonlinear function of CAL estimators: # - Average value added per employee by NACE macro-sectors # Compute the linearized variable(s) of the domain estimator(s): gez.dom <- get.linvar(sbscal, expression(va.imp2/emp.num), by = ~nace.macro) # Have a look: head(gez.dom)#> gez_Agriculture gez_Industry gez_Commerce gez_Services #> [1,] 0.08832571 -0.001569001 0.003286128 -0.0009192289 #> [2,] -0.07952122 -0.001724095 0.003953348 -0.0005727482 #> [3,] -0.09158169 -0.001863556 0.004554392 -0.0002598170 #> [4,] -0.09263152 -0.001912281 0.004764609 -0.0001501959 #> [5,] -0.13106081 -0.001857972 0.004530306 -0.0002723712 #> [6,] -0.21674829 0.000218825 0.003100713 0.0003404993tail(gez.dom)#> gez_Agriculture gez_Industry gez_Commerce gez_Services #> [6904,] -3.350826e-04 0.0003841800 0.017881716 -0.0068263119 #> [6905,] -4.642672e-05 0.0007062987 0.019760903 -0.0025790194 #> [6906,] -5.569347e-04 -0.0001049788 0.004258778 -0.0016469592 #> [6907,] -5.778319e-04 -0.0001282706 0.004124138 -0.0019071877 #> [6908,] -1.389237e-03 -0.0001650359 0.001535488 -0.0006649312 #> [6909,] -1.411036e-03 -0.0001894359 0.001394730 -0.0008227651# As you see the matrix no longer has the staggered structure of the HT case. In fact, the # calibration residuals are in general not 0 outside the active domain. Therefore, you # cannot opt for the more convenient stacked format! # Let's thus add all the column of the linearized variable(s) matrix to the initial # *uncalibrated* design object: sbsdes$variables <- cbind(sbsdes$variables, gez.dom) # Now compute the estimated standard error of the *HT totals* of the *domain-specific* # linearized variables: svystatTM(sbsdes, ~gez_Agriculture + gez_Industry + gez_Commerce + gez_Services)#> Total SE #> gez_Agriculture -1.053756e-13 2.6731348 #> gez_Industry 4.727406e-15 0.7109626 #> gez_Commerce 1.956698e-14 14.4259517 #> gez_Services 3.743587e-15 0.6770744# And compare it to the estimated standard errors of the complex *calibration* estimator # by domains: svystatL(sbscal, expression(va.imp2/emp.num), by= ~nace.macro)#> nace.macro va.imp2/emp.num SE.va.imp2/emp.num #> Agriculture Agriculture 57.14679 2.6731348 #> Industry Industry 48.17033 0.7109626 #> Commerce Commerce 227.37322 14.4259517 #> Services Services 36.80444 0.6770744# ...which - in this case - can equivalently be obtained exploiting its ratio nature: svystatR(sbscal, num = ~va.imp2, den= ~emp.num, by= ~nace.macro)#> nace.macro va.imp2/emp.num SE.va.imp2/emp.num #> Agriculture Agriculture 57.14679 2.6731348 #> Industry Industry 48.17033 0.7109626 #> Commerce Commerce 227.37322 14.4259517 #> Services Services 36.80444 0.6770744# Test the equality of the results: all.equal( svystatTM(sbsdes, ~gez_Agriculture + gez_Industry + gez_Commerce + gez_Services)$SE, svystatL(sbscal, expression(va.imp2/emp.num), by= ~nace.macro)$SE)#> [1] TRUE# OK