Make ReGenesees Digest Externally Calibrated Weights

Enables ReGenesees to provide correct variance estimates of (functions of) calibration estimators, even if the survey weights have not been calibrated by ReGenesees.

ext.calibrated(data, ids, strata = NULL, weights,
               fpc = NULL, self.rep.str = NULL, check.data = TRUE,
               weights.cal, calmodel, partition = FALSE, sigma2 = NULL)

Arguments

data	The same as in function `e.svydesign`.
ids	The same as in function `e.svydesign`.
strata	The same as in function `e.svydesign`.
weights	The same as in function `e.svydesign`.
fpc	The same as in function `e.svydesign`.
self.rep.str	The same as in function `e.svydesign`.
check.data	The same as in function `e.svydesign`.
weights.cal	Formula identifying the externally calibrated weights.
calmodel	The same as in function `e.calibrate`.
partition	The same as in function `e.calibrate`.
sigma2	The same as in function `e.calibrate`.

Details

Owing to ReGenesees's ability to provide proper variance estimates for (complex functions of) calibration estimators, some users may be tempted to exploit ReGenesees in the estimation phase even if they did not use ReGenesees for calibration.

This result cannot be achieved naively, by simply passing to ReGenesees function e.svydesign the survey data and supplying the externally calibrated weights through its weights argument.

Indeed, variance estimation methods of ReGenesees's summary statistics functions (svystatTM, svystatR, svystatS, svystatSR, svystatB, svystatQ, svystatL and svystat) are dispatched according to the class of the input design object:

If the design object is un-calibrated (i.e. its class is ‘analytic’), variance formulas are appropriate to Horvitz-Thompson estimators (and functions of them).
If the design object is calibrated (i.e. its class is ‘cal.analytic’), variance formulas are appropriate to Calibration estimators (and functions of them).

Therefore, the naive approach of passing the externally calibrated weights weights.cal to e.svydesign as if they were initial or design weights cannot succeed, since it would result in HT-like variance estimates, leading generally to variance overestimation (with bigger upward bias for variables that are better explained by the calibration model).

Function ext.calibrated has been designed exactly to avoid the aforementioned pitfalls and to allow ReGenesees provide correct variance estimates of (functions of) calibration estimators, even if the survey weights have been calibrated externally by other software.

Argument weights.cal identifies the externally calibrated weights of the units included in the sample. The data variable referenced by weights.cal must be numeric. Currently, only positive externally calibrated weights can be handled (see the dedicated section below).

Other arguments to ext.calibrated derive either from function e.svydesign or from function e.calibrate. The former serve the purpose of passing the survey data and the corresponding sampling design metadata, the latter are meant to tell ext.calibrated how the externally calibrated weights have been obtained.

What if externally calibrated weights happen to be negative?

From a methodological perspective, negative calibration weights are legitimate. However, owing to software implementation details whose modification would not be trivial, function ext.calibrated is not yet able to cope with this case. Note that the problem is actually due to the external origin of the negative calibration weights. In fact, ReGenesees calibration and estimation facilities are entirely able to cope with possibly negative calibration weights, provided they were computed internally.

Value

An object of class cal.analytic, storing the original survey data plus all the sampling design and calibration metadata needed for proper variance estimation.

Note

Exactly as ReGenesees's base functions e.svydesign and e.calibrate would do, ext.calibrated too will wrap inside its return value a local copy of data. As usual, this copy will be stored inside the variables slot of the output list. As usual, again, the calibrated weights will be accessible by using the weights function.

Examples


# Load data sbs data
data(sbs)

#########################################################################
# Simulate an external calibration procedure and compute some benchmark #
# estimates and errors to test function ext.calibrated                  #
#########################################################################
# Define a survey design
sbsdes <- e.svydesign(data= sbs, ids= ~id, strata= ~strata, weights= ~weight,
                      fpc= ~fpc)

# Build a template for population totals
pop <- pop.template(data= sbsdes, calmodel= ~y:nace.macro + emp.cl + emp.num - 1,
                    partition= ~dom3)

# Have a look at the template structure
pop.desc(pop)
#> # Data frame of known totals for a *partitioned* calibration task
#> - Number of rows:         4
#> - Number of columns:      11
#> - Number of known totals: 40
#> 
#> # The data frame structure is as follows
#> ## Column 1 identifies 4 *calibration domains* (one for each row)
#> 
#>   dom3
#> 1    A
#> 2    B
#> 3    C
#> 4    D
#> 
#> ## Columns 2-11 identify known totals organized into *3 BLOCKS* 
#> 
#>  - BLOCK 1 ---------------------------------------------------------------- 
#>  - Benchmark:           Counts of emp.cl 
#>  - Known totals:        20 
#>  - Auxiliary variables: 5 
#>  - Columns:             2-6 
#> 
#>     2       3        4         5         6         
#> aux "[6,9]" "(9,19]" "(19,49]" "(49,99]" "(99,Inf]"
#> 
#> 
#>  - BLOCK 2 ---------------------------------------------------------------- 
#>  - Benchmark:           Total of emp.num 
#>  - Known totals:        4 
#>  - Auxiliary variables: 1 
#>  - Column:              7 
#> 
#>     7        
#> aux "emp.num"
#> 
#> 
#>  - BLOCK 3 ---------------------------------------------------------------- 
#>  - Benchmark:           Totals of y by nace.macro 
#>  - Known totals:        16 
#>  - Auxiliary variables: 4 
#>  - Columns:             8-11 
#> 
#>     8               9            10           11          
#> aux "y:Agriculture" "y:Industry" "y:Commerce" "y:Services"
#> 
#> 

# Fill the template
pop <- fill.template(universe= sbs.frame, template= pop)
#> 
#> # Coherence check between 'universe' and 'template': OK
#> 

# Calibrate
sbscal <- e.calibrate(design= sbsdes, df.population= pop, calfun= "logit",
                      bounds= c(0.8, 1.3), sigma2= ~ emp.num)

# Compute benchmark estimates and errors (average value added per employee by
# region) to be later compared with those obtained by using ext.calibrated 
benchmark <- svystatR(design= sbscal, num= ~va.imp2, den= ~emp.num, by= ~region)
benchmark
#>        region va.imp2/emp.num SE.va.imp2/emp.num
#> North   North        56.51349          0.9344696
#> Center Center        44.99534          1.5700976
#> South   South        65.85808          2.8296036

# Extract the 'externally' calibrated weights...
w <- weights(sbscal)

#...and add these 'externally' calibrated weights to the original survey data
sbs.ext <- data.frame(sbs, w.ext = w)

# NOTE: Now sbs.ext is just a data frame, without any knowledge of the
#       calibration metadata formerly stored inside sbscal (i.e. the object
#       calibrated by ReGenesees)


##############################################################
# Let ReGenesees digest the 'externally' calibrated weights, #
# then re-compute benchmark estimates and errors for testing #
##############################################################
# Simply pass survey data along with sampling design and calibration model
# metadata
sbscal.ext <- ext.calibrated(data= sbs.ext, ids= ~id, strata= ~strata,
                             weights= ~weight, fpc = ~fpc,
                             weights.cal= ~w.ext,
                             calmodel= ~y:nace.macro + emp.cl + emp.num - 1,
                             partition= ~dom3, sigma2= ~emp.num)

# Have a look at the output
sbscal.ext
#> Calibrated, Stratified Independent Unit Sampling Design
#> - [664] strata
#> - [6909] units
#> 
#> Call:
#> ext.calibrated(data = sbs.ext, ids = ~id, strata = ~strata, weights = ~weight, 
#>     fpc = ~fpc, weights.cal = ~w.ext, calmodel = ~y:nace.macro + 
#>         emp.cl + emp.num - 1, partition = ~dom3, sigma2 = ~emp.num)

# Now re-compute benchmark estimates and errors by means of new object
# ext.sbscal
test <- svystatR(design= sbscal.ext, num= ~va.imp2, den= ~emp.num, by= ~region)
test
#>        region va.imp2/emp.num SE.va.imp2/emp.num
#> North   North        56.51349          0.9344696
#> Center Center        44.99534          1.5700976
#> South   South        65.85808          2.8296036

################################################################
# Compare benchmark estimates and errors to those derived from #
# ext.calibrated return object                                 #
################################################################
benchmark
#>        region va.imp2/emp.num SE.va.imp2/emp.num
#> North   North        56.51349          0.9344696
#> Center Center        44.99534          1.5700976
#> South   South        65.85808          2.8296036
test
#>        region va.imp2/emp.num SE.va.imp2/emp.num
#> North   North        56.51349          0.9344696
#> Center Center        44.99534          1.5700976
#> South   South        65.85808          2.8296036

# ...and they are identical, as it must be.

# NOTE: All utility tools yield exactly the same results, e.g.
identical(weights(sbscal), weights(sbscal.ext))
#> [1] TRUE
identical(g.range(sbscal), g.range(sbscal.ext))
#> [1] TRUE


##########################################################################
# Show that the naive idea of directly passing the externally calibrated #
# weights to e.svydesign does NOT work properly for variance estimation  #
##########################################################################
naive <- e.svydesign(data= sbs.ext, ids= ~id, strata= ~strata,
                     weights= ~w.ext, fpc = ~fpc)

# Estimated sampling errors derived by this naive design object...
svystatR(design= naive, num= ~va.imp2, den= ~emp.num, by= ~region)
#>        region va.imp2/emp.num SE.va.imp2/emp.num
#> North   North        56.51349           1.174274
#> Center Center        44.99534           1.639185
#> South   South        65.85808           2.870172

#...do NOT match benchmark values, overestimating them:
benchmark
#>        region va.imp2/emp.num SE.va.imp2/emp.num
#> North   North        56.51349          0.9344696
#> Center Center        44.99534          1.5700976
#> South   South        65.85808          2.8296036