Fill the Known Totals Template for a Calibration Task

Given a template prepared to store the totals of the auxiliary variables for a specific calibration task, computes the actual values of such totals from a sampling frame.

fill.template(universe, template, mem.frac = 10)

Arguments

universe	Data frame containing the complete list of the units belonging to the target population, along with the corresponding values of the auxiliary variables (the sampling frame).
template	The template for the calibration task, an object of class `pop.totals`.
mem.frac	A `numeric` and non-negative value (the default is `10`). It triggers a memory-efficient algorithm when universe is really huge (see ‘Details’ and ‘Performance’).

Details

Recall that a template object returned by function pop.template has a structure that complies with the standard required by e.calibrate, but is empty, in the sense that all the known totals it must be able to store are missing (NA). Whenever these totals are available to the user as such, that is in the form of already computed aggregated values (e.g. because they come from an external source, like a Population Census), the ReGenesees package cannot automatically fill the template. Stated more explicitly: the user himself has to bear the responsibility of putting the right values in the right slots of the prepared template data frame. To this end, function pop.desc could be very helpful.

A lucky alternative arises when a “sampling frame” (that is a data frame containing the complete list of the units belonging to the target population, along with the corresponding values of the auxiliary variables) is available. In such cases, indeed, the fill.template function is able to: (i) automatically compute the totals of the auxiliary variables from the universe data frame, (ii) safely arrange and format these values according to the template structure.

Notice that fill.template will perform a complete coherence check between universe and template. If this check fails, the program stops and prints an error message: the meaning of the message should help the user diagnose the cause of the problem. Should empty levels be present in any factor variable belonging to universe, they would be dropped.

Argument mem.frac (whose value must be numeric and non-negative) triggers a memory-efficient algorithm when universe is really huge. The only sound reason to ever change the value of this argument from its default (mem.frac=10) is that an invocation of fill.template caused a memory-failure (i.e. a messages beginning cannot allocate vector of size ...) on your machine. In such a case, increasing the value of mem.frac (e.g. mem.frac=20) will provide a better chance of succeeding (for more details, see ‘Performance’ section below).

Performance

Real-world calibration tasks (e.g. in the field of Official Statistics) can simultaneously involve hundreds of auxiliary variables and refer to target populations of several million units. In such circumstances, the naive aggregation of the calibration model.matrix of universe may turn out to be too memory-demanding (at least in ordinary PC environments) and determine a memory-failure error.

The alternative implemented in fill.template is to: (i) split universe in chunks, (ii) compute partial sums of auxiliary variables chunk-by-chunk, (iii) update template by adding progressively such partial sums. This alternative is triggered by parameter mem.frac, which also implicitly controls the number of chunks. The function estimates the memory that would be used to store the full model.matrix of universe and compares it to 4 GB: if the resulting ratio is bigger than 1/mem.frac, the memory-efficient algorithm starts; the number of chunks in which universe will then be split is determined in such a way that the memory needed to store the model.matrix of each chunk does not exceed a fraction 1/mem.frac of 4 GB.

Whenever fill.template switches to the memory-efficient "chunking" algorithm, a warning message will signal it and will specify as well the number of chunks that are being processed.

Value

An object of class pop.totals storing the actual values of the population totals for the specified calibration task, ready to be safely passed to e.calibrate.

References

Zardetto, D. (2015) “ReGenesees: an Advanced R System for Calibration, Estimation and Sampling Error Assessment in Complex Sample Surveys”. Journal of Official Statistics, 31(2), 177-203. doi:10.1515/jos-2015-0013 .

Examples

# Load sbs data:
data(sbs)

# Build a design object:
sbsdes<-e.svydesign(data=sbs,ids=~id,strata=~strata,weights=~weight,fpc=~fpc)


###########################
# A simple example first. #
###########################

# Suppose you want to calibrate on the enterprise counts inside areas
  # 1) Build the population totals template:
pop<-pop.template(sbsdes, calmodel=~area-1)

 # Note: given the dimension of the obtained template...
dim(pop)
#> [1]  1 24

 # ...the number of known totals to be stored is 24 (one for each area).

 # 2) Use the fill.template function to (i) automatically compute
 #    such 24 totals from the universe (sbs.frame) and (ii) safely fill
 #    the template:
pop<-fill.template(universe=sbs.frame,template=pop)
#> 
#> # Coherence check between 'universe' and 'template': OK
#> 
pop
#>   area11 area12 area13 area14 area15 area16 area17 area21 area22 area23 area24
#> 1   4434   1246   1505    391   1449    216    612    169    182     91     79
#>   area31 area32 area33 area34 area41 area42 area43 area51 area52 area53 area61
#> 1   1468   1232    363   2326    225    128    125    325    408     47    147
#>   area62 area63
#> 1     77     73

 # 3) Lastly calibrate, e.g. with the unbounded linear distance and
 #    heteroskedastic effects proportional to emp.num:
sbscal<-e.calibrate(sbsdes,pop,sigma2=~emp.num,bounds=c(-Inf,Inf))


########################################
# A more involved (two-sided) example. #
########################################

# Now suppose you have to perform a calibration process which
# exploits as auxiliary information the total number of employees (emp.num)
# and enterprises (ent) inside the domains obtained by:
#  i) crossing nace2 and region;
# ii) crossing emp.cl, region and nace.macro;

# Due to the fact that nace2 is nested into nace.macro,
# the calibration model can be efficiently factorized as follows:
## 1) Add to the design object and universe the new compressed
 #    factor variable involving nested factors, namely:
sbsdes<-des.addvars(sbsdes,nace2.in.nace.macro=nace2 %into% nace.macro)
sbs.frame$nace2.in.nace.macro<-sbs.frame$nace2 %into% sbs.frame$nace.macro

  # 2) Build the template exploiting the new variable:
pop<-pop.template(sbsdes,
     calmodel=~(emp.num+ent):(nace2.in.nace.macro + emp.cl)-1,
     partition=~nace.macro:region)

 # Note: given the dimension of the obtained template...
dim(pop)
#> [1] 12 68

 # ...the number of known totals to be stored is 792.

 # 3) Use the fill.template function to (i) automatically compute
 #    such 792 totals from the universe (sbs.frame) and (ii) safely fill
 #    the template:
pop<-fill.template(universe=sbs.frame,template=pop)
#> 
#> # Coherence check between 'universe' and 'template': OK
#> 

 # Note: out of the 792 known totals in pop, only non-zero entries are actually
 # relevant

 # 4) Lastly calibrate, e.g. with the unbounded linear distance and
 #    heteroskedastic effects proportional to emp.num:
sbscal<-e.calibrate(sbsdes,pop,sigma2=~emp.num,bounds=c(-Inf,Inf))
#> Warning: Calibration system is singular: switching to Moore-Penrose generalized inverse.
#> Warning: Calibration system is singular: switching to Moore-Penrose generalized inverse.
#> Warning: Calibration system is singular: switching to Moore-Penrose generalized inverse.
#> Warning: Calibration system is singular: switching to Moore-Penrose generalized inverse.
#> Warning: Calibration system is singular: switching to Moore-Penrose generalized inverse.
#> Warning: Calibration system is singular: switching to Moore-Penrose generalized inverse.

# Note: a global calibration task would have led to identical calibrated
# weights, but in a more memory-hungry and time-consuming way, as you can
# verify:
  # 1) Build template:
pop.g<-pop.template(sbsdes,
       calmodel=~(emp.num+ent):(nace2:region + emp.cl:nace.macro:region)-1)
dim(pop.g)
#> [1]   1 462

  # 2) Fill template:
pop.g <- fill.template(sbs.frame,pop.g)
#> 
#> # Coherence check between 'universe' and 'template': OK
#> 

  # 3) Calibrate globally:
if (FALSE) {
sbscal.g<-e.calibrate(sbsdes,pop.g,sigma2=~emp.num,bounds=c(-1E6,1E6))

  # 4) Compare calibrated weights (factorized vs. global solution):
range(weights(sbscal)/weights(sbscal.g))

  # ... they are equal.
}


###########################################################
# Just a single example of the memory-efficient algorithm #
# triggered by argument 'mem.frac'.                       #
###########################################################
if (FALSE) {
 # First artificially increase the size of the sampling frame (e.g.
 # up to 5 million rows):
sbs.frame.HUGE<-sbs.frame[sample(1:nrow(sbs.frame),5000000,rep=TRUE),]
dim(sbs.frame.HUGE)

 # Build the template:
pop<-pop.template(sbsdes,
     calmodel=~(emp.num+ent):(nace2.in.nace.macro + emp.cl)-1,
     partition=~nace.macro:region)
dim(pop)

 # Fill the template by using the HUGE universe:
pop<-fill.template(universe=sbs.frame.HUGE,template=pop)
}