fill.template.Rd
Given a template prepared to store the totals of the auxiliary variables for a specific calibration task, computes the actual values of such totals from a sampling frame.
fill.template(universe, template, mem.frac = 10)
universe | Data frame containing the complete list of the units belonging to the target population, along with the corresponding values of the auxiliary variables (the sampling frame). |
---|---|
template | The template for the calibration task, an object of class |
mem.frac | A |
Recall that a template
object returned by function pop.template
has a structure that complies with the standard required by e.calibrate
, but is empty, in the sense that all the known totals it must be able to store are missing (NA
). Whenever these totals are available to the user as such, that is in the form of already computed aggregated values (e.g. because they come from an external source, like a Population Census), the ReGenesees package cannot automatically fill the template. Stated more explicitly: the user himself has to bear the responsibility of putting the right values in the right slots of the prepared template
data frame. To this end, function pop.desc
could be very helpful.
A lucky alternative arises when a “sampling frame” (that is a data frame containing the complete list of the units belonging to the target population, along with the corresponding values of the auxiliary variables) is available. In such cases, indeed, the fill.template
function is able to: (i) automatically compute the totals of the auxiliary variables from the universe
data frame, (ii) safely arrange and format these values according to the template
structure.
Notice that fill.template
will perform a complete coherence check between universe
and template
. If this check fails, the program stops and prints an error message: the meaning of the message should help the user diagnose the cause of the problem. Should empty levels be present in any factor variable belonging to universe
, they would be dropped.
Argument mem.frac
(whose value must be numeric and non-negative) triggers a memory-efficient algorithm when universe is really huge. The only sound reason to ever change the value of this argument from its default (mem.frac=10
) is that an invocation of fill.template
caused a memory-failure (i.e. a messages beginning cannot allocate vector of size ...
) on your machine. In such a case, increasing the value of mem.frac
(e.g. mem.frac=20
) will provide a better chance of succeeding (for more details, see ‘Performance’ section below).
Real-world calibration tasks (e.g. in the field of Official Statistics) can simultaneously involve hundreds of auxiliary variables and refer to target populations of several million units. In such circumstances, the naive aggregation of the calibration model.matrix
of universe
may turn out to be too memory-demanding (at least in ordinary PC environments) and determine a memory-failure error.
The alternative implemented in fill.template
is to: (i) split universe
in chunks, (ii) compute partial sums of auxiliary variables chunk-by-chunk, (iii) update template
by adding progressively such partial sums. This alternative is triggered by parameter mem.frac
, which also implicitly controls the number of chunks. The function estimates the memory that would be used to store the full model.matrix
of universe
and compares it to 4 GB: if the resulting ratio is bigger than 1/mem.frac
, the memory-efficient algorithm starts; the number of chunks in which universe
will then be split is determined in such a way that the memory needed to store the model.matrix
of each chunk does not exceed a fraction 1/mem.frac
of 4 GB.
Whenever fill.template
switches to the memory-efficient "chunking" algorithm, a warning message will signal it and will specify as well the number of chunks that are being processed.
An object of class pop.totals
storing the actual values of the population totals for the specified calibration task, ready to be safely passed to e.calibrate
.
Zardetto, D. (2015) “ReGenesees: an Advanced R System for Calibration, Estimation and Sampling Error Assessment in Complex Sample Surveys”. Journal of Official Statistics, 31(2), 177-203. doi: https://doi.org/10.1515/jos-2015-0013.
e.calibrate
for calibrating weights, pop.template
for the definition of the class pop.totals
and to build a "template" data frame for known population totals, pop.desc
to provide a natural language description of the template structure, and %into%
for the compression operator for nested factors.
# Load sbs data: data(sbs) # Build a design object: sbsdes<-e.svydesign(data=sbs,ids=~id,strata=~strata,weights=~weight,fpc=~fpc) ########################### # A simple example first. # ########################### # Suppose you want to calibrate on the enterprise counts inside areas # 1) Build the population totals template: pop<-pop.template(sbsdes, calmodel=~area-1) # Note: given the dimension of the obtained template... dim(pop)#> [1] 1 24# ...the number of known totals to be stored is 24 (one for each area). # 2) Use the fill.template function to (i) automatically compute # such 24 totals from the universe (sbs.frame) and (ii) safely fill # the template: pop<-fill.template(universe=sbs.frame,template=pop)#> #> # Coherence check between 'universe' and 'template': OK #>pop#> area11 area12 area13 area14 area15 area16 area17 area21 area22 area23 area24 #> 1 4434 1246 1505 391 1449 216 612 169 182 91 79 #> area31 area32 area33 area34 area41 area42 area43 area51 area52 area53 area61 #> 1 1468 1232 363 2326 225 128 125 325 408 47 147 #> area62 area63 #> 1 77 73# 3) Lastly calibrate, e.g. with the unbounded linear distance and # heteroskedastic effects proportional to emp.num: sbscal<-e.calibrate(sbsdes,pop,sigma2=~emp.num,bounds=c(-Inf,Inf)) ######################################## # A more involved (two-sided) example. # ######################################## # Now suppose you have to perform a calibration process which # exploits as auxiliary information the total number of employees (emp.num) # and enterprises (ent) inside the domains obtained by: # i) crossing nace2 and region; # ii) crossing emp.cl, region and nace.macro; # Due to the fact that nace2 is nested into nace.macro, # the calibration model can be efficiently factorized as follows: ## 1) Add to the design object and universe the new compressed # factor variable involving nested factors, namely: sbsdes<-des.addvars(sbsdes,nace2.in.nace.macro=nace2 %into% nace.macro) sbs.frame$nace2.in.nace.macro<-sbs.frame$nace2 %into% sbs.frame$nace.macro # 2) Build the template exploiting the new variable: pop<-pop.template(sbsdes, calmodel=~(emp.num+ent):(nace2.in.nace.macro + emp.cl)-1, partition=~nace.macro:region) # Note: given the dimension of the obtained template... dim(pop)#> [1] 12 68# ...the number of known totals to be stored is 792. # 3) Use the fill.template function to (i) automatically compute # such 792 totals from the universe (sbs.frame) and (ii) safely fill # the template: pop<-fill.template(universe=sbs.frame,template=pop)#> #> # Coherence check between 'universe' and 'template': OK #># Note: out of the 792 known totals in pop, only non-zero entries are actually # relevant # 4) Lastly calibrate, e.g. with the unbounded linear distance and # heteroskedastic effects proportional to emp.num: sbscal<-e.calibrate(sbsdes,pop,sigma2=~emp.num,bounds=c(-Inf,Inf))#> Warning: Calibration system is singular: switching to Moore-Penrose generalized inverse.#> Warning: Calibration system is singular: switching to Moore-Penrose generalized inverse.#> Warning: Calibration system is singular: switching to Moore-Penrose generalized inverse.#> Warning: Calibration system is singular: switching to Moore-Penrose generalized inverse.#> Warning: Calibration system is singular: switching to Moore-Penrose generalized inverse.#> Warning: Calibration system is singular: switching to Moore-Penrose generalized inverse.# Note: a global calibration task would have led to identical calibrated # weights, but in a more memory-hungry and time-consuming way, as you can # verify: # 1) Build template: pop.g<-pop.template(sbsdes, calmodel=~(emp.num+ent):(nace2:region + emp.cl:nace.macro:region)-1) dim(pop.g)#> [1] 1 462# 2) Fill template: pop.g <- fill.template(sbs.frame,pop.g)#> #> # Coherence check between 'universe' and 'template': OK #># 3) Calibrate globally: if (FALSE) { sbscal.g<-e.calibrate(sbsdes,pop.g,sigma2=~emp.num,bounds=c(-1E6,1E6)) # 4) Compare calibrated weights (factorized vs. global solution): range(weights(sbscal)/weights(sbscal.g)) # ... they are equal. } ########################################################### # Just a single example of the memory-efficient algorithm # # triggered by argument 'mem.frac'. # ########################################################### if (FALSE) { # First artificially increase the size of the sampling frame (e.g. # up to 5 million rows): sbs.frame.HUGE<-sbs.frame[sample(1:nrow(sbs.frame),5000000,rep=TRUE),] dim(sbs.frame.HUGE) # Build the template: pop<-pop.template(sbsdes, calmodel=~(emp.num+ent):(nace2.in.nace.macro + emp.cl)-1, partition=~nace.macro:region) dim(pop) # Fill the template by using the HUGE universe: pop<-fill.template(universe=sbs.frame.HUGE,template=pop) }