into.Rd
The special binary operator %into%
transforms nested factors in such a way as to reduce the dimension and/or the sparsity of the model matrix of a calibration problem.
inner %into% outer "%into%"(inner, outer)
inner | Factor with levels nested into |
---|---|
outer | Factor whose levels are an aggregation of those in |
Arguments inner
and outer
must be both factor
s and must have the same length
. Moreover, inner
has to be strictly nested into outer
. Nesting is defined by treating elements in inner
and outer
as if they were positionally tied (i.e. as if they belonged to columns of a given data frame). The definition is as follows:
inner
and outer
are strictly nested if, and only if, 1) every set of equal elements in inner
correspond to a set of equal elements in outer
, and 2) inner
has more non-empty levels than outer
.
If inner
and outer
do not fulfill the conditions above, evaluating inner %into% outer
gives an error.
Suppose inner
is actually nested into outer
and define inner.in.outer <- inner %into% outer
. The output factor inner.in.outer
is built by recoding inner
levels in such a way that each of them is mapped into the integer which represents its order inside the corresponding level of outer
(see ‘Examples’). As a consequence, the levels of inner.in.outer
will be 1:n.max
, being n.max
the maximum number of levels of inner
tied to a level of outer
. Since this number is generally considerably smaller than the number of levels of inner
, inner.in.outer
can be seen as a compressed representation of inner
. Obviously, compression comes at a price: indeed inner.in.outer
can now be used to identify a level of inner
only inside a given level of outer
(see ‘Examples’).
The usefulness of the %into%
operator emerges in the calibration context. As we already documented in e.calibrate
, factorizing a calibration problem (i.e. exploiting the partition
argument of e.calibrate
) determines a significant reduction in computation complexity, especially for big surveys. Now, it is sometimes the case that a calibration model is actually factorizable, even if this property is not self-apparent, due to factor nesting. In such cases, anyway, trying naively to factorize the outer
variable(s) typically leads to very big and sparse model matrices (as well as population totals data frames), with the net result of washing-out the expected efficiency gain. A better alternative is to exploit the %into%
operator in order to compress the inner
variable in such a way that the outer
variable can be actually factorized without giving rise to huge and sparse matrices. Section ‘Examples’ reports some practical illustration of the above line of reasoning.
A factor with levels 1:n.max
, being n.max
the maximum number of levels of inner
tied to a level of outer
.
Further examples can be found in the fill.template
help page.
################################################# ## General properties of the %into% operator. # ################################################# # First build a small data frame with 2 nested factors representing # regions and provinces: dd <- data.frame( reg = factor( rep(LETTERS[1:3], c(6, 3, 1)) ), prov = factor( rep(letters[1:6], c(3, 2, 1, 2, 1, 1)) ) ) dd#> reg prov #> 1 A a #> 2 A a #> 3 A a #> 4 A b #> 5 A b #> 6 A c #> 7 B d #> 8 B d #> 9 B e #> 10 C f# Since prov is strictly nested into reg we can compute: prov.in.reg <- dd$prov %into% dd$reg prov.in.reg#> [1] 1 1 1 2 2 3 1 1 2 1 #> Levels: 1 2 3# Note that prov.in.reg has 3 levels because, as can be seen from dd, # the maximum number of provinces inside regions is 3. Thus prov.in.reg # is actually a compressed version of dd$prov (whose levels were 6) # but, obviously, it can now be used to identify a province only inside # a given region. This means that the the two factors below are identical (up # to levels' labels): dd$prov#> [1] a a a b b c d d e f #> Levels: a b c d e f#> [1] 1.A 1.A 1.A 2.A 2.A 3.A 1.B 1.B 2.B 1.C #> Levels: 1.A 2.A 3.A 1.B 2.B 1.C# Note that all the statements below generate errors: if (FALSE) { dd$reg %into% dd$prov dd$reg %into% dd$reg dd$prov %into% dd$prov } ###################################################################### ## A more useful (and complex) example from the calibration context. # ###################################################################### # First define a design object: data(data.examples) exdes <- e.svydesign(data=example,ids=~towcod+famcod,strata=~SUPERSTRATUM, weights=~weight) # Now suppose you have to perform a calibration process which # exploits the following known population totals: # 1) Joint distribution of sex and age10c (age in 10 classes) # at the region level; # 2) Joint distribution of sex and age5c (age in 5 classes) # at the province level; # # The auxiliary variables corresponding to the population totals above # can be symbolically represented by a calibration model like the following: # ~(procod:age5c + regcod:age10c - 1):sex # # At first sight it seems that only the sex variable can be factorized # in the model above. However if one observe that regions are an aggregation # of provinces, one realizes that also the regcod variable can be factorized. # Similarly, since categories of age5c are an aggregation of categories of # age10c, age5c can be factorized too. In both cases, using the %into% # operator will save computation time and memory usage. # Let us see it in practice: # ## 1) Global calibration (i.e. calmodel=~(procod:age5c + regcod:age10c - 1):sex, # no partition variable, known totals stored in pop07): t<-system.time( cal07<-e.calibrate(design=exdes,df.population=pop07, calmodel=~(procod:age5c + regcod:age10c - 1):sex, calfun="logit",bounds=c(0.2,1.8)) ) ## 2) Partitioned calibration on the self evident variable sex only # (i.e. calmodel=~procod:age5c + regcod:age10c - 1, partition=~sex, # known totals stored in pop07p): tp<-system.time( cal07p<-e.calibrate(design=exdes,df.population=pop07p, calmodel=~procod:age5c + regcod:age10c - 1,partition=~sex, calfun="logit",bounds=c(0.2,1.8)) ) ## 3) Full partitioned calibration on variables sex, regcod and age5c # by exploiting %into%. # First add to the design object the new compressed factor variables # involving nested factors, namely provinces inside regions... exdes<-des.addvars(exdes,procod.in.regcod=procod %into% regcod) # ...and age10c inside age5c: exdes<-des.addvars(exdes,age10c.in.age5c=age10c %into% age5c) # Now calibrate exploiting the new variables # (i.e. calmodel=~procod.in.regcod + age10c.in.age5c - 1, # partition=~sex:regcod:age5c, known totals stored inside cal07pp) tpp<-system.time( cal07pp<-e.calibrate(design=exdes,df.population=pop07pp, calmodel=~procod.in.regcod + age10c.in.age5c - 1, partition=~sex:regcod:age5c, calfun="logit",bounds=c(0.2,1.8)) ) # Now compare execution times: t#> user system elapsed #> 1.34 0.00 1.34tp#> user system elapsed #> 0.39 0.00 0.39tpp#> user system elapsed #> 0.33 0.00 0.33# thus, tpp < tp < t, as expected. # Notice also that we obtained identical calibrated weights: all.equal(weights(cal07),weights(cal07p))#> [1] TRUE#> [1] TRUE# as it must be.