Compress Nested Factors

The special binary operator %into% transforms nested factors in such a way as to reduce the dimension and/or the sparsity of the model matrix of a calibration problem.

inner %into% outer
"%into%"(inner, outer)

Arguments

inner	Factor with levels nested into `outer` (see ‘Details’).
outer	Factor whose levels are an aggregation of those in `inner` (see ‘Details’).

Details

Arguments inner and outer must be both factors and must have the same length. Moreover, inner has to be strictly nested into outer. Nesting is defined by treating elements in inner and outer as if they were positionally tied (i.e. as if they belonged to columns of a given data frame). The definition is as follows:

inner and outer are strictly nested if, and only if, 1) every set of equal elements in inner correspond to a set of equal elements in outer, and 2) inner has more non-empty levels than outer.

If inner and outer do not fulfill the conditions above, evaluating inner %into% outer gives an error.

Suppose inner is actually nested into outer and define inner.in.outer <- inner %into% outer. The output factor inner.in.outer is built by recoding inner levels in such a way that each of them is mapped into the integer which represents its order inside the corresponding level of outer (see ‘Examples’). As a consequence, the levels of inner.in.outer will be 1:n.max, being n.max the maximum number of levels of inner tied to a level of outer. Since this number is generally considerably smaller than the number of levels of inner, inner.in.outer can be seen as a compressed representation of inner. Obviously, compression comes at a price: indeed inner.in.outer can now be used to identify a level of inner only inside a given level of outer (see ‘Examples’).

The usefulness of the %into% operator emerges in the calibration context. As we already documented in e.calibrate, factorizing a calibration problem (i.e. exploiting the partition argument of e.calibrate) determines a significant reduction in computation complexity, especially for big surveys. Now, it is sometimes the case that a calibration model is actually factorizable, even if this property is not self-apparent, due to factor nesting. In such cases, anyway, trying naively to factorize the outer variable(s) typically leads to very big and sparse model matrices (as well as population totals data frames), with the net result of washing-out the expected efficiency gain. A better alternative is to exploit the %into% operator in order to compress the inner variable in such a way that the outer variable can be actually factorized without giving rise to huge and sparse matrices. Section ‘Examples’ reports some practical illustration of the above line of reasoning.

Value

A factor with levels 1:n.max, being n.max the maximum number of levels of inner tied to a level of outer.

Examples

#################################################
## General properties of the %into% operator.   #
#################################################
 # First build a small data frame with 2 nested factors representing
 # regions and provinces:
dd <- data.frame(
                 reg  = factor( rep(LETTERS[1:3], c(6, 3, 1)) ),
                 prov = factor( rep(letters[1:6], c(3, 2, 1, 2, 1, 1)) )
                )
dd
#>    reg prov
#> 1    A    a
#> 2    A    a
#> 3    A    a
#> 4    A    b
#> 5    A    b
#> 6    A    c
#> 7    B    d
#> 8    B    d
#> 9    B    e
#> 10   C    f

 # Since prov is strictly nested into reg we can compute:
prov.in.reg <- dd$prov %into% dd$reg
prov.in.reg
#>  [1] 1 1 1 2 2 3 1 1 2 1
#> Levels: 1 2 3

 # Note that prov.in.reg has 3 levels because, as can be seen from dd,
 # the maximum number of provinces inside regions is 3. Thus prov.in.reg
 # is actually a compressed version of dd$prov (whose levels were 6)
 # but, obviously, it can now be used to identify a province only inside
 # a given region. This means that the the two factors below are identical (up
 # to levels' labels):
dd$prov
#>  [1] a a a b b c d d e f
#> Levels: a b c d e f
interaction(prov.in.reg,dd$reg,drop=TRUE)
#>  [1] 1.A 1.A 1.A 2.A 2.A 3.A 1.B 1.B 2.B 1.C
#> Levels: 1.A 2.A 3.A 1.B 2.B 1.C

 # Note that all the statements below generate errors:
if (FALSE) {
dd$reg  %into% dd$prov
dd$reg  %into% dd$reg
dd$prov %into% dd$prov
}

######################################################################
## A more useful (and complex) example from the calibration context. #
######################################################################
 # First define a design object:
data(data.examples)
exdes <- e.svydesign(data=example,ids=~towcod+famcod,strata=~SUPERSTRATUM,
weights=~weight)

 # Now suppose you have to perform a calibration process which
 # exploits the following known population totals:
 # 1) Joint distribution of sex and age10c (age in 10 classes)
 #    at the region level;
 # 2) Joint distribution of sex and age5c (age in 5 classes)
 #    at the province level;
 #
 # The auxiliary variables corresponding to the population totals above
 # can be symbolically represented by a calibration model like the following:
 # ~(procod:age5c + regcod:age10c - 1):sex
 #
 # At first sight it seems that only the sex variable can be factorized
 # in the model above. However if one observe that regions are an aggregation
 # of provinces, one realizes that also the regcod variable can be factorized.
 # Similarly, since categories of age5c are an aggregation of categories of
 # age10c, age5c can be factorized too. In both cases, using the %into%
 # operator will save computation time and memory usage. 
 # Let us see it in practice:
 #
 ## 1) Global calibration (i.e. calmodel=~(procod:age5c + regcod:age10c - 1):sex,
  # no partition variable, known totals stored in pop07):
t<-system.time(
               cal07<-e.calibrate(design=exdes,df.population=pop07,
                      calmodel=~(procod:age5c + regcod:age10c - 1):sex,
                      calfun="logit",bounds=c(0.2,1.8))
               )

 ## 2) Partitioned calibration on the self evident variable sex only
  # (i.e. calmodel=~procod:age5c + regcod:age10c - 1, partition=~sex,
  # known totals stored in pop07p):
tp<-system.time(
                cal07p<-e.calibrate(design=exdes,df.population=pop07p,
                        calmodel=~procod:age5c + regcod:age10c - 1,partition=~sex,
                        calfun="logit",bounds=c(0.2,1.8))
                )

 ## 3) Full partitioned calibration on variables sex, regcod and age5c
  # by exploiting %into%.
  # First add to the design object the new compressed factor variables
  # involving nested factors, namely provinces inside regions...
exdes<-des.addvars(exdes,procod.in.regcod=procod %into% regcod)

  # ...and age10c inside age5c:
exdes<-des.addvars(exdes,age10c.in.age5c=age10c %into% age5c)

  # Now calibrate exploiting the new variables
  # (i.e. calmodel=~procod.in.regcod + age10c.in.age5c - 1,
  # partition=~sex:regcod:age5c, known totals stored inside cal07pp)
tpp<-system.time(
                 cal07pp<-e.calibrate(design=exdes,df.population=pop07pp,
                          calmodel=~procod.in.regcod + age10c.in.age5c - 1,
                          partition=~sex:regcod:age5c,
                          calfun="logit",bounds=c(0.2,1.8))
                )

 # Now compare execution times:
t
#>    user  system elapsed 
#>    1.34    0.00    1.34 
tp
#>    user  system elapsed 
#>    0.39    0.00    0.39 
tpp
#>    user  system elapsed 
#>    0.33    0.00    0.33 

 # thus, tpp < tp < t, as expected.
 # Notice also that we obtained identical calibrated weights:
all.equal(weights(cal07),weights(cal07p))
#> [1] TRUE
all.equal(weights(cal07),weights(cal07pp))
#> [1] TRUE

 # as it must be.

Arguments

Details

Value

See also

Examples

Contents

Author