contrasts.RG.Rd
These functions control the way ReGenesees translates a symbolic calibration model (as specified by the calmodel
formula in e.calibrate
, pop.template
, fill.template
, aux.estimates
, ...) to its numeric encoding (i.e. the model-matrix used by the internal algorithms to perform actual computations).
contrasts.RG() contrasts.off() contrasts.reset() contr.off(n, base = 1, contrasts = TRUE, sparse = FALSE)
n | Formally as in function |
---|---|
base | Formally as in function |
contrasts | Fictitious, but formally as in function |
sparse | Formally as in function |
All the calibration facilities in package ReGenesees transform symbolic calibration models (as specified by the user via calmodel
) into numeric model-matrices. Factor variables occurring in calmodel
play a special role in such transformations, as the encoding of a factor can (and, by default, do) depend on the structure of the formula in which it occurs. The ReGenesees functions documented below control the way factor levels are translated into auxiliary variables and mapped to columns of population totals data frames. The underlying technical tools are contrasts
handling functions (see Section 'Technical Remarks and Warnings' for further details).
Under the calibration perspective, ordered and unordered factors appearing in calmodel
must be treated the same way. This obvious constraint defines the ReGenesees default for contrasts handling. Such a default is silently set when loading the package. Moreover, you can set it also by calling contrasts.RG()
. As can be understood by reading Section 'Technical Remarks and Warnings' below, the default setup can be seen as “efficient-but-slightly-risky”.
A call to contrasts.off()
simply disables all contrasts and imposes a complete dummy coding of factors. Under this setup, all levels of factors occurring in calmodel
generate a distinct model-matrix column, even if some of these columns can be linearly dependent. To be very concise, the contrasts.off()
setup can be seen as “safe-but-less-efficient” as compared to the default one (read Section 'Technical Remarks and Warnings' for more details).
Function contr.off
is not meant to be called directly by users: it serves only the purpose of enabling the contrasts.off()
setup.
A call to contrasts.reset()
restores R factory-fresh defaults for contrasts (which do distinguish ordered and unordered factors). Users may want to use this function after having completed a ReGenesees session, e.g. before switching to other R functions relying on contrasts (such as lm
, glm
, ...).
“[...] the corner cases of model.matrix and friends is some of the more impenetrable code in the R sources.”
Peter Dalgaard
Contrasts handling functions tell R how to encode the model-matrix associated to a given model-formula on specific data (see, e.g., contr.treatment
, contrasts
, model.matrix
, formula
, and references therein). More specifically, contrasts control the way factor-terms and interaction-terms occurring in formulae get actually represented in the model matrix. For instance, R (by default) avoids the complete dummy coding of a factor whenever it is able to understand, on the basis of the structure of the model-formula, that some of the factor levels would generate linearly dependent (i.e. redundant) columns in the model-matrix (see Section ‘Examples’).
The usage of contrasts to build smaller, full-rank calibration model-matrices would be a good opportunity for ReGenesees, provided it comes without any information loss. Indeed, smaller model-matrices mean less population totals to be provided by users, and higher efficiency in computations.
Unfortunately, few controversial cases have been signalled in which R ability to "simplify" a model-matrix on the basis of the structure of the related model-formula seems to lead to strange, unexpected results (see, e.g., this R-help thread). No matter whether such R behaviour is or not an actual bug with respect to its impact on R linear model fitting or ANOVA facilities, it surely represents a concern for ReGenesees with respect to calibration (see Section ‘Examples’). The risk is the following: there could be rare cases in which exploiting R contrasts handling functions inside ReGenesees ends up with a wrong (i.e. incomplete) population totals template, and (eventually) with wrong calibration results.
Though one could adopt several ad-hoc countermeasures to sterilize the risk described above while still taking advantage of contrasts (see Section ‘Examples’), the choice of completely disabling contrasts via contrasts.off()
would result in a 100% safety guarantee. If computational efficiency is not a serious concern for you, switching off contrasts may determine the best ReGenesees setup for your analyses.
Chambers, J. M. and Hastie, T. J. (1992) Statistical models. Chapter 2 of Statistical Models in S eds J. M. Chambers and T. J. Hastie, Wadsworth & Brooks/Cole.
"Why does the order of terms in a formula translate into different models/model matrices?", R-help thread.
e.calibrate
, pop.template
, fill.template
, and aux.estimates
for the meaning and the usage of calmodel
in ReGenesees. formula
, model.matrix
, contrasts
, and contr.treatment
to understand the role of contrasts in R.
###################### # Easy things first: # ###################### # 1) When ReGenesees is loaded, its standard way of handling contrasts # (i.e. no ordered-unordered factor distinction) is silently set: options("contrasts")#> $contrasts #> unordered ordered #> "contr.treatment" "contr.treatment" #># 2) To switch off contrasts (i.e. apply always dummy coding to factors), # simply type: contrasts.off()#> #> # Contrasts treatment has been switched off: #> #> $contrasts #> unordered ordered #> "contr.off" "contr.off" #># 3) To restore R factory-fresh defaults for contrasts, simply type: contrasts.reset()#> #> # Factory-fresh defaults for contrasts treatment have been restored: #> #> $contrasts #> unordered ordered #> "contr.treatment" "contr.poly" #># 4) To switch on again standard ReGenesees contrasts, simply type: contrasts.RG()#> #> # Standard ReGenesees contrasts treatment has been set: #> #> $contrasts #> unordered ordered #> "contr.treatment" "contr.treatment" #>############################################################# # A simple calibration example to understand the effects of # # switching off contrasts. # ############################################################# # Load sbs data: data(sbs) # Create a design object: sbsdes<-e.svydesign(data=sbs,ids=~id,strata=~strata,weights=~weight,fpc=~fpc) # Suppose you want to calibrate on the marginals of 'region' (a factor # with 3 levels: "North", "Center", and "South") and 'dom3' (a factor # with 4 levels: "A", "B", "C", and "D"). # Let's see how things go under the 'contrast on' (default) and 'contrasts off' # setups: ######################################## # 1) ReGenesees default: contrasts ON. # ######################################## # As you see contrasts are ON: options("contrasts")#> $contrasts #> unordered ordered #> "contr.treatment" "contr.treatment" #># Build and fill the population totals template: temp1<-pop.template(data=sbsdes,calmodel=~region+dom3-1) pop1<-fill.template(universe=sbs.frame,template=temp1)#> #> # Coherence check between 'universe' and 'template': OK #># Now inspect the obtained known totals data.frame: pop1#> regionNorth regionCenter regionSouth dom3B dom3C dom3D #> 1 10374 3541 3403 3459 5144 7020# As you see: (i) it has only 6 columns, and (ii) the "A" level of # factor 'dom3' is missing. This is because contrasts are ON, so that # R is able to understand that only 6 out of the 3 + 4 marginal counts # are actually independent. Indeed, the "A" counts... sum(sbs.frame$dom3=="A")#> [1] 1695#> [1] 1695# Now calibrate: cal1<-e.calibrate(sbsdes,pop1) ################################################## # 2) Switch OFF contrasts: dummy coding for all! # ################################################## # To switch off contrasts simply call: contrasts.off()#> #> # Contrasts treatment has been switched off: #> #> $contrasts #> unordered ordered #> "contr.off" "contr.off" #># Build and fill the population totals template: temp2<-pop.template(data=sbsdes,calmodel=~region+dom3-1) pop2<-fill.template(universe=sbs.frame,template=temp2)#> #> # Coherence check between 'universe' and 'template': OK #># Now inspect the obtained known totals data.frame: pop2#> regionNorth regionCenter regionSouth dom3A dom3B dom3C dom3D #> 1 10374 3541 3403 1695 3459 5144 7020# As you see: (1) it has now 7 columns, and (2) the "A" level of factor # 'dom3' has been resurrected. This is because contrasts are OFF, # so that each level of factors in calmodel are coded to dummies. # Now calibrate. Since only 6 out of 7 dummy auxiliary variables are # actually independent, the model.matrix computed by e.calibrate will not be # full-rank. As a consequence, e.calibrate would use the Moore-Penrose # generalized inverse (in practice, this could depend on the machine R # is running on): cal2<-e.calibrate(sbsdes,pop2) # Compare the calibration weights generated under setups 1) and 2): all.equal(weights(cal2),weights(cal1))#> [1] TRUE# Lastly set back contrasts to ReGenesees default: contrasts.RG()#> #> # Standard ReGenesees contrasts treatment has been set: #> #> $contrasts #> unordered ordered #> "contr.treatment" "contr.treatment" #>############################################# # Weird results, risks and countermeasures. # # ("When the going gets tough...") # ############################################# # Suppose you want to calibrate on: (A) the joint distribution of 'region' (a # factor with 3 levels: "North", "Center", and "South") and 'nace.macro' (a # factor with 4 levels: "Agriculture", "Industry", "Commerce", and "Services") # and, at the same time, on (B) the total number of employees ('emp.num', a # numeric variable) by 'nace.macro'. # # You rightly expect that 3*4 + 4 = 16 population totals are needed for such a # calibration task. Indeed, knowing the enterprise counts for the 3*4 cells of # the joint distribution (A) doesn't tell anything on the number of employees # working in the 4 nace macrosectors (B), and vice-versa. # # Moreover, you might expect that calibration models: # (i) calmodel = ~region:nace.macro + emp.num:nace.macro - 1 # (ii) calmodel = ~emp.num:nace.macro + region:nace.macro - 1 # # should produce the same results. # Unfortunately, WHEN CONTRASTS ARE ON, this is not the case: only model (i) # leads to the expected, right results. Let's see. ########################################### # A strange result when contrasts are ON: # # the order of terms in calmodel matters! # ########################################### # As you see contrasts are ON: options("contrasts")#> $contrasts #> unordered ordered #> "contr.treatment" "contr.treatment" #># Start with (i) calmodel = ~region:nace.macro + emp.num:nace.macro - 1 # Build and fill the population totals template: temp1<-pop.template(data=sbsdes,~region:nace.macro+emp.num:nace.macro-1) pop1<-fill.template(universe=sbs.frame,template=temp1)#> #> # Coherence check between 'universe' and 'template': OK #># Now inspect the obtained known totals data.frame: pop1#> regionNorth:nace.macroAgriculture regionCenter:nace.macroAgriculture #> 1 283 61 #> regionSouth:nace.macroAgriculture regionNorth:nace.macroIndustry #> 1 114 5080 #> regionCenter:nace.macroIndustry regionSouth:nace.macroIndustry #> 1 2290 1925 #> regionNorth:nace.macroCommerce regionCenter:nace.macroCommerce #> 1 1902 554 #> regionSouth:nace.macroCommerce regionNorth:nace.macroServices #> 1 604 3109 #> regionCenter:nace.macroServices regionSouth:nace.macroServices #> 1 636 760 #> nace.macroAgriculture:emp.num nace.macroIndustry:emp.num #> 1 17168 579580 #> nace.macroCommerce:emp.num nace.macroServices:emp.num #> 1 69525 318121#> [1] 1 16# Now calibrate: cal1<-e.calibrate(sbsdes,pop1) # Now compare with (ii) calmodel = ~emp.num:nace.macro + region:nace.macro - 1 # Build and fill the population totals template: temp2<-pop.template(data=sbsdes,~emp.num:nace.macro+region:nace.macro-1) pop2<-fill.template(universe=sbs.frame,template=temp2)#> #> # Coherence check between 'universe' and 'template': OK #>#> [1] 1 12# Apparently 4 totals are missing; let's inspect the known totals data.frame # to understand which ones: pop2#> emp.num:nace.macroAgriculture emp.num:nace.macroIndustry #> 1 17168 579580 #> emp.num:nace.macroCommerce emp.num:nace.macroServices #> 1 69525 318121 #> nace.macroAgriculture:regionCenter nace.macroIndustry:regionCenter #> 1 61 2290 #> nace.macroCommerce:regionCenter nace.macroServices:regionCenter #> 1 554 636 #> nace.macroAgriculture:regionSouth nace.macroIndustry:regionSouth #> 1 114 1925 #> nace.macroCommerce:regionSouth nace.macroServices:regionSouth #> 1 604 760# Thus we are missing the 4 'nace.macro' totals for 'region' level "North". # Everything goes as if R contrasts functions mistakenly treated the term # emp.num:nace.macro as a factor-factor interaction (i.e. a 2 way joint # distribution), which would have justified to eliminate the 4 missing totals # as redundant. # Notice that calibrating on pop2 would generate wrong results... cal2<-e.calibrate(sbsdes,pop2) # ...indeed the 4 estimates of 'nace.macro' for 'region' level "North" are not # actually calibrated (look at the magnitude of SE estimates): svystatTM(cal2,~region,~nace.macro)#> nace.macro Total.regionNorth Total.regionCenter Total.regionSouth #> Agriculture Agriculture 283.3861 61 114 #> Industry Industry 5078.5898 2290 1925 #> Commerce Commerce 1902.6949 554 604 #> Services Services 3109.4091 636 760 #> SE.Total.regionNorth SE.Total.regionCenter SE.Total.regionSouth #> Agriculture 0.2837772 1.048106e-14 2.898936e-14 #> Industry 0.8421128 5.905246e-13 7.752573e-14 #> Commerce 1.3567159 1.164382e-14 1.732362e-14 #> Services 0.3829344 4.196158e-14 6.501151e-15################################################################ # A possible countermeasure (still working with contrasts ON). # ################################################################ # Empirical evidence tells that the weird case above is extremely rare # and that it manifests whenever a numeric (say X) and a factor (say F) both # interact with the same factor (say D), i.e. calmodel=~(X+F):D-1. # # The risky order-dependent nature of such models can be sterilized (while # still taking advantage of contrasts-driven simplifications for large, # complex calibrations) by using a numeric variable with values 1 for # all sample units. # # For instance, one could use variable 'ent' in the sbs data.frame, to # handle the (A) part of the calibration constraints. Indeed you may easily # verify that both the calmodel formulae below: # (i) calmodel = ~ent:region:nace.macro + emp.num:nace.macro - 1 # (ii) calmodel = ~emp.num:nace.macro + ent:region:nace.macro - 1 # # produce exactly the same, right results. ################################################################## # THE ULTIMATE, 100% SAFE, COUNTERMEASURE: switch contrasts OFF! # ################################################################## # No contrasts means no model-matrix simplifications at all, hence # also no unwanted, wrong simplifications. Let's see: # To switch off contrasts simply call: contrasts.off()#> #> # Contrasts treatment has been switched off: #> #> $contrasts #> unordered ordered #> "contr.off" "contr.off" #># Compare again, with contrasts OFF, the calibration models: # (i) calmodel = ~region:nace.macro + emp.num:nace.macro - 1 # (ii) calmodel = ~emp.num:nace.macro + region:nace.macro - 1 # Build and fill the population totals templates: temp1<-pop.template(data=sbsdes,~region:nace.macro+emp.num:nace.macro-1) pop1<-fill.template(universe=sbs.frame,template=temp1)#> #> # Coherence check between 'universe' and 'template': OK #>pop1#> regionNorth:nace.macroAgriculture regionCenter:nace.macroAgriculture #> 1 283 61 #> regionSouth:nace.macroAgriculture regionNorth:nace.macroIndustry #> 1 114 5080 #> regionCenter:nace.macroIndustry regionSouth:nace.macroIndustry #> 1 2290 1925 #> regionNorth:nace.macroCommerce regionCenter:nace.macroCommerce #> 1 1902 554 #> regionSouth:nace.macroCommerce regionNorth:nace.macroServices #> 1 604 3109 #> regionCenter:nace.macroServices regionSouth:nace.macroServices #> 1 636 760 #> nace.macroAgriculture:emp.num nace.macroIndustry:emp.num #> 1 17168 579580 #> nace.macroCommerce:emp.num nace.macroServices:emp.num #> 1 69525 318121temp2<-pop.template(data=sbsdes,~emp.num:nace.macro+region:nace.macro-1) pop2<-fill.template(universe=sbs.frame,template=temp2)#> #> # Coherence check between 'universe' and 'template': OK #>pop2#> emp.num:nace.macroAgriculture emp.num:nace.macroIndustry #> 1 17168 579580 #> emp.num:nace.macroCommerce emp.num:nace.macroServices #> 1 69525 318121 #> nace.macroAgriculture:regionNorth nace.macroIndustry:regionNorth #> 1 283 5080 #> nace.macroCommerce:regionNorth nace.macroServices:regionNorth #> 1 1902 3109 #> nace.macroAgriculture:regionCenter nace.macroIndustry:regionCenter #> 1 61 2290 #> nace.macroCommerce:regionCenter nace.macroServices:regionCenter #> 1 554 636 #> nace.macroAgriculture:regionSouth nace.macroIndustry:regionSouth #> 1 114 1925 #> nace.macroCommerce:regionSouth nace.macroServices:regionSouth #> 1 604 760#> [1] 1 16dim(pop2)#> [1] 1 16# Verify they lead to right calibrated objects... cal1<-e.calibrate(sbsdes,pop1) cal2<-e.calibrate(sbsdes,pop2) # ...with the same calibrated weights: all.equal(weights(cal2),weights(cal1))#> [1] TRUE# Lastly set back contrasts to ReGenesees default: contrasts.RG()#> #> # Standard ReGenesees contrasts treatment has been set: #> #> $contrasts #> unordered ordered #> "contr.treatment" "contr.treatment" #>