kottcalibrate.Rd
Adds to a kott.design
object the calibrated weights columns (one for each replicate weight, plus one for the initial weights).
kottcalibrate(deskott, df.population, calmodel = if (inherits(df.population, "pop.totals")) attr(df.population, "calmodel"), partition = if (inherits(df.population, "pop.totals")) attr(df.population, "partition") else FALSE, calfun = c("linear", "raking", "logit"), bounds = c(-Inf, Inf), aggregate.stage = NULL, maxit = 50, epsilon = 1e-07, force.rep = FALSE)
deskott | Object of class |
---|---|
df.population | Data frame containing the known population totals for the auxiliary variables. |
calmodel | Formula defining the linear structure of the calibration model. |
partition | Formula specifying the variables that define the "calibration domains" for the model (see 'Details'); |
calfun |
|
bounds | Allowed range for the ratios between calibrated and initial weights; the default is |
aggregate.stage | An integer: if specified, causes the calibrated weights to be constant within sampling units at this stage. |
maxit | Maximum number of iterations for the Newton-Raphson algorithm; the default is |
epsilon | Tolerance for the relative differences between the population totals and the corresponding estimates based on the claibrated weights; the default is |
force.rep | If |
This function creates an object of class kott.cal.design
. A kott.cal.design
object is made up by the union of the (calibrated) replicated survey data and the metadata describing the sampling design. kott.cal.design
objects make it possible to estimate the variance of calibration estimators [Deville, Sarndal 92] using the extended "Delete-A-Group Jackknife" method [Kott 2008].
The mandatory argument calmodel
symbolically defines the calibration model you want to use, that is - in the language of the generalised regression estimator - the assisting linear regression model underlying the calibration problem [Wilkinson, Rogers 73]. More specifically, the calmodel
formula identifies the auxiliary variables and the constraints for the calibration problem. For example, calmodel=~(X+Z):C+(A+B):D
defines the calibration problem in which constraints are imposed: (i) on the auxiliary (quantitative) variables X
and Z
within the subpopulations identified by the (qualitative) classification variable C
and, at the same time, (ii) on the absolute frequency of the (qualitative) variables A
and B
within the subpopulations identified by the (qualitative) classification variable D
.
The deskott
variables referenced by calmodel
must be numeric
or factor
and must not contain any missing value (NA
).
Problems for which one or more qualitative variables can be "factorised" in the formula that specifies the calibration model, are particularly interesting. These variables split the population into non-overlapping subpopulations known as "calibration domains" for the model. An example is provided by the statement calmodel=~(A+B+X+Z):D
in which the variable that identifies the calibration domains is D
; similarly, the formula calmodel=~(A+B+X+Z):D1:D2
identifies as calibration domains the subpopulations determined by crossing the modalities of D1
and D2
. The interest in models of this kind lies in the fact that the global calibration problem they describe can, actually, be broken down into local subproblems, one per calibration domain, which can be solved separately [Vanderhoeft 01]. Thus, for example, the global problem defined by calmodel=~(A+B+X+Z):D
is equivalent to the sequence of problems defined by the "reduced model" calmodel=~A+B+X+Z
in each of the domains identified by the modalities of D
. The opportunity to separately solve the subproblems related to different calibration domains achieves a significant reduction in computation complexity: the gain increases with increasing survey data size and (most importantly) with increasing auxiliary variables number.
The optional argument partition
makes it possible to choose, in cases in which the calibration problem can be factorised, whether to solve the problem globally or iteratively (that is, separately for each calibration domain). The global solution (which is the default option) can be selected invoking the kottcalibrate
function with partition=FALSE
. To request the iterative solution - a strongly recommended option when dealing with a lot of auxiliary variables and big data sizes - it is necessary to specify via partition
the variables defining the calibration domains for the model. If a formula is passed through the partition
argument (for example: partition=~D1:D2
), the program checks that calmodel
actually describes a "reduced model" (for example: calmodel=~X+Z+A+B
), that is it does not reference any of the partition variables; if this is not the case, the program stops and prints an error message.
The deskott
variables referenced by partition
(if any) must be factor
and must not contain any missing value (NA
).
The mandatory argument df.population
is used to specify the known totals of the auxiliary variables referenced by calmodel
within the subpopulations (if any) identified by partition
. These known totals must be stored in a data frame whose structure (i) depends on the values of calmodel
and partition
and (ii) must conform to a standard. In order to facilitate understanding of and compliance with this standard, the EVER package provides the user with two functions: pop.template
and population.check
. The pop.template
function is able to guide the user in constructing the known totals data frame for a specific calibration problem, while the population.check
function allows to check whether a known totals data frame conforms to the standard required by kottcalibrate
. In any case, if the df.population
data frame does not comply with the standard, the kottcalibrate
function stops and prints an error message: the meaning of the message should help the user diagnose the cause of the problem.
The calfun
argument identifies the distance function to be used in the calibration process. Three built-in functions are provided: "linear"
, "raking"
, and "logit"
. The default is "linear"
, which corresponds to the euclidean metric.
The bounds
argument allows to add "range constraints" to the calibration problem. To be precise, the interval defined by bounds
will contain the values of the ratios between final (calibrated) and initial (direct) weights. The default value is c(-Inf,Inf)
, i.e. no range constraints are imposed. These constraints are optional unless the "logit"
function is selected: in the latter case the range defined by bounds
has to be finite.
The value passed by the aggregate.stage
argument must be an integer between 1 and the number of sampling stages of deskott
. If specified, causes the calibrated weights to be constant within sampling units selected at the aggregate.stage
stage (actually this is only ensured if the initial weights had already this property, as is sometimes the case in multistage cluster sampling). If not specified, the calibrated weights may differ even for sampling units with identical initial weights. The same holds if some final units belonging to the same cluster selected at the stage aggregate.stage
fall in distinct calibration domains (i.e. if the domains defined by partition
"cut across" the aggregate.stage
-stage clusters).
The maxit
argument sets the maximum number of iteration for the Newton-Raphson algorithm that is used to solve the calibration problem. The default value is 50
.
The epsilon
argument determines the convergence criterion for the optimisation algorithm: it fixes the maximum allowed value for the relative differences between the population totals and the corresponding estimates based on the claibrated weights. The default value is 10^-7
.
If the number of replicates for deskott
(the input object of class kott.design
) is nrg
, the function kottcalibrate
is in charge of solving nrg+1
distinct calibration problems. In fact, the calibrated weights calculated by kottcalibrate
must ensure that the known population totals are exactly reproduced not only by the original sample, but also by all its nrg
replicates. Should this requirement fail, the DAGJK method would end up with a biased variance estimator [Kott 2008]. It is, however, possible (more likely when range constraints are imposed) that, for some of the nrg+1
distinct calibration problems and for the given values of epsilon
and maxit
, the solving algorithm does not converge. In this case kottcalibrate
by default stops and prints an error message. On the contrary if force.rep = TRUE
, provided that the failure to converge pertains only to the replicate weights, the function is forced to return the best approximation achieved for the corresponding calibrated weights. When this occurs, DAGJK standard errors estimates built on the object returned by kottcalibrate
will be biased.
If the number of replicates for deskott
is nrg
, the function kottcalibrate
is in charge of solving nrg+1
distinct calibration problems. When, dealing with a factorisable calibration problem, the user selects the iterative solution, each one of the above mentioned problems is split into as many sub-problems as the number of subpopulations defined by partition
. A calibration process with such a complex structure needs some ad hoc tool for error diagnostics. For this purpose, every call to kottcalibrate
creates, by side effect, a dedicated data structure named kottcal.status
into the .GlobalEnv
. kottcal.status
is a list with two components: the first, "call"
, identifies the call to kottcalibrate
that generated the list, the second, return.code
, is a matrix each element of which identifies the return code of a specific calibration sub-problem. The meaning of the return codes is as follows:
-1
not yet tackled sub-problem;
0
solved sub-problem (convergence achieved);
1
unsolved sub-problem (no convergence): output forced.
Recall that the latter return code may only occur if force.rep = TRUE
.
In case of error, users can exploit kottcal.status
to identify the sub-problem from which the error stemmed, hence taking a step forward to eliminate it.
An object of class kott.cal.design
. The data frame it contains includes (in addition to the data already stored in deskott
) the calibrated weights columns (one for each replicate weight, plus one for the initial weights, nrg+1
in all). The names of these columns are obtained by pasting the name of the initial weights column with the string ".cal"
and the indices NULL, 1, 2, …, nrg
.
The kott.cal.design
class is a specialisation of the kott.design
class; this means that an object created by kottcalibrate
inherits from the data.frame
class and you can use on it every method defined on that class.
Deville, J.C., Sarndal, C.E. (1992) "Calibration Estimators in Survey Sampling", Journal of the American Statistical Association, Vol. 87, No. 418, pp.376-382.
Kott, Phillip S. (2008) "Building a Better Delete-a-Group Jackknife for a Calibration Estimator", NASS Research Report, NASS: Washington, DC.
Wilkinson, G.N., Rogers, C.E. (1973) "Symbolic Description of Factorial Models for Analysis of Variance", Journal of the Royal Statistical Society, series C (Applied Statistics), Vol. 22, pp. 181-191.
Vanderhoeft, C. (2001) "Generalized Calibration at Statistic Belgium", Statistics Belgium Working Paper n. 3, http://www.statbel.fgov.be/studies/paper03_en.asp.
Lumley, T. (2006) "survey: analysis of complex survey samples", http://cran.at.r-project.org/web/packages/survey/index.html.
Scannapieco, M., Zardetto, D., Barcaroli, G. (2007) "La Calibrazione dei Dati con R: una Sperimentazione sull'Indagine Forze di Lavoro ed un Confronto con GENESEES/SAS", Contributi Istat n. 4., https://www.istat.it/it/files//2018/07/2007_4.pdf.
desc
for a concise description of kott.design
objects, kottby
, kott.ratio
, kott.regcoef
, kott.quantile
and kottby.user
for calculating estimates and standard errors, pop.template
for constructing known totals data frames in compliance with the standard required by kottcalibrate
, population.check
to check that the known totals data frame satisfies that standard, bounds.hint
to obtain an hint for range restricted calibration.
# Calibration of a kott.design object according to different calibration # models (the known totals data frames pop01, \ldots, pop05p and the bounds # vector are contained in the data.examples file). # For the examples relating to calibration models that can be factorised # both a global and an iterative solution are given. data(data.examples) # Creation of the object to be calibrated: kdes<-kottdesign(data=example,ids=~towcod+famcod,strata=~SUPERSTRATUM, weights=~weight,nrg=15) # 1) Calibration on the total number of units in the population # (totals in pop01): kdescal01<-kottcalibrate(deskott=kdes,df.population=pop01,calmodel=~1, calfun="logit",bounds=bounds,aggregate.stage=2) # Checking the result (the function 'ones' is contained # in the data.examples file): kottby.user(kdescal01,user.estimator=ones)#> $estimate #> [1] 918300 #> #> $SE #> [1] 0.004524931 #># 2) Calibration on the marginal distributions of sex and marstat # (totals in pop02): kdescal02<-kottcalibrate(deskott=kdes,df.population=pop02, calmodel=~sex+marstat-1,calfun="logit",bounds=bounds, aggregate.stage=2) # Checking the result: kottby(kdescal02,~sex+marstat)#> $sex #> total SE #> sex.f 468366 0.003553678 #> sex.m 449934 0.007568289 #> #> $marstat #> total SE #> marstat.married 532376 0.009008087 #> marstat.unmarried 311996 0.002435166 #> marstat.widowed 73928 0.001678797 #># 3) Calibration (global solution) on the joint distribution of sex # and marstat (totals in pop03): kdescal03<-kottcalibrate(deskott=kdes,df.population=pop03, calmodel=~marstat:sex-1,calfun="logit",bounds=bounds) # Checking the result: kottby(kdescal03,~sex,~marstat) # or: kottby(kdescal03,~marstat,~sex)#> $married #> total SE #> sex.f 272172 0.001314645 #> sex.m 260204 0.007915787 #> #> $unmarried #> total SE #> sex.f 159280 0.0076358 #> sex.m 152716 0.0053101 #> #> $widowed #> total SE #> sex.f 36914 0.004295206 #> sex.m 37014 0.001490132 #># which, obviously, is not respected by kdescal02 (notice the size of SE): kottby(kdescal02,~sex,~marstat)#> $married #> total SE #> sex.f 272101.5 4430.701 #> sex.m 260274.5 4430.700 #> #> $unmarried #> total SE #> sex.f 159235.8 5535.492 #> sex.m 152760.2 5535.491 #> #> $widowed #> total SE #> sex.f 37028.63 2534.089 #> sex.m 36899.37 2534.089 #># 3.1) Again a calibration on the joint distribution of sex and marstat # but, this time, with the iterative solution (partition=~sex, # totals in pop03p): kdescal03p<-kottcalibrate(deskott=kdes,df.population=pop03p, calmodel=~marstat-1,partition=~sex,calfun="logit", bounds=bounds) # Checking the result: kottby(kdescal03p,~sex,~marstat)#> $married #> total SE #> sex.f 272172 0.001314645 #> sex.m 260204 0.008027662 #> #> $unmarried #> total SE #> sex.f 159280 0.009375013 #> sex.m 152716 0.005521055 #> #> $widowed #> total SE #> sex.f 36914 0.004295206 #> sex.m 37014 0.002508294 #># 4) Calibration (global solution) on the totals for the quantitative # variables x1, x2 and x3 in the subpopulations defined by the # regcod variable (totals in pop04): kdescal04<-kottcalibrate(deskott=kdes,df.population=pop04, calmodel=~(x1+x2+x3-1):regcod,calfun="logit", bounds=bounds,aggregate.stage=2) # Checking the result: kottby(kdescal04,~x1+x2+x3,~regcod)#> $x1 #> 6 7 10 #> total 18403 22484 13726 #> SE 8.719017e-08 0.003661244 6.957057e-05 #> #> $x2 #> 6 7 10 #> total 5870 7557 4884 #> SE 0.0005060036 1.495426e-05 6.312759e-08 #> #> $x3 #> 6 7 10 #> total 6525 8092 5659 #> SE 0.0003954996 0.0003854333 6.295e-05 #># 4.1) Same problem with the iterative solution (partition=~regcod, # totals in pop04p): kdescal04p<-kottcalibrate(deskott=kdes,df.population=pop04p, calmodel=~x1+x2+x3-1,partition=~regcod,calfun="logit", bounds=bounds,aggregate.stage=2) # Checking the result: kottby(kdescal04p,~x1+x2+x3,~regcod)#> $x1 #> 6 7 10 #> total 18403 22484 13726 #> SE 8.719017e-08 0.003473024 7.016744e-05 #> #> $x2 #> 6 7 10 #> total 5870 7557 4884 #> SE 0.0005537555 1.44469e-05 4.93151e-08 #> #> $x3 #> 6 7 10 #> total 6525 8092 5659 #> SE 0.0003681689 0.000367463 6.294953e-05 #># 5) Calibration (global solution) on the total for the quantitative # variable x1 and on the marginal distribution of the qualitative # variable age5c, in the subpopulations defined by crossing sex # and marstat (totals in pop05): kdescal05<-kottcalibrate(deskott=kdes,df.population=pop05, calmodel=~(age5c+x1-1):sex:marstat,calfun="logit", bounds=bounds,force.rep=TRUE) # Calibration process diagnostics: kottcal.status#> $call #> kottcalibrate(deskott = kdes, df.population = pop05, calmodel = ~(age5c + #> x1 - 1):sex:marstat, calfun = "logit", bounds = bounds, force.rep = TRUE) #> #> $return.code #> global #> original 0 #> replicate.1 0 #> replicate.2 0 #> replicate.3 0 #> replicate.4 0 #> replicate.5 0 #> replicate.6 0 #> replicate.7 0 #> replicate.8 0 #> replicate.9 0 #> replicate.10 0 #> replicate.11 0 #> replicate.12 0 #> replicate.13 0 #> replicate.14 0 #> replicate.15 0 #>#> $age5c #> $age5c$f.married #> total SE #> age5c.1 11023 3.586120e-07 #> age5c.2 92213 1.631380e-06 #> age5c.3 114859 2.213937e-06 #> age5c.4 47614 8.861208e-07 #> age5c.5 6463 1.887778e-07 #> #> $age5c$m.married #> total SE #> age5c.1 7770 3.405498e-08 #> age5c.2 83894 6.112836e-07 #> age5c.3 122823 8.619858e-07 #> age5c.4 36003 3.571255e-07 #> age5c.5 9714 6.836014e-08 #> #> $age5c$f.unmarried #> total SE #> age5c.1 56857 5.824452e-09 #> age5c.2 45467 1.101849e-08 #> age5c.3 46391 1.202950e-08 #> age5c.4 9573 4.547046e-08 #> age5c.5 992 3.122428e-05 #> #> $age5c$m.unmarried #> total SE #> age5c.1 49705 2.569336e-08 #> age5c.2 50453 4.327043e-08 #> age5c.3 39974 3.159618e-08 #> age5c.4 12122 5.363742e-08 #> age5c.5 462 3.761207e-05 #> #> $age5c$f.widowed #> total SE #> age5c.1 2099 2.931135e-06 #> age5c.2 8122 1.846965e-06 #> age5c.3 16437 1.481764e-05 #> age5c.4 8574 2.317668e-06 #> age5c.5 1682 4.092174e-05 #> #> $age5c$m.widowed #> total SE #> age5c.1 561 1.338078e-05 #> age5c.2 12013 1.009192e-05 #> age5c.3 14556 1.320804e-05 #> age5c.4 8024 1.833614e-05 #> age5c.5 1860 1.031180e-07 #> #> #> $x1 #> f.married m.married f.unmarried m.unmarried f.widowed #> total 14750 18171 8446 8210 2737 #> SE 5.273801e-06 1.923902e-06 1.820326e-08 1.346842e-07 1.89757e-05 #> m.widowed #> total 2298 #> SE 2.971892e-05 #># 5.1) Same problem with the iterative solution (partition=~sex:marstat, # totals in pop05p): kdescal05p<-kottcalibrate(deskott=kdes,df.population=pop05p, calmodel=~age5c+x1-1,partition=~sex:marstat, calfun="logit",bounds=bounds,force.rep=TRUE) # Calibration process diagnostics: kottcal.status#> $call #> kottcalibrate(deskott = kdes, df.population = pop05p, calmodel = ~age5c + #> x1 - 1, partition = ~sex:marstat, calfun = "logit", bounds = bounds, #> force.rep = TRUE) #> #> $return.code #> f.married f.unmarried f.widowed m.married m.unmarried m.widowed #> original 0 0 0 0 0 0 #> replicate.1 0 0 0 0 0 0 #> replicate.2 0 0 0 0 0 0 #> replicate.3 0 0 0 0 0 0 #> replicate.4 0 0 0 0 0 0 #> replicate.5 0 0 0 0 0 0 #> replicate.6 0 0 0 0 0 0 #> replicate.7 0 0 0 0 0 0 #> replicate.8 0 0 0 0 0 0 #> replicate.9 0 0 0 0 0 0 #> replicate.10 0 0 0 0 0 0 #> replicate.11 0 0 0 0 0 0 #> replicate.12 0 0 0 0 0 0 #> replicate.13 0 0 0 0 0 0 #> replicate.14 0 0 0 0 0 0 #> replicate.15 0 0 0 0 0 0 #>#> $age5c #> $age5c$f.married #> total SE #> age5c.1 11023 6.251745e-05 #> age5c.2 92213 5.402141e-04 #> age5c.3 114859 3.276265e-04 #> age5c.4 47614 2.709589e-04 #> age5c.5 6463 1.052340e-05 #> #> $age5c$m.married #> total SE #> age5c.1 7770 0.0007016523 #> age5c.2 83894 0.0001238189 #> age5c.3 122823 0.0001688175 #> age5c.4 36003 0.0001219403 #> age5c.5 9714 0.0004550047 #> #> $age5c$f.unmarried #> total SE #> age5c.1 56857 3.901676e-04 #> age5c.2 45467 2.462598e-04 #> age5c.3 46391 1.929844e-04 #> age5c.4 9573 6.562936e-04 #> age5c.5 992 3.142646e-05 #> #> $age5c$m.unmarried #> total SE #> age5c.1 49705 5.342956e-04 #> age5c.2 50453 8.592134e-04 #> age5c.3 39974 5.908477e-04 #> age5c.4 12122 7.520150e-04 #> age5c.5 462 3.761208e-05 #> #> $age5c$f.widowed #> total SE #> age5c.1 2099 2.931134e-06 #> age5c.2 8122 1.925998e-06 #> age5c.3 16437 1.528427e-05 #> age5c.4 8574 2.388789e-06 #> age5c.5 1682 4.092174e-05 #> #> $age5c$m.widowed #> total SE #> age5c.1 561 1.338078e-05 #> age5c.2 12013 1.009192e-05 #> age5c.3 14556 1.320804e-05 #> age5c.4 8024 1.833614e-05 #> age5c.5 1860 1.031180e-07 #> #> #> $x1 #> f.married m.married f.unmarried m.unmarried f.widowed #> total 14750 18171 8446 8210.001 2737 #> SE 0.0009335519 0.0004246752 0.0005365315 0.002536905 1.959062e-05 #> m.widowed #> total 2298 #> SE 2.971892e-05 #># Notice that 3.1 e 5.1) do not impose the aggregate.stage=2 # condition. This condition cannot, in fact, be fulfilled because # in both cases the domains defined by partition "cut across" # the kdes second stage clusters (households). To compare the results, # the same choice was also made for 3) e 5).