kottby.user.Rd
Calculates estimates, standard errors and confidence intervals for user-defined estimators (even non-analytic) in subpopulations.
kottby.user(deskott, by = NULL, user.estimator, na.replace = NULL, vartype = c("se", "cv", "cvpct", "var"), conf.int = FALSE, conf.lev = 0.95, df = attr(deskott, "nrg") - 1, ...) global(deskott)
deskott | Object of class |
---|---|
by | Formula specifying the variables that define the "estimation domains". If |
user.estimator | R function to compute the value of the desired estimator on the original survey sample (see also 'Details' and 'Defining a user estimator function'). |
na.replace | Value to be used to replace any |
vartype |
|
conf.int | Boolean ( |
conf.lev | Probability specifying the desired confidence level: the default value is |
df | Degrees of freedom for the t distribution used to build confidence intervals (see 'Details'). |
… | Additional parameters (if any) to be passed to the |
The kottby.user
function is designed to fully exploit the versatility of the DAGJK [Kott 99-01] replication method. It is intended to provide the user with a user-friendly tool for calculating estimates, standard errors and confidence intervals for estimators defined by the user themselves. As is obvious, weighted estimates for the "user-defined estimator" are computed using suitable weights depending on the class of deskott
: calibrated weights for class kott.cal.design
and direct weights otherwise.
The optional argument by
specifies the variables that define the "estimation domains", that is the subpopulations for which the estimates are to be calculated. If by=NULL
(the default option), the estimates produced by kottby
refer to the whole population. Estimation domains must be defined by a formula: for example the statement by=~B1:B2
selects as estimation domains the subpopulations determined by crossing the modalities of variables B1
and B2
. The deskott
variables referenced by by
(if any) must be factor
and must not contain any missing value (NA
).
The mandatory argument user.estimator
is used to specify the calculation method for the "user-defined estimator". In more precise terms: the value bound to the formal argument user.estimator
must be a function (an R object of class function
, even anonymous) able to compute the value of the required estimator on the sample data frame contained in deskott
. It is not necessary for the user.estimator
function's return value to be a single numerical value (it can be a vector, a matrix, an array, …). In any case, it will be tacitly coerced to array by kottby.user
. More detailed indications on how the user.estimator
function must be constructed can be found in the 'Defining a user estimator function' section below.
The optional argument na.replace
makes it possible to specify a value to be used to replace any missing values generated by user.estimator
in the kottby.user
function output. By default na.replace=NULL
and the missing values are returned as NA
s.
The conf.int
argument allows to request the confidence intervals for the estimates. By default conf.int=FALSE
, that is the confidence intervals are not provided.
Whenever confidence intervals are requested (i.e. conf.int=TRUE
), the desired confidence level can be specified by means of the conf.lev
argument. The conf.lev
value must represent a probability (0<=conf.lev<=1
) and its default is chosen to be 0.95
.
Given an input kott.design
object with nrg
random groups, by default kottby.user
builds the confidence intervals making use of a t distribution with nrg-1
degrees of freedom. Indeed the argument df
has a default value of nrg-1
. Notice, however, that this default value should be used only when the user-defined function user.estimator
estimates a univariate parameter of interest. As an example, if user.estimator
were designed to estimate regression coefficients for a multiple linear regression with p predictors and no intercept, the right choice would be df = nrg-p
.
The special argument …
(dot-dot-dot) allows to specify additional parameters to be passed to the user-defined user.estimator
function.
In order to be correctly invoked by kottby.user
, the function that codifies the "user-defined estimator" must comply with specific syntactical restrictions. On the other hand there is not any constraint (at least in principle) on the semantics of the function, that is on "what it calculates".
The fundamental constraint is that the function's formal arguments list meets some minimal requirements. Suppose, for simplicity, that the function bound to the user.estimator
formal argument is named user.estfun
; than its structure must necessarily be of the following type:
user.estfun=function(data, weights, etc){body}
[1]
The structure [1] has to be interpreted as follows: user.estfun
body must contain all the instructions that would make it possible to compute the required estimator on the sample data contained in the data
data frame using the weights contained in its weights
column. The "etc"
symbol represents in [1] any other user.estfun
's formal arguments whose actual values can be specified, when invoking kottby.user
, using its special argument …
(dot-dot-dot).
Sometimes users may need to employ "global" quantities in the body of the user.estfun
function, that is, quantities that, even when dealing with sub-population estimates, should not be re-calculated for the sub-populations themselves (the latter being the standard kottby.user
behaviour). This need is met by the global
function: the user has only to reference, wherever the need arises, the user.estfun
input data frame by means of the global(data)
expression rather than the standard one data
.
The global
function only accepts kott.design
class objects and can only be used within functions invoked by user.estfun
. An example that clearly illustrates the utility of global
is provided by the calculation of poverty estimates (see the poverty
function documented in the 'Examples' section below).
The return value depends on the value of the input parameters. In the most general case, the function returns an object of class list
(typically a list made up of data frames).
The freedom granted to the user in developing the user.estimator
function has important consequences that are worth highlighting. The key point is that, since only the user knows the semantics of user.estimator
, he must vouch for its correct functioning. In particular:
(i) The kottby.user
function must be able to invoke the user.estimator
function on the deskott
sample data frame and, if necessary, on its subsets defined by the by
variables. Consequently, when developing the function, the user must make sure that the instructions in its body
refer to variables that are actually contained in that data frame. This check could not be done by the kottby.user
caller function albeit at the expense of limiting the user's freedom in constructing his user.estimator
;
(ii) In the same way, due to user's freedom in developing user.estimator
, the kottby.user
function cannot prevent the generation of missing values in its output. The usefulness of the na.replace
parameter must, therefore, be considered as purely "cosmetic".
Kott, Phillip S. (1999) "The Extended Delete-A-Group Jackknife". Bulletin of the International Statistical Instititute. 52nd Session. Contributed Papers. Book 2, pp. 167-168.
Kott, Phillip S. (2001) "The Delete-A-Group Jackknife". Journal of Official Statistics, Vol.17, No.4, pp. 521-526.
kottby
for estimating totals and means, kott.ratio
for estimating ratios between totals, kott.quantile
for estimating quantiles and kott.regcoef
for estimating regression coefficients.
# Some examples of user-defined estimators and illustration # of their use via kottby.user. Remember that R functions # expressing user-defined estimators must comply with the # condition indicated in [1]. The 3 functions that appear # in the following examples ('ones', 'ratio' and 'poverty') # are contained in the data.examples file. # The 'poverty' function (also) illustrates the correct use # of the 'global' function. data(data.examples) # Creation of a kott.design object: kdes<-kottdesign(data=example,ids=~towcod+famcod,strata=~SUPERSTRATUM, weights=~weight,nrg=15) # 1) Estimator of the number of final units in the population. # Use the name 'ones' to refer to the R function that # expresses the estimator and define it as follows: # ones <- function (d, w) # ###################################### # # Number of final units estimator. # # ###################################### # { # sum(d[, w]) # } # Now using kottby.user is easy, for instance: kottby.user(kdes,user.estimator=ones)#> $estimate #> [1] 924101.3 #> #> $SE #> [1] 11810.13 #># 2) Estimator of ratios between totals (or means) for 2 # quantitative variables. Use the name 'ratio' to refer # to the R function that expresses the estimator and # define it as follows (notice the use of the etc # arguments in [1]): # ratio <- function (d, w, num, den) # ########################################### # # Ratio estimator for totals (or means) # # # of quantitative variables. # # ########################################### # { # sum(d[, w] * d[, num])/sum(d[, w] * d[, den]) # } # Calculating ratio estimates and standard errors # is easy (notice the use of the \dots argument # of kottby.user): kottby.user(kdes,user.estimator=ratio,num="y1",den="x1")#> $estimate #> [1] 6.638191 #> #> $SE #> [1] 0.6660701 #># 3) A non-analytic estimator: population percentage # with income below the poverty threshold (defined, # for the sake of simplicity, as 0.6 times the # average income for the whole population). # Call 'poverty' the estimator and define it as follows: # poverty <- function (d, w, y, threshold) # #################################################################### # # Population percentage with income below the poverty threshold. # # # Suppose poverty threshold is defined as 0.6 times the average # # # income for the whole population. # # #################################################################### # { # if (missing(threshold)) { # # if I do want to take into account the variance of the poverty # # threshold letting it be re-calculated replicate by replicate. # d.global = global(d) # th.value = 0.6 * sum(d.global[, w] * d.global[, y])/sum(d.global[, w]) # } # else { # # if I do not want to take into account the variance of the poverty # # threshold, I will supply its point estimate to the 'threshold' argument. # th.value = threshold # } # est = 100 * sum(d[d[, y] < th.value, w])/sum(d[, w]) # est # } # 3.1) First use: neglect the variance of the poverty threshold # and supply to 'threshold' (by means of the \dots argument # of kottby.user) its point estimate obtained using kottby: pov.line<-0.6*kottby(kdes,~income,estimator="mean")$mean kottby.user(kdes,user.estimator=poverty,y="income",threshold=pov.line)#> $estimate #> [1] 11.83658 #> #> $SE #> [1] 0.8047257 #># 3.2) Second use: do take into account the variance of the poverty # threshold letting it be re-calculated replicate by replicate # (thus not supplying any actual value to 'threshold'): kottby.user(kdes,user.estimator=poverty,y="income")#> Error in eval(deskott.name, envir = last.env): object 'kdes' not found# Notice that the standard error estimate for the 'poverty' estimator # obtained in 3.2) cannot be calculated analytically by Taylor # linearization. # Notice the use of the 'global' function in the body of 'poverty': # since the poverty status of each final unit depends on a global # value (that is, the average income for the whole population) # 'global' is used to prevent, whenever a sub-population poverty # estimate is needed, this global value being calculated locally # i.e. within the sub-population itself. # In fact: pov.line<-0.6*kottby(kdes,~income,estimator="mean")$mean kdes2<-kott.addvars(kdes,pov.status=as.factor(ifelse(income<pov.line, "poor","not-poor"))) kottby.user(kdes2,by=~pov.status,user.estimator=poverty,y="income")#> Error in eval(deskott.name, envir = last.env): object 'kdes2' not found# If the 'global' function were not used in 'poverty' # the poverty threshold would be calculated relative to # each individual sub-population: poverty2 <- function (d, w, y, threshold) ############################################### # Whithout relying on the 'global' function # ############################################### { if (missing(threshold)) { th.value = 0.6 * sum(d[, w] * d[, y])/sum(d[, w]) } else { th.value = threshold } est = 100 * sum(d[d[, y] < th.value, w])/sum(d[, w]) est } kottby.user(kdes2,by=~pov.status,user.estimator=poverty2,y="income")#> $`not-poor` #> $`not-poor`$estimate #> [1] 3.749291 #> #> $`not-poor`$SE #> [1] 0.4795252 #> #> #> $poor #> $poor$estimate #> [1] 10.89201 #> #> $poor$SE #> [1] 1.42301 #> #># This means that without 'global' a non-null fraction of poors # would be paradoxically estimated for the "non-poors" sub-population # (and, conversely, a non-null fraction of non-poors among the "poors").