Estimation for user-defined estimators

Calculates estimates, standard errors and confidence intervals for user-defined estimators (even non-analytic) in subpopulations.

kottby.user(deskott, by = NULL, user.estimator, na.replace = NULL,
            vartype = c("se", "cv", "cvpct", "var"),
            conf.int = FALSE, conf.lev = 0.95,
            df = attr(deskott, "nrg") - 1, ...)

global(deskott)

Arguments

deskott	Object of class `kott.design` containing the replicated survey data.
by	Formula specifying the variables that define the "estimation domains". If `NULL` (the default option) estimates refer to the whole population.
user.estimator	R function to compute the value of the desired estimator on the original survey sample (see also 'Details' and 'Defining a user estimator function').
na.replace	Value to be used to replace any `NA`s in the output estimates (see 'Details').
vartype	`character` vector specifying the desired variability estimators. It is possible to choose one or more of: standard error (the default), coefficient of variation, percent coefficient of variation, or variance.
conf.int	Boolean (`logical`) value to request confidence intervals for the estimates: the default is `FALSE`.
conf.lev	Probability specifying the desired confidence level: the default value is `0.95`.
df	Degrees of freedom for the t distribution used to build confidence intervals (see 'Details').
…	Additional parameters (if any) to be passed to the `user.estimator` function.

Details

The kottby.user function is designed to fully exploit the versatility of the DAGJK [Kott 99-01] replication method. It is intended to provide the user with a user-friendly tool for calculating estimates, standard errors and confidence intervals for estimators defined by the user themselves. As is obvious, weighted estimates for the "user-defined estimator" are computed using suitable weights depending on the class of deskott: calibrated weights for class kott.cal.design and direct weights otherwise.

The optional argument by specifies the variables that define the "estimation domains", that is the subpopulations for which the estimates are to be calculated. If by=NULL (the default option), the estimates produced by kottby refer to the whole population. Estimation domains must be defined by a formula: for example the statement by=~B1:B2 selects as estimation domains the subpopulations determined by crossing the modalities of variables B1 and B2. The deskott variables referenced by by (if any) must be factor and must not contain any missing value (NA).

The mandatory argument user.estimator is used to specify the calculation method for the "user-defined estimator". In more precise terms: the value bound to the formal argument user.estimator must be a function (an R object of class function, even anonymous) able to compute the value of the required estimator on the sample data frame contained in deskott. It is not necessary for the user.estimator function's return value to be a single numerical value (it can be a vector, a matrix, an array, …). In any case, it will be tacitly coerced to array by kottby.user. More detailed indications on how the user.estimator function must be constructed can be found in the 'Defining a user estimator function' section below.

The optional argument na.replace makes it possible to specify a value to be used to replace any missing values generated by user.estimator in the kottby.user function output. By default na.replace=NULL and the missing values are returned as NAs.

The conf.int argument allows to request the confidence intervals for the estimates. By default conf.int=FALSE, that is the confidence intervals are not provided.

Whenever confidence intervals are requested (i.e. conf.int=TRUE), the desired confidence level can be specified by means of the conf.lev argument. The conf.lev value must represent a probability (0<=conf.lev<=1) and its default is chosen to be 0.95.

Given an input kott.design object with nrg random groups, by default kottby.user builds the confidence intervals making use of a t distribution with nrg-1 degrees of freedom. Indeed the argument df has a default value of nrg-1. Notice, however, that this default value should be used only when the user-defined function user.estimator estimates a univariate parameter of interest. As an example, if user.estimator were designed to estimate regression coefficients for a multiple linear regression with p predictors and no intercept, the right choice would be df = nrg-p.

The special argument … (dot-dot-dot) allows to specify additional parameters to be passed to the user-defined user.estimator function.

Defining a user estimator function

In order to be correctly invoked by kottby.user, the function that codifies the "user-defined estimator" must comply with specific syntactical restrictions. On the other hand there is not any constraint (at least in principle) on the semantics of the function, that is on "what it calculates".
The fundamental constraint is that the function's formal arguments list meets some minimal requirements. Suppose, for simplicity, that the function bound to the user.estimator formal argument is named user.estfun; than its structure must necessarily be of the following type:

user.estfun=function(data, weights, etc){body} [1]

The structure [1] has to be interpreted as follows: user.estfun body must contain all the instructions that would make it possible to compute the required estimator on the sample data contained in the data data frame using the weights contained in its weights column. The "etc" symbol represents in [1] any other user.estfun's formal arguments whose actual values can be specified, when invoking kottby.user, using its special argument … (dot-dot-dot).

Sometimes users may need to employ "global" quantities in the body of the user.estfun function, that is, quantities that, even when dealing with sub-population estimates, should not be re-calculated for the sub-populations themselves (the latter being the standard kottby.user behaviour). This need is met by the global function: the user has only to reference, wherever the need arises, the user.estfun input data frame by means of the global(data) expression rather than the standard one data.
The global function only accepts kott.design class objects and can only be used within functions invoked by user.estfun. An example that clearly illustrates the utility of global is provided by the calculation of poverty estimates (see the poverty function documented in the 'Examples' section below).

Value

The return value depends on the value of the input parameters. In the most general case, the function returns an object of class list (typically a list made up of data frames).

Note

The freedom granted to the user in developing the user.estimator function has important consequences that are worth highlighting. The key point is that, since only the user knows the semantics of user.estimator, he must vouch for its correct functioning. In particular:
(i) The kottby.user function must be able to invoke the user.estimator function on the deskott sample data frame and, if necessary, on its subsets defined by the by variables. Consequently, when developing the function, the user must make sure that the instructions in its body refer to variables that are actually contained in that data frame. This check could not be done by the kottby.user caller function albeit at the expense of limiting the user's freedom in constructing his user.estimator;
(ii) In the same way, due to user's freedom in developing user.estimator, the kottby.user function cannot prevent the generation of missing values in its output. The usefulness of the na.replace parameter must, therefore, be considered as purely "cosmetic".

References

Kott, Phillip S. (1999) "The Extended Delete-A-Group Jackknife". Bulletin of the International Statistical Instititute. 52nd Session. Contributed Papers. Book 2, pp. 167-168.

Kott, Phillip S. (2001) "The Delete-A-Group Jackknife". Journal of Official Statistics, Vol.17, No.4, pp. 521-526.

Examples

# Some examples of user-defined estimators and illustration
# of their use via kottby.user. Remember that R functions
# expressing user-defined estimators must comply with the
# condition indicated in [1]. The 3 functions that appear
# in the following examples ('ones', 'ratio' and 'poverty')
# are contained in the data.examples file.
# The 'poverty' function (also) illustrates the correct use
# of the 'global' function.

data(data.examples)

# Creation of a kott.design object:
kdes<-kottdesign(data=example,ids=~towcod+famcod,strata=~SUPERSTRATUM,
      weights=~weight,nrg=15)


# 1) Estimator of the number of final units in the population.
#    Use the name 'ones' to refer to the R function that
#    expresses the estimator and define it as follows: 

#    ones <- function (d, w)
#    ######################################
#    #  Number of final units estimator.  #
#    ######################################
#    {
#        sum(d[, w])
#    }

#    Now using kottby.user is easy, for instance:
kottby.user(kdes,user.estimator=ones)
#> $estimate
#> [1] 924101.3
#> 
#> $SE
#> [1] 11810.13
#> 


# 2) Estimator of ratios between totals (or means) for 2
#    quantitative variables. Use the name 'ratio' to refer
#    to the R function that expresses the estimator and
#    define it as follows (notice the use of the etc
#    arguments in [1]): 

#    ratio <- function (d, w, num, den)
#    ###########################################
#    #  Ratio estimator for totals (or means)  #
#    #  of quantitative variables.             #
#    ###########################################
#    {
#        sum(d[, w] * d[, num])/sum(d[, w] * d[, den])
#    }

#    Calculating ratio estimates and standard errors
#    is easy (notice the use of the \dots argument
#    of kottby.user):
kottby.user(kdes,user.estimator=ratio,num="y1",den="x1")
#> $estimate
#> [1] 6.638191
#> 
#> $SE
#> [1] 0.6660701
#> 


# 3) A non-analytic estimator: population percentage 
#    with income below the poverty threshold (defined,
#    for the sake of simplicity, as 0.6 times the
#    average income for the whole population).
#    Call 'poverty' the estimator and define it as follows:

#    poverty <- function (d, w, y, threshold)
#    ####################################################################
#    #  Population percentage with income below the poverty threshold.  #  
#    #  Suppose poverty threshold is defined as 0.6 times the average   #
#    #  income for the whole population.                                #
#    ####################################################################
#    {
#        if (missing(threshold)) {
#        # if I do want to take into account the variance of the poverty
#        # threshold letting it be re-calculated replicate by replicate.
#            d.global = global(d)
#            th.value = 0.6 * sum(d.global[, w] * d.global[, y])/sum(d.global[, w])
#        }
#        else {
#        # if I do not want to take into account the variance of the poverty
#        # threshold, I will supply its point estimate to the 'threshold' argument.
#            th.value = threshold 
#        }
#        est = 100 * sum(d[d[, y] < th.value, w])/sum(d[, w])
#        est
#    }


#    3.1) First use: neglect the variance of the poverty threshold
#         and supply to 'threshold' (by means of the \dots argument
#         of kottby.user) its point estimate obtained using kottby:
pov.line<-0.6*kottby(kdes,~income,estimator="mean")$mean
kottby.user(kdes,user.estimator=poverty,y="income",threshold=pov.line)
#> $estimate
#> [1] 11.83658
#> 
#> $SE
#> [1] 0.8047257
#> 

#    3.2) Second use: do take into account the variance of the poverty
#         threshold letting it be re-calculated replicate by replicate
#         (thus not supplying any actual value to 'threshold'):
kottby.user(kdes,user.estimator=poverty,y="income")
#> Error in eval(deskott.name, envir = last.env): object 'kdes' not found

#    Notice that the standard error estimate for the 'poverty' estimator 
#    obtained in 3.2) cannot be calculated analytically by Taylor 
#    linearization. 

#    Notice the use of the 'global' function in the body of 'poverty': 
#    since the poverty status of each final unit depends on a global
#    value (that is, the average income for the whole population) 
#    'global' is used to prevent, whenever a sub-population poverty
#    estimate is needed, this global value being calculated locally
#    i.e. within the sub-population itself.
#    In fact: 
pov.line<-0.6*kottby(kdes,~income,estimator="mean")$mean
kdes2<-kott.addvars(kdes,pov.status=as.factor(ifelse(income<pov.line,
                                              "poor","not-poor")))
kottby.user(kdes2,by=~pov.status,user.estimator=poverty,y="income")
#> Error in eval(deskott.name, envir = last.env): object 'kdes2' not found

#    If the 'global' function were not used in 'poverty' 
#    the poverty threshold would be calculated relative to  
#    each individual sub-population:

poverty2 <- function (d, w, y, threshold)
###############################################
#  Whithout relying on the 'global' function  #  
###############################################
{
    if (missing(threshold)) {
        th.value = 0.6 * sum(d[, w] * d[, y])/sum(d[, w])
    }
    else {
        th.value = threshold
    }
    est = 100 * sum(d[d[, y] < th.value, w])/sum(d[, w])
    est
}

kottby.user(kdes2,by=~pov.status,user.estimator=poverty2,y="income")
#> $`not-poor`
#> $`not-poor`$estimate
#> [1] 3.749291
#> 
#> $`not-poor`$SE
#> [1] 0.4795252
#> 
#> 
#> $poor
#> $poor$estimate
#> [1] 10.89201
#> 
#> $poor$SE
#> [1] 1.42301
#> 
#> 

#    This means that without 'global' a non-null fraction of poors
#    would be paradoxically estimated for the "non-poors" sub-population
#    (and, conversely, a non-null fraction of non-poors among the "poors").