Estimation of Quantiles in Subpopulations

Calculates estimates, standard errors and confidence intervals for Quantiles of numeric variables in subpopulations.

svystatQ(design, y, probs = c(0.25, 0.5, 0.75), by = NULL,
         vartype = c("se", "cv", "cvpct", "var"),
         conf.lev = 0.95, na.rm = FALSE,
         ties=c("discrete", "rounded"))

# S3 method for svystatQ
coef(object, ...)
# S3 method for svystatQ
SE(object, ...)
# S3 method for svystatQ
VAR(object, ...)
# S3 method for svystatQ
cv(object, ...)
# S3 method for svystatQ
confint(object, ...)

Arguments

design	Object of class `analytic` (or inheriting from it) containing survey data and sampling design metadata.
y	Formula defining the interest variable.
probs	Vector of probability values to be used to calculate the quantiles estimates. The default value selects estimates of quartiles.
by	Formula specifying the variables that define the "estimation domains". If `NULL` (the default option) estimates refer to the whole population.
vartype	`character` vector specifying the desired variability estimators. It is possible to choose one or more of: standard error (`'se'`, the default), coefficient of variation (`'cv'`), percent coefficient of variation (`'cvpct'`), or variance (`'var'`).
conf.lev	Probability specifying the desired confidence level: the default value is `0.95`.
na.rm	Should missing values (if any) be removed from the variable of interest? The default is `FALSE` (see ‘Details’).
ties	How should duplicated observed values be treated? Select `'discrete'` for a genuinely discrete interest variable and `'rounded'` for a continuous one.
object	An object of class `svystatQ`.
...	Additional arguments to `coef`, ..., `confint` methods (if any).

Details

This function calculates weighted estimates for the Quantiles of a quantitative variable using suitable weights depending on the class of design: calibrated weights for class cal.analytic and direct weights otherwise.

Standard errors are calculated using the so-called "Woodruff method" [Woodruff 52][Sarndal, Swensson, Wretman 92]: (i) first a confidence interval (at a given confidence level 1-a) is constructed for the relative frequency of units with values below the estimated quantile, (ii) then the inverse of the estimated cumulative relative frequency distribution (ECDF) is used to map this interval to a confidence interval for the quantile, (iii) lastly the desired standard error is estimated by dividing the length of the obtained confidence interval by the value 2*qnorm(1-a/2). Notice that the procedure above builds, in general, asymmetric confidence intervals around the estimated quantiles.

The mandatory argument y identifies the variable of interest, that is the variable for which estimates of quantiles have to be calculated. The design variable referenced by y must be numeric.

The optional argument probs specifies the probability values (0.001<=probs[i]<=0.999) corresponding to the quantiles one wants to estimate; the default option selects quartiles.

The optional argument by specifies the variables that define the "estimation domains", that is the subpopulations for which the estimates are to be calculated. If by=NULL (the default option), the estimates produced by svystatQ refer to the whole population. Estimation domains must be defined by a formula: for example the statement by=~B1:B2 selects as estimation domains the subpopulations determined by crossing the modalities of variables B1 and B2. Notice that a formula like by=~B1+B2 will be automatically translated into the factor-crossing formula by=~B1:B2: if you need to compute estimates for domains B1 and B2 separately, you have to call svystatQ twice. The design variables referenced by by (if any) should be of type factor, otherwise they will be coerced.

The conf.int argument allows to request the confidence intervals for the estimates. By default conf.int=FALSE, that is the confidence intervals are not provided.

Whenever confidence intervals are requested (i.e. conf.int=TRUE), the desired confidence level can be specified by means of the conf.lev argument. The conf.lev value must represent a probability (0<=conf.lev<=1) and its default is chosen to be 0.95.

Missing values (NA) in interest variables should be avoided. If na.rm=FALSE (the default) they generate NAs in estimates (or even an error, if design is calibrated). If na.rm=TRUE, observations containing NAs are dropped, and estimates get computed on non missing values only. This implicitly assumes that missing values hit interest variables completely at random: should this not be the case, computed estimates would be biased.

Argument ties addresses the problem of how to treat duplicated observed values (if any) when computing the ECDF. Option 'discrete' (the default) is appropriate when the variable of interest is genuinely discrete, while 'rounded' is a better choice for a continuous variable, i.e. when duplicates stem from rounding. In the first case the ECDF will show a vertical step corresponding to a duplicated value, in the second a smoother shape will be achieved by linear interpolation.

Value

An object inheriting from the data.frame class, whose detailed structure depends on input parameters' values.

References

Woodruff, R.S. (1952) “Confidence Intervals for Medians and Other Position Measures”, Journal of the American Statistical Association, Vol. 47, No. 260, pp. 635-646.

Sarndal, C.E., Swensson, B., Wretman, J. (1992) “Model Assisted Survey Sampling”, Springer Verlag.

Examples

# Creation of a design object:
data(data.examples)
des<-e.svydesign(data=example,ids=~towcod+famcod,strata=~SUPERSTRATUM,
     weights=~weight)

# Estimate of the deciles of the income variable for
# the whole population:
svystatQ(des,~income,probs=seq(0.1,0.9,0.1),ties="rounded")
#>           income.Q[p]        SE CI.l(95%) CI.u(95%)
#> p = 0.100    713.0867 13.383999  687.5816  740.0459
#> p = 0.200    888.0190  8.790575  868.0873  902.5458
#> p = 0.300   1021.8314 10.066114 1002.7788 1042.2373
#> p = 0.400   1134.8728  9.562784 1117.9118 1155.3972
#> p = 0.500   1243.6615 11.869458 1220.3692 1266.8966
#> p = 0.600   1359.6427 11.348459 1335.7082 1380.1933
#> p = 0.700   1468.1979 11.664875 1445.5464 1491.2719
#> p = 0.800   1606.7957 14.910089 1581.4703 1639.9168
#> p = 0.900   1826.9735 14.051038 1803.4479 1858.5269


# Another design object:
data(sbs)
des<-e.svydesign(data=sbs,ids=~id,strata=~strata,weights=~weight,
     fpc=~fpc)

# Estimation of the median value added 
# for economic activity macro-sectors:
svystatQ(des,~va.imp2,probs=0.5,by=~nace.macro,
         ties="rounded",vartype="cvpct")
#>              nace.macro va.imp2.Q[0.5] CI.l(95%).va.imp2.Q[0.5]
#> Agriculture Agriculture       369.1569                 320.3380
#> Industry       Industry       446.9925                 400.5063
#> Commerce       Commerce      1461.9967                1319.6071
#> Services       Services       391.3681                 365.6434
#>             CI.u(95%).va.imp2.Q[0.5] CV%.va.imp2.Q[0.5]
#> Agriculture                 451.1369           9.038889
#> Industry                    493.4787           5.306100
#> Commerce                   1725.5497           7.083373
#> Services                    443.8254           5.096160

# Estimation of the Interquartile Range (IQR) of the number
# of employees for economic activity macro-sectors:
apply(svystatQ(des,~emp.num,probs=c(0.25,0.75),by=~nace.macro)[,2:3],1,diff)
#> Agriculture    Industry    Commerce    Services 
#>          15          44           9          24

Estimation of Quantiles in Subpopulations

Arguments

Details

Value

References

See also

Examples

Contents

Author