Calculates estimates, standard errors and confidence intervals for Quantiles of numeric variables in subpopulations.

svystatQ(design, y, probs = c(0.25, 0.5, 0.75), by = NULL,
         vartype = c("se", "cv", "cvpct", "var"),
         conf.lev = 0.95, na.rm = FALSE,
         ties=c("discrete", "rounded"))

# S3 method for svystatQ
coef(object, ...)
# S3 method for svystatQ
SE(object, ...)
# S3 method for svystatQ
VAR(object, ...)
# S3 method for svystatQ
cv(object, ...)
# S3 method for svystatQ
confint(object, ...)

Arguments

design

Object of class analytic (or inheriting from it) containing survey data and sampling design metadata.

y

Formula defining the interest variable.

probs

Vector of probability values to be used to calculate the quantiles estimates. The default value selects estimates of quartiles.

by

Formula specifying the variables that define the "estimation domains". If NULL (the default option) estimates refer to the whole population.

vartype

character vector specifying the desired variability estimators. It is possible to choose one or more of: standard error ('se', the default), coefficient of variation ('cv'), percent coefficient of variation ('cvpct'), or variance ('var').

conf.lev

Probability specifying the desired confidence level: the default value is 0.95.

na.rm

Should missing values (if any) be removed from the variable of interest? The default is FALSE (see ‘Details’).

ties

How should duplicated observed values be treated? Select 'discrete' for a genuinely discrete interest variable and 'rounded' for a continuous one.

object

An object of class svystatQ.

...

Additional arguments to coef, ..., confint methods (if any).

Details

This function calculates weighted estimates for the Quantiles of a quantitative variable using suitable weights depending on the class of design: calibrated weights for class cal.analytic and direct weights otherwise.

Standard errors are calculated using the so-called "Woodruff method" [Woodruff 52][Sarndal, Swensson, Wretman 92]: (i) first a confidence interval (at a given confidence level 1-a) is constructed for the relative frequency of units with values below the estimated quantile, (ii) then the inverse of the estimated cumulative relative frequency distribution (ECDF) is used to map this interval to a confidence interval for the quantile, (iii) lastly the desired standard error is estimated by dividing the length of the obtained confidence interval by the value 2*qnorm(1-a/2). Notice that the procedure above builds, in general, asymmetric confidence intervals around the estimated quantiles.

The mandatory argument y identifies the variable of interest, that is the variable for which estimates of quantiles have to be calculated. The design variable referenced by y must be numeric.

The optional argument probs specifies the probability values (0.001<=probs[i]<=0.999) corresponding to the quantiles one wants to estimate; the default option selects quartiles.

The optional argument by specifies the variables that define the "estimation domains", that is the subpopulations for which the estimates are to be calculated. If by=NULL (the default option), the estimates produced by svystatQ refer to the whole population. Estimation domains must be defined by a formula: for example the statement by=~B1:B2 selects as estimation domains the subpopulations determined by crossing the modalities of variables B1 and B2. Notice that a formula like by=~B1+B2 will be automatically translated into the factor-crossing formula by=~B1:B2: if you need to compute estimates for domains B1 and B2 separately, you have to call svystatQ twice. The design variables referenced by by (if any) should be of type factor, otherwise they will be coerced.

The conf.int argument allows to request the confidence intervals for the estimates. By default conf.int=FALSE, that is the confidence intervals are not provided.

Whenever confidence intervals are requested (i.e. conf.int=TRUE), the desired confidence level can be specified by means of the conf.lev argument. The conf.lev value must represent a probability (0<=conf.lev<=1) and its default is chosen to be 0.95.

Missing values (NA) in interest variables should be avoided. If na.rm=FALSE (the default) they generate NAs in estimates (or even an error, if design is calibrated). If na.rm=TRUE, observations containing NAs are dropped, and estimates get computed on non missing values only. This implicitly assumes that missing values hit interest variables completely at random: should this not be the case, computed estimates would be biased.

Argument ties addresses the problem of how to treat duplicated observed values (if any) when computing the ECDF. Option 'discrete' (the default) is appropriate when the variable of interest is genuinely discrete, while 'rounded' is a better choice for a continuous variable, i.e. when duplicates stem from rounding. In the first case the ECDF will show a vertical step corresponding to a duplicated value, in the second a smoother shape will be achieved by linear interpolation.

Value

An object inheriting from the data.frame class, whose detailed structure depends on input parameters' values.

References

Woodruff, R.S. (1952) “Confidence Intervals for Medians and Other Position Measures”, Journal of the American Statistical Association, Vol. 47, No. 260, pp. 635-646.

Sarndal, C.E., Swensson, B., Wretman, J. (1992) “Model Assisted Survey Sampling”, Springer Verlag.

See also

Estimators of Totals and Means svystatTM, Ratios between Totals svystatR, Shares svystatS, Ratios between Shares svystatSR, Multiple Regression Coefficients svystatB, Complex Analytic Functions of Totals and/or Means svystatL, and all of the above svystat.

Examples

# Creation of a design object: data(data.examples) des<-e.svydesign(data=example,ids=~towcod+famcod,strata=~SUPERSTRATUM, weights=~weight) # Estimate of the deciles of the income variable for # the whole population: svystatQ(des,~income,probs=seq(0.1,0.9,0.1),ties="rounded")
#> income.Q[p] SE CI.l(95%) CI.u(95%) #> p = 0.100 713.0867 13.383999 687.5816 740.0459 #> p = 0.200 888.0190 8.790575 868.0873 902.5458 #> p = 0.300 1021.8314 10.066114 1002.7788 1042.2373 #> p = 0.400 1134.8728 9.562784 1117.9118 1155.3972 #> p = 0.500 1243.6615 11.869458 1220.3692 1266.8966 #> p = 0.600 1359.6427 11.348459 1335.7082 1380.1933 #> p = 0.700 1468.1979 11.664875 1445.5464 1491.2719 #> p = 0.800 1606.7957 14.910089 1581.4703 1639.9168 #> p = 0.900 1826.9735 14.051038 1803.4479 1858.5269
# Another design object: data(sbs) des<-e.svydesign(data=sbs,ids=~id,strata=~strata,weights=~weight, fpc=~fpc) # Estimation of the median value added # for economic activity macro-sectors: svystatQ(des,~va.imp2,probs=0.5,by=~nace.macro, ties="rounded",vartype="cvpct")
#> nace.macro va.imp2.Q[0.5] CI.l(95%).va.imp2.Q[0.5] #> Agriculture Agriculture 369.1569 320.3380 #> Industry Industry 446.9925 400.5063 #> Commerce Commerce 1461.9967 1319.6071 #> Services Services 391.3681 365.6434 #> CI.u(95%).va.imp2.Q[0.5] CV%.va.imp2.Q[0.5] #> Agriculture 451.1369 9.038889 #> Industry 493.4787 5.306100 #> Commerce 1725.5497 7.083373 #> Services 443.8254 5.096160
# Estimation of the Interquartile Range (IQR) of the number # of employees for economic activity macro-sectors: apply(svystatQ(des,~emp.num,probs=c(0.25,0.75),by=~nace.macro)[,2:3],1,diff)
#> Agriculture Industry Commerce Services #> 15 44 9 24