svystatQ.Rd
Calculates estimates, standard errors and confidence intervals for Quantiles of numeric variables in subpopulations.
svystatQ(design, y, probs = c(0.25, 0.5, 0.75), by = NULL, vartype = c("se", "cv", "cvpct", "var"), conf.lev = 0.95, na.rm = FALSE, ties=c("discrete", "rounded")) # S3 method for svystatQ coef(object, ...) # S3 method for svystatQ SE(object, ...) # S3 method for svystatQ VAR(object, ...) # S3 method for svystatQ cv(object, ...) # S3 method for svystatQ confint(object, ...)
design | Object of class |
---|---|
y | Formula defining the interest variable. |
probs | Vector of probability values to be used to calculate the quantiles estimates. The default value selects estimates of quartiles. |
by | Formula specifying the variables that define the "estimation domains". If |
vartype |
|
conf.lev | Probability specifying the desired confidence level: the default value is |
na.rm | Should missing values (if any) be removed from the variable of interest? The default is
|
ties | How should duplicated observed values be treated? Select |
object | An object of class |
... | Additional arguments to |
This function calculates weighted estimates for the Quantiles of a quantitative variable using suitable weights depending on the class of design
: calibrated weights for class cal.analytic
and direct weights otherwise.
Standard errors are calculated using the so-called "Woodruff method" [Woodruff 52][Sarndal, Swensson, Wretman 92]: (i) first a confidence interval (at a given confidence level 1-a) is constructed for the relative frequency of units with values below the estimated quantile, (ii) then the inverse of the estimated cumulative relative frequency distribution (ECDF) is used to map this interval to a confidence interval for the quantile, (iii) lastly the desired standard error is estimated by dividing the length of the obtained confidence interval by the value 2*qnorm(1-a/2). Notice that the procedure above builds, in general, asymmetric confidence intervals around the estimated quantiles.
The mandatory argument y
identifies the variable of interest, that is the variable for which estimates of quantiles have to be calculated. The design
variable referenced by y
must be numeric
.
The optional argument probs
specifies the probability values (0.001<=probs[i]<=0.999
) corresponding to the quantiles one wants to estimate; the default option selects quartiles.
The optional argument by
specifies the variables that define the "estimation domains", that is the subpopulations for which the estimates are to be calculated. If by=NULL
(the default option), the estimates produced by svystatQ
refer to the whole population. Estimation domains must be defined by a formula: for example the statement by=~B1:B2
selects as estimation domains the subpopulations determined by crossing the modalities of variables B1
and B2
. Notice that a formula like by=~B1+B2
will be automatically translated into the factor-crossing formula by=~B1:B2
: if you need to compute estimates for domains B1
and B2
separately, you have to call svystatQ
twice. The design
variables referenced by by
(if any) should be of type factor
, otherwise they will be coerced.
The conf.int
argument allows to request the confidence intervals for the estimates. By default conf.int=FALSE
, that is the confidence intervals are not provided.
Whenever confidence intervals are requested (i.e. conf.int=TRUE
), the desired confidence level can be specified by means of the conf.lev
argument. The conf.lev
value must represent a probability (0<=conf.lev<=1
) and its default is chosen to be 0.95
.
Missing values (NA
) in interest variables should be avoided. If na.rm=FALSE
(the default) they generate NAs in estimates (or even an error, if design
is calibrated). If na.rm=TRUE
, observations containing NAs are dropped, and estimates get computed on non missing values only. This implicitly assumes that missing values hit interest variables completely at random: should this not be the case, computed estimates would be biased.
Argument ties
addresses the problem of how to treat duplicated observed values (if any) when computing the ECDF. Option 'discrete'
(the default) is appropriate when the variable of interest is genuinely discrete, while 'rounded'
is a better choice for a continuous variable, i.e. when duplicates stem from rounding. In the first case the ECDF will show a vertical step corresponding to a duplicated value, in the second a smoother shape will be achieved by linear interpolation.
An object inheriting from the data.frame
class, whose detailed structure depends on input parameters' values.
Woodruff, R.S. (1952) “Confidence Intervals for Medians and Other Position Measures”, Journal of the American Statistical Association, Vol. 47, No. 260, pp. 635-646.
Sarndal, C.E., Swensson, B., Wretman, J. (1992) “Model Assisted Survey Sampling”, Springer Verlag.
Estimators of Totals and Means svystatTM
, Ratios between Totals svystatR
, Shares svystatS
, Ratios between Shares svystatSR
, Multiple Regression Coefficients svystatB
, Complex Analytic Functions of Totals and/or Means svystatL
, and all of the above svystat
.
# Creation of a design object: data(data.examples) des<-e.svydesign(data=example,ids=~towcod+famcod,strata=~SUPERSTRATUM, weights=~weight) # Estimate of the deciles of the income variable for # the whole population: svystatQ(des,~income,probs=seq(0.1,0.9,0.1),ties="rounded")#> income.Q[p] SE CI.l(95%) CI.u(95%) #> p = 0.100 713.0867 13.383999 687.5816 740.0459 #> p = 0.200 888.0190 8.790575 868.0873 902.5458 #> p = 0.300 1021.8314 10.066114 1002.7788 1042.2373 #> p = 0.400 1134.8728 9.562784 1117.9118 1155.3972 #> p = 0.500 1243.6615 11.869458 1220.3692 1266.8966 #> p = 0.600 1359.6427 11.348459 1335.7082 1380.1933 #> p = 0.700 1468.1979 11.664875 1445.5464 1491.2719 #> p = 0.800 1606.7957 14.910089 1581.4703 1639.9168 #> p = 0.900 1826.9735 14.051038 1803.4479 1858.5269# Another design object: data(sbs) des<-e.svydesign(data=sbs,ids=~id,strata=~strata,weights=~weight, fpc=~fpc) # Estimation of the median value added # for economic activity macro-sectors: svystatQ(des,~va.imp2,probs=0.5,by=~nace.macro, ties="rounded",vartype="cvpct")#> nace.macro va.imp2.Q[0.5] CI.l(95%).va.imp2.Q[0.5] #> Agriculture Agriculture 369.1569 320.3380 #> Industry Industry 446.9925 400.5063 #> Commerce Commerce 1461.9967 1319.6071 #> Services Services 391.3681 365.6434 #> CI.u(95%).va.imp2.Q[0.5] CV%.va.imp2.Q[0.5] #> Agriculture 451.1369 9.038889 #> Industry 493.4787 5.306100 #> Commerce 1725.5497 7.083373 #> Services 443.8254 5.096160# Estimation of the Interquartile Range (IQR) of the number # of employees for economic activity macro-sectors: apply(svystatQ(des,~emp.num,probs=c(0.25,0.75),by=~nace.macro)[,2:3],1,diff)#> Agriculture Industry Commerce Services #> 15 44 9 24