Estimation of Totals and Means in Subpopulations

Computes estimates, standard errors and confidence intervals for Totals and Means in subpopulations.

svystatTM(design, y, by = NULL, estimator = c("Total", "Mean"),
          vartype = c("se", "cv", "cvpct", "var"),
          conf.int = FALSE, conf.lev = 0.95, deff = FALSE,
          na.rm = FALSE)

# S3 method for svystatTM
coef(object, ...)
# S3 method for svystatTM
SE(object, ...)
# S3 method for svystatTM
VAR(object, ...)
# S3 method for svystatTM
cv(object, ...)
# S3 method for svystatTM
deff(object, ...)
# S3 method for svystatTM
confint(object, ...)

Arguments

design	Object of class `analytic` (or inheriting from it) containing survey data and sampling design metadata.
y	Formula defining the variables of interest. Only `numeric` or `factor` variables are allowed.
by	Formula specifying the variables that define the "estimation domains". If `NULL` (the default option) estimates refer to the whole population.
estimator	`character` specifying the desired estimator: it may be `'Total'` (the default) or `'Mean'`.
vartype	`character` vector specifying the desired variability estimators. It is possible to choose one or more of: standard error (`'se'`, the default), coefficient of variation (`'cv'`), percent coefficient of variation (`'cvpct'`), or variance (`'var'`).
conf.int	Compute confidence intervals for the estimates? The default is `FALSE`.
conf.lev	Probability specifying the desired confidence level: the default value is `0.95`.
deff	Should the design effect be computed? The default is `FALSE` (see ‘Details’).
na.rm	Should missing values (if any) be removed from the variables of interest? The default is `FALSE` (see ‘Details’).
object	An object of class `svystatTM`.
...	Additional arguments to `coef`, ..., `confint` methods (if any).

Details

This function computes weighted estimates for Totals and Means using suitable weights depending on the class of design: calibrated weights for class cal.analytic and direct weights otherwise. Standard errors for nonlinear estimators (e.g. calibration estimators) are calculated using the Taylor linearization technique.

The mandatory argument y identifies the variables of interest, that is the variables for which estimates are to be calculated. The corresponding formula should be of the type y=~var1+...+varn. The design variables referenced by y should be numeric or factor (variables of other types - e.g. character - will generate and error). It is admissible to specify for y "mixed" formulae that simultaneously contain quantitative (numeric) variables and qualitative (factor) variables.

To override the restriction to formulae of the type y=~var1+...+varn, the AsIs operator I() can be used (see ‘Examples’). Though the latter opportunity could appear quite useful in some occasion, actually it should be almost always possible to find a work-around by using other functions of the ReGenesees package.

The optional argument by specifies the variables that define the "estimation domains", that is the subpopulations for which the estimates are to be calculated. If by=NULL (the default option), the estimates produced by svystatTM refer to the whole population. Estimation domains must be defined by a formula: for example the statement by=~B1:B2 selects as estimation domains the subpopulations determined by crossing the modalities of variables B1 and B2. Notice that a formula like by=~B1+B2 will be automatically translated into the factor-crossing formula by=~B1:B2: if you need to compute estimates for domains B1 and B2 separately, you have to call svystatTM twice. The design variables referenced by by (if any) should be of type factor, otherwise they will be coerced.

The optional argument estimator makes it possible to select the desired estimator. If
estimator="Total" (the default option), svystatTM calculates, for a given variable of interest vark, the estimate of the total (when vark is numeric) or the estimate of the absolute frequency distribution (when vark is factor). Similarly, if estimator="Mean", the function calculates the estimate of the mean (when vark is numeric) or the the estimate of the relative frequency distribution (when vark is factor).

The conf.int argument allows to request the confidence intervals for the estimates. By default conf.int=FALSE, that is the confidence intervals are not provided.

Whenever confidence intervals are requested (i.e. conf.int=TRUE), the desired confidence level can be specified by means of the conf.lev argument. The conf.lev value must represent a probability (0<=conf.lev<=1) and its default is chosen to be 0.95.

The optional argument deff allows to request the design effect [Kish 1995] for the estimates. By default deff=FALSE, that is the design effect is not provided. The design effect of an estimator is defined as the ratio between the variance of the estimator under the actual sampling design and the variance that would be obtained for an 'equivalent' estimator under a hypothetical simple random sampling without replacement of the same size. To obtain an estimate of the design effect comparing to simple random sampling “with replacement”, one must use deff="replace".
Understanding what 'equivalent' estimator actually means is straightforward when dealing with Horvitz-Thompson estimators of Totals and Means. This is not the case when, on the contrary, the estimator to which the deff refers is a nonlinear estimator (e.g. for Calibration estimators of Totals and Means). In such cases, the standard approach is to use as 'equivalent' estimator the linearized version of the original estimator (that is: the estimator of the total of the linearized variable, aka "Woodruff transform").
When dealing with domain estimation, the design effects referring to a given subpopulation are currently computed by taking the ratios between the actual variance estimates and those that would have been obtained if a simple random sampling were carried out within that subpopulation. This is the same as the srssubpop option for Stata's function estat.

Missing values (NA) in interest variables should be avoided. If na.rm=FALSE (the default) they generate NAs in estimates (or even an error, if design is calibrated). If na.rm=TRUE and estimator="Mean", observations containing NAs are dropped, and estimates get computed on non-missing values only. This implicitly assumes that missing values hit interest variables completely at random (MCAR): should this not be the case, computed estimates would be biased. Since, even under the MCAR assumption, estimates of totals and counts solely based on complete cases would be biased, function svystatTM adopts a model-based ratio estimator when na.rm=TRUE and estimator="Total". This is obtained by (i) first estimating the mean or proportion based on complete cases only, and (ii) then multiplying the result by the estimated population size based on all observations (i.e. both missing and non-missing). This model-based ratio estimator is asymptotically design-unbiased under the MCAR assumption. Notice that the na.rm=TRUE option is only allowed if y references a single interest variable.

Value

An object inheriting from the data.frame class, whose detailed structure depends on input parameters' values.

References

Sarndal, C.E., Swensson, B., Wretman, J. (1992) “Model Assisted Survey Sampling”, Springer Verlag.

Kish, L. (1995). “Methods for design effects”. Journal of Official Statistics, Vol. 11, pp. 55-77.

European Commission, Eurostat, (2013). “Handbook on precision requirements and variance estimation for ESS households surveys: 2013 edition”, Publications Office. doi: 10.2785/13579

Examples

# Load survey data:
data(data.examples)

# Creation of a design object:
des<-e.svydesign(data=example,ids=~towcod+famcod,strata=~SUPERSTRATUM,
     weights=~weight)


# Estimation of the total of 3 quantitative variables for the whole
# population:
svystatTM(des,~y1+y2+y3)
#>       Total        SE
#> y1 381111.8 11979.658
#> y2 356633.5 11426.506
#> y3  24478.3  2971.032


# Estimation of the total of the same 3 variables by region, with SE
# and CV%: 
svystatTM(des,~y1+y2+y3,~regcod,vartype=c("se","cvpct"))
#>    regcod Total.y1 Total.y2 Total.y3 SE.Total.y1 SE.Total.y2 SE.Total.y3
#> 6       6 127263.2 120685.6   6577.6    5795.431    5844.851    1518.427
#> 7       7 158181.1 145929.1  12252.0    8740.133    7849.983    2253.128
#> 10     10  95667.5  90018.8   5648.7    5790.963    5897.502    1202.009
#>    CV%.Total.y1 CV%.Total.y2 CV%.Total.y3
#> 6      4.553894     4.843039     23.08482
#> 7      5.525397     5.379313     18.38988
#> 10     6.053218     6.551412     21.27939


# Estimation of the mean of the same 3 variables by marstat and sex:
svystatTM(des,~y1+y2+y3,~marstat:sex,estimator="Mean")
#>               marstat sex   Mean.y1   Mean.y2     Mean.y3 SE.Mean.y1 SE.Mean.y2
#> married.f     married   f 0.3907492 0.3669468 0.023802352 0.01856838 0.01852581
#> unmarried.f unmarried   f 0.4235675 0.4042538 0.019313678 0.02280424 0.02173967
#> widowed.f     widowed   f 0.4626949 0.4407110 0.021983877 0.04586107 0.04617194
#> married.m     married   m 0.4067734 0.3826893 0.024084168 0.01664767 0.01616298
#> unmarried.m unmarried   m 0.4394581 0.3906832 0.048774895 0.02309729 0.02211473
#> widowed.m     widowed   m 0.4018789 0.3952614 0.006617401 0.04809382 0.04797347
#>              SE.Mean.y3
#> married.f   0.005178534
#> unmarried.f 0.007090960
#> widowed.f   0.012790746
#> married.m   0.005624599
#> unmarried.m 0.009429334
#> widowed.m   0.006606120


# Estimation of the absolute frequency distribution of the qualitative
# variable age5c for the whole population, with the design effect:
svystatTM(des,~age5c,deff=TRUE)
#>           Total       SE     DEff
#> age5c1 128928.4 6855.457 1.379735
#> age5c2 294575.1 9646.795 1.510391
#> age5c3 356463.9 9354.653 1.301673
#> age5c4 122897.6 8022.848 1.967450
#> age5c5  21236.3 2728.266 1.168435

# MARGINAL relative frequency distributions
  # Estimation of the relative frequency distribution of the qualitative
  # variable age5c for the whole population:
  svystatTM(des,~age5c,estimator="Mean")
#>              Mean          SE
#> age5c1 0.13951760 0.006775833
#> age5c2 0.31876927 0.008310965
#> age5c3 0.38574115 0.009450915
#> age5c4 0.13299148 0.007791633
#> age5c5 0.02298049 0.002911569

# CONDITIONAL relative frequency distributions
  # Estimation of the relative frequency distribution of the qualitative
  # variable marstat by sex:
  svystatTM(des,~marstat,~sex,estimator="Mean")
#>   sex Mean.marstatmarried Mean.marstatunmarried Mean.marstatwidowed
#> f   f           0.5819877             0.3385543          0.07945802
#> m   m           0.5794871             0.3381051          0.08240785
#>   SE.Mean.marstatmarried SE.Mean.marstatunmarried SE.Mean.marstatwidowed
#> f             0.01381074               0.01378699            0.008264284
#> m             0.01521242               0.01435306            0.007792035

# JOINT relative frequency distributions
  # Estimation of the relative frequency of the joint distribution of sex
  # and marstat:
    # *First Solution* (using the AsIs operator I()):
    svystatTM(des,~I(sex:marstat),estimator="Mean")
#>                                 Mean          SE
#> I(sex:marstat)f:married   0.29603854 0.008526480
#> I(sex:marstat)f:unmarried 0.17221175 0.007805405
#> I(sex:marstat)f:widowed   0.04041776 0.004296936
#> I(sex:marstat)m:married   0.28472052 0.008815911
#> I(sex:marstat)m:unmarried 0.16612183 0.007935676
#> I(sex:marstat)m:widowed   0.04048961 0.003912457
    # *Second Solution* (adding a new variable to des):
    des2 <- des.addvars(des, sex.marstat=sex:marstat)
    svystatTM(des2,~sex.marstat,estimator="Mean")
#>                              Mean          SE
#> sex.marstatf:married   0.29603854 0.008526480
#> sex.marstatf:unmarried 0.17221175 0.007805405
#> sex.marstatf:widowed   0.04041776 0.004296936
#> sex.marstatm:married   0.28472052 0.008815911
#> sex.marstatm:unmarried 0.16612183 0.007935676
#> sex.marstatm:widowed   0.04048961 0.003912457
    # *Third Solution* (exploiting estimators of Shares, see also ?svystatS):
    # Add new variable 'ones' to estimate counts of final units (individuals)
    # and estimate the share of people for classes of sex and marstat
    des2 <- des.addvars(des, ones=1)
    svystatS(des2,~ones,classes=~sex:marstat)
#>                       ones.Share          SE
#> sexf:marstatmarried   0.29603854 0.008526480
#> sexm:marstatmarried   0.28472052 0.008815911
#> sexf:marstatunmarried 0.17221175 0.007805405
#> sexm:marstatunmarried 0.16612183 0.007935676
#> sexf:marstatwidowed   0.04041776 0.004296936
#> sexm:marstatwidowed   0.04048961 0.003912457


# Estimation of the mean income inside provinces, with confidence intervals
# at a confidence level of 0.9:
svystatTM(des,~income,~procod,estimator="Mean",conf.int=TRUE,conf.lev=0.9)
#>    procod Mean.income SE.Mean.income CI.l(90%).Mean.income
#> 8       8    1247.856       47.46596              1169.781
#> 9       9    1277.439       29.81988              1228.390
#> 10     10    1278.077       16.57645              1250.811
#> 11     11    1305.575       41.65284              1237.062
#> 30     30    1191.915       30.50543              1141.738
#> 31     31    1214.014       28.18796              1167.649
#> 32     32    1253.583       32.02876              1200.900
#> 54     54    1244.461       14.02571              1221.391
#> 55     55    1246.414       24.79436              1205.631
#> 93     93    1295.200       21.92183              1259.141
#>    CI.u(90%).Mean.income
#> 8               1325.931
#> 9               1326.489
#> 10              1305.343
#> 11              1374.087
#> 30              1242.092
#> 31              1260.379
#> 32              1306.265
#> 54              1267.532
#> 55              1287.197
#> 93              1331.258


# Quantitative and qualitative variables together: estimation of the
# total of income and of the absolute frequency distribution of sex,
# by marstat:
svystatTM(des,~income+sex,~marstat)
#>             marstat Total.income Total.sexf Total.sexm SE.Total.income
#> married     married    693643975   273569.6   263110.6        18071387
#> unmarried unmarried    368739638   159141.1   153513.4        15118198
#> widowed     widowed     98441036    37350.1    37416.5         8058280
#>           SE.Total.sexf SE.Total.sexm
#> married        9317.489      9690.125
#> unmarried      8208.844      7509.333
#> widowed        4005.659      3631.664


# Estimating totals in domains for "incomplete" partitions: more on
# the AsIs operator I()

  # Estimation of the total income (plus cvpct) ONLY in region 7:
  svystatTM(des,~I(income*(regcod=="7")),vartype="cvpct")
#>                                 Total      CV%
#> I(income * (regcod == "7")) 525145767 3.421359
  # Alternative solution (adding a new variable to des):
  des2 <- des.addvars(des, inc_reg7=I(income*(regcod=="7")))
  svystatTM(des2,~inc_reg7,vartype="cvpct")
#>              Total      CV%
#> inc_reg7 525145767 3.421359

  # Estimation of the total income (plus cvpct) ONLY in regions 6 and 10:
  svystatTM(des,~I(income*as.numeric(regcod %in% c("6","10"))),vartype="cvpct")
#>                                                      Total      CV%
#> I(income * as.numeric(regcod %in% c("6", "10"))) 635678881 2.093386
  # Alternative solution (adding a new variable to des):
  des2 <- des.addvars(des, inc_reg6.10=I(income*(regcod %in% c("6","10"))))
  svystatTM(des2,~inc_reg6.10,vartype="cvpct")
#>                 Total      CV%
#> inc_reg6.10 635678881 2.093386

  # Compare with the corresponding estimates for the "complete" partition,
  # i.e. for regions:
  svystatTM(des,~income,~regcod,vartype="cvpct")
#>    regcod Total.income CV%.Total.income
#> 6       6    361813305         2.776094
#> 7       7    525145767         3.421359
#> 10     10    273865576         3.187308


# Under default settings lonely PSUs produce errors in standard
# errors estimation (notice we didn't pass the fpcs):
data(fpcdat)
des.lpsu<-e.svydesign(data=fpcdat,ids=~psu+ssu,strata=~stratum,
          weights=~w)
if (FALSE) {
svystatTM(des.lpsu,~x+y+z,vartype=c("se","cvpct"))
}

# This can be circumvented in different ways, namely:
old.op <- options("RG.lonely.psu"="adjust")
svystatTM(des.lpsu,~x+y+z,vartype=c("se","cvpct"))
#>        Total        SE      CV%
#> x  1005.1667  199.0398 19.80167
#> y   976.2417  277.1037 28.38474
#> z 19180.1349 4079.4900 21.26935
options(old.op)

# or otherwise:
old.op <- options("RG.lonely.psu"="average")
svystatTM(des.lpsu,~x+y+z,vartype=c("se","cvpct"))
#>        Total        SE      CV%
#> x  1005.1667  220.1591 21.90275
#> y   976.2417  328.8805 33.68843
#> z 19180.1349 4741.9620 24.72330
options(old.op)

# but see also ?collapse.strata for a better alternative.


####################################################
# Household-level estimation in household surveys. #
####################################################

# Large scale household surveys typically adopt a 2-stage sampling
# design with municipalities as PSUs and households as SSUs, in order
# to eventually collect information on each individual belonging to
# sampled SSUs. In such a framework (up to possible total nonresponse
# effects), each individual inside a sampled household shares the
# same direct weight, which, in turn, equals the household weight.
# This implies that it is very easy to build estimates referred to
# SSU-level (households) information, despite estimators actually
# involve only individual values. Some examples are given below.

  # Load survey data:
  data(data.examples)

  # Define the survey design (variable famcod identifies households)
  exdes<-e.svydesign(data=example,ids=~towcod+famcod,strata=~stratum,
         weights=~weight)

  # Collapse strata to eliminate lonely PSUs
  exdes<-collapse.strata(design=exdes,block.vars=~sr:procod)
#> 
#> # All lonely strata (45) successfully collapsed!
#> 
#> Warning: No similarity score specified: achieved strata aggregation depends on the ordering of sample data

  # Now add new convenience variables to the design object:
    ## 'ones':       to estimate individuals counts
    ## 'housize':    to classify individuals by household size
    ## 'houdensity': to estimate households counts
  exdes<-des.addvars(exdes,
                     ones=1,
                     housize=factor(ave(famcod,famcod,FUN = length)),
                     houdensity=ave(famcod,famcod,FUN = function(x) 1/length(x))
                    )

  # Estimate the total number of households:
  nhou<-svystatTM(exdes,~houdensity,vartype="cvpct")
  nhou
#>             Total     CV%
#> houdensity 733831 2.02535

  # Estimate the total number of individuals:
  nind<-svystatTM(exdes,~ones,vartype="cvpct")
  nind
#>         Total      CV%
#> ones 924101.3 2.215888

  # Thus the average number of individuals per household is:
  coef(nind)/coef(nhou)
#>     ones 
#> 1.259284 

  # ...which can be obtained also as a ratio (along with
  # its estimated sampling variability):
  svystatR(exdes,~ones,~houdensity,vartype="cvpct")
#>                    Ratio       CV%
#> ones/houdensity 1.259284 0.8744035

  # Estimate the number and proportion of individuals living in households
  # of given sizes:
  nind.by.housize<-svystatTM(exdes,~housize,vartype="cvpct")
  nind.by.housize
#>             Total       CV%
#> housize1 566584.1  2.326594
#> housize2 291708.4  4.709615
#> housize3  59286.0 12.776275
#> housize4   6522.8 43.582008

  pind.by.housize<-svystatTM(exdes,~housize,estimator="Mean",var="cvpct")
  pind.by.housize
#>                 Mean       CV%
#> housize1 0.613119038  2.008487
#> housize2 0.315667124  3.660218
#> housize3 0.064155304 12.057171
#> housize4 0.007058534 43.343418

  # Estimate the number of households by household size:
  nhou.by.housize<-svystatTM(exdes,~houdensity,~housize,vartype="cvpct")
  nhou.by.housize
#>   housize Total.houdensity CV%.Total.houdensity
#> 1       1         566584.1             2.326594
#> 2       2         145854.2             4.709615
#> 3       3          19762.0            12.776275
#> 4       4           1630.7            43.582008

  # Notice that estimates of individuals and household counts are consistent,
  # indeed:
  coef(nind.by.housize)/coef(nhou.by.housize)
#> housize1 housize2 housize3 housize4 
#>        1        2        3        4

Estimation of Totals and Means in Subpopulations

Arguments

Details

Value

References

See also

Examples

Contents

Author