Binds survey data and sampling design metadata.

e.svydesign(data, ids, strata = NULL, weights,
            fpc = NULL, self.rep.str = NULL, check.data = TRUE)

# S3 method for analytic
summary(object, ...)

Arguments

data

Data frame of survey data.

ids

Formula identifying clusters selected at subsequent sampling stages (PSUs, SSUs, ...).

strata

Formula identifying the stratification variable; NULL (the default) implies no stratification.

weights

Formula identifying the initial weights for the sampling units.

fpc

Formula identifying finite population corrections at subsequent sampling stages (see ‘Details’).

self.rep.str

Triggers an approximate variance estimation method for multistage designs (see ‘Details’). If not NULL (the default), must be a formula identifying self-representing strata (SR), if any.

check.data

Check out the correct nesting of data clusters? The default is TRUE.

object

An object of class analytic, as returned by e.svydesign.

...

Arguments for future extensions.

Details

This function has the purpose of binding in an effective and persistent way the survey data to the metadata describing the adopted sampling design. Both kinds of information are stored in a complex object of class analytic, which extends the survey.design2 class from the survey package. The sampling design metadata are then used to enable and guide processing and analyses provided by other functions in the ReGenesees package (such as e.calibrate, svystatTM, ...).

The data, ids and weights arguments are mandatory, while strata, fpc, self.rep.str and check.data arguments are optional. The data variables that are referenced by ids, weights and, if specified, by strata, fpc, self.rep.str must not contain any missing value (NA). Should empty levels be present in any factor variable belonging to data, they would be dropped.

The ids argument specifies the cluster identifiers. It is possible to specify a multistage sampling design by simply using a formula which involves the identifiers of clusters selected at subsequent sampling stages. For example, ids=~id.PSU + id.SSU declares a two-stage sampling in which the first stage units are identified by the id.PSU variable and second stage ones by the id.SSU variable.

The strata argument identifies the stratification variable. The data variable referenced by strata (if specified) must be a factor. By default the sample is assumed to be non-stratified.

The weights argument identifies the initial (or direct) weights for the units included in the sample. The data variable referenced by weights must be numeric. Direct weights must be strictly positive.

The fpc formula serves the purpose of specifying the finite population corrections at subsequent sampling stages. By default fpc=NULL, which implies with-replacement sampling.
If the survey has only one stage, then the fpcs can be given either as the total population size in each stratum or as the fraction of the total population that has been sampled. In either case the relevant population size must be expressed in terms of sampling units (be they elementary units or clusters). That is, sampling 100 units from a population stratum of size 500 can be specified as 500 or as 100/500=0.2. Thus, passing to fpc a column of zeros, means again with-replacement sampling.
For multistage sampling the population size (or the sampling fraction) for each sampling stage should also be specified in fpc. For instance, when ids=~id.PSU + id.SSU the fpc formula should look like fpc=~fpc.PSU + fpc.SSU, with variable fpc.PSU giving the population sizes (or sampling fractions) in each stratum for the first stage units, while variable fpc.SSU gives population sizes (or sampling fractions) for the second stage units in each sampled PSU. Notice that if you choose to pass to fpc population totals (rather than sampling rates) at a given stage, then you must do the same for all stages (and vice versa).
If fpc is specified but for fewer stages than ids, sampling is assumed to be complete for subsequent stages. The function will check that fpcs values at each sampling stage do not vary within strata.

When dealing with a two-stage (multistage) stratified sampling design that includes self-representing (SR) strata (i.e. strata containing only one PSU selected with probability 1), the only (leading) contribution to the variance of SR strata arises from the second stage units (“variance PSUs”).
When options("RG.ultimate.cluster") is FALSE (which is the default for ReGenesees), variance estimation for SR strata is correctly handled provided the survey fpcs have been properly specified. In particular, if fpc=~fpc.PSU + fpc.SSU and one specifies fpcs in terms of sampling fractions, then, inside SR strata, fpc.PSU must be always equal to one. When, on the contrary, the “Ultimate Cluster Approximation” holds (i.e. options("RG.ultimate.cluster") has been set to TRUE) the SR strata give no contribution at all to the sampling variance.

A compromise solution (adopted by former existing survey software) is the one of retaining, for both SR and not-SR strata, only the leading contribution to the sampling variance. This means that only the SSUs are relevant for SR strata, whereby only the PSUs matter in not-SR strata. This compromise solution can be achieved by using the self.rep.str argument. If this argument is actually specified (as a formula referencing the data variable that identifies the SR strata), a warning is generated in order to remind the user that an approximate variance estimation method will be adopted on that design. Notice that, when choosing the self.rep.str option, the user must ensure that the variable referenced by self.rep.str is logical (with value TRUE for SR strata and FALSE otherwise) or numeric (with value 1 for SR strata and 0 otherwise) or factor (with levels "1" for SR strata and "0" otherwise).

The optional argument check.data allows to check out the correct nesting of data clusters (PSUs, SSUs, ...). If check.data=TRUE the function checks that every unit selected at stage k+1 is associated to one and only one unit selected at stage k. For a stratified design the function checks also the correct nesting of clusters within strata.

PPS Sampling Designs

Probability proportional to size sampling with replacement does not pose any problem: one must simply specify fpc=NULL and pass the right weights. This holds also for multistage designs, where PSUs are selected with replacement with PPS inside strata. Moreover, when the PSUs are sampled with replacement, the only contribution to the variance arises from the estimated PSU totals, and one can simply ignore any available information about subsequent sampling stages.

For unequal probability sampling without replacement, on the contrary, in order to get correct variance estimates, one should know the second-order inclusion probabilities under the sampling design at hand. Unluckily, these probabilities cannot generally be computed, thus one has to resort to some viable approximation. The easier one rests on pretending that PSUs were sampled with replacement, even if this is not actually the case. It is worth stressing that this approach will result in conservative estimates. Moreover, the variance overestimation is expected to be negligible as long as the actual sampling fractions of PSUs are close to zero. Notice that this "with replacement" approximation can be achieved by either not specifying fpc, or by passing to the PSUs term of fpc a column of zeros.

Value

An object of class analytic. The print method for that class gives a concise description of the sampling design. The summary method provides further details. Objects of class analytic persistently store input survey data inside their variables component. Weights can be accessed by using the weights function.

Note

The analytic class is a specialization of the survey.design2 class from the survey package [Lumley 06]; this means that an object created by e.svydesign inherits from the survey.design2 class and you can use on it every method defined on the latter class.

References

Sarndal, C.E., Swensson, B., Wretman, J. (1992) “Model Assisted Survey Sampling”, Springer Verlag.

Lumley, T. (2006) “survey: analysis of complex survey samples”, https://CRAN.R-project.org/package=survey.

Zardetto, D. (2015) “ReGenesees: an Advanced R System for Calibration, Estimation and Sampling Error Assessment in Complex Sample Surveys”. Journal of Official Statistics, 31(2), 177-203. doi: https://doi.org/10.1515/jos-2015-0013.

See also

svystatTM, svystatR, svystatS, svystatSR, svystatB, svystatQ, svystatL for calculating estimates and standard errors, e.calibrate for calibrating weights, ReGenesees.options for setting/changing variance estimation options, collapse.strata for the suggested way of handling lonely PSUs, weights to extract weights.

Examples

############################################################## # The following examples illustrate how to create objects # # (of class 'analytic') defining different sampling designs. # # Note: sometimes the same survey data will be used to # # define more than one design: this serves only the purpose # # of illustrating e.svydesign syntax. # ############################################################## data(data.examples) # Two-stage stratified cluster sampling design (notice that # the design contains lonely PSUs): des<-e.svydesign(data=example,ids=~towcod+famcod,strata=~stratum, weights=~weight) des
#> Stratified 2 - Stage Cluster Sampling Design (with replacement) #> - [80] strata #> - [1307, 2372] clusters #> #> Call: #> e.svydesign(data = example, ids = ~towcod + famcod, strata = ~stratum, #> weights = ~weight)
# Use the summary() function if you need some additional details, e.g.: summary(des)
#> Stratified 2 - Stage Cluster Sampling Design (with replacement) #> - [80] strata #> - [1307, 2372] clusters #> #> Call: #> e.svydesign(data = example, ids = ~towcod + famcod, strata = ~stratum, #> weights = ~weight) #> #> Probabilities: #> Min. 1st Qu. Median Mean 3rd Qu. Max. #> 0.001074 0.002760 0.003112 0.003718 0.004570 0.010471 #> #> Sample stratum sizes: #> 801 802 803 901 902 903 904 905 906 907 908 1001 1002 1003 1004 1005 1006 #> Obs 24 20 69 48 22 28 23 31 21 6 17 487 19 25 22 27 25 #> PSUs 21 17 3 35 1 1 1 1 1 1 2 402 16 23 1 1 1 #> 1007 1008 1009 1101 1102 1103 1104 3001 3002 3003 3004 3005 3006 3007 3008 #> Obs 28 40 34 39 24 29 15 36 17 19 15 12 30 23 28 #> PSUs 1 1 2 31 1 1 1 32 1 1 1 1 1 1 1 #> 3009 3010 3011 3012 3101 3102 3103 3104 3105 3106 3107 3108 3201 3202 3203 #> Obs 12 26 15 26 51 31 15 18 30 19 32 34 152 6 2 #> PSUs 1 1 2 2 43 28 12 12 1 1 1 2 128 6 2 #> 3204 5401 5402 5403 5404 5405 5406 5407 5408 5409 5410 5411 5412 5413 5414 #> Obs 12 165 66 48 60 34 28 37 31 24 24 35 20 24 24 #> PSUs 2 131 47 41 43 26 24 1 1 1 1 1 1 1 2 #> 5415 5416 5501 5502 5503 5504 9301 9302 9303 9304 9305 9306 9307 9308 9309 #> Obs 38 26 75 39 32 28 57 14 17 26 34 33 22 35 36 #> PSUs 1 1 58 1 1 2 39 12 12 1 1 1 1 1 1 #> 9310 9311 9312 #> Obs 24 33 27 #> PSUs 1 2 1 #> #> Data variables: #> [1] "towcod" "famcod" "key" "weight" "stratum" #> [6] "SUPERSTRATUM" "sr" "regcod" "procod" "x1" #> [11] "x2" "x3" "y1" "y2" "y3" #> [16] "age5c" "age10c" "sex" "marstat" "z" #> [21] "income"
# Use the 'variables' slot to extract survey data, e.g.: head(des$variables)
#> towcod famcod key weight stratum SUPERSTRATUM sr regcod procod x1 x2 x3 y1 y2 #> 1 147 3103 1 485.8 803 26 0 7 8 0 0 0 0 0 #> 2 147 3103 2 485.8 803 26 0 7 8 0 0 0 1 1 #> 3 147 3109 3 485.8 803 26 0 7 8 0 0 0 1 1 #> 4 147 3111 4 485.8 803 26 0 7 8 0 0 0 0 0 #> 5 147 3120 5 485.8 803 26 0 7 8 0 0 1 1 1 #> 6 147 3121 6 485.8 803 26 0 7 8 0 0 0 0 0 #> y3 age5c age10c sex marstat z income #> 1 0 3 5 f unmarried 148.32432 1158 #> 2 0 2 4 f married 88.57746 1268 #> 3 0 3 6 f married 115.07377 108 #> 4 0 4 7 f married 86.37647 1700 #> 5 0 2 4 f married 110.52172 537 #> 6 0 3 5 f married 134.40092 2143
# Use the weights() function to extract weights, e.g.: summary(weights(des))
#> Min. 1st Qu. Median Mean 3rd Qu. Max. #> 95.5 218.8 321.3 308.0 362.3 931.2
# Again the same design, but using collapsed strata (SUPERSTRATUM variable) # to remove lonely PSUs: des<-e.svydesign(data=example,ids=~towcod+famcod,strata=~SUPERSTRATUM, weights=~weight) des
#> Stratified 2 - Stage Cluster Sampling Design (with replacement) #> - [55] strata #> - [1307, 2372] clusters #> #> Call: #> e.svydesign(data = example, ids = ~towcod + famcod, strata = ~SUPERSTRATUM, #> weights = ~weight)
# Two stage cluster sampling (no stratification): des<-e.svydesign(data=example,ids=~towcod+famcod,weights=~weight) des
#> 2 - Stage Cluster Sampling Design (with replacement) #> - [1307, 2372] clusters #> #> Call: #> e.svydesign(data = example, ids = ~towcod + famcod, weights = ~weight)
# Stratified unit sampling design: des<-e.svydesign(data=example,ids=~key,strata=~SUPERSTRATUM, weights=~weight) des
#> Stratified Independent Unit Sampling Design (with replacement) #> - [55] strata #> - [3000] units #> #> Call: #> e.svydesign(data = example, ids = ~key, strata = ~SUPERSTRATUM, #> weights = ~weight)
data(sbs) # One-stage stratified unit sampling without replacement # (notice the presence of the fpc argument): des<-e.svydesign(data=sbs,ids=~id,strata=~strata,weights=~weight, fpc=~fpc) des
#> Stratified Independent Unit Sampling Design #> - [664] strata #> - [6909] units #> #> Call: #> e.svydesign(data = sbs, ids = ~id, strata = ~strata, weights = ~weight, #> fpc = ~fpc)
# Same design as above but ignoring the finite population corrections: des<-e.svydesign(data=sbs,ids=~id,strata=~strata,weights=~weight) des
#> Stratified Independent Unit Sampling Design (with replacement) #> - [664] strata #> - [6909] units #> #> Call: #> e.svydesign(data = sbs, ids = ~id, strata = ~strata, weights = ~weight)
data(fpcdat) # Two-stage stratified cluster sampling without replacement # (notice that the fpcs are specified for both stages): des<-e.svydesign(data=fpcdat,ids=~psu+ssu,strata=~stratum,weights=~w, fpc=~fpc1+fpc2) des
#> Stratified 2 - Stage Cluster Sampling Design #> - [5] strata #> - [10, 19] clusters #> #> Call: #> e.svydesign(data = fpcdat, ids = ~psu + ssu, strata = ~stratum, #> weights = ~w, fpc = ~fpc1 + fpc2)
# Same design as above but assuming complete sampling for the # second stage units (notice fpcs have been passed only for the # first stage): des<-e.svydesign(data=fpcdat,ids=~psu+ssu,strata=~stratum,weights=~w, fpc=~fpc1) des
#> Stratified 2 - Stage Cluster Sampling Design #> - [5] strata #> - [10, 19] clusters #> #> Call: #> e.svydesign(data = fpcdat, ids = ~psu + ssu, strata = ~stratum, #> weights = ~w, fpc = ~fpc1)
# Again a two-stage stratified cluster sampling without replacement but # specified in such a way as to retain, in the estimation phase, only # the leading contribution to the sampling variance (i.e. the one arising # from SSUs in SR strata and PSUs in not-SR strata). Notice that the # self.rep.str argument is used: des<-e.svydesign(data=fpcdat,ids=~psu+ssu,strata=~stratum,weights=~w, fpc=~fpc1+fpc2, self.rep.str=~sr)
#> Warning: Sampling variance estimation for this design will take into account only leading contributions, i.e. PSUs in not-SR strata and SSUs in SR strata (see ?e.svydesign and ?ReGenesees.options for details)
des
#> Stratified 2 - Stage Cluster Sampling Design #> - [5] strata #> - [12, 19] clusters #> #> Call: #> e.svydesign(data = fpcdat, ids = ~psu + ssu, strata = ~stratum, #> weights = ~w, fpc = ~fpc1 + fpc2, self.rep.str = ~sr)