Modifies a stratified design containing lonely PSUs by collapsing its design strata into superstrata.

collapse.strata(design, block.vars = NULL, sim.score = NULL)

Arguments

design

Object of class analytic (or inheriting from it) containing survey data and sampling design metadata.

block.vars

Formula specifying blocking variables: only strata belonging to the same block will be aggregated (see ‘Details’). If NULL (the default option) no constraints will be imposed.

sim.score

Formula specifying a similarity score for strata: lonely strata will be paired with the most similar stratum in each block (see ‘Details’). If NULL (the default option) random pairs will be formed.

Details

Lonely PSUs (i.e. PSUs which are alone inside a not self-representing stratum) are a concern from the viewpoint of variance estimation. As a general solution, the ReGenesees package can handle the lonely PSUs problem by setting proper variance estimation options (see ReGenesees.options). The collapse.strata function implements a widely used alternative: the so called collapsed strata technique. The basic idea is to build artificial “superstrata” by aggregating strata containing lonely PSUs to other strata, and then to use such superstrata for variance estimation (see e.g. [Wolter 85] and [Rust, Kalton 87]).

The optional argument block.vars identifies “blocking variables” that can be used to constrain the way lonely strata are collapsed to form superstrata. More specifically: first, blocking variables are used to partition sample data in “blocks” via factor crossing, then, only lonely strata belonging to the same block are aggregated. If block.vars=NULL (the default option), no constraint will act on collapsing. The design variables referenced by block.vars (if any) should be of type factor. Errors will be raised if (i) blocks cut across strata, or (ii) block.vars generate any non-aggregable strata (i.e. lonely strata which are a singleton inside a block).

The optional argument sim.score can be used to specify a similarity score for strata aggregation. This means that each lonely stratum will be collapsed with the stratum that has the most similar value of variable sim.score inside the block. Thus the similarity of two strata is actually measured by the (absolute value of the) difference among the corresponding sim.score values. Only one design variable can be referenced by the sim.score formula: (i) it must be of type numeric, (ii) it must be constant inside each stratum, and (iii) it should be positive (otherwise its abs() will be silently used). Note that if no similarity score is specified (i.e. sim.score=NULL), the achieved strata aggregation will depend on the ordering of input sample data in design.

The collapsing algorithm will, whenever possible, build superstrata by pairing a lonely stratum to another not-yet-aggregated stratum. Therefore, in general, superstrata will contain only two design strata. Rare exceptions can arise, e.g. due to constraints, with at most three design strata inside a superstratum. The choice to collapse strata in pairs has been taken because it is known to be appropriate for large-scale surveys with many strata (at least for national level estimates, see e.g. [Rust, Kalton 87]).

The collapse.strata function handles correctly finite population corrections. If design has been built by passing strata sampling fractions via the fpc argument, the function re-computes sampling fractions inside superstrata by exploiting the achieved mapping of strata to superstrata and the fpc slot of design.

Strata Collapse Process Diagnostics

As already observed in the ‘Details’ Section, there are three non trivial reasons why function collapse.strata can run into errors: (1) the blocks cut across strata, (2) some blocks contain a stratum needing to be aggregated while this stratum happens to be the only one inside the block, (3) the similarity score for strata aggregation varies inside strata. In order to help the user to identify such data anomalies, hence taking a step forward to eliminate them, every call to collapse.strata generates, by side effect, a diagnostics data structure named clps.strata.status into the .GlobalEnv (see ‘Examples’).

The clps.strata.status list has three components: the first reports the error message, the second stores a vector identifying the data subsets that have been hit by the anomaly, the third reports the call to collapse.strata that generated the list. For instance, when error condition (1) holds, the second element of clps.strata.status identifies the strata that are cut by blocks; if, instead, error condition (2) holds, the second element of the list identifies the blocks containing non-aggregable strata.

It must be stressed that every call to collapse.strata generates the clps.strata.status list, even when the strata collapsing process ends successfully. In such cases, the first element of the list reports the number of lonely strata that have undergone aggregation, whereas the second is a useful data frame (named clps.table) mapping collapsed strata to superstrata. To be more specific: each row of clps.table identifies a stratum that has been mapped to a superstratum, while the columns of clps.table give: (i) the block to which the stratum belongs, (ii) the stratum name, (iii) a flag indicating if the stratum was lonely or not, (iv) the name of the superstratum to which it has been mapped.

Methodological Warning

A warning must be emphasized: strata similarity score sim.score should be based on prior knowledge and/or on expectations on true values of stratum means for the variable(s) to be estimated, not on current sample data. Indeed, building sim.score by estimating stratum means with the current sample can lead to severe underestimation of sampling variance, i.e. to too tight confidence intervals.

Value

An object of the same class as design, without strata containing lonely PSUs.

References

Wolter, K.M. (2007) “Introduction to Variance Estimation”, Second Edition, Springer-Verlag, New York.

Rust, K., Kalton, G. (1987) “Strategies for Collapsing Strata for Variance Estimation”, Journal of Official Statistics, Vol. 3, No. 1, pp. 69-81.

See also

ReGenesees.options for a different way to handle the lonely PSUs problem (namely by setting variance estimation options).

Examples

############################################## # Explore alternative collapsing strategies. # ############################################## # Build a survey design with lonely PSU strata: data(data.examples) exdes <- e.svydesign(data= example, ids= ~ towcod+famcod, strata= ~ stratum, weights= ~ weight) exdes
#> Stratified 2 - Stage Cluster Sampling Design (with replacement) #> - [80] strata #> - [1307, 2372] clusters #> #> Call: #> e.svydesign(data = example, ids = ~towcod + famcod, strata = ~stratum, #> weights = ~weight)
# Explore 3 possible collapsing strategies: # 1) Aggregate lonely strata by forming random pairs exdes.clps1 <- collapse.strata(exdes)
#> #> # All lonely strata (45) successfully collapsed! #>
#> Warning: No similarity score specified: achieved strata aggregation depends on the ordering of sample data
exdes.clps1
#> Stratified 2 - Stage Cluster Sampling Design (with replacement) #> - [40] strata (collapsed) #> - [1307, 2372] clusters #> #> Call: #> e.svydesign(data = example, ids = ~towcod + famcod, strata = ~stratum, #> weights = ~weight)
# 2) Aggregate lonely strata in pairs under constraints: # i. aggregated strata must be both not self-representing # ii. aggregated strata must belong to the same province (which # is appropriate if e.g. provinces are planned estimation domains) exdes.clps2 <- collapse.strata(exdes,~sr:procod)
#> #> # All lonely strata (45) successfully collapsed! #>
#> Warning: No similarity score specified: achieved strata aggregation depends on the ordering of sample data
exdes.clps2
#> Stratified 2 - Stage Cluster Sampling Design (with replacement) #> - [51] strata (collapsed) #> - [1307, 2372] clusters #> #> Call: #> e.svydesign(data = example, ids = ~towcod + famcod, strata = ~stratum, #> weights = ~weight)
# 3) A WRONG strategy: compute strata similarity score by using # sample estimates of the interest variable (y1) inside strata: old.op <- options("RG.lonely.psu"="remove") stat.score <- svystatTM(design= exdes, ~y1, by= ~ stratum) options(old.op) exdes2<-des.addvars(exdes, sim.score=stat.score[match(stratum,stat.score$stratum),2]) exdes.clps3 <- collapse.strata(exdes2,~sr:procod,~sim.score)
#> #> # All lonely strata (45) successfully collapsed! #>
exdes.clps3
#> Stratified 2 - Stage Cluster Sampling Design (with replacement) #> - [51] strata (collapsed) #> - [1307, 2372] clusters #> #> Call: #> e.svydesign(data = example, ids = ~towcod + famcod, strata = ~stratum, #> weights = ~weight)
# Compute total estimates of y1 at the province level # for all 3 designs with collapsed strata: stat.clps1 <- svystatTM(design= exdes.clps1, y= ~ y1, by= ~ procod, estimator= "Total", vartype= "cvpct") stat.clps2 <- svystatTM(design= exdes.clps2, y= ~ y1, by= ~ procod, estimator= "Total", vartype= "cvpct") stat.clps3 <- svystatTM(design= exdes.clps3, y= ~ y1, by= ~ procod, estimator= "Total", vartype= "cvpct") # Compute the same estimates by using two alternatives # to handle lonely PSUs: # "adjust" option old.op <- options("RG.lonely.psu"="adjust") stat.adj <- svystatTM(design= exdes, y= ~ y1, by= ~ procod, estimator= "Total", vartype= "cvpct") options(old.op) # "average" option old.op <- options("RG.lonely.psu"="average") stat.ave <- svystatTM(design= exdes, y= ~ y1, by= ~ procod, estimator= "Total", vartype= "cvpct") options(old.op) # Lastly, compare achieved estimates for CV percentages: stat.clps1
#> procod Total.y1 CV%.Total.y1 #> 8 8 16594.0 35.488473 #> 9 9 27600.8 29.383709 #> 10 10 90531.0 8.804117 #> 11 11 23455.3 32.666911 #> 30 30 52886.6 12.993996 #> 31 31 13898.1 19.616292 #> 32 32 27779.6 9.259905 #> 54 54 70357.6 14.240086 #> 55 55 25309.9 30.087218 #> 93 93 32698.9 22.939472
stat.clps2
#> procod Total.y1 CV%.Total.y1 #> 8 8 16594.0 28.017982 #> 9 9 27600.8 9.039777 #> 10 10 90531.0 6.457955 #> 11 11 23455.3 13.473811 #> 30 30 52886.6 8.883012 #> 31 31 13898.1 9.545896 #> 32 32 27779.6 9.199254 #> 54 54 70357.6 7.511728 #> 55 55 25309.9 19.593096 #> 93 93 32698.9 8.825899
stat.clps3
#> procod Total.y1 CV%.Total.y1 #> 8 8 16594.0 28.017982 #> 9 9 27600.8 6.592204 #> 10 10 90531.0 6.459646 #> 11 11 23455.3 13.473811 #> 30 30 52886.6 9.430964 #> 31 31 13898.1 11.817371 #> 32 32 27779.6 9.199254 #> 54 54 70357.6 5.892712 #> 55 55 25309.9 19.593096 #> 93 93 32698.9 4.338132
stat.adj
#> procod Total.y1 CV%.Total.y1 #> 8 8 16594.0 28.017982 #> 9 9 27600.8 32.029603 #> 10 10 90531.0 9.335660 #> 11 11 23455.3 33.708200 #> 30 30 52886.6 25.198026 #> 31 31 13898.1 22.486645 #> 32 32 27779.6 9.199254 #> 54 54 70357.6 14.864035 #> 55 55 25309.9 29.597972 #> 93 93 32698.9 24.543208
stat.ave
#> procod Total.y1 CV%.Total.y1 #> 8 8 16594.0 28.017982 #> 9 9 27600.8 9.874840 #> 10 10 90531.0 6.773911 #> 11 11 23455.3 19.905977 #> 30 30 52886.6 15.601568 #> 31 31 13898.1 10.129595 #> 32 32 27779.6 9.199254 #> 54 54 70357.6 5.751701 #> 55 55 25309.9 14.056925 #> 93 93 32698.9 5.907063
# Thus the qualitative features are as expected: the "adjust" option # tends to give conservative sampling variance estimates, the WRONG collapsing # strategy 3) tends to underestimate sampling variance, while other methods # give results in-between those extrema. ########################################################### # A simple way for defining the strata similarity scores. # ########################################################### # Suppose that strata have been clustered in groups of similar # strata. You can, then, use the integer codes of the factor # variable identifying the clusters as a similarity score. # You can do as follows: # Load some data: data(fpcdat) # Build a design object: fpcdes<-e.svydesign(data=fpcdat,ids=~psu+ssu,strata=~stratum,weights=~w) fpcdes
#> Stratified 2 - Stage Cluster Sampling Design (with replacement) #> - [5] strata #> - [10, 19] clusters #> #> Call: #> e.svydesign(data = fpcdat, ids = ~psu + ssu, strata = ~stratum, #> weights = ~w)
# As we deliberately omitted to specify fpcs, this design # has 2 lonely strata out of 5: find.lon.strata(fpcdes)
#> [1] "S.3" "S.5"
# Now, suppose that factor variable pl.domain identifies clusters of # similar strata... table(fpcdat$stratum,fpcdat$pl.domain)
#> #> pd.1 pd.2 pd.3 #> S.1 7 0 0 #> S.2 0 8 0 #> S.3 0 4 0 #> S.4 0 0 6 #> S.5 0 0 3
# ...hence, the similarity score can be obtained simply... fpcdes<-des.addvars(fpcdes,score=unclass(pl.domain)) # ...and readily be used to drive the strata collapsing: fpcdes.clps<-collapse.strata(fpcdes,sim.score=~score)
#> #> # All lonely strata (2) successfully collapsed! #>
fpcdes.clps
#> Stratified 2 - Stage Cluster Sampling Design (with replacement) #> - [3] strata (collapsed) #> - [10, 19] clusters #> #> Call: #> e.svydesign(data = fpcdat, ids = ~psu + ssu, strata = ~stratum, #> weights = ~w)
clps.strata.status
#> $message #> [1] "All lonely strata (2) successfully collapsed!" #> #> $clps.table #> block stratum lonely stratum.collapsed #> 1 G S.2 0 G.clps.1 #> 2 G S.3 1 G.clps.1 #> 3 G S.4 0 G.clps.2 #> 4 G S.5 1 G.clps.2 #>
# As we expected from the groups defined by pl.domain, lonely stratum S.2 # has been paired to S.3, and lonely stratum S.5 to S.4. # Should we have omitted to specify a similarity score, we would have # obtained different superstrata: fpcdes.clps2<-collapse.strata(fpcdes)
#> #> # All lonely strata (2) successfully collapsed! #>
#> Warning: No similarity score specified: achieved strata aggregation depends on the ordering of sample data
fpcdes.clps2
#> Stratified 2 - Stage Cluster Sampling Design (with replacement) #> - [3] strata (collapsed) #> - [10, 19] clusters #> #> Call: #> e.svydesign(data = fpcdat, ids = ~psu + ssu, strata = ~stratum, #> weights = ~w)
clps.strata.status
#> $message #> [1] "All lonely strata (2) successfully collapsed!" #> #> $clps.table #> block stratum lonely stratum.collapsed #> 1 G S.1 0 G.clps.1 #> 2 G S.3 1 G.clps.1 #> 3 G S.2 0 G.clps.2 #> 4 G S.5 1 G.clps.2 #>
################################################################# # Few examples to inspect the clps.strata.status list generated # # for diagnostics purposes. # ################################################################# # 1) Ill defined blocks: cutting across strata: if (FALSE) { clps.err1 <- collapse.strata(exdes,~sex) } clps.strata.status
#> $message #> [1] "All lonely strata (2) successfully collapsed!" #> #> $clps.table #> block stratum lonely stratum.collapsed #> 1 G S.1 0 G.clps.1 #> 2 G S.3 1 G.clps.1 #> 3 G S.2 0 G.clps.2 #> 4 G S.5 1 G.clps.2 #>
# 2) Ill defined blocks: generating non-aggregable strata if (FALSE) { clps.err2 <- collapse.strata(exdes,~regcod:stratum) } clps.strata.status
#> $message #> [1] "All lonely strata (2) successfully collapsed!" #> #> $clps.table #> block stratum lonely stratum.collapsed #> 1 G S.1 0 G.clps.1 #> 2 G S.3 1 G.clps.1 #> 3 G S.2 0 G.clps.2 #> 4 G S.5 1 G.clps.2 #>
# 3) Successful collapsing: explore strata to superstrata mapping exdes.ok <- collapse.strata(exdes,~sr:regcod:procod)
#> #> # All lonely strata (45) successfully collapsed! #>
#> Warning: No similarity score specified: achieved strata aggregation depends on the ordering of sample data
clps.strata.status
#> $message #> [1] "All lonely strata (45) successfully collapsed!" #> #> $clps.table #> block stratum lonely stratum.collapsed #> 1 0.10.54 5407 1 0.10.54.clps.1 #> 2 0.10.54 5414 0 0.10.54.clps.1 #> 3 0.10.54 5415 1 0.10.54.clps.2 #> 4 0.10.54 5410 1 0.10.54.clps.2 #> 5 0.10.54 5409 1 0.10.54.clps.3 #> 6 0.10.54 5408 1 0.10.54.clps.3 #> 7 0.10.54 5413 1 0.10.54.clps.4 #> 8 0.10.54 5416 1 0.10.54.clps.4 #> 9 0.10.54 5411 1 0.10.54.clps.5 #> 10 0.10.54 5412 1 0.10.54.clps.5 #> 11 0.10.55 5503 1 0.10.55.clps.1 #> 12 0.10.55 5504 0 0.10.55.clps.1 #> 13 0.10.55 5502 1 0.10.55.clps.1 #> 14 0.6.30 3009 1 0.6.30.clps.1 #> 15 0.6.30 3002 1 0.6.30.clps.1 #> 16 0.6.30 3005 1 0.6.30.clps.1 #> 17 0.6.30 3011 0 0.6.30.clps.2 #> 18 0.6.30 3003 1 0.6.30.clps.2 #> 19 0.6.30 3007 1 0.6.30.clps.3 #> 20 0.6.30 3010 1 0.6.30.clps.3 #> 21 0.6.30 3006 1 0.6.30.clps.4 #> 22 0.6.30 3012 0 0.6.30.clps.4 #> 23 0.6.30 3008 1 0.6.30.clps.5 #> 24 0.6.30 3004 1 0.6.30.clps.5 #> 25 0.6.31 3108 0 0.6.31.clps.1 #> 26 0.6.31 3105 1 0.6.31.clps.1 #> 27 0.6.31 3107 1 0.6.31.clps.2 #> 28 0.6.31 3106 1 0.6.31.clps.2 #> 29 0.6.93 9305 1 0.6.93.clps.1 #> 30 0.6.93 9308 1 0.6.93.clps.1 #> 31 0.6.93 9312 1 0.6.93.clps.1 #> 32 0.6.93 9311 0 0.6.93.clps.2 #> 33 0.6.93 9307 1 0.6.93.clps.2 #> 34 0.6.93 9306 1 0.6.93.clps.3 #> 35 0.6.93 9309 1 0.6.93.clps.3 #> 36 0.6.93 9304 1 0.6.93.clps.4 #> 37 0.6.93 9310 1 0.6.93.clps.4 #> 38 0.7.10 1005 1 0.7.10.clps.1 #> 39 0.7.10 1007 1 0.7.10.clps.1 #> 40 0.7.10 1006 1 0.7.10.clps.2 #> 41 0.7.10 1008 1 0.7.10.clps.2 #> 42 0.7.10 1004 1 0.7.10.clps.3 #> 43 0.7.10 1009 0 0.7.10.clps.3 #> 44 0.7.11 1104 1 0.7.11.clps.1 #> 45 0.7.11 1103 1 0.7.11.clps.1 #> 46 0.7.11 1102 1 0.7.11.clps.1 #> 47 0.7.9 904 1 0.7.9.clps.1 #> 48 0.7.9 902 1 0.7.9.clps.1 #> 49 0.7.9 905 1 0.7.9.clps.1 #> 50 0.7.9 906 1 0.7.9.clps.2 #> 51 0.7.9 903 1 0.7.9.clps.2 #> 52 0.7.9 908 0 0.7.9.clps.3 #> 53 0.7.9 907 1 0.7.9.clps.3 #>