


Estimating means and standard errors from LTRMP survey
data
By Brian Gray
Last revised 10 June 2005
[Mike Caucutt and reviewers: text in red to be hotlinked]
Introduction
The LTRMP employs both probabilistic and deterministic survey designs (see LTRMP survey sampling designs). Sample information from the former designs may be used to make inferences about the populations from which those samples were derived. For example, the prevalence of submersed aquatic vegetation may be estimated for the population of potentially sampleable units using sample prevalence data. This "upscaling" or generalizability is generally not true with data derived from judgement samples-the derivation of the majority of the sampling locations associated with the deterministic LTRMP designs. For this reason, the balance of this memo is concerned with the use of sample information from the LTRMP's probabilistic designs.
The LTRMP estimates annual means and associated standard errors by relying on the sampling design rather than on presumed distributional assumptions associated with the observed data. These so-called "design-based" methods accommodate complexities often associated with survey designs, including stratification and nonproportional sampling. Extensions of these methods, termed "model-assisted design-based," may be used to estimate temporal trends. A useful comparison of design- and model-based methods for the analysis of survey data is provided by Lohr (1999).
The LTRMP does not currently adjust statistics for detection probabilities or capture efficiencies. Consequently, statistics from the Program must be treated as index statistics rather than as abundances or even as relative abundances. This issue has been addressed in host of recent papers (e.g., MacKenzie et al. 2005). Due to details discussed under design, this issue should be presumed most critical for statistics presented by the fish component. Unfortunately, information required to adjust LTRMP status and trend statistics for nondetection is historically available only from the cluster data from the vegetation component.
Sample inclusion probabilities
A sample inclusion probability is the probability that an individual population unit-for the LTRMP, a grid point-is selected for sampling. For the LTRMP, these inclusion probabilities have historically varied across strata (within a given pool) but have been constant within strata (example). [Mike: pls enter following as mouse over: An LTRMP investigator selects 20 grid points from each of strata i and j for sampling. If the population sizes of these strata are 1000 and 2000 units (grid points), then the sample inclusion probabilities are 20/1000 = 2% and 20/2000 = 1%, respectively. For inclusion probabilities to be constant (i.e., "proportional to size"), the number of grid points selected would need to be directly proportional to the strata sizes (e.g., select 10 and 20 units for sampling, respectively).
Sampling weights
The LTRMP uses sampling weights to address the effects of nonproportional sampling. Sampling weights for the LTRMP are generally defined as the inverses of the sample inclusion probabilities, and they may also be viewed as the number of potentially sampleable units represented by a given sampled unit (example). [Mike: pls enter following as mouse over at "example"->. Using the example in the "Sample inclusion probabilities" section, each sampled unit in strata i and j are viewed as representing 50 and 100 (i.e., 1000/20 and 2000/20) potentially sampleable units, respectively. Consequently, the sampling weights-summed over strata-will correspond to the population size for the analysis (for most analyses, this population size corresponds to that listed as "Total" in LTRMP population sizes <-hot link to LTRMP population sizes.pdf).
In some cases, locations selected for sampling were not sampled. This might have occurred, for example, when the intended sampling location was inaccessible. At present, the LTRMP treats these missing observations as missing at random (vegetation component) or by substituting pre-defined alternative locations (else). If unsampled locations were either i) not missing at random or ii) not interchangeable with the alternate locations for the given metric, then we may expect our reported statistics to reflect bias of unknown magnitude. At present, the LTRMP ignores the issue of missing data, and sample inclusion probabilities are estimated using the observed rather than intended number of sample units. (The importance of missing observations for the LTRMP remains essentially unexamined but is expected to be addressed beginning fiscal year 2007.)
Calculating inclusion probabilities and sampling weights
Population sizes-the numbers of potentially sampleable units or grid points-are provided in LTRMP population sizes <-hot link to LTRMP population sizes.pdf by strata, LTRMP component and field station.
Steps for calculating inclusion probabilities and sampling weights:
- Obtain population sizes by strata from LTRMP
population sizes
- Obtain sample sizes for the metric of interest by strata and by sampling
event
- Divide sample sizes by population sizes; the quotients are your
effective or observed inclusion probabilities
- The reciprocal of the inclusion probabilities are the sampling weights
- Ensure the sampling weights sum to the desired totals.
- Address the following exceptions:
a. The fish component repeats a spatially stratified sampling in each of three seasons per year (two seasons per year beginning in 2005). To ensure sampling weights sum to the total number of sampleable units per stratum (Ns), sampling weights are divided by 3 (when estimating over three periods) or by 2 (when estimating over two periods).
b. Population sizes in the impounded and main channel strata of the water component decreased 16-fold in 1995 (as the length of each grid cell increased four-fold). This issue is addressed by using the pre-1995 population sizes to generate sampling weights (finite population corrections, if employed, should instead use the actual population sizes).
c. For vegetation, population sizes decreased in 1999 because the sampled portions of the spatial strata were restricted to £ 3 m in 1998 and £ 2.5 m thereafter. Also, population sizes for some strata in Pool 4 changed in 2000. Both issues are addressed under "Estimating means and standard errors," below.
Example code that generates inclusion probabilities and sampling weights [Mike: pls link everything within asterisks here:
***************************
The following code generates strata-specific population sizes for each period and years 1993 through 2004. The code may be adopted for use by the nonfish components by dropping the references to 'period.'
* enter population sizes (copied from LTRMP population sizes);
data fish_capN_h;
input Component $ fstation Stratum2 $ Stratum_Code _total_;
* change 'stratum' to 'stratum2' to avoid conflict with SAS' convention;
format stratum2 $5. period best8.;
if stratum2 = "TOTAL" then delete;
drop component;
do year=1993 to 2004 by 1; do period=1 to 3 by 1; output; end; end;
datalines;
FISH 1 MCB-O 1502 1486
FISH 1 MCB-S 1503 766
FISH 1 SC 1504 2887
FISH 1 BWC-O 1510 5073
FISH 1 BWC-S 1511 3860
FISH 1 TOTAL . 14072
FISH 2 MCB-O 1502 1620
FISH 2 MCB-S 1503 756
FISH 2 SC 1504 4148
FISH 2 BWC-O 1510 1978
FISH 2 BWC-S 1511 3434
FISH 2 IMP-O 1520 13204
FISH 2 IMP-S 1521 494
FISH 2 TOTAL . 25634
FISH 3 MCB-O 1502 3527
FISH 3 MCB-S 1503 910
FISH 3 SC 1504 2758
FISH 3 BWC-O 1510 5877
FISH 3 BWC-S 1511 3734
FISH 3 IMP-O 1520 10002
FISH 3 IMP-S 1521 438
FISH 3 TOTAL . 27246
FISH 4 MCB-O 1502 10032
FISH 4 MCB-S 1503 3199
FISH 4 SC 1504 5671
FISH 4 BWC-O 1510 358
FISH 4 BWC-S 1511 764
FISH 4 IMP-O 1520 588
FISH 4 IMP-S 1521 172
FISH 4 TOTAL . 20784
FISH 5 MCB-O 1502 10001
FISH 5 MCB-S 1503 2592
FISH 5 SC 1504 1872
FISH 5 TOTAL . 14465
FISH 6 MCB-O 1502 4829
FISH 6 MCB-S 1503 4935
FISH 6 SC 1504 653
FISH 6 BWC-O 1510 6946
FISH 6 BWC-S 1511 3616
FISH 6 TOTAL . 20979
run;
* sort population size file;
proc sort data=fish_capN_h out=fish.fish_capN_h;
by fstation year stratum2 period;
run;
* sort input data file;
proc sort data=fish.bluegills out=bluegillssort;
by fstation year stratum2 period;
run;
*** generate effective sampling probabilities by fs, year, stratum and period;
ods listing close; * send nothing to screen;
proc surveymeans data=bluegillssort total=fish.fish_capN_h;
stratum stratum2 period / list;
var var1; * enter 1 variable here. variable choice is largely immaterial;
by fstation year;
ods output Statistics=Stats;
ods output Stratainfo=Stratainfo;
run;
ods listing;
* merge data and sampling weight files;
data bluegillsortweights;
merge bluegillssort stratainfo;
by fstation year stratum2 period;
sweight = 1/rate; * only appropriate if ignore period;
spweight = 1/(3*rate); * divide by 3 because fish repeats it's SRS in each of 3 seasons per year;
run;
* ensure sum of weights equal totals;
proc means sum data=bluegillsortweights noprint;
var sweight spweight;
output out=osum sum=sumswt sumspwt;
by fstation year;
run;
proc print; run;
***************************************
Finite population correction factors
Given a finite number of potential sampling locations, the variance of a statistic will decrease as increasing proportions of those locations are sampled. For example, when the entire population is "sampled," the design-based sampling variance is zero (because the population was censused). While corrections for the sampling fraction of a population may be addressed using finite population correction factors such corrections are often ignored when sample inclusion probabilities are less than 10%. As LTRMP inclusion probabilities are uniformly <10%, the LTRMP doesn't adjust standard errors for finite population sampling.
Estimating design-based means and associated standard errors
The LTRMP estimates design-based
means by weighting sample outcomes by sampling weights, and uses a Taylor
series approximation to estimate corresponding sampling errors. Both means and
standard errors are estimated using SAS®' survey means procedure
(proc surveymeans; SAS 2003). Technical details are provided by component:
· Estimating means and standard errors from fish component data [Mike: everything 'til next bullet to be in window associated with this bullet]
Estimating means and standard errors from fish component data
The fish component's design is stratified in space and time (season). As "status" is typically estimated at the pool-year scale, means and standard errors will generally be estimated separately by field station and sampling year. In these cases, the following code is used:
proc surveymeans data=datasetname mean stderr;
strata spatialstratum tempstratum / list;
var var1 var2 etc; * enter variables of interest here;
weight sweight; * denotes sampling weight;
by fieldstation year;
run;
Means by spatial or temporal strata are obtained by moving the appropriate stratum terms from the 'strata' line to the 'by' line. Confidence intervals associated with such means and standard errors should be footnoted with a small-sample size caveat.
Wing dams and sampling locations within wing dams are not selected differently from locations within other strata sampled by the fish component (Gutreuter 1996) [mike: pls link to /documents/reports/1995/95p00201.pdf]. Consequently, wing dam sampling information is excluded from annual means reported by the fish component (Gutreuter 1996) [Mike: ditto on link].
[BRG1] · Estimating means and standard errors from macroinvertebrate component data [Mike: put everything until next bullet in window associated with this bullet]
Estimating means and standard errors from macroinvertebrate component data
The macroinvertebrate component defined strata in space only. Sample sizes have typically been moderate in backwater and impounded strata (~50), intermediate in side channel strata (n~20) and small in main channel border (~10) strata. Means and standard errors for macroinvertebrate count and presence/absence outcomes will generally be estimated separately for each field station and sampling year. In these cases, the following code is used:
proc surveymeans data=datasetname mean stderr;
stratum spatialstratum / list;
var var1 var2 etc; * enter variables of interest here;
weight sweight; * denotes sampling weight;
by fieldstation year;
run;
Means by strata are obtained by moving the appropriate stratum terms from the 'strata' line to the 'by' line. Where such means and standard errors are derived from within-strata sample sizes < 50, confidence intervals associated with those means and standard errors should be footnoted with a small-sample size caveat.
· Estimating means and standard errors from vegetation component data [Mike: everything 'til next bullet to be in window associated with this bullet]
Estimating means and standard errors from vegetation component data
Percent frequency of occurrence (prevalence) and richness indices
The vegetation component defines strata in space only and, with the exception of the percent cover variable, employs a cluster-based design. Sample sizes vary considerably by strata but are often large (n>50). Means and standard errors for presence/absence outcomes will generally be estimated separately for each taxa group, field station and sampling year. In these cases, the following code is used:
proc surveymeans data=datasetname mean stderr;
stratum spatialstratum / list;
var var1 var2 etc; * enter variables of interest here;
cluster site / def; * primary sampling unit = site;
weight sweight; * denotes sampling weight;
by fieldstation year;
run;
Presence/absence data may need to be entered within a class statement: See comment on coding binary outcomes under 'Generalities,' below.
The LTRMP vegetation component intends to begin estimating prevalence rather than prevalence index in 2006, and richness rather than richness index in 2006 or 2007. These shifts will adjust for nondetection by exploiting extra information resulting from the vegetation component's cluster-based design.
The vegetation component's sampling frame for all pools was revised in 1999, and strata in Pool 4 were split in 2000. Consequently, means and SEs from 1998 (all pools) and for upper and lower Pool 4 in 1999 require adjustment for uncertainty associated with sample sizes that were unknown prior to the restratification; this issue will be addressed in 2006.
If sampling date is confounded with strata (as occurs in Pool 8-where strata and sampling sequence both follow an approximate north-south sequence), strata means are confounded with date and should be reported only with appropriate clarification, adjustment for date or both.
[BRG2] · Estimating means and standard errors from water quality
component data [Mike: everything 'til next bullet to be in
window associated with this bullet]
Estimating means and standard errors from water quality component data
The water quality component employs a spatially stratified design. The water quality design is replicated within each of four seasons. Sample sizes have typically been moderate in backwater and impounded strata, moderate in side channel strata and small in main channel border strata. As interest is typically in within-season estimates, means and standard errors will generally be estimated separately for each field station, season and sampling year. In these cases, the following code is used:
proc surveymeans data=datasetname mean stderr;
stratum spatialstratum / list;
var var1 var2 etc; * enter variables of interest here;
weight sweight; * denotes sampling weight;
by fieldstation season year;
run;
· Issues associated with estimating means and standard errors [Mike: everything 'til next bullet to be in window associated with this bullet]
Issues associated with estimating and using means and standard errors
· The above sets of code require identical sampling weights for every variable entered on the 'var' line. Where this assumption is not met (such as for many water quality constituents), separate analyses must be run.
· means not presumed independent in time
· If sampling intensities exceed some minimum (typically set at 10%), then finite population corrections should be made by using the 'total' option on the 'proc' line. For example: <proc surveymeans data=datasetname mean stderr total=;>. High sampling intensities are most likely to occur in strata with small population sizes. Examples include water quality's main channel border and impounded strata (beginning spring 1995).
·
Binary outcomes that are coded
numerically should be listed in a class statement unless "presence"=1 and
"absence" = 0. Binary outcomes that are coded as character (e.g., "present"
and "absent") are automatically treated as categorical variables.
· Estimating means and standard errors from a subpopulation not defined by the design requires methods that acknowledge that the number of samples in the subpopulation was a random variable. This issue is most commonly faced when estimating means and standard errors for Upper and Lower Pool 4. For vegetation, however, all strata but the backwater stratum have been split into upper and lower strata. Under the presumption that the proportion of backwater samples derived from upper and lower portions of Pool 4 is small [Yao: pls confirm this statement], this correction is not performed for estimates from the vegetation component.
· Estimating species richness is generally beyond the scope of this paper. However, methods for estimating species richness are reviewed by MacKenzie et al. (2005). More traditional methods are required for fisheries and macroinvertebrate data while more recent approaches (e.g., Dorazio and Royle 2005) may be used with data from the vegetation component's clustered design.
Modeling using means
Modeling using means, including using design-based means, is challenging for a number of reasons. The first is that the sampling variance, σ2/n, associated with a particular mean is a function of the sample size. Consequently, variation in sample sizes may be presumed to imply that sampling errors vary among means. For stratified random samples, within-strata sample sizes should also have been roughly constant. Second, the sampling variances of means for categorical and count data are themselves functions of the means. Third, the true means should also be presumed to itself vary (i.e., not merely because of sampling error); a complication is that this true mean or parameter variance is, for nonnormal data, typically presumed to vary on a scale other than that on which the data were sampled. Further discussion of modeling using means is provided by Snijders and Bosker (1999).
Confidence intervals
Unlike the case for means and standard errors, most survey-based confidence intervals rely on distributional assumptions and those often, in turn, rely on large-sample size assumptions. For a given data set, the definition of "large" depends on the skewness of the data and on the number of strata (Lohr 1999, Thompson 2002). Unless otherwise indicated, all data from the LTRMP program should be presumed to be nontrivially skewed (exceptions include some water quality metrics, including those explicitly reported as log-transformed). Within-strata sample sizes have often been small (i.e., n < 30 to 50, and possibly < 10), as have the number of strata. The number of strata has ranged between approximately 15 in the fish program (10 beginning in 2005) to approximately 5 for metrics associated with the other components. For these reasons, t-based confidence intervals for pool-wide means from all LTRMP components are viewed as approximate, and t-based confidence intervals by strata are reported only for continuous data or when strata-specific sample sizes exceed 50. When data sets do not meet these criteria, strata-specific confidence intervals may be estimated using resampling or replication methods (Lohr 1999). [BRG3]
References
Dorazio, RM and JA Royle. 2003. Mixture models for
estimating the size of a closed population when capture rates vary among
individuals. Biometrics 59: 351‑364.
Gutreuter, S, R Burkhardt and K Lubinski. 1995. Long Term Resource Monitoring Program Procedures: Fish Monitoring. National Biological Service, Environmental Management Technical Center, Onalaska, Wisconsin, July 1995. LTRMP 95‑P002‑1. 42 pp. + Appendixes A‑J.
Lohr, SL. 1999. Sampling: Design and Analysis. Duxbury Press, Pacific Grove, CA.
Mackenzie, DI, JD Nichols, N Sutton and LL Bailey. 2005. Improving inferences in population studies of rare species that are detected imperfectly. Ecology 86: 1101-1113
SAS Institute Inc. 2003. SAS OnlineDoc® 9.1. SAS Institute Inc., Cary, NC.
Snijders, TAB and RJ Bosker. 1999. Multilevel analysis. Sage: London. Pp. 266.
Thompson, SK. 2002. Sampling, 2nd ed. Wiley, New York.
This document benefitted from comments by .....
Contact: Questions or comments may be directed to Brian Gray, LTRMP statistician, Upper Midwest Environmental Sciences Center, La Crosse, Wisconsin, at brgray@usgs.gov.
Page Last Modified: April 17, 2018