Contributions to the multivariate Analysis of Marine Environmental Monitoring
ColaboratorAluja Banet, Tomàs; Universitat Politècnica de Catalunya. Departament d'Estadística i Investigació Operativa
Document typeDoctoral thesis
PublisherUniversitat Politècnica de Catalunya
Rights accessOpen Access
The thesis parts from the view that statistics starts with data, and starts by introducing the data sets studied: marine benthic species counts and chemical measurements made at a set of sites in the Norwegian Ekofisk oil field, with replicates and annually repeated. An introductory chapter details the sampling procedure and shows with reliability calculations that the (transformed) chemical variables have excellent reliability, whereas the biological variables have poor reliability, except for a small subset of abundant species. Transformed chemical variables are shown to be approximately normal. Bootstrap methods are used to assess whether the biological variables follow a Poisson distribution, and lead to the conclusion that the Poisson distribution must be rejected, except for rare species. A separate chapter details more work on the distribution of the species variables: truncated and zero-inflated Poisson distributions as well as Poisson mixtures are used in order to account for sparseness and overdispersion. Species are thought to respond to environmental variables, and regressions of the abundance of a few selected species onto chemical variables are reported. For rare species, logistic regression and Poisson regression are the tools considered, though there are problems of overdispersion. For abundant species, random coefficient models are needed in order to cope with intraclass correlation. The environmental variables, mainly heavy metals, are highly correlated, leading to multicollinearity problems. The next chapters use a multivariate approach, where all species data is now treated simultaneously. The theory of correspondence analysis is reviewed, and some theoretical results on this method are reported (bounds for singular values, centring matrices). An applied chapter discusses the correspondence analysis of the species data in detail, detects outliers, addresses stability issues, and considers different ways of stacking data matrices to obtain an integrated analysis of several years of data, and to decompose variation into a within-sites and between-sites component. More than 40 % of the total inertia is due to variation within stations. Principal components analysis is used to analyse the set of chemical variables. Attempts are made to integrate the analysis of the biological and chemical variables. A detailed theoretical development shows how continuous variables can be mapped in an optimal manner as supplementary vectors into a correspondence analysis biplot. Geometrical properties are worked out in detail, and measures for the quality of the display are given, whereas artificial data and data from the monitoring survey are used to illustrate the theory developed. The theory of display of supplementary variables in biplots is also worked out in detail for principal component analysis, with attention for the different types of scaling, and optimality of displayed correlations. A theoretical chapter follows that gives an in depth theoretical treatment of canonical correspondence analysis, (linearly constrained correspondence analysis, CCA for short) detailing many mathematical properties and aspects of this multivariate method, such as geometrical properties, biplots, use of generalized inverses, relationships with other methods, etc. Some applications of CCA to the survey data are dealt with in a separate chapter, with their interpretation and indication of the quality of the display of the different matrices involved in the analysis. Weighted principal component analysis of weighted averages is proposed as an alternative for CCA. This leads to a better display of the weighted averages of the species, and in the cases so far studied, also leads to biplots with a higher amount of explained variance for the environmental data. The thesis closes with a bibliography and outlines some suggestions for further research, such as a the generalization of canonical correlation analysis for working with singular covariance matrices, the use partial least squares methods to account for the excess of predictors, and data fusion problems to estimate missing biological data.
- Tesis - TDX-UPC