# Research Projects

The Missouri node is developing spatio-temporal methodology that can be used to analyze the rich resource of survey data from the American Community Survey (ACS), which is an ongoing survey that releases data annually, providing communities with timely information needed to plan the distribution of resources and services. According to the U.S. Census Bureau, the data from the survey provides input into how more than $400 billion in federal and state funds are distributed annually. With the ACS, the Census Bureau has shifted from the decennial census long-form to using an ongoing survey that releases data annually. Making this transition presents many methodological challenges, both for the Census Bureau and for data users. The Missouri node will develop a hierarchical, multiscale, spatio-temporal statistical framework for carrying out small area estimation while preserving geographical and temporal constraints that arise from the aggregate structure found in the ACS. In addition, the research will provide a variety of methods that are of independent interest and can be used in many other surveys administered by the Census Bureau and other federal statistics agencies. It is expected that several of the proposed methods will carry over to the area of disease mapping and thus provide important tools for public health.

**Multivariate Spatio-Temporal Survey Fusion with Application to the American Community Survey and Local Area Unemployment Statistics:**

There are often multiple surveys available that estimate and report related demographic variables of interest that are referenced over space and/or time. Not all surveys produce the same information, and thus, combining these surveys typically leads to higher quality estimates. In addition, various surveys often produce estimates with incomplete spatio-temporal coverage. By combining surveys, we can account for different margins of error and leverage dependencies to produce estimates of every variable considered at every spatial location and every time point. Our strategy is to use a hierarchical modeling approach, where the first stage of the model incorporates the margin of error associated with each survey. Then, in a lower stage of the hierarchical model, the multivariate spatio-temporal mixed effects model is used to incorporate multivariate spatio-temporal dependencies of the processes of interest. We adopt a fully Bayesian approach for combining surveys; i.e., given all of the available surveys, the conditional distributions of the latent processes of interest are used for statistical inference. To demonstrate our proposed methodology, we jointly analyze period estimates from the Census Bureau's ACS and estimates obtained from the Bureau of Labor Statistics Local Area Unemployment Statistics program.

Articles:

Bradley, J.R., Holan, S.H., and Wikle, C.K. (2016) Multivariate Spatio-Temporal Survey Fusion with Application to the American Community Survey and Local Area Unemployment Statistics, *STAT*, 5: 224 - 233.

**Leveraging social-media data to improve survey estimates of small areas:**

The Fay-Herriot (FH) model is widely used in small area estimation and uses auxiliary information to reduce estimation variance at undersampled locations. We extend the type of covariate information used in the FH model to include functional covariates, such as social-media search loads, or remote-sensing images (e.g., in crop-yield surveys). The inclusion of these functional covariates is facilitated through a two-stage dimension-reduction approach that includes a Karhunen-Loéve expansion followed by stochastic search variable selection. Additionally, the importance of modeling spatial autocorrelation has recently been recognized in the FH model; our model utilizes the conditional autoregressive class of spatial models in addition to functional covariates. We demonstrate the effectiveness of our approach through simulation and through the analysis of American Community Survey data. We use Google Trends search curves as functional covariates to analyze changes in rates of household Spanish speaking in the eastern half of the United States.

Articles:

Porter, A.T., Holan, S.H., Wikle, C.K., and Cressie, N. (2014) Spatial Fay–Herriot Models for Small Area Estimation with Functional Covariates. *Spatial Statistics. *10, 27 – 42.

**Small area estimation via multivariate spatial models:**

The Fay-Herriot model is a standard model for direct survey estimators in which the true quantity of interest (the superpopulation mean) is latent, and its estimation is improved through the use of auxiliary covariates. In the context of small area estimation, these estimates can be further improved by borrowing strength across spatial regions or by considering multiple outcomes simultaneously. We provide here two formulations to perform small area estimation with Fay-Herriot models that include both multivariate outcomes and latent spatial dependence. We consider two model formulations, one in which the outcome-by-space dependence structure is separable and one that accounts for the cross-dependence through the use of a generalized multivariate conditional autoregressive (GMCAR) structure. The GMCAR model is shown in a state-level example to produce smaller mean squared prediction errors, relative to equivalent census variables, than the separable model and the state-of-the-art multivariate model with unstructured dependence between outcomes and no spatial dependence. In addition, both the GMCAR and the separable models give smaller mean squared prediction error than the state-of-the-art model when conducting small area estimation on county level data from the American Community Survey.

Articles:

Porter, A.T., Holan, S.H., and Wikle, C.K. (2015) Small Area Estimation via Multivariate Fay-Herriot Models with Latent Spatial Dependence. *Australian & New Zealand Journal of Statistics*. 57, 15 – 29.

**Small area estimation of skewed, count, and binary data:**

The Fay-Herriot model is generalized for skewed, count, and binary data that are spatial and big. Big spatial datasets can be found in national surveys that partition the country into many small areas. The responses are not typically normally distributed, and it is only after aggregation that the averages have a normal-like behavior. In this research project, we consider a hierarchical statistical model consisting of a conditional exponential family model for the data and an underlying (hidden) geostatistical process for some transformation of the (conditional) mean of the data model. Within this hierarchical model, dimension reduction is achieved by modeling the geostatistical process as a linear combination of a fixed number of spatial basis functions, which results in substantial computational speed-ups. We use an empirical hierarchical model (EHM), where the parameters are estimated by maximum likelihood using an EM algorithm. In this non-normal setting, the E-step of the EM algorithm is obtained from a Laplace approximation.

Articles:

Sengupta, A., & Cressie, N. (2013) Hierarchical Statistical Modeling of Big Spatial Datasets Using the Exponential Family of Distributions.* Spatial Statistics*, 4, 14 - 44.

**Bayesian Spatial Change of Support for Count-Valued Survey Data:**

We introduce Bayesian spatial change of support (COS) methodology for count-valued survey data with known survey variances. Our methodology is motivated by the ACS, an ongoing survey administered by the Census Bureau that provides timely information on several key demographic variables. The ACS produces 1-year, 3-year, and 5-year "period-estimates," and corresponding margins of errors, for published demographic and socio-economic variables recorded over predefined geographies within the U.S. Despite the availability of these predefined geographies, it is often of interest to data-users to specify customized user-defined spatial supports. This problem is known as spatial COS, which is typically performed under the assumption that the data are Gaussian. However, count-valued survey data is naturally non-Gaussian and, hence, we consider modeling these data using a Poisson distribution. Additionally, survey-data are often accompanied by estimates of error, which we incorporate into our analysis. We interpret Poisson count-valued data in small areas as an aggregation of events from a spatial point process. This approach provides the flexibility necessary to allow ACS users to consider a variety of spatial supports in "real-time." We show the effectiveness of our approach through a simulated example and an analysis using public-use ACS data.

Articles:

Bradley, J.R., Wikle, C.K., and Holan, S.H. (2016) Bayesian Spatial Change of Support for Count-Valued Survey Data with Application to the American Community Survey. *Journal of the American Statistical Association*, 111: 472 - 487.

**Bayesian Semiparametric Hierarchical Empirical Likelihood Spatial Models:**

We introduce a general hierarchical Bayesian framework that incorporates a flexible nonparametric data model specification through the use of empirical likelihood methodology, which we term semiparametric hierarchical empirical likelihood (SHEL) models. Although general dependence structures can be readily accommodated, we focus on spatial modeling, a relatively underdeveloped area in the empirical likelihood literature. Importantly, the models we develop naturally accommodate spatial association on irregular lattices and irregularly spaced point-referenced data. We illustrate our proposed framework by means of a simulation study and through three real data examples. First, we develop a spatial Fay-Herriot model in the SHEL framework and apply it to the problem of small area estimation in the American Community Survey. Next, we illustrate the SHEL model in the context of areal data (on an irregular lattice) through the North Carolina sudden infant death syndrome (SIDS) dataset. Finally, we analyze a point-referenced dataset from the North American Breeding Bird survey that considers dove counts for the state of Missouri. In all cases, we demonstrate superior performance of our model, in terms of mean squared prediction error, over standard parametric analyses.

Articles:

Porter, A.T., Holan, S.H., and Wikle, C.K. (2015) Bayesian Semiparametric Hierarchical Empirical Likelihood Spatial Models. * Journal of Statistical Planning and Inference*, 165, 78 – 90.

**The Cepstral Model for Multivariate Time Series: The Vector Exponential Model:**

Vector autoregressive (VAR) models have become a staple in the analysis of multivariate time series and are formulated in the time domain as difference equations, with an implied covariance structure. In many contexts, it is desirable to work with a stable, or at least stationary, representation. To fit such models, one must impose restrictions on the coefficient matrices to ensure that certain determinants are nonzero; which, except in special cases, may prove burdensome. To circumvent these difficulties, we propose a flexible frequency domain model expressed in terms of the spectral density matrix. Specifically, this paper treats the modeling of covariance stationary vector-valued (i.e., multivariate) time series via an extension of the exponential model for the spectrum of a scalar time series. We discuss the modeling advantages of the vector exponential model and its computational facets, such as how to obtain Wold coefficients from given cepstral coefficients. Finally, we demonstrate the utility of our approach through simulation as well as two illustrative data examples focusing on multi-step ahead forecasting and estimation of squared coherence.

Articles:

Holan, S.H., McElroy, T.S., and Wu, G. (2016 – To Appear) The Cepstral Model for Multivariate Time Series: The Vector Exponential Model. *Statistica Sinica.*

**Mixed Effects Modeling for Areal Data that Exhibit Multivariate-Spatio-Temporal Dependencies:**

There are many data sources available that report related variables of interest that are also referenced over geographic regions and time; however, there are relatively few general statistical methods that one can readily use that incorporate these multivariate-spatio-temporal dependencies. As such, we introduce the multivariate-spatio-temporal mixed effects model (MSTM) to analyze areal data with multivariate-spatio-temporal dependencies. The proposed MSTM extends the notion of Moran's I basis functions to the multivariate-spatio-temporal setting. This extension leads to several methodological contributions including extremely effective dimension reduction, a dynamic linear model for multivariate-spatio-temporal areal processes, and the reduction of a high-dimensional parameter space using a novel parameter model. Several examples are used to demonstrate that the MSTM provides an extremely viable solution to many important problems found in different and distinct corners of the spatio-temporal statistics literature including: modeling nonseparable and nonstationary covariances, combing data from multiple repeated surveys, and analyzing massive multivariate-spatio-temporal datasets.

Articles:

Bradley, J.R., Holan, S.H., and Wikle, C.K. (2015) Multivariate Spatio-Temporal Models for High-Dimensional Areal Data with Application to Longitudinal Employer-Household Dynamics. *The Annals of Applied Statistics*, 9, 1761 – 1791.

**Bayesian Marked Point Process Modeling for Generating Fully synthetic Public Use Data with Point-Referenced Geography:**

Many data stewards collect confidential data that include fine geography. When sharing these data with others, data stewards strive to disseminate data that are informative for a wide range of spatial and non-spatial analyses while simultaneously protecting the confidentiality of data subjects' identities and attributes. Typically, data stewards meet this challenge by coarsening the resolution of the released geography and, as needed, perturbing the confidential attributes. When done with high intensity, these redaction strategies can result in released data with poor analytic quality. We propose an alternative dissemination approach based on fully synthetic data. We generate data using marked point process models that can maintain both the statistical properties and the spatial dependence structure of the confidential data. We illustrate the approach using data consisting of mortality records from Durham, North Carolina.

Articles:

Quick, H., Holan, S.H., Wikle, C.K., and Reiter, J.P. (2015) Bayesian Marked Point Process Modeling for Generating Fully synthetic Public Use Data with Point-Referenced Geography. *Spatial Statistics*, 14, 439 – 451.

**Regionalization of Multiscale Spatial Processes using a Criterion for Spatial Aggregation Error: **

The modifiable areal unit problem and the ecological fallacy are known problems that occur when modeling multiscale spatial processes. We investigate how these forms of spatial aggregation error can guide a regionalization over a spatial domain of interest. By “regionalization” we mean a specification of geographies that define the spatial support for areal data. This topic has been studied vigorously by geographers, but has been given less attention by spatial statisticians. Thus, we propose a criterion for spatial aggregation error (CAGE), which we minimize to obtain an optimal regionalization. To define CAGE we draw a connection between spatial aggregation error and a new multiscale representation of the truncated Karhunen-Loeve (K-L) expansion. This relationship between CAGE and the multiscale truncated K-L expansion leads to illuminating theoretical developments including: connections between spatial aggregation error and squared prediction error, and a novel extension of Obled-Creutin eigenfunctions. The effectiveness of our approach is demonstrated through a simulation study and an analysis of two datasets, one using the American Community Survey and one related to environmental ocean winds.

Articles:

Bradley, J.R., Wikle, C.K., and Holan, S.H. (2016 – To Appear) Regionalization of Multiscale Spatial Processes using a Criterion for Spatial Aggregation Error. *Journal of the Royal Statistical Society: Series B.*

**Spatio-Temporal Change of Support with Application to American Community Survey Multi-Year Period Estimates:**

We present hierarchical Bayesian methodology to perform spatio-temporal change of support (COS) for survey data with Gaussian sampling errors. This methodology is motivated by the American Community Survey (ACS), which is an ongoing survey administered by the U.S. Census Bureau that provides timely information on several key demographic variables. The ACS has published 1-year, 3-year, and 5-year period-estimates, and margins of errors, for demographic and socio-economic variables recorded over predefined geographies. The spatio-temporal COS methodology considered here provides data users with a way to estimate ACS variables on customized geographies and time periods, while accounting for sampling errors. Additionally, 3-year ACS period estimates are to be discontinued, and this methodology can provide predictions of ACS variables for 3-year periods given the available period estimates. The methodology is based on a spatio-temporal mixed effects model with a low-dimensional spatio-temporal basis function representation, which provides multi-resolution estimates through basis function aggregation in space and time. This methodology includes a novel parameterization that uses a target dynamical process and recently proposed parsimonious Moran's I propagator structures. Our approach is demonstrated through two applications using public-use ACS estimates, and is shown to produce good predictions on a holdout set of 3-year period estimates.

Articles:

Bradley, J.R., Wikle, C.K., and Holan, S.H. (2015 – To Appear) Spatio-Temporal Change of Support with Application to American Community Survey Multi-Year Period Estimates. *STAT, 4, 255-270.*

**Zeros and Ones: A Case for Suppressing Zeros in Sensitive Count Data with an Application to Stroke Mortality:**

In the current era of global internet connectivity, privacy concerns are of the utmost importance. When official statistical agencies collect spatially referenced, confidential data that they intend to release as public-use files, the suppression of small counts is a common measure that agencies take to protect the confidentiality of the data-subjects from ill-intentioned users. The goal of this paper is to demonstrate that an interval suppression criterion that does not suppress zeros can fail to protect regions with a single occurrence. We illustrate the difference in disclosure risk between an interval suppression criterion and a one-sided suppression criterion by considering a US county-level dataset composed of the number of deaths due to stroke in White men. Here, we illustrate that an interval suppression criterion leads to a twofold increase in the disclosure risk when compared with a one-sided suppression criterion for regions with a single incidence among a population of less than 600. We conclude with an extension of these findings beyond stroke mortality and by offering general guidelines for data suppression.

Articles:

Quick, H, Holan, S.H., and Wikle, C.K. (2015) Zeros and Ones: A Case for Suppressing Zeros in Sensitive Count Data with an Application to Stroke Mortality. *STAT, *4, 227 – 234.

**The SAR Model for Very Large Datasets: A Reduced Rank Approach**:

The SAR model is widely used in spatial econometrics to model Gaussian processes on a discrete spatial lattice, but for large datasets, fitting it becomes computationally prohibitive, and hence, its usefulness can be limited. A computationally-efficient spatial model is the spatial random effects (SRE) model, and in this article, we calibrate it to the SAR model of interest using a generalisation of the Moran operator that allows for heteroskedasticity and an asymmetric SAR spatial dependence matrix. In general, spatial data have a measurement-error component, which we model, and we use restricted maximum likelihood to estimate the SRE model covariance parameters; its required computational time is only the order of the size of the dataset. Our implementation is demonstrated using mean usual weekly income data from the 2011 Australian Census.

Articles:

Burden, S., Cressie, N., and Steel, D.G. (2015) The SAR Model for Very Large Datasets: A Reduced Rank Approach. *Econometrics*, 3, 317 – 338.

**Distribution Theory for High-dimensional Dependent Count Data:**

We introduce a Bayesian approach for multivariate spatio-temporal prediction for high-dimensional count-valued data. Our primary interest is when there are possibly millions of data points referenced over different variables, geographic regions, and times. This problem requires extensive methodological advancements, as jointly modeling correlated data of this size leads to the so-called "big n problem." The computational complexity is further exacerbated by acknowledging that count-valued data are non-Gaussian. Thus, we develop a computationally efficient distribution theory for this setting. To incorporate dependence between variables, regions, and times, a multivariate spatio-temporal mixed effects model is used. The results in this manuscript are extremely general, and can be used for data that exhibit fewer sources of dependency than what we consider. The implications of our modeling framework may have a large impact on the general problem of jointly modeling correlated count-valued data. We demonstrate our methodology by analyzing data from the LEHD program.

Articles:

Bradley, J.R., Holan, S.H., and Wikle, C.K. (2016) Computationally Efficient Multivariate Spatio-Temporal Models for High-Dimensional Count-Valued Data. (To Appear -*Bayesian Analysis*).

**Multivariate Spatial Covariance Models: A Conditional Approach:**

Multivariate geostatistics is based on modeling all covariances between all possible combinations of two or more variables at any sets of locations in a continuously indexed domain. Multivariate spatial covariance models need to be built with care, since any covariance matrix that is derived from such a model must be nonnegative-definite. In this article, a conditional approach is developed for spatial-model construction where the validity conditions are easy to check. Starting with bivariate spatial covariance models, the approach's connection to multivariate models defined by networks of spatial variables is demonstrated. In some circumstances, such as modeling respiratory illness conditional on air pollution, the direction of conditional dependence is clear. When it is not, the two directional models can be compared. More generally, the graph structure of the network reduces the number of possible models to compare. Model selection then amounts to finding possible causative links in the network. The conditional approach has been demonstrated on two separate bivariate spatial datasets, one from the ACS and one that relates surface temperature with surface pressure. In both cases, the role of the two variables is seen to be asymmetric.

Articles:

Cressie, N. and Zammit-Mangion, A. (2016). Multivariate Spatial Covariance Models: A Conditional Approach. *Biometrika*, 103, 915-935.

**Generating Partially Synthetic Geocoded Public Use Data with Decreased Disclosure Risk Using Differential Smoothing:**

When collecting geocoded confidential data with the intent to disseminate, agencies often resort to altering the geographies prior to making data publicly available. An alternative to releasing aggregated and/or perturbed data is to release synthetic data, where sensitive values are replaced with draws from models designed to capture distributional features in the collected data. The issues associated with spatially outlying observations in the data, however, have received relatively little attention. Our goal here is to shed light on this problem and propose a solution - referred to as "differential smoothing." We illustrate our approach using sale prices of homes in San Francisco.

Articles:

Quick, H., Holan, S.H., and Wikle, C.K. (2016) Generating Partially Synthetic Geocoded Public Use Data with Decreased Disclosure Risk Using Differential Smoothing. (Under Revision - *Journal of the Royal Statistical Society - A*)