Statistics and Computing
Công bố khoa học tiêu biểu
* Dữ liệu chỉ mang tính chất tham khảo
Sắp xếp:
Preserving confidentiality of high-dimensional tabulated data: Statistical and computational issues
Statistics and Computing - Tập 13 - Trang 363-370 - 2003
Dissemination of information derived from large contingency tables formed from confidential data is a major responsibility of statistical agencies. In this paper we present solutions to several computational and algorithmic problems that arise in the dissemination of cross-tabulations (marginal sub-tables) from a single underlying table. These include data structures that exploit sparsity to support efficient computation of marginals and algorithms such as iterative proportional fitting, as well as a generalized form of the shuttle algorithm that computes sharp bounds on (small, confidentiality threatening) cells in the full table from arbitrary sets of released marginals. We give examples illustrating the techniques.
Noisy Monte Carlo: convergence of Markov chains with approximate transition kernels
Statistics and Computing - - 2016
Exact Bayesian inference for the Bingham distribution
Statistics and Computing - Tập 26 - Trang 349-360 - 2014
This paper is concerned with making Bayesian inference from data that are assumed to be drawn from a Bingham distribution. A barrier to the Bayesian approach is the parameter-dependent normalising constant of the Bingham distribution, which, even when it can be evaluated or accurately approximated, would have to be calculated at each iteration of an MCMC scheme, thereby greatly increasing the computational burden. We propose a method which enables exact (in Monte Carlo sense) Bayesian inference for the unknown parameters of the Bingham distribution by completely avoiding the need to evaluate this constant. We apply the method to simulated and real data, and illustrate that it is simpler to implement, faster, and performs better than an alternative algorithm that has recently been proposed in the literature.
Matching pursuit by undecimated discrete wavelet transform for non-stationary time series of arbitrary length
Statistics and Computing - Tập 8 - Trang 205-219 - 1998
We describe how to formulate a matching pursuit algorithm which successively approximates a periodic non-stationary time series with orthogonal projections onto elements of a suitable dictionary. We discuss how to construct such dictionaries derived from the maximal overlap (undecimated) discrete wavelet transform (MODWT). Unlike the standard discrete wavelet transform (DWT), the MODWT is equivariant under circular shifts and may be computed for an arbitrary length time series, not necessarily a multiple of a power of 2. We point out that when using the MODWT and continuing past the level where the filters are wrapped, the norms of the dictionary elements may, depending on N, deviate from the required value of unity and require renormalization.We analyse a time series of subtidal sea levels from Crescent City, California. The matching pursuit shows in an iterative fashion how localized dictionary elements (scale and position) account for residual variation, and in particular emphasizes differences in construction for varying parts of the series.
Correction to: Bayesian high-dimensional covariate selection in non-linear mixed-effects models using the SAEM algorithm
Statistics and Computing - - 2024
Some computational aspects of Gaussian CARMA modelling
Statistics and Computing - - 2013
Representation of continuous-time ARMA (Auto-Regressive-Moving-Average), CARMA, time-series models is reviewed. Computational aspects of simulating and calculating the likelihood-function of CARMA models are summarized. Some numerical properties are illustrated by simulations. Methods for enforcing the stationarity restriction on the parameter space are discussed. Due to such methods restricted numerical estimation enforcing stationarity is possible. The impact of scaling of time axis on the magnitude of the parameters is demonstrated. Proper scaling of the time axis can give parameter values of similar magnitude which is useful for numerical work. The practicality of the computational approach is illustrated with some real and simulated data.
Generalized linear models for massive data via doubly-sketching
Statistics and Computing - - 2023
Generalized linear models are a popular analytics tool with interpretable results and broad applicability, but require iterative estimation procedures that impose data transfer and computational costs that can be problematic under some infrastructure constraints. We propose a doubly-sketched approximation of the iteratively re-weighted least squares algorithm to estimate generalized linear model parameters using a sequence of surrogate datasets. The procedure sketches once to reduce data transfer costs, and sketches again to reduce data computation costs, yielding wall-clock time savings. Regression coefficients and standard errors are produced, with comparison against literature methods. Asymptotic properties of the proposed procedure are shown, with empirical results from simulated and real-world datasets. The efficacy of the proposed method is investigated across a variety of commodity computational infrastructure configurations accessible to practitioners. A highlight of the present work is the estimation of a Poisson-log generalized linear model across almost 1.7 billion observations on a personal computer in 25 min.
R-VGAL: a sequential variational Bayes algorithm for generalised linear mixed models
Statistics and Computing - - 2024
Models with random effects, such as generalised linear mixed models (GLMMs), are often used for analysing clustered data. Parameter inference with these models is difficult because of the presence of cluster-specific random effects, which must be integrated out when evaluating the likelihood function. Here, we propose a sequential variational Bayes algorithm, called Recursive Variational Gaussian Approximation for Latent variable models (R-VGAL), for estimating parameters in GLMMs. The R-VGAL algorithm operates on the data sequentially, requires only a single pass through the data, and can provide parameter updates as new data are collected without the need of re-processing the previous data. At each update, the R-VGAL algorithm requires the gradient and Hessian of a “partial” log-likelihood function evaluated at the new observation, which are generally not available in closed form for GLMMs. To circumvent this issue, we propose using an importance-sampling-based approach for estimating the gradient and Hessian via Fisher’s and Louis’ identities. We find that R-VGAL can be unstable when traversing the first few data points, but that this issue can be mitigated by introducing a damping factor in the initial steps of the algorithm. Through illustrations on both simulated and real datasets, we show that R-VGAL provides good approximations to posterior distributions, that it can be made robust through damping, and that it is computationally efficient.
Split Hamiltonian Monte Carlo
Statistics and Computing - Tập 24 - Trang 339-349 - 2013
We show how the Hamiltonian Monte Carlo algorithm can sometimes be speeded up by “splitting” the Hamiltonian in a way that allows much of the movement around the state space to be done at low computational cost. One context where this is possible is when the log density of the distribution of interest (the potential energy function) can be written as the log of a Gaussian density, which is a quadratic function, plus a slowly-varying function. Hamiltonian dynamics for quadratic energy functions can be analytically solved. With the splitting technique, only the slowly-varying part of the energy needs to be handled numerically, and this can be done with a larger stepsize (and hence fewer steps) than would be necessary with a direct simulation of the dynamics. Another context where splitting helps is when the most important terms of the potential energy function and its gradient can be evaluated quickly, with only a slowly-varying part requiring costly computations. With splitting, the quick portion can be handled with a small stepsize, while the costly portion uses a larger stepsize. We show that both of these splitting approaches can reduce the computational cost of sampling from the posterior distribution for a logistic regression model, using either a Gaussian approximation centered on the posterior mode, or a Hamiltonian split into a term that depends on only a small number of critical cases, and another term that involves the larger number of cases whose influence on the posterior distribution is small.
Multiscale interpretation of taut string estimation and its connection to Unbalanced Haar wavelets
Statistics and Computing - Tập 21 - Trang 671-681 - 2010
We compare two state-of-the-art non-linear techniques for nonparametric function estimation via piecewise constant approximation: the taut string and the Unbalanced Haar methods. While it is well-known that the latter is multiscale, it is not obvious that the former can also be interpreted as multiscale. We provide a unified multiscale representation for both methods, which offers an insight into the relationship between them as well as suggesting lessons both methods can learn from each other.
Tổng số: 1,338
- 1
- 2
- 3
- 4
- 5
- 6
- 10