Statistics and Computing

Công bố khoa học tiêu biểu

* Dữ liệu chỉ mang tính chất tham khảo

Sắp xếp:  
Discussion on the paper by Friedman and Fisher
Statistics and Computing - Tập 9 - Trang 146-147 - 1999
David W. Scott
Enmsp: an elastic-net multi-step screening procedure for high-dimensional regression
Statistics and Computing - Tập 34 - Trang 1-16 - 2024
Yushan Xue, Jie Ren, Bin Yang
To improve the estimation efficiency of high-dimensional regression problems, penalized regularization is routinely used. However, accurately estimating the model remains challenging, particularly in the presence of correlated effects, wherein irrelevant covariates exhibit strong correlation with relevant ones. This situation, referred to as correlated data, poses additional complexities for model estimation. In this paper, we propose the elastic-net multi-step screening procedure (EnMSP), an iterative algorithm designed to recover sparse linear models in the context of correlated data. EnMSP uses a small repeated penalty strategy to identify truly relevant covariates in a few iterations. Specifically, in each iteration, EnMSP enhances the adaptive lasso method by adding a weighted $$l_2$$ penalty, which improves the selection of relevant covariates. The method is shown to select the true model and achieve the $$l_2$$ -norm error bound under certain conditions. The effectiveness of EnMSP is demonstrated through numerical comparisons and applications in financial data.
Distributed adaptive nearest neighbor classifier: algorithm and theory
Statistics and Computing - Tập 33 - Trang 1-23 - 2023
Ruiqi Liu, Ganggang Xu, Zuofeng Shang
When data is of an extraordinarily large size or physically stored in different locations, the distributed nearest neighbor (NN) classifier is an attractive tool for classification. We propose a novel distributed adaptive NN classifier for which the number of nearest neighbors is a tuning parameter stochastically chosen by a data-driven criterion. An early stopping rule is proposed when searching for the optimal tuning parameter, which not only speeds up the computation but also improves the finite sample performance of the proposed algorithm. Convergence rate of excess risk of the distributed adaptive NN classifier is investigated under various sub-sample size compositions. In particular, we show that when the sub-sample sizes are sufficiently large, the proposed classifier achieves the nearly optimal convergence rate. Effectiveness of the proposed approach is demonstrated through simulation studies as well as an empirical application to a real-world dataset.
Resample-smoothing of Voronoi intensity estimators
Statistics and Computing - Tập 29 Số 5 - Trang 995-1010 - 2019
Mehdi Moradi, Ottmar Cronie, Ege Rubak, Raphaël Lachièze-Rey, Jorge Mateu, Adrian Baddeley
Efficient Bayesian estimation of the multivariate Double Chain Markov Model
Statistics and Computing - Tập 23 - Trang 467-480 - 2012
Matthew Fitzpatrick, Dobrin Marchev
The Double Chain Markov Model (DCMM) is used to model an observable process $Y = \{Y_{t}\}_{t=1}^{T}$ as a Markov chain with transition matrix, $P_{x_{t}}$ , dependent on the value of an unobservable (hidden) Markov chain $\{X_{t}\}_{t=1}^{T}$ . We present and justify an efficient algorithm for sampling from the posterior distribution associated with the DCMM, when the observable process Y consists of independent vectors of (possibly) different lengths. Convergence of the Gibbs sampler, used to simulate the posterior density, is improved by adding a random permutation step. Simulation studies are included to illustrate the method. The problem that motivated our model is presented at the end. It is an application to real data, consisting of the credit rating dynamics of a portfolio of financial companies where the (unobserved) hidden process is the state of the broader economy.
Distributed statistical optimization for non-randomly stored big data with application to penalized learning
Statistics and Computing - Tập 33 - Trang 1-13 - 2023
Kangning Wang, Shaomin Li
Distributed optimization for big data has recently attracted enormous attention. However, the existing algorithms are all based on one critical randomness condition, i.e., the big data are randomly distributed on different machines. This is seldom in practice, and violating this condition can seriously degrade the estimation accuracy. To fix this problem, we propose a pilot dataset surrogate loss function based optimization framework, which can realize communication-efficient distributed optimization for non-randomly distributed big data. Furthermore, we also apply it to penalized high-dimensional sparse learning problems by combining it with the penalty functions. Theoretical properties and numerical results all confirm the good performance of the proposed methods.
Generalised likelihood profiles for models with intractable likelihoods
Statistics and Computing - Tập 34 - Trang 1-14 - 2023
David J. Warne, Oliver J. Maclaren, Elliot J. Carr, Matthew J. Simpson, Christopher Drovandi
Likelihood profiling is an efficient and powerful frequentist approach for parameter estimation, uncertainty quantification and practical identifiablity analysis. Unfortunately, these methods cannot be easily applied for stochastic models without a tractable likelihood function. Such models are typical in many fields of science, rendering these classical approaches impractical in these settings. To address this limitation, we develop a new approach to generalising the methods of likelihood profiling for situations when the likelihood cannot be evaluated but stochastic simulations of the assumed data generating process are possible. Our approach is based upon recasting developments from generalised Bayesian inference into a frequentist setting. We derive a method for constructing generalised likelihood profiles and calibrating these profiles to achieve desired frequentist coverage for a given coverage level. We demonstrate the performance of our method on realistic examples from the literature and highlight the capability of our approach for the purpose of practical identifability analysis for models with intractable likelihoods.
Calibrating the Gaussian multi-target tracking model
Statistics and Computing - Tập 25 - Trang 595-608 - 2014
Sinan Yıldırım, Lan Jiang, Sumeetpal S. Singh, Thomas A. Dean
We present novel batch and online (sequential) versions of the expectation–maximisation (EM) algorithm for inferring the static parameters of a multiple target tracking (MTT) model. Online EM is of particular interest as it is a more practical method for long data sets since in batch EM, or a full Bayesian approach, a complete browse of the data is required between successive parameter updates. Online EM is also suited to MTT applications that demand real-time processing of the data. Performance is assessed in numerical examples using simulated data for various scenarios. For batch estimation our method significantly outperforms an existing gradient based maximum likelihood technique, which we show to be significantly biased.
Control variates for stochastic gradient MCMC
Statistics and Computing - - 2018
Jack Baker, Paul Fearnhead, Emily B. Fox, Christopher Nemeth
It is well known that Markov chain Monte Carlo (MCMC) methods scale poorly with dataset size. A popular class of methods for solving this issue is stochastic gradient MCMC (SGMCMC). These methods use a noisy estimate of the gradient of the log-posterior, which reduces the per iteration computational cost of the algorithm. Despite this, there are a number of results suggesting that stochastic gradient Langevin dynamics (SGLD), probably the most popular of these methods, still has computational cost proportional to the dataset size. We suggest an alternative log-posterior gradient estimate for stochastic gradient MCMC which uses control variates to reduce the variance. We analyse SGLD using this gradient estimate, and show that, under log-concavity assumptions on the target distribution, the computational cost required for a given level of accuracy is independent of the dataset size. Next, we show that a different control-variate technique, known as zero variance control variates, can be applied to SGMCMC algorithms for free. This postprocessing step improves the inference of the algorithm by reducing the variance of the MCMC output. Zero variance control variates rely on the gradient of the log-posterior; we explore how the variance reduction is affected by replacing this with the noisy gradient estimate calculated by SGMCMC.
The minimum regularized covariance determinant estimator
Statistics and Computing - Tập 30 - Trang 113-128 - 2019
Kris Boudt, Peter J. Rousseeuw, Steven Vanduffel, Tim Verdonck
The minimum covariance determinant (MCD) approach estimates the location and scatter matrix using the subset of given size with lowest sample covariance determinant. Its main drawback is that it cannot be applied when the dimension exceeds the subset size. We propose the minimum regularized covariance determinant (MRCD) approach, which differs from the MCD in that the scatter matrix is a convex combination of a target matrix and the sample covariance matrix of the subset. A data-driven procedure sets the weight of the target matrix, so that the regularization is only used when needed. The MRCD estimator is defined in any dimension, is well-conditioned by construction and preserves the good robustness properties of the MCD. We prove that so-called concentration steps can be performed to reduce the MRCD objective function, and we exploit this fact to construct a fast algorithm. We verify the accuracy and robustness of the MRCD estimator in a simulation study and illustrate its practical use for outlier detection and regression analysis on real-life high-dimensional data sets in chemistry and criminology.
Tổng số: 1,338   
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 10