Journal of the Royal Statistical Society. Series B: Statistical Methodology

Công bố khoa học tiêu biểu

Sắp xếp:  
Estimating the Number of Clusters in a Data Set Via the Gap Statistic
Journal of the Royal Statistical Society. Series B: Statistical Methodology - Tập 63 Số 2 - Trang 411-423 - 2001
Robert Tibshirani, Guenther Walther, Trevor Hastie
Summary We propose a method (the ‘gap statistic’) for estimating the number of clusters (groups) in a set of data. The technique uses the output of any clustering algorithm (e.g. K-means or hierarchical), comparing the change in within-cluster dispersion with that expected under an appropriate reference null distribution. Some theory is developed for the proposal and a simulation study shows that the gap statistic usually outperforms other methods that have been proposed in the literature.
Assessing the Finite Dimensionality of Functional Data
Journal of the Royal Statistical Society. Series B: Statistical Methodology - Tập 68 Số 4 - Trang 689-705 - 2006
Peter Hall, Céline Vial
SummaryIf a problem in functional data analysis is low dimensional then the methodology for its solution can often be reduced to relatively conventional techniques in multivariate analysis. Hence, there is intrinsic interest in assessing the finite dimensionality of functional data. We show that this problem has several unique features. From some viewpoints the problem is trivial, in the sense that continuously distributed functional data which are exactly finite dimensional are immediately recognizable as such, if the sample size is sufficiently large. However, in practice, functional data are almost always observed with noise, for example, resulting from rounding or experimental error. Then the problem is almost insolubly difficult. In such cases a part of the average noise variance is confounded with the true signal and is not identifiable. However, it is possible to define the unconfounded part of the noise variance. This represents the best possible lower bound to all potential values of average noise variance and is estimable in low noise settings. Moreover, bootstrap methods can be used to describe the reliability of estimates of unconfounded noise variance, under the assumption that the signal is finite dimensional. Motivated by these ideas, we suggest techniques for assessing the finiteness of dimensionality. In particular, we show how to construct a critical point V^q such that, if the distribution of our functional data has fewer than q−1 degrees of freedom, then we should be willing to assume that the average variance of the added noise is at least V^q. If this level seems too high then we must conclude that the dimension is at least q−1. We show that simpler, more conventional techniques, based on hypothesis testing, are generally not effective.
Econometric Analysis of Realized Volatility and its Use in Estimating Stochastic Volatility Models
Journal of the Royal Statistical Society. Series B: Statistical Methodology - Tập 64 Số 2 - Trang 253-280 - 2002
Ole E. Barndorff–Nielsen, Neil Shephard
SummaryThe availability of intraday data on the prices of speculative assets means that we can use quadratic variation-like measures of activity in financial markets, called realized volatility, to study the stochastic properties of returns. Here, under the assumption of a rather general stochastic volatility model, we derive the moments and the asymptotic distribution of the realized volatility error—the difference between realized volatility and the discretized integrated volatility (which we call actual volatility). These properties can be used to allow us to estimate the parameters of stochastic volatility models without recourse to the use of simulation-intensive methods.
Maximum Likelihood from Incomplete Data Via the <i>EM</i> Algorithm
Journal of the Royal Statistical Society. Series B: Statistical Methodology - Tập 39 Số 1 - Trang 1-22 - 1977
A. P. Dempster, Nan M. Laird, Donald B. Rubin
Summary A broadly applicable algorithm for computing maximum likelihood estimates from incomplete data is presented at various levels of generality. Theory showing the monotone behaviour of the likelihood and convergence of the algorithm is derived. Many examples are sketched, including missing value situations, applications to grouped, censored or truncated data, finite mixture models, variance component estimation, hyperparameter estimation, iteratively reweighted least squares and factor analysis.
Acceleration of the EM Algorithm by using Quasi-Newton Methods
Journal of the Royal Statistical Society. Series B: Statistical Methodology - Tập 59 Số 3 - Trang 569-587 - 1997
Mortaza Jamshidian, Robert I. Jennrich
Summary The EM algorithm is a popular method for maximum likelihood estimation. Its simplicity in many applications and desirable convergence properties make it very attractive. Its sometimes slow convergence, however, has prompted researchers to propose methods to accelerate it. We review these methods, classifying them into three groups: pure, hybrid and EM-type accelerators. We propose a new pure and a new hybrid accelerator both based on quasi-Newton methods and numerically compare these and two other quasi-Newton accelerators. For this we use examples in each of three areas: Poisson mixtures, the estimation of covariance from incomplete data and multivariate normal mixtures. In these comparisons, the new hybrid accelerator was fastest on most of the examples and often dramatically so. In some cases it accelerated the EM algorithm by factors of over 100. The new pure accelerator is very simple to implement and competed well with the other accelerators. It accelerated the EM algorithm in some cases by factors of over 50. To obtain standard errors, we propose to approximate the inverse of the observed information matrix by using auxiliary output from the new hybrid accelerator. A numerical evaluation of these approximations indicates that they may be useful at least for exploratory purposes.
Causal Inference by using Invariant Prediction: Identification and Confidence Intervals
Journal of the Royal Statistical Society. Series B: Statistical Methodology - Tập 78 Số 5 - Trang 947-1012 - 2016
Jonas Peters, Peter Bühlmann, Nicolai Meinshausen
SummaryWhat is the difference between a prediction that is made with a causal model and that with a non-causal model? Suppose that we intervene on the predictor variables or change the whole environment. The predictions from a causal model will in general work as well under interventions as for observational data. In contrast, predictions from a non-causal model can potentially be very wrong if we actively intervene on variables. Here, we propose to exploit this invariance of a prediction under a causal model for causal inference: given different experimental settings (e.g. various interventions) we collect all models that do show invariance in their predictive accuracy across settings and interventions. The causal model will be a member of this set of models with high probability. This approach yields valid confidence intervals for the causal relationships in quite general scenarios. We examine the example of structural equation models in more detail and provide sufficient assumptions under which the set of causal predictors becomes identifiable. We further investigate robustness properties of our approach under model misspecification and discuss possible extensions. The empirical properties are studied for various data sets, including large-scale gene perturbation experiments.
A Direct Approach to False Discovery Rates
Journal of the Royal Statistical Society. Series B: Statistical Methodology - Tập 64 Số 3 - Trang 479-498 - 2002
John D. Storey
SummaryMultiple-hypothesis testing involves guarding against much more complicated errors than single-hypothesis testing. Whereas we typically control the type I error rate for a single-hypothesis test, a compound error rate is controlled for multiple-hypothesis tests. For example, controlling the false discovery rate FDR traditionally involves intricate sequential p-value rejection methods based on the observed data. Whereas a sequential p-value method fixes the error rate and estimates its corresponding rejection region, we propose the opposite approach—we fix the rejection region and then estimate its corresponding error rate. This new approach offers increased applicability, accuracy and power. We apply the methodology to both the positive false discovery rate pFDR and FDR, and provide evidence for its benefits. It is shown that pFDR is probably the quantity of interest over FDR. Also discussed is the calculation of the q-value, the pFDR analogue of the p-value, which eliminates the need to set the error rate beforehand as is traditionally done. Some simple numerical examples are presented that show that this new approach can yield an increase of over eight times in power compared with the Benjamini–Hochberg FDR method.
Probabilistic Principal Component Analysis
Journal of the Royal Statistical Society. Series B: Statistical Methodology - Tập 61 Số 3 - Trang 611-622 - 1999
Michael E. Tipping, Chris Bishop
Summary Principal component analysis (PCA) is a ubiquitous technique for data analysis and processing, but one which is not based on a probability model. We demonstrate how the principal axes of a set of observed data vectors may be determined through maximum likelihood estimation of parameters in a latent variable model that is closely related to factor analysis. We consider the properties of the associated likelihood function, giving an EM algorithm for estimating the principal subspace iteratively, and discuss, with illustrative examples, the advantages conveyed by this probabilistic approach to PCA.
Regression Shrinkage and Selection Via the Lasso
Journal of the Royal Statistical Society. Series B: Statistical Methodology - Tập 58 Số 1 - Trang 267-288 - 1996
Robert Tibshirani
SUMMARY We propose a new method for estimation in linear models. The ‘lasso’ minimizes the residual sum of squares subject to the sum of the absolute value of the coefficients being less than a constant. Because of the nature of this constraint it tends to produce some coefficients that are exactly 0 and hence gives interpretable models. Our simulation studies suggest that the lasso enjoys some of the favourable properties of both subset selection and ridge regression. It produces interpretable models like subset selection and exhibits the stability of ridge regression. There is also an interesting relationship with recent work in adaptive function estimation by Donoho and Johnstone. The lasso idea is quite general and can be applied in a variety of statistical models: extensions to generalized regression models and tree-based models are briefly described.
Regularization and Variable Selection Via the Elastic Net
Journal of the Royal Statistical Society. Series B: Statistical Methodology - Tập 67 Số 2 - Trang 301-320 - 2005
Hui Zou, Trevor Hastie
SummaryWe propose the elastic net, a new regularization and variable selection method. Real world data and a simulation study show that the elastic net often outperforms the lasso, while enjoying a similar sparsity of representation. In addition, the elastic net encourages a grouping effect, where strongly correlated predictors tend to be in or out of the model together. The elastic net is particularly useful when the number of predictors (p) is much bigger than the number of observations (n). By contrast, the lasso is not a very satisfactory variable selection method in the p≫n case. An algorithm called LARS-EN is proposed for computing elastic net regularization paths efficiently, much like algorithm LARS does for the lasso.
Tổng số: 17   
  • 1
  • 2