Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring

American Association for the Advancement of Science (AAAS) - Tập 286 Số 5439 - Trang 531-537 - 1999
Todd R. Golub1,2, Donna K. Slonim2, Pablo Tamayo2, Idoia Glaria2, Michelle Gaasenbeek2, Jill P. Mesirov2, Hilary A. Coller2, Mignon L. Loh1, James R. Downing3, M A Caligiuri4, C D Bloomfield4, Eric S. Lander5,2
1Dana-Farber Cancer Institute and Harvard Medical School, Boston, MA 02115 USA
2Whitehead Institute/Massachusetts Institute of Technology, Center for Genome Research, Cambridge, MA 02139 USA
3St. Jude Children's Research Hospital, Memphis, TN, 38105, USA
4Comprehensive Cancer Center and Cancer and Leukemia Group B, Ohio State University, Columbus, OH 43210, USA.
5Department of Biology, Massachusetts Institute of Technology, Cambridge, MA 02142, USA

Tóm tắt

Although cancer classification has improved over the past 30 years, there has been no general approach for identifying new cancer classes (class discovery) or for assigning tumors to known classes (class prediction). Here, a generic approach to cancer classification based on gene expression monitoring by DNA microarrays is described and applied to human acute leukemias as a test case. A class discovery procedure automatically discovered the distinction between acute myeloid leukemia (AML) and acute lymphoblastic leukemia (ALL) without previous knowledge of these classes. An automatically derived class predictor was able to determine the class of new leukemia cases. The results demonstrate the feasibility of cancer classification based solely on gene expression monitoring and suggest a general strategy for discovering and predicting cancer classes for other types of cancer, independent of previous biological knowledge.

Từ khóa


Tài liệu tham khảo

Triche T. J., et al., Prog. Clin. Biol. Res. 271, 475 (1988).

Stephenson C. F., Bridge J. A., Sandberg A. A., Hum. Pathol. 23, 1270 (1992);

Delattre O., et al., N. Engl. J. Med. 331, 294 (1994);

Turc-Carel C., et al., Cancer Genet. Cytogenet. 19, 361 (1986);

Douglass E. C., et al., Cytogenet. Cell Genet. 45, 148 (1987);

Dalla-Favera R., et al., Proc. Natl. Acad. Sci. U.S.A. 79, 7824 (1982);

; R. Taub et al. ibid. p. 7837; G. Balaban-Malenbaum and F. Gilbert Science 198 739 (1977).

Farber S., Diamond L. K., Mercer R. D., Sylvester R. F., Wolff J. A., N. Engl. J. Med. 238, 787 (1948).

C. E. Forkner Leukemia and Allied Disorders (Macmillan New York 1938);

Frei E., et al., Blood 18, 431 (1961);

; Medical Research Council Br. Med. J. 1 7 (1963).

Quaglino D., Hayhoe F. G. J., J. Pathol. 78, 521 (1959);

Bennett J. M., Dutcher T. F., Blood 33, 341 (1969);

Graham R. C., Lundholm U., Karnovsky M. J., J. Histochem. Cytochem. 13, 150 (1965).

Tsukimoto I., Wong R. Y., Lampkin B. C., N. Engl. J. Med. 294, 245 (1976);

Schlossman S. F., et al., Proc. Natl. Acad. Sci. U.S.A. 73, 1288 (1976);

Roper M., et al., Blood 61, 830 (1983);

Sallan B. S. E. et al. 55 395 (1980);

; J. M. Pesando et al. ibid. 54 1240 (1979).

Golub T. R., et al., Proc. Natl. Acad. Sci. U.S.A. 92, 4917 (1995);

McLean T. W., et al., Blood 88, 4252 (1996);

Shurtleff S. A., et al., Leukemia 9, 1985 (1995);

Romana S. P., et al., Blood 86, 4263 (1995);

Rowley J. D., Ann. Genet. 16, 109 (1973) .

Recent reviews of ALL and AML therapy can be found in

Pui C. H., Evans W. E., N. Engl. J. Med. 339, 605 (1998);

Bishop J. F., Med. J. Aust. 170, 39 (1999);

Stone R. M., Mayer R. J., Hematol. Oncol. Clin. N. Am. 7, 47 (1993).

10.1038/ng1296-457

10.1038/nbt1296-1675

10.1126/science.283.5398.83

10.1038/nbt1297-1359

10.1091/mbc.9.12.3273

10.1073/pnas.93.20.10614

Yang G. P., Ross D. T., Kuang W. W., Brown P. O., Weigel R. J., Nucleic Acids Res. 27, 1517 (1999).

10.1126/science.274.5287.536

10.1038/ng1296-457

Kononen J., et al., Nature Med. 4, 844 (1998);

Khan J., et al., Cancer Res. 58, 5009 (1998);

; K. A. Cole et al. Nature Genet. 21 (suppl. 1) 38 (1999).

10.1038/ng1296-457

Yang G. P., Ross D. T., Kuang W. W., Brown P. O., Weigel R. J., Nucleic Acids Res. 27, 1517 (1999);

Khan J., et al., Cancer Res. 58, 5009 (1998);

Khan J., et al., Electrophoresis 20, 223 (1999).

We compared six normal human kidney biopsies and six kidney tumors (renal cell carcinomas RCCs) using the methods described for the leukemias. Neighborhood analysis showed a high density of genes correlated with the distinction. Class predictors were constructed using 50 genes and the predictions proved to be 100% accurate in cross-validation. The informative genes more highly expressed in normal kidney as compared to RCCs included 13 metabolic enzymes two ion channels and three isoforms of the heavy-metal chelator metallothionein all of which function in normal kidney physiology. Those more highly expressed in RCC than normal kidney included interleukin-1 an inflammatory cytokine responsible for the febrile response experienced by patients with RCC and CCND1 a D-type cyclin amplified in some cases of RCC.

The initial 38 samples were all derived from bone marrow aspirates performed at the time of diagnosis before chemotherapy. After informed consent was obtained mononuclear cells were collected by Ficoll sedimentation and total RNA extracted with either Trizol (Gibco/BRL) or RNAqueous reagents (Ambion). The 27 ALL samples were derived from childhood ALL patients treated on Dana-Farber Cancer Institute (DFCI) protocols between 1980 and 1999. Samples were randomly selected from the leukemia cell bank based on availability. The 11 adult AML samples were similarly obtained from the Cancer and Leukemia Group B (CALGB) leukemia cell bank. Samples were selected without regard to immunophenotype cytogenetics or other molecular features. The independent samples used to confirm the results contained a broader range of samples including peripheral blood samples and childhood AML cases (23).

A total of 3 to 10 μg of total RNA from each sample was used to prepare biotinylated target essentially as previously described with minor modifications [

10.1073/pnas.96.6.2907

10.1038/nbt1297-1359

]. A complete description of the biochemical and mathematical procedures used in this paper is available through our Web site at www.genome.wi.mit.edu/MPR.

Samples were excluded if they yielded less than 15 μg of biotinylated RNA if the hybridization was weak (see our Web site for quantitative criteria) or if there were visible defects in the array (such as scratches). A total of 80 leukemia samples were analyzed during the course of the experiments reported here. Of these eight were excluded on the basis of these a priori quality control criteria.

Each gene is represented by an expression vector v ( g ) = ( e 1 e 2 … e n ) where e i denotes the expression level of gene g in i th sample in the initial set S of samples. A class distinction is represented by an idealized expression pattern c = ( c 1 c 2 … c n ) where c i = +1 or 0 according to whether the i -th sample belongs to class 1 or class 2. One can measure “correlation” between a gene and a class distinction in a variety of ways. One can use the Pearson correlation coefficient or the Euclidean distance. We used a measure of correlation P ( g c ) that emphasizes the “signal-to-noise” ratio in using the gene as a predictor. Let [μ 1 ( g ) σ 1 ( g )] and [μ 2 ( g ) σ 2 ( g )] denote the means and SDs of the log of the expression levels of gene g for the samples in class 1 and class 2 respectively. Let P ( g c ) = [μ 1 ( g ) − μ 2 ( g )]/[σ 1 ( g ) + σ 2 ( g )] which reflects the difference between the classes relative to the SD within the classes. Large values of | P ( g c )| indicate a strong correlation between the gene expression and the class distinction while the sign of P ( g c ) being positive or negative corresponds to g being more highly expressed in class 1 or class 2. Unlike a standard Pearson correlation coefficient P ( g c ) is not confined to the range [–1 +1]. Neighborhoods N 1 ( c r ) and N 2 ( c r ) of radius r around class 1 and class 2 were defined to be the sets of genes such that P ( g c ) = r and P ( g c ) = − r respectively. An unusually large number of genes within the neighborhoods indicates that many genes have expression patterns closely correlated with the class vector.

A permutation test was used to calculate whether the density of genes in a neighborhood was statistically significantly higher than expected. We compared the number of genes in the neighborhood to the number of genes in similar neighborhoods around idealized expression patterns corresponding to random class distinctions obtained by permuting the coordinates of c. We performed 400 permutations and determined the 5 and 1% significance levels for the number of genes contained within neighborhoods of various levels of correlation with c. See also the legend to Fig. 2.

The set of informative genes consists of the n /2 genes closest to a class vector high in class 1 [that is P ( g c ) as large as possible] and the n /2 genes closest to class 2 [that is − P ( g c ) as large as possible]. The number n of informative genes is the only free parameter in defining the class predictor.

The class predictor is uniquely defined by the initial set S of samples and the set of informative genes. Parameters ( a g b g ) are defined for each informative gene. The value a g = P ( g c ) reflects the correlation between the expression levels of g and the class distinction. The value b g = [μ 1 ( g ) + μ 2 ( g )]/2 is the average of the mean log expression values in the two classes. Consider a new sample X to be predicted. Let x g denote the normalized log (expression level) of gene g in the sample (where the expression level is normalized by subtracting the mean and dividing by the SD of the expression levels in the initial set S). The vote of gene g is v g = a g ( x g − b g ) with a positive value indicating a vote for class 1 and a negative value indicating a vote for class 2. The total vote V 1 for class 1 is obtained by summing the absolute values of the positive votes over the informative genes while the total vote V 2 for class 2 is obtained by summing the absolute values of the negative votes.

The prediction strength PS is defined as PS = ( V win − V lose )/( V win + V lose ) where V win and V lose are the vote totals for the winning and losing classes. The measure PS reflects the relative margin of victory of the vote.

The appropriate PS threshold depends on the number n of genes in the predictor because the PS is a sum of n variables corresponding to the individual genes and thus its fluctuation for random input data scales inversely with n. See our Web site concerning the specific choice of PS threshold.

In cross-validation the entire prediction process is repeated from scratch with 37 of the 38 samples. This includes identifying the 50 informative genes to be used in the predictor and defining parameters for weighted voting.

The independent set of leukemia samples comprised 24 bone marrow and 10 peripheral blood specimens all obtained at the time of leukemia diagnosis. The ALL samples were obtained from the DFCI childhood ALL bank ( n = 17) or St. Jude Children's Research Hospital (SJCRH) ( n = 3). Whereas the AML samples in the initial data set were all derived from adult patients the AML samples in the independent data set were derived from both adults and children. The samples were obtained from either the CALGB (adult AML n = 4) SJCRH (childhood AML n = 5) or the Children's Cancer Group (childhood AML n = 5) leukemia banks. The samples were processed as described (13) with the exception of the samples from SJCRH which used a very different protocol. The SJCRH samples were subjected to hypotonic lysis (rather than Ficoll sedimentation) and RNA was prepared by an aqueous extraction (Qiagen).

Although the number of genes used had no significant effect on the outcome in this case (median PS for cross-validation ranged from 0.81 to 0.68 over a range of predictors using 10 to 200 genes all with 0% error) it may matter in other instances. One approach is to vary the number of genes used select the number that maximizes the accuracy rate in cross-validation and then use the resulting model on the independent data set. In any case we recommend using at least 10 genes for two reasons. Class predictors using a small number of genes may depend too heavily on any one gene and can produce spuriously high prediction strengths (because a large “margin of victory” can occur by chance due to statistical fluctuation resulting from a small number of genes). In general we also considered the 99% confidence line in neighborhood analysis to be the upper bound for gene selection.

Dinndorf P. A., et al., Med. Pediatr. Oncol. 20, 192 (1992);

Master P. S., Richards S. J., Kendall J., Roberts B. E., Scott C. S., Blut 59, 221 (1989);

Buccheri V., et al., Blood 82, 853 (1993).

Konopleva M., et al., Blood 93, 1668 (1999).

Crawford A. W., Beckerle M. C., J. Biol. Chem. 266, 5847 (1991).

Ross W., Rowe T., Glisson B., Yalowich J., Liu L., Cancer Res. 44, 5857 (1984).

Treatment failure was defined as failure to achieve a complete remission after a standard induction regimen including 3 days of anthracycline and 7 days of cytarabine. Treatment successes were defined as patients in continuous complete remission for a minimum of 3 years. FAB subclass M3 patients were excluded but samples were otherwise not selected with regard to FAB criteria.

Borrow J., et al., Nature Genet. 12, 159 (1996);

; T. Nakamura et al. ibid. p. 154; S. Y. Huang et al. Br. J. Haematol. 96 682 (1997).

Kroon E., et al., EMBO J. 17, 3714 (1998).

10.1073/pnas.96.6.2907

The SOM was constructed using our GENECLUSTER software (32) with a variation filter excluding genes with less than fivefold variation across the collection of samples.

For testing putative clusters derived from the SOM or chosen at random we constructed class predictors with various number of genes (ranging from 10 to 100) and selected the one with the highest cross-validation accuracy rate (in this case 20 genes).

A related approach would be to represent each cluster only as the subset of points lying near the centroid of the cluster.

Various statistical methods can be used to compare the predictors derived from the SOM-derived clusters with predictors derived from random classes. We compared the median prediction strength. Specifically 100 predictors corresponding to random classes of comparable size were constructed and the median PS for each predictor was determined. The performance for the actual predictor was then compared to the distribution of these 100 median PSs to obtain empirical significance levels. The observed median PS in the initial data set was 0.86 which exceeded the median PS for all 100 random predictors; the empirical significance level was thus <1%. The observed median PS for the independent data set was 0.61 which exceed the median PS for all but 4 of the 100 random permutations; the empirical significance level was thus 4%.

Various approaches can be used to test classes C 1 C 2 … C n arising from a multinode SOM. One can construct predictors to distinguish each pair of classes (C i versus C j ) or to distinguish each class for the complement of the class (C i versus not C i ). Here we used the pair-wise approach (C i versus C j ). For cross-validation one can restrict attention to samples known to lie in the union of C i and C j . For an independent data set one must examine all samples (because it is unknown which samples lie in the union of C i and C j ). It may be possible to improve the statistical power of this test by using techniques for multiclass prediction.

Thirty-three ALL samples were tested by cross-validation using a 50-gene predictor. Thirty-two of 33 samples were correctly assigned as T-ALL or B-ALL; the remaining sample received a PS < 0.3 and no prediction was therefore made. Details are provided on our Web site .

T. R. Golub unpublished results.

Turc-Carel S., et al., Cancer Genet. Cytogenet. 19, 361 (1986);

Douglass E. C. et al. 45 148 (1987).

We are grateful to S. Sallan J. Ritz K. Loughlin S. Shurtleff P. Kourlas F. Smith the Cancer and Leukemia Group B and Children's Cancer Group for providing valuable patient samples. We thank R. Klausner D. G. Gilliland D. Nathan G. Daley J. Staunton M. Angelo A. Leblanc P. Lee Z. Kikinis G. Acton and members of the Lander and Golub laboratories for helpful discussions. This work was supported in part by the Leukemia Society of America (T.R.G); the National Institutes of Health and the Leukemia Clinical Research Foundation (C.D.B); and Affymetrix Millennium Pharmaceuticals and Bristol-Myers Squibb (E.S.L).