The Sequence of the Human Genome
Tóm tắt
A 2.91-billion base pair (bp) consensus sequence of the euchromatic portion of the human genome was generated by the whole-genome shotgun sequencing method. The 14.8-billion bp DNA sequence was generated over 9 months from 27,271,853 high-quality sequence reads (5.11-fold coverage of the genome) from both ends of plasmid clones made from the DNA of five individuals. Two assembly strategies—a whole-genome assembly and a regional chromosome assembly—were used, each combining sequence data from Celera and the publicly funded genome effort. The public data were shredded into 550-bp segments to create a 2.9-fold coverage of those genome regions that had been sequenced, without including biases inherent in the cloning and assembly procedure used by the publicly funded group. This brought the effective coverage in the assemblies to eightfold, reducing the number and size of gaps in the final assembly over what would be obtained with 5.11-fold coverage. The two assembly strategies yielded very similar results that largely agree with independent mapping data. The assemblies effectively cover the euchromatic regions of the human chromosomes. More than 90% of the genome is in scaffold assemblies of 100,000 bp or more, and 25% of the genome is in scaffolds of 10 million bp or larger. Analysis of the genome sequence revealed 26,588 protein-encoding transcripts for which there was strong corroborating evidence and an additional ∼12,000 computationally derived genes with mouse matches or other weak supporting evidence. Although gene-dense clusters are obvious, almost half the genes are dispersed in low G+C sequence separated by large tracts of apparently noncoding sequence. Only 1.1% of the genome is spanned by exons, whereas 24% is in introns, with 75% of the genome being intergenic DNA. Duplications of segmental blocks, ranging in size up to chromosomal lengths, are abundant throughout the genome and reveal a complex evolutionary history. Comparative genomic analysis indicates vertebrate expansions of genes associated with neuronal function, with tissue-specific developmental regulation, and with the hemostasis and immune systems. DNA sequence comparisons between the consensus sequence and publicly funded genome data provided locations of 2.1 million single-nucleotide polymorphisms (SNPs). A random pair of human haploid genomes differed at a rate of 1 bp per 1250 on average, but there was marked heterogeneity in the level of polymorphism across the genome. Less than 1% of all SNPs resulted in variation in proteins, but the task of determining which SNPs have functional consequences remains an open challenge.
Từ khóa
Tài liệu tham khảo
; U.S. Department of Energy Office of Health and Environmental Research Sequencing the Human Genome: Summary Report of the Santa Fe Workshop Santa Fe NM 3 to 4 March 1986 (Los Alamos National Laboratory Los Alamos NM 1986).
R. Cook-Deegan The Gene Wars: Science Politics and the Human Genome (Norton New York 1996).
Seeburg P. H., et al., Trans. Assoc. Am. Physicians 90, 109 (1977).
Adams M. D., et al., Nature 377, 3 (1995);
Mahy B. W. J., Esposito J. J., Venter J. C., Am. Soc. Microbiol. News 57, 577 (1991).
International Human Genome Sequencing Consortium (2001) Nature 409 860 (2001).
Institutional review board: P. Calabresi (chairman) H. P. Freeman C. McCarthy A. L. Caplan G. D. Rogell J. Karp M. K. Evans B. Margus C. L. Carter R. A. Millman S. Broder.
Eligibility criteria for participation in the study were as follows: prospective donors had to be 21 years of age or older not pregnant and capable of giving an informed consent. Donors were asked to self-define their ethnic backgrounds. Standard blood bank screens (screening for HIV hepatitis viruses and so forth) were performed on all samples at the clinical laboratory prior to DNA extraction in the Celera laboratory. All samples that tested positive for transmissible viruses were ineligible and were discarded. Karyotype analysis was performed on peripheral blood lymphocytes from all samples selected for sequencing; all were normal. A two-staged consent process for prospective donors was employed. The first stage of the consent process provided information about the genome project procedures and risks and benefits of participating. The second stage of the consent process involved answering follow-up questions and signing consent forms and was conducted about 48 hours after the first.
DNA was isolated from blood (173) or sperm. For sperm a washed pellet (100 μl) was lysed in a suspension (1 ml) containing 0.1 M NaCl 10 mM tris-Cl–20 mM EDTA (pH 8) 1% SDS 1 mg proteinase K and 10 mM dithiothreitol for 1 hour at 37°C. The lysate was extracted with aqueous phenol and with phenol/chloroform. The DNA was ethanol precipitated and dissolved in 1 ml TE buffer. To make genomic libraries DNA was randomly sheared end-polished with consecutive BAL31 nuclease and T4 DNA polymerase treatments and size-selected by electrophoresis on 1% low-melting-point agarose. After ligation to Bst XI adapters (Invitrogen catalog no. N408-18) DNA was purified by three rounds of gel electrophoresis to remove excess adapters and the fragments now with 3′-CACA overhangs were inserted into Bst XI-linearized plasmid vector with 3′-TGTG overhangs. Libraries with three different average sizes of inserts were constructed: 2 10 and 50 kbp. The 2-kbp fragments were cloned in a high-copy pUC18 derivative. The 10- and 50-kbp fragments were cloned in a medium-copy pBR322 derivative. The 2- and 10-kbp libraries yielded uniform-sized large colonies on plating. However the 50-kbp libraries produced many small colonies and inserts were unstable. To remedy this the 50-kbp libraries were digested with Bgl II which does not cleave the vector but generally cleaved several times within the 50-kbp insert. A 1264-bp Bam HI kanamycin resistance cassette (purified from pUCK4; Amersham Pharmacia catalog no. 27-4958-01) was added and ligation was carried out at 37°C in the continual presence of Bgl II. As Bgl II–Bgl II ligations occurred they were continually cleaved whereas Bam HI–Bgl II ligations were not cleaved. A high yield of internally deleted circular library molecules was obtained in which the residual insert ends were separated by the kanamycin cassette DNA. The internally deleted libraries when plated on agar containing ampicillin (50 μg/ml) carbenicillin (50 μg/ml) and kanamycin (15 μg/ml) produced relatively uniform large colonies. The resulting clones could be prepared for sequencing using the same procedures as clones from the 10-kbp libraries.
Transformed cells were plated on agar diffusion plates prepared with a fresh top layer containing no antibiotic poured on top of a previously set bottom layer containing excess antibiotic to achieve the correct final concentration. This method of plating permitted the cells to develop antibiotic resistance before being exposed to antibiotic without the potential clone bias that can be introduced through liquid outgrowth protocols. After colonies had grown QBot (Genetix UK) automated colony-picking robots were used to pick colonies meeting stringent size and shape criteria and to inoculate 384-well microtiter plates containing liquid growth medium. Liquid cultures were incubated overnight with shaking and were scored for growth before passing to template preparation. Template DNA was extracted from liquid bacterial culture using a procedure based upon the alkaline lysis miniprep method (173) adapted for high throughput processing in 384-well microtiter plates. Bacterial cells were lysed; cell debris was removed by centrifugation; and plasmid DNA was recovered by isopropanol precipitation and resuspended in 10 mM tris-HCl buffer. Reagent dispensing operations were accomplished using Titertek MAP 8 liquid dispensing systems. Plate-to-plate liquid transfers were performed using Tomtec Quadra 384 Model 320 pipetting robots. All plates were tracked throughout processing by unique plate barcodes. Mated sequencing reads from opposite ends of each clone insert were obtained by preparing two 384-well cycle sequencing reaction plates from each plate of plasmid template DNA using ABI-PRISM BigDye Terminator chemistry (Applied Biosystems) and standard M13 forward and reverse primers. Sequencing reactions were prepared using the Tomtec Quadra 384-320 pipetting robot. Parent-child plate relationships and by extension forward-reverse sequence mate pairs were established by automated plate barcode reading by the onboard barcode reader and were recorded by direct LIMS communication. Sequencing reaction products were purified by alcohol precipitation and were dried sealed and stored at 4°C in the dark until needed for sequencing at which time the reaction products were resuspended in deionized formamide and sealed immediately to prevent degradation. All sequence data were generated using a single sequencing platform the ABI PRISM 3700 DNA Analyzer. Sample sheets were created at load time using a Java-based application that facilitates barcode scanning of the sequencing plate barcode retrieves sample information from the central LIMS and reserves unique trace identifiers. The application permitted a single sample sheet file in the linking directory and deleted previously created sample sheet files immediately upon scanning of a sample plate barcode thus enhancing sample sheet-to-plate associations.
Celera's computing environment is based on Compaq Computer Corporation's Alpha system technology running the Tru64 Unix operating system. Celera uses these Alphas as Data Servers and as nodes in a Virtual Compute Farm all of which are connected to a fully switched network operating at Fast Ethernet speed (for the VCF) and gigabit Ethernet speed (for data servers). Load balancing and scheduling software manages the submission and execution of jobs based on central processing unit (CPU) speed memory requirements and priority. The Virtual Compute Farm is composed of 440 Alpha CPUs which includes model EV6 running at a clock speed of 400 MHz and EV67 running at 667 MHz. Available memory on these systems ranges from 2 GB to 8 GB. The VCF is used to manage trace file processing and annotation. Genome assembly was performed on a GS 160 running 16 EV67s (667 MHz) and 64 GB of memory and 10 ES40s running 4 EV6s (500 MHz) and 32 GB of memory. A total of 100 terabytes of physical disk storage was included in a Storage Area Network that was available to systems across the environment. To ensure high availability file and database servers were configured as 4-node Alpha TruClusters so that services would fail over in the event of hardware or software failure. Data availability was further enhanced by using hardware- and software-based disk mirroring (RAID-0) disk striping (RAID-1) and disk striping with parity (RAID-5).
Trace processing generates quality values for base calls by means of Paracel's TraceTuner trims sequence reads according to quality values trims vector and adapter sequence from high-quality reads and screens sequences for contaminants. Similar in design and algorithm to the phred program (174) TraceTuner reports quality values that reflect the log-odds score of each base being correct. Read quality was evaluated in 50-bp windows each read being trimmed to include only those consecutive 50-bp segments with a minimum mean accuracy of 97%. End windows (both ends of the trace) of 1 5 10 25 and 50 bases were trimmed to a minimum mean accuracy of 98%. Every read was further checked for vector and contaminant matches of 50 bp or more and if found the read was removed from consideration. Finally any match to the 5′ vector splice junction in the initial part of a read was removed.
National Center for Biotechnology Information (NCBI); available at www.ncbi.nlm.nih.gov/.
NCBI; available at www.ncbi.nlm.nih.gov/HTGS/.
All bactigs over 3 kbp were examined for coverage by Celera mate pairs. An interval of a bactig was deemed an assembly error where there were no mate pairs spanning the interval and at least two reads that should have their mate on the other side of the interval but did not. In other words there was no mate pair evidence supporting a join in the breakpoint interval and at least two mate pairs contradicting the join. By this criterion we detected and broke apart bactigs at 13 037 locations or equivalently we found 2.13% of the bactigs to be misassembled.
We considered a BAC entry to be chimeric if by the Lander-Waterman statistic (175) the odds were 0.99 or more that the assembly we produced was inconsistent with the sequence coming from a single source. By this criterion 714 or 2.2% of BAC entries were deemed chimeric.
E. W. Myers J. L. Weber in Computational Methods in Genome Research S. Suhai Ed. (Plenum New York 1996) pp. 73–89.
P. Deloukas et al. Science 282 744 (1998).
J. Zhang et al. data not shown.
Shredded bactigs were located on long CSA scaffolds (>500 kbp) and the distribution of these fragments on the scaffolds was analyzed. If the spread of these fragments was greater than four times the reported BAC length the BAC was considered to be chimeric. In addition if >20% of bactigs of a given BAC were found on a different scaffolds that were not adjacent in map position then the BAC was also considered as chimeric. The total chimeric BACs divided by the number of BACs used for CSA gave the minimal estimate of chimerism rate.
The International RH Mapping Consortium available at www.ncbi.nlm.nih.gov/genemap99/.
See Masker.html.
See .
M. Yandell in preparation.
Scaffolds containing greater than 10 kbp of sequence were analyzed for features of biological importance through a series of computational steps and the results were stored in a relational database. For scaffolds greater than one megabase the sequence was cut into single megabase pieces before computational analysis. All sequence was masked for complex repeats using Repeatmasker (52) before gene finding or homology-based analysis. The computational pipeline required ∼7 hours of CPU time per megabase including repeat masking or a total compute time of about 20 000 CPU hours. Protein searches were performed against the nonredundant protein database available at the NCBI. Nucleotide searches were performed against human mouse and rat Celera Gene Indices (assemblies of cDNA and EST sequences) mouse genomic DNA reads generated at Celera (3×) the Ensembl gene database available at the European Bioinformatics Institute (EBI) human and rodent (mouse and rat) EST data sets parsed from the dbEST database (NCBI) and a curated subset of the RefSeq experimental mRNA database (NCBI). Initial searches were performed on repeat-masked sequence with BLAST 2.0 (54) optimized for the Compaq Alpha compute-server and an effective database size of 3 × 10 9 for BLASTN searches and 1 × 10 9 for BLASTX searches. Additional processing of each query-subject pair was performed to improve the alignments. All protein BLAST results having an expectation score of <1 × 10 −4 human nucleotide BLAST results having an expectation score of <1 × 10 −8 with >94% identity and rodent nucleotide BLAST results having an expectation score of <1 × 10 8 with >80% identity were then examined on the basis of their high-scoring pair (HSP) coordinates on the scaffold to remove redundant hits retaining hits that supported possible alternative splicing. For BLASTX searches analysis was performed separately for selected model organisms (yeast mouse human C. elegans and D. melanogaster ) so as not to exclude HSPs from these organisms that support the same gene structure. Sequences producing BLAST hits judged to be informative nonredundant and sufficiently similar to the scaffold sequence were then realigned to the genomic sequence with Sim4 for ESTs and with Lap for proteins. Because both of these algorithms take splicing into account the resulting alignments usually give a better representation of intron-exon boundaries than standard BLAST analyses and thus facilitate further annotation (both machine and human). In addition to the homology-based analysis described above three ab initio gene prediction programs were used (63).
Miklos G. L., John B., Am. J. Hum. Genet. 31, 264 (1979);
P. E. Warburton H. F. Willard in Human Genome Evolution M. S. Jackson T. Strachan G. Dover Eds. (BIOS Scientific Oxford 1996) pp. 121–145.
Holmquist G. P., Am. J. Hum. Genet. 51, 17 (1992).
Lek first compares all proteins in the proteome to one another. Next the resulting BLAST reports are parsed and a graph is created wherein each protein constitutes a node; any hit between two proteins with an expectation beneath a user-specified threshold constitutes an edge. Lek then uses this graph to compute a similarity between each protein pair ij in the context of the graph as a whole by simply dividing the number of BLAST hits shared in common between the two proteins by the total number of proteins hit by i and j. This simple metric has several interesting properties. First because the similarity metric takes into account both the similarity and the differences between the two sequences at the level of BLAST hits the metric respects the multidomain nature of protein space. Two multidomain proteins for instance each containing domains A and B will have a greater pairwise similarity to each other than either one will have to a protein containing only A or B domains so long as A-B–containing multidomain proteins are less frequent in the proteome than are single-domain proteins containing A or B domains. A second interesting property of this similarity metric is that it can be used to produce a similarity matrix for the proteome as a whole without having to first produce a multiple alignment for each protein family an error-prone and very time-consuming process. Finally the metric does not require that either sequence have significant homology to the other in order to have a defined similarity to each other only that they share at least one significant BLAST hit in common. This is an especially interesting property of the metric because it allows the rapid recovery of protein families from the proteome for which no multiple alignment is possible thus providing a computational basis for the extension of protein homology searches beyond those of current HMM- and profile-based search methods. Once the whole-proteome similarity matrix has been calculated Lek first partitions the proteome into single-linkage clusters (27) on the basis of one or more shared BLAST hits between two sequences. Next these single-linkage clusters are further partitioned into subclusters each member of which shares a user-specified pairwise similarity with the other members of the cluster as described above. For the purposes of this publication we have focused on the analysis of single-linkage clusters and what we have termed “complete clusters ” e.g. those subclusters for which every member has a similarity metric of 1 to every other member of the subcluster. We believe that the single-linkage and complete clusters are of special interest in part because they allow us to estimate and to compare sizes of core protein sets in a rigorous manner. The rationale for this is as follows: if one imagines for a moment a perfect clustering algorithm capable of perfectly partitioning one or more perfectly annotated protein sets into protein families it is reasonable to assume that the number of clusters will always be greater than or equal to the number of single-linkage clusters because single-linkage clustering is a maximally agglomerative clustering method. Thus if there exists a single protein in the predicted protein set containing domains A and B then it will be clustered by single linkage together with all single-domain proteins containing domains A or B. Likewise for a predicted protein set containing a single multidomain protein the number of real clusters must always be less than or equal to the number of complete clusters because it is impossible to place a unique multidomain protein into a complete cluster. Thus the single-linkage and complete clusters plus singletons should comprise a lower and upper bound of sizes of core protein sets respectively allowing us to compare the relative size and complexity of different organisms' predicted protein set.
The probability that a contiguous set of proteins is the result of a segmental duplication can be estimated approximately as follows. Given that protein A and B occur on one chromosome and that A′ and B′ (paralogs of A and B) also exist in the genome the probability that B′ occurs immediately after A′ is 1/ N where N is the number of proteins in the set (for this analysis N = 26 588). Allowing for B′ to occur as any of the next J-1 proteins [leaving a gap between A′ and B′ increases the probability to ( J – 1)/ N ; allowing B′A′ or A′B′ gives a probability of 2( J – 1)/ N ]. Considering three genes ABC the probability of observing A′B′C′ elsewhere in the genome given that the paralogs exist is 1/ N 2 . Three proteins can occur across a spread of five positions in six ways; more generally we compute the number of ways that K proteins can be spread across J positions by counting all possible arrangements of K – 2 proteins in the J – 2 positions between the first and last protein. Allowing for a spread to vary from K positions (no gaps) to J gives L=∑X=K−2J−2 XK−2arrangements. Thus the probability of chance occurrence is L / N K–1 . Allowing for both sets of genes (e.g. ABC and A′B′C′) to be spread across J positions increases this to L 2 / N K–1 . The duplicated segment might be rearranged by the operations of reversal or translocation; allowing for M such rearrangements gives us a probability P = L 2 M / N K–1 . For example the probability of observing a duplicated set of three genes in two different locations where the three genes occur across a spread of five positions in both locations is 36/ N 2 ; the expected number of such matched sets in the predicted protein set is approximately ( N )36/ N 2 = 36/ N a value «1. Therefore any such duplications of three genes are unlikely to result from random rearrangements of the genome. If any of the genes occur in more than two copies the probability that the apparent duplication has occurred by chance increases. The algorithm for selecting candidate duplications only generates matched protein sets with P « 1.
Reviewed in
W.-H. Li Molecular Evolution (Sinauer Sunderland MA 1997).
From the observed coverage of the sequences at each site for each individual we calculated the probability that a SNP would be detected at the site if it were present. For each level of coverage there is a binomial sampling of the two homologs for each individual and a heterozygous site could only be ascertained if both homologs are present or if two alleles from different individuals are present. With coverage x from a given individual both homologs are present in the assembly with probability 1 − (1/2)x −1 . Even if both homologs are present the probability that a SNP is detected is <1 because a fraction of sites failed the quality criteria. Integrating over coverage levels the binomial sampling and the quality distribution we derived an expected number of sites in the genome that were ascertained for polymorphism for each individual. The nucleotide diversity was then the observed number of variable sites divided by the expected number of sites ascertained.
R. R. Hudson in Oxford Surveys in Evolutionary Biology D. J. Futuyma J. D. Antonovics Eds. (Oxford Univ. Press Oxford 1990) vol. 7 pp. 1–44.
Brief description of the methods used to build the Panther classification. First the June 2000 release of the GenBank NR protein database (excluding sequences annotated as fragments or mutants) was partitioned into clusters using BLASTP. For the clustering a seed sequence was randomly chosen and the cluster was defined as all sequences matching the seed to statistical significance (E-value < 10 −5 ) and “globally” alignable (the length of the match region must be >70% and <130% of the length of the seed). If the cluster had more than five members and at least one from a multicellular eukaryote the cluster was extended. For the extension step a hidden Markov Model (HMM) was trained for the cluster using the SAM software package version 2. The HMM was then scored against GenBank NR (excluding mutants but including fragments for this step) and all sequences scoring better than a specific (NLL-NULL) score were added to the cluster. The HMM was then retrained (with fixed model length) and all sequences in the cluster were aligned to the HMM to produce a multiple sequence alignment. This alignment was assessed by a number of quality measures. If the alignment failed the quality check the initial cluster was rebuilt around the seed using a more restrictive E-value followed by extension alignment and reassessment. This process was repeated until the alignment quality was good. The multiple alignment and “general” (i.e. describing the entire cluster or “family”) HMM (176) were then used as input into the BETE program (177). BETE calculates a phylogenetic tree for the sequences in the alignment. Functional information about the sequences in each cluster were parsed from SwissProt (178) and GenBank records. “Tree-attribute viewer” software was used by biologist curators to correlate the phylogenetic tree with protein function. Subfamilies were manually defined on the basis of shared function across subtrees and were named accordingly. HMMs were then built for each subfamily using information from both the subfamily and family (K. Sjölander in preparation). Families were also manually named according to the functions contained within them. Finally all of the families and subfamilies were classified into categories and subcategories based on their molecular functions. The categorization was done by manual review of the family and subfamily names by examining SwissProt and GenBank records and by review of the literature as well as resources on the World Wide Web. The current version (2.0) of the Panther molecular function schema has four levels: category subcategory family and subfamily. Protein sequences for whole eukaryotic genomes (for the predicted human proteins and annotated proteins for fly worm yeast and Arabidopsis ) were scored against the Panther library of family and subfamily HMMs. If the score was significant (the NLL-NULL score cutoff depends on the protein family) the protein was assigned to the family or subfamily function with the most significant score.
E. R. Kandel J. H. Schwartz T. Jessell Principles of Neural Science (McGraw-Hill New York ed. 4 2000).
H. J. Muller in Heritage from Mendel R. A. Brink Ed. (Univ. of Wisconsin Press Madison WI 1967) p. 419.
Feinberg A. P., Curr. Top. Microbiol. Immunol. 249, 87 (2000).
J. Sambrook E. F. Fritch T. Maniatis Molecular Cloning: A Laboratory Manual (Cold Spring Harbor Laboratory Press Cold Spring Harbor NY ed. 2 1989).
Sjölander K., Proc. Int. Soc. Mol. Biol. 6, 165 (1998).
GO available at www.geneontology.org/.
We thank E. Eichler and J. L. Goldstein for many helpful discussions and critical reading of the manuscript and A. Caplan for advice and encouragement. We also thank T. Hein D. Lucas G. Edwards and the Celera IT staff for outstanding computational support. The cost of this project was underwritten by the Celera Genomics Group of the Applera Corporation. We thank the Board of Directors of Applera Corporation: J. F. Abely Jr. (retired) R. H. Ayers J.-L. Bélingard R. H. Hayes A. J. Levine T. E. Martin C. W. Slayman O. R. Smith G. C. St. Laurent Jr. and J. R. Tobin for their vision enthusiasm and unwavering support and T. L. White for leadership and advice. Data availability: The genome sequence and additional supporting information are available to academic scientists at the Web site (www.celera.com). Instructions for obtaining a DVD of the genome sequence can be obtained through the Web site. For commercial scientists wishing to verify the results presented here the genome data are available upon signing a Material Transfer Agreement which can also be found on the Web site.