Early advocates of the human genome project envisioned a gradual process of gene discovery as the human genome sequence was progressively revealed during the time period from 1995 to 2005. It has not worked out this way at all. Strategies that fished for genes by random sequencing of cDNA copies of mRNA transcripts were unexpectedly quite efficient. The results are large public databases of cDNA sequence information. Even larger private databases have been compiled by Human Genome Sciences, Inc. and by Incyte Pharmaceuticals, Inc. For the most part, these databases consist of several-hundred-base cDNA fragments called expressed sequence tags (ESTs). The challenge has been to find effective ways of using these sequence fragments to reveal the function of newly discovered genes1. The term 'functional genomics' has been coined and widely accepted for this process, even though there are few successful examples yet of its implementation.

The function of most genes must inevitably be studied at the protein level. Progressing from the efficient, almost monotonous world of DNA sequence manipulation to the idiosyncratic behaviour of individual proteins is generally a time-consuming and frustrating effort; understanding interactions between proteins raises the complexity to a higher level altogether. Major breakthroughs are needed to remove this bottleneck, and on page 46, Gitte Neubauer and colleagues offer a fairly bright ray of hope. Three powerful techniques have been used synergistically to leap from DNA sequence information to clues about protein function: in vivo expression of green fluorescent protein fusions, cDNA sequence analysis and mass spectrometry.

The target of this pilot study in functional genomics is the human spliceosome, a particle that removes intronic sequences from mRNA precursors to yield mature mRNA. The spliceosome is a fair representative of the challenges to be faced in the arena of 'functional genomics'. In addition to several known RNA components, it contains more than 40 different proteins, many existing in sub-stoichiometric amounts reflecting the functional diversity of the spliceosome population. Many of the proteins also show complex patterns of post-translational modification. In brief, the spliceosome is a protein chemist's nightmare.

The strategy used to identify protein components of the human spliceosome began with in vitro assembly of the particles from crude cell extracts onto a biotinylated pre-mRNA substrate, allowing facile purification of a mixture of particles in different states of assembly and function. Two-dimensional protein gel electrophoresis of the resulting mixture revealed sixty-nine discrete spots. These were cut out and in-gel digested with trypsin; the resulting peptides were analysed by mass spectrometry, revealing mass and partial sequence data for each. Forty-nine were identified as twenty-five separate protein products of previously known genes. All but four of the remaining spots could be matched with DNA sequences in the public EST database, suggesting that the database is already sufficiently complete to serve as a reference point for many functional genomics projects, especially those focusing on cellular components present in more than trace amounts.

A total of nineteen new human proteins could be matched with EST counterparts. The formidable task of assigning a function to these proteins was approached in two ways. Sequence matching efforts were expanded by linking overlap** ESTs, or by cloning and characterizing full-length cDNA counterparts of ESTs, which led to additional matches to known proteins. Expression in HeLa cells of fusions of these cDNAs to the gene encoding green fluorescent protein provided direct fluorescent microscopic evidence for a punctate distribution of several of the newly discovered proteins in the cell nucleus, providing reasonable evidence that the proteins are indeed spliceosome components.

Most biologists are already aware of the power of green fluorescent protein fusions and EST database searching. An appreciation of the power of mass spectrometry is lagging in comparison, but will inevitably grow with the complexity of molecular biological problems being attacked. A quiet revolution in the mass spectrometry of macromolecules during the past half decade, spurred largely by innovative applications development around the gentle MALDI and electrospray ionization methods, has been shielded from the view of most biologists. Progress has been reported largely in chemical methods journals, often shrouded in a formidable cloak of impenetrable acronyms.

Continued progress in genomics will sire applications that we have only an inkling of today. Gitte Neubauer et al. extend our understanding of the spliceosome; a logical next step is to study how its protein components interact with each other and with other cellular species2. Dissecting other complexes of potential physiological relevance will naturally follow, an effort with a scope in magnitude and complexity that is enormous. The amount of data that must be generated to exploit a decoded genome is nearly incomprehensible. Those methods with the highest throughput and degree of informativeness will prove dominant. It is fortuitous that mass spectrometry, which inherently provides among the most definitive and interpretable data sets conceivable, is also answering the call for ultra-high throughput tools3. It can detect protein modifications ranging from hyper-phosphorylated states that cause complex mass shift patterns to single deamidations resulting in single Dalton mass shifts in large proteins. Many biotechnology, genomics and pharmaceutical companies engaging in functional genomics and related programs are currently gearing up for ultra-high throughput methods compatible with mass spectrometric detection. These range from peptide/protein map** to cDNA libraries4 as demonstrated by the Mann group, to generating tens of thousands of parallel processed genotypes per day, to screening exact molecular masses of millions of combinatorial library elements and their interactions with corresponding targets. The classical view of the mass spectrometer as a room full of finicky high vacuum equipment incompatible with the milieu of large biological molecules is hopelessly obsolete. It does not seem reckless to extrapolate that, despite limitations in ultimate sensitivity and amenable analyte size5, mass spectometry will replace gel-based separations required for many nucleic acids5 and protein6 applications.

With the ongoing success of the human genome project, we are no longer daunted by numbers like billions of base pairs or tens of thousands of genes. Three numbers, however, remain obscure. We can't even estimate these. One is the number of functionally significant protein complexes. Even restricted to pair-wise interactions, we are potentially dealing with tens of billions of possibilities. Another number is the array of organisms where genomic approaches will be necessary or desirable. The third is the tally of DNA sequence variations in a population that must be characterized to properly account for interesting differences in the behaviour between individuals. Obtaining a rough idea of these numbers will require many more strides like the advances that recent mass spectrometric work promise.