The emergence of top-down proteomics in clinical research

Proteomic technology has advanced steadily since the development of 'soft-ionization' techniques for mass-spectrometry-based molecular identification more than two decades ago. Now, the large-scale analysis of proteins (proteomics) is a mainstay of biological research and clinical translation, with researchers seeking molecular diagnostics, as well as protein-based markers for personalized medicine. Proteomic strategies using the protease trypsin (known as bottom-up proteomics) were the first to be developed and optimized and form the dominant approach at present. However, researchers are now beginning to understand the limitations of bottom-up techniques, namely the inability to characterize and quantify intact protein molecules from a complex mixture of digested peptides. To overcome these limitations, several laboratories are taking a whole-protein-based approach, in which intact protein molecules are the analytical targets for characterization and quantification. We discuss these top-down techniques and how they have been applied to clinical research and are likely to be applied in the near future. Given the recent improvements in mass-spectrometry-based proteomics and stronger cooperation between researchers, clinicians and statisticians, both peptide-based (bottom-up) strategies and whole-protein-based (top-down) strategies are set to complement each other and help researchers and clinicians better understand and detect complex disease phenotypes.

predicting disease prognosis and identifying druggable targets for new therapeutics. Diagnostic or companion diagnostic biomarkers are greatly sought after. Th e holy grail of biomarker discovery, however, is proteomic biomarkers that predict that a given phenotype will develop. Great progress has been made toward these goals over the past 20 years, and proteomics has been a powerful tool for providing information about a broad range of diseases and clinical phenotypes. However, compared with the discoveries that rapidly followed completion of the Human Genome Project, the translation of proteomic information into medical advances has been slower than expected. A plethora of biological information has been obtained, yet the data have minimal clinical relevance. Th is type of discovery-based protein analysis has, therefore, been associated with a high cost and a low return on investment. Despite the modest use of proteomics within clinical applications, many in the fi eld are optimistic that proteomics, which is still evolving, will play an important part in 21st century medicine [1,2].
Proteomic research has mostly been dominated by bottom-up techniques. Such techniques involve in vitro enzymatic digestion of the sample and mass spectrometry (MS)-based analysis of the resultant peptide mixture. Inferences are then drawn about the protein composition of the sample. Over the last 20 years, such bottom-up methods have been developed into extremely sensitive and selective methods capable of identifying >5,000 proteins within a single sample. Th ese methods follow in the footsteps of many 'small-molecule' liquid chromatography (LC)-MS assays that have been approved by the US Food and Drug Administration (for example, those for vitamin D3, glycosphingolipids and thyroglobulin) and are poised to augment this capability in the clinical research laboratory [3].
Bottom-up technology has produced a myriad of proteomic data for many living systems [4][5][6], enabled innovative ways for understanding disease [7] and provided new leads for clinical diagnostics [8]; however, the complete proteomic tool kit for 21st century research will consist of orthogonal methods that allow analysis at multiple levels: the peptide, whole-protein and intact Abstract Proteomic technology has advanced steadily since the development of 'soft-ionization' techniques for massspectrometry-based molecular identifi cation more than two decades ago. Now, the large-scale analysis of proteins (proteomics) is a mainstay of biological research and clinical translation, with researchers seeking molecular diagnostics, as well as proteinbased markers for personalized medicine. Proteomic strategies using the protease trypsin (known as bottom-up proteomics) were the fi rst to be developed and optimized and form the dominant approach at present. However, researchers are now beginning to understand the limitations of bottom-up techniques, namely the inability to characterize and quantify intact protein molecules from a complex mixture of digested peptides. To overcome these limitations, several laboratories are taking a whole-proteinbased approach, in which intact protein molecules are the analytical targets for characterization and quantifi cation. We discuss these top-down techniques and how they have been applied to clinical research and are likely to be applied in the near future. Given the recent improvements in mass-spectrometrybased proteomics and stronger cooperation between researchers, clinicians and statisticians, both peptidebased (bottom-up) strategies and whole-proteinbased (top-down) strategies are set to complement each other and help researchers and clinicians better understand and detect complex disease phenotypes. protein complex levels [9]. Although bottom-up proteomic technology is well developed, the technology for analyzing whole proteins (known as top-down proteomics) and intact protein complexes (known as nextgeneration top-down proteomics or protein complex proteomics) is less so (Figure 1, center). Notwithstanding the nascent technology, biological research will benefi t greatly from a combined proteomic approach that can take advantage of the individual strengths of all three approaches to complement the defi ciencies inherent in each. We propose that such a combination approach will result in an increased return on investment for MS-based proteomics in the next decade or two and therefore a greater impact on human health ( Figure 1).

State-of-the-art bottom-up proteomics in clinical research
Most clinical proteomic research focuses on identifying the molecular signatures of specifi c diseases or disease phenotypes from relevant biological samples from patients. When found, these molecular signatures, or bio markers, provide novel ways to detect, understand and, perhaps, treat disease. Much of the search for biomarkers has been conducted on human serum or plasma. Although plasma is readily obtainable, it is daunting in its proteomic complexity, owing to a vast dynamic range of component concentrations within a single sample that spans more than ten orders of magnitude [10]. Not surprisingly, thorough analysis of the protein composition of plasma is a challenge. Nevertheless, techniques for carrying out targeted measurements in human serum have been developed.
One such technique is an antibody-based enrichment strategy termed SISCAPA (stable isotope standards and capture by antipeptide antibodies). Whiteaker et al. [11] used SISCAPA to achieve a >1,000-fold enrichment of target peptides within plasma and to detect analytes in the nanogram per milliliter range using an ion-trap mass spectrometer. Another technique that has now been widely implemented is multiple reaction monitoring (MRM), which measures targeted peptides within complex mixtures and can be used for absolute quantifi cation of these peptides [12]. For example, by optimizing sample preparation and measurement conditions, Keshishian et al. [13] used MRM and achieved limits of quantifi cation (LOQs) in the low nanogram per milliliter range without the need for antibody-based enrichment. Although the antibody-based methods used in clinical laboratories can achieve much lower LOQs, in the picogram to femtogram per milliliter range, as is the case for cardiac troponin and prostate-specifi c antigen [14,15], optimized MRM assays coupled with SISCAPA could represent the future of biomarker validation assays [16].
Examples of MRM successes in clinical research include the following: the quantifi cation of proteins in the cerebrospinal fl uid to aid understanding of the later stages of multiple sclerosis [17]; the development of quantitative validation techniques for plasma biomarkers, To be successful, clinical proteomic projects must link observed phenotypes to modern molecular medicine through the analysis of complex proteoforms. Clinical phenotypes are aff ected by both familial inheritance (genotype) and environmental eff ects (that is, there can be diff ering molecular causes for the same underlying disease). In bottom-up proteomic analyses, the proteins in samples are digested into peptides, and inferences are then made about the native proteome. Owing to its ease of implementation, bottom-up proteomics is the most widely implemented technique in proteomic research. In top-down proteomic analyses, the protein molecules are analyzed in their intact state, providing a higher degree of mechanistic connection with disease. Proteomic analyses of native protein complexes provide the strongest connection between molecular mechanism and disease; however, considerable technical advances are needed before this next generation of top-down proteomic approaches can be widely used. In this clinical proteomic workfl ow, information gathered from protein analysis may be used to catalyze the development of new techniques to manage human health. Adapted partly from [9]. with LOQs reaching picograms per milliliter [13]; and the demonstration of robust targeted assays for cancerassociated protein quantification in both plasma and urine samples from patients [18]. In the first example, Jia et al. [17] used MRM to quantify 26 proteins from the cerebrospinal fluid of patients with secondary progressive multiple sclerosis. They included patients with a noninflammatory neurological disorder and healthy humans as controls. The many significant differences in the abundance of certain proteins between patient groups may hold true upon further sampling and could yield important insight and provide a new method for multiple sclerosis research [17]. In the second example, Keshishian et al. [13] performed important empirical testing of serum-processing options and provided a method for achieving an LOQ appropriate for current serum biomarkers (low nanogram per milliliter), even while multiplexing the assay to monitor multiple analytes. In the third example, Huttenhain et al. [18] extended this empirical testing to develop MRM assays for over 1,000 cancer-associated proteins in both serum and urine. They extended their results to monitor, using MS, the levels of four biomarkers that are currently used to assess ovarian cancer risk (apolipoprotein A1, transferrin, β 2 -microglobulin and transthyretin; using Quest Diagnostics' OVA1 enzyme-linked immunosorbent assay (ELISA) panel). In a panel of 83 serum samples, they found significant differences in the abundance of these proteins between patients with ovarian cancer and those with benign ovarian tumors, and these differences were consistent with prior results obtained from immunoassays. This study exemplifies the strength of MRM for multiplexed quantification of peptide biomarkers in complex clinical samples. MRM offers unrivaled utility for sensitive and accurate detection of target peptides in clinical samples (information that is subsequently used to infer the presence and level of proteins in the sample). However, the proteome harbors more complexity than typical MRM assays can interrogate. This analytical mismatch confounds the diagnostic accuracy of the MRM-based assays in ways that are not possible to overcome by using bottom-up MS-based proteomic technology alone.
One issue with MRM is that it is a targeted assay and relies on a priori knowledge of the protein to be measured. At present, most of that knowledge is obtained from bottom-up, discovery-type proteomic studies, in which enzymatic digestion precedes the peptide-based analysis of proteins in complex mixtures. Herein lies the key limitation of bottom-up strategies. With enzymatic digestion, the information describing individual intact proteins is lost, preventing complete characterization of all of the protein forms expressed at one time for any given protein-coding gene. As a result, clinical conclusions are based on potentially inaccurate protein expression levels, because these levels are derived from quantifying peptides that may not be representative of all of the diverse forms of protein molecules present. (For example, the peptide sequence is common to many forms of a protein molecule; however, some forms are posttranslationally modified on amino acids within the same stretch of sequence.) The net effect of a bottom-up strategy is that MRM peptides report only generally on protein expression of a gene, because modified peptides that represent individual protein molecules are unlikely to be discovered upon enzymatic digestion in an untargeted fashion.
Measuring the expression of protein-coding genes at the protein level is important; however, in a living system, it is the individual protein molecules that are likely to correlate more tightly with (aberrant) molecular functions. Because these individual protein molecules (which, for example, contain coding polymorphisms, mutations, splicing variations and post-translational modifications) are likely to perform different functions from other modified versions of the same parent protein [19], it becomes imperative to measure protein expression with a precision that will distinguish between even closely related intact protein forms. Top-down proteomics offers this precision.

Top-down proteomic approaches
Top-down MS-based proteomic technology provides the highest molecular precision for analyzing primary structures by examining proteins in their intact state, without the use of enzymatic digestion. In doing so, topdown proteomic techniques can fully characterize the composition of individual protein molecules (these intact protein molecules were recently coined 'proteoforms' [20]). Traditionally, the top-down strategy consisted of two-dimensional protein separation involving isoelectric focusing and PAGE followed by visualization of the protein spots within the gel, a technique known as twodimensional gel electrophoresis . Both two-dimensional gel electrophoresis [21] and difference gel electrophoresis [22] facilitate a 'birds'-eye' view of the proteins in a sample in one or more biological states. Salient proteome features are then further investigated by identifying the proteins of interest using bottom-up MS. These techniques provide a large visual representation of the proteome and have been applied in disease research, such as cancer research [23,24]; however, several technical challenges have impeded the universal adoption of this top-down approach. First, there are limitations on proteome resolution, leading to the co-migration of multiple proteins to the same location on the gel. Second, there are issues with gel-to-gel reproducibility. Third, this approach is labor intensive. Last, the enzymatic digestion required for MS identification prevents full molecular characterization [25,26].
An alternative method for top-down proteomics, and the front-runner for becoming the technique of choice for top-down proteomics, is LC electrospray ionization tandem MS (LC-ESI-MS/MS). This soft-ionization technique can be applied to intact proteins of up to approximately 50 kDa using hybrid instruments offering Fouriertransform-based high-resolution measurements [27]. The high-resolution LC-ESI-MS/MS approach to top-down proteomics has recently proven to be capable of truly high-throughput protein identification [28] and is now appreciated as a viable option for proteome discovery [29].
We hypothesize, as do many researchers in the topdown proteomics field, that the information obtained from precise, comprehensive whole-protein analysis will be connected more directly to complex disease phenotypes than information gained from bottom-up analyses. As a result, studying proteomes at the whole-protein level will provide a more efficient translation of proteomic data into phenotypic understanding and early detection of disease. At present, top-down proteomic techniques are less sensitive than bottom-up strategies, which presents concerns for biomarker studies. None theless, there is a need for a combined approach to translational proteomics that uses both top-down and bottomup strategies. Figure 2 depicts the positioning of wholeprotein (top-down) analysis and peptide-based (bottomup) protein analysis in the space of complex human disease. With complete protein characterization afforded by top-down analyses, sensitive MRM assays with LOQs in the nanogram per milliliter range can be developed to target the exact proteoforms that are most closely connected to the disease phenotype of interest. When proteoforms are larger than the current limit for topdown proteomics, which is approximately 50 kDa, an intermediate technique called middle-down proteomics can be used. With this technique, targeted enzymatic digestion occurs minimally throughout the protein to produce large peptides with an average size of about 6 kDa [30]. These large stretches of polypeptide can facilitate partial characterization of large proteins (>50 kDa) and allows better proteoform specificity in MRM assay development.

Recent advances in top-down proteomic implementation
At present, proteomic approaches in clinical research can be grouped into two categories: protein-profiling approaches, and protein identification and characteri zation using the 'grind and find' strategy. In addition to the two-dimensional gel electrophoresis and difference gel electrophoresis methods described above, another historical profiling approach was surface-enhanced laser desorption/ionization time-of-flight MS (SELDI-TOF MS). In SELDI-TOF MS, a solid-phase enrichment step is used to bind proteins in complex mixtures, most often serum or plasma, reducing the sample complexity by compressing the dynamic range of the sample to be analyzed. Then, laser desorption is used to ionize the proteins from the surface directly into a time-of-flight mass analyzer for MS profiling. With its ability to decrease the daunting complexity of plasma [10] to make it more amenable to protein profiling, SELDI-TOF analysis was once a highly touted technique for plasma proteomic studies, particularly for biomarker discovery assays. One of the main early arguments in favor of such an approach was offered by Petricoin and Liotta [31]. They argued that although SELDI-TOF was purely an MS1 profiling technique, which does not provide enough mass or chemical selectivity to ensure that a differentially expressed mass is a unique entity, comparison of the collective profile of disease and non-disease samples could uncover genuine biomarker signatures, and it would be the signatures rather than the identification of any one biomarker that would have an impact on medicine.
MS imaging (MSI) is a protein-profiling technique that is similar in certain respects to SELDI-TOF and is rapidly gaining popularity because of its innovative pairing with topological information at both the tissue and cellular levels. Sweedler and Caprioli are pioneers of MSI using matrix-assisted laser desorption/ionization (MALDI) MS, and they have applied this approach to answer many biological questions. For all applications, researchers are finding much value in being able to pinpoint protein MS profiles to certain locations within a tissue slice or organism, depending on the type of sample at which the experiment is aimed. One striking use of MSI has been to identify biomarker profiles of renal cell carcinoma in kidney tissue [32] (Table 1). Progress in this burgeoning area of clinical research will involve identifying and precisely characterizing the proteoforms detected by MSI-based profiling approaches.
In the protein characterization mode of analysis, topdown proteomics has been applied in several high-profile translational research projects (Table 1). In contrast to the proteome profiling of modern MS-based imaging techniques, top-down proteomics offers protein identi fication, molecular characterization (often complete) and relative quantification of related protein species. For example, Chamot-Rooke and colleagues [33] are taking advantage of top-down proteomics to identify factors associated with the invasiveness of the bacterium Neisseria meningitidis. They used precision MS to quantify the expression of proteoforms in type IV pili, implicating these structures in the detachment of bacteria from meningitis-associated tissue [33]. In a similar manner, Ge and colleagues have been performing top-down analyses on intact cardiac troponin I proteoforms to gain insight into myocardial dysfunction. In a recent study, the Ge group observed an increase in phosphorylation in the failing human myocardium by examining the proteoforms of intact cardiac troponin I [34]. Interestingly, they also unambiguously localized the phosphorylation events within the protein and uncovered information that is important for gaining a mechanistic understanding of myocardial failure. In another example of proteoformresolved top-down analysis, Hendrickson and Yates and colleagues [35] identified, characterized and quantified multiple proteoforms of apolipoprotein CIII within human blood, including those with O-linked glycosylation. Their research is important not only because it extends the concept of proteoform quantification but also because apolipoprotein CIII is associated with coronary artery disease.
Other groups are using MS coupled with hydrogendeuterium (H-D)-exchange chemistry to study the dynamics of intact proteins. In a potent application of H-D-exchange mass-spectrometry, Agar and colleagues [36] studied the protein dynamics of superoxide dismutase 1 variants associated with familial amyotrophic lateral sclerosis. In the variants analyzed, they found a common structural and dynamic change within the electrostatic loop of the protein [36]. Their data provide important molecular mechanistic insight into this inherited form of motor neuron disease and further exemplify the utility of proteoform-resolved data from intact proteins for informing clinical research.

The future of top-down strategies in clinical proteomics
Support for using top-down proteomics in clinical research is growing with each publication that features its use. The examples described above were hard won by early adopters of the technique and illustrate the appliation of whole-protein analysis to a diverse range of disease-related questions that can be answered with proteo form-resolved information (Table 1). However, even Top-down proteomics provides information closely connected to complex disease phenotypes. Many protein molecules can be encoded by a single gene locus, owing to modifications such as methylation (Me) and phosphorylation (P). These different forms, which can be present simultaneously in the proteome, are called proteoforms [20]. In this example, the expression of one protein-coding gene leads to four distinct proteoforms, owing to different combinations of Me and P modifications (top left). Top-down proteomic analysis preserves the proteoforms and yields 'proteoform-resolved' data; mock mass-spectrometry (MS) data are presented for this example (top right). Bottom-up analysis depends on the enzymatic digestion of proteins: the four distinct proteoforms form a mixture of five MS-compatible peptides (bottom left); mock MS data are presented (bottom right). The bottom-up analysis clearly shows an increase in the abundance of methylated and phosphorylated peptides, but it cannot link this information to the expression levels of the intact proteoforms, leading to an ambiguous result. The top-down analysis, by contrast, indicates that the doubly modified proteoform is upregulated compared with the other three forms. In a complementary approach, the full protein characterization afforded by top-down proteomics can be used to develop multiple reaction monitoring (MRM) assays that reliably report on distinct intact protein molecules. In the future, most clinical translational proteomic strategies are likely to take a combination approach, taking advantage of the sensitivity and high-throughput capacity of MRM and the high molecular precision of top-down proteomics. One of the main reasons why top-down proteomics is somewhat esoteric at present is that it took longer to develop into a high-throughput assay. It was not until 2011 that top-down proteomics was shown to be applicable to large-scale experiments [28]. Before then, its use was limited to a focused approach for characterizing targeted proteins within samples. Much of the top-down proteomic research described above fits into this category. However, now that top-down proteomics can be performed on Orbitrap MS instruments without the need for a superconducting magnet, as recently demonstrated by Ahlf et al. [37] and Tian et al. [38], it is expected that more laboratories will begin to apply highthroughput top-down techniques regularly without needing collaborators. In fact, a new Consortium for Top Down Proteomics has formed, with the mission 'to promote innovative research, collaboration and education accelerating the comprehensive analysis of intact proteins' [39]. As top-down proteomics becomes more widespread, we can expect to see certain clinical research topics illuminated. One aspect of disease biology that is ripe for top-down analysis is the immune system. The immune system is connected to many human diseases in various ways and consists of a range of cell types, with close to 300 distinct populations in the blood alone [40]. To date, information within the immune system that is associated with disease mechanisms, progression and biomarkers has gone untouched by top-down proteomic approaches. We believe that a search for disease-associated biomarkers using gene-and cell-specific proteomics will substantially benefit from the application of wholeprotein analysis to the proteomes of the immune cell populations associated with individual diseases. This idea combines the high analytical precision of top-down proteomics with a layer of precision from individual celltype resolution.
The analysis of disease-associated immune cell populations (for example, sorted by flow cytometry) using topdown proteomics will have an integral role in shaping the future of clinical proteomic research. In the ideal situation, certain disease studies will begin with topdown proteomic analyses to characterize the intact proteins in each immune cell type in the peripheral blood. Peripheral blood cells can be isolated from patients by the same routine procedure used for obtaining whole blood, serum, and plasma and thus serve as prime candidates for clinical studies of samples directly obtained from patients. The top-down characterization of proteins in immune cell populations will provide proteoform-resolved data that report on the expression profile of proteins within these cell types. The profiles will be readily comparable with 'healthy' human cell proteomes by applying the technique to samples isolated from patients without the disease under study. Then, taking a hybrid approach to clinical proteomic research, the discovery phase of top-down proteomics, with its proteoform-resolved data, can then be used to guide the development of proteoform-specific peptides for followup, large-scale MRM validation trials.
We believe that the single-cell analysis capabilities of flow cytometry will couple well with proteoform-resolved top-down data. In general, flow cytometry is a common and well-developed procedure for analyzing the cell-bycell expression of particular proteins using antibodies targeting these proteins. However, without proteoformresolved information to guide the development and selection of antibodies for monitoring, the information from a flow cytometry experiment could be confusing, with the same protein inference problem that limits the specificity of MRM (Figure 2). In other words, neither technique can accurately describe distinct proteoforms when used alone.
With the pairing of top-down proteomics and flow cytometry, individual proteoforms can be targeted by antibodies that bind only to those distinct forms of the protein. In this manner, the flow cytometry information Table 1

Laboratory a Disease or condition Application description Reference
Chamot-Rooke Bacterial meningitis Relative quantification of intact Neisseria meningitidis type IV pilus proteoforms [33] Agar Neurodegeneration H-D exchange-enabled analysis of fALS SOD1 variant protein dynamics [36] Ge Myocardial dysfunction Relative quantification of intact cardiac troponin I proteoforms [34] Caprioli Renal carcinoma Tissue profiling of intact proteins in cancerous versus healthy kidneys [32] Hendrickson and Yates Coronary artery disease Relative quantification of intact apolipoprotein CIII proteoforms [35] Nelson Diabetes Relative quantification of proteoforms in plasma from healthy individuals and diabetics [41] a Examples of laboratories applying top-down proteomic strategies to clinically related research are presented here, with references to their recent work. In this diverse research, top-down proteomics is being used to understand the dynamics of intact proteins, to measure the relative abundances of intact proteoforms and to provide mass spectrometry profiles of intact proteins directly from human tissue. In all of these cases, the information obtained by studying whole proteins has led to significant insight into human disease. fALS, familial amyotrophic lateral sclerosis; H-D, hydrogen-deuterium; SOD1, superoxide dismutase 1.
will also be proteoform-resolved. Adding this layer of precision to both the MRM and the flow-cytometry follow-up assays will provide a considerable advance toward understanding and diagnosing complex phenotypes, especially when the data are paired with cell-bycell information from disease-associated immune cells.
Ultimately, pairing proteoform-resolved information from top-down proteomics with sensitive and standardized MRM assays and similarly sensitive and standardized targeted flow cytometry assays will provide two promising options for the development of validated clinical diagnostic assays for early disease-phenotype detection. We hope that in the near future more clinical proteomic pursuits will begin with top-down proteomics discovery that will drive the research with proteoformresolved precision. One clear benefit of the spread of topdown technology to many laboratories would be a collective increase in the precision of data collection and reporting compared with the prototypic information that bottom-up proteomics is currently providing (Figure 2). Another advantage would be global 'beta testing' of the technique. Inevitably, the more people who use top-down proteomics, the more demand there will be for improved instrumentation and data acquisition (plus the critical software). This type of increased demand will guide the industrial development of top-down platform tools that will benefit the research community directly, by allowing more robust and capable analysis. Thus, a positive feedback loop will commence that will mirror the robust growth cycle experienced by bottom-up technologies over the past 20 years. Having seen the improvements over that time, it is exciting to imagine where top-down technology will be in the near future.
Finally, the overall goal for using top-down proteomics in clinical research is not to take the place of the welldeveloped, optimized assays that are used in diagnostic laboratories around the world (for example, targeted RNA measurements, DNA sequencing and ELISAs). Rather, the goal is to inform the development and implementation of more-sensitive, more-selective diagnostic tests. By correlating the exact proteoforms with a given disease phenotype, diagnostic laboratories will be able to design assays to perform routine analyses in a proteoform-specific manner.