This description of an emerging sector was enabled by two data sources: a database of firms and R&D expenditures initiated at Stanford University and continued at Duke University through 2004 (when the company database was discontinued for reasons described below); and the ability to identify patents making claims that specify terms distinctive to DNA or RNA. The data also construct a window through which to view technology transfer, patenting, and academic-industrial-government interactions, using genomics as an important subset of biotechnology.
Identifying genomics firms
The definition of what is or is not a genomics firm is somewhat amorphous. Similar to how the term biotechnology refers to the practice of a subset of pharmaceutical firms employing molecular biological methods, genomics is an approach, not an industrial sector. One unifying feature among the many companies that became known as genomics firms and were included in our database was that all or a substantial fraction of their business plans hinged on use of large datasets on several or many genes, emerging DNA technologies of sequencing, novel methods of DNA detection, or interpretation of information based on DNA sequence or structure. This is not restricted to human DNA, but also includes microbes, plants, and other organisms. However, it became increasingly difficult to determine exactly what portion of a firm’s business was related to genomics as the technologies ramified into many disparate lines of life sciences and industrial application. R&D allocations by firms on our list range from complete dedication to genomics to only a small, but meaningful, fraction of R&D funds attributed to genomics. With this in mind, our dataset of genomics firms is a best effort estimation of the genomics sector as it emerged, but should not be viewed as an exact valuation of how much genomics R&D was taking place in the commercial sector.
We used the following criteria to include firms in our analysis: analysis of DNA structure a core business; ‘genomics’ listed on website, annual report, or in news stories as part of the business plan; and firm listed as ‘genomics’ by stock analysts or trade press (subject to correction if determined not to meet one of the above criteria). We accepted the definitions of those reporting the figures (including the trade press characterization of private firms). When reporting on firms and funding programs, we visited websites or read publicly available data sources. We excluded firms solely or primarily focused on protein, rather than DNA structure, or those that identified themselves as primarily ‘proteomics’ or some other ‘-omics’ field other than genomics. These distinctions were not entirely consistent, details about the technologies used were not always explicit, and the amount of information publicly available varied widely. Many firm descriptions made it difficult to make judgments. The rule of thumb was to exclude firms unless they (or others writing about them) explicitly referred to genomics, or when the nature of their business seemed similar to other firms already on the list. In cases of doubt, firms were contacted for clarification, and excluded or included according to the taxonomy noted below.
General firm information
The database of genomics firms began from two sources: a December 1993 survey of early genomics firms done by one of the authors (RCD; contract report available at the National Reference Center for Bioethics Literature, Georgetown University) , and the BioWorld Report 2000 Genomics Review. Our list was then expanded using several principal sources: three web-based biotechnology services (BioSpace.com, Recombinant Capital, and GenomeWeb.com), scientific journals, and biotechnology trade and technical publications. A few firms were also identified by membership in the Biotechnology Industry Organization or brought to our attention by scientists, stock analysts, or other firms on our list. The database of genomics firms was maintained through 2004, the year after the Human Genome Project formally ended with publication of the human reference sequence in April 2003 .
To assemble contact information on firms, we visited the websites for each firm (except the few lacking websites), and made phone calls to clarify points of uncertainty. Our monitoring was greatly expedited by use of the following sources: news about genomics firms in BioSpace.com’s daily ‘Breaking News’ service; twice daily GenomeWeb Daily News Bulletins; Genomics Today, a news service of the Pharmaceutical Research and Manufacturers Association; and reading scientific and trade journals.
We made efforts to gather the following information for each firm in the database: current and former names; contact information such as address, phone, fax, website, and executive officers; year founded; firm taxonomy (as described below); and total number of issued US patents and DNA-based patents (see description of search methods below).
Each firm was designated as being either public, private, acquired, subsidiary, nonprofit, dissolved, or lost to follow-up. Firms that had undergone merger were classified under the acquired category. A firm was designated as dissolved only when direct evidence of dissolution of the business was uncovered (for example, press report, direct contact with former management or staff). All other firms that we could not locate (by web search, or former phone or email contact) but for which we did not have direct evidence of dissolution were considered lost to follow-up.
The database of genomics firms was discontinued in 2004. This was partly a choice to end the study with completion of the Human Genome Project, partly because the data-collection effort was substantial and our research project ended, and partly because the term ‘genomics’ became difficult to justify as a coherent, distinctive category as genomic technologies became ubiquitous in the life sciences and in industrial applications. The problem of definitional wobble is apparent even in government funding programs devoted to genomics, although reasonable estimates were possible for nonprofit and government funding streams through 2008 .
One of the limitations of our survey is the relative dearth of trade press or other sources for collecting information about firms outside North America and Western Europe. We acknowledge that our coverage is not uniform and that we may have missed a significant number of international companies. Firms in India, China, other parts of Asia, Latin America, and Eastern Europe are very likely under-represented. This bias applies to publicly traded firms, but is true a fortiori for privately held firms, which can be very difficult to identify and monitor.
A genomics taxonomy emerged from reviewing descriptions of R&D carried out by the firms that were described by themselves, on websites or in annual reports, or by others in the trade press and news websites as ‘genomics firms.’ The categories emerged from a bootstrapping process of classifying companies, comparing results of classification among the research team, adding terms to accommodate new categories up to a point of ‘saturation’ when few reclassifications were needed; inter-rater reliability was established. Categories in the taxonomy are not mutually exclusive; each firm can be classified under multiple headings.
For publicly traded genomics firms, we gathered the following additional annual financial data: total operating expenses, R&D expenses, number of employees, plant and equipment values, total revenues, net income, and market capitalization. Market capitalization was either gathered directly from financial data sources or was calculated by taking the product of the adjusted closing value of the stock on the day of fiscal year end and the reported number of outstanding shares in the annual financial reports.
Financial data for publicly traded US and international firms were collected primarily through the use of four databases specializing in firm financials: Mergent Online - U.S Company Data , Compustat North America , Thomson Research - Worldscope , and OneSource - Business Browser . The source of these databases’ information is US Securities and Exchange Commission filings, press releases, and analyst reports. In some cases, when companies were not listed in one of these databases, we gathered data directly from firm annual reports. Despite accessing multiple data sources, there remain several firms for which we were unable to locate all financial data points. (Data tables are included in supplemental materials, see Additional file 1) Our aggregate data are thus only a rough proxy for collective activity in private commercial genomics, not comprehensive and fine-grained analyses of particular firms or technologies.
To obtain the count of total issued US patents, we conducted searches using the US Patent and Trademark Office website search engine . Searches were done looking for the name and former names of each firm as the assignee on patents. Efforts were made to incorporate the patents of acquired and subsidiary firms into the total count of patents for parent firms. We also searched for common misspellings and typos for firm names, when appropriate. Total issued US patent counts were current through 7 February 2006, covering two years beyond the period for which we report company financial data. This two-year period approximates the time of traditional total pendency for patents at the US Patent and Trademark Office .
The many distinctive terms for DNA and RNA allow DNA patents to be identified with a relatively high degree of specificity and sensitivity, providing an analytical tool to study genomic innovation. To obtain the count of DNA-based patents, we conducted searches for issued US patents in the DNA Patent Database (DPD) . Established in 1994, the DPD contains patents (and, since 1999, patent applications) with one or more claims explicitly referring to DNA or RNA or terms of art specific to DNA (for example, ‘plasmid’ or ‘nucleotide’), mapping patents to the field of genomics. This patent collection goes well beyond just gene patents (usually referring to DNA molecules encoding proteins) to include methods, instruments, and software. The search algorithm is available online . The individual terms used in the DPD were tested individually for specificity and sensitivity in 1997 and the algorithm modified and re-tested in 2003. Our searches were performed using the 2003 algorithm and utilized techniques similar to those described above for total issued US patents. DPD patent counts cited here are up to date through 11 January 2006, also covering two years beyond the period for which we report company financial data. Comparing DNA patents to total patents yields a ratio of genomics to other patenting activity, a rough indicator of ‘genomics intensity.’