Our genomes today: time to be clear

DNA is an identifier. We are not defined by our genome, but our DNA is ours and we can be identified through it. Despite the comments made at the time, it was neither wicked nor tacky when Craig Venter, shortly after the first human genome sequence was published in 2001, publicly revealed that he was one donor of the samples used in Celera's genome sequencing project [1]. Venter later explained that by identifying himself as a donor he had intended to demystify the human genome and to reduce public fears about the potential misuse of genetic information [2].


The old days
Regarding the past puts the issues of identifi ability and disclosure of personal and public genomes into perspective. Th e protection of individuals against the possible negative consequences of disclosure of their genetic information has been a major concern throughout the history of human sequencing. Th e Human Genome Project (HGP) has placed the protection of individuals at the core of its Ethical, Legal and Social Implications program since its inception as part of the HGP in 1989. In 1983, the US President's Commission for the Study of Ethical Problems in Medicine and Biomedical and Behavioral Research, as the designated advisory body to the Congress on these matters, reported on 'Screening and Counseling for Genetic Conditions' . Th e requirement of confi dentiality already ranked fi rst -before 'autonomy'among the Commission's fi ve recommendations. Th e Commission recommended that genetics-related information be kept confi dential and coded, although, notably, even these nascent recommendations made this conditional: 'whenever that is compatible with the purpose of the data bank' [3].
At the time there existed no doubt that, with the appropriate measures (in particular through coding techniques), anonymity could be preserved. Yet evidence contradicting this viewpoint -in the form of forensic identifi cation of individuals using DNA -already existed in the 1980s. As genome sequencing becomes increasingly widespread and large amounts of data and biospecimens accumulate in many diff erent types of biobanks, addressing the identifi ability of individuals is an increasingly pressing issue.

Sharing data, protecting privacy
Biobanks and repositories were established to facilitate the storage and redistribution of samples and data. Th ese eff orts seek to meet core scientifi c requirements of sample and data sharing for the purposes of comparison, re-analysis and avoidance of redundancies in research eff orts. Fulfi lling the ethical and legal requirement of protecting study participants and, in particular, shielding their identity leads to a fundamental tension between data sharing and privacy.
Researchers, database managers and biobank directors have tried their best to meet both goals by using methods to obfuscate and de-identify biological material and data. Th ese measures are not always successful. In 2007 the National Institutes of Health mandated that genomewide association studies deposit data in a central database. Aggregate data were thought to be 'safe' and were thus publicly shared by the database of Genotypes and Phenotypes (dbGaP). Th at policy was immediately modifi ed by dbGaP once it was demonstrated that individual genotypes were identifi able in pooled data [4].
Th is revision may have been exceptional: regulators are understandably reluctant to admit to incorrect assumptions about data safety. Th e introduction of restrictions on access to materials that have already been disseminated and have found widespread use does not add to the credibility of regulatory bodies. Th e recent re-identifi cation of widely disseminated 'de-identifi ed' samples by surname inference led to the removal of some publicly shared information, but not to the removal of associated data and samples from repositories [5]. Notably, the removed data were considered to be compliant with anonymization requirements mandated by the US Health Insurance Portability and Accountability Act, and reidentifi cation occurred despite this. Today it is clear that individuals are identifi able through their unique biological profi les, and evidence of this -for example, from gene expression or microbiome datasets -continues to accumulate. Rapid advancements in the genomic sciences have tremendously increased our understanding of human biology. Global collaboration among researchers and sharing of human specimens and data to corroborate findings are key conditions of good scientific practice. Traditional research ethics regarding human subjects is based on different assumptions about scientific practice, assumptions of 'anonymity' being just one example, thereby posing increasing dilemmas to the scientific community.

Personal disclosure
At the other end of the privacy spectrum, opposite to presumed anonymity, is personal disclosure. At the time, many questioned the wisdom of Craig Venter's decision to reveal his inclusion in the compound human genome sequence published by his group [1]. Such a disclosure may have been in conflict with the study protocol, and Venter's action may have warranted more transparency with his colleagues. In addition, with some effort one could eventually extract trait and disease predictions about him and the other DNA donors -some of which could be considered stigmatizing.
However, personal disclosure can have great value. Disclosure has been at the foundation of most patient organizations and research-focused disease interest groups: giving up anonymity and sharing experiences has been for many patients and their relatives the only route to improved diagnostics, treatment and care. The nonprofit advocacy organization Genetic Alliance (and its many member organizations) is one excellent example. A case involving mental health information, which is currently considered to be potentially highly stigmatizing, illustrates the great value in disclosure. In 1908, mental health care in the United States was changed forever when Clifford Beers disclosed his personal history of mental illness and the miserable state of care in his autobiography [6], sparking a movement leading to comprehensive mental health reform.

Identifiability today
The prevention of the identification of individuals through meticulous and costly de-identification procedures has kept investigators, statisticians, data managers, ethics committees and oversight bodies busy, as data and sample collections have grown to form large databases and biorepositories.
Yet DNA is an identifier and, as such, all biological material and sequence data can ultimately reveal the identity of their source. While anonymity and confidentiality are promised to study participants, researchers are sharing data by default, as a necessary condition of good scientific practice and as required by funding agencies.
The Personal Genome Project (PGP) has been the first research project to make public sharing of data a reality while avoiding unsustainable promises of anonymity to participants, while their comprehensive genotype and phenotype data are made accessible in the public domain [7][8][9]. Moreover, the value of the availability of robustly annotated variant sets is increasingly being recognized [10]. Going forward, researchers and participants should consider similar models to the PGP that allow them to build open-access resources; the outcomes and benefits of such clear and open collaborations may well exceed current expectations.
Abbreviations dbGAP, database of Genotypes and Phenotypes; HGP, Human Genome Project; PGP, Personal Genome Project.

Competing interests
JEL is Ethics Consultant and MPB is Director of Research of the PGP; both serve on a voluntary basis and do not receive a salary or financial compensation from the PGP. The authors declare that they have no other competing interests.