The Personal Genome Project enrollment process
In society, a wide diversity of preferences exist with respect to levels of privacy, and many individuals choose to participate in the Harvard PGP despite the lack of assurance of privacy and anonymity. Enrollment and participation are very deliberate processes. Prospective participants must first verify their eligibility and, although enrollment is greatly facilitated by an online interface, it nevertheless requires several steps on the part of the participant to demonstrate understanding and consent. Each of these steps accounts for a fraction of potential participants that do not ultimately enroll (Figure 1), and in many cases these are likely individuals who realized that they did not wish to volunteer.
The most notable step in our online enrollment process is our requirement for potential participants to pass an enrollment examination. To ensure the decision to participate is well-informed, we provide a study guide and require individuals to correctly answer all questions on this examination. The examination design is modular (with each module to be repeated until all questions are answered correctly), and both our study guide and consent documents are publicly shared so that other studies may use or adapt them [7]. Our recent data show that the enrollment examination remains the most significant barrier in our online enrollment process: 59% of users who did not complete enrollment in the 2012 to 2013 time period stopped at the enrollment examination stage. About half of the people (49.8%) who created accounts on our site between June 2012 and December 2013 completed the enrollment process (Figure 1). This represents an update on our prior analysis of accounts created until May 2012, which was largely similar, with 41.1% of accounts completing the enrollment process [8]. Among those who passed the examination stage, 90% electronically signed the online consent form and fully enrolled in the project. As of 31 December 2013, 3,181 participants are fully enrolled.
The enrollment examination and the very detailed consent form emphasize the research-only character of the PGP, where participants are not expected to directly benefit. The resulting cohort is therefore enriched for highly motivated individuals interested in contributing to the project, and many of our participant-initiated communications are from participants interested in donating samples as well as genetic and health data they have gathered from external sources (see below).
After enrollment, participants continue to use our website to add data to their public profiles, and to review and publish the data we return to them. Although developing and maintaining the participant-facing infrastructure has been a significant cost, the benefits are apparent. Self-service makes it more practical for participants to exercise their will. Sensitive interactions, such as soliciting feedback during the withdrawal process, are carefully designed and can be consistently executed. The process of encoding the study protocol in the form of software sometimes reveals ambiguities that can be explored and clarified, resulting in better agreement between researchers’ behavior and participants’ expectations. Common interactions like enrollment and sample collection can be largely automated, so the incremental cost of each additional participant is extremely low. With the intention of making our participatory approach more accessible to other research projects, we have released the website software under the GNU General Public License.
Participant communication
Participation in the PGP is an ongoing relationship after enrollment. Account and data are managed through our online interface, and participants can use a ‘Contact Us’ button on the website to email us. In the 16 months analyzed here (June 2012 to December 2013; Figure 2), 579 emails were received, which averages about one email per day. Communications were diverse and included general interest and questions (for example, regarding eligibility requirements), interest in donation of data, reports of site bugs and account issues (for example, name changes), and questions about the timeline of sampling and return of data.
As in any study, participants can decide to withdraw at any moment and in the PGP such a decision is not influenced by a patient-physician relationship or opportunities for clinical interventions. Since online enrollment began in 2010, less than 1% of users who have fully enrolled have later withdrawn (26 participants); of the nine participants who shared reasons for withdrawal, five expressed concerns about privacy that developed after enrollment, and four expressed frustration with the timeline and requirements involved for participation. From June 2012 to December 2013, 17 out of 579 emails sent by participants were related to the issue of withdrawal (3.3%, see Figure 2, ‘Withdrawal’). Of the 185 participants who have publicly shared whole genome or exome data, none have withdrawn from the project.
Participant experiences with the return of genome data
Most projects that create biological data and cell lines do not return data to participants. Samples are typically stripped of identifying data to protect the privacy of participants - although there is increasing recognition that this may not be sufficient to prevent unwanted re-identification, it nevertheless theoretically renders researchers unable to return data to their study participants. Other rationales for not returning data include concerns regarding misuse of data as a clinical tool, and potentially burdensome participant requests for assistance with data interpretation. Modern genotyping and sequencing technologies should cause us to question the coherence of this traditional approach, especially when projects generate public sequence data. Individuals now have ready access to deep genetic data about themselves through direct-to-consumer services, with one million single nucleotide polymorphism datasets available for $100 to $200. The difference between ‘public data’ and ‘access to one’s personal data’ is essentially reduced to the effort a participant must make to identify which public dataset is their own.
Access to and return of data is one of the core components of PGP [9], and the PGP has so far returned whole genome data to 163 individuals. (Our total of 185 includes an additional 22 individuals that have shared genome or exome data obtained elsewhere). We emphasize to participants that our data is research-grade (that is, not for clinical use) and that many types of error are possible, including errors in data, failure to discover or report significant genetic issues, and ambiguous or false positive findings. We also provide access to genome interpretations as produced by the Genomes-Environments-Trait (GET)-Evidence system, which provides a mechanism for continued improvement in genome interpretation and annotation through participant engagement and community review of the scientific literature [8]. Only a small fraction (11%) of participants who received whole genome data have contacted us regarding that data. Of these, only a minority (19%, or 0.8% of total communications) are seeking additional knowledge regarding interpretation, and most (81%, or 3.3% of total communications) are inquiries regarding file formats and access to additional data files, made by participants interested in self-pursuit of additional analysis.
The continued application of our GET-Evidence system has been used to record interpretations of a variety of variants found in participant genomes. These interpretations are publicly shared on the GET-Evidence website [10]. Our overall experience generally continues to be one of ‘false positives’, variants reported to cause phenotypes that the participant does not appear to have. We believe these are generally due to a lack of statistical significance in original literature rather than sequencing errors (notably, sequencing errors are randomly distributed and unlikely to match a previously reported variant).
One false-positive variant that is a useful illustration for the uncertainties in genome interpretation is SCN5A-G615E. This variant was found in a participant who is identified in our public dataset as hu034DB1. Several publications implicate it as a cause of long-QT syndrome. Recommendations released by the American College of Medical Genetics (ACMG) [11] recommend that clinical studies report known pathogenic variants (defined as ‘previously reported and a recognized cause of the disorder’) and expected pathogenic variants (defined as ‘previously unreported and is of the type which is expected to cause the disorder’) in SCN5A. How do we determine which variants meet these criteria? A non-skeptical reading of the literature would define variant SCN5A-G615E as a known pathogenic variant. However, we observed that none of these publications demonstrated variant-specific statistically significant enrichment for this variant in cases versus controls. We also confirmed that our participant reported no family history consistent with this disease, and that she pursued clinical evaluation after learning of this variant and was not diagnosed with the disease. Although disease may later manifest in this participant, we have yet to discover a case of unexpected disease in which the causal variant’s pathogenic hypothesis lacked statistical significance. Our experience, in the context of incidental findings, is that the ACMG recommendations provide little guidance when there is no accompanying variant-specific consensus regarding which variants within those genes warrant clinical response.
We also have at least one ‘true positive’ to report: one participant discovered an unanticipated disease after genome sequencing revealed a rare genetic variant. JAK2-V617F, found in a blood sample donated by huA90CE6, is an acquired mutation associated with myeloproliferative disordersb. Although this gene is not included in the ACMG recommendations, our evaluation of the literature concluded that a significant fraction of carriers later develop myeloproliferative disorders. Although this participant was not suspected of having any genetic disease, he had a past medical incident involving a blood clot and, upon self-pursued clinical evaluation subsequent to detection of this variant, was discovered to have abnormally high platelets (essential thrombocytosis) and now treats this with low-dose aspirin. The participant, as a journalist, reported this experience in an article series for Bloomberg News [12].
Participant-contributed data
Our study allows participants to autonomously contribute diverse data to be shared on their public profiles, and many of the emails we receive from participants are inquiries about such contributions (14.3% of emails in the period from June 2012 to December 2013, see Figure 2). To facilitate donation of health records, we have supported import of data from Google Health (now discontinued) and Microsoft Healthvault in Continuity of Care Record format. We parse health conditions from these records for re-display on our site. We would like to share the raw data files themselves, but these files contain sensitive personal data (for example, full names of participants, their health care providers, and email addresses) - even participants open about their account identity may not wish to have all such information publicly shared. In the interest of facilitating future public datasets, we encourage developers of health record management systems to allow individuals to remove their personal identifiers and contact information when exporting records. As of December 2013, 1,235 participants (39% of 3,191 enrolled participants) have contributed health record data through these resources.
Parsing these records gives us a valuable insight into the health and trait data represented in the participant cohort. We recognized, however, that these data can be non-uniform; for example, there are many traits participants may not think to report because they are common or mostly benign. To address this, we created a series of 12 surveys spanning 239 phenotypes (Additional file 1) based on the traits and conditions listed in our health record data. In order to allow for the discovery of unknown associations between variants and hypothesis generation, the range is intentionally broad, ranging from extremely common traits (for example, myopia, dental caries) to moderately rare conditions (for example, porphyria, Marfan syndrome). As of December 2013, 680 participants (21%) have completed all 12 surveys to add trait and disease data to their public profiles. Among the 185 participants who have released whole genome or exome data, 133 (72%) have completed all 12 surveys.
Participant willingness to contribute data extends beyond health data. Many inquiries we receive are from participants interested in donating genetic data acquired elsewhere (8.2% of participant-initiated communications, see Figure 2). As of 31 December 2013, 462 participants have shared through their public profiles genetic data acquired from other sources. This is primarily composed of single nucleotide polymorphism genotyping data, but also includes 22 whole genome and exome datasets.
Building a participatory research community
Forgoing the assurances of privacy and allowing participants to publicly share identifiable data has shown practical benefits. One important difference we have discovered is that participants are no longer isolated: participants and researchers have been able to meet each other at our yearly GET conference. Participants have also formed participant-managed online groups, including groups on LinkedIn and Facebook and an online forum [13]. The formation of a participant community allows participants to share knowledge, participation experiences, news items of interest, and mutual assistance with the understanding of research data.
Public data inspires important discussions. In January 2013, Gymrek et al. used publicly available data from HapMap project samples to demonstrate re-identification methods [4], and later that year another group used our project’s data for similar research [14]. Notably, because these data are public, this research is considered exempt according to exemption 4 of the ‘Common Rule’ of Health and Human Services regulations (45 CFR part 46 subpart A) [15]. No PGP participants withdrew from the project because of these incidents, demonstrating their correct understanding of the public nature of their data with PGP. However, these events highlight a concern for participants in mainstream studies whose data or specimens have been shared publicly and for whom privacy was assured: there is currently no requirement for ethical oversight of re-identification efforts conducted by researchers in the US if they work with publicly available material [16].
Many PGP participants choose to be public about their identity, and some of these have written about the project to share their personal experiences with genome data, as well as broader lessons about genome research and technology. This includes the reporting by John Lauerman mentioned earlier [12], an editorial by Steven Pinker [17], and a book by Misha Angrist [18]. With these writers we can see one of the great potential benefits of participatory research: bridging the divide between researchers and their community to more broadly share scientific understanding.