Machine learning identifies a compact gene set for monitoring the circadian clock in human blood

Background The circadian clock and the daily rhythms it produces are crucial for human health, but are often disrupted by the modern environment. At the same time, circadian rhythms may influence the efficacy and toxicity of therapeutics and the metabolic response to food intake. Developing treatments for circadian dysfunction, as well as optimizing the daily timing of treatments for other health conditions, will require a simple and accurate method to monitor the molecular state of the circadian clock. Methods Here we used a recently developed method called ZeitZeiger to predict circadian time (CT, time of day according to the circadian clock) from genome-wide gene expression in human blood. Results In cross-validation on 498 samples from 60 individuals across three publicly available datasets, ZeitZeiger predicted CT in single samples with a median absolute error of 2.1 h. The predictor trained on all 498 samples used 15 genes, only two of which are part of the core circadian clock. By then applying ZeitZeiger to 475 additional samples from the same three datasets, we quantified how the circadian clock in the blood was affected by various perturbations to the sleep–wake and light–dark cycles. Finally, we extended ZeitZeiger (1) to handle intra-individual variation by making predictions based on multiple samples taken a known time apart, and (2) to handle inter-individual variation by personalizing predictions based on samples from the respective individual. Each of these strategies improved prediction of CT by ~20%. Conclusions Our results are an important step towards precision circadian medicine. In addition, our generalizable extensions to ZeitZeiger may be applicable to the growing number of biological datasets that contain multiple observations per individual. Electronic supplementary material The online version of this article (doi:10.1186/s13073-017-0406-4) contains supplementary material, which is available to authorized users.


Figure S1
Strength of circadian rhythmicity, quantified as a signal-to-noise ratio (SNR), for gene expression in human blood. As in Fig. 2, data is from control samples from all three datasets. (A) Cumulative distribution function of SNR for all genes. (B) SNR for core clock genes.
(C) SNR for genes in the ZeitZeiger predictor.

Figure S2
Ten-fold cross-validation to predict CT using only the core clock genes (otherwise identical to Fig. 1).

Figure S3
Boxplots of log-likelihood of predicted circadian time for each condition in each dataset (related to Fig. 3). For each of the three datasets, a predictor was trained on control samples from the other two datasets, then tested on all samples from the dataset of interest. The left-most condition in each dataset is the control. 4/13 Figure S4 Phase difference (i.e., difference in circadian time of peak expression) between each perturbation condition and the respective control condition for core clock genes. A negative phase difference corresponds to a delay relative to the control 24-h light-dark cycle. Points are only shown if the clock gene showed a signal-to-noise ratio of at least 0.4 in both the control condition and the perturbation condition. Applying ZeitZeiger to all samples ("in phase" and "out of phase") from GSE48113. Instead of predicting circadian time, ZeitZeiger was trained to predict time relative to average DLMO in each condition. Average DLMO was ~1 h later (relative to the original light-dark cycle) in the "out of phase" samples than in the "in phase" samples, so using time relative to DLMO instead of CT merely shifts the times in the "out of phase" samples and reduces the apparent delay between "in phase" and "out of phase" samples by ~1 h. Boxplots of improvement in absolute error between the universal predictor and the ensemble predictor with universal guidance, as a function of number of personal training samples (because personal predictions were based on leave-one-out cross-validation, this is equal to the number of samples for the respective individual minus one).

11/13
Figure S10 Personalized predictions with universal guidance applied to groups of samples. Each group consisted of two samples taken ~12 hours apart from the same individual.
(A) Boxplots of absolute error for universal (standard 10-fold cross-validation, identical to Fig. 3), personal (leave-group-out cross-validation for each individual), and ensemble (circular mean of universal and personal) predictors. (B) Improvement in absolute error between universal predictor and ensemble predictor as a function of the number of personal training samples for that group (equal to the number of samples for that individual minus two).

12/13
Figure S11 Genes and SPCs of the personal predictors trained with universal guidance.
(A) Heatmap of genes present in personal predictors trained with universal guidance (using 15 genes present in predictor shown in Fig. 2). Rows correspond to genes and columns correspond to individuals. Black indicates the gene was present in the predictor for that individual. Rows and columns were sorted by hierarchical clustering.
(B) Histogram of difference between peak times of SPC 1 and SPC 2. (C) Circadian times of peak expression for SPC 1 and SPC 2 in the personal predictors. Each point corresponds to one individual. For ease of visualization, some peak times for SPC 2 were shifted by 24 hours.