Strength of functional signature correlates with effect size in autism

Background Disagreements over genetic signatures associated with disease have been particularly prominent in the field of psychiatric genetics, creating a sharp divide between disease burdens attributed to common and rare variation, with study designs independently targeting each. Meta-analysis within each of these study designs is routine, whether using raw data or summary statistics, but combining results across study designs is atypical. However, tests of functional convergence are used across all study designs, where candidate gene sets are assessed for overlaps with previously known properties. This suggests one possible avenue for combining not study data, but the functional conclusions that they reach. Method In this work, we test for functional convergence in autism spectrum disorder (ASD) across different study types, and specifically whether the degree to which a gene is implicated in autism is correlated with the degree to which it drives functional convergence. Because different study designs are distinguishable by their differences in effect size, this also provides a unified means of incorporating the impact of study design into the analysis of convergence. Results We detected remarkably significant positive trends in aggregate (p < 2.2e-16) with 14 individually significant properties (false discovery rate <0.01), many in areas researchers have targeted based on different reasoning, such as the fragile X mental retardation protein (FMRP) interactor enrichment (false discovery rate 0.003). We are also able to detect novel technical effects and we see that network enrichment from protein–protein interaction data is heavily confounded with study design, arising readily in control data. Conclusions We see a convergent functional signal for a subset of known and novel functions in ASD from all sources of genetic variation. Meta-analytic approaches explicitly accounting for different study designs can be adapted to other diseases to discover novel functional associations and increase statistical power. Electronic supplementary material The online version of this article (doi:10.1186/s13073-017-0455-8) contains supplementary material, which is available to authorized users.


Background
Over the past decade, enormous progress has been made in characterizing sources of DNA variation contributing to disease. Most of this progress has been enabled by study designs which are carefully tailored to exploit technologies targeting particular classes of variation. Researchers have used chromosomal analysis arrays [1][2][3][4], genotyping arrays [5][6][7][8], whole-exome sequencing (WES) [9][10][11][12][13], and whole genome sequencing (WGS) [14,15] to identify risk loci and alleles. The results from these studies cannot be naively compared; common variants are limited to regions of the genome with known variation (a SNP is known) but only reach significance with large numbers, while rare or ultra-rare variants are conditioned on not being in this list of common variants. Trio and quad studies are used mainly in WES and WGS study designs, while large case and control cohorts are required for signals in genome-wide association studies (GWAS). Thus, for each study design, we are asking distinct questions that relate to the population prevalence, disease mechanism, burden, and risk.
Within each study, however, it is commonplace to look to overlapping functional properties of candidate disease genes to find the biologically meaningful signal among the positive results. Candidate genes are prioritized based on enrichment analyses in pathways related to the phenotype (e.g., neuronal activity regulation) or some other disease feature shared by the genes (e.g., expression in the brain). If these methods return no significant results, more complex methods are performed to extract common features from the disease gene set [16], such as co-regulatory module detection from co-expression networks [17] or binding from protein-protein interaction (PPI) networks [18]. Regardless of the study design, the analysis with respect to functional convergence follows a similar (and largely separable) design: genes selected as hits are tested for the presence of some joint signature with the null provided by genes which are not hits. By the same logic that suggests testing hits for functional convergence relative to the background, we hypothesize that sets of genes which are "strong" hits will show more functional convergence than those which are "weak" hits.
We suggest that the degree of functional convergence may by hypothesized to vary (monotonically) with the degree to which genes are causal for the disease. Genes only weakly causal, whether due to high false positive rates in the study design or low effect sizes, are not strongly implicated as sharing a joint role by their cooccurrence as disease-related. For instance, disease candidates from GWAS have low relative risks (and therefore low effect sizes) as they are inherited common variation in the population. On the other hand, de novo mutations are a form of genetic variation which evolutionary forces have had little time to act upon [19] (e.g., unless embryonically lethal) and are of high risk (and high effect sizes). Studies also suffer from type I errors (false positives), and this too should be reflected in an aggregate disease signal of the candidate genes, as quantified by their common functional properties. A set of genes with de novo mutations will show a strong aggregate disease signal, while we might expect a weaker signal from the gene candidates from GWAS [20]. Measuring their "functional convergence", as determined by a gene set enrichment test or network analysis, we can thus exploit our knowledge of gene candidates' effect sizes and false positive rates. For a true disease property, we expect the correlation between gene set effect size and functional convergence to be strong, and for a weak or artifactual property, we expect no significant correlation.
We propose to test this hypothesis by running a metaanalytic study on autism spectrum disorder (ASD [MIM 209850]) candidates across numerous genetic studies and over a wide range of gene properties and functions. ASD is a neurodevelopmental disease commonly characterized by behavioral traits such as poor social and communication skills [21]. In more severe cases, ASD is comorbid with mild to severe intellectual disability, facial and cranial dysmorphology, and gastrointestinal disorders.
Perhaps because of grouping these multiple and sometimes distinct phenotypes into one disorder, and the complexity of behavior as a trait, understanding the genetic architecture of this cognitive disease has been non-trivial [22]. The genetic component of ASD is estimated to be 50-60% [23]; however, in a substantial number of cases the underlying genetic factors of the disease are still unknown. Due to these levels of heterogeneity, multiple studies and study designs have been used to determine the underlying genetics which we make use of here. Taking these different studies, we construct several disease gene candidate collections, each containing genes of similar levels of risk, as determined by their odds ratios and relative risks. On every gene collection, we run a number of analyses, calculating the functional convergence using standard enrichment methods, and more complex network analysis enrichments. By exploiting trends in targeted genetic variation and their known effect sizes, we demonstrate it is possible to discriminate biologically convergent signals from likely technical artifacts at a very fine resolution. The disease properties with strong trend signals are largely consistent with the known literature on ASD (e.g., FMRP interactor enrichment), but we also see a few otherwise interesting properties as unlikely to be disease-specific. Particularly protein-protein interaction networks and some co-expression networks, which extract artifactual signals from the study design, show signals in control data using that study design. Our focus here is on autism due to our interest in the disorder, its well-powered data, and also its phenotypic and genetic heterogeneity.

Study design
An overview of our study design and method is shown in Fig. 1. Briefly, we start by characterizing the ASD gene sets collected for this analysis. Each study's results were collapsed individually into a set of genes, with an estimated average effect size for that candidate set (Fig. 1a). We calculate a functional effect (e.g., statistical overlaps with known functions; Fig. 1b) for disease-specific and more general gene functional properties. We then calculate the correlation of these functional convergences with the estimated effect size of that variant class (Fig. 1c). More specifically, we test to see if the set of genes with high effect sizes have strong relative functional convergences as measured by a functional enrichment of some disease property across them, and if those with low effect sizes have weaker functional signals. We apply this test to numerous functional properties on candidate gene sets from a variety of study designs. Functions with positive correlations (positive trends) we believe will show signatures that are likely associated with autism and can be used for further functional characterization of the disease. Throughout our work we refer to the "effect size" as the disease burden or risk of a gene candidate (or the average of such values within a gene set), and the "functional convergence" as the significance of a functional test for a disease gene set after controlling for the set size.

Study data Disease gene candidate sets
We first collected candidate disease gene sets from available autism studies. We selected the largest study of WES of families from the Simons Simplex Collection (SSC) [24]. We defined different sets of genes from over 2000 gene candidates, splitting into recurrent (at least two probands having the mutation) and non-recurrent mutations, according to mutation type (loss-of-function, missense, and silent mutations). We selected copy number variant (CNV) data also from the individuals in the SSC [25] and parsed it into similar sets. We then used the CNVs as parsed by Gilman et al. [26], which prioritized genes with their NETBAG algorithm. For GWAS gene sets, we generated two lists from the Psychiatric Genomics Consortium (PGC) study on autism and related psychiatric disorders [22]: one from the reported gene list and a second list of all adjacent genes as listed in the GWAS NHGRI-EBI catalog [6]. Our negative control sets included using the genes with mutations in the unaffected siblings of the probands from the SSC studies. This yielded 11 candidate gene lists, categorized by average effect size.
As an additional test, we perform a control experiment to test the ability of our method to discern psychiatric GWAS from other traits and diseases. To assess this, we will swap out the autism GWAS candidate lists for other candidate lists from other GWAS and rerun our analyses. We took all the GWAS data in the GWAS catalog [6], totaling over 1396 traits across 2066 studies. For each trait, we created gene lists with the reported genes. We conditioned on traits with at least 27 genes, which left us with approximately 148 traits. Each of those gene sets was then substituted for the autism GWAS, allowing us to test which GWAS data shows most convergence with the other autism data.  Fig. 1 Functional convergence trend calculation. a Starting with disease gene set collections, we rank each by the average effect size of the genes within that set. b We then run "functional tests" on these genes sets and calculate a functional convergence for each. c Then, using the ranking of the disease gene sets, we measure the functional convergence signature-the correlation of the trend line of the functional convergences versus the rank Gene functional annotation data Co-expression networks The majority of recent studies used co-expression networks from BrainSpan to illustrate network convergence among disease genes of ASD (e.g., [27][28][29]). In a similar fashion, we generated a brain-specific network from the BrainSpan RNA-seq data (578 samples). In addition to this, we generated an aggregate co-expression network from 28 brain tissue and cell-specific microarray experiments (3362 samples). For more general networks, we used our aggregate RNA-seq and microarray coexpression networks as previously described by Ballouz and colleagues [30]. In brief, these are the aggregates of 50 networks (1970 samples) and 43 networks (5134) samples, respectively, across various tissues, cell types, and conditions. As a comparison to the aggregate networks we recommend, we constructed and tested individual networks from single experiments that are more commonly used. This includes tissue-specific coexpression networks from the GTEx data [31] (29 tissues) and age-specific co-expression networks (five age groups). As additional tests, we took a further 227 RNAseq expression datasets with at least 20 samples within each experiment from GEMMA [32], and have generated a further 454 individual human co-expression networks, using all annotated transcripts (30,000, GENCODE [33]), and then only protein-coding genes (18,000).

Protein-protein interaction networks
We used the human physical PPIs from BIOGRID (version 3.2.121) [34] and created a binary PPI network, where each protein is a node and each PPI is an edge. Because of the sparseness of the network, we extended the network by modeling indirect connections [35], taking the inverse minimum path length between two proteins as the weighted edge, with a maximum distance of six jumps roughly as described in Gillis and Pavlidis [36]. We repeated this for alternative PPI datasets including: I2D [37] (v2.9), HPRD [38] (release 9), HIPPIE [39] (v1.8), IntAct [40], the CCSB interactome database [41] (HI-III v2.2), STRING [42] (v10), and PIPs [43]. A non-interacting PPI network was created from data from the negatome [44] (v2).

Gene sets and collections
We considered common functional gene sets and neurological-specific sets, as used in numerous studies, as gene sets to test for ASD candidate enrichment. These included the post synaptic density (HPSD) gene set [45], synapse sets [46], the synaptosome [47], chromatin remodeling set [48], fragile X mental retardation protein (FMRP) set [49], and gene essentiality [50]. For more standard sets, we also took the Gene Ontology (GO) terms (April 2015) [51] and KEGG pathways [52]. For each GO term, we only used evidence codes that were not inferred electronically and propagated annotations through the ontology (parent node terms inherited the genes of their leaf node terms). To minimize redundancy from GO, we restricted our enrichment analyses to GO term groups with sizes between 20 and 1000 genes. These GO terms and KEGG groups were used in the enrichment analyses with the full multiple hypothesis test correction penalty. As an extension to the original study, we collected alternative gene property sets for more functional enrichment tests. For this we used all the collections from MSigDB [7] (gene sets H, C1-C7). We calculated the multifunctionality of a gene based on the number of times a gene is seen as being annotated to a function (using GO) [53].

Disease gene score sets
We used disease gene scoring methods that rank genes according to how likely they are to have damaging effects if they are mutated. This included the Residual Variation Intolerance Score (RVIS) [54], haploinsufficiency (HI) scores [55], mutational rates, and constrained gene scores and probabilities (pLI) from ExAC [56].

Expression data
To obtain brain-specific expression and differential expression information, we used three common and large sample size brain-specific transcriptomic sets. These included the Human Brain Transcriptome (GSE25219) [57], BrainSpan [58], and the Human Prefrontal Cortex transcriptome (GSE30272) [59]. We divided the samples into fetal (post-conception week) and post-birth stages, and performed a straightforward differential expression fold change analysis (averaging across these stages) [60].

Calculating average disease effect sizes
For the 11 candidate disease and control gene sets ( Table 1, Fig. 2a), we ranked the set according to the overall or average "effect size" of the genes within it. For the de novo mutation candidates, we took the ratio of observed counts of mutations to silent mutations within the study for that class of mutations, and then the ratio of those odds between siblings to probands (as calculated in Sanders et al. [10]). To calculate this effect size for the GWAS results, we took the average odds ratios from the individual studies of each SNP, which ranged between 1.01 and 1.1. For the control sets (siblings and the silent mutations), we took the effect size to be null. We then ranked the sets based on these overall effect sizes. After these calculations, we ended up with three general classes: null effects (as controls), weak effects (missense and common variants), and strong effects (rarer loss-of-function and copy number variants).

Calculating functional convergences
Our functional tests, described below, return p values which are dependent on the size of the gene set being considered. The statistical tests differ depending on the mode of analysis (e.g., enrichment or network), but by "functional convergence" we simply mean significance (p value) after correcting for the set size, typically by downsampling. For the downsampling, we took a subset of genes, recalculated the p value 1000 times, and then took geometric means of the adjusted p values. Throughout, where we write "functional convergence" it is possible to read "p value after correcting for set size".

Network connectivity
We measure the clustering of sets of genes within networks through the use of a network modularity calculation. We compare the degree of connections a gene has to all the genes in the network (global node degree), and to those of interest within the sub-network they form (local node degree). The null expectation is that genes will be connected equally well to genes within the subnetwork as to those outside. Genes with large positive residuals have more weighted internal connections than external connections, implying a well inter-connected module. We test the significance of this distribution of residuals to a null set (random similarly sized set of genes, Mann-Whitney-Wilcoxon test, wilcox.test in R) to determine our test statistic.

Gene set enrichment testing
As a way to determine the level of enrichment of the candidate gene sets within other functional sets, we used a hypergeometric test with multiple test correction (phyper in R). The downsampled p value was used as the functional convergence measure.

Disease gene property testing
For the disease gene scoring properties, we tested the significance of the scores of the candidate genes using the Mann-Whitney-Wilcoxon test (wilcox.test in R). The functional convergence was the p value of this test.

Measuring functional convergence trends
For each gene property tested, we then measured the "trend" by calculating the correlation of the ranked functional effect sizes of our gene sets, whereby the gene sets are ordered according to their effect size ranks. A positive correlation is one where the function tested is correlated with our ordering. We computed this using Spearman's rank coefficient to capture the degree of variation, but the significant subsets identified are generally robust to choice of measurement metric such as the Pearson's coefficient. We limited our functional convergence tests to the subset of functions where at least one gene set of the 11 showed a significant functional convergence signal (p < 0.05). In essence this filtering removes gene sets where there are, for example, no overlaps with any disease sets and should not affect our analysis. For our main analysis, we tested 4210 functional properties, the majority being GO and KEGG groups. The additional functional properties we included such as MSigDB and all co-expression networks increased this number to a total of 18,116 tests.

Determining significance of the functional convergence trends
To calculate a null, we permute the labels of the gene sets and calculate the functional convergence trends. Note that in the ranked case, this is simply the null distribution of a Spearman correlation, with similarly associated significances. We first filter for functional tests where any one of the disease and control gene sets have a functional convergence of 0.05, but report both pre-and post-filtering results. Because our hypothesis (and test) are concerned with the ordering of functional effect sizes, filtering so that the data have at least one significant value changes the null distribution only slightly (e.g., probability of ties). We calculate the number of significant correlations based on the false discovery rate (FDR) at 0.01 and 0.05. Known confounds of disease gene sets are gene length [61] and gene multifunctionality, and to test this we generated matched gene set controls by sampling genes with similar gene lengths, GO multifunctionality and disease multifunctionality measures. Using the ranked CDS (coding DNA sequence) region of the genes, we generated sets of genes of similar ranked length distributions to the 11 real gene sets in the analysis. Downsampled, we then ran the analyses on these gene sets that are specifically not involved in the phenotype. This was repeated for multifunctionality as calculated using GO and then disease (using Phenocarta [62]).

Results
Little overlap of the autism candidate genes across gene sets of different effect sizes We find genes with loss-of-function de novo mutations to be little implicated in GWA studies, with only four candidate genes overlapping those two sets ( Fig. 2b; hypergeometric test p = 0.76). Interestingly, the more recurrent genes in the loss-of-function de novo set, the more unlikely they are to be found in other gene sets. For gene sets with the lower average effect sizes (e.g., the genes with missense mutations), their overlap with other gene sets is greater, in particular with the control sets ( Fig. 2b; hypergeometric tests p~4.4e-3 to 2.4e-6).
The de novo variants are conditioned on being rare (low frequency) and novel by not appearing in the parents. The SNPs used in GWAS are generally conditioned on being common by having minor allele frequencies greater than 0.05 [63]. Even if this filtering is done on the variant level, and not on the gene level, it still creates selection trends within our observations of variants and thus genes. This is possibly a version of Berkson's effect [64], where selecting for an outcome generates negative correlations between potential causes for it. An additional . c Common biases that affect studies are gene length and number of functional annotations. The average standardized rank (± standard error) of genes with respect to these properties shows that the "rare" disease sets contain longer genes but are not much more multifunctional than random cause is largely technical; since we've conditioned on frequency, genes with higher mutability are depleted in our rare lists and enriched in our common lists. Thus, the lack of overlap is at least potentially not largely reflective of underlying genetics or biology, but likely due to the selection bias in obtaining them. There is also poor overlap within the rarer variation itself, for instance of genes within CNVs and those with loss-of-function SNVs (three genes, p~0.37); there is generally a discrepancy between study designs focused on (different) sources of rare variation, and not just rare versus common. It should be noted that whether biological or technical, the lack of overlap does nothing to discredit either common or rare variation as a contributor to the disease, but it does highlight the need for a framework to combine and analyze the results of these studies that is aware of these biases and can distinguish biology from technical effects.

Functional convergence trends as shown through enrichment and connectivity tests
While enrichment analysis is comparatively straightforward, we demonstrate an example in Fig. 3a using the genes with de novo loss-of-function mutations from Iossifov et al. [11] (341 genes) and their overlap with essential genes (see "Methods"). In Fig. 3a, we represent this enrichment test as a Venn diagram of the overlap of the candidate disease gene set with the essential gene set and calculate the significance of the overlap with a hypergeometric test (n = 82, p~9.8e-9). We continue this analysis on the other candidate disease gene sets from recent ASD studies, varying across study designs and technologies (WES, GWAS, and arrays). Splitting each gene set by mutational class, recurrence, and gender, we perform the same hypergeometric tests. To make comparable assessments between studies and gene sets,

Significance of excess connectivity
Functional convergence p~7.4e-9 Effect size: 4.1

De novo loss-of-function SNVs (recurrent)
Functional convergence: p~0.08 Effect size: 1.1 . a Gene set enrichment is calculated with a hypergeometric test. A large number (34%) of the genes in the de novo loss-of-function set overlap with essential genes (hypergeometric test p~9.8e-10). b This is repeated across all disease gene sets (a subset shown here). Sample size is controlled through downsampling. Gene sets with the higher effect sizes also have the higher functional convergences. c We can now demonstrate how to calculate the functional convergence trend for the "essential genes" test. The disease gene sets are ordered by an estimate of the average effect size of genes within the set (from low to high on the x-axis) and the functional convergence of that disease gene set is plotted (y-axis). A trend between the effect size of the candidate genes and their essentiality can be clearly observed. The network connectivity functional test (d) consists of calculating the ratio of disease genes' total connectivity (node degrees calculated from the whole network; sum of their connections) to their internal connectivity (node degrees of their subnetwork; sum of their connections to one another). The line (in grey) reflects the expected values if there is no preferential connectivity within the set. We see that a large number (72%) of the genes lie above the identity line. The Wilcoxon p value of the mean residuals is shown in the inset (p~7.83e-41). e Once again, controlling for sample size through downsampling, the functional convergence of each gene set is calculated (subset shown). f A weak trend between the effect size of the candidate genes and their degree of co-expression is visible. Empirical nulls are calculated by permuting disease gene sets and FDRs through a Benjamini-Hochberg correction against the resultant functional convergence trends we calculate the functional convergence by downsampling-selecting a subset of genes within that set and averaging the results over 1000 permutations (schematic in Fig. 3b). Taking a representative set of studies (Table 1), we use the degree of disease effect to rank these sets, noting that recurrence leads to a higher effect size even for variation and study designs of the same class by reducing the number of false positives. Placing the controls sets on the far left, and the highest disease rank set (recurrent de novo loss-of-function genes) on the far right, and plotting their functional effect values, we observe an upward trend (Fig. 3c Spearman's r s = 0.95, Fisher's transformation p < 8.24e-06). The slope (i.e., the correlation) of this trend line represents the "functional convergence trend", with higher correlations indicating higher functional effects. A less common (likely due to complexity) yet important functional test is network connectivity. Genes that are co-regulated or form parts of a functional unit, protein complex, or pathway are preferentially co-expressed, and this information is captured in co-expression networks. We next demonstrate how network-style effect sizes can be similarly calculated through a modularity analysis. In Fig. 3d, we plot the global node degrees (xaxis) against their connectivity to the remainder of the genes in the set (y-axis). In the null (grey line), the genes would be connected to other autism genes in proportion to the incidence of those genes within the genome. Deviations from this null across all genes generate excess modularity within this set (studentized residuals shown in Fig. 3d inset) and determine the statistical results reported for the set overall (Wilcoxon test). A large number of genes are highly interconnected in this set, as shown by the number of points above the line (Wilcoxon test on the studentized residuals, p~7.83e-41). It is important to note that this network analysis is calculated against the empirical null for each gene individually (x-axis) and so is unaffected by any gene-specific bias (such as length). Only higher-order topological properties across gene-gene relationships for a given gene can produce a signal. Even assortativity, the tendency for genes of high node degree to preferentially interact, is quite low within these data (r = 0.064). As in the previous steps, we repeat the network connectivity tests across all gene sets (Fig. 3e), also downsampling to calculate the functional convergence. Once again, gene sets with higher proportion of burden genes correlate with functional convergence tests ( Fig. 3e; Spearman's r s = 0.69, Fisher's transformation p < 0.02).

A subset of functional properties are correlated with disease effect sizes
We extended our analysis to other disease gene property tests, and calculated their effect size correlations, plotting the distribution of correlations ( Fig. 4a; 4210 functional tests performed, 4164 with calculable correlations). We then calculated the null distribution for the variation across effect sizes by permuting the estimated effect size for each real set and rerunning our analysis. Only limiting our functional tests to those where we had at least one gene set returning a significant enrichment signal, we observe a strong signal (61 tests; Fig. 4a; 14 functions FDR <0.01; Table 2). Reducing the stringency of the underlying enrichment (383 tests; Fig. 4b), we observe a weaker signal (ten functions FDR <0.01). Removing the underlying enrichment constraint, we observe that most functional tests are ordered consistent with the null, with a few highly correlated functions ( Fig. 4c; enrichment at positive end, three functions FDR <0.01). The results are broadly reassuring that some weak artifact is not driving the tendency of the functional convergence and effect size to be correlated because that correlation occurs almost exclusively where the underlying tests themselves are detecting significance. In other words, the ordering of significances is only non-random where the underlying values are also non-random. We focus on the 14 functional properties identified in the first filtered assessment ( Table 2).
Each property can be defined by its vector of effect sizes across gene sets and so we can cluster the properties by their Euclidean distance in this space. Taking the 61 properties and highlighting the properties that are significant (FDR 0.01), they split into approximately seven clusters and a singleton (Fig. 4e). The interesting clusters are 1 and 7 as they have the highest correlations (as depicted by the dark purple scale), and a stronger significant signal from the de novo set (white/yellow in heatmap). Cluster 1, specifically, has the most consistent trends and contains the expression analyses (overexpression and fold change), the gene essentiality scores, and some of the neural gene sets. Cluster 3 has the coexpression networks clustered, and the mutational probabilities, but is slightly weaker as the control sets also show some enrichment. Cluster 5 contains most of the GO groups. Cluster 6 has some tests which are functionally enriched in the CNV and missense gene sets but are not significantly enriched for any of the genes in the de novo recurrent gene set and thus do not show a substantially positive functional convergence trend. The clustering speaks to the similarity of some of the tests (i.e., GO group clustering), but also to a likely neuronal signature across the disease gene sets.

Significant functional properties are consistent with the autism literature
One of the properties with the highest correlation was network connectivity in the BrainSpan co-expression network; however, all disease gene sets had a significant functional convergence with BrainSpan, indicating that, in addition to the real signal, there is a background signal affecting even control data. In particular, the signal from the silent recurrent mutations in the probands (functional convergence p = 7.5e-7) shows that control data subject to only one study design may select genes in a highly nonrandom pattern. Most top scoring disease properties are consistent with the literature on autism candidates such as average RVIS and haploinsufficiency scores [65], along with gene length and enrichment for FMRP interactors. RVIS scores are highly enriched in the loss-of-function recurrent set and the CNVs, but not significant in any of the other sets (Fig. 5a); as with any meta-analysis, significance in any one set is not necessary for aggregate significance. Genes with high haploinsufficiency scores-those that cannot maintain normal function with a single copy-are overrepresented in the loss-of-function recurrent genes, and there is also a significant effect in the GWAS results. Many interaction networks and traditional functional categories appear to be poor candidates to determine convergence in disease genes, as they cluster control gene sets and sets of low effects as well as those of disease genes. For instance, the extended PPI network has a high effect in the sibling control sets (e.g., silent functional convergence p~1.3e-5; Fig. 5b). GO terms and KEGG pathways typically do not survive correcting for multiple testing, although there is a general deviation from the null and the extremal GO functions are concordant with the known literature (e.g., GO: 0016568 chromatin modification, hypergeometric test p~1e-3 for the de novo recurrent set; or GO: 0048667 cell morphogenesis involved in neuron differentiation, hypergeometric test p~0.04 for the CNV set). So although functional convergence trends are concentrated in more clearly disease-related properties such Gene sets with a functional enrichment at p<0.5 All gene sets, Spearman's rho  b If we filter for the functions with some weak signal in the underlying functional tests (p < 0.5), 383 correlations are considered with ten functions as significant (dark blue, Student's paired t-test p < 2.2e- 16). Note that we are not filtering with respect to our own functional effect size test, which assesses variation in the underlying functional tests, merely that the underlying tests do return some values. c When we have no constraints (all tests, 4164 shown), three pass an FDR of 0.01. d We enumerate these in a barplot. e A heatmap of all the ranked scores of the test gene sets (columns) for the subset of 61 significant effect properties (rows). The properties clustered into six groups when we cut the dendrogram at a height of~12. Their functional convergence correlations (ranked) show that most high correlations cluster (in clusters 1 and 3). White/yellow is a high rank, red is low. The property type is color coded as described in the figure key. A high correlation is shaded purple, low/negative correlations are grey. Clusters are labeled and colored, with function FDR <0.01 outlined in black as RVIS, traditional functional categories from, e.g., GO remain of modest use.

Robustness and relative contributions of study designs and variants
In order to determine whether the functional convergence trend rose preferentially from a subset of studies, we conducted a series of robustness analyses (Additional file 1: Figure S1). Ideally, the significant functional convergence trend we see is due only to effect size estimates across studies which are themselves robust. Nor do we want the trends to be strongly affected by ordering of the gene sets with similar effect sizes. Even though the average effect sizes for the GWAS sets were the same, the number of false positives within these sets varies, and this was incorporated into the ranking scheme. It is also arguable that the silent mutations in the probands may have some regulatory effect, or are false negatives.
As a more stringent test, we removed whole classes of variants from the analyses (e.g., all the controls or all the common/weaker gene sets) and calculated the trends once again (Additional file 1: Figure S1d-f ). This is a negative control experiment in the sense that if the functional convergence trend arises meta-analytically, it should be largely robust to changing things we are not certain about (e.g., as above, whether effect sizes are 1.1 or 1.09) and not robust to changing things we are certain about (common variants play some role in autism). Removing either controls, gene sets with the highest effects, or the common variants from the trend analyses removes all the number of significant correlations, although some deviation from the null remained (Additional file 1: Figure S1). When rare variants are excluded, the distribution of correlations is most similar to the null, but still significantly different (Student's paired t-test p~0.03), while the total significance of the test is closest to the full version when common variation is excluded (Student's paired t-test p~8.2e-7). Since our common data are likely the weakest due to the tremendous focus of autism data collection toward rare variation in the SSC, this makes sense, but common variation still contributes substantial joint signal. These tests confirm that the approximate order of gene sets by effect sizes correctly drives the results and that we are robust to minor variation in the exact effect sizes listed, but do rely on the joint use of the extremely divergent study results (rare and common) within the metaanalysis to attain significant results.  To control for the impact of gene length and multifunctionality (the number of functions a gene is listed as possessing), we repeated a control version of our analyses. In this case, the real disease gene sets were swapped out with gene sets matched with respect to multifunctionality or length. We then reran the evaluation of functional convergence trends to determine if any previously identified properties arise as correlated with these control sets (ordered by their match to a specific disease set, e.g., gene length distribution). Repeating the analysis in this control case, we find the derived correlations are for the most part extremely similar to the null (reference). We can additionally use these control versions as a slightly more stringent null distribution for expected correlations when we evaluate the real disease sets. In the analysis where we do not condition on the underlying tests having reached some level of significance (as in Fig. 4a), we see even more correlations passing significance (Additional file 1: Figure S2), indicating that the multifunctionality or gene length do little to explain the general trends we see.
Promiscuous or absent enrichment have both historically been problematic within disease gene data; both diminish the specificity of functional results. When too many functions are returned from an analysis, we need to cherry pick and with too few, we have no "leads" and are left in the dark. We suggest that the strong aggregate effect we see and small number of significant functions is likely near to a useful and biologically plausible type of specificity for downstream analysis, as suggested by the fact that ad hoc filtering (i.e., top ten lists) usually are at about this level when not constrained by significance. Our set of functional tests and results are shown in Additional file 2: Table S1 and the full data set is available online.
One potential failure mode of this analysis comes from the GWAs we have used. Because the number of autism GWAS available and well-powered for analysis was relatively small, we used a combined psychiatric genomics dataset, which included bipolar disorder and schizophrenia. We now wish to test how specific our results were to our disease and not a signal of GWAS in general. In order to assess this, we reran our analysis replacing the psychiatric GWAS candidate genes with a candidate list derived from a different GWA study, iterating through each of 148 GWAS traits (see "Methods"). If the convergence we identify is specific to autism, we should expect to identify autism, or related disorders, as producing the most significant correlations when integrated with the other (non-GWAS) autism data in analysis. Using the number of correlations calculated as significant to rank the 148 traits, the top ten traits include the autism and schizophrenia GWAS, and a few larger studies such as "body mass index" (Additional file 1: Figure S3). This is a fairly striking confirmation of our original hypothesis: Rare and common variation not only show functional overlaps when conditioned on effect size, but ones that are specific to autism, as well as more general overlaps likely reflective of all disease. The more general overlaps can be seen by the relatively high ranking of the larger GWAS, which are not related to psychiatric disorders but do score well based on very broad disease properties, such as the gene mutability scores.
Expanding the functional gene tests shows no further significant properties We wished to see if we could find other significant associations if we expanded our repertoire of functions within each type of test. Our first set of network analyses focused on general aggregate co-expression networks and brain sample-only aggregates. In most analyses, researchers use individual datasets to build their networks and we wished to compare our results to these. Thus, we expanded our tests to a total of 540 networks. We repeated the same analysis, using an additional six PPI networks, 76 condition-specific networks (tissues, sex, and age), and a further 454 RNAseq co-expression networks (227 across 18,000 protein coding genes and 227 across 30,000 transcripts). Once again, we see functional convergence across almost all the gene sets with little ordered trend by effect size. The network convergence exists in even the control data and is therefore likely due to study selection biases alone; none pass an FDR of 0.01 (Additional file 1: Figure S4a).
Initially, we focused on expression data for the brain, but were curious about how tissue-specific these patterns were, or whether the genes were generally highly expressed. To this end, we repeated the expression analyses using tissue-specific expression datasets from GTEx data [31]. We were also curious to determine if we could see sex-specific differences, and used additional data from the GEUVADIS project [66]. Repeating the functional convergence trends on all these expression datasets shows little to no significant expression in the individual gene sets, and no significant functional correlations (Additional file 1: Figure S4b). The functional test with the greatest correlation was also from brain-specific expression datasets (r s = 0.78).
One last set of gene properties typically used by researchers in their analyses are the curated gene sets from MSigDB. We repeated our analyses on all eight collections and calculated the functional convergence trends, using the hypergeometric test as in the case of calculating enrichment in GO. The gene sets range from curated data sets from the known literature to computationally derived gene sets from cancer microarrays. Perhaps unsurprisingly, we see no enrichment in these gene sets (Additional file 1: Figure S5) as most are inflammatory or oncogenic collections, or versions of GO terms and KEGG pathways which we had already found to have no enrichment.

Discussion
Advances in sequencing technologies, classifications, and diagnoses of autism and improved genetic analyses have vastly expanded the potential candidate list for autismlinked genes and our overall understanding of the disorder. Different study designs target different genetic mechanisms and thus potentially quite distinct molecular or biochemical signals at the gene-level. The most prominent result from our analysis is that the functional signal is actually shared, so long as the different effect sizes of the genes are taken into account. Our meta-analytic approach is robust, stringent, and partially independent of most conventional analyses as it exploits trends varying strictly across studies to find plausible biological signals. We expect that continued research parsing and defining different classes of autism will be able to build substantially from this base, toward integrated models of all genomic variation to determine aggregate autism risk. One such possibility follows naturally from our own work, by using the functional properties our analysis suggests are convergent across study designs to prioritize genes; i.e., we could score genes as autism-associated based on how many convergent functions they appear in. While this is comparatively straightforward, we suspect that the major thrust of future research will be toward similar models, but occurring at the level of candidate variants within those genes. As such detailed models are constructed, our analyses highlight a number of points that will be particularly important. First, differences in the prevalence of different classes of variants will be a major property to control for. Many of our analytic choices are centered on not allowing common variant signals with particular technical properties to "swamp" rare, but important, signals. A second major contribution of our work is to highlight which properties are likely useful ones to validate future models. As this work establishes and previous analyses have also suggested [30], PPI data are particularly confounded with selection biases that can create apparent signals. The functional properties we highlight as most significant within our analysis, and which other detailed research has supported (e.g., FMRP [67,68]), are already targets of intense research interest and also likely to be valuable for validation of any disease burden models, where they should continue to show a signal.
Our specific experimental design for assessing a relationship between functional convergence and effect size may well be open to elaboration and emendation. Potential weaknesses in our design are our use of non-parametric tests and downsampling to control for set size. These are, we think, natural choices for robustness but more finely tuned alternatives are likely to exist and could easily be a target of research since our results suggests the observation of key functional convergence trends is highly robust and salient within the data. As the number of disease gene sets expands, and further refinement of risk assessment is achieved, the resolution of functional convergence trends should grow. Indeed, incorporating effect size as a metaanalytic constraint offers a diverse range of novel applications. That integration may be across study designs and classes of variation, as we have done, or may involve phenotype or other properties. So, for example, one could determine functional convergence trends that grow or shrink depending on how patients are classified, or even broken down in a sex-specific manner for interpreting protective effects. More broadly, as data and the means for obtaining it grow, techniques to statistically assess its structured dependencies will grow more useful and important. Our robustness analysis speaks to this in that while we are robust to modest losses of data, it is clear that more data will only improve the signals of the individual classes. More finely tuned effect size estimates and better separations of the gene sets and variant classifications will also help refine the distinction between biological and artifactual signals, ideally allowing us conduct yet more focused study designs in a productive feedback loop.

Conclusions
In this work we have found that the stronger the effect size of autism candidate genes, the more likely they are to exhibit a joint functional signal. The functional properties identified exhibit some specificity to autism and neuropsychiatric disease (e.g., FMRP interactors), but also some more general links to disease (e.g., RVIS). While there remains substantial heterogeneity between study designs and the genetic architectures of disease which they may uncover, we have shown that there is some commonality across study designs. The commonality across study designs is not a literal overlap in risk genes, or even functional effect, but that functions weakly identified in GWA studies are likely to be more strongly identified in rare variation studies. As evidence for autism and other disorders continues to develop and continues to be heterogeneous with respect to ascertainment biases and study designs, we suspect approaches related to the one we describe will be of increasing importance.

Additional files
Additional file 1: Figure S1. Trend line robustness analysis. Figure S2. Functional convergences null for matched length and multifunctionality controls. Figure S3. Functional convergence correlation/trend distributions for GWAS. Figure S4. Functional convergence correlation/trend distributions for all network connectivity tests and all gene expression tests. Figure S5.  Table S1. Functional convergence correlation/trend distributions. Table S2. Additional expression functional convergences and correlations/trends. Table S3. Additional gene properties functional convergences and correlations/trends.