Assessment of functional convergence across study designs in autism 1

Background Disagreements over genetic signatures associated with disease have been particularly prominent in the field of psychiatric genetics, creating a sharp divide between disease burdens attributed to common and rare variation, with study designs independently targeting each. Meta-analysis within each of these study designs is routine, whether using raw data or summary statistics, but no method for combining the functional conclusions across study designs exist. Method In this work, we develop a general solution which integrates the disparate genetic contributions constrained by their observed effect sizes to determine functional convergence in the underlying architecture of complex diseases, which we illustrate on autism spectrum disorder (ASD) data. Our approach looks not only for similarities in the functional conclusions drawn from each study type individually but also those which are consistent with the known effect sizes across these studies. We name this the "functional effect size trend" and it can be understood as a generalization of a classic meta-analytic method, the funnel plot test. Results We detected remarkably significant trends in aggregate (p~ 1.92e-31) with 20 individually significant properties (FDR<0.01), many in areas researchers have targeted based on different reasoning, such as the fragile X mental retardation protein (FMRP) interactor enrichment (FDR~0.006). We are also able to detect novel technical effects and we see that network enrichment from protein-protein interaction data is heavily confounded with study design, arising readily in control data. Conclusions As a novel means for meta-analysis across study designs, this method reveals a convergent functional signal in ASD from all sources of genetic variation. Meta-analytic approaches explicitly accounting for different study designs can be adapted to other diseases to discover novel functional associations and increase statistical power.


Background
Autism spectrum disorder (ASD [MIM 209850]) is a neurodevelopmental disease commonly characterized by behavioral traits such as poor social and communication skills [1].In more severe cases, ASD is comorbid with mild to severe intellectual disability, facial and cranial dysmorphology and gastrointestinal disorders.Perhaps because of grouping these multiple and sometimes distinct phenotypes into one disorder, and the complexity of behavior as a trait, understanding the genetic architecture of this cognitive disease is non-trivial [2].The genetic component of ASD is estimated to be 50-60% [3], however there are still a substantial number of cases where the underlying genetic factors of the disease are unknown.Researchers have used chromosomal analysis arrays [4][5][6][7], genotyping arrays [8][9][10][11], whole-exome sequencing (WES) [12][13][14][15][16][17], and whole genome sequencing (WGS) [18,19], to identify risk loci and alleles.What has been demonstrated through these analyses is that all humans carry a substantial mutational load, some of which is most likely deleterious [20].There are varying levels of genetic risk, affected in part by epigenetic and environmental factors, that all contribute to the variation in penetrance of disease gene candidates [21].Typically, overlapping functional properties of candidate disease genes are exploited to sort out the positive and negative results.Candidate genes are prioritized based on enrichment analyses in pathways related to the phenotype (e.g., neuronal activity regulation) or some other disease feature shared by the genes (e.g., expression in the brain).If these methods return no significant results, more complex methods are performed to extract common features from the disease gene set [22], such as co-regulatory module detection from co-expression networks [23] or binding from protein-protein interaction (PPI) networks [24].Without further rigorous molecular validation, it is difficult to establish whether these signals are disease specific, or artifacts of the study design or ascertainment method of the patients and populations.
Ideally, the combination of these study designs would explain both the contribution of rare and common variation to autism, i.e., the genetic architecture of the disease.In particular, one would like to combine the results across these to understand the functional mechanisms of the disease and the genetic risks involved.Yet doing so is challenging, mainly because the different study designs are carefully calibrated to enrich for meaningful signals and thus reflect different effects which cannot be naively comparedcommon variants are limited to regions of the genome with known variation (a SNP is known) but only reach significance with large numbers, while rare or ultra-rare variants are conditioned on not being in this list of common variants.Trio and quad studies are used mainly in WES and WGS study designs, while large case and control cohorts are required for signals in genome-wide association studies (GWAS).For each study design, we are asking distinct questions that relate to the population prevalence, disease mechanism, burden and risk.These different study designs create a structured dependency among the resulting candidate gene sets: not only in the relative contribution of the genes and variants they implicate, but also in their false discovery rates, technical errors, and ascertainment biases.Rather than treating discrepancies in study designs as a basis for argument within the field (e.g., see [25,26]), they should be appreciated as a necessary consequence of study design, which must be exploited to uncover a greater portion of the genetic architecture.Like blind men describing an elephant, the different results still reflect a unified whole, if only they could be combined.We seek to unify the individual functional signals meta-analytically, in a way which will identify individual biases and take advantage of the same biological features to which study designs are tailored.
What we can exploit and also agree upon are disease burdens or effect sizes for different candidate disease genes, as measured by their variant class or mutation type, and disease risk.For instance, disease candidates from GWAS have low relative risks (and therefore low effect sizes) as they are inherited common variation in the population.On the other hand, de novo mutations are a form of genetic variation which evolutionary forces have not had time to act upon [27], and are of high risk (and high effect sizes) i.e., a mutation generally implies disease.Studies also suffer from type I errors (false positives), and this too should be reflected in an aggregate disease signal of the candidate genes, as .CC-BY-NC-ND 4.0 International license peer-reviewed) is the author/funder.It is made available under a The copyright holder for this preprint (which was not .http://dx.doi.org/10.1101/043422doi: bioRxiv preprint first posted online Mar. 12, 2016; 6 quantified by their common functional properties.A set of genes with de novo mutations will show a strong aggregate disease signal, while we might expect a weak signal from the gene candidates from GWAS [28].Measuring their 'functional' effect size, as determined by a gene set enrichment test or network analysis, we can thus exploit our knowledge of gene candidates' effect sizes and false positive rates.For a real disease property, we expect the correlation between gene set effect size and functional effect size to be strong, and for a weak or artifactual property, we expect no significant correlation.We call the presence of this tendency the 'functional effect size trend', and it can be rigorously assessed by repurposing fundamental meta-analytic techniques. A simple and common method to detect bias in meta-analytic studies, the funnel graph [29], is based in a similar comparison of effect sizes to a measure of precision.We illustrate this in Fig 1 .The underlying idea is to visualize the relationships between effect size estimates and sample size in the individual studies of a meta-analysis.This is done by plotting the odds ratio against the standard error or other related measure of variance (e.g., precision, standard error ratios).An inverted symmetric funnel would indicate that the data has no study selection biases.Small studies with small or no effect, as they are typically underpowered, should sit close to zero on both axes.Larger studies that are effective (high effect sizes) have low variance.However, a plot with asymmetry indicates likely systematic or publication biases and this is measured by either a rank correlation test [30] or a regression analysis [31].The elegance of the test is in its simplicity: a correlation will indicate that there is asymmetry and bias, while symmetry implies heterogeneous data and publication results.There are two points of generalization from the traditional funnel test to meta-analysis across study designs [16].First, we are not looking to a fixed effect size (no correlation) but to a known trend in effect sizes (positive correlation).Second, we consider many functional tests and then determine the subset which are significantly concordant with the effect sizes (after multiple hypothesis test correction).Our analysis is a generalized complement to the funnel test: rather than looking for correlations with bad properties when we know there should none, we look for correlations with good properties when we know there should be one.This lack of heterogeneity in the funnel test is similar to their being hidden classes within the data -i.e., study designs -which we can detect [32].In both cases, though, expectations about effect size trends are being exploited, it is just that the expectations are different due to the differing types of variation alternate study designs target as causal for autism.
In order to integrate the varying data and extract the convergent disease functional signal, we have run a meta-analytic study on ASD candidates across numerous genetic studies and tested a wide range of gene properties.We first construct several disease gene candidate collections, with genes of varying levels of risk, as determined by their odds ratios and relative risks.On every gene collection, we run a number of analyses, calculating the functional effect size using standard enrichment methods, and more complex network analysis enrichments.By exploiting trends in targeted genetic variation and their known effect sizes, we demonstrate it is possible to discriminate biologically convergent signals from technical artifacts at a very fine resolution.The disease properties with strong trend signals are largely consistent with the known literature on ASD (e.g., FMRP interactor enrichment) but we also see a few otherwise interesting properties are unlikely to be disease specific, particularly protein-protein interaction networks and some co-expression networks, which extract artifactual signals from the study design.Our focus here is on autism due to our interest in the disorder, and also its phenotypic and genetic heterogeneity, but the method is generally applicable to determine functional replication across studies in a formal way that is absent in the field.

Study overview
An overview of our study design and method is shown in Fig 2.  The copyright holder for this preprint (which was not .http://dx.doi.org/10.1101/043422doi: bioRxiv preprint first posted online Mar. 12, 2016; 8 were collapsed individually into a set of genes, with an estimated average effect size for that candidate set (Fig 2 , part 2).To combine the implicated variants and obtain a functional signal from the disparate studies, we calculate whether there is a functional effect which correlates with the estimated effect size of that variant class.More specifically, the set of genes with high effect sizes have strong relative functional effect sizes as measured by a functional enrichment of some disease property across them, and those with low effect sizes, have weaker functional signals (e.g., statistical overlaps with known functions).We apply this test to numerous functional properties on candidate gene sets from a variety of study designs (Fig 2 , part 3).We then use the resulting functional properties that are consistent with this trend to construct a unifying gene score for autism specific genes (Fig 2 , part 4).Throughout our work we refer to the "effect size" as the disease burden or risk of a gene candidate (or the average of such values within a gene set), and the "functional effect size" as the significance of a functional test for a disease gene set after controlling for the set size.

Disease gene candidate sets
We first collected candidate disease gene sets from available autism studies.We selected the largest study of whole-exome sequencing (WES) of families from the Simons Simplex Collection (SSC) [33].We defined different sets of genes from over 2000 gene candidates, splitting into recurrent (>2 probands having the mutation) and non-recurrent mutations, according to mutation type (loss-of-function, missense and silent mutations).We selected copy number variant (CNV) data also from the individuals in the SSC [34], and parsed it into similar sets.We then used the CNVs as parsed by Gilman et al. [35], which prioritized genes with their NETBAG algorithm.For GWAS gene sets, we searched for the term "Autism" in the GWAS catalog from NCBI (date February 2015, Additional file 1: Table S1).A total of 9 GWAS were available, and we used their SNP-gene mapping strategies to define our candidate sets: first .selected genes with associated SNPs, and then expanding that set to include the nearest gene for intergenic regions, and finally a gene set with all adjacent genes.We also filtered on significance of the SNPs (i.e., using the p=5e-8 threshold), and then without significance filters (selecting all reported SNPs).
Each GWAS was of relatively small size (up to ~7000 individuals), and varied in strategy (family trio or case-control), and had few candidate genes, all of which were of similar effect sizes.As there were overlaps between the cohorts in the individual studies, the total number of individuals was estimated to be roughly 10,000, but over ~40K if summed naively.For the final ASD candidate list, we included OMIM [36] disease candidates (search on "Autism", date February 2015), all genes associated with "autism" reported in Phenocarta [37] (date January 2014), genes from the SFARI database [38] (date February 2015) and genes associated with "autism" from the HuGE navigator resource [39] (date Feburary 2015).We also filtered the SFARI database on "high-confidence" evidence as described in Shohat et al, [40] and included a curated ASD list from Betancur et al, [41].All data sets are made available in the supplement and online.For additional analyses, we collected whole-exome sequencing data sets across other similar diseases (Additional file 1: Table S5).

Co-expression networks
The majority of recent studies used co-expression networks from BrainSpan to illustrate network convergence among disease genes of ASD (e.g., see [42][43][44]).In a similar fashion, we generated brain specific networks from the BrainSpan RNA-seq data (578 samples) and from an aggregate of 28 brain tissue and cell specific microarray experiments (3,362 samples).For more general networks, we used our aggregate RNA-seq and microarray co-expression networks as previously described in Ballouz et al, [45].In brief, these are the aggregates of 50 networks (1,970 samples) and 43 networks (5,134) samples respectively, across various tissues, cell types and conditions. .

Functional gene sets
We considered common functional gene sets and neurological specific sets, as used in numerous studies, as gene sets to test for ASD candidate enrichment.These included the post synaptic density (HPSD) gene set [46], synapse sets [47], the synaptosome [48], chromatin remodelling set [49], fragile X mental retardation protein (FMRP) set [50], and gene essentiality [51].For more standard sets, we also took the Gene Ontology [52] (GO) terms (April 2015) and KEGG pathways [53].For each GO term, we only used evidence codes that were not inferred electronically, propagated annotations through the ontology (parent node terms inherited the genes of their leaf node terms).To minimize redundancy from GO, we restricted our enrichment analyses to GO terms groups with sizes between 20 to 1000 genes.While these GO terms and KEGG groups are used in the enrichment analyses (with the full multiple hypothesis test correction penalty), only the subset that intersected with any of the disease and control gene sets have effect size trends calculated leaving us with 69 GO groups and 23 KEGG pathways.This is primarily a visualization issue; the FDR values calculated used throughout remain the same.

Disease gene score sets
We used disease gene scoring methods that rank genes according to likely having damaging effects if they are mutated.This included the Residual Variation Intolerance Score (RVIS) [54], haploinsufficiency (HI) scores [55], and mutational rates [42].

Expression data
To obtain expression and differential expression information, we used three common and large sample size brain-specific transcriptomic sets.These included the Human Brain Transcriptome (GSE25159) [56], BrainSpan [57] and the Human Prefrontal Cortex transcriptome (GSE30272) [58].We divided the samples into fetal (post-conception week -PCW) and post-birth stages, and performed a straightforward differential expression (DE) fold change analysis (averaging across these stages) [59].
. CC-BY-NC-ND 4.0 International license peer-reviewed) is the author/funder.It is made available under a The copyright holder for this preprint (which was not .http://dx.doi.org/10.1101/043422doi: bioRxiv preprint first posted online Mar. 12, 2016; 11

Protein-protein interaction data
We used all the physical protein-protein interactions from BIOGRID (version 3.2.121)[60] and created a binary protein-protein interaction network, where each protein was a node and each protein-protein interaction is an edge.Because of the sparseness of the network, we extended the network by modelling indirect connections [61], taking the inverse minimum path length between two proteins as the weighted edge, with a maximum distance of 6 jumps roughly as described in Gillis et al, [62].

Candidate gene set effect sizes and their ranks
From the total list of candidate gene sets we collated, we chose 14 representative sets for our effect size analyses (Table 1).For each candidate disease gene set, we ranked the set according to the overall or average "effect size" of the genes within it.To calculate this effect size for the GWAS results, we took the average odds ratios from the individual studies of each the SNP, which ranged between 1.1-1.2.For the de novo mutation candidates, we took the ratio of observed counts of mutations to silent mutations within the study for that class of mutations, and then the ratio of those odds between siblings to probands (as calculated in Sanders et al, [13]).For the control sets (siblings and the silent mutations), we took the effect size to be null.We then ranked the sets based on these overall effect sizes.After these calculations, we end up with three general classes: null effects (as controls), weak effects (missense and common variants) and strong effects (rarer loss-of-function and copy number variants).

Calculating functional effect sizes
Our functional tests, described below, return p-values which are dependent on the size of the gene set being considered.The statistical tests differ depending on the mode of analysis (e.g., enrichment or network), but by 'functional effect size' we simply mean significance (p-value) after correcting for the set size, typically by downsampling.For the downsampling, we took a subset of genes, recalculated the pvalue and then took geometric means of the adjusted p-values.Throughout, where we write 'functional effect size' it is possible to read 'p-value after correcting for set size'. .

Network connectivity
We measure the clustering of sets of genes within networks through the use of a network modularity calculation.We compare the degree of connections a gene has to all the genes in the network (global node degree), and to those of interest within the sub-network they form (local node degree).The null expectation is that genes will be connected equally well to genes within the sub-network as to those outside.Genes with large positive residuals have more weighted internal connections than external connections, implying a well inter-connected module.We test the significance of this distribution of residuals to a null set (random similarly sized set of genes, Mann-Whitney-Wilcoxon test, wilcox.test in R) to determine our test statistic.

Gene set enrichment testing
As a way to determine the level of enrichment of the candidate gene sets within other functional sets, we used a hypergeometric test with multiple test correction (phyper in R).The downsampled p-value was used as the functional effect size measure.

Disease gene property testing
For the disease gene scoring properties, we tested the significance of the scores of the candidate genes using the Mann-Whitney-Wilcoxon test (wilcox.test in R).The functional effect size was the p-value of this test.

Functional size effect trends
For each gene property tested, we then measured the "trend" by calculating the slope and intercept of the functional effect sizes of our gene sets, whereby the gene sets are ordered according to their effect size ranks.A positive slope is one where the function tested is correlated with our ordering, and an intercept close to zero indicating no effect or one with enrichment in the controls.We computed this using both ranked and unranked data, as well as correlations.We generally report results in terms of .CC-BY-NC-ND 4.0 International license peer-reviewed) is the author/funder.It is made available under a The copyright holder for this preprint (which was not .http://dx.doi.org/10.1101/043422doi: bioRxiv preprint first posted online Mar. 12, 2016; 13 slopes to capture the degree of variation, but the significant subsets identified are generally robust to choice of measurement metric, as the three metrics are well correlated themselves (r=0.7 with the raw slopes and r=0.93 with the ranked slopes).

Aggregate gene score
We then selected the top ten disease properties that had high and predictive positive slopes (Additional file 2: Table S2).We then took all known protein coding genes from GENCODE [63] (version 19, approximately 19,000 genes) that had at least one annotation across all our properties.For each disease gene property, we gave each gene a ranking within that property.For gene scores (i.e., mutational rates or the RVIS), this was simply the rank of their score.For the gene sets, we ranked the genes and gave a tied score to the NAs (i.e., GO terms, synaptic sets).For the networks, we used the node degree of the genes as their rank, but as their "real" score is dependent on a gene set, this was only a proxy measure.

Little overlap of autism candidates across gene sets of different effect sizes
Genes with loss-of-function de novo mutations are not unusually likely to be found in the GWA studies, with 4 candidate genes overlapping those two sets (Fig 3, hypergeometric test p=0.62).Interestingly, the more recurrent genes in the loss-of-function de novo set, the more unlikely they are to be found in other gene sets (Additional file 1: Fig S1).For gene sets with the lower average effect sizes (e.g., the genes with missense mutations), their overlap with other gene sets is greater, in particular with the control sets (Fig 3, hypergeometric tests p~4.4e-3 to 2.4e-6).The de novo variants are conditioned on being rare (low frequency) and novel by not appearing in the parents.The SNPs used in GWAS are generally conditioned on being common by having minor allele frequencies greater than 0.05 [64].Even if this filtering is done on the variant level, and not on the gene level, it still creates selection trends within our observations of variants and thus genes.This is possibly a version of Berkson's effect [65] -where .selecting for an outcome generates negative correlations between potential causes for it.An additional cause is largely technical; since we've conditioned on frequency, genes with higher mutability are depleted in our rare lists, and enriched in our common lists.Thus the lack of overlap is at least potentially not largely reflective of underlying genetics or biology, but likely due to the selection bias in obtaining them.There is also poor overlap within the rarer variation itself, for instance of genes within CNVs and those with loss-of-function SNVs (3 genes, p=0.37); there is generally a discrepancy between study designs focused on (different) sources of rare variation, and not just rare versus common.It should be noted that whether biological or technical, the lack of overlap does nothing to discredit either common or rare variation as a contributor to the disease -but highlights the need for a framework to combine and analyze the results of these studies that is aware of these biases and can distinguish biology from technical effects.

Modularity of de novo autism candidates exhibits a functional effect size trend
While enrichment analysis is comparatively straightforward, we demonstrate an example in Fig 4A using the genes with de novo loss-of-function mutations from Iossifov et al, [33] (341 genes) and their overlap with essential genes (see Methods).In Fig 4A , we represent this enrichment test as a Venn diagram of the overlap of the candidate disease gene set with the essential gene set, and calculate the significance of the overlap with a hypergeometric test (n=82, p~7.11e-11).We continue this analysis on the other candidate disease gene sets from recent ASD studies, varying across study designs and technologies (WES, GWAS and arrays).Splitting each gene set by mutational class, recurrence and gender, we perform the same hypergeometric tests.To make comparable assessments between studies and gene sets, we calculate the functional effect size by downsampling -selecting a subset of genes within that set and averaging the results over a 1000 permutations (schematic in Fig 4B).Taking a representative set of studies (Table 1), we use the degree of disease effect to rank these sets, noting that recurrence leads to a higher effect size even for variation and study designs of the same class by reducing the number of .false positives.Placing the controls sets on the far left, and the highest disease rank set (recurrent de novo loss-of-function genes) on the far right, and plotting their functional effect values, we observe an upward trend (Fig 4C, black line, slope=0.9,Spearman's ρ=0.9, Fisher's transformation p<1e-16).The slope of this trend line represents the "effect size trend", with higher slopes indicating higher functional effects.
A less common (likely due to complexity) yet important functional test is network connectivity.Genes that are co-regulated or form parts of a functional unit, protein complex or pathway, are preferentially co-expressed, and this information is captured in co-expression networks.We next demonstrate how network-style effect sizes can be similarly calculated through a modularity analysis.In Fig 4D , we plot the global node degrees (x-axis) against their connectivity to the remainder of genes in the set (y-axis).
In the null (grey line), the genes would be connected to other autism genes in proportion to the incidence of those genes within the genome.Deviations from this null across all genes generate excess modularity within this set (studentized residuals shown in inset Fig 4D ) and determine the statistical results reported for the set overall (Wilcoxon test).A large number of genes are highly interconnected in this set, as shown by the number of points above the line (Wilcoxon test on the studentized residuals, p~1.13e-42).It is important to note that this network analysis is calculated against the empirical null for each gene individually (x-axis) and so is unaffected by any gene-specific bias (such as length).Only higher-order topological properties across gene-gene relationships for a given gene can produce a signal.Even assortativity, the tendency for genes of high node degree to preferentially interact, is quite low within this data (r=0.064,see Additional file 1: Table S3).Autism gene sets exhibit a highly significant effect size trend across all study designs Extending our analysis to other disease gene property tests (Additional file 2: Table S2), we calculated their effect size slopes (Fig 5A).We observe that most slopes are positive but close to zero (i.e., weak or low), with a few highly sloped lines (tail in distribution Student's T-test, p<1e-9).The highest sloped property was network connectivity in the BrainSpan co-expression network; however, all disease gene sets had a significant functional effect size with Brainspan, indicating that in addition to the real signal, there is a background signal affecting even control data.In particular, the signal from the silent recurrent mutations in the probands (functional effect size p=7.5e-7)shows that control data subject to only one study design may select genes in a highly non-random pattern.Most top scoring disease properties are consistent with the literature on autism candidates such as average RVIS and haploinsufficiency scores, along with CDS length and enrichment for FMRP interactors.RVIS scores are highly enriched in the loss-of-function recurrent set and the CNVs, but not significant in any of the other sets (Fig 5B ); as with any meta-analysis significance in any one set is not necessary for aggregate significance.Genes with high haploinsufficiency scores -those that cannot maintain normal function with a single copy -are overrepresented in the loss-of-function recurrent genes, and there is also a significant effect in the GWAS results.Many interaction networks and traditional functional categories appear to be poor candidates to determine convergence in disease genes, as they cluster control gene sets and sets of low effects as well as those of disease genes.For instance, the extended PPI network has a high effect in the sibling controls sets (e.g., silent functional effect size p~1e-10, Fig 5C).GO terms and KEGG pathways typically do not survive correcting for multiple testing, although there is a general deviation from the null (as described in the following section) and the extremal GO functions are concordant with the known literature (e.g., GO:0051276 chromosome organization, hypergeometric test .p~3.96e-4 for the de novo recurrent set).So although effect size trends are concentrated in more clearly disease related -properties such as RVIS, traditional functional categories from, e.g., GO remain of modest use.

Functional properties cluster strongly in relation to autism
We calculated the null distributions for the variation across effect sizes in two ways: by randomly sampling genes to derive new meaningless gene sets or by permuting the estimated effect size for each real set and rerunning our analysis.The two nulls almost perfectly overlap (Fig 6A), indicating that significance must arise specifically from concordance across studies that is directional.Indeed, the slopes are much more likely to be positive than either randomly sampled sets, or sets where effect sizes Each property can be defined by its vector of effect sizes across gene sets (e.g., Fig 5) and so we can cluster the properties by their Euclidean distance in this space.The properties split into approximately 6 major clusters (Fig 6B).Clusters 1 and 2 have random signals across the sets and consist of mostly the GO, KEGG and neural gene enrichment tested properties.These are functionally enriched in some disease gene sets (e.g., the CNV and missense gene sets) but are not significantly enriched for any of the genes in the de novo recurrent gene set and are thus not showing a substantially positive functional effect size trend.Cluster 3 similarly contains GO groups which have some enrichment signals from the de novo and CNV set.These GO groups are protein demethylation and related activities (e.g., GO:0008214, GO:0006482).Cluster 4 contains an array of properties with no consistent trend across the data sets.The more interesting clusters are 5 and 6 which have the highest slopes (as depicted by the dark purple scale), and where there is a stronger significant signal from the de novo set (white/yellow in heatmap).Cluster 5, specifically, has the most consistent trends and contains the expression analyses (overexpression and fold change) and the gene essentiality scores.Cluster 6 has the co-expression networks clustered, and the mutational probabilities.

Robustness and relative contributions
In order to determine whether the functional effect size trend rose preferentially from a subset of studies, we conducted a series of robustness analyses (Additional file 1: Fig S2).Ideally, the significant functional effect size trend we see is due only to effect size estimates across studies which are themselves robust.Nor do we want the trends to be strongly affected by ordering of the gene sets with similar effect sizes.Even though the average effect sizes for the GWAS sets were the same, the number of false positives within these sets varies, and this was incorporated into the ranking scheme.It is also arguable that the silent mutations in the probands may have some regulatory effect, or are false negatives.Our previous analyses ranked our gene sets with these assumptions in mind and incorporated an ordering for control sets and the GWAS sets.With this ranking of the gene sets, 20 of the functional effect size trends are significant at an FDR of 0.01 (Additional file 1: Fig S2B,C).If we instead ignore the ordering provided by false negative rates and regulatory impact and repeat our analysis (treating the same effect size sets as having tied ranks), there is a modest loss of significance in the distribution of slopes (7 functional tests at an FDR<0.01,Additional file 1: Fig S2A,C), while the overall deviation from the null remains very significant (Wilcoxon test p~4.76e-33).Generally, using ranking as we have is slightly conservative in impact, as expected, and using the raw effect sizes raises the number of significant slopes (FDR<0.01) to 23.As a more stringent test, we removed whole classes of variants from the analyses (e.g., all the controls or all the common/weaker gene sets) and calculated the trends once again (Additional file 1: Fig S2D-F).This is a negative control experiment in the sense that if the functional effect size trend arises meta-analytically, it should be largely robust to changing things we are . not certain about (e.g., as above, whether effect sizes are 1.1 or 1.09) and not robust to changing things we are certain about (common variants play some role in autism).Removing either controls, genes sets with the highest effects or the common variants from the trend analyses removes almost all the number of significant slopes (FDR 0.01, Additional file 1: Fig S2C ), although some deviation from the null remained.When rare variants are excluded, the distribution of slopes is most similar to the null (Wilcoxon p~5.95e-6), while the total significance of the test is closest to the full version when common variation is excluded (Wilcoxon p~1.54e-26 compared to p~1.91e-31).Since our common data is likely the weakest due to the tremendous focus of autism data collection toward rare variation in the SSC, this makes sense, but common variation still contributes substantial joint signal.These tests confirm that the approximate order of gene sets by effect sizes correctly drives the results and that we are robust to minor variation in the exact effect sizes listed, but do rely on the joint use of the extremely divergent study results (rare and common) within the meta-analysis to attain significant results.
Promiscuous or absent enrichment have both historically been problematic within disease gene data; both diminish the specificity of functional results.When too many functions are returned from an analysis, we need to cherry pick and with too few, we have no "leads" and are left in the dark.We suggest that the strong aggregate effect we see and small number of significant functions is likely near the most useful and biologically plausible type of specificity for downstream analysis, as suggested by the fact that ad hoc filtering (i.e., top ten lists) usually are at about this level when not constrained by significance.Our set of functional tests and results are shown in Additional file 2: Table S2 and the full data set is available online.
. CC-BY-NC-ND 4.0 International license peer-reviewed) is the author/funder.It is made available under a The copyright holder for this preprint (which was not .http://dx.doi.org/10.1101/043422doi: bioRxiv preprint first posted online Mar. 12, 2016; 20 Re-prioritizing ASD candidates emphasizes comorbidity among neurodevelopmental disorders Our final analyses were to see if the most significant disease properties score ASD candidates in a notably distinct way.We should expect an important reordering particularly in the case of non-recurrent de novo derived candidate genes, where the aggregate effect size is a combination of very high effect true positives and unknown effect (likely low) false positives.Using the previous analyses, we construct a meta-gene score by first selecting a set of top properties from clusters 5 and 6 (Fig 6A) that held the upward trend (both high slope and high correlation) and removing redundant properties (Additional file 2: Table S2).In order to keep analyses straightforward and avoid overfitting, we did not weight among these functions and considered only binary functional characterizations (excluding few properties).We then ranked the genes by their aggregate rank across the number of properties they possessed for the set of functions which we had identified as having very high functional effect size trends.
To assess this score, we took 1,002 ASD candidate genes (Additional file 2: Table S4) from publically available databases (OMIM [36], Phenocarta [37], HuGE [39], SFARI [38] and the Betancur et al, [41] list) and ranked them according to this meta-gene score.We see a low but significant correlation with the aggregate score (Pearson's r= 0.19, Fisher's transformation p= 2.787e-09) and the incidence of these genes in the databases.Most of the top ranked candidates are comorbid with other neurological disorders, indicating that their disease candidacy is strongly associated with a cognitive disorder.The top candidate gene here is the ubiquitin protein ligase HUWE1 [MIM 300697], which regulates neural differentiation and proliferation, and is known disease gene for mental retardation syndromic X-linked Turner type (MRXST [MIM 300706]) [66] along with being implicated in schizophrenia [67] and autism [68].NSD1 (Nuclear Receptor Binding SET Domain Protein 1 [MIM 606681]) is implicated in Sotos syndrome [MIM 117550] and Weaver syndrome [MIM 277590], both comorbid with ASD [69], and the histone acetyltransferase p300, EP300 [MIM 602700], [70] is involved in Rubinstein-Taybi syndrome .[76], or whether the variation is structural (i.e., CNVs, indels) or protein truncating point mutations, or both [77].However, in most cases, common functional signals are assessed using virtually identical methods (i.e., enrichment of gene groups or network connections).Combining the candidate gene signals and looking for convergence at the functional level would be a desirable assessment, yet simply doing so by placing all candidate genes in the same 'bin' as the simplest analysis might suggest, would be clearly problematic.First, controls vary and artifactual trends may differ, so that case-control comparisons will not hold across study designs.Second, the more distributed the genic effect the study design targets, the higher the sample count necessary to obtain significance, and the more genes (weakly) implicated there will be, meaning rare variation more tightly linked to autism (but with less explanatory power in aggregate) will be dominated by diffuse trends which only link any single gene in a weak sense (the gene is more likely involved in numerous other processes).To overcome these, we have proposed conducting meta-analysis at the 'output' stage for any given study, after functional convergence is determined, and then additionally incorporating effect sizes across study designs as an additional constraint.
In specific, we have shown that the estimated effect size of a disease gene candidate set, as determined primarily by their mutational class, is a good guide to determining which disease properties are biological and which may be driven by artifact.We show that most ASD disease genes display a neurodevelopmental signature, with top candidates genes involved in neuronal migration and synapse formation.Epigenetic and chromatin remodelers rank lower, and may not be a specific feature of the disease, or play lesser roles.One particular point of interest is whether or not this scoring system is specific to ASD, or has largely picked out a signature of neurodevelopmental disease.While data availability is generally higher in autism, we checked for enrichment and functional effect size trends of other available GWAS and WES studies (see supplement for more details).The functional effect size Our meta-analysis solely exploits variation across studies and therefore is partially independent of conventional analyses.Thus, when we consider PPI network clustering an artifactual property, it is not because we don't see significant clustering of gene sets in PPI data; in fact, we do.Rather, there is no functional effect trend consistent with effect size, a fact made clearest by the control sets also clustering.Another interesting example arises in one of the few properties we identified as functionally relevant that might seem artifactual: gene size.In fact, this has been hotly debated recently [40,78], and may reflect real biology on top of technical effects [79] .In our analysis, CDS exhibits a likely bias (e.g., CNVs showing enrichment) but its significant variation in effect size is wholly independent of this (in fact, CNVs are the strongest negative outlier in the trend) and the majority of the significance arises from sets less plausibly length biased (particularly recurrent de novo mutations; while WES is obviously lengthbiased, recurrence is stringent enough to likely render this irrelevant).As a final point of interest, we note that for most of the co-expression network analyses, the control data shows significant enrichment, and therefore interpreting co-expression must be done with more aware analyses, as in this work.More broadly, our analysis can be seen as providing validation between study designs for their functional results, and doing so in ways that would be hard for them to accomplish independently.
Individual studies have tended to rely on ever more stringent criteria (e.g., replication of family-wise error rates) to minimize bias, but this is obviously limited where the bias reflects the study design which is, itself, replicated.Meta-analysis is the classic approach to integrate potentially weak evidence to a stronger conclusion, but this is only possible once effect size expectations are factored into analysis.
While improvements in both GWAS and rare variant data will undoubtedly improve their individual signals, cross-study analysis is a particularly stringent means of assessing functional outputs for consistency.
. groups: we tested whether there were functional signal commons to both analyses which follows the known effect sizes.Such signals existed and made sense.Thus, both data support one another rather than being in conflict.Of course, it is possible that our approach misses real signals which are common or rare variant specific, although our robustness analysis suggests this is not a major factor.We note that our assessment of robustness of the analysis is, in some sense, more stringent even than replication cohorts or approaches such as cross-validation [80][81][82] since we have both positive and negative controls when holding out data.Because our test is so simple, our observation of statistically significant deviations from the null is immediately interpretable; this is critical given the extreme scale of the analysis which, in essence, permutes every functional test through every previously collected data set.This methodological simplicity implies that the approach will generalize well.
Indeed, incorporating effect size as a meta-analytic constraint offers a diverse range of novel applications.As more data from different studies is collected for autism and other complex diseases, we can see this approach as a basis for determining properties and functional signals from the disease candidates in an integrative way.That integration may be across study designs and classes of variation, as we have done, or may involve phenotype or other properties.So, for example, one could determine functional effect size trends that grow or shrink depending on how patients were classified, or even broken down in a sex-specific manner for interpreting protective effects.More broadly, as data and the means for obtaining it grows, techniques to statistically assess its structured dependencies will grow more useful and important.Our robustness analysis speaks to this in that while we are robust to modest losses of data, it is clear that more data will only improve the signals of the individual classes.More finely-tuned effect size estimates and better separations of the gene sets and variant classifications will also help refine the distinction between biological and artifactual signals, ideally allowing us conduct yet more focused study designs in a productive feedback loop.   in meta-analysis assessments using funnel plots.
(A) Funnel plot tests are used in determining homogeneity in the studies of a meta-analysis.Typically, effect of the study (x-axis) is plotted against a form of error or variance (e.g., inverse standard error, y-axis).The studies should fit within this "funnel" for a homogenous set of studies (top panel).However, in cases where this is an obvious heterogeneity (skew in bottom panel, dashed ellipse), there are likely hidden biases.(B) Hidden classes within the data e.g., study types, could be the cause for such skew.In this case (top panel), the heterogeneous study forces skew.In the bottom panel, this is not the case, and there could be some other hidden class.(C) Given the lack of heterogeneity, we can exploit this "failure" (i.e., class association) through an effect size test for functional associations (top panel), or lack thereof (middle panel).For homogenous data sets, we can observe any correlations to potential confounds, which in this case is shown for sample size (bottom panel).
.    There is little overlap in the disease gene sets as shown in the heatmap and as enumerated in the Venn diagram beside it.There are some significant overlaps.Control gene sets overlap significantly with missense genes (333 genes hypergeometric test p=2.54e-6),common SNPs gene sets (11 genes, p=4.5e-3), and the loss-of-function (LoF) SNV gene sets (71 genes, p=3.2e-3) but not the CNV gene sets (14 genes, p=0.03).Missense and common SNPs overlap significantly (4 genes, p=2.4e-4).However, loss-of-function SNVs do not overlap significantly with either common (4 genes, p=0.62), missense (68 genes, p=0.75), or CNVs (3 genes, p=0.37).The copyright holder for this preprint (which was not .http://dx.doi.org/10.1101/043422doi: bioRxiv preprint first posted online Mar. 12, 2016; 36 higher effect sizes also have the higher functional effect sizes.(C) We can now demonstrate how to calculate the functional effect size trend for the "essential genes" test.The disease gene sets are ordered by an estimate of the average effect size of genes within the set (from low to high on the x-axis) and the functional effect size is of that disease gene set is plotted (y-axis).A trend between the effect size of the candidate genes and their essentiality can be clearly observed (FDR<0.01).The network connectivity functional test (D) consists of calculating the ratio of disease genes' total connectivity (node degrees calculated from the whole network; sum of their connections) to their internal connectivity (node degrees of their subnetwork; sum of their connections to one another).The line (in grey) reflects the expected values if there is no preferential connectivity within the set.We see that a large number (72%) of the genes lie above the identity line.The Wilcoxon P-value of the mean residuals is shown in the inset (p~1.13e-42)(E) Once again, controlling for sample size through downsampling, the functional effect size of each gene set is calculated (subset shown).(F) A weak trend (FDR>0.01)between the effect size of the candidate genes and their degree of co-expression is visible.Empirical nulls are calculated by permuting disease gene sets and FDRs through a Benjamini-Hochberg correction against the resultant 158 functional effect size trends. .CC-BY-NC-ND 4.0 International license peer-reviewed) is the author/funder.It is made available under a The copyright holder for this preprint (which was not .http://dx.doi.org/10.1101/043422doi: bioRxiv preprint first posted online Mar. 12, 2016;  The properties clustered into 6 groups when we cut the dendrogram at a height of ~1.6.Their effect size slopes (unranked) show that most highly sloped effect size lines cluster (in clusters 5 and 6).White/yellow is a high rank, red is low.A high slope is shaded purple, low/negative slopes are grey.The property type is color coded as described in the figure key.
. WES results.Varying effect sizes depending on mutation and recurrence.Odds ratio is calculated as in Sanders et al, [13] (ratio of observed counts of mutations to silent mutations, and then ratio of those odds between siblings to probands).

Tables and captions
Briefly, we start by characterizing the ASD gene sets collected for this analysis (Fig 2, part 1).Each study's results (rare and common variants) .CC-BY-NC-ND 4.0 International license peer-reviewed) is the author/funder.It is made available under a As in the previous steps, we repeat the network connectivity tests across all gene sets (Fig 4E), also downsampling to calculate the functional effect size.Once again, gene sets with higher proportion of .burden genes correlate with functional effect size tests (Fig 4E, black line, slope=0.59,Spearman's ρ=0.59,Fisher's transformation p<0.03).
were randomly estimated (Fig 6A, Wilcoxon test, p~1.92e-31 and p~1.56e-32).The long tail in raw slopes (Fig 5A) is visible as a much broader shift toward significance in the ranked values (Fig 6B), indicating the significant extremal values arise largely through effects distributed across multiple disease sets.See "robustness and relative contributions" section for more detail on statistical assessment.

.
trends remain consistent; although the ASD studies functional effect size trends are generally greater than other cognitive disorders (see Additional file 1: Fig S3).

FiguresFig 1
Figures Fig 1 Hidden classes in meta-analysis assessments using funnel plots.

Fig 2
Fig 2 Schematic of functional effect size trend calculation.

( 1 )
Starting with disease gene set collections, (2) each is then rank each by the average effect size of the genes within that set.(3)We then run functional tests on these genes sets and calculate a functional effect size for each.Then, using the ranking of the disease gene sets, we measure the functional effect size trend -the slope of the trend line of the functional effect sizes versus the rank.With these positive slopes (strong functional signatures), we determine which functions are likely disease-specific.(4) This allows us to generate a score for each gene, which we can use to identify and validate candidate genes.

Fig 3
Fig 3 Overlap of genes across the different gene sets.

Fig 4
Fig 4 Functional properties of disease gene sets are tested using gene set enrichment (top panels) and coexpression network connectivity (bottom panels).
(A) Gene set enrichment is calculated with a hypergeometric test.A large number (34%) of the genes in the de novo loss-of-function set overlap with essential genes (hypergeometric test p~7.11e-11).(B) This is repeated across all disease gene sets, (a subset shown here).Sample size is controlled through downsampling.Gene sets with the .CC-BY-NC-ND 4.0 International license peer-reviewed) is the author/funder.It is made available under a

Fig 5
Fig 5 Functional convergence of the disease gene sets.(A) Distribution of the effect size slopes of functional properties commonly tested for by disease candidate methods.The (B) RVIS enrichment test has a very high slope across the same gene sets (black lines)(C) extended PPIN has mostly artifactual signal as demonstrated by the flat slope, highly elevated from the null but in no effectcorrelated way and suggestive of consistent biases.The grey lines are the ranked functional effect size trend slopes (to show the trends are robust to non-parametric assessment, see Fig 6A).

Fig 6
Fig 6 Clustering disease property tests by their functional effect size trends.

(
A) The slopes of the ranked effect size trends are significantly different from null distributions for both effect size permutations and random gene set sampling (Wilcoxon test p~1.92e-31 and p~1.56e-32 respectively).We've drawn the two red lines to indicate the false discovery rates (FDRs) of 0.01, where 20 and 21 functions are significant.(B) A heatmap of all the ranked scores of the test gene sets (columns) for 158 properties tested (rows).

Table 1
Disease gene sets used in the study GWAS.Multiple hit models, low effect size per gene.Odds ratios range between 1.1 and 1.2.