### Generation of exome data, accession of reference data and data processing

Genomic DNA of European individuals was enriched for the target region of all human CCDS exons with SureSelect Human All Exon Kit (Agilent, Santa Clara, USA) according to the manufacturer's protocol and sequenced on a HiSeq 2000 (Illumina, San Diego, USA), yielding more than 5 gigabases raw sequence data per exome. The Charité University Medicine ethics board approved this study, which conforms to the Helsinki Declaration, and we obtained informed consent of all participants.

Publicly available NGS raw data and variant calls of 1,063 individuals of different populations were downloaded from the ftp server of the 1000 genomes project [19]. Exome variants of these individuals served as reference variant sets in our work. Exome variants of the 5000 exomes project and of de Ligt *et al*. were used for testing the accuracy predictions [20, 21].

Exomes of test samples were enriched with Human All Exon SureSelect baits from Agilent and sequenced on Illumina Genome Analyzer IIx and Illumina HiSeq 2000 as 100 bp single-end reads or paired-end reads according to the manufacturers' protocols. Short sequence reads were mapped by Novoalign (Novocraft, version 2.08) or BWA[22] to the reference sequence GRCh37. Variants were detected with default settings with SAMtools [23] or GATK [10] on bam-formatted alignments [22]. Variant calls in variant call format (vcf) [24] were restricted to single nucleotide changes and to the consensus exome target region of the 1000 genomes project. Additionally, sites that were classified as technical artifacts by the 1000 genomes project were ignored.

### Distance function

The distance ${d}_{ij}$ between any two samples ${x}_{i}$ and ${x}_{j}$ for all positions $\phantom{\rule{0.1em}{0ex}}k$ in the target region (exome), where the called genotypes differ from the reference sequence in at least one sample, can be calculated by a weighted indicator function $I\left({x}_{i}\left(k\right),{x}_{j}\left(k\right)\right)*{W}_{ij}\left(k\right)$, with:

${I}_{ij}\left(k\right)=I\left({x}_{i}\left(k\right),{x}_{j}\left(k\right)\right)=\left\{\begin{array}{c}1,if\phantom{\rule{0.3em}{0ex}}{x}_{i}\left(k\right)=={x}_{j}\left(k\right)\\ 0,if\phantom{\rule{0.3em}{0ex}}{x}_{i}\left(k\right)\ne {x}_{j}\left(k\right)\end{array}\right.$

and ${W}_{ij}\left(k\right)=\frac{2}{f\left({x}_{i}\left(k\right)\right)+f\left({x}_{j}\left(k\right)\right)}$.

This means that for the same genotypes the indicator *I* is weighted by the reciprocal of the genotype frequency $f\left({x}_{i}\left(k\right)\right)$, which is based on the reference set with an appropriate background population. To give an example, a genotype for individual $\phantom{\rule{0.1em}{0ex}}i$ at a given position $\phantom{\rule{0.1em}{0ex}}k$, ${x}_{i}\left(k=chr6:79595096\right)=C/C$, would refer to a genotype frequency $f\left({x}_{i}\left(k\right)\right)=0.999$, if 1 out of 1,000 individuals in the reference set differs from this genotype.

For genotypes that were present only in the test sample but not observed at all in the reference set, we set their frequency to$1/\left(n+1\right)$, where $\phantom{\rule{0.1em}{0ex}}n$ is the total number of individuals in the reference set.

Based on that the distance

${d}_{ij}$ is defined as:

${d}_{ij}=1-\frac{1}{{C}_{ij}}{\displaystyle \sum _{k}}{I}_{ij}\left(k\right)*{W}_{ij}\left(k\right)$

where ${C}_{ij}=\sum _{k}{W}_{ij}\left(k\right)$ is used as a normalizing constant.

Therefore, a disagreement at a position of low variability in the reference set contributes more to the total distance than one at a highly variable position.

In the resulting distance matrix, $\phantom{\rule{0.1em}{0ex}}D$, pairs of individuals who are 'closely related' can be distinguished from those who are distinctly apart by lower distance values. Thus, a distance ${d}_{ij}=0$ means total agreement of all genotypes and a distance value of ${d}_{ij}=1$ means total disagreement of all genotypes.

### Visualization of distance matrices by non-metric multidimensional scaling

The output of the above-described pairwise comparison of variant sets is a high-dimensional distance matrix with given distances or dissimilarities between pairs of individuals that satisfy all conditions of a metric. To represent the dissimilarities as distances between points in a low-dimensional space, we used a statistical technique named non-metric multidimensional scaling (MDS), that is, a visualization method such as principal component analysis or metric MDS. However, in contrast to principal component analysis (PCA) and metric MDS, non-metric MDS does not make any assumptions about the distribution of the underlying high-dimensional data. With a pre-specified number of dimensions for the embedding φ and an appropriate initial configuration, the

*isoMDS* function of the

*MASS* R-package was used to minimize the goodness of fit, called stress

*S*, of Kruskal and Shepard (see [

25]). To promote readability and an easy interpretation of the data, we chose a standard two-dimensional embedding with:

$S({x}_{1},\dots ,{x}_{n},\phi )=\sqrt{\frac{{\displaystyle {\sum}_{i=1,i\ne j}^{n}{\left({d}_{ij}-\left|\right|\phi ({x}_{i})-\phi ({x}_{j})\left|\right|\right)}^{2}}}{{\displaystyle {\sum}_{i=1,i\ne j}^{n}\left|\right|\phi ({x}_{i})}-\phi ({x}_{j})|{|}^{2}}}$

where $\left|\right|\cdot \left|\right|$ defines the Euclidean norm.

### Down sampling of raw data and simulation of genotyping accuracy

For coverage-adjusted comparisons, we randomly removed sequence reads from the original alignments. Variants were recalled on these down-sampled exomes as described above. As genotyping accuracy we define the percentage of the entire exome that was correctly genotyped, that is the sum of true positive genotype calls (alternate and reference genotypes) divided by the entire size of the exomic target region. For our simulations, we assumed that the reference set had a genotyping accuracy of 100% and introduced genotyping errors at random positions. As most of the exomic positions had low variability in the reference set, the contribution of genotyping errors to the distance function could be approximated by adding twice a binomial distributed random variable, $X~B\left(N,p\right)*2$, to the normalizing constant ${C}_{ij}$, with probability *p* equaling the specified genotyping error and the number of trials $N=2.8*1{0}^{7}$bp is the total size of the exome,.

### Computation of the standardized dissimilarity score and reference curve

Distances between all individuals of the reference set were measured and the averaged values of the median and interquartile range of all columns of the distance matrix were computed to standardize the median of a test sample. The median of the distances from a test sample to all individuals of the reference set was computed and normalized by subtracting the pre-calculated median of the reference set and dividing the interquartile range (IQR) of the reference set. The reference curve and both 5% and 95% quartiles for the standardized dissimilarity score (SDS) were computed for the reference set and simulated data sets of decreasing error groups.