Selecting and extracting informative variant sites
We have previously shown that using as few as 5000 carefully chosen polymorphic loci is sufficient for relatedness estimation and that this subset of informative loci yields more accurate estimates than using all available variants [3]. A similar site-selection strategy is also used in Conpair to estimate contamination [6]. In Somalier, we utilize the observation that the optimal sites for detecting relatedness are high-quality, unlinked sites with a population allele frequency of around 0.5. A balanced allele frequency maximizes the probability that any 2 unrelated samples will differ. We distribute a set of informative sites to be queried by Somalier, though users may also create their own sites files tailored to their application. The sites are high-frequency single-nucleotide variants selected from gnomAD [7] exomes that exclude segmental duplication and low-complexity regions [8]. We also distribute a set of sites limited to exons that are frequently (> 10 reads in at least 40% of samples) expressed in GTeX for use in cohorts with RNA-seq data. To minimize genotyping error, variants with nearby insertions or deletions are excluded. In addition, we have excluded sites that are cytosines in the reference so that the tool can be used on bisulfite seq data, for example, to check the correspondence between bisulfite sequencing and RNA-seq data. The Somalier repository includes the code to create a set of sites for different organisms given a population VCF and a set of optional exclude regions. We distribute a default set of matched sites for both the GRCh37 and GRCh38 builds of the human reference genome. This allows a user to extract sites from a sample aligned to GRCh37 using our GRCh37 sites file and compare that sketch to a sketch created from a sample aligned to GRCh38 by extracting the sites in our GRCh38 file. This is convenient as labs move from GRCh37 to GRCh38 and future genome builds. The sites files include informative variants on the X and Y chromosomes so that Somalier can also estimate a sample’s sex from the genotypes. However, only autosomal sites are used to estimate relatedness. With the default sites files, Somalier inspects 17,766 total sites (these are distributed with the Somalier software), all of which are chosen to be in coding sequence so that they are applicable to genome, exome, and RNA-seq datasets.
In order to quickly extract data from polymorphic sites into a genome sketch, Somalier uses the BAM or CRAM index to query each file at each of the informative sites. Alignments with a mapping quality of at least 1 that are not duplicates, supplementary, or failing quality control (according to the SAM flag) are used. Each passing alignment is evaluated at the requested position and the base in the alignment at that position is checked against the given reference and alternate for the query variant. This check considers the CIGAR operation [9] at that base which indicates insertions, deletions, and other events within the read. This is faster than a traditional sequence alignment “pileup” as it looks at each read only once and interrogates only the exact position in question. If a VCF (or BCF or GVCF) is provided instead of an alignment file, Somalier will extract the depth information for each sample for requested sites that are present in the VCF. The sketches extracted from a VCF are indistinguishable from those extracted from alignment files. In order to support single-sample VCFs, which do not contain calls where the individual is homozygous for the reference allele, the user may indicate that missing variants should be assumed to be homozygous for the reference allele. This also facilitates comparing multiple tumor-normal VCFs where many sites will not be shared (however, in those cases, it is preferable to extract the sketch from the alignment files rather than from the VCF).
Somalier tallies reference and alternate counts for each site. Once all sites are collected, it writes a binary file containing the sample name and the allele counts collected at each of the inspected sites. For the set of sites distributed from the Somalier repository, a sketch file requires ~ 200 KB of space on disk or in memory. This sketch format and the speed of parsing and comparing sketch files are key strengths of Somalier. For example, since Somalier can complete a full analysis of 2504 sketches from the 1000 Genomes high-coverage whole-genome samples (Michael Zody, personal communication) in under 20 s, users can keep a pool of sample sketches to test against and check incoming samples against all previously sketched samples.
Comparing sketches
Thousands of sample sketches can be read into memory per second and compared. In order to calculate relatedness, Somalier converts the reference and alternate allele counts stored for each sample at each site into a genotype. The genotype is determined to be unknown if the depth is less than the user-specified value (default of 7), homozygous reference if the allele balance (i.e., alt-count/[ref-count + alt-count]) is less than 0.02, heterozygous if the allele balance is between 0.2 and 0.8, homozygous alternate if the allele balance is above 0.98, and unknown otherwise (Fig. 1a). A flag can amend these rules such that missing sites (with depth of 0) are treated as homozygous reference, rather than unknown. While simple, this heuristic genotyping works well in practice and is extremely fast, because Somalier looks only at single-nucleotide variants in non-repetitive regions of the genome. As the sample is processed, Somalier also collects information on depth, mean allele-balance, number of reference, heterozygous, and homozygous alternate calls for each sample, along with similar stats for the X and Y chromosomes. These data are used to calculate per-sample quality control metrics. In order to measure relatedness, the data collected for each sample is converted into a data structure consisting of hom_ref, het, and hom_alt bit vectors (Fig. 1b). The bit vectors consist of 64 bit integers, enabling Somalier to store 64 variants per integer. There are 17,384 autosomal sites in the default sites file used by Somalier, consuming only 6519 bytes per sample (17,384/64 bits * 3 bit-vectors/sample * 8 bits/byte). With this data layout, Somalier can represent all 2504 samples from the 1000 Genomes Project in under 17 megabytes of memory. This simple data structure also facilitates rapid pairwise comparisons (Fig. 1c); for example, we can compute IBS0 (that is, “identity by-state 0” or sites where zero alleles are shared between two samples A and B) with the following logic which evaluates 64 sites in parallel:
$$ \left(\mathtt{A}.\mathtt{\hom}\_\mathtt{ref}\ \mathtt{and}\ \mathtt{B}.\mathtt{\hom}\_\mathtt{alt}\right)\ \mathtt{or}\ \left(\mathtt{B}.\mathtt{\hom}\_\mathtt{ref}\ \mathtt{and}\ \mathtt{A}.\mathtt{\hom}\_\mathtt{alt}\right) $$
We repeat this for each of the 272 (17,384 autosomal sites/64 sites per entry) entries in the array to assess all of the genome-wide sites for each pair of samples. In fact, we do not need the sites, just the count of sites that are IBS0. Therefore, we use the popcount (i.e., the count of bits that are set to TRUE) hardware instruction to count the total number of bits where the expression is non-zero in order to get the total count of IBS0 sites between the 2 samples. In addition to IBS0, we calculate counts of IBS2 where both samples have the same genotype, shared heterozygotes (both are heterozygotes), shared homozygous alternates, and heterozygous sites for each sample. All of the operations are extremely fast as it does not require code branching via, for example, conditional logic; instead, the calculations are all conducted with bitwise operations.
Once those metrics are calculated, the relatedness between sample i and sample j is calculated as:
$$ \left(\mathtt{shared}\hbox{-} \mathtt{hets}\left(\mathtt{i},\mathtt{j}\right)\hbox{-} \mathtt{2}\ast \mathtt{ibs}\mathtt{0}\left(\mathtt{i},\mathtt{j}\right)\right)/\mathtt{\min}\ \left(\mathtt{hets}\left(\mathtt{i}\right),\mathtt{hets}\left(\mathtt{j}\right)\right) $$
where hets is the count of heterozygote calls per sample out of the assayed sites. This metric is derived by Manichaikul et al. [4]. In addition, the homozygous concordance rate is reported as:
$$ \left(\mathtt{shared}\hbox{-} \mathtt{homozygous}\hbox{-} \mathtt{alts}\left(\mathtt{i},\mathtt{j}\right)\hbox{-} \mathtt{2}\ast \mathtt{ibs}\mathtt{0}\left(\mathtt{i},\mathtt{j}\right)\right)/\mathtt{\min}\ \left(\mathtt{homozygous}\hbox{-} \mathtt{alts}\left(\mathtt{i}\right),\mathtt{homozygous}\hbox{-} \mathtt{alts}\left(\mathtt{j}\right)\right) $$
This measure is similar to the one described in HYSYS [10] except that the HYSYS measure is simply:
$$ \Big(\mathtt{shared}\hbox{-} \mathtt{homozygous}\hbox{-} \mathtt{alts}\left(\mathtt{i},\mathtt{j}\right)\hbox{-} /\mathtt{\min}\ \left(\mathtt{homozygous}\hbox{-} \mathtt{alts}\left(\mathtt{i}\right),\mathtt{homozygous}\hbox{-} \mathtt{alts}\left(\mathtt{j}\right)\right) $$
Our formulation has the benefit that it matches the scale and interpretation of the relatedness estimate; unrelated individuals will have a concordance of around 0, whereas in HYSYS they will have a value around 0.5. This is a useful relatedness metric when severe loss of heterozygosity removes many heterozygous calls from the tumor sample making the traditional relatedness calculation inaccurate.
If a pedigree file is given, Wright’s method of path coefficients [11] is used to calculate the expected relatedness. These values can then be compared to the relatedness observed from the genotypes. For somatic samples, the user can also specify a “groups” file where sample identifiers appearing on the same line are expected to be identical; for example, three biopsies from each of two individuals would appear as three comma-separated sample identifiers on two separate lines.
Finally, the output is reported both as text and as an interactive HTML page. When using the webpage, the user can toggle which relatedness metrics (IBS0, IBS2, relatedness, homozygous concordance, shared heterozygotes, shared homozygous alternates) to plot for the X and Y coordinates and, if expected groups were given (e.g., tumor-normal pairs) on the command-line, points are colored according to their expected relatedness. This setup means that points of similar colors should cluster together. In addition, Somalier plots the per-sample output in a separate plot with selectable axes; this functionality allows one to evaluate predicted vs. reported sex and depth across samples.
Somalier requires htslib (https://htslib.org). It is written in the Nim programming language (https://nim-lang.org) which compiles to C and also utilizes our hts-nim [12] library. It is distributed as a static binary, and the source code is available at https://github.com/brentp/somalier under an academic license.