Whole-genome reference panel of 1,781 Northeast Asians improves imputation accuracy of rare and low-frequency variants

Genotype imputation using the reference panel is a cost-effective strategy to fill millions of missing genotypes for the purpose of various genetic analyses. Here, we present the Northeast Asian Reference Database (NARD), including whole-genome sequencing data of 1,781 individuals from Korea, Mongolia, Japan, China, and Hong Kong. NARD provides the genetic diversities of Korean (n=850) and Mongolian (n=386) ancestries that were not present in the 1000 Genomes Project Phase 3 (1KGP3). We combined and re-phased the genotypes from NARD and 1KGP3 to construct a union set of haplotypes. This approach established a robust imputation reference panel for the Northeast Asian populations, which yields the greatest imputation accuracy of rare and low-frequency variants compared with the existing panels. Also, we illustrate that NARD can potentially improve disease variant discovery by reducing pathogenic candidates. Overall, this study provides a decent reference panel for the genetic studies in Northeast Asia.

UK10K 5 and IMPUTE2 21 (see URLs); we reciprocally imputed two panels using Minimac3, therefore missing genotypes in NARD or 1KGP3 were statistically inferred. Consistent with the previous studies 4,6,[8][9][10] , combining two panels showed more accurate imputation result compared with NARD or 1KGP3 alone. Furthermore, we confirmed a large improvement of the imputation accuracy, particularly for very-rare (MAF < 0.2%; R 2 = 0.80), rare (0.2% ≤ MAF < 0.5%; R 2 = 0.83), and low-frequency (0.5% ≤ MAF < 5%; R 2 = 0.87) variants, when the haplotypes in the combined panel were re-phased by SHAPEIT3 22 . In addition to measuring accuracy, we assessed the number of the accurately imputed SNPs for each panel. For this analysis, we used the estimated R 2 values in the info file measured by Minimac3, as it is the standard for the quality control procedure in GWAS 23,24 . We found that the re-phased panel produced the greatest number of high-confident SNPs (R 2 ≥ 0.9) compared with other panels, especially with 1KGP3 (n = 7.5 million versus 6.7 million), in concordance with the imputation accuracy (Fig. 2b).
To investigate the underlying reason of the improved imputation performance by the re-phasing approach, we performed the identity-by-descent (IBD) analysis. It is known that phasing or genotype errors cause the gaps within the real IBD tracts, hence shorter segments tend to be detected in phased genotype data 25,26 . Based on this aspect, we expected that haplotype correction is occurred in NARD by re-phasing and it would extend the length of shared IBD segments among individuals. Therefore, we measured the shared large IBD segments (≥ 2 cM) between the two individuals using the original (phased without 1KGP3) and the re-phased haplotypes of NARD. As a result, we confirmed the significant increase of shared IBD lengths and numbers in the re-phased haplotypes, which implies that the haplotype refinement is applied to NARD by the long-range phasing (LRP) 27,28 during the re-phasing process ( Supplementary Fig. 3).
We further represented the power of the re-phased panel using an independent cohort of 106 unrelated Northeast Asian individuals (79 CHN and 27 JPN) 29,30 . For the imputation accuracy measurement, we used MAF bins defined by 2,093 Northeast Asians from NARD and 1KGP3 (CHB, CHS, and JPT). In agreement with the imputation result for the KOR cohort, the re-phased panel provided the most accurate genotype dosages on veryrare (R 2 = 0.74), rare (R 2 = 0.75), and low-frequency (R 2 = 0.81) variants ( Supplementary Fig.   4a). Moreover, the re-phased panel also generated the largest number of accurate imputed genotypes compared with other panels, especially with 1KGP3 (n = 7.3 million versus 7.0 million; Supplementary Fig. 4b).

NARD imputation server
We have developed a web-based imputation server for the researchers to publicly use NARD + 1KGP3 (re-phased) panel (see URLs). The NARD imputation server provides the imputation process for a wide range of genotype data format including PLINK 31 (ped files paired with map files or bed files paired with bim and fam files), 23andMe (Mountain View, CA), AncestryDNA (Lehi, UT) files, and the variant call format (VCF). The results are processed through the imputation pipeline consisting of four major steps; pre-processing, phasing, imputation, and post-processing ( Supplementary Fig. 5). The pre-processing step checks the uploaded files in valid formats and converts them into VCF files for the next steps.
If the input data is uploaded in a PLINK format, it will be converted based on human_g1k_v37 reference from 1KGP3 using GotCloud 32 , and 23andMe/AncestryDNA files will be converted by bcftools 33 . The pre-processed data is phased using Eagle or SHAPEIT2 34 with or without a reference panel, respectively, as the pipeline of Michigan Imputation Server (see URLs). Then, imputation is implemented with Minimac3. In the postprocessing step, the output is assessed and provided as the gzip-compressed VCF files.

NARD for variant interpretation
We also evaluated the advantage of NARD as a population-specific reference panel for the clinical variant interpretation. Common variants were excluded for this analysis because it is the first step to identify rare disease-causing genes 35 . To examine the potential advantage of NARD, the frequencies of SNPs between the Genome Aggregation Database (gnomAD 2.1.1 release; see URLs) and NARD were compared. We redefined the frequency of 1.8 million genome-wide SNPs that are rare in worldwide populations from gnomAD (gnomAD-ALL) to low-frequency or common (MAF ≥ 5%). Moreover, 0.9 million rare genome-wide SNPs in East Asian from gnomAD (gnomAD-EAS) were low-frequency or common variants in NARD (Fig. 3a).
Then, we simulated rare disease variant discovery using 203 samples that were included in two pseudo-GWAS panels for the imputation analysis. We applied the variant filtering criteria (MAF < 5%) from the guideline of American College of Medical Genetics for the interpretation of sequence variants 36 . Notably, the number of protein-altering variants (missense, nonsense, frameshift, and splicing variants) was significantly reduced when the exome catalogue of gnomAD-EAS and NARD were jointly applied for variant filtration (Fig.   3b). This result represents that NARD could also contribute to inference of the pathogenic variant classification as well as genotype imputation for the Northeast Asians.

Discussion
Due to the cost-reduction and the technological advancements in WGS, several groups have been focused on building the population-specific reference panels, especially for underrepresented populations in the conventional panels such as 1KGP3 3,4,7-10 . However, the Northeast Asian-specific panel with deep sequencing coverage and large sample size has been barely constructed. In this study, we integrated whole-genome sequence variants of 1,781 Northeast Asian individuals to construct a reference panel, NARD, to resolve the uncertainty in genotype imputation using the pre-existing panels and to facilitate more comprehensive genetic researches in Northeast Asia.
Genotype-imputation accuracy is known to be affected by several factors and one of the major determinants is the size of reference panel 5,34 . Hence, most genetic studies for the Northeast Asians 37-40 were relied on the panels with large sample size, although the ancestries between the study population and the reference panel are not matched. However, these panels showed lower power in genotype imputation, compared to the well-matched population-specific panels with smaller sample size 4,6-10, 41 . Recently, HRC panel was constructed using the genotypes of more than 30,000 individuals (mostly Europeans), but previous study demonstrated the poor performance of this panel for the Northeast Asians, even worse than 1KGP3 panel 42 , and our analysis showed the same result. This might be due to the skewed proportion of European ancestry in HRC panel as unhelpful haplotypes for the Northeast Asians 43 , and emphasizing the importance of the population-specific reference panel for the accurate genotype imputation and the subsequent genetic analysis.
Considering the importance of the population-specific reference panel, we generated largescale WGS dataset of KOR and MNG that were not included in 1KGP3. From the results of population structure analysis, KOR and MNG were genetically differentiated from other East Asians. Therefore, the major ancestries in Northeast Asia are finally covered as populationscale by our dataset. In addition to two populations, other Northeast Asians including JPN, CHN, and HKG were also sequenced to increase the imputation power by sample size effect and to build NARD as a reference panel for the Northeast Asians.
As previous studies yield high imputation accuracy from their population-specific panels by combining dataset of 1KGP3 4,6,8-10 , we also confirmed the improvement of the imputation performance by combining NARD and 1KGP3 using a fast and simple approach as described in UK10K and IMPUTE2. However, there could be an issue regarding the uncertainty of the imputed genotypes, since the missing genotypes in each panel were statistically estimated. Therefore, referring to HRC study, calculating genotype likelihood of each variants using the individual BAM files in the panel would resolve this issue, if the sequencing coverages are sufficient. In addition to merging NARD and 1KGP3, we further enhanced the power of the combined panel by applying the re-phasing strategy. It is an advanced process that has not been applied in most previous studies 4,6,8-10 , but HRC study has shown further improvement of the imputation accuracy with this approach. Based on this strategy, NARD + 1KGP3 (re-phased) panel produced more accurate genotype dosages, especially for uncommon variants (MAF < 5%), than NARD + 1KGP3 panel, might be due to haplotype correction by LRP with the assistance of the haplotypes in 1KGP3 panel.
In summary, we generated a large-scale reference panel for the Northeast Asians, which will be a highly valuable resource to resolve the persistent deficiency of Asian genome data. We believe that our efforts will facilitate more extensive genetic researches and remarkably contribute to precision medicine in Northeast Asians.     Non-reference allele frequency (%) KOR cohort 0.9 ≤ R 2 0.8 ≤ R 2 < 0.9 0.7 ≤ R 2 < 0.8 Estimated Imputation Accuracy