Massive underrepresentation of Arabs in genomic studies of common disease

Bhattacharya, Romit; Chen, NingNing; Shim, Injeong; Kuwahara, Hiroyuki; Gao, Xin; Alkuraya, Fowzan S.; Fahed, Akl C.

doi:10.1186/s13073-023-01254-8

Comment
Open access
Published: 22 November 2023

Massive underrepresentation of Arabs in genomic studies of common disease

Romit Bhattacharya^1,2,3^na1,
NingNing Chen^4,5^na1,
Injeong Shim^2,3,6,
Hiroyuki Kuwahara^4,5,
Xin Gao^4,5^na2,
Fowzan S. Alkuraya⁷^na2 &
…
Akl C. Fahed ORCID: orcid.org/0000-0002-4849-6389^1,2,3^na2

Genome Medicine volume 15, Article number: 99 (2023) Cite this article

1589 Accesses
1 Citations
14 Altmetric
Metrics details

Abstract

Arabs represent 5% of the world population and have a high prevalence of common disease, yet remain greatly underrepresented in genome-wide association studies, where only 1 in 600 individuals are Arab. We highlight the persistent and unaddressed underrepresentation of Arabs in genomic databases and discuss its impact on public health genomics and missed opportunities for biological discovery.

Arabs, common complex disease, and healthcare disparities

Arab people represent about 5% of the world population and are an ancestrally diverse group with unique health considerations. Arabs can be defined by their geographic histories, religious practices, ethnicities, and countries of origin. The countries of the Arab world span the Arabian Peninsula, the Levant, and North Africa and are very diverse economically — including some of the wealthiest and poorest countries in the world — and genetically through extensive evidence of admixture found among present-day Arabs. However, while Arabs are ancestrally diverse, they have (1) genetically distinct signatures [1] and (2) unique risk factors for common diseases. These include waterpipe smoking — a key driver for risk of heart disease and cancer, some of the highest rates of vitamin D deficiency worldwide — attributed to dark skin and conservative clothing [2] and consanguinity leading to autozygosity.

Arabs are also present in diaspora, and as such, the risk of propagating healthcare disparities is a problem for not only the Arab world but also the West. Several Arab countries have witnessed large waves of immigration to the West due to conflict and political or economic breakdown. In the US alone, it is estimated that there are 3.7 million Arabs although it is likely that actual estimates are larger due to underreporting. Limited studies on Arab Americans observed a higher age-adjusted mortality rate than non-Arab Americans raising concerns for possible disparities in those populations [3]. With extensive European, Asian, and African admixture, Arabs can present as White, or non-White which allows misidentification in Western datasets and confounds attempts at disaggregation of data. In fact, even the US Census does not acknowledge Arabs as a separate category and instead sees them as “White.”

There is a rising epidemic of complex diseases in countries of the Arab world. The increasing rates of cardiovascular disease, diabetes, dyslipidemia, and cancer are often ascribed to environmental, behavioral, or economic factors. For decades, communicable diseases in low- and middle-income countries of the Arab world were the public health priority, but today nearly 80% of deaths are attributable to non-communicable diseases [4]. Common diseases are highly heritable and it remains unclear how genomic background may drive risk among Arabs for common diseases where there is a large public health burden. The interplay of genomic and nongenomic risk for many common diseases is increasingly appreciated in European-ancestry datasets and, more recently, other world populations with growing genomic and clinical datasets. For example, recent insights from individuals of South Asian ancestry identified increased risk for cardiovascular disease not captured by known conventional risk factors [5]. Whether this is true for Arabs is unknown.

Massive and persistent underrepresentation of Arabs in genome-wide association studies

Genomic databases globally have suffered from limited ancestral diversity, and Martin et al. highlighted how this could propagate health disparities through predictable overrepresentation of genetic minorities — demonstrating that in 2019 despite only 16% of the global population being of European descent, ~ 79% of all genome-wide association studies (GWAS) participants were of European descent [6]. Arab populations are among the populations least represented in GWAS and large population-based biobanks, along with peoples of Oceania, Southeast Asia, Native American nations, and parts of Africa. Given high levels of endogamy and consanguinity in the Arab population, the inclusion of Arabs in such datasets has been an area of interest, but has not been a focus globally. Currently, in the GWAS Catalog, 88% of individuals are of European ancestry and 5.8% are East Asian (Biobank Japan and China Kadoorie Biobank being among the largest contributors from East Asia). By contrast, the relative inclusion of other ancestry groups has consistently languished at < 1.6%, with Hispanic, South Asian, and Greater Middle Eastern populations making up < 1% of the participants in published GWAS studies.

We used the GWAS Catalog (https://www.ebi.ac.uk/gwas/) to quantify the representation trend for Arab individuals over 18 years — on aggregate and broken down by country — and find extraordinary underrepresentation in GWAS studies. When we examined Arab populations specifically defined by country of origin (in contrast to the grouping variable “Greater Middle Eastern” that is often used), we found approximately 0.17% of the GWAS Catalog was Arab — meaning about 1 in 600 individuals in published GWAS studies are currently of Arab ancestry (compared to 523 in 600 are of European ancestry). What is more discouraging is that this disparity has not been sufficiently addressed over time (Fig. 1a). When considering only non-European genomes, Arabs are noted to be consistently near 0% since the start of the GWAS Catalog (Fig. 1b). In fact, representation from the majority of countries considered to comprise the Arab world is lacking (Fig. 1d, e). Genomes from Algeria, Iraq, Libya, Oman, Somalia, and Syria are all completely absent.

The Qatar Biobank (QBB) has published data on ~ 15,000 individuals, and this represents the single largest contribution to Arab genomic representation to date (Fig. 1c) [7]. With even this modest addition in number, the contribution to genomics has been large with the initial GWAS from the QBB identifying 24.6 million previously unknown variants, and numerous novel disease-associated variants predominantly from 5 newly identified non-admixed subclusters within the Qatari population [7, 8]. The addition of QBB data to the GWAS Catalog represents a great advance, but additionally highlights the necessity and urgency of increasing representation from other Arab countries (Fig. 1d, e). While the contributions of the QBB are commendable, Qatar is not representative of all Arabs, for instance the distant North African or Levantine countries, as genetic ancestry varies greatly across the region.

Missed opportunities in disease prevention and discovery

Discoveries from GWAS have begun to move into the clinical space through implementation of polygenic risk scores (PRS). It is now well-established that lack of diversity in reference genomes and published GWAS limit portability and accuracy of PRS prediction in non-European ancestries. The issues begin with SNP array imputation where reference panels are predominantly of European origin, and thus, when linkage disequilibrium (LD) blocks differ across ancestries, variation in diverse populations can be completely obscured ⁶. Differences in minor allele frequency (MAF) and effect size of variants among individuals of different genetic ancestries also lead to reduced accuracy of prediction — and increase the likelihood of propagating healthcare disparities ⁶. When comparing the MAF of causal variants associated with 9 cardiometabolic traits in a large European-ancestry cohort (the UK Biobank) to their frequency in an indigenous Arab population, the MAF is on average 7.6% lower in Arabs (p-value = 4.2e − 06), explaining part of the reduced performance of PRS that are derived in European populations when they are applied to Arabs (Fig. 1f) [7].

The use of computational methods to improve cross-ethnic PRS performance in genetically diverse populations is encouraging, but not enough to allay real concerns regarding ongoing health disparities. While large biobanks, population cohort studies, and international consortia have allowed for serially larger GWAS, the genetic diversity of these GWAS has not kept pace, despite advocacy from groups like the American Society of Human Genetics to prioritize portability of scores. Polygenic scores are increasingly being returned to patients, and clinico-genomic models to predict and intervene on risk early in the life course through a precision medicine approach are likely to become the norm. Without the appropriate development or validation of such clinico-genomic models in Arabs, widespread implementation risks propagation of already existing disparities in care.

Beyond the serious impact on public health genomics, the massive underrepresentation of Arabs in genomic studies is a missed opportunity to discover new disease biology. A prior comment in the journal eloquently described the opportunity to identify homozygous loss of function variants and novel candidate genes for recessive disease through more sequencing of highly consanguineous Arab populations [9]. We highlight that also for common disease, homozygous loss of function variants (human knockouts) might inform interpretation of GWAS. Increasingly, the diversity of GWAS participants, more than the sample size, is advancing the discovery of new loci. For example, in a recent multi-ancestry GWAS of type 2 diabetes, 46% of new loci would not have been identified in a European-ancestry GWAS only [10]. Today, more than 30 million whole genomes have been sequenced which has advanced our understanding of genetic underpinnings of disease, but it is now well understood that in addition to the number of genomes, future opportunities for discovery lie in sequencing a diversity of genomes. First, genomes of different ancestries, when combined, allow leveraging of different LD block structures for better identification of causal variants within a specific genetic region. Second, different causal variants for similar diseases may be identified by studying diverse populations.

The unique ancestral, geographic, and cultural histories of the Arab people offer many opportunities for discovery and improved risk prediction and health for the 450 million Arabs in the world. Future directions and actions to reduce disparities and increase yield of novel variants would include (1) encouraging census bodies to disaggregate Arabs from White individuals within the diaspora, (2) active recruitment of Arabs in genomic databases, and (3) supporting infrastructure and training programs in Arab countries to develop their own biobanks genomic resources to add to the global collective pool of data.

Availability of data and materials

All data are publicly available at https://www.ebi.ac.uk/gwas/ and https://www.finucanelab.org/data.

Abbreviations

BMI:: Body mass index
CAD:: Coronary artery disease
CB2:: Cannabinoid receptor 2
DBP:: Diastolic blood pressure
DKD:: Diabetic kidney disease
GWAS:: Genome-wide association study
HDL:: High-density lipoprotein
Ht:: Height
LD:: Linkage disequilibrium
LDL:: Low-density lipoprotein
MAF:: Minor allele frequency
PC:: Principal components
PRS:: Polygenic risk score
QBB:: Qatar Biobank
SBP:: Systolic blood pressure
SNP:: Single nucleotide polymorphism
T2DM:: Type 2 diabetes mellitus
TG:: Triglycerides
US:: United States

References

Almarri MA, Haber M, Lootah RA, Hallast P, Al Turki S, Martin HC, Xue Y, Tyler-Smith C. The genomic history of the Middle East. Cell. 2021;184(18):4612-4625.e14. https://doi.org/10.1016/j.cell.2021.07.013.
Article CAS PubMed PubMed Central Google Scholar
Chakhtoura M, Rahme M, Chamoun N, El-Hajj Fuleihan G. Vitamin D in the Middle East and North Africa. Bone Reports. 2018;8:135–46. https://doi.org/10.1016/j.bonr.2018.03.004.
Article PubMed PubMed Central Google Scholar
El-Sayed AM, Tracy M, Scarborough P, Galea S. Ethnic inequalities in mortality: the case of Arab-Americans. PLoS ONE. 2011;6(12): e29185. https://doi.org/10.1371/journal.pone.0029185.
Article CAS PubMed PubMed Central Google Scholar
The World Bank Open Data, Cause of death, by non-communicable disease (% of total)—Middle East & North Africa. (n.d.). . Retrieved April 24, 2023, from https://data.worldbank.org/indicator/SH.DTH.NCOM.ZS?locations=ZQ
Patel AP, Wang M, Kartoun U, Ng K, Khera AV. Quantifying and understanding the higher risk of atherosclerotic cardiovascular disease among South Asian individuals: results from the UK Biobank prospective cohort study. Circulation. 2021;144(6):410–22. https://doi.org/10.1161/CIRCULATIONAHA.120.052430.
Article CAS PubMed PubMed Central Google Scholar
Martin AR, Kanai M, Kamatani Y, Okada Y, Neale BM, Daly MJ. Clinical use of current polygenic risk scores may exacerbate health disparities. Nat Genet. 2019;51(4):584–91. https://doi.org/10.1038/s41588-019-0379-x.
Article CAS PubMed PubMed Central Google Scholar
Thareja G, Al-Sarraj Y, Belkadi A, Almotawa M, Suhre K, Albagha OME. Whole genome sequencing in the Middle Eastern Qatari population identifies genetic associations with 45 clinically relevant traits. Nat Commun. 2021;12(1):1250. https://doi.org/10.1038/s41467-021-21381-3.
Article CAS PubMed PubMed Central Google Scholar
Mbarek H, Ismail SI. Mapping the Arab genome. Nat Genet. 2022;54(12):1761–3. https://doi.org/10.1038/s41588-022-01239-0.
Article CAS PubMed Google Scholar
Abou Tayoun AN, Rehm HL. Genetic variation in the Middle East-an opportunity to advance the human genetics field. Genome Medicine. 2020;12(1):116. https://doi.org/10.1186/s13073-020-00821-7.
Article PubMed PubMed Central Google Scholar
Ken Suzuki, Konstantinos Hatzikotoulas, Lorraine Southam, Henry J. Taylor, Xianyong Yin, Kim M. Lorenz, Ravi Mandla, Alicia Huerta-Chagoya, Nigel W. Rayner, Ozvan Bocher, S. V. Arruda Ana Luiza de, Kyuto Sonehara, Shinichi Namba, Simon S. K. Lee, Michael H. Preuss, Lauren E. Petty, Philip Schroeder, Brett Vanderwerff, Mart Kals, … Eleftheria Zeggini. (2023). Multi-ancestry genome-wide study in > 2.5 million individuals reveals heterogeneity in mechanistic pathways of type 2 diabetes and complications. MedRxiv, 2023.03.31.23287839. https://doi.org/10.1101/2023.03.31.23287839

Download references

Acknowledgements

None.

Funding

This study was supported by a sponsored research agreement between King Faisal Specialist Hospital and Research Center and Massachusetts General Hospital. RB is funded by the John S. LaDue Fellowship in Cardiovascular Disease Research at Harvard Medical School. HK, NC, and XG are supported by the King Abdullah University of Science and Technology (KAUST) Office of Research Administration (ORA) under Award No FCC/1/1976–44-01, FCC/1/1976–45-01, REI/1/5202–01-01, REI/1/5234–01-01, REI/1/4940–01-01, RGC/3/4816–01-01, and REI/1/0018–01-01. IS is supported by a grant of the Korea Health Technology R&D Project through the Korea Health Industry Development Institute (KHIDI), funded by the Ministry of Health & Welfare, South Korea (HI19C1328). ACF is supported by a grant from the National Heart, Lung, and Blood Institute (1K08HL161448).

Author information

Romit Bhattacharya and NingNing Chen are co-first authors.
Xin Gao, Fowzan S. Alkuraya, and Akl C. Fahed jointly supervised this work.

Authors and Affiliations

Department of Medicine, Cardiovascular Research Center, Massachusetts General Hospital, Harvard Medical School, 185 Cambridge Street | CPZN 3.128, Boston, MA, 02114, USA
Romit Bhattacharya & Akl C. Fahed
Department of Medicine, Center for Genomic Medicine, Massachusetts General Hospital, Harvard Medical School, Boston, MA, USA
Romit Bhattacharya, Injeong Shim & Akl C. Fahed
Cardiovascular Disease Initiative, Broad Institute of MIT and Harvard, Cambridge, MA, USA
Romit Bhattacharya, Injeong Shim & Akl C. Fahed
Computational Biosciences Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal, Kingdom of Saudi Arabia
NingNing Chen, Hiroyuki Kuwahara & Xin Gao
Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal, Kingdom of Saudi Arabia
NingNing Chen, Hiroyuki Kuwahara & Xin Gao
Department of Digital Health, Samsung Medical Center, Samsung Advanced Institute for Health Sciences & Technology, Sungkyunkwan University, Seoul, Republic of South Korea
Injeong Shim
Department of Translational Genomics, Center for Genomic Medicine, King Faisal Specialist Hospital and Research Center, Riyadh, Kingdom of Saudi Arabia
Fowzan S. Alkuraya

Authors

Romit Bhattacharya
View author publications
You can also search for this author in PubMed Google Scholar
NingNing Chen
View author publications
You can also search for this author in PubMed Google Scholar
Injeong Shim
View author publications
You can also search for this author in PubMed Google Scholar
Hiroyuki Kuwahara
View author publications
You can also search for this author in PubMed Google Scholar
Xin Gao
View author publications
You can also search for this author in PubMed Google Scholar
Fowzan S. Alkuraya
View author publications
You can also search for this author in PubMed Google Scholar
Akl C. Fahed
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

RB wrote the manuscript and contributed to study data interpretation. NC analyzed and interpreted the data from the GWAS Catalog and contributed to writing. IS, HK, and XG contributed to data collection and analysis. FSA performed critical review of the manuscript and contributed to study design and oversight of the study. ACF oversaw the study design, data analysis, and manuscript preparation. All authors read and approved the final manuscript.

Corresponding authors

Correspondence to Fowzan S. Alkuraya or Akl C. Fahed.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

All authors have reviewed the text and consent to publication.

Competing interests

RB is a medical advisor to Casana Care Inc, unrelated to the present work. ACF is a co-founder of Goodpath and has received a grant from Abbott unrelated to the present work. The remaining authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article

Bhattacharya, R., Chen, N., Shim, I. et al. Massive underrepresentation of Arabs in genomic studies of common disease. Genome Med 15, 99 (2023). https://doi.org/10.1186/s13073-023-01254-8

Download citation

Received: 22 September 2023
Accepted: 08 November 2023
Published: 22 November 2023
DOI: https://doi.org/10.1186/s13073-023-01254-8

Massive underrepresentation of Arabs in genomic studies of common disease

Abstract

Arabs, common complex disease, and healthcare disparities

Massive and persistent underrepresentation of Arabs in genome-wide association studies

Missed opportunities in disease prevention and discovery

Availability of data and materials

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Keywords

Genome Medicine

Contact us

Massive underrepresentation of Arabs in genomic studies of common disease

Abstract

Arabs, common complex disease, and healthcare disparities

Massive and persistent underrepresentation of Arabs in genome-wide association studies

Missed opportunities in disease prevention and discovery

Availability of data and materials

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Genome Medicine

Contact us