For over two decades, the human reference genome has served as a crucial foundation for genetic research, enabling scientists to identify genes associated with specific diseases and trace the evolution of human traits.
However, this critical resource has also been recognized as a flawed tool due to its limited representation of human genetic diversity.
Now, researchers with the Human Pangenome Reference Consortium have made groundbreaking progress in capturing the fraction of human DNA that varies between individuals, addressing this long-standing issue.
The human reference genome, a consensus genetic sequence, has been the standard against which other genetic data is compared. The problem with this standard is that approximately 70 percent of its data came from a single individual of predominantly African-European background, whose DNA was sequenced during the Human Genome Project.
Consequently, the reference genome fails to provide sufficient insight into the 0.2 to one percent of genetic sequence that distinguishes each of the seven billion people on our planet. This inherent bias in biomedical data is believed to contribute to the health disparities affecting patients today, with many genetic variants found in non-European populations not represented in the reference genome at all.
Researchers have long called for a more inclusive resource to better diagnose diseases and guide medical treatments. This demand has been answered by the Human Pangenome Reference Consortium, which recently published a study in the journal Nature.
In their research, the experts assembled genomic sequences of 47 individuals from diverse populations around the world into a pangenome. This new resource renders more than 99 percent of each sequence with high accuracy, revealing nearly 120 million DNA base pairs that had previously gone unseen.
Though still a work in progress, the pangenome is already publicly available and can be used by scientists worldwide as a new standard human genome reference.
According to Erich D. Jarvis, one of the primary investigators from The Rockefeller University, “This complex genomic collection represents significantly more accurate human genetic diversity than has ever been captured before.”
Jarvis further explains the implications of this breakthrough, stating that “with a greater breadth and depth of genetic data at their disposal, and greater quality of genome assemblies, researchers can refine their understanding of the link between genes and disease traits, and accelerate clinical research.”
Since its completion in 2003, the first draft of the human genome has gone through significant refinements, with gaps being filled, errors corrected, and sequencing technology advancing.
In 2022, researchers achieved another milestone by sequencing the final eight percent of the genome, which primarily consists of tightly coiled, non-coding DNA and repetitive DNA regions. Despite these advancements, the reference genome remained imperfect, particularly concerning the critical 0.2 to one percent of DNA representing human diversity.
To address this issue, the Human Pangenome Reference Consortium (HPRC) was launched in 2019. This government-funded collaboration between over a dozen research institutions in the United States and Europe sought to improve our understanding of human genetic diversity.
One of the consortium’s leaders, Erich D. Jarvis, was developing advanced sequencing and computational methods through the Vertebrate Genomes Project, which aims to sequence all 70,000 vertebrate species. Jarvis and other collaborating labs decided to apply these advances to reveal the variation within a single vertebrate: Homo sapiens.
The researchers turned to the 1000 Genomes Project, a public database of sequenced human genomes comprising more than 2,500 individuals representing 26 geographically and ethnically diverse populations, to collect a variety of samples. Most of the samples come from Africa, home to the planet’s largest human diversity.
“In many other large human genome diversity projects, the scientists selected mostly European samples,” said Jarvis. “We made a purposeful effort to do the opposite. We were trying to counteract the biases of the past.” It is likely that gene variants that could inform our knowledge of both common and rare diseases can be found among these populations.
To broaden the gene pool, the researchers had to create crisper, clearer sequences of each individual. They used the approaches developed by members of the Vertebrate Genome Project and associated consortiums to solve a longstanding technical problem in the field.
Every person inherits one genome from each parent, resulting in two copies of every chromosome, known as a diploid genome. When a person’s genome is sequenced, separating parental DNA can be challenging.
Older techniques and algorithms often made errors when merging parental genetic data for an individual, resulting in an unclear view. Jarvis highlights, “The differences between mom’s and dad’s chromosomes are bigger than most people realize. Mom may have 20 copies of a gene and dad only two.”
To avoid confusion when representing many genomes in a pangenome, the HPRC adopted a method developed by Adam Phillippy and Sergey Koren at the National Institutes of Health. They focused on parent-child “trios”—a mother, a father, and a child whose genomes had all been sequenced.
Using the data from the parents, they were able to clarify the lines of inheritance and arrive at a higher-quality sequence for the child, which they then used for pangenome analysis.
Their analysis of 47 individuals has yielded 94 distinct genome sequences, two for each set of chromosomes, plus the sex Y chromosome in males.
Using advanced computational techniques, the researchers aligned and layered the 94 sequences, uncovering approximately 120 million DNA base pairs that were previously unseen or located differently than recorded in the previous reference genome.
Around 90 million of these base pairs derive from structural variations, which are differences in people’s DNA that occur when chunks of chromosomes are rearranged, moved, deleted, inverted, or duplicated.
Jarvis emphasizes the importance of this discovery, as structural variants have been found to play a significant role in human health and population-specific diversity.
“They can have dramatic effects on trait differences, disease, and gene function,” he says. “With so many new ones identified, there’s going to be a lot of new discoveries that weren’t possible before.”
The pangenome assembly also fills in gaps caused by repetitive sequences or duplicated genes. One example is the major histocompatibility complex (MHC), a cluster of genes that code proteins on cell surfaces to help the immune system recognize antigens, such as those from the SARS-CoV-2 virus.
Jarvis explains, “They’re really important, but it was impossible to study MHC diversity using the older sequencing methods. We’re seeing much greater diversity than we expected. This new information will help us understand how immune responses against specific pathogens vary among people.”
This could lead to improved methods for matching organ transplant donors with patients or identifying individuals at risk for autoimmune diseases.
The researchers also discovered surprising new characteristics of centromeres, which are critical for cell division and lie at the centers of chromosomes. Despite having highly repetitive DNA sequences, centromeres exhibit significant diversity from one haplotype to another. “The centromeres seem to be one of the most rapidly evolving parts of the chromosome,” Jarvis notes.
The current 47-person pangenome is just a starting point for the HPRC, with the ultimate goal of producing high-quality, nearly error-free genomes from at least 350 individuals from diverse populations by mid-2024. This would allow the capture of rare alleles that confer essential adaptive traits, such as those related to oxygen use and UV light exposure in Tibetans living at high altitudes..
However, a significant challenge in collecting this data is gaining trust from communities that have seen past abuses of biological data. For instance, there are currently no samples in the study from Native American or Aboriginal peoples, who have historically been disregarded or exploited by scientific research.
Jarvis acknowledges the need for relationship building, saying, “It’s a complex situation that’s going to require a lot of relationship building. There’s greater sensitivity now.”
Despite past mistrust, many groups are willing to participate in the project today. “There are individuals, institutions, and governmental bodies from different countries who are saying, ‘We want to be part of this. We want our population to be represented.’ We’re already making progress,” said Jarvis.
In conclusion, the development of the pangenome marks a significant step forward in addressing the limitations of the human reference genome. By providing a more accurate and inclusive representation of human genetic diversity, researchers can make strides in understanding the genetic underpinnings of diseases and ultimately improve the quality of medical care for all.
The Human Genome Project (HGP) was an international, collaborative research program whose primary goal was to determine the complete sequence of the human genome.
Launched in 1990 and completed in 2003, the HGP aimed to understand the genetic makeup of humans, identify the approximately 20,000-25,000 protein-coding genes, and map their locations on the 23 pairs of chromosomes that make up human DNA.
The project brought together researchers from multiple scientific disciplines, including genetics, molecular biology, and computer science. It led to significant advancements in DNA sequencing technologies, bioinformatics, and our understanding of the structure, organization, and function of the human genome.
One of the major outcomes of the HGP was the creation of the human reference genome, a consensus genetic sequence that serves as a standard against which other genetic data can be compared. The reference genome has been utilized in countless studies to identify genes implicated in specific diseases, trace the evolution of human traits, and better understand human biology.
However, the human reference genome has its limitations. One of its main drawbacks is that about 70 percent of its data came from a single individual of predominantly African-European background.
As a result, the reference genome can only provide limited information about the 0.2 to one percent of genetic sequences that make each of the seven billion people on Earth different from each other. This limitation has led to an inherent bias in biomedical data, which is believed to contribute to health disparities affecting patients today.
Recognizing these limitations, researchers have called for a more inclusive resource that better captures human diversity for diagnosing diseases and guiding medical treatments.
Recent efforts, such as the Human Pangenome Reference Consortium, are working towards characterizing the fraction of human DNA that varies between individuals to create a more representative and accurate human reference genome.
—-
Check us out on EarthSnap, a free app brought to you by Eric Ralls and Earth.com.