DNA holds the fundamental blueprint for life, encoding essential information for the development and functioning of organisms. Deciphering how this information is stored and organized has been a monumental scientific challenge.
A new tool, GROVER, developed by researchers at the Biotechnology Center (BIOTEC) of Dresden University of Technology, offers an unprecedented approach to understanding the complex language of DNA.
This large language model, trained on human DNA sequences promises to revolutionize genomics and personalized medicine.
Since the discovery of the double helix structure, scientists have sought to unravel the mysteries encoded in DNA. Seventy years later, it is evident that the genome‘s information is intricate and multilayered, with only 1-2% consisting of genes that code for proteins.
“DNA has many functions beyond coding for proteins. Some sequences regulate genes, others serve structural purposes, and most sequences serve multiple functions at once,” said senior author Anna Poetsch, a research group leader at BIOTEC.
“Currently, we don’t understand the meaning of most of the DNA. When it comes to understanding the non-coding regions of the DNA, it seems that we have only started to scratch the surface. This is where AI and large language models can help.”
Drawing inspiration from the success of large language models like GPT in natural language processing, Poetsch and her team approached DNA as a form of language.
“DNA is the code of life. Why not treat it like a language?” she asked. The team trained GROVER (short for Genome Rules Obtained via Extracted Representations) on a reference human genome, enabling the model to decode and extract biological meaning from DNA sequences.
“GROVER learned the rules of DNA. In terms of language, we are talking about grammar, syntax, and semantics. For DNA, this means learning the rules governing the sequences, the order of the nucleotides and sequences, and the meaning of the sequences,” said lead author Melissa Sanabria, a scientist at BIOTEC.
“Like GPT models learning human languages, GROVER has basically learned how to ‘speak’ DNA.”
GROVER can not only predict subsequent DNA sequences but also extract contextual information, such as identifying gene promoters or protein binding sites, and even understanding epigenetic processes – regulatory mechanisms that act on top of the DNA sequence.
“It is fascinating that by training GROVER with only the DNA sequence, without any annotations of functions, we are actually able to extract information on biological function. To us, it shows that the function, including some of the epigenetic information, is also encoded in the sequence,” said Sanabria.
To enable GROVER to interpret DNA, the team had to create a “DNA dictionary.” Unlike human languages, DNA has no predefined words; it consists of four letters (A, T, G, and C) that form sequences with various biological functions.
The researchers employed a technique from compression algorithms to identify common combinations of these letters.
“We analyzed the whole genome and looked for combinations of letters that occur most often. We started with two letters and went over the DNA, again and again, to build it up to the most common multi-letter combinations,” said Sanabria.
“In this way, in about 600 cycles, we have fragmented the DNA into ‘words’ that let GROVER perform the best when it comes to predicting the next sequence.”
The creation of this DNA dictionary was a crucial step that set GROVER apart from previous attempts to model DNA language. This innovative approach allows the model to uncover the deeper layers of genetic information, providing insights into human biology, disease predispositions, and treatment responses.
GROVER’s potential impact on genomics and personalized medicine is immense. By decoding the rules of DNA as one would a language, researchers can gain a more profound understanding of the biological meaning hidden within our genetic code.
“We believe that understanding the rules of DNA through a language model is going to help us uncover the depths of biological meaning hidden in the DNA, advancing both genomics and personalized medicine,” Poetsch said.
As GROVER continues to analyze and interpret genetic data, it holds the promise of unlocking the mysteries of our genome, paving the way for new discoveries and advancements in medical science.
This tool not only enhances our understanding of the human genome but also sets the stage for a new era in genomics, where AI-driven insights can lead to more personalized and effective treatments for a wide range of conditions.
The study is published in the journal Nature Machine Intelligence.
Like what you read? Subscribe to our newsletter for engaging articles, exclusive content, and the latest updates.
Check us out on EarthSnap, a free app brought to you by Eric Ralls and Earth.com.