“A Google for DNA”: Scientists Launch Groundbreaking Search Engine for Genetic Code

7 views

Skip to first unread message

rael-science

unread,

Oct 19, 2025, 1:34:44 PM10/19/25

to rael science

The Raelian Movement

for those who are not afraid of the future : http://www.rael.org

Get Rael-Science on Facebook: http://www.facebook.com/raelscience
Get Rael-Science on Twitter: https://twitter.com/rael_science

Source : https://scitechdaily.com/a-google-for-dna-scientists-launch-groundbreaking-search-engine-for-genetic-code/

Biology

“A Google for DNA”: Scientists Launch Groundbreaking Search Engine for Genetic Code

By ETH ZurichOctober 12, 2025

5 Mins Read

Sequencing has filled global archives with vast DNA and RNA reads, but finding signals in that noise has remained out of reach. ETH Zurich’s MetaGraph turns raw sequences into a compressed, full-text index, enabling near-instant matches that could speed research on pathogens, resistance, and more. Credit: Shutterstock

A new tool developed at ETH Zurich, MetaGraph, allows scientists to search through vast public DNA and RNA databases in seconds — like a “Google for DNA.”

DNA sequencing has transformed biomedical research, making it possible to identify rare hereditary disorders in patients and pinpoint specific mutations within tumor cells. Over the past several years, newer techniques (next-generation sequencing) have fueled remarkable scientific progress. During 2020 and 2021, for instance, these methods allowed researchers to rapidly decode and monitor the SARS-CoV-2 genome worldwide.

At the same time, an increasing number of scientists are sharing their sequencing results publicly. This openness has led to the accumulation of enormous datasets stored in major repositories such as the American SRA (Sequence Read Archive) and the European ENA (European Nucleotide Archive). These databases now contain around 100 petabytes of information—comparable to the total amount of text available across the internet, with one petabyte equal to one million gigabytes.

Until recently, searching through these immense archives to compare DNA sequences required vast computing resources, making efficient analysis nearly impossible. Researchers at ETH Zurich have now developed a solution to overcome this challenge.

Full-text search instead of downloading entire data sets

The scientists have developed a method that greatly shortens and facilitates this search. The “MetaGraph” digital tool searches the raw data of all DNA or RNA sequences stored in the databases – just like a conventional Internet search engine. After entering a sequence they are interested in as full text into a search mask, researchers can find out within seconds or minutes, depending on the query, where it has already appeared.

“It’s a kind of Google for DNA,” as Professor Gunnar Rätsch, data scientist at the Department of Computer Science at ETH Zurich summarizes. Until now, researchers had to search the databases for descriptive metadata. In order to access the raw data, they had to download the respective data sets. These searches were incomplete, time-consuming and expensive.

“MetaGraph“ is comparatively favorable in terms of costs, as the researchers state in their study. The representation of all public biological sequences would fit on a few computer hard drives, while larger queries should cost no more than 0.74 dollars per megabase.

As the DNA search engine the ETH researchers have developed is also both precise and efficient, it can help to accelerate genetic research – for example, in the case of little-researched pathogens or new pandemics. In this way, the tool could become a catalyst in research into antibiotic resistance: for example, by identifying resistance genes or useful viruses that can destroy bacteria – known as bacteriophages – in the databases.

Compression by a factor of 300

In the study published on 8 October in the journal Nature, the ETH researchers demonstrate how MetaGraph works: the tool indexes the data and presents it in compressed form. This is achieved by way of complex mathematical graphs that improve the structure of the data – similar to spreadsheet programs such as Excel. “Mathematically speaking, it is a huge matrix with millions of columns and trillions of rows,” as Rätsch states.

The idea of rendering large amounts of data searchable with the help of indexes is standard practice in computer science research. What is new about the work of the ETH researchers, however, is the complex linking of raw data and metadata and the compression by a factor of about 300, similar to a book summary: it no longer contains every word, but all the main storylines and connections remain intact – more compact, yet without any relevant loss of information.

“We are pushing the limits of what is possible in order to keep the data sets as compact as possible without losing necessary information,” says Dr André Kahles, who, like Rätsch, is a member of the Biomedical Informatics Group at ETH Zurich. By contrast with other DNA search masks currently being researched, the ETH researchers’ approach is scalable. This means that the larger the amount of data queried, the less additional computing power the tool requires.

Half of the data is already available now

The ETH researchers first presented MetaGraph in 2020 and have been continuously improving it ever since. The tool is already available for queries (link). It provides a full-text search engine for millions of sequence sets from DNA and RNA, as well as proteins from viruses, bacteria, fungi, plants, animals, and humans. At present, just under half of the sequence data sets available worldwide are indexed. According to Gunnar Rätsch, the rest should follow by the end of the year. Given that MetaGraph is available as open source, it could also be of interest to pharmaceutical companies that have large amounts of internal research data.

Kahles even believes it is possible that the DNA search engine will one day be used by private individuals: “In the early days, even Google didn’t know exactly what a search engine was good for. If the rapid development in DNA sequencing continues, it may become commonplace to identify your balcony plants more precisely.”

Reference: “Efficient and accurate search in petabase-scale sequence repositories” by Mikhail Karasikov, Harun Mustafa, Daniel Danciu, Oleksandr Kulkov, Marc Zimmermann, Christopher Barber, Gunnar Rätsch and André Kahles, 8 October 2025, Nature.
DOI: 10.1038/s41586-025-09603-w