A new tool developed at ETH Zurich, MetaGraph, allows scientists to search through vast public DNA and RNA databases in seconds — like a “Google for DNA.”
DNA sequencing has transformed biomedical research, making it possible to identify rare hereditary disorders in patients and pinpoint specific mutations within tumor cells. Over the past several years, newer techniques (next-generation sequencing) have fueled remarkable scientific progress. During 2020 and 2021, for instance, these methods allowed researchers to rapidly decode and monitor the SARS-CoV-2 genome worldwide.
At the same time, an increasing number of scientists are sharing their sequencing results publicly. This openness has led to the accumulation of enormous datasets stored in major repositories such as the American SRA (Sequence Read Archive) and the European ENA (European Nucleotide Archive). These databases now contain around 100 petabytes of information—comparable to the total amount of text available across the internet, with one petabyte equal to one million gigabytes.
Until recently, searching through these immense archives to compare DNA sequences required vast computing resources, making efficient analysis nearly impossible. Researchers at ETH Zurich have now developed a solution to overcome this challenge.
The scientists have developed a method that greatly shortens and facilitates this search. The “MetaGraph” digital tool searches the raw data of all DNA or RNA sequences stored in the databases – just like a conventional Internet search engine. After entering a sequence they are interested in as full text into a search mask, researchers can find out within seconds or minutes, depending on the query, where it has already appeared.
“It’s a kind of Google for DNA,” as Professor Gunnar Rätsch, data scientist at the Department of Computer Science at ETH Zurich summarizes. Until now, researchers had to search the databases for descriptive metadata. In order to access the raw data, they had to download the respective data sets. These searches were incomplete, time-consuming and expensive.
In the study published on 8 October in the journal Nature, the ETH researchers demonstrate how MetaGraph works: the tool indexes the data and presents it in compressed form. This is achieved by way of complex mathematical graphs that improve the structure of the data – similar to spreadsheet programs such as Excel. “Mathematically speaking, it is a huge matrix with millions of columns and trillions of rows,” as Rätsch states.
The idea of rendering large amounts of data searchable with the help of indexes is standard practice in computer science research. What is new about the work of the ETH researchers, however, is the complex linking of raw data and metadata and the compression by a factor of about 300, similar to a book summary: it no longer contains every word, but all the main storylines and connections remain intact – more compact, yet without any relevant loss of information.
“We are pushing the limits of what is possible in order to keep the data sets as compact as possible without losing necessary information,” says Dr André Kahles, who, like Rätsch, is a member of the Biomedical Informatics Group at ETH Zurich. By contrast with other DNA search masks currently being researched, the ETH researchers’ approach is scalable. This means that the larger the amount of data queried, the less additional computing power the tool requires.
Reference: “Efficient and accurate search in
petabase-scale sequence repositories” by Mikhail
Karasikov, Harun Mustafa, Daniel Danciu, Oleksandr
Kulkov, Marc Zimmermann, Christopher Barber, Gunnar
Rätsch and André Kahles, 8 October 2025, Nature.
DOI: 10.1038/s41586-025-09603-w