Similarity Level

1 view

Skip to first unread message

Desmond Hutchins

unread,

Aug 5, 2024, 12:43:23 AM8/5/24

to gracticalwea

A100% match means the assignment has no original work. It has most probably been submitted previously to Turnitin. This can happen if the student is making a re-submission of their work and the file had already been submitted to the Turnitin database. It could be a student error and they submitted to another assignment area by mistake. It can also indicate collusion or copying an essay from another student, either in their class, from a previous year or another institution.

It is not uncommon to see this in a long assignment where these are made up of quotes or commonly used phases. Filtering the bibliography and quotations may help to remove some of these to reveal matches of interest.A large match to a single source.

This report shows a similarity score of 21%. There are a couple of larger matches to single sources. The larger percentage sources will need to be investigated to ensure they are referenced correctly.

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

As global demand for digital storage capacity grows, storage technologies based on synthetic DNA have emerged as a dense and durable alternative to traditional media. Existing approaches leverage robust error correcting codes and precise molecular mechanisms to reliably retrieve specific files from large databases. Typically, files are retrieved using a pre-specified key, analogous to a filename. However, these approaches lack the ability to perform more complex computations over the stored data, such as similarity search: e.g., finding images that look similar to an image of interest without prior knowledge of their file names. Here we demonstrate a technique for executing similarity search over a DNA-based database of 1.6 million images. Queries are implemented as hybridization probes, and a key step in our approach was to learn an image-to-sequence encoding ensuring that queries preferentially bind to targets representing visually similar images. Experimental results show that our molecular implementation performs comparably to state-of-the-art in silico algorithms for similarity search.

The sequences of the PCR primers needed to retrieve a specific file are analogous to a filename, in that they must be stored and remembered separately from the data itself. In database terms, this is referred to as key-based retrieval. Although key-based retrieval might be sufficient for a well-maintained library or archive, modern search and recommendation systems do not assume users know the exact key of the document they are looking for, and thus make heavy use of content-based retrieval. For instance, they allow users to search for words, phrases or even topics that occur within documents, or enable them to use an image as a query to retrieve visually similar images.

Executing key-based retrieval in a DNA database leverages DNA hybridization to perform parallel molecular computing: the PCR primers associated with a particular file are programmed to bind with their intended reverse complements, even in the presence of many millions of potentially off-target sequences. Early formulations of DNA databases17,18,19,20 proposed that hybridization could also be used to search through the content of the documents in the database. However, these approaches require that semantically similar documents are represented by similar sequences, and this is not possible in a DNA database that allows storage of arbitrary digital data. For instance, while a pair of JPEG and PNG files may represent visually similar images, we cannot rely on their binary encoding (or their DNA encoding) to be similar. The same is true for text or other media that may be encoded or compressed in unpredictable ways.

Designing a universal encoding from feature vectors to DNA sequences is difficult because of the high dimensionality of both feature space and DNA sequence space, and because of the nonlinear nature of DNA hybridization. An alternative is to use machine learning to optimize the encoding for a particular dataset. Early researchers19 achieved this by clustering the dataset using k-means, then mapping each cluster center to a known DNA sequence, which is assigned to each document in that cluster. By reducing content-based retrieval to exact key-based retrieval, their approach sidesteps any issues with unwanted DNA hybridization. However, there is no notion of more or less similar within a cluster: every item in the cluster is retrieved, regardless of its distance to the query. Additionally, once the clusters are chosen, they are static; any additional data added to the database must fit into the existing clusters, even if it would cause the cluster centers to change.

In this work, we greatly expand upon our prior proof-of-principle work25 and show how we can scale up computational workflow and molecular image search from tens to over 1.5 million images. We show a path toward overcoming the limitations of fixed clusters by using machine learning techniques to create a continuous feature-to-sequence encoding that preserves similarity. Crucially, the in silico learning step only happens once and the cost of this is amortized over the lifetime of the database. As with the feature extractor, the trained encoder can be applied to new documents not seen during the learning process, provided they share an underlying distribution (e.g., images of the natural world). The trained encoder must translate a new item or a query into its corresponding DNA sequence in silico, but the rest of the computation is carried out molecularly, which accounts for most of the computation for a search query. This allows new items to be freely added or used as queries without retraining, and all of our experiments were performed with documents that were not seen during training.

As in our prior work, we focus on encoding feature vectors derived from images, because large datasets and feature extractors are readily available, and similarity between images is easy to visualize. However, our approach can be applied to any type of media, as long as an appropriate feature extractor is available. We use OpenImages26,27, a collection of roughly 9 million images, and the FC2 intermediate layer of VGG1622, a convolutional neural network designed for image classification, to extract feature vectors. Unlike our prior work, we do not reduce the dimensionality of the VGG16-FC2 vectors prior to encoding. As shown in Fig. 2A, the encoder is a fully connected neural network with one hidden layer that directly translates the feature vectors into softmax-encoded DNA sequences that are 80 nucleotides in length, where each position is represented numerically by a four-channel vector (one channel for each possible base) whose components sum to one. This is a continuous approximation of a one-hot encoding, where one of the four channels would have the value one, and the rest would have the value zero. A continuous approximation is necessary because neural networks must have differentiable operations in order to be efficiently trainable via gradient descent. However, because the softmax encoding is continuous, the encoder may output an indeterminate base for a particular position (for instance, 75% A and 25% G). We do not treat this as a probabilistically random base; to output a sequence from a softmax encoding, we treat it as if it were one-hot and simply take the bases with the maximum values. To encourage the softmax-encoded sequences to have a high maximum value, indeterminate outputs are penalized during training. The goal of the encoder is to map feature vectors to DNA sequences such that a pair of neighboring feature vectors will produce a pair of sequences that are likely to hybridize when one of the sequences is reverse complemented.

Dashed gray and dashed-and-dotted gray lines represent chance performance and perfect performance, respectively. Not all of the algorithms could produce results towards the lower-left (low recall and low proportion retrieved). We assume these algorithms could be stopped early to produce fewer results with a linear decrease in recall; dashed continuations represent these linear interpolations.

To investigate the effect that increasing the database size might have on search performance, we ran NUPACK simulations on a database of 5.5 million additional images from OpenImages. Figure 5 shows that the highest simulated yields (which should correspond to the most sequencing reads in laboratory experiments) are reserved for images that are visually similar to the query, indicating that aggressive filtering is possible even in larger databases.

Here, we introduced an approach for performing massively parallel molecular image search. In our technique, data are encoded and stored in such a way that the storage substrate, synthetic DNA, also behaves as a computational element by performing DNA hybridization. Computer architects refer to this as in-memory computing or near-data processing, because it avoids the bottleneck of shuttling data between memory and the CPU. It is not a general-purpose computing paradigm, but it is still very powerful because it is capable of efficient parallel computation over high-dimensional data. This basic mechanism can be generalized to broader tasks such as pattern classification and time series analysis34.

A limitation to our approach is that the search paradigm (e.g., visual similarity search) is fixed when the database is created, so a user is limited in the way they can search but they do not need to know what to search for. For example, we did not know that there would be images of tuxedo cats in the database before we conducted our search; the query image came from outside of the database. We did, however, know that we were using a query image to search for similar images. Another potential limitation is the long latency (minutes to hours) to complete a single query. However, it is possible to compensate for this and achieve high throughput through batch processing. Furthermore, given a sufficiently large electronic database (e.g., one that does not fit in memory), a single query could require comparably long latency and significant energy consumption.