Prefetch Maximum File Size Download Limit Is 20gb

0 views

Skip to first unread message

Laurence Jabali

unread,

Aug 4, 2024, 12:03:16 PM8/4/24

to perscerseven

The combination of prefetch + fasterq-dump is the fastest way to extract FASTQ-files from SRA-accessions. The prefetch tool downloads all necessary files to your computer. The prefetch - tool can be invoked multiple times if the download did not succeed. It will not start from the beginning every time; instead, it will pick up from where the last invocation failed.

After the download, you have the option to test the downloaded data with the vdb-validate tool. After the successful download, there is no need for network-connectivity. You can move the folder created by prefetch to a different location to perform the conversion to the fastq-format somewhere else (for instance to a compute-cluster without internet access).

The prefetch-tool downloads to a directory named by accession. E.g. prefetch SRR000001 will create a directory named SRR000001 in the current directory. Make sure that if you move the SRR000001 directory, you don't rename it as the conversion-tool will need to find the original directory.

The prefetch-tool has a default maximum download-size of 20G. If the requested accession is bigger than 20G, you will need to increase that limit. You can specify an extremely high limit no matter how large the requested accession is. You can also query the accession-size using the vdb-dump-tool and the --info option. For instance, vdb-dump SRR000001 --info tells you how large this accession is ( among other information ). The accession SRR000001 has 932,308,473 bytes, which is below the default limit, so no further action is necessary. The accession SRR1951777 has 410,112,373,995 bytes. To download this accession you have to lift the limit above that size:

Before you perform the extraction, you should make a quick estimation about the hard-drive space required. The final fastq-files will be approximately 7 times the size of the accession. The fasterq-dump-tool needs temporary space ( scratch space ) of about 1.5 times the amount of the final fastq-files during the conversion. Overall, the space you need during the conversion is approximately 17 times the size of the accession. You can check how much space you have by running the $df -h . command. Under the 4th column ( Avail ), you see the amount of space you have available. Please take into consideration that there might be quotas set by your administrator which are not always visible. If the limit is exceeded, the 'fasterq-dump'-tool will fail and a message will be displayed.

This assumes that you have previously 'prefetched' the accession into the current working directory. If the directory SRR000001 is not there, the tool will try to access the accession over the network. This will be much slower and might eventually fail due to network timeouts.

Notice that you use the accession as a command line argument. The tool will use the current directory as scratch-space and will also put the output-files there. When finished, the tool will delete all temporary files it created. You will now have 3 files in your working directory:

The fasterq-dump-tool performs a split-3 operation by default. The fasterq-dump-tool is not identical to the former fastq-dump-tool with regards to command line options. The following is a comparison between fastq-dump and fasterq-dump:

Currently there is no way to download a missing vdbcache file - it is needed to speed up the processing of some accessions.If a vdbcache-file is available remotely, it will be used.If there is no internet access and the vdbcache-file exists for a given accession, the conversion of the accession will take a significant amount of time.

By default, run the fasterq-dump [options] in the same directory where you ran prefetch .The fastq-files will be created in the current directory.Use the --outdir-option if you want these output-files to be created in a different directory.

If you need to move the result of the prefetch download, move the entire directory - don't rename it.Then cd to the parent directory of the directory and run the fasterq-dump - tool in this directory.

RabbitMQ Streams is a persistent replicated data structure that can complete the same tasks as queues: they buffer messages from producers that are read by consumers.However, streams differ from queues in two important ways: how messages are stored and consumed.

Data in a stream can be used via a RabbitMQ client library or through adedicated binary protocol plugin and associated client(s).The latter option is highly recommended as it provides access to all stream-specific features and offers best possible throughput (performance).

To answer these questions, streams were not introduced to replace queues but to complement them. Streams open up many opportunities for new RabbitMQ use cases which are described in Use Cases for Using Streams.

You should also review the stream plugin information to learn more about the usage of streams with the binary RabbitMQ Stream protocol and the stream core and stream plugin comparison page for the feature matrix.

When wanting to deliver the same message to multiple subscribers users currentlyhave to bind a dedicated queue for each consumer. If the number of consumers islarge this becomes potentially inefficient, especially when wanting persistenceand/or replication. Streams will allow any number of consumers to consumethe same messages from the same queue in a non-destructive manner, negating the needto bind multiple queues. Stream consumers will also be able to read from replicasallowing read load to be spread across the cluster.

As all current RabbitMQ queue types have destructive consume behaviour, i.e. messagesare deleted from the queue when a consumer is finished with them, it is notpossible to re-read messages that have been consumed. Streams will allowconsumers to attach at any point in the log and read from there.

Most RabbitMQ queues are designed to converge towards the empty state and areoptimised as such and can perform worse when there are millions of messages on agiven queue. Streams are designed to store larger amounts of data in anefficient manner with minimal in-memory overhead.

To declare a stream, set the x-queue-type queue argument to stream(the default is classic). This argument must be provided by a clientat declaration time; it cannot be set or changed using a policy.This is because policy definition or applicable policy can be changed dynamically butqueue type cannot. It must be specified at the time of declaration.

Declaring a queue with an x-queue-type argument set to stream will create a streamwith a replica on each configured RabbitMQ node. Streams are quorum systemsso uneven cluster sizes is strongly recommended.

While this argument can be configured via a policy, it will only be appliedto the stream if the policy is set (exists) at stream declaration time. If this argumentis changed for a matching but pre-existing stream it will not be changed evenif the effective policy of the queue record may indicate it is.

As streams never delete any messages, any consumer can start reading/consumingfrom any point in the log. This is controlled by the x-stream-offset consumer argument.If it is unspecified the consumer will start reading from the next offset writtento the log after the consumer starts. The following values are supported:

Single active consumer for streams is a feature available in RabbitMQ 3.11 and more.It provides exclusive consumption and consumption continuity on a stream.When several consumer instances sharing the same stream and name enable single active consumer, only one of these instances will be active at a time and so will receive messages.The other instances will be idle.

Super streams are a way to scale out by partitioning a large stream into smaller streams.They integrate with single active consumer to preserve message order within a partition.Super streams are available starting with RabbitMQ 3.11.

A super stream is a logical stream made of individual, regular streams.It is a way to scale out publishing and consuming with RabbitMQ Streams: a large logical stream is divided into partition streams, splitting up the storage and the traffic on several cluster nodes.

It is possible to create the topology of a super stream with any AMQP 0.9.1 library or with the management plugin, it requires to create a direct exchange, the "partition" streams, and bind them together.It may be easier to use the rabbitmq-streams add_super_stream command though.Here is how to use it to create an invoices super stream with 3 partitions:

Super streams add complexity compared to individual streams, so they should not be considered the default solution for all use cases involving streams.Consider using super streams only if you are sure you reached the limits of individual streams.

RabbitMQ Stream provides a server-side filtering feature that avoids reading all the messages of a stream and filtering only on the client side.This helps to save network bandwidth when a consuming application needs only a subset of messages, e.g. the messages from a given geographical region.

As shown in the snippet above, there must be some client-side filtering logic as well because server-side filtering is probabilistic: messages that do not match the filter value can still be sent to the consumer.The server uses a Bloom filter, a space-efficient probabilistic data structure, where false positives are possible.Despite this, the filtering saves some bandwidth, which is its primary goal.

Streams are not really queues in the traditional sense and thus do notalign very closely with AMQP 0.9.1 queue semantics. Many features that other queue typessupport are not supported and will never be due to the nature of the queue type.

Streams do not support global QoS prefetch where a channel sets a singleprefetch limit for all consumers using that channel. If an attemptis made to consume from a stream from a channel with global QoS enableda channel error will be returned.

Streams are implemented as an immutable append-only disk log. This means thatthe log will grow indefinitely until the disk runs out. To avoid this undesirablescenario it is possible to set a retention configuration per stream which willdiscard the oldest data in the log based on total log data size and/or age.