How To Download A Large File Quickly

0 views

Skip to first unread message

Laurelino Braendel

unread,

Aug 3, 2024, 5:37:51 PM8/3/24

to diamolusne

dd will do the job, but reading from /dev/zero and writing to the drive can take a long time when you need a file several hundreds of GBs in size for testing... If you need to do that repeatedly, the time really adds up.

dd from the other answers is a good solution, but it is slow for this purpose. In Linux (and other POSIX systems), we have fallocate, which uses the desired space without having to actually write anything to it, works with most modern disk based file systems, very fast:

dd is the obvious first choice, but dd is essentially a copy and that forces you to write every block of data (thus, initializing the file contents)... And that initialization is what takes up so much I/O time. (Want to make it take even longer? Use /dev/random instead of /dev/zero! Then you'll use CPU as well as I/O time!) In the end though, dd is a poor choice (though essentially the default used by the VM "create" GUIs). E.g:

truncate is another choice -- and is likely the fastest... But that is because it creates a "sparse file". Essentially, a sparse file is a section of disk that has a lot of the same data, and the underlying filesystem "cheats" by not really storing all of the data, but just "pretending" that it's all there. Thus, when you use truncate to create a 20 GB drive for your VM, the filesystem doesn't actually allocate 20 GB, but it cheats and says that there are 20 GB of zeros there, even though as little as one track on the disk may actually (really) be in use. E.g.:

fallocate is the final -- and best -- choice for use with VM disk allocation, because it essentially "reserves" (or "allocates" all of the space you're seeking, but it doesn't bother to write anything. So, when you use fallocate to create a 20 GB virtual drive space, you really do get a 20 GB file (not a "sparse file", and you won't have bothered to write anything to it -- which means virtually anything could be in there -- kind of like a brand new disk!) E.g.:

Try mkfile myfile as an alternative of dd. With the -n option the size is noted, but disk blocks aren't allocated until data is written to them. Without the -n option, the space is zero-filled, which means writing to the disk, which means taking time.

mkfile is derived from SunOS and is not available everywhere. Most Linux systems have xfs_mkfile which works exactly the same way, and not just on XFS file systems despite the name. It's included in xfsprogs (for Debian/Ubuntu) or similar named packages.

Most Linux systems also have fallocate, which only works on certain file systems (such as btrfs, ext4, ocfs2, and xfs), but is the fastest, as it allocates all the file space (creates non-holey files) but does not initialize any of it.

EDIT: as many have pointed out, this will not physically allocate the file on your device. With this you could actually create an arbitrary large file, regardless of the available space on the device, as it creates a "sparse" file.

But here's a possibility that might work for your application. If you don't care about the contents of the file, how about creating a "virtual" file whose contents are the dynamic output of a program? Instead of open()ing the file, use popen() to open a pipe to an external program. The external program generates data whenever it's needed. Once the pipe is open, it acts just like a regular file in that the program that opened the pipe can fseek(), rewind(), etc. You'll need to use pclose() instead of close() when you're done with the pipe.

The GPL mkfile is just a (ba)sh script wrapper around dd; BSD's mkfile just memsets a buffer with non-zero and writes it repeatedly. I would not expect the former to out-perform dd. The latter might edge out dd if=/dev/zero slightly since it omits the reads, but anything that does significantly better is probably just creating a sparse file.

Absent a system call that actually allocates space for a file without writing data (and Linux and BSD lack this, probably Solaris as well) you might get a small improvement in performance by using ftrunc(2)/truncate(1) to extend the file to the desired size, mmap the file into memory, then write non-zero data to the first bytes of every disk block (use fgetconf to find the disk block size).

One approach: if you can guarantee unrelated applications won't use the files in a conflicting manner, just create a pool of files of varying sizes in a specific directory, then create links to them when needed.

So I wanted to create a large file with repeated ascii strings. "Why?" you may ask. Because I need to use it for some NFS troubleshooting I'm doing. I need the file to be compressible because I'm sharing a tcpdump of a file copy with the vendor of our NAS. I had originally created a 1g file filled with random data from /dev/urandom, but of course since it's random, it means it won't compress at all and I need to send the full 1g of data to the vendor, which is difficult.

Edit: Before you ding me because the OP said, "I don't care about the contents," know that I posted this answer because it's one of the first replies to "how to create a large file linux" in a Google search. And sometimes, disregarding the contents of a file can have unforeseen side effects.Edit 2: And fallocate seems to be unavailable on a number of filesystems, and creating a 1GB compressible file in 1.2s seems pretty decent to me (aka, "quickly").

You can call general support to confirm security permissions, and put in a request for deleting large amounts of files from your library. However, this can only really be done if you're needing a bulk amount within certain parameters (for example, everything uploaded before 2023), the team handling the request can't really go through massive account libraries with a fine tooth comb.

At my current company, the project I work on is coded in Java, at least for the systems / backend part. Whenever I get assigned a task dealing with the Java code, it take me hours or even days to figure everything out and apply my solutions. The reasons are:

The second thing is to keep notes. I do this by writing my main question, something like, "Where do I make this change?", then indented under that I write the questions I need to answer to answer the main question, then questions to answer that question, and so forth. Eventually you get down to a question easy enough to answer, and you can work your way back up.

Without looking at the code base, it's hard to know what's wrong with it if at all. Your description sounds as if previous developers fell in love with abstraction and applied it all over the place. This seems to happen occasionally when developers think a single principle or pattern is the key to good software and apply it indiscriminately wherever they think it can be used. In my experience, there is basically one single principle that helps code readability and maintainability if applied liberally: KISS. Others can be used in moderation and with good judgement.

However, you're not going to change the code base, so this doesn't help in your current situation. Karl Bielefeldts answer is spot-on: Don't get lost by following every abstraction path down to the leaves, but initially trust that the implementations do what the abstractions promise. I found that stepping through code in a debugger helps a lot in seeing how such code is working, so that is something you can do when the control flow is really unclear or when you need to track a bug.

And you should use everything that helps you focus on the problem at hand. Dealing with a complex and largely unfamiliar code base is extremely taxing on your concentration, so take notes, shield yourself from distractions and noise, make breaks as needed to digest what you read and let puzzle pieces fall into place.

These two principles could give you an idea in which direction to look for: It is ok to ask for support you need! You should trust in your colleagues that they will watch for their "resource" of patience, and that they will inform you if your questions are too much.

If you are using Scrum, you should definitely bring this in a daily standup. It's an impediment and you and your team should try to remove it. You can also talk with the Scrum Master or your supervisor about this.

Your team should have a plan for new employees to familiarize you with the software. For example there should be an extensive documentation or someone should explain to you how it works. Even more so if it's complicated piece of software.

A lot of us developers are the sort of people who take pride in our problem-solving skills, and so it almost feels unnatural for us to ask for help. You have to train yourself out of that mindset - sometimes, asking for help is in everyone's best interest (and you and your teammates can figure out how to go about help/training in a way that's not disruptive). Over time, you'll pick up on things, and you'll need less and less help. Also, the activity of trying to sort out (what appears to you as) the impenetrable mess of concepts expressed in legacy code is not the kind of problem solving that's at the center of your job description. It's grunt work. You want to get to solving the actual problems that the software was made for; some grunt work will be required to do that, but a little help from the more experienced members of your team can get you there faster. So it's a matter of finding a balance. A bit of struggling to figure things out on your own will lead to developing a deeper understanding of things, but don't go overboard.

It's difficult to say what the exact source of the problem is without being there; every organization is different. In any case, because of the fact that you are a new member of the team, and that the codebase is the way that it is, it's a given that you'll be needing some help or training. Your organization should know this. The thing is, you are trying to solve the issue, but you are trying to figure it out while working with limited information and with untested assumptions (i.e., you don't know how your organization may (or may not) be able to help you with your issue, you assume your teammates will feel you are bothering them, etc.). You should ask around first; talk to people about these things. Then you'll know what the options are, who to else to talk to, and so on. Then you'll be in a better position to decide how to act.