Xkcd Windows Copy

1 view

Skip to first unread message

Gabriel Litke

unread,

Aug 3, 2024, 5:17:12 PM8/3/24

to rungvolbestlen

I know that the Windows copy dialog (in Windows XP) stores the copy in memory first, and it is still copying after the dialog closes, so the time is off, but why is the estimation of the time it will take to make a copy so inaccurate, even when memory copying has been disabled (in Vista and Windows 7)? It seems so arbitrary! How does the whole copy procedure work, and why can't Windows estimate it correctly?

For this not only the amount of bytes but the amount of files to create play a role. If you have a million of 1KB files or thousand 1MB files the situation will be quite different because the former has the overhead of creating many many files. Depending on the filesystem used, this could take more time than actually transferring the data.

"Because the copy dialog is just guessing. It can't predict the future, but it is forced to try. And at the very beginning of thecopy, when there is very little history to go by, the prediction canbe really bad.

Here's an analogy: Suppose somebody tells you, "I am going to count to100, and you need to give continuous estimates as to when I will bedone." They start out, "one, two, three...". You notice they are goingat about one number per second, so you estimate 100 seconds. Uh-oh,now they're slowing down. "Four... ... ... five... ... ..." Now youhave to change your estimate to maybe 200 seconds. Now they speed up:"six-seven-eight-nine" You have to update your estimate again.

Now somebody who is listening only to your estimates and not the theperson counting thinks you are off your rocker. Your estimate wentfrom 100 seconds to 200 seconds to 50 seconds; what's your problem?Why can't you give a good estimate?

File copying is the same thing. The shell knows how many files and howmany bytes are going to be copied, but it doesn't know know how fastthe hard drive or network or internet is going to be, so it just hasto guess. If the copy throughput changes, the estimate needs to changeto take the new transfer rate into account."

You have the same problem with file transfers. The speed that the file transfers is not constant, it speeds up and slows down based on a lot of factors. The reason the number jumps around so much is Microsoft leaned toward the "only count the last interval" side of the spectrum.

There is nothing wrong with that side of the spectrum, it gives you more accurate "seconds per second" (one second in real time makes the counter go down by one second) but this causes the total ETA of the timer to jump around a lot.

A good example of the opposite side is 7-Zip when it is compressing. If the speed of the compression drops as it processes you can see that the ETA does not jump dramatically like a file transfer ETA, but it may take 2 to 3 real seconds before the timer ticks down one second (or it even may start counting up) until it stabilizes at the new speed.

Because the copy dialog is just guessing. It can't predict the future,but it is forced to try. And at the very beginning of the copy, whenthere is very little history to go by, the prediction can be reallybad.

Firstly, that Windows is guessing. It knows how many files, and how big they are, but the transfer rate per file is highly variable. It depends on things like size, or even location on the drive in some cases. As time goes on, it's adjusting its guess based off current and past conditions, and as such you have inaccurate estimated transfer speeds under real-world conditions.

Because the copy dialog is just guessing. It can't predict the future, but it is forced to try. And at the very beginning of the copy, when there is very little history to go by, the prediction can be really bad.

Here's an analogy: Suppose somebody tells you, "I am going to count to 100, and you need to give continuous estimates as to when I will be done." They start out, "one, two, three...". You notice they are going at about one number per second, so you estimate 100 seconds. Uh-oh, now they're slowing down. "Four... ... ... five... ... ..." Now you have to change your estimate to maybe 200 seconds. Now they speed up: "six-seven-eight-nine" You have to update your estimate again.

The obvious reason is that the speed of the transfer varies over time, and so does the average, and so does prediction. To explain this to a non-tech friend, I've used an analogy involving travel by air. You're going to fly over the Atlantic. When you arrive with a taxi at the departing airport, your ETA is about two months. When you disembark at the arriving airport, based on your average speed so far, you will reach your friend's house in 5 seconds.

But you need to appreciate how much the speed can actually vary, even with what seems like a predictable scenario, like copying files within the same disk, or between two local disks. One of the new features I like in Windows 8 is the ability to graph the speed over time if you click "more details". If you don't have access to a Windows 8 machine, search images for Windows 8 copy dialog for a lot of examples. Many of them are fairly flat, but many of them are also disturbingly bumpy, to the point that you wonder whether the hard drive is actually healthy, when it dips to zero.

There are better and worse ETA prediction algorithms, but for an accurate prediction, the computer would have to be all-knowing. The risk of trying to make the algorithm "smart" is that it might create new, unforeseen, cases where it's even more hilariously wrong.

The only way to know how long it'll take to compress a set of files is to compress them. Sometimes Windows' best guess is close, sometimes it's wildly wrong. The same is true of copying large numbers of files, as I'm sure you've noticed.

Perhaps there's a program out there that can copy/compress files and make an alarm sound when it finishes. That would be truly useful. We could have a little nap while we wait for Windows to finish the housecleaning.

In order to expedite the copy process (not spend too much time calculating time estimates instead of performing copy-related operations), the windows copy utility built into Explorer maintains a limited amount of information about how fast previous write operations completed. Each time it needs to calculate the time remaining, it just figures out the average amount of time write operations have been taking, and then multiplies by the number of remaining write operations.

Numbers 1 and 3 would seem to have the most obvious effect on the transfer time calculation, but a great many people do not account for number 2. This can have a huge effect on how long the transfer will take, and is difficult to quantify.

Basically, each time a file is written the filesystem needs to write a bit of metadata about the file, eg. ownership, permissions, creation/modification/access times, etc. Depending on the particular filesystem, this information may be written to a part of the disk very 'far away' from where they file is being written. This filesystem overhead is what can make a seemingly simple transfer take a long time, and/or make the time estimate fluctuate wildly.

eg: Transferring one large file you'll notice that the estimate holds steady and is fairly accurate, but transferring hundreds of files of varying sizes, but the same total size, can take longer and cause the time estimate to pitch a fit.

The reason most people writing the blogs, and people here aren't aware of the possibility is as best as I can tell due to field of study and schooling breadth. A modest yet also very comfortable remedy should be possible for [a graduate with more recent training than the blog writers] [a multibillion dollar company] Microsoft.

Where a, b, and c have 3 states each: the file manager peeks at the files (or just the metadata) before copying, and F*(b x c) + d is not an expensive computation; if you want something more accurate use a lookup table with more states-- there's hardly any calculation at all.

The key difference between what I described and previous implementations that we've seen so far would be, in short, observing filesize and file distrubtion/entropy on the disk and using it to [more] accurately account for the time element of disk usage.

There are a lot of "unknown" variables when you are trying to predict how long something is going to take. For example, while the program knows that there are 3500 files, and that the files amount to 3.5 GB (3500 MB), does that mean that each file is 1 MB? Not necessarily. There could be a lot of 4 KB files, and a lot of 100 MB files, and some other in between. Also, you have to take into consideration where the files are coming from and where they're going (e.g. media.) What is the biggest bottleneck? How do you account trying to copy files from an HDD through a VPN tunnel? You give a best case scenario, and then adjust your counters in real time. This is why you see those progress meters change on the fly.

Rather than invest a lot of time coming up with a low confidenceestimate that would be only slightly improved over the current one, wefocused on presenting the information we were confident about in auseful and compelling way. This makes the most reliable information wehave available to you so you can make more informed decisions.

Maintain a table of expected speeds for each storage device on thefilesystem. Record how long it takes to read the filesysteminformation. When a device is mounted, if it's reasonable for thedevice type, seek to the middle and end, measuring speeds there, too.Get approximate curves for the read and write speeds across locations,and use those for future estimates. For future read and writeoperations, take note of where they are and how fast they go, andadjust the curves accordingly.

Obviously none of this is easily implemented.. and I only mentioned file copies. Similar work would need to be done for all sorts of transfers.
The question you have to ask yourself- Would you rather microsoft spend it's time giving you a better estimate or would you rather they make your files transfer faster.