Some questions about sparse files

nickdu

unread,

Dec 18, 2008, 3:26:00 PM12/18/08

to

I have a couple questions regarding sparse files.

1. We're looking for an easy and efficient way to store blobs (array of
bytes) of data in some sort of circular queue and return some sort of key
someone can use for later access to that blob. The blobs can very varying
lengths. I was wondering whether sparse files would be a reasonable approach
for this as it appears it's an efficient mechanism for storing messages of
varying length (at least that's what I gather).

I guess you could also store each blob in its own file, but then I think the
overhead of creating a file per blob (may have millions) might be costly.

I could also store all the blobs in a single "normal" (non-sparse) file, but
then I think the house keeping of walking the chain of blobs (most likely
I'll need to chain them if I don't use a sparse file) might be costly in
terms of performance and also adds to the code I'll have to write. Though I
guess NTFS is keeping its own list of sections of the file that contain data.

2. If I want to copy a sparse file to other Windows machines running NTFS do
I need to write my own code to do that or does CopyFile() handle sparse files
such that it only copies the parts of the file which contain data such that
the copied file is exactly the same as the source file?

--
Thanks,
Nick

nickno...@community.nospam
remove "nospam" change community. to msn.com

Jialiang Ge [MSFT]

unread,

Dec 19, 2008, 2:33:37 AM12/19/08

to

Hello Nick

For the first question about an easy and efficient way to store blobs, I
need some time to think over it more carefully and will report back as soon
as possible.

For the second question about copying a sparse file, neither xcopy nor
robocopy (http://technet.microsoft.com/en-us/library/cc733145.aspx) can
keep the sparse attribute in the target file based on my tests. The API
CopyFile does not handle sparse files well either. We need to write codes
or use some third party solutions. A sample code was provided by FlexHex
http://www.flexhex.com/docs/articles/sparse-files.phtml
At the bottom of the page, you will find the source code of a utility to
copy sparse files.

Regards,
Jialiang Ge (jia...@online.microsoft.com, remove 'online.')
Microsoft Online Community Support

Delighting our customers is our #1 priority. We welcome your comments and
suggestions about how we can improve the support we provide to you. Please
feel free to let my manager know what you think of the level of service
provided. You can send feedback directly to my manager at:
msd...@microsoft.com.

==================================================
Get notification to my posts through email? Please refer to
http://msdn.microsoft.com/en-us/subscriptions/aa948868.aspx#notifications.

MSDN Managed Newsgroup support offering is for non-urgent issues where an
initial response from the community or a Microsoft Support Engineer within
2 business day is acceptable. Please note that each follow up response may
take approximately 2 business days as the support professional working with
you may need further investigation to reach the most efficient resolution.
The offering is not appropriate for situations that require urgent,
real-time or phone-based interactions. Issues of this nature are best
handled working with a dedicated Microsoft Support Engineer by contacting
Microsoft Customer Support Services (CSS) at
http://msdn.microsoft.com/en-us/subscriptions/aa948874.aspx
==================================================
This posting is provided "AS IS" with no warranties, and confers no rights.

m

unread,

Dec 19, 2008, 6:16:39 PM12/19/08

to

What technique do you plan to use for indexing your blobs that will be
better than a file system? This is, after all, what they are designed to
do.

"nickdu" <nickno...@community.nospam> wrote in message
news:ABA4112C-311F-48C1...@microsoft.com...

Alexander Grigoriev

unread,

Dec 19, 2008, 9:39:43 PM12/19/08

to

I think any alternate solution wil be more manageable than spare files.

You didn't say anything about blob sizes, persistency, etc.

"nickdu" <nickno...@community.nospam> wrote in message
news:ABA4112C-311F-48C1...@microsoft.com...

nickdu

unread,

Dec 24, 2008, 11:13:01 AM12/24/08

to

pseudo code:

long GetBlob(int index, byte *buffer, long bytes)
{
Seek(sparse, (index + 1) * 1GB);
Read(sparse, &length, sizeof(length));
return Read(sparse, buffer, min(bytes, length))
}

Otherwise if you don't use sparse files and put one blob right after the
next, then you either need to maintain an index in the file or you need to
read the list each time you want to retrieve a blob.

Most likely I'll use the first blob to store some info, like the next free
index.
--
Thanks,
Nick

nickno...@community.nospam
remove "nospam" change community. to msn.com

nickdu

unread,

Dec 24, 2008, 11:15:01 AM12/24/08

to

Do you have another solution in mind?

I indicated the blob sizes are variable and I can't pick a max. Well
actually I am picking a max, but that should be an unreasonably large value
that should be much larger than anyone needs. That's the beauty of using
sparse files, or so I gathered. There must be some reason why the OS chose
sparse files for the NTFS change journal.
--
Thanks,
Nick

nickno...@community.nospam
remove "nospam" change community. to msn.com

Alexander Grigoriev

unread,

Dec 24, 2008, 11:33:07 AM12/24/08

to

1. You mentioned sort of key. Why can't the key be just a 64-bit file
offset? How do you pass data length to the clients?
2. Does your file needs to be persistent between sessions, needs backup?
3. You said circular queue. Does that mean number of blobs has an upper
limit? WHat happens when you need to overwrte a blob?

By the way, I don't see anything wrong with simply storing an index in the
start of file. That will also add persistence. Any overhead would be no
greater than filesystem overhead for sparse file access. Actually, it would
be much less.

"nickdu" <nickno...@community.nospam> wrote in message

news:AA991457-B503-4C1E...@microsoft.com...

Tommy

unread,

Dec 24, 2008, 3:59:11 PM12/24/08

to

nickdu wrote:
> Do you have another solution in mind?
>
> I indicated the blob sizes are variable and I can't pick a max. Well
> actually I am picking a max, but that should be an unreasonably large value
> that should be much larger than anyone needs. That's the beauty of using
> sparse files, or so I gathered. There must be some reason why the OS chose
> sparse files for the NTFS change journal.

The Change Journal needs something like a sparse file to store/read
its entries as fast as possible using pre-allocated slack space.
Using a circular queue method, you can discard old when writing new
entries.

Have you considered compression? This might be more reasonable today
given that the speed and overhead can be negligible.

Have you considered the Gather/Scatter I/O functions?

This may apply well to circular queues as well, where you have
pre-define size here for the queue. In this case, it may be queue of
buffer pointers and this can be passed to the WriteGather() and
ReadScatter() functions.

Also, you might want to consider using your own b-tree index system
with varying data size records (this is what we do). Each record has
a meta data block header that defines its length (also contains magic
numbers for integrity checking).

You can look at SQLITE which I believe the current version is said to
pretty fast for blobs. It uses a btree concept on the backend. If its
simply a backend application (not serving users or multi-clients),
SQLITE is something many of the top applications are using as a fast,
lightweight SQL ready framework for ideas like single application
storage, configuration, etc. Google Apps, Thunderbird, FireFox all
use SQLITE for its configuration and storage of data.

--

nickdu

unread,

Dec 26, 2008, 8:19:00 AM12/26/08

to

Hmmm, I replied to this yet it didn't seem to make it to this thread.

The indexing would look something similar to:

GetBlob(int index, byte *bytes, int size)
{
Seek(_sparse, (index + 1) * 1GB, SEEKOFFSET_ORIGIN);
Read(_sparse, &length, sizeof(length));
Read(_sparse, bytes, min(length, size));
}

Most likely I would use the first block to store information, like the next
free block. Other than that the indexing would be as showin above. Just
like indexing an array. That's the beauty of sparse files, right? You can
be wasteful and pick a huge block size and the OS will only consume the
actual amount of space you use. Of course the OS must have to do some book
keeping, but better the OS than me.

The NTFS change journal uses sparse file so I assume it saw a benefit in
using it.
--
Thanks,
Nick

nickno...@community.nospam
remove "nospam" change community. to msn.com

nickdu

unread,

Dec 26, 2008, 8:27:01 AM12/26/08

to

Do you have an alternative solution in mind? If so, what?

The blob size as I mentioned is variable. I would pick an upper bound that
way bigger than I can conceive anyone using and choose that as the size. It
allows very efficient indexing without wasting space. At least that's what I
gather from what I read about sparse files. Also, the NTFS change journal
uses sparse files so there must be some benefit, right?

What do you mean by persistency? Data will be persisted to the sparse file.
I will most likely have to age things, not based on a LRU scheme as that
would require me to keep some sort of LRU list which would complicate things.
I would probably just make the sparse file circular and pick an upper bound
of the index. When it wraps I would end up writing over the first blob.

This points out another thing I might have to do. Write the index into the
location so I can tell when data has been overwritten. If someone ask for
index x and at the slot for index x is written data for index y then I can
return an error. In this case index x modulus the size of the circular
buffer = index y modulus the circular buffer.
--
Thanks,
Nick

nickno...@community.nospam
remove "nospam" change community. to msn.com

m

unread,

Dec 26, 2008, 10:43:58 AM12/26/08

to

Okay - how would you get the index? This is what you need to be able to do
better than the file system does for this plan to make sense. And you need
to be able to say that the benefit of using your sparse file is sufficient
to outweigh the extra problems with backup & maintenance.

BTW: this psudocode is inefficient and can be improved by using page sized
aligned reads. The seek is unnecessary and will preclude multiple readers -
use overlapped IO. Both reads can be consolidated for small (less than 1
buffer sized) blobs and for large blobs, multiple reads will be required.
All of this is true regardless of how the file(s) that store the data are
arranged

Also, storing metadata creates the possibility of file corruption - consider
the case of process termination during a blob insert. If the metadata is
updated, and the blob data isn't, there will be garbage returned when the
app asks for that blob; but if the metadata isn't updated, the insert will
be lost completely; and, even worse, if a partial metadata update is made,
then the whole file might be unreadable. These risks can be mitigated by
using FILE_FLAG_NO_BUFFERING + FILE_FLAG_WRITE_THROUGH, but you must write
code that can handle the consequences of corruption and / or to check for
these conditions and try to correct them. This is something else that the
file system does for you directly.

"nickdu" <nickno...@community.nospam> wrote in message

news:A3F5E9E5-3FC4-49A7...@microsoft.com...

nickdu

unread,

Dec 28, 2008, 9:21:01 AM12/28/08

to

nickdu

unread,

Dec 28, 2008, 9:23:00 AM12/28/08

to

You're correct, you could just use the file offset as a cookie, but I believe
if you go that route then you'll have a bunch of code to chaining the blobs
for when you need to recover space. And once you have that then there is
greater chance of making the entire file unusable in the event of some
corruption.

nickdu

unread,

Dec 28, 2008, 9:25:00 AM12/28/08

to

The reason you indicate as why the Change Journal uses sparse files is
exactly why I'm considering using sparse files.
--
Thanks,
Nick

nickno...@community.nospam
remove "nospam" change community. to msn.com

nickdu

unread,

Dec 28, 2008, 9:26:00 AM12/28/08

to

The indiex is the cookie I returned from my Store() method. The consumer was
given the cookie and they give it back to me when they want to retrieve the
blob.

As it was pseudo code I provided I was not too concerned with performance.

Alexander Grigoriev

unread,

Dec 28, 2008, 10:12:47 AM12/28/08

to

How fast you need to write data?

"nickdu" <nickno...@community.nospam> wrote in message

news:E9493378-5CA5-4FE8...@microsoft.com...

m

unread,

Dec 28, 2008, 11:21:41 AM12/28/08

to

So your solution is to have another part of the app do the work?

Do these references persist across restarts? If so, then how do you expect
them to be stored? As part of a larger already existing entity?

It was not the specifics of your psudocode that I was responding to but the
nature of if and the obvious indications that you are not considering the
full scope of the problems you will need to solve for this solution to be
fully functional and reliable.

However, as it is obvious that I can't dissuade you from your purpose, I
have endeavoured to point out some of the problems that you will face

The main advantage of this design is eliminating the need to call CreateFile
when a new blob is to be written or must be retrieved by having a single
data file always open. Depending on your data rate, this overhead is likely
insignificant; but if your data rate is really high, then you may need to
follow this approach. Of course the primary reason to use a sparse file is
to simplify the code, at the expense of IO performance, and this would seem
counterproductive in this case. Also, the code complexity increased by
keeping a blob to block map versus a static offset is not significant with
respect to the complexity required to prevent file corruption during HW / OS
failures.

And hence my original question: what is it about your program that makes it
possible to do this better than the filesystem?

"nickdu" <nickno...@community.nospam> wrote in message

news:3E3FF996-50D7-4179...@microsoft.com...

nickdu

unread,

Dec 29, 2008, 11:11:00 PM12/29/08

to

Another part of my application is not doing the work either. The client to
my application will have to do the work of storing the cookie I give back to
them. It's similar to a DB I guess. The DB just stores the data. If you
want to gain access to it you need to construct the correct SQL to get to the
data that you stored. In my case they ask me to store a blob. I do so and
return them a cookie which they need to use when they want to retrieve the
blob they stored.

In addition I have the ability to enumerate all the blobs in the store.

I won't need to have a blob to block map. When someone ask for blob at
index 5 I seek to location 5 * 1GB, read the length of the blob, read the
blob and return it to them. That's it.

The goal was to reduce the complexity of my code. The OS already gives me
this feature for free so why not use it. The Change Journal seems to have
thought it was a good feature. I have the same sort of requirement (I
think). I have a variable blob size and I would like to not waste a lot of
disk space. So I set an arbitrary large size and let the OS handle figuring
out what space to consume.

Of course I want to verify the OS implementation was good and that my
requirements are such that using sparse files might make sense which is why
I'm asking the question here.

nickdu

unread,

Dec 29, 2008, 11:18:04 PM12/29/08

to

Good question. I don't have an exact number for you. This implementation
would be to replace an existing tracing implementation which makes use of ETW
to persist trace events. I think I heard someone mention the number
20,000/sec in referring to ETW. Not sure what 20,000/sec means. I believe
they were indicating that ETW could handle 20,000 buffers a second. Again,
that's still vague as I'm sure the throughput is dependent on the buffer
size. And any implementation will be limited on the sustained throughput of
the underlying storage subsystem.
--
Thanks,
Nick

nickno...@community.nospam
remove "nospam" change community. to msn.com

Alexander Grigoriev

unread,

Dec 30, 2008, 12:56:45 AM12/30/08

to

You'll have quite a problem by issuing 20000 separate I/O per second to a
single disk. You have some chance with sequential reading/writing, but
s\Sparse files will make it much worse.

"nickdu" <nickno...@community.nospam> wrote in message

news:CAFD464B-8725-41F4...@microsoft.com...

m

unread,

Dec 30, 2008, 6:54:03 PM12/30/08

to

Since you clearly don't want to think about the issues involved, and have
already irrevocably decided on your solution, I leave you to your problems.

"nickdu" <nickno...@community.nospam> wrote in message

news:97AD404E-8038-404A...@microsoft.com...

nickdu

unread,

Dec 31, 2008, 8:31:01 AM12/31/08

to

You didn't point out any problems yet. I did get an indication of a problem
from Alexander indicating a performance problem if I don't stick to
sequential IO requests.

I'm not stuck on this implementation at all. From what I know of the sparse
technology it seems to address some of my requirements. It's also used by
the OS to implement the Change Journal so I assume the implementation is
decent. I would rather not write something myself if the OS has already
implemented it for me. That would only reduce the time to market and
increase the amount of code I would have to maintain.