Building on recent discussions involving memory efficient lazy streams, and in an attempt to organize and share some thoughts, the following contemplates a scenario in which:
1) you have a large 2GB file (or other resource) to process
2) you'd like to leverage Ceylon's rich and growing support for working with lazy streams
3) the work may involve things like XML parsing, performing database updates, streaming transformation to an output stream, etc.
The idea is that we need a streaming solution. Working with such large amount of input data in memory is impractical.
So, considering XML parsing, a function like:
[XmlEvent*] parseXml(Byte|Finished input());
simply won't work. The 2GB XML document is too large to fit in memory, and for our imaginary application, we don’t need it to.
Instead, a workable signature using both lazy input and lazy output streams is:
{XmlEvent*} parseXml({Byte*} input);
Further processing can easily be performed on the returned XmlEvent stream. (Note that this function imposes no restrictions on the returned stream: absent limitations of the provided input (hint: see below!), the returned {XmlEvent*} may be iterated as many times as desired.)
That leaves us with the final task of turning the 2GB file resource into an {Byte*}. Ideally, we'd be able to use something like:
{Byte*} fileToByteStream(File file);
Every call to iterator() would open a new file input stream which would be returned in the form of {Byte*}. But, unfortunately, this would lead to file handle leaks since there would be no way to close abandoned/unfinished iterators.
Instead, the best solution may be to use a class that staisfies both Destroyable and {Byte*}, and permits at most one iterator from being created:
class FileByteIterable(File f) satisfies {Byte*} & Destroyable {}
Instances of this class can safely be created in try blocks, ensuring that their associated Iterators are properly closed.
The following is a working demo of these ideas, but instead of processing XML, we’ll simply count uppercase characters:
shared void run() {
value tempFile = temporaryDirectory.TemporaryFile(null, null);
// create a 2GB+ file
try (writer = tempFile.Overwriter()) {
for (i in 0:40M) {
writer.writeLine {
"Lorem ipsum dolor sit amet, \
consectetur adipiscing elit";
};
}
}
// count uppercase characters using a stream
try (bytes = FileByteIterable(tempFile)) {
print {
// Don't use collect here, or OOME!
// Better would be a lazy utf8.decoder
bytes.map((b) => b.unsigned.character)
.count(Character.uppercase);
};
}
// prints 40000000
}
Note how convenient the file processing is, using familiar constructs! And, even with a much smaller 15MB input file, changing “map” to “collect” above causes the program to run *significantly* slower (multiple seconds vs. sub-second).
The FileByteIterable class is written in two parts: the class itself, which deals specifically with Files, and a reusable functionIterable() function, which can turn any function into a single-shot iterable:
class FileByteIterable(File f, Integer bufferSize=1k)
satisfies {Byte*} & Destroyable {
assert (bufferSize >= 1);
value reader = f.Reader();
destroy = reader.destroy;
iterator = expand {
functionIterable {
() => let (bytes = reader.readBytes(bufferSize))
if (bytes.empty)
then finished
else bytes;
};
}.iterator;
}
{Element*} functionIterable<Element>
(Element|Finished generate()) => object
satisfies {Element*} {
variable value used = false;
shared actual Iterator<Element> iterator() {
"Only one iterator may be obtained from \
each Iterable created by functionIterable()."
assert (!used);
used = true;
return object satisfies Iterator<Element> {
next() => generate();
};
}
};
The next step of course would be to add additional classes for dealing with other types of resources.
Note that this approximates two open issues that go a bit further, with the idea that resource Iterables wouldn’t need the single-iterator restriction:
Of course, another option would be to allow multiple Iterators *until* theIterable.destroy() is called, at which time all file handles would be destroyed and no further Iterators would be allowed. That may provide little practical benefit though, since only one Iterator should ever be necessary!
John