Fast forward-backward (write-read)

Virgil Stokes

unread,

Oct 23, 2012, 10:31:17 AM10/23/12

to pytho...@python.org

I am working with some rather large data files (>100GB) that contain time series
data. The data (t_k,y(t_k)), k = 0,1,...,N are stored in ASCII format. I perform
various types of processing on these data (e.g. moving median, moving average,
and Kalman-filter, Kalman-smoother) in a sequential manner and only a small
number of these data need be stored in RAM when being processed. When performing
Kalman-filtering (forward in time pass, k = 0,1,...,N) I need to save to an
external file several variables (e.g. 11*32 bytes) for each (t_k, y(t_k)). These
are inputs to the Kalman-smoother (backward in time pass, k = N,N-1,...,0).
Thus, I will need to input these variables saved to an external file from the
forward pass, in reverse order --- from last written to first written.

Finally, to my question --- What is a fast way to write these variables to an
external file and then read them in backwards?

Tim Chase

unread,

Oct 23, 2012, 12:09:58 PM10/23/12

to Virgil Stokes, pytho...@python.org

Am I missing something, or would the fairly-standard "tac" utility
do the reversal you want? It should[*] be optimized to handle
on-disk files in a smart manner.

Otherwise, if you can pad the record-lengths so they're all the
same, and you know the total number of records, you can seek to
Total-(RecSize*OneBasedOffset) and write the record,optionally
padding if you need/can. At least on *nix-like OSes, you can seek
into a sparse-file with no problems (untested on Win32).

-tkc

[*]
Just guessing here. Would be disappointed if it *wasn't*.

Paul Rubin

unread,

Oct 23, 2012, 12:17:35 PM10/23/12

to

Virgil Stokes <v...@it.uu.se> writes:
> Finally, to my question --- What is a fast way to write these
> variables to an external file and then read them in backwards?

Seeking backwards in files works, but the performance hit is
significant. There is also a performance hit to scanning pointers
backwards in memory, due to cache misprediction. If it's something
you're just running a few times, seeking backwards the simplest
approach. If you're really trying to optimize the thing, you might
buffer up large chunks (like 1 MB) before writing. If you're writing
once and reading multiple times, you might reverse the order of records
within the chunks during the writing phase.

You're of course taking a performance bath from writing the program in
Python to begin with (unless using scipy/numpy or the like), enough that
it might dominate any effects of how the files are written.

Of course (it should go without saying) that you want to dump in a
binary format rather than converting to decimal.

Paul Rubin

unread,

Oct 23, 2012, 12:22:21 PM10/23/12

to

Paul Rubin <no.e...@nospam.invalid> writes:
> Seeking backwards in files works, but the performance hit is
> significant. There is also a performance hit to scanning pointers
> backwards in memory, due to cache misprediction. If it's something
> you're just running a few times, seeking backwards the simplest
> approach.

Oh yes, I should have mentioned, it may be simpler and perhaps a little
bit faster to use mmap rather than seeking.

Message has been deleted

Tim Chase

unread,

Oct 23, 2012, 12:53:37 PM10/23/12

to pytho...@python.org

On 10/23/12 11:17, Paul Rubin wrote:
> Virgil Stokes <v...@it.uu.se> writes:
>> Finally, to my question --- What is a fast way to write these
>> variables to an external file and then read them in backwards?
>
> Seeking backwards in files works, but the performance hit is
> significant. There is also a performance hit to scanning pointers
> backwards in memory, due to cache misprediction. If it's something
> you're just running a few times, seeking backwards the simplest
> approach. If you're really trying to optimize the thing, you might
> buffer up large chunks (like 1 MB) before writing. If you're writing
> once and reading multiple times, you might reverse the order of records
> within the chunks during the writing phase.

I agree with Paul here, it's been a while since I did it, and my
dataset was small enough (and passed through once) so I just let it
run. Writing larger chunks is definitely a good way to go.

> You're of course taking a performance bath from writing the program in
> Python to begin with (unless using scipy/numpy or the like), enough that
> it might dominate any effects of how the files are written.

I usually find that the I/O almost always overwhelms the actual
processing.

> Of course (it should go without saying) that you want to dump in a
> binary format rather than converting to decimal.

Again, the conversion to/from decimal hasn't been a great cost in my
experience, as it's overwhelmed by the I/O cost of shoveling the
data to/from disk.

-tkc

Paul Rubin

unread,

Oct 23, 2012, 12:58:38 PM10/23/12

to

Tim Chase <pytho...@tim.thechases.com> writes:
> Again, the conversion to/from decimal hasn't been a great cost in my
> experience, as it's overwhelmed by the I/O cost of shoveling the
> data to/from disk.

I've found that cpu costs both for processing and conversion are
significant. Also, using a binary format makes the file a lot smaller,
which decreases the i/o cost as well eliminating the conversion cost.
And, the conversion can introduce precision loss, another thing to be
avoided. The famous "butterfly effect" was serendipitously discovered
that way.

Virgil Stokes

unread,

Oct 23, 2012, 1:17:51 PM10/23/12

to pytho...@python.org

On 23-Oct-2012 18:09, Tim Chase wrote:
> On 10/23/12 09:31, Virgil Stokes wrote:
>> I am working with some rather large data files (>100GB) that contain time series
>> data. The data (t_k,y(t_k)), k = 0,1,...,N are stored in ASCII format. I perform
>> various types of processing on these data (e.g. moving median, moving average,
>> and Kalman-filter, Kalman-smoother) in a sequential manner and only a small
>> number of these data need be stored in RAM when being processed. When performing
>> Kalman-filtering (forward in time pass, k = 0,1,...,N) I need to save to an
>> external file several variables (e.g. 11*32 bytes) for each (t_k, y(t_k)). These
>> are inputs to the Kalman-smoother (backward in time pass, k = N,N-1,...,0).
>> Thus, I will need to input these variables saved to an external file from the
>> forward pass, in reverse order --- from last written to first written.
>>

>> Finally, to my question --- What is a fast way to write these variables to an
>> external file and then read them in backwards?

> Am I missing something, or would the fairly-standard "tac" utility
> do the reversal you want? It should[*] be optimized to handle
> on-disk files in a smart manner.

Not sure about "tac" --- could you provide more details on this and/or a simple
example of how it could be used for fast reversed "reading" of a data file?

>
> Otherwise, if you can pad the record-lengths so they're all the
> same, and you know the total number of records, you can seek to
> Total-(RecSize*OneBasedOffset) and write the record,optionally
> padding if you need/can. At least on *nix-like OSes, you can seek
> into a sparse-file with no problems (untested on Win32).

The records lengths will all be the same and yes seek could be used; but, I was
hoping for a faster method.

Thanks Tim! :-)

Virgil Stokes

unread,

Oct 23, 2012, 1:06:46 PM10/23/12

to pytho...@python.org

On 23-Oct-2012 18:17, Paul Rubin wrote:
> Virgil Stokes <v...@it.uu.se> writes:
>> Finally, to my question --- What is a fast way to write these
>> variables to an external file and then read them in backwards?
> Seeking backwards in files works, but the performance hit is
> significant. There is also a performance hit to scanning pointers
> backwards in memory, due to cache misprediction. If it's something
> you're just running a few times, seeking backwards the simplest
> approach. If you're really trying to optimize the thing, you might
> buffer up large chunks (like 1 MB) before writing. If you're writing
> once and reading multiple times, you might reverse the order of records
> within the chunks during the writing phase.

I am writing (forward) once and reading (backward) once.

>
> You're of course taking a performance bath from writing the program in
> Python to begin with (unless using scipy/numpy or the like), enough that
> it might dominate any effects of how the files are written.

I am currently using SciPy/NumPy

>
> Of course (it should go without saying) that you want to dump in a
> binary format rather than converting to decimal.

Yes, I am doing this (but thanks for "underlining" it!)

Thanks Paul :-)

Virgil Stokes

unread,

Oct 23, 2012, 1:09:32 PM10/23/12

to pytho...@python.org

On 23-Oct-2012 18:35, Dennis Lee Bieber wrote:
> On Tue, 23 Oct 2012 16:31:17 +0200, Virgil Stokes <v...@it.uu.se>
> declaimed the following in gmane.comp.python.general:

>
>> Finally, to my question --- What is a fast way to write these variables to an
>> external file and then read them in backwards?
>>

> Stuff them into an SQLite3 database and retrieve using a descending
> sort?
Have never worked with a database; but, could be worth a try (at least to
compare I/O times).

Thanks Dennis :-)

Tim Chase

unread,

Oct 23, 2012, 1:56:29 PM10/23/12

to Virgil Stokes, pytho...@python.org

On 10/23/12 12:17, Virgil Stokes wrote:
> On 23-Oct-2012 18:09, Tim Chase wrote:

>>> Finally, to my question --- What is a fast way to write these
>>> variables to an external file and then read them in
>>> backwards?

>> Am I missing something, or would the fairly-standard "tac"
>> utility do the reversal you want? It should[*] be optimized to
>> handle on-disk files in a smart manner.
> Not sure about "tac" --- could you provide more details on this
> and/or a simple example of how it could be used for fast reversed
> "reading" of a data file?

Well, if you're reading input.txt (and assuming it's one record per
line, separated by newlines), you can just use

tac < input.txt > backwards.txt

which will create a secondary file that is the first file in reverse
order. Your program can then process this secondary file in-order
(which would be backwards from your source).

I might have misunderstood your difficulty, but it _sounded_ like
you just want to inverse the order of a file.

-tkc

Virgil Stokes

unread,

Oct 23, 2012, 2:37:04 PM10/23/12

to pytho...@python.org

Yes, I do wish to inverse the order, but the "forward in time" file will be in
binary.

--V

Cousin Stanley

unread,

Oct 23, 2012, 4:03:39 PM10/23/12

to

Virgil Stokes wrote:

> Not sure about "tac" --- could you provide more details on this
> and/or a simple example of how it could be used for fast reversed
> "reading" of a data file ?

tac is available as a command under linux ....

$ whatis tac
tac (1) - concatenate and print files in reverse

$ whereis tac
tac: /usr/bin/tac /usr/bin/X11/tac /usr/share/man/man1/tac.1.gz

$ man tac

SYNOPSIS
tac [OPTION]... [FILE]...

DESCRIPTION

Write each FILE to standard output, last line first.

With no FILE, or when FILE is -, read standard input.

I only know that the tac command exists
but have never used it myself ....

--
Stanley C. Kitching
Human Being
Phoenix, Arizona

David Hutto

unread,

Oct 23, 2012, 5:50:55 PM10/23/12

to Virgil Stokes, pytho...@python.org

On Tue, Oct 23, 2012 at 10:31 AM, Virgil Stokes <v...@it.uu.se> wrote:
> I am working with some rather large data files (>100GB) that contain time
> series data. The data (t_k,y(t_k)), k = 0,1,...,N are stored in ASCII
> format. I perform various types of processing on these data (e.g. moving
> median, moving average, and Kalman-filter, Kalman-smoother) in a sequential
> manner and only a small number of these data need be stored in RAM when
> being processed. When performing Kalman-filtering (forward in time pass, k =
> 0,1,...,N) I need to save to an external file several variables (e.g. 11*32
> bytes) for each (t_k, y(t_k)). These are inputs to the Kalman-smoother
> (backward in time pass, k = N,N-1,...,0). Thus, I will need to input these
> variables saved to an external file from the forward pass, in reverse order
> --- from last written to first written.
>

> Finally, to my question --- What is a fast way to write these variables to
> an external file and then read them in backwards?

Don't forget to use timeit for an average OS utilization.

I'd suggest two list comprehensions for now, until I've reviewed it some more:

forward = ["%i = %s" % (i,chr(i)) for i in range(33,126)]
backward = ["%i = %s" % (i,chr(i)) for i in range(126,32,-1)]

for var in forward:
print var

for var in backward:
print var

You could also use a dict, and iterate through a straight loop that
assigned a front and back to a dict_one = {0 : [0.100], 1 : [1.99]}
and the iterate through the loop, and call the first or second in the
dict's var list for frontwards , or backwards calls.

But there might be faster implementations, depending on other
function's usage of certain lower level functions.

--
Best Regards,
David Hutto
CEO: http://www.hitwebdevelopment.com

David Hutto

unread,

Oct 23, 2012, 6:36:33 PM10/23/12

to Virgil Stokes, pytho...@python.org

> Don't forget to use timeit for an average OS utilization.
>
> I'd suggest two list comprehensions for now, until I've reviewed it some more:
>
> forward = ["%i = %s" % (i,chr(i)) for i in range(33,126)]
> backward = ["%i = %s" % (i,chr(i)) for i in range(126,32,-1)]
>
> for var in forward:
> print var
>
> for var in backward:
> print var
>
> You could also use a dict, and iterate through a straight loop that
> assigned a front and back to a dict_one = {0 : [0.100], 1 : [1.99]}
> and the iterate through the loop, and call the first or second in the
> dict's var list for frontwards , or backwards calls.
>
>
> But there might be faster implementations, depending on other
> function's usage of certain lower level functions.
>

Missed the part about it being a file. Use:

forward = ["%i = %s" % (i,chr(i)) for i in range(33,126)]
backward = ["%i = %s" % (i,chr(i)) for i in range(126,32,-1)]

print forward,backward

David Hutto

unread,

Oct 23, 2012, 6:49:47 PM10/23/12

to Virgil Stokes, pytho...@python.org

> Missed the part about it being a file. Use:
>
> forward = ["%i = %s" % (i,chr(i)) for i in range(33,126)]
> backward = ["%i = %s" % (i,chr(i)) for i in range(126,32,-1)]
>
> print forward,backward
>

This was a dud, let me rework it real quick, I deleted what i had, and
accidentally wrote the wrong function.

Steven D'Aprano

unread,

Oct 23, 2012, 6:53:42 PM10/23/12

to

On Tue, 23 Oct 2012 17:50:55 -0400, David Hutto wrote:

> On Tue, Oct 23, 2012 at 10:31 AM, Virgil Stokes <v...@it.uu.se> wrote:
>> I am working with some rather large data files (>100GB)

[...]

>> Finally, to my question --- What is a fast way to write these variables
>> to an external file and then read them in backwards?
>
> Don't forget to use timeit for an average OS utilization.

Given that the data files are larger than 100 gigabytes, the time
required to process each file is likely to be in hours, not microseconds.
That being the case, timeit is the wrong tool for the job, it is
optimized for timings tiny code snippets. You could use it, of course,
but the added inconvenience doesn't gain you any added accuracy.

Here's a neat context manager that makes timing long-running code simple:

http://code.activestate.com/recipes/577896

> I'd suggest two list comprehensions for now, until I've reviewed it some
> more:

I would be very surprised if the poster will be able to fit 100 gigabytes
of data into even a single list comprehension, let alone two.

This is a classic example of why the old external processing algorithms
of the 1960s and 70s will never be obsolete. No matter how much memory
you have, there will always be times when you want to process more data
than you can fit into memory.

--
Steven

Demian Brecht

unread,

Oct 23, 2012, 6:57:44 PM10/23/12

to Steven D'Aprano, pytho...@python.org

> This is a classic example of why the old external processing algorithms
> of the 1960s and 70s will never be obsolete. No matter how much memory
> you have, there will always be times when you want to process more data
> than you can fit into memory.

But surely nobody will *ever* need more than 640k…

Right?

Demian Brecht
@demianbrecht
http://demianbrecht.github.com

David Hutto

unread,

Oct 23, 2012, 7:19:28 PM10/23/12

to Virgil Stokes, pytho...@python.org

Whether this is fast enough, or not, I don't know:

filename = "data_file.txt"
f = open(filename, 'r')
forward = [line.rstrip('\n') for line in f.readlines()]
backward = [line.rstrip('\n') for line in reversed(forward)]
f.close()
print forward, "\n\n", "********************\n\n", backward, "\n"

David Hutto

unread,

Oct 23, 2012, 7:34:15 PM10/23/12

to Steven D'Aprano, pytho...@python.org

On Tue, Oct 23, 2012 at 6:53 PM, Steven D'Aprano
<steve+comp....@pearwood.info> wrote:
> On Tue, 23 Oct 2012 17:50:55 -0400, David Hutto wrote:
>
>> On Tue, Oct 23, 2012 at 10:31 AM, Virgil Stokes <v...@it.uu.se> wrote:
>>> I am working with some rather large data files (>100GB)
> [...]
>>> Finally, to my question --- What is a fast way to write these variables
>>> to an external file and then read them in backwards?
>>
>> Don't forget to use timeit for an average OS utilization.
>
> Given that the data files are larger than 100 gigabytes, the time
> required to process each file is likely to be in hours, not microseconds.
> That being the case, timeit is the wrong tool for the job, it is
> optimized for timings tiny code snippets. You could use it, of course,
> but the added inconvenience doesn't gain you any added accuracy.

It depends on the end result, and the fact that if the iterations
themselves are about the same time, then just using a segment of the
iterations could be scaled down, and a full run might be worth it, if
you have a second computer running optimization.

>
> Here's a neat context manager that makes timing long-running code simple:
>
>
> http://code.activestate.com/recipes/577896

I'll test this out for big O notation later. For the OP:

http://en.wikipedia.org/wiki/Big_O_notation

>
>
>
>> I'd suggest two list comprehensions for now, until I've reviewed it some
>> more:
>
> I would be very surprised if the poster will be able to fit 100 gigabytes
> of data into even a single list comprehension, let alone two.

Again, these can be scaled depending on the operations of the function
in question, and the average time of aforementioned function(s)

>
> This is a classic example of why the old external processing algorithms
> of the 1960s and 70s will never be obsolete. No matter how much memory
> you have, there will always be times when you want to process more data
> than you can fit into memory

This is a common misconception. You can engineer a device that
accommodates this if it's a direct experimental necessity.

emile

unread,

Oct 23, 2012, 7:35:40 PM10/23/12

to pytho...@python.org

On 10/23/2012 04:19 PM, David Hutto wrote:
> Whether this is fast enough, or not, I don't know:

well, the OP's original post started with
"I am working with some rather large data files (>100GB)..."

> filename = "data_file.txt"
> f = open(filename, 'r')
> forward = [line.rstrip('\n') for line in f.readlines()]

f.readlines() will be big(!) and have overhead... and forward results in
something again as big.

> backward = [line.rstrip('\n') for line in reversed(forward)]

and defining backward looks to me to require space to build backward and
hold reversed(forward)

So, let's see, at that point in time (building backward) you've got
probably somewhere close to 400-500Gb in memory.

My guess -- probably not so fast. Thrashing is sure to be a factor on
all but machines I'll never have a chance to work on.

> f.close()
> print forward, "\n\n", "********************\n\n", backward, "\n"

It's good to retain context.

Emile

Paul Rubin

unread,

Oct 23, 2012, 7:46:26 PM10/23/12

to

Virgil Stokes <v...@it.uu.se> writes:
> Yes, I do wish to inverse the order, but the "forward in time" file
> will be in binary.

I really think it will be simplest to just write the file in forward
order, then use mmap to read it one record at a time. It might be
possible to squeeze out a little more performance with reordering tricks
but that's the first thing to try.

David Hutto

unread,

Oct 23, 2012, 8:01:36 PM10/23/12

to emile, pytho...@python.org

On Tue, Oct 23, 2012 at 7:35 PM, emile <em...@fenx.com> wrote:
> On 10/23/2012 04:19 PM, David Hutto wrote:
>>
>> Whether this is fast enough, or not, I don't know:
>
>
> well, the OP's original post started with
> "I am working with some rather large data files (>100GB)..."

Well, is this a dedicated system, and one that they have the budget to upgrade?

Data files have some sort of parsing, unless it's one huge dict, or
list, so there has to be an average size to the parse.

So big O notation should begin to refine without a full file.

>
>
>> filename = "data_file.txt"
>> f = open(filename, 'r')
>> forward = [line.rstrip('\n') for line in f.readlines()]
>
>
> f.readlines() will be big(!) and have overhead... and forward results in
> something again as big.
>

Not if an average can be taken, and then refined as the actual gigs
are being iterated through.

>
>> backward = [line.rstrip('\n') for line in reversed(forward)]
>
>
> and defining backward looks to me to require space to build backward and
> hold reversed(forward)
>
> So, let's see, at that point in time (building backward) you've got
> probably somewhere close to 400-500Gb in memory.
>
> My guess -- probably not so fast. Thrashing is sure to be a factor on all
> but machines I'll never have a chance to work on.

But does the OP have access? They never stated their hardware, and
upgradable budget.

>
>
>> f.close()
>> print forward, "\n\n", "********************\n\n", backward, "\n"
>
>
>
> It's good to retain context.

Trying to practice good form ;).

Oscar Benjamin

unread,

Oct 23, 2012, 8:06:13 PM10/23/12

to Virgil Stokes, pytho...@python.org

On 23 October 2012 15:31, Virgil Stokes <v...@it.uu.se> wrote:
> I am working with some rather large data files (>100GB) that contain time
> series data. The data (t_k,y(t_k)), k = 0,1,...,N are stored in ASCII
> format. I perform various types of processing on these data (e.g. moving
> median, moving average, and Kalman-filter, Kalman-smoother) in a sequential
> manner and only a small number of these data need be stored in RAM when
> being processed. When performing Kalman-filtering (forward in time pass, k =
> 0,1,...,N) I need to save to an external file several variables (e.g. 11*32
> bytes) for each (t_k, y(t_k)). These are inputs to the Kalman-smoother
> (backward in time pass, k = N,N-1,...,0). Thus, I will need to input these
> variables saved to an external file from the forward pass, in reverse order
> --- from last written to first written.
>

> Finally, to my question --- What is a fast way to write these variables to
> an external file and then read them in backwards?

You mentioned elsewhere that you are using numpy. I'll assume that the
data you want to read/write are numpy arrays.

Numpy arrays can be written very efficiently in binary form using
tofile/fromfile:

>>> import numpy
>>> a = numpy.array([1, 2, 5], numpy.int64)
>>> a
array([1, 2, 5])
>>> with open('data.bin', 'wb') as f:
... a.tofile(f)
...

You can then reload the array with:

>>> with open('data.bin', 'rb') as f:
... a2 = numpy.fromfile(f, numpy.int64)
...
>>> a2
array([1, 2, 5])

Numpy arrays can be reversed before writing or after reading using;

>>> a2
array([1, 2, 5])
>>> a2[::-1]
array([5, 2, 1])

Assuming you wrote the file forwards you can make an iterator to yield
the file in chunks backwards like so (untested):

def read_backwards(f, dtype, chunksize=1024 ** 2):
dtype = numpy.dtype(dtype)
nbytes = chunksize * dtype.itemsize
f.seek(0, 2)
fpos = f.tell()
while fpos > nbytes:
f.seek(fpos, 0)
yield numpy.fromfile(f, dtype, chunksize)[::-1]
fpos -= nbytes
yield numpy.fromfile(f, dtype)[::-1]

Oscar

Tim Chase

unread,

Oct 23, 2012, 8:30:54 PM10/23/12

to Virgil Stokes, pytho...@python.org

On 10/23/12 13:37, Virgil Stokes wrote:
> Yes, I do wish to inverse the order, but the "forward in time"
> file will be in binary.

Your original post said:

> The data (t_k,y(t_k)), k = 0,1,...,N are stored in ASCII format

making it hard to know what sort of data is in this file.

So I guess it would help to have some sample data to work with, even
if it's just some dummy data and a raw processing loop without doing
anything inside it. Something like the output of either of these

$ xxd forward_data.txt | head -50 > forward_head.txt
$ od forward_data.txt | head -50 > forward_head.txt

plus a basic loop to show how you're extracting the values:

for line in file("forward_head.txt"):
data1, data2, data3 = process(line)

and how you want to reverse over them:

for line in file("reversed.txt"):
if same_processing_as_forward_source:
data1, data2, data3 = process(line)
else:
data1, data2, data3 = other_process(line)

or do you want something more like

for line in super_reverse_magic(file("forward_head.txt")):
data1, data2, data3 = process(line)

?

-tkc

David Hutto

unread,

Oct 23, 2012, 10:29:09 PM10/23/12

to Oscar Benjamin, pytho...@python.org

On Tue, Oct 23, 2012 at 8:06 PM, Oscar Benjamin
<oscar.j....@gmail.com> wrote:
> On 23 October 2012 15:31, Virgil Stokes <v...@it.uu.se> wrote:
>> I am working with some rather large data files (>100GB) that contain time
>> series data. The data (t_k,y(t_k)), k = 0,1,...,N are stored in ASCII
>> format. I perform various types of processing on these data (e.g. moving
>> median, moving average, and Kalman-filter, Kalman-smoother) in a sequential
>> manner and only a small number of these data need be stored in RAM when
>> being processed. When performing Kalman-filtering (forward in time pass, k =
>> 0,1,...,N) I need to save to an external file several variables (e.g. 11*32
>> bytes) for each (t_k, y(t_k)). These are inputs to the Kalman-smoother
>> (backward in time pass, k = N,N-1,...,0). Thus, I will need to input these
>> variables saved to an external file from the forward pass, in reverse order
>> --- from last written to first written.
>>
>> Finally, to my question --- What is a fast way to write these variables to
>> an external file and then read them in backwards?
>
> You mentioned elsewhere that you are using numpy. I'll assume that the
> data you want to read/write are numpy arrays.

If that is the case always timeit. The following is an example of 3
functions, with repetitions of time that give an average:

import timeit
#3 dimensional matrix
x_dim = -1
y_dim = -1
z_dim = -1
s = """\

x_dim = -1
y_dim = -1
z_dim = -1
dict_1 = {}

for i in xrange(0,6):
x_dim = 1
y_dim = 1
z_dim = 1
dict_1['%s' % (i) ] = ['x = %i' % (x_dim), 'y = %i' % (y_dim), 'z =
%i' % (z_dim)]

"""

t = """\
import numpy
numpy.array([[ 1., 0., 0.],
[ 0., 1., 2.]])
"""

u = """\
list_count = 0
an_array = []
for i in range(0,10):

if list_count > 3:
break

if i % 3 != 0:
an_array.append(i)

if i % 3 == 0:
list_count += 1

"""
print timeit.timeit(stmt=s, number=100000)
print timeit.timeit(stmt=t, number=100000)
print timeit.timeit(stmt=u, number=100000)

Message has been deleted

Virgil Stokes

unread,

Oct 24, 2012, 3:05:38 AM10/24/12

to pytho...@python.org

On 24-Oct-2012 00:36, David Hutto wrote:
>> Don't forget to use timeit for an average OS utilization.
>>

>> I'd suggest two list comprehensions for now, until I've reviewed it some more:
>>

>> forward = ["%i = %s" % (i,chr(i)) for i in range(33,126)]
>> backward = ["%i = %s" % (i,chr(i)) for i in range(126,32,-1)]
>>

>> for var in forward:
>> print var
>>
>> for var in backward:
>> print var
>>
>> You could also use a dict, and iterate through a straight loop that
>> assigned a front and back to a dict_one = {0 : [0.100], 1 : [1.99]}
>> and the iterate through the loop, and call the first or second in the
>> dict's var list for frontwards , or backwards calls.
>>
>>
>> But there might be faster implementations, depending on other
>> function's usage of certain lower level functions.
>>

> Missed the part about it being a file. Use:
>
> forward = ["%i = %s" % (i,chr(i)) for i in range(33,126)]
> backward = ["%i = %s" % (i,chr(i)) for i in range(126,32,-1)]
>
> print forward,backward

Interesting approach for small data sets (or blocks from a much larger data set).

Thanks David :-)

Virgil Stokes

unread,

Oct 24, 2012, 3:12:29 AM10/24/12

to pytho...@python.org

On 24-Oct-2012 02:06, Oscar Benjamin wrote:
> On 23 October 2012 15:31, Virgil Stokes <v...@it.uu.se> wrote:
>> I am working with some rather large data files (>100GB) that contain time
>> series data. The data (t_k,y(t_k)), k = 0,1,...,N are stored in ASCII
>> format. I perform various types of processing on these data (e.g. moving
>> median, moving average, and Kalman-filter, Kalman-smoother) in a sequential
>> manner and only a small number of these data need be stored in RAM when
>> being processed. When performing Kalman-filtering (forward in time pass, k =
>> 0,1,...,N) I need to save to an external file several variables (e.g. 11*32
>> bytes) for each (t_k, y(t_k)). These are inputs to the Kalman-smoother
>> (backward in time pass, k = N,N-1,...,0). Thus, I will need to input these
>> variables saved to an external file from the forward pass, in reverse order
>> --- from last written to first written.
>>
>> Finally, to my question --- What is a fast way to write these variables to
>> an external file and then read them in backwards?
> You mentioned elsewhere that you are using numpy. I'll assume that the
> data you want to read/write are numpy arrays.
>

Ok Oscar,
Thanks for the tip and I will look into this more.

Virgil Stokes

unread,

Oct 24, 2012, 3:17:05 AM10/24/12

to pytho...@python.org

On 24-Oct-2012 00:57, Demian Brecht wrote:
>> This is a classic example of why the old external processing algorithms
>> of the 1960s and 70s will never be obsolete. No matter how much memory
>> you have, there will always be times when you want to process more data
>> than you can fit into memory.
>

> But surely nobody will *ever* need more than 640k…
>
> Right?
>
> Demian Brecht
> @demianbrecht
> http://demianbrecht.github.com
>
>
>
>

Yes, I can still remember such quotes --- thanks for jogging my memory, Demian :-)

Virgil Stokes

unread,

Oct 24, 2012, 3:19:36 AM10/24/12

to pytho...@python.org

On 24-Oct-2012 00:53, Steven D'Aprano wrote:
> On Tue, 23 Oct 2012 17:50:55 -0400, David Hutto wrote:
>
>> On Tue, Oct 23, 2012 at 10:31 AM, Virgil Stokes <v...@it.uu.se> wrote:
>>> I am working with some rather large data files (>100GB)
> [...]
>>> Finally, to my question --- What is a fast way to write these variables
>>> to an external file and then read them in backwards?
>> Don't forget to use timeit for an average OS utilization.
> Given that the data files are larger than 100 gigabytes, the time
> required to process each file is likely to be in hours, not microseconds.
> That being the case, timeit is the wrong tool for the job, it is
> optimized for timings tiny code snippets. You could use it, of course,
> but the added inconvenience doesn't gain you any added accuracy.
>
> Here's a neat context manager that makes timing long-running code simple:
>
>
> http://code.activestate.com/recipes/577896

Thanks for this link

>
>
>
>> I'd suggest two list comprehensions for now, until I've reviewed it some
>> more:
> I would be very surprised if the poster will be able to fit 100 gigabytes
> of data into even a single list comprehension, let alone two.

You are correct and I have been looking at working with blocks that are sized to
the RAM available for processing.

>
> This is a classic example of why the old external processing algorithms
> of the 1960s and 70s will never be obsolete. No matter how much memory
> you have, there will always be times when you want to process more data
> than you can fit into memory.
>
>
>

Thanks for your insights :-)

David Hutto

unread,

Oct 24, 2012, 3:19:30 AM10/24/12

to Virgil Stokes, pytho...@python.org

On Wed, Oct 24, 2012 at 3:05 AM, Virgil Stokes <v...@it.uu.se> wrote:
> On 24-Oct-2012 00:36, David Hutto wrote:
>>>

>>> Don't forget to use timeit for an average OS utilization.
>>>

>>> I'd suggest two list comprehensions for now, until I've reviewed it some
>>> more:
>>>

>>> forward = ["%i = %s" % (i,chr(i)) for i in range(33,126)]
>>> backward = ["%i = %s" % (i,chr(i)) for i in range(126,32,-1)]
>>>
>>> for var in forward:
>>> print var
>>>
>>> for var in backward:
>>> print var
>>>
>>> You could also use a dict, and iterate through a straight loop that
>>> assigned a front and back to a dict_one = {0 : [0.100], 1 : [1.99]}
>>> and the iterate through the loop, and call the first or second in the
>>> dict's var list for frontwards , or backwards calls.
>>>
>>>
>>> But there might be faster implementations, depending on other
>>> function's usage of certain lower level functions.
>>>
>> Missed the part about it being a file. Use:
>>
>> forward = ["%i = %s" % (i,chr(i)) for i in range(33,126)]
>> backward = ["%i = %s" % (i,chr(i)) for i in range(126,32,-1)]
>>
>> print forward,backward
>
> Interesting approach for small data sets (or blocks from a much larger data
> set).
>
> Thanks David :-)

No problem.

I think this was for a > 100GB, which might be able to be reduced for
parsing if I could see a snippet of the usual data being processed by
the function

But it does go to big O notation, and optimization of the average data
being passed through, unless the data varies in wide ranges, in which
that could be optimized to go from smaller to larger, vice versa, or
othe pieces of data with a higher priority level..

David Hutto

unread,

Oct 24, 2012, 3:26:30 AM10/24/12

to Virgil Stokes, pytho...@python.org

On Wed, Oct 24, 2012 at 3:17 AM, Virgil Stokes <v...@it.uu.se> wrote:
> On 24-Oct-2012 00:57, Demian Brecht wrote:
>>>

>>> This is a classic example of why the old external processing algorithms
>>> of the 1960s and 70s will never be obsolete. No matter how much memory
>>> you have, there will always be times when you want to process more data
>>> than you can fit into memory.
>>
>>

>> But surely nobody will *ever* need more than 640k…
>>
>> Right?
>>
>> Demian Brecht
>> @demianbrecht
>> http://demianbrecht.github.com
>>
>>
>>
>>
> Yes, I can still remember such quotes --- thanks for jogging my memory,
> Demian :-)

This is only on equipment designed by others, otherwise, you could
engineer the hardware yourself to perfom just certain functions for
you(RISC), and pass that back to the CISC(from a PCB design).

Virgil Stokes

unread,

Oct 24, 2012, 3:07:43 AM10/24/12

to pytho...@python.org

Unfortunately, I may be forced to process the data on a Windows platform; but,
thanks Cousin for the Linux tip.

Steven D'Aprano

unread,

Oct 24, 2012, 4:05:02 AM10/24/12

to

On Wed, 24 Oct 2012 01:23:58 -0400, Dennis Lee Bieber wrote:

> On Tue, 23 Oct 2012 16:35:40 -0700, emile <em...@fenx.com> declaimed the
> following in gmane.comp.python.general:

>
>> On 10/23/2012 04:19 PM, David Hutto wrote:
>> > forward = [line.rstrip('\n') for line in f.readlines()]
>>
>> f.readlines() will be big(!) and have overhead... and forward results
>> in something again as big.
>>

> Well, since file objects are iterable, could one just drop the
> .readlines() ? ( ... line in f )

Yes, but the bottleneck is still that the list comprehension will run to
completion, trying to process the entire 100+ GB file in one go.

[...]
> And since the line-ends have already been stripped from forward,
> backward should just be:
>
> backward = reversed(forward)

reversed returns a lazy iterator, but it requires that forward is a non-
lazy (eager) sequence. So again you're stuck trying to read the entire
file into RAM.

--
Steven

Tim Golden

unread,

Oct 24, 2012, 4:12:16 AM10/24/12

to pytho...@python.org

On 24/10/2012 08:07, Virgil Stokes wrote:
> On 23-Oct-2012 22:03, Cousin Stanley wrote:

> Unfortunately, I may be forced to process the data on a Windows
> platform; but, thanks Cousin for the Linux tip.

Well, addressing that specific point, tac is available for Windows:

http://unxutils.sourceforge.net/

No idea how efficient it is...

TJG

Emile van Sebille

unread,

Oct 24, 2012, 9:47:51 AM10/24/12

to pytho...@python.org

On 10/23/2012 4:35 PM, emile wrote:

> So, let's see, at that point in time (building backward) you've got
> probably somewhere close to 400-500Gb in memory.
>
> My guess -- probably not so fast. Thrashing is sure to be a factor on
> all but machines I'll never have a chance to work on.

I went looking for a machine capable of this and got about halfway there
with http://www.tech-news.com/publib/pl2818.html which allows up to
248Gb memory -- near as I can tell the price for the maxed out system is
$2,546,200. Plus $3k/mo maintenance. Memory's still not quite enough,
but I'll bet it's fast. And a lot more reasonable at $1000 per Gb of
memory particularly when contrasted to the $1000 I paid for a single Mb
of memory back in 1984 or thereabouts.

Emile

Grant Edwards

unread,

Oct 24, 2012, 9:56:57 AM10/24/12

to

On 2012-10-23, Steven D'Aprano <steve+comp....@pearwood.info> wrote:

> I would be very surprised if the poster will be able to fit 100
> gigabytes of data into even a single list comprehension, let alone
> two.
>
> This is a classic example of why the old external processing
> algorithms of the 1960s and 70s will never be obsolete. No matter how
> much memory you have, there will always be times when you want to
> process more data than you can fit into memory.

Too true. One of the projects I did in grad school about 20 years ago
was a plugin for some fancy data visualization software (I think it
was DX: http://www.research.ibm.com/dx/). My plugin would subsample
"on the fly" a selected section of a huge 2D array of data in a file.
IBM and SGI had all sorts of widgets you could use to sample,
transform and visualize data, but they all assumed that the input data
would fit into virtual memory.

--
Grant Edwards grant.b.edwards Yow! I Know A Joke!!
at
gmail.com

Paul Rubin

unread,

Oct 24, 2012, 11:08:22 AM10/24/12

to

Emile van Sebille <em...@fenx.com> writes:
>> probably somewhere close to 400-500Gb in memory....

> I went looking for a machine capable of this and got about halfway
> there with http://www.tech-news.com/publib/pl2818.html which allows up
> to 248Gb memory -- near as I can tell the price for the maxed out
> system is $2,546,200. Plus $3k/mo maintenance.

1x http://www.newegg.com/Product/Product.aspx?Item=N82E16816101317
@ $1400

+ 8x http://www.newegg.com/Product/Product.aspx?Item=N82E16820239276
@ $658

+ 4x http://www.newegg.com/Product/Product.aspx?Item=N82E16819113036
@ $520

+ 2x

= $8744 for a 64-core box with 512GB of ram. Add another $1000 or so
for disks/SSD's depending on configuration. Not bad. There were times
I coulda used something like this.

rusi

unread,

Oct 24, 2012, 11:11:37 AM10/24/12

to

On Oct 23, 7:52 pm, Virgil Stokes <v...@it.uu.se> wrote:
> I am working with some rather large data files (>100GB) that contain time series
> data. The data (t_k,y(t_k)), k = 0,1,...,N are stored in ASCII format. I perform
> various types of processing on these data (e.g. moving median, moving average,
> and Kalman-filter, Kalman-smoother) in a sequential manner and only a small
> number of these data need be stored in RAM when being processed. When performing
> Kalman-filtering (forward in time pass, k = 0,1,...,N) I need to save to an
> external file several variables (e.g. 11*32 bytes) for each (t_k, y(t_k)). These
> are inputs to the Kalman-smoother (backward in time pass, k = N,N-1,...,0).
> Thus, I will need to input these variables saved to an external file from the
> forward pass, in reverse order --- from last written to first written.
>

> Finally, to my question --- What is a fast way to write these variables to an
> external file and then read them in backwards?

Have you tried gdbm/bsddbm? They are meant for such (I believe).
Probably needs to be installed for windows; works for linux.
If I were you I'd try out with the giant data on linux and see if the
problem is solved, then see how to install for windows

Message has been deleted

Virgil Stokes

unread,

Oct 25, 2012, 5:00:59 AM10/25/12

to pytho...@python.org

Thanks Rusi :-)

Dave Angel

unread,

Oct 28, 2012, 7:18:01 AM10/28/12

to Virgil Stokes, pytho...@python.org

On 10/24/2012 03:14 AM, Virgil Stokes wrote:

> Thanks Paul,
> I am working on this approach now...

If you're using mmap to map the whole file, you'll need 64bit Windows to
start with. I'd be interested to know if Windows will allow you to mmap
100gb at one stroke. Have you tried it, or are you starting by figuring
how to access the data from the mmap?

--

DaveA

Virgil Stokes

unread,

Oct 28, 2012, 10:20:45 AM10/28/12

to pytho...@python.org

On 28-Oct-2012 12:18, Dave Angel wrote:
> On 10/24/2012 03:14 AM, Virgil Stokes wrote:
>> On 24-Oct-2012 01:46, Paul Rubin wrote:

>> Thanks Paul,
>> I am working on this approach now...
> If you're using mmap to map the whole file, you'll need 64bit Windows to
> start with. I'd be interested to know if Windows will allow you to mmap
> 100gb at one stroke. Have you tried it, or are you starting by figuring
> how to access the data from the mmap?

Thanks very much for pursuing my query, Dave.

I have not tried it yet --- temporarily side-tracked; but, I will post my
findings on this issue.

Oscar Benjamin

unread,

Oct 28, 2012, 2:21:45 PM10/28/12

to Virgil Stokes, pytho...@python.org

On 28 October 2012 14:20, Virgil Stokes <v...@it.uu.se> wrote:
> On 28-Oct-2012 12:18, Dave Angel wrote:
>>
>> On 10/24/2012 03:14 AM, Virgil Stokes wrote:
>>>
>>> On 24-Oct-2012 01:46, Paul Rubin wrote:
>>>>

>>> Thanks Paul,
>>> I am working on this approach now...
>>
>> If you're using mmap to map the whole file, you'll need 64bit Windows to
>> start with. I'd be interested to know if Windows will allow you to mmap
>> 100gb at one stroke. Have you tried it, or are you starting by figuring
>> how to access the data from the mmap?
>
> Thanks very much for pursuing my query, Dave.
>
> I have not tried it yet --- temporarily side-tracked; but, I will post my
> findings on this issue.

If you are going to use mmap then look at the numpy.memmap function.
This wraps pythons mmap so that you can access the contents of the
mapped binary file as if it was a numpy array. This means that you
don't need to handle the bytes -> float conversions yourself.

>>> import numpy
>>> a = numpy.array([4,5,6], numpy.float64)
>>> a
array([ 4., 5., 6.])
>>> with open('tmp.bin', 'wb') as f: # write forwards
... a.tofile(f)
... a.tofile(f)
...
>>> a2 = numpy.memmap('tmp.bin', numpy.float64)
>>> a2
memmap([ 4., 5., 6., 4., 5., 6.])
>>> a2[3]
4.0
>>> a2[5:2:-1] # read backwards
memmap([ 6., 5., 4.])

Oscar

Virgil Stokes

unread,

Oct 28, 2012, 6:36:04 PM10/28/12

to pytho...@python.org

On 2012-10-28 19:21, Oscar Benjamin wrote:
> On 28 October 2012 14:20, Virgil Stokes <v...@it.uu.se> wrote:
>> On 28-Oct-2012 12:18, Dave Angel wrote:
>>> On 10/24/2012 03:14 AM, Virgil Stokes wrote:
>>>> On 24-Oct-2012 01:46, Paul Rubin wrote:

Thanks Oscar!