I am working with some rather large data files (>100GB) that contain time series data. The data (t_k,y(t_k)), k = 0,1,...,N are stored in ASCII format. I perform various types of processing on these data (e.g. moving median, moving average, and Kalman-filter, Kalman-smoother) in a sequential manner and only a small number of these data need be stored in RAM when being processed. When performing Kalman-filtering (forward in time pass, k = 0,1,...,N) I need to save to an external file several variables (e.g. 11*32 bytes) for each (t_k, y(t_k)). These are inputs to the Kalman-smoother (backward in time pass, k = N,N-1,...,0). Thus, I will need to input these variables saved to an external file from the forward pass, in reverse order --- from last written to first written.
Finally, to my question --- What is a fast way to write these variables to an external file and then read them in backwards?
> I am working with some rather large data files (>100GB) that contain time series > data. The data (t_k,y(t_k)), k = 0,1,...,N are stored in ASCII format. I perform > various types of processing on these data (e.g. moving median, moving average, > and Kalman-filter, Kalman-smoother) in a sequential manner and only a small > number of these data need be stored in RAM when being processed. When performing > Kalman-filtering (forward in time pass, k = 0,1,...,N) I need to save to an > external file several variables (e.g. 11*32 bytes) for each (t_k, y(t_k)). These > are inputs to the Kalman-smoother (backward in time pass, k = N,N-1,...,0). > Thus, I will need to input these variables saved to an external file from the > forward pass, in reverse order --- from last written to first written.
> Finally, to my question --- What is a fast way to write these variables to an > external file and then read them in backwards?
Am I missing something, or would the fairly-standard "tac" utility
do the reversal you want? It should[*] be optimized to handle
on-disk files in a smart manner.
Otherwise, if you can pad the record-lengths so they're all the
same, and you know the total number of records, you can seek to
Total-(RecSize*OneBasedOffset) and write the record,optionally
padding if you need/can. At least on *nix-like OSes, you can seek
into a sparse-file with no problems (untested on Win32).
-tkc
[*]
Just guessing here. Would be disappointed if it *wasn't*.
Virgil Stokes <v...@it.uu.se> writes:
> Finally, to my question --- What is a fast way to write these
> variables to an external file and then read them in backwards?
Seeking backwards in files works, but the performance hit is
significant. There is also a performance hit to scanning pointers
backwards in memory, due to cache misprediction. If it's something
you're just running a few times, seeking backwards the simplest
approach. If you're really trying to optimize the thing, you might
buffer up large chunks (like 1 MB) before writing. If you're writing
once and reading multiple times, you might reverse the order of records
within the chunks during the writing phase.
You're of course taking a performance bath from writing the program in
Python to begin with (unless using scipy/numpy or the like), enough that
it might dominate any effects of how the files are written.
Of course (it should go without saying) that you want to dump in a
binary format rather than converting to decimal.
Paul Rubin <no.em...@nospam.invalid> writes:
> Seeking backwards in files works, but the performance hit is
> significant. There is also a performance hit to scanning pointers
> backwards in memory, due to cache misprediction. If it's something
> you're just running a few times, seeking backwards the simplest
> approach.
Oh yes, I should have mentioned, it may be simpler and perhaps a little
bit faster to use mmap rather than seeking.
> Virgil Stokes <v...@it.uu.se> writes:
>> Finally, to my question --- What is a fast way to write these
>> variables to an external file and then read them in backwards?
> Seeking backwards in files works, but the performance hit is
> significant. There is also a performance hit to scanning pointers
> backwards in memory, due to cache misprediction. If it's something
> you're just running a few times, seeking backwards the simplest
> approach. If you're really trying to optimize the thing, you might
> buffer up large chunks (like 1 MB) before writing. If you're writing
> once and reading multiple times, you might reverse the order of records
> within the chunks during the writing phase.
I agree with Paul here, it's been a while since I did it, and my
dataset was small enough (and passed through once) so I just let it
run. Writing larger chunks is definitely a good way to go.
> You're of course taking a performance bath from writing the program in
> Python to begin with (unless using scipy/numpy or the like), enough that
> it might dominate any effects of how the files are written.
I usually find that the I/O almost always overwhelms the actual
processing.
> Of course (it should go without saying) that you want to dump in a
> binary format rather than converting to decimal.
Again, the conversion to/from decimal hasn't been a great cost in my
experience, as it's overwhelmed by the I/O cost of shoveling the
data to/from disk.
Tim Chase <python.l...@tim.thechases.com> writes:
> Again, the conversion to/from decimal hasn't been a great cost in my
> experience, as it's overwhelmed by the I/O cost of shoveling the
> data to/from disk.
I've found that cpu costs both for processing and conversion are
significant. Also, using a binary format makes the file a lot smaller,
which decreases the i/o cost as well eliminating the conversion cost.
And, the conversion can introduce precision loss, another thing to be
avoided. The famous "butterfly effect" was serendipitously discovered
that way.
> On 10/23/12 09:31, Virgil Stokes wrote:
>> I am working with some rather large data files (>100GB) that contain time series
>> data. The data (t_k,y(t_k)), k = 0,1,...,N are stored in ASCII format. I perform
>> various types of processing on these data (e.g. moving median, moving average,
>> and Kalman-filter, Kalman-smoother) in a sequential manner and only a small
>> number of these data need be stored in RAM when being processed. When performing
>> Kalman-filtering (forward in time pass, k = 0,1,...,N) I need to save to an
>> external file several variables (e.g. 11*32 bytes) for each (t_k, y(t_k)). These
>> are inputs to the Kalman-smoother (backward in time pass, k = N,N-1,...,0).
>> Thus, I will need to input these variables saved to an external file from the
>> forward pass, in reverse order --- from last written to first written.
>> Finally, to my question --- What is a fast way to write these variables to an
>> external file and then read them in backwards?
> Am I missing something, or would the fairly-standard "tac" utility
> do the reversal you want? It should[*] be optimized to handle
> on-disk files in a smart manner.
Not sure about "tac" --- could you provide more details on this and/or a simple example of how it could be used for fast reversed "reading" of a data file?
> Otherwise, if you can pad the record-lengths so they're all the
> same, and you know the total number of records, you can seek to
> Total-(RecSize*OneBasedOffset) and write the record,optionally
> padding if you need/can. At least on *nix-like OSes, you can seek
> into a sparse-file with no problems (untested on Win32).
The records lengths will all be the same and yes seek could be used; but, I was hoping for a faster method.
> Virgil Stokes <v...@it.uu.se> writes:
>> Finally, to my question --- What is a fast way to write these
>> variables to an external file and then read them in backwards?
> Seeking backwards in files works, but the performance hit is
> significant. There is also a performance hit to scanning pointers
> backwards in memory, due to cache misprediction. If it's something
> you're just running a few times, seeking backwards the simplest
> approach. If you're really trying to optimize the thing, you might
> buffer up large chunks (like 1 MB) before writing. If you're writing
> once and reading multiple times, you might reverse the order of records
> within the chunks during the writing phase.
I am writing (forward) once and reading (backward) once.
> You're of course taking a performance bath from writing the program in
> Python to begin with (unless using scipy/numpy or the like), enough that
> it might dominate any effects of how the files are written.
I am currently using SciPy/NumPy
> Of course (it should go without saying) that you want to dump in a
> binary format rather than converting to decimal.
Yes, I am doing this (but thanks for "underlining" it!)
> On 23-Oct-2012 18:09, Tim Chase wrote:
>>> Finally, to my question --- What is a fast way to write these
>>> variables to an external file and then read them in
>>> backwards?
>> Am I missing something, or would the fairly-standard "tac"
>> utility do the reversal you want? It should[*] be optimized to
>> handle on-disk files in a smart manner.
> Not sure about "tac" --- could you provide more details on this
> and/or a simple example of how it could be used for fast reversed
> "reading" of a data file?
Well, if you're reading input.txt (and assuming it's one record per
line, separated by newlines), you can just use
tac < input.txt > backwards.txt
which will create a secondary file that is the first file in reverse
order. Your program can then process this secondary file in-order
(which would be backwards from your source).
I might have misunderstood your difficulty, but it _sounded_ like
you just want to inverse the order of a file.
> On 10/23/12 12:17, Virgil Stokes wrote:
>> On 23-Oct-2012 18:09, Tim Chase wrote:
>>>> Finally, to my question --- What is a fast way to write these
>>>> variables to an external file and then read them in
>>>> backwards?
>>> Am I missing something, or would the fairly-standard "tac"
>>> utility do the reversal you want? It should[*] be optimized to
>>> handle on-disk files in a smart manner.
>> Not sure about "tac" --- could you provide more details on this
>> and/or a simple example of how it could be used for fast reversed
>> "reading" of a data file?
> Well, if you're reading input.txt (and assuming it's one record per
> line, separated by newlines), you can just use
> tac < input.txt > backwards.txt
> which will create a secondary file that is the first file in reverse
> order. Your program can then process this secondary file in-order
> (which would be backwards from your source).
> I might have misunderstood your difficulty, but it _sounded_ like
> you just want to inverse the order of a file.
Yes, I do wish to inverse the order, but the "forward in time" file will be in binary.
Virgil Stokes wrote:
> Not sure about "tac" --- could you provide more details on this > and/or a simple example of how it could be used for fast reversed > "reading" of a data file ?
tac is available as a command under linux ....
$ whatis tac
tac (1) - concatenate and print files in reverse
On Tue, Oct 23, 2012 at 10:31 AM, Virgil Stokes <v...@it.uu.se> wrote:
> I am working with some rather large data files (>100GB) that contain time
> series data. The data (t_k,y(t_k)), k = 0,1,...,N are stored in ASCII
> format. I perform various types of processing on these data (e.g. moving
> median, moving average, and Kalman-filter, Kalman-smoother) in a sequential
> manner and only a small number of these data need be stored in RAM when
> being processed. When performing Kalman-filtering (forward in time pass, k =
> 0,1,...,N) I need to save to an external file several variables (e.g. 11*32
> bytes) for each (t_k, y(t_k)). These are inputs to the Kalman-smoother
> (backward in time pass, k = N,N-1,...,0). Thus, I will need to input these
> variables saved to an external file from the forward pass, in reverse order
> --- from last written to first written.
> Finally, to my question --- What is a fast way to write these variables to
> an external file and then read them in backwards?
Don't forget to use timeit for an average OS utilization.
I'd suggest two list comprehensions for now, until I've reviewed it some more:
forward = ["%i = %s" % (i,chr(i)) for i in range(33,126)]
backward = ["%i = %s" % (i,chr(i)) for i in range(126,32,-1)]
for var in forward:
print var
for var in backward:
print var
You could also use a dict, and iterate through a straight loop that
assigned a front and back to a dict_one = {0 : [0.100], 1 : [1.99]}
and the iterate through the loop, and call the first or second in the
dict's var list for frontwards , or backwards calls.
But there might be faster implementations, depending on other
function's usage of certain lower level functions.
> Don't forget to use timeit for an average OS utilization.
> I'd suggest two list comprehensions for now, until I've reviewed it some more:
> forward = ["%i = %s" % (i,chr(i)) for i in range(33,126)]
> backward = ["%i = %s" % (i,chr(i)) for i in range(126,32,-1)]
> for var in forward:
> print var
> for var in backward:
> print var
> You could also use a dict, and iterate through a straight loop that
> assigned a front and back to a dict_one = {0 : [0.100], 1 : [1.99]}
> and the iterate through the loop, and call the first or second in the
> dict's var list for frontwards , or backwards calls.
> But there might be faster implementations, depending on other
> function's usage of certain lower level functions.
Missed the part about it being a file. Use:
forward = ["%i = %s" % (i,chr(i)) for i in range(33,126)]
backward = ["%i = %s" % (i,chr(i)) for i in range(126,32,-1)]
On Tue, 23 Oct 2012 17:50:55 -0400, David Hutto wrote:
> On Tue, Oct 23, 2012 at 10:31 AM, Virgil Stokes <v...@it.uu.se> wrote:
>> I am working with some rather large data files (>100GB) [...]
>> Finally, to my question --- What is a fast way to write these variables
>> to an external file and then read them in backwards?
> Don't forget to use timeit for an average OS utilization.
Given that the data files are larger than 100 gigabytes, the time required to process each file is likely to be in hours, not microseconds. That being the case, timeit is the wrong tool for the job, it is optimized for timings tiny code snippets. You could use it, of course, but the added inconvenience doesn't gain you any added accuracy.
Here's a neat context manager that makes timing long-running code simple:
> I'd suggest two list comprehensions for now, until I've reviewed it some
> more:
I would be very surprised if the poster will be able to fit 100 gigabytes of data into even a single list comprehension, let alone two.
This is a classic example of why the old external processing algorithms of the 1960s and 70s will never be obsolete. No matter how much memory you have, there will always be times when you want to process more data than you can fit into memory.
> This is a classic example of why the old external processing algorithms > of the 1960s and 70s will never be obsolete. No matter how much memory > you have, there will always be times when you want to process more data > than you can fit into memory.
But surely nobody will *ever* need more than 640k…
Whether this is fast enough, or not, I don't know:
filename = "data_file.txt"
f = open(filename, 'r')
forward = [line.rstrip('\n') for line in f.readlines()]
backward = [line.rstrip('\n') for line in reversed(forward)]
f.close()
print forward, "\n\n", "********************\n\n", backward, "\n"
<steve+comp.lang.pyt...@pearwood.info> wrote:
> On Tue, 23 Oct 2012 17:50:55 -0400, David Hutto wrote:
>> On Tue, Oct 23, 2012 at 10:31 AM, Virgil Stokes <v...@it.uu.se> wrote:
>>> I am working with some rather large data files (>100GB)
> [...]
>>> Finally, to my question --- What is a fast way to write these variables
>>> to an external file and then read them in backwards?
>> Don't forget to use timeit for an average OS utilization.
> Given that the data files are larger than 100 gigabytes, the time
> required to process each file is likely to be in hours, not microseconds.
> That being the case, timeit is the wrong tool for the job, it is
> optimized for timings tiny code snippets. You could use it, of course,
> but the added inconvenience doesn't gain you any added accuracy.
It depends on the end result, and the fact that if the iterations
themselves are about the same time, then just using a segment of the
iterations could be scaled down, and a full run might be worth it, if
you have a second computer running optimization.
> Here's a neat context manager that makes timing long-running code simple:
>> I'd suggest two list comprehensions for now, until I've reviewed it some
>> more:
> I would be very surprised if the poster will be able to fit 100 gigabytes
> of data into even a single list comprehension, let alone two.
Again, these can be scaled depending on the operations of the function
in question, and the average time of aforementioned function(s)
> This is a classic example of why the old external processing algorithms
> of the 1960s and 70s will never be obsolete. No matter how much memory
> you have, there will always be times when you want to process more data
> than you can fit into memory
This is a common misconception. You can engineer a device that
accommodates this if it's a direct experimental necessity.
Virgil Stokes <v...@it.uu.se> writes:
> Yes, I do wish to inverse the order, but the "forward in time" file
> will be in binary.
I really think it will be simplest to just write the file in forward
order, then use mmap to read it one record at a time. It might be
possible to squeeze out a little more performance with reordering tricks
but that's the first thing to try.
On 23 October 2012 15:31, Virgil Stokes <v...@it.uu.se> wrote:
> I am working with some rather large data files (>100GB) that contain time
> series data. The data (t_k,y(t_k)), k = 0,1,...,N are stored in ASCII
> format. I perform various types of processing on these data (e.g. moving
> median, moving average, and Kalman-filter, Kalman-smoother) in a sequential
> manner and only a small number of these data need be stored in RAM when
> being processed. When performing Kalman-filtering (forward in time pass, k =
> 0,1,...,N) I need to save to an external file several variables (e.g. 11*32
> bytes) for each (t_k, y(t_k)). These are inputs to the Kalman-smoother
> (backward in time pass, k = N,N-1,...,0). Thus, I will need to input these
> variables saved to an external file from the forward pass, in reverse order
> --- from last written to first written.
> Finally, to my question --- What is a fast way to write these variables to
> an external file and then read them in backwards?
You mentioned elsewhere that you are using numpy. I'll assume that the
data you want to read/write are numpy arrays.
Numpy arrays can be written very efficiently in binary form using
tofile/fromfile:
>>> import numpy
>>> a = numpy.array([1, 2, 5], numpy.int64)
>>> a
array([1, 2, 5])
>>> with open('data.bin', 'wb') as f:
... a.tofile(f)
...
You can then reload the array with:
>>> with open('data.bin', 'rb') as f:
... a2 = numpy.fromfile(f, numpy.int64)
...
>>> a2
array([1, 2, 5])
Numpy arrays can be reversed before writing or after reading using;
>>> a2
array([1, 2, 5])
>>> a2[::-1]
array([5, 2, 1])
Assuming you wrote the file forwards you can make an iterator to yield
the file in chunks backwards like so (untested):
> Yes, I do wish to inverse the order, but the "forward in time"
> file will be in binary.
Your original post said:
> The data (t_k,y(t_k)), k = 0,1,...,N are stored in ASCII format
making it hard to know what sort of data is in this file.
So I guess it would help to have some sample data to work with, even
if it's just some dummy data and a raw processing loop without doing
anything inside it. Something like the output of either of these
$ xxd forward_data.txt | head -50 > forward_head.txt
$ od forward_data.txt | head -50 > forward_head.txt
plus a basic loop to show how you're extracting the values:
for line in file("forward_head.txt"):
data1, data2, data3 = process(line)
and how you want to reverse over them:
for line in file("reversed.txt"):
if same_processing_as_forward_source:
data1, data2, data3 = process(line)
else:
data1, data2, data3 = other_process(line)
or do you want something more like
for line in super_reverse_magic(file("forward_head.txt")):
data1, data2, data3 = process(line)
<oscar.j.benja...@gmail.com> wrote:
> On 23 October 2012 15:31, Virgil Stokes <v...@it.uu.se> wrote:
>> I am working with some rather large data files (>100GB) that contain time
>> series data. The data (t_k,y(t_k)), k = 0,1,...,N are stored in ASCII
>> format. I perform various types of processing on these data (e.g. moving
>> median, moving average, and Kalman-filter, Kalman-smoother) in a sequential
>> manner and only a small number of these data need be stored in RAM when
>> being processed. When performing Kalman-filtering (forward in time pass, k =
>> 0,1,...,N) I need to save to an external file several variables (e.g. 11*32
>> bytes) for each (t_k, y(t_k)). These are inputs to the Kalman-smoother
>> (backward in time pass, k = N,N-1,...,0). Thus, I will need to input these
>> variables saved to an external file from the forward pass, in reverse order
>> --- from last written to first written.
>> Finally, to my question --- What is a fast way to write these variables to
>> an external file and then read them in backwards?
> You mentioned elsewhere that you are using numpy. I'll assume that the
> data you want to read/write are numpy arrays.
If that is the case always timeit. The following is an example of 3
functions, with repetitions of time that give an average: