I am looking for a method to "shuffle" the lines of a large file.
I have a corpus of sorted and "uniqed" English sentences that has been produced with (1):
(1) sort corpus | uniq > corpus.uniq
corpus.uniq is 80G large. The fact that every sentence appears only once in corpus.uniq plays an important role for the processes I use to involve my corpus in. Yet, the alphabetical order is an unwanted side effect of (1): Very often, I do not want (or rather, I do not have the computational capacities) to apply a program to all of corpus.uniq. Yet, any series of lines of corpus.uniq is obviously a very lopsided set of English sentences.
So, it would be very useful to do one of the following things:
- produce corpus.uniq in a such a way that it is not sorted in any way - shuffle corpus.uniq > corpus.uniq.shuffled
Unfortunately, none of the machines that I may use has 80G RAM. So, using a dictionary will not help.
> I am looking for a method to "shuffle" the lines of a large file.
> I have a corpus of sorted and "uniqed" English sentences that has been > produced with (1):
> (1) sort corpus | uniq > corpus.uniq
> corpus.uniq is 80G large. The fact that every sentence appears only > once in corpus.uniq plays an important role for the processes > I use to involve my corpus in. Yet, the alphabetical order is an > unwanted side effect of (1): Very often, I do not want (or rather, I > do not have the computational capacities) to apply a program to all of > corpus.uniq. Yet, any series of lines of corpus.uniq is obviously a > very lopsided set of English sentences.
> So, it would be very useful to do one of the following things:
> - produce corpus.uniq in a such a way that it is not sorted in any way > - shuffle corpus.uniq > corpus.uniq.shuffled
> Unfortunately, none of the machines that I may use has 80G RAM. > So, using a dictionary will not help.
There was a thread a while ago about choosing random lines from a file without reading the whole file into memory. Would that help? Instead of shuffling the file, shuffle the users. I can't find the thread though...
Not tested this, run it (or some derivation thereof) over the output to get increasing randomness. You will want to keep max_buffered_lines as high as possible really I imagine. If shuffle() is too intensize you could itterate over the buffer several times randomly removing and printing lines until the buffer is empty/suitibly small removing some more processing overhead.
### START ### import random
f = open('corpus.uniq')
buffer = [] max_buffered_lines = 1000
for line in f: if len(buffer) < max_buffered_lines: buffer.append(line) else: buffer.shuffle() for line in buffer: print line
random.shuffle(buffer) for line in buffer: print line
[mailto:python-list-bounces+alexs=advfn....@python.org]On Behalf Of Joerg Schuster Sent: 07 March 2005 13:37 To: python-l...@python.org Subject: shuffle the lines of a large file
Hello,
I am looking for a method to "shuffle" the lines of a large file.
I have a corpus of sorted and "uniqed" English sentences that has been produced with (1):
(1) sort corpus | uniq > corpus.uniq
corpus.uniq is 80G large. The fact that every sentence appears only once in corpus.uniq plays an important role for the processes I use to involve my corpus in. Yet, the alphabetical order is an unwanted side effect of (1): Very often, I do not want (or rather, I do not have the computational capacities) to apply a program to all of corpus.uniq. Yet, any series of lines of corpus.uniq is obviously a very lopsided set of English sentences.
So, it would be very useful to do one of the following things:
- produce corpus.uniq in a such a way that it is not sorted in any way - shuffle corpus.uniq > corpus.uniq.shuffled
Unfortunately, none of the machines that I may use has 80G RAM. So, using a dictionary will not help.
On Monday 07 March 2005 14:36, Joerg Schuster wrote:
> Any ideas?
The following program should do the trick (filenames are hardcoded, look at top of file):
### shuffle.py
import random import shelve
# Open external files needed for data storage. lines = open("test.dat","r") lineindex = shelve.open("test.idx") newlines = open("test.new.dat","w")
# Create an index of all lines of the file in an external flat file DB. # This means that nothing actually remains in memory, but in an extremely # efficient (g)dbm flatfile DB. def makeIdx(): i = 0L lastpos = 0L curpos = None while lines.readline(): # This is after the (\r)\n, which will be stripped() and rewritten # by writeNewLines(). curpos = long(lines.tell()) lineindex[hex(i)[2:-1]] = "%s:%s" % (hex(lastpos)[2:-1], hex(curpos-lastpos)[2:-1]) lastpos = curpos i += 1 return i
maxidx = makeIdx()
# To shuffle the file, just shuffle the index. Problem being: there is no # random number generator which even remotely has the possibility of yielding # all possible permutations. Thus, for simplicity: just exchange every element # in order 1..end with a random element from the rest of the file. This is # certainly no perfect shuffle, and in case the shuffling is too bad, just # rerun shuffleIdx() a couple of times. def shuffleIdx(): oldi = 0L # Use a while loop, as xrange doesn't work with longs. while oldi < maxidx: oi = hex(oldi)[2:-1] while True: ni = hex(long(random.randrange(maxidx)))[2:-1] if ni <> oi: break lineindex[oi], lineindex[ni] = lineindex[ni], lineindex[oi] oldi += 1
shuffleIdx()
# Write out the shuffled file. Do this by just walking the index 0..end. def writeNewLines(): i = 0L # Use a while loop, as xrange doesn't work with longs. while i < maxidx: # Extract line index and line length from the index file. lidx, llen = [long(x,16) for x in lineindex[hex(i)[2:-1]].split(":")] lines.seek(lidx) line = lines.read(llen).strip() newlines.write(line+"\n") i += 1
writeNewLines()
### End shuffle.py
I don't know how fast this program will run, but at least, it does as told... ;)
"Joerg Schuster" <joerg.schuster.REMOVET...@gmail.com> writes: >Hello, >I am looking for a method to "shuffle" the lines of a large file. >I have a corpus of sorted and "uniqed" English sentences that has been >produced with (1): >(1) sort corpus | uniq > corpus.uniq >corpus.uniq is 80G large. The fact that every sentence appears only >once in corpus.uniq plays an important role for the processes >I use to involve my corpus in. Yet, the alphabetical order is an >unwanted side effect of (1): Very often, I do not want (or rather, I >do not have the computational capacities) to apply a program to all of >corpus.uniq. Yet, any series of lines of corpus.uniq is obviously a >very lopsided set of English sentences. >So, it would be very useful to do one of the following things: >- produce corpus.uniq in a such a way that it is not sorted in any way >- shuffle corpus.uniq > corpus.uniq.shuffled >Unfortunately, none of the machines that I may use has 80G RAM. >So, using a dictionary will not help. >Any ideas?
Instead of shuffling the file itself maybe you could index it (with dbm for instance) and select random lines by using random indexes whenever you need a sample.
[mailto:python-list-bounces+alexs=advfn....@python.org]On Behalf Of Alex Stapleton Sent: 07 March 2005 14:17 To: Joerg Schuster; python-l...@python.org Subject: RE: shuffle the lines of a large file
Not tested this, run it (or some derivation thereof) over the output to get increasing randomness. You will want to keep max_buffered_lines as high as possible really I imagine. If shuffle() is too intensize you could itterate over the buffer several times randomly removing and printing lines until the buffer is empty/suitibly small removing some more processing overhead.
### START ### import random
f = open('corpus.uniq')
buffer = [] max_buffered_lines = 1000
for line in f: if len(buffer) < max_buffered_lines: buffer.append(line) else: buffer.shuffle() for line in buffer: print line
random.shuffle(buffer) for line in buffer: print line
f.close()
### END ###
-----Original Message----- From: python-list-bounces+alexs=advfn....@python.org [mailto:python-list-bounces+alexs=advfn....@python.org]On Behalf Of Joerg Schuster Sent: 07 March 2005 13:37 To: python-l...@python.org Subject: shuffle the lines of a large file
Hello,
I am looking for a method to "shuffle" the lines of a large file.
I have a corpus of sorted and "uniqed" English sentences that has been produced with (1):
(1) sort corpus | uniq > corpus.uniq
corpus.uniq is 80G large. The fact that every sentence appears only once in corpus.uniq plays an important role for the processes I use to involve my corpus in. Yet, the alphabetical order is an unwanted side effect of (1): Very often, I do not want (or rather, I do not have the computational capacities) to apply a program to all of corpus.uniq. Yet, any series of lines of corpus.uniq is obviously a very lopsided set of English sentences.
So, it would be very useful to do one of the following things:
- produce corpus.uniq in a such a way that it is not sorted in any way - shuffle corpus.uniq > corpus.uniq.shuffled
Unfortunately, none of the machines that I may use has 80G RAM. So, using a dictionary will not help.
> I am looking for a method to "shuffle" the lines of a large file.
Of the top of my head: decorate, randomize, undecorate. Prepend a suitable large random number or hash to each line and then use sort. You could prepend new line numbers instead but even storing the randomised indexes might use too much memory.
As far as I can tell, what you ultimately want is to be able to extract a random ("representative?") subset of sentences. Given the huge size of data, I would suggest not randomizing the file, but randomizing accesses to the file. E.g. (sorry for off-the-cuff pseudo python): [adjust 8196 == 2**13 to your disk block size] . while True: . byteno = random.randint(0,length_of_file) . #align to disk block to avoid unnecessary IO . byteno = (byteno >> 13) << 13 #zero out the bottom 13 bits . f.seek(byteno) #set the file pointer to a random position . bytes = r.read(8196) #read one block . sentences = bytes.splitlines()[2:-1] #omit ends with partial lines . do_something(sentences)
If you only need 1000 sentences, use only one sentence from each block, if you need 1M, then use them all. [I hope I understood you problem]
Joerg Schuster wrote: > Unfortunately, none of the machines that I may use has 80G RAM. > So, using a dictionary will not help.
> Any ideas?
Why don't you index the file? I would store the byte-offsets of the beginning of each line into an index file. Then you can generate a random number from 1 to Whatever, go get that index from the index file, then open your text file, seek to that position in the file, read one line, and close the file. Using this process you can then extract a somewhat random set of lines from your 'corpus' text file.
You probably should consider making a database of the file, keep the raw text file for sure, but create a converted copy in bsddb or pytables format.
On 7 Mar 2005 05:36:32 -0800, rumours say that "Joerg Schuster" <joerg.schuster.REMOVET...@gmail.com> might have written:
>Hello,
>I am looking for a method to "shuffle" the lines of a large file.
[snip]
>So, it would be very useful to do one of the following things:
>- produce corpus.uniq in a such a way that it is not sorted in any way >- shuffle corpus.uniq > corpus.uniq.shuffled
>Unfortunately, none of the machines that I may use has 80G RAM. >So, using a dictionary will not help.
To implement your 'shuffle' command in Python, you can do the following algorithm, with a couple of assumptions:
ASSUMPTION ----------
The total line count in your big file is less than sys.maxint.
The algorithm as given works for systems where eol is a single '\n'.
ALGORITHM ---------
Create a temporary filelist.FileList fl (see attached file) of struct.calcsize("q") bytes each (struct.pack and the 'q' format string is your friend), to hold the offset of each line start in big_file. fl[0] would be 0, fl[1] would be the length of the first line including its '\n' and so on.
Read once the big_file appending to fl the offset each time (if you need help with this, let me know).
random.shuffle(fl) # this is tested with the filelist.FileList as given
for offset_as_str in fl: offset= struct.unpack("q", offset_as_str)[0] big_file.seek(offset) sys.stdout.write(big_file.readline())
That's it. Redirect output to your preferred file. No promises for speed though :) -- TZOTZIOY, I speak England very best. "Be strict when sending and tolerant when receiving." (from RFC1958) I really should keep that in mind when talking with people, actually...
> I am looking for a method to "shuffle" the lines of a large file.
If speed and space are not a concern, I would be tempted to presume that this can be organised without too much difficulty. However, looking for speed handling a big file, while keeping equiprobability of all possible permutations, might be sensibly more difficult.
I vaguely remember having read something along these lines (not shuffling as you mean it, but still, reorganising a lengthy file) in Knuth's "Art of Computer Programming", in one of the exercises within the chapter on Sorting methods (volume 3). That's long ago, but if I remember well, Knuth did not consider this as an easy exercise.
Raymond Hettinger <vze4r...@verizon.net> wrote: > >>> from random import random > >>> out = open('corpus.decorated', 'w') > >>> for line in open('corpus.uniq'): > print >> out, '%.14f %s' % (random(), line),
Sort is truly excellent at very large datasets. If you give it a file bigger than memory then it divides it up into temporary files of memory size, sorts each one, then merges all the temporary files back together.
You tune the memory sort uses for in memory sorts with --buffer-size. Its pretty good at auto tuning though.
You may want to set --temporary-directory also to save filling up your /tmp.
In a previous job I did a lot of stuff with usenet news and was forever blowing up the server with scripts which used too much memory. sort was always the solution!
<simon.brunn...@gmail.com> wrote: > On 7 Mar 2005 06:38:49 -0800, g...@ll.mit.edu <g...@ll.mit.edu> wrote: > > As far as I can tell, what you ultimately want is to be able to extract > > a random ("representative?") subset of sentences.
> If this is what's wanted, then perhaps some variation on this cookbook > recipe might do the trick:
def randomLines(filename, lines=1): selected_lines = list(None for line_no in xrange(lines))
for line_index, line in enumerate(open(filename)): for selected_line_index in xrange(lines): if random.uniform(0, line_index) < 1: selected_lines[selected_line_index] = line
return selected_lines
This has the advantage that every line had the same chance of being picked regardless of its length. There is the chance that it'll pick the same line more than once, though.
On Tuesday 08 March 2005 15:28, Simon Brunning wrote:
> This has the advantage that every line had the same chance of being > picked regardless of its length. There is the chance that it'll pick > the same line more than once, though.
Problem being: if the file the OP is talking about really is 80GB in size, and you consider a sentence to have 80 bytes on average (it's likely to have less than that), that makes 10^9 sentences in the file. Now, multiply that with the memory overhead of storing a list of 10^9 None(s), and reconsider, whether that algorithm really works for the posted conditions. I don't think that any machine I have access to even has near enough memory just to store this list... ;)
On Tue, 8 Mar 2005 15:49:35 +0100, Heiko Wundram <modeln...@ceosg.de> wrote: > Problem being: if the file the OP is talking about really is 80GB in size, and > you consider a sentence to have 80 bytes on average (it's likely to have less > than that), that makes 10^9 sentences in the file. Now, multiply that with > the memory overhead of storing a list of 10^9 None(s), and reconsider, > whether that algorithm really works for the posted conditions. I don't think > that any machine I have access to even has near enough memory just to store > this list... ;)
Ah, but that's the clever bit; it *doesn't* store the whole list - only the selected lines.
On Tuesday 08 March 2005 15:55, Simon Brunning wrote:
> Ah, but that's the clever bit; it *doesn't* store the whole list - > only the selected lines.
But that means that it'll only read several lines from the file, never do a shuffle of the whole file content... When you'd want to shuffle the file content, you'd have to set lines=1 and throw away repeating lines in subsequent runs, or you'd have to set lines higher, and deal with the resulting lines too in some way (throw away repeating ones... :-). Doesn't matter how, you'd have to store which lines you've already read (selected_lines). And in any case you'd need a line cache of 10^9 entries for this amount of data...
On Fri, 11 Mar 2005 06:59:33 +0100, Heiko Wundram <modeln...@ceosg.de> wrote: > On Tuesday 08 March 2005 15:55, Simon Brunning wrote: > > Ah, but that's the clever bit; it *doesn't* store the whole list - > > only the selected lines.
> But that means that it'll only read several lines from the file, never do a > shuffle of the whole file content...
Err, thing is, it *does* pick a random selection from the whole file, without holding the whole file in memory. (It does hold all the selected items in memory - I don't see any way to avoid that.) Why not try it and see?
> When you'd want to shuffle the file > content, you'd have to set lines=1 and throw away repeating lines in > subsequent runs, or you'd have to set lines higher, and deal with the > resulting lines too in some way (throw away repeating ones... :-).
Eliminating duplicates is left as an exercise for the reader. ;-)
> Doesn't > matter how, you'd have to store which lines you've already read > (selected_lines). And in any case you'd need a line cache of 10^9 entries for > this amount of data...
> def randomLines(filename, lines=1): > selected_lines = list(None for line_no in xrange(lines))
> for line_index, line in enumerate(open(filename)): > for selected_line_index in xrange(lines): > if random.uniform(0, line_index) < 1: > selected_lines[selected_line_index] = line
> return selected_lines
> This has the advantage that every line had the same chance of being > picked regardless of its length. There is the chance that it'll pick > the same line more than once, though.
The idea is simple: have a chain of "pickers" that pass all items that are not picked on to the next picker. The current implementation, however, may hit the recursion limit for large samples.
import random
class PickOne(object): """Pick a random item from an iterable and store it in the 'picked' attribute. As an iterator, yield all items that are not picked.
Helper for the pick() function. """ def __init__(self, iterable): self.iterable = iterable def __iter__(self): it = enumerate(self.iterable) index, self.picked = it.next() for index, item in it: if random.uniform(0, index) < 1: yield self.picked self.picked = item else: yield item
def pick(iterable, N, accept_less=False): """Pick N random items from an iterable. """ # Set up the PickOne(PickOne( ... PickOne(iterable) ... )) chain keepers = [] for i in xrange(N): iterable = PickOne(iterable) keepers.append(iterable)
# We are interested in the side effects, i. e. the PickOne instances # picking random items from the initial iterable for item in iterable: pass
# The following could be # return [k.picked for k in keepers] # but we want to handle the case len(iterable) < N, too result = [] for keeper in keepers: try: picked = keeper.picked except AttributeError: if accept_less: return result raise ValueError("sequence contains less items than shall be picked") result.append(picked) return result
def demo(): """Pick random lines from a set of files. """ import optparse from itertools import chain, izip, repeat, count
parser = optparse.OptionParser() parser.usage = ("%prog [nl] file1 file2 ... fileN\n\n" "Choose random lines from files given on the command line.") parser.add_option("-n", "--count", type="int", default=10, help="number of lines that shall be picked (default=10)") parser.add_option("-l", "--less", action="store_true", help="accept less picked lines than specified, if there are less total lines in the file(s)") options, args = parser.parse_args()
index_name_line = chain(*[izip(count(), repeat(fn), file(fn)) for fn in args]) try: picked = pick(index_name_line, options.count, options.less) except ValueError: raise SystemExit("Cannot pick more lines than found in the file(s)")
print "".join([("%3d <%s> %s" % ifl) for ifl in picked])