Use a dictionary:
linedict = {}
for line in f:
key = line[:3]
linedict[key] = line[3:] # or alternatively 'line' if you want to
include key in the line anyway
sortedlines = []
for key in linedict.keys().sort():
sortedlines.append(linedict[key])
(untested)
This is the simplest, and probably inefficient approach. But it should work.
>
> 2.) How do I sort file1.csv by column name; for example, if all the
> records have three column headings, “id”, “first_name”, “last_name”;
> here would be some sample data:
>
> a. Id, first_name,last_name
>
> b. 001,John,Filben
>
> c. 002,Joe, Smith
This is more complicated: I would make a list of lines, where each line
is a list split according to columns (like ['001', 'John', 'Filben']),
and then I would sort this list using operator.itemgetter, like this:
lines.sort(key = operator.itemgetter(num)) # where num is the number of
column, starting with 0 of course
Read up on operator.*, it's very useful.
>
> 3.) What about if I have millions of records and I am processing on a
> laptop with a large external drive – basically, are there space
> considerations? What are the work arounds.
The simplest is to use smth like SQLite: define a table, fill it up, and
then do SELECT with ORDER BY.
But with a million records I wouldn't worry about it, it should fit in
RAM. Observe:
>>> a={}
>>> for i in range(1000000):
... a[i] = 'spam'*10
...
>>> sys.getsizeof(a)
25165960
So that's what, 25 MB?
Although I have to note that TEMPORARY ram usage in Python process on my
machine did go up to 113MB.
Regards,
mk
lines = f.readlines()
lines.sort(key=lambda line: line[ : 3])
or even:
lines = sorted(f.readlines(), key=lambda line: line[ : 3]))
#!/usr/bin/python
def sortit(fname):
fo = open(fname)
linedict = {}
for line in fo:
key = line[:3]
linedict[key] = line
sortedlines = []
keys = linedict.keys()
keys.sort()
for key in keys:
sortedlines.append(linedict[key])
return sortedlines
if __name__ == '__main__':
sortit('testfile.txt')
MRAB's solution is obviously better, provided you know about Python's
lambda.
Regards,
mk
> [snip]
> Simpler would be:
>
> lines = f.readlines()
> lines.sort(key=lambda line: line[ : 3])
>
> or even:
>
> lines = sorted(f.readlines(), key=lambda line: line[ : 3]))
Sure, but a complete newbie (I have this impression about OP) doesn't
have to know about lambda.
I expected my solution to be slower, but it's not (on a file with
100,000 random string lines):
# time ./sort1.py
real 0m0.386s
user 0m0.372s
sys 0m0.014s
# time ./sort2.py
real 0m0.303s
user 0m0.286s
sys 0m0.017s
sort1.py:
#!/usr/bin/python
def sortit(fname):
lines = open(fname).readlines()
lines.sort(key = lambda x: x[:3])
if __name__ == '__main__':
sortit('testfile.txt')
sort2.py:
#!/usr/bin/python
def sortit(fname):
fo = open(fname)
linedict = {}
for line in fo:
key = line[:3]
linedict[key] = line
sortedlines = []
keys = linedict.keys()
keys.sort()
for key in keys:
sortedlines.append(linedict[key])
return sortedlines
if __name__ == '__main__':
sortit('testfile.txt')
Any idea why? After all, I'm "manually" doing quite a lot: allocating
key in a dict, then sorting dict's keys, then iterating over keys and
accessing dict value.
Regards,
mk
You may also want to look at the GNU tools "sort" and "cut". If your
job is to process files, I'd recommend tools designed to process files
for the task.
--
Jonathan Gardner
jgar...@jonathangardner.net