Re: clarification

Thomas Jollans

unread,

Aug 17, 2007, 7:17:37 AM8/17/07

to pytho...@python.org

On Friday 17 August 2007, Beema shafreen wrote:
> hi everybody,
> i have a file with data separated by tab
> mydata:
> fhl1 fkh2
> dfp1 chk1
> mal3 alp14
> mal3 moe1
> mal3 spi1
> mal3 bub1
> mal3 bub3
> mal3 mph1
> mal3 mad3
> hob1 nak1
> hob1 wsp1
> hob1 rad3
> cdr2 cdc13
> cdr2 cdc2
> shows these two are separated by tab represented as columns
> i have to check the common data between these two coloumn1 an coloumn2
> my code:
> data = []
> data1 = []
> result = []
> fh = open('sheet1','r')
> for line in fh.readlines():
> splitted = line.strip().split('\t')
> data.append(splitted[0])
> data1.append(splitted[1])
> for k in data:
> if k in data1:
> result.append(k)
> print result
> fh.close()
>
> can you tell me problem with my script and what should is do for this

No, I have not tested it. You tell us the problem, and we might understand the
situation better than you.

--
Regards, Thomas Jollans
GPG key: 0xF421434B may be found on various keyservers, eg pgp.mit.edu
Hacker key <http://hackerkey.com/>:
v4sw6+8Yhw4/5ln3pr5Ock2ma2u7Lw2Nl7Di2e2t3/4TMb6HOPTen5/6g5OPa1XsMr9p-7/-6

Laurent Pointal

unread,

Aug 17, 2007, 7:43:03 AM8/17/07

to

Thomas Jollans a écrit :

> On Friday 17 August 2007, Beema shafreen wrote:
>> hi everybody,
>> i have a file with data separated by tab
>> mydata:
>> fhl1 fkh2

<zip>

>> shows these two are separated by tab represented as columns
>> i have to check the common data between these two coloumn1 an coloumn2
>> my code:
>> data = []
>> data1 = []
>> result = []
>> fh = open('sheet1','r')
>> for line in fh.readlines():
>> splitted = line.strip().split('\t')
>> data.append(splitted[0])
>> data1.append(splitted[1])
>> for k in data:
>> if k in data1:
>> result.append(k)
>> print result
>> fh.close()

Use set data type for data and data1 (you fill them with an algo like th
one you wrote - just use add() in place of appen()) then use set
intersection to get common data.

See doc for set data type:
http://docs.python.org/lib/types-set.html

Would look like (not tested):
data = set()
data1 = set()

fh = open('sheet1','r')
for line in fh.readlines():
splitted = line.strip().split('\t')

data.add(splitted[0])
data1.add(splitted[1])

result = data.intersection(data1)

Laurent Pointal

unread,

Aug 17, 2007, 7:44:16 AM8/17/07

to

Laurent Pointal a écrit :

[cleaning]
fh = open('sheet1')
for line in fh:

Scott David Daniels

unread,

Aug 17, 2007, 9:09:29 AM8/17/07

to

lefts = set()
rights = set()
with open('sheet1', 'r') as fh:
for line in fh:
trimmed = line.strip()
if trimmed: # Skip blanks (file end often looks like that)
left, right = line.strip().split('\t')
lefts.add(left)
rights.add(right)
result = lefts & rights

-Scott

samwyse

unread,

Aug 17, 2007, 10:32:02 PM8/17/07

to

Scott David Daniels wrote:
> lefts = set()
> rights = set()
> with open('sheet1', 'r') as fh:
> for line in fh:
> trimmed = line.strip()
> if trimmed: # Skip blanks (file end often looks like that)
> left, right = line.strip().split('\t')
> lefts.add(left)
> rights.add(right)
> result = lefts & rights
>
> -Scott

# change to however many columns may later exist
cols = [ set() for i in range(0, 2) ]

with open('sheet1', 'r') as fh:
for line in fh:

for col, data in zip(cols, line.strip().split('\t')):
col.add(data)
result = cols[0] & cols[1]

samwyse

unread,

Aug 18, 2007, 11:02:58 AM8/18/07

to

My laptop started complaining about low power before I was really
finished with this last night, so I just hit Send and went to bed. The
last line really should be:
result = reduce(operator.and_, cols)

I'll note that using 'zip' transparently deals with handling those cases
where there are either more or fewer columns of data than expected.

Finally, does anyone familar with P3K know how best to do the reduction
without using 'reduce'? Right now, sets don't support the 'add' and
'multiply' operators, so 'sum' and (the currently ficticious) 'product'
won't work at all; while 'any' and 'all' don't work as one might hope.
Are there an 'intersection' and 'union' built-ins anywhere?

Tim Williams

unread,

Aug 18, 2007, 11:37:42 AM8/18/07

to Beema shafreen, pytho...@python.org

On 17/08/07, Beema shafreen <beema.s...@gmail.com> wrote:
> hi everybody,
> i have a file with data separated by tab
> mydata:
> fhl1 fkh2

> dfp1 chk1
> mal3 alp14
> mal3 moe1
> mal3 spi1
> mal3 bub1
> mal3 bub3
> mal3 mph1
> mal3 mad3
> hob1 nak1
> hob1 wsp1
> hob1 rad3
> cdr2 cdc13
> cdr2 cdc2

> shows these two are separated by tab represented as columns
> i have to check the common data between these two coloumn1 an coloumn2
> my code:
> data = []
> data1 = []
> result = []
> fh = open('sheet1','r')
> for line in fh.readlines():
> splitted = line.strip().split('\t')
> data.append(splitted[0])
> data1.append(splitted[1])
> for k in data:
> if k in data1:
> result.append(k)
> print result
> fh.close()
>

> can you tell me problem with my script and what should is do for this

For a start, you are iterating k in data *everytime* you iterate a
line in fh which will give you a speed issue and give you duplicates
in the result. The following is probably what you intended to do

> for line in fh.readlines():
> do stuff
> for k in data:
> do stuff

.split() splits by spaces, newlines AND tabs so you just need

> splitted = line.split()

eg

>>> ln = 'fhl1\tfkh2\r\n'
>>> ln.split()
['fhl1', 'fkh2']

I think I would have done something like this (not tested)

Input = open('sheet1').read().split()
data = set(Input[::2])
data1 = set (Input[1::2])
result = data.intersection(data1)

or even this (if you don't need data and data1 later in the code)

Input = open('sheet1').read().split()
result = set(Input[::2]).intersection(set (Input[1::2]))

HTH :)

Alex Martelli

unread,

Aug 18, 2007, 12:26:10 PM8/18/07

to

samwyse <deja...@email.com> wrote:
...

> Finally, does anyone familar with P3K know how best to do the reduction
> without using 'reduce'? Right now, sets don't support the 'add' and
> 'multiply' operators, so 'sum' and (the currently ficticious) 'product'
> won't work at all; while 'any' and 'all' don't work as one might hope.
> Are there an 'intersection' and 'union' built-ins anywhere?

For intersection and union of a sequence of sets, I'd use:

def union_all(sos):
result = set()
for s in sos: result.update(s)
return result

def intersect_all(sos):
it = iter(sos)
result = set(it.next())
for s in it: result.intersection_update(s)
return result

The latter will raise an exception if sos is empty -- I don't think the
"intersection of no sets at all" has a single natural interpretation
(while the "union of no sets at all" appears to be naturally interpreted
as an empty set)... if you disagree, just wrap a try/except around the
initialization of result, and return whatever in the except clause.

Of course, hoisting the unbound method out of the loops can afford the
usual small optimization. But my point is that, in Python, these
operations (like, say, the concatenation of a sequence of lists, etc)
are best performed "in place" via loops calling mutator methods such as
update and intersection_update (or a list's extend, etc), rather than
"functionally" (building and tossing away many intermediate results).
E.g., consider:

brain:~ alex$ python -mtimeit -s'sos=[set(range(x,x+4)) for x in
range(0, 100, 3)]' 'r=set()' 'for x in sos: r.update(x)'
100000 loops, best of 3: 18.8 usec per loop

brain:~ alex$ python -mtimeit -s'sos=[set(range(x,x+4)) for x in
range(0, 100, 3)]' 'r=reduce(set.union, sos, set())'
10000 loops, best of 3: 87.2 usec per loop

Even in a case as tiny as this one, "reduce" is taking over 4 times
longer than the loop with the in-place mutator -- and it only gets
worse, as we're talking about O(N squared) vs O(N) performance! Indeed,
this is part of what makes reduce an "attractive nuisance"...;-). [[And
so is sum, if used OTHERWISE than for the documented purpose, computing
"the sum of a sequence of numbers": a loop with r.extend is similarly
faster, to concatenate a sequence of lists, when compared to sum(sol,
[])...!!!]]

Alex

samwyse

unread,

Aug 19, 2007, 8:03:39 AM8/19/07

to

Alex Martelli wrote:

> Of course, hoisting the unbound method out of the loops can afford the
> usual small optimization. But my point is that, in Python, these
> operations (like, say, the concatenation of a sequence of lists, etc)
> are best performed "in place" via loops calling mutator methods such as
> update and intersection_update (or a list's extend, etc), rather than
> "functionally" (building and tossing away many intermediate results).
> E.g., consider:
>
> brain:~ alex$ python -mtimeit -s'sos=[set(range(x,x+4)) for x in
> range(0, 100, 3)]' 'r=set()' 'for x in sos: r.update(x)'
> 100000 loops, best of 3: 18.8 usec per loop
>
> brain:~ alex$ python -mtimeit -s'sos=[set(range(x,x+4)) for x in
> range(0, 100, 3)]' 'r=reduce(set.union, sos, set())'
> 10000 loops, best of 3: 87.2 usec per loop
>
> Even in a case as tiny as this one, "reduce" is taking over 4 times
> longer than the loop with the in-place mutator -- and it only gets
> worse, as we're talking about O(N squared) vs O(N) performance! Indeed,
> this is part of what makes reduce an "attractive nuisance"...;-). [[And
> so is sum, if used OTHERWISE than for the documented purpose, computing
> "the sum of a sequence of numbers": a loop with r.extend is similarly
> faster, to concatenate a sequence of lists, when compared to sum(sol,
> [])...!!!]]

Curiously, when I attempt the actual problem at hand (set intersection)
rather than something altogether different (set union) on my Dell laptop
running Win2K Pro, I get the following results:

stmt = 'r=reduce(set.intersection, los)'
best of 3: 21.9 usec per loop
stmt = 'r=intersect_all(los)'
best of 3: 29.3 usec per loop

So, the "nuisance" is more attractive than you thought. Apparently, one
can't really depend on "in place" mutator methods being more efficient
that using functional techniques. BTW, the above figures reflect 10,000
iterations; using larger counts makes the difference even larger. I
suspect that this is because the Python interpreter has fewer chances to
be interrupted by external processes when it's using 'reduce'. This in
turn indicates that both implementations actually work about same and
your "O(n squared)" argument is irrelevant.

P.S. Here's the source I used for the timings:

from timeit import Timer

setup = """
def intersect_all(los):
it = iter(los)

result = set(it.next())
for s in it: result.intersection_update(s)
return result

not7=range(2,11)
not7.remove(7)
los=[set(range(0,360,step)) for step in not7]
"""

number = 10000
repeat = 3
precision = 3

for stmt in 'r=reduce(set.intersection, los)', 'r=intersect_all(los)':
t = Timer(stmt, setup)
r = t.repeat(repeat, number)
best = min(r)
usec = best * 1e6 / number
print "stmt =", repr(stmt)
print "best of %d: %.*g usec per loop" % (repeat, precision, usec)

Alex Martelli

unread,

Aug 19, 2007, 12:03:53 PM8/19/07

to

samwyse <deja...@email.com> wrote:
...

> > brain:~ alex$ python -mtimeit -s'sos=[set(range(x,x+4)) for x in
> > range(0, 100, 3)]' 'r=set()' 'for x in sos: r.update(x)'
> > 100000 loops, best of 3: 18.8 usec per loop
> >
> > brain:~ alex$ python -mtimeit -s'sos=[set(range(x,x+4)) for x in
> > range(0, 100, 3)]' 'r=reduce(set.union, sos, set())'
> > 10000 loops, best of 3: 87.2 usec per loop
> >
> > Even in a case as tiny as this one, "reduce" is taking over 4 times
> > longer than the loop with the in-place mutator -- and it only gets
> > worse, as we're talking about O(N squared) vs O(N) performance! Indeed,
> > this is part of what makes reduce an "attractive nuisance"...;-). [[And

The set-union case, just like the list-catenation case, is O(N squared)
(when approached in a functional way) because the intermediate result
often _grows_ [whenever a new set or list operand adds items], and thus
a new temporary value must be allocated, and the K results-so-far
"copied over" (as part of constructing the new temp value) from the
previous temporary value; and sum(range(N)) grows quadratically in N.
The in-place approach avoids that fate by a strategy of proportional
over-allocation (used by both set and lists) that makes in-place
operations such as .update(X) and .extend(X) amortized O(K) where K is
len(X).

In the set-intersection case, the intermediate result _shrinks_ rather
than growing, so the amount of data "copied over" is a diminishing
quantity at each step, and so the analysis showing quadratic behavior
for the functional approach does not hold; behavior may be roughly
linear, influenced in minor ways by accidental regularities in the sets'
structure and order (especially likely for short sequences of small
sets, as in your case). Using a slightly longer sequence of slightly
larger sets, with little structure to each, e.g.:

random.seed(12345) # set seed to ensure total repeatability
los=[set(random.sample(range(1000), 990)) for x in range(200)]

at the end of the setup (the intersection of these 200 sets happens to
contain 132 items), I measure (as usual on my 18-months-old Macbook Pro
laptop):

stmt = 'reduce(set.intersection,los)'
best of 3: 1.66e+04 usec per loop
stmt = 'intersect_all(los)'
best of 3: 1.66e+04 usec per loop

and occasionally 1.65 or 1.67 instead of 1.66 for either or both,
whether with 10,000 or 100,000 loops. (Not sure whether your
observations about the reduce-based approach becoming faster with more
loops may be due to anomalies in Windows' scheduler, or the accidental
regularities mentioned above; my timings are probably more regular since
I have two cores, one of which probably ends up dedicated to whatever
task I'm benchmarking while the other one runs all "background" stuff).

> turn indicates that both implementations actually work about same and
> your "O(n squared)" argument is irrelevant.

It's indeed irrelevant when the behavior _isn't_ quadratic (as in the
case of intersections) -- but unfortunately it _is_ needlessly quadratic
in most interesting cases involving containers (catenation of sequences,
union of sets, merging of dictionaries, merging of priority-queues,
...), because in those cases the intermediate temporary values tend to
grow, as I tried to explain in more detail above.

Alex

Paul Rubin

unread,

Aug 20, 2007, 11:18:03 AM8/20/07

to

al...@mac.com (Alex Martelli) writes:
> > turn indicates that both implementations actually work about same and
> > your "O(n squared)" argument is irrelevant.
>
> It's indeed irrelevant when the behavior _isn't_ quadratic (as in the
> case of intersections) -- but unfortunately it _is_ needlessly quadratic
> in most interesting cases involving containers (catenation of sequences,
> union of sets, merging of dictionaries, merging of priority-queues,
> ...), because in those cases the intermediate temporary values tend to
> grow, as I tried to explain in more detail above.

Mostly directed to samwyse: Alex is correct here and I'd add that
Python and its libraries are written for an imperative, mutating
approach though there are some situations in which you can write in
functional style without a big efficiency hit. In particular, the
quadratic behavior Alex points out is because of Python's hash-based
implementation of sets, so you can't make a new set that adds or
removes one element from the old set, without copying all the elements
around.

A more purist functional approach would probably implement sets with
some other data structure such as AVL trees, so you can make a new one
that adds or deletes some node in O(log n) time (it would share almost
all of its structure with the old one). So instead of O(n) for the
hash-based mutating implementation or O(n**2) for the hash/copying
based functional implementation, you get O(n log n) for the functional
tree implementation. Haskell's Data.Map module does something like
this. There's a book "Purely Functional Data Structures" by Chris
Okasaki that shows how to implement dicts, priority queues, and
fancier objects in functional style based on similar principles.

If you want to try out functional programming in a lightweight
language (much easier to grok than Haskell) you might look at Hedgehog Lisp:

http://hedgehog.oliotalo.fi/

It includes a functional dictionary implementation based on AVL trees
but with a Python dictionary-like API.