No, I have not tested it. You tell us the problem, and we might understand the
situation better than you.
--
Regards, Thomas Jollans
GPG key: 0xF421434B may be found on various keyservers, eg pgp.mit.edu
Hacker key <http://hackerkey.com/>:
v4sw6+8Yhw4/5ln3pr5Ock2ma2u7Lw2Nl7Di2e2t3/4TMb6HOPTen5/6g5OPa1XsMr9p-7/-6
Use set data type for data and data1 (you fill them with an algo like th
one you wrote - just use add() in place of appen()) then use set
intersection to get common data.
See doc for set data type:
http://docs.python.org/lib/types-set.html
Would look like (not tested):
data = set()
data1 = set()
fh = open('sheet1','r')
for line in fh.readlines():
splitted = line.strip().split('\t')
data.add(splitted[0])
data1.add(splitted[1])
result = data.intersection(data1)
[cleaning]
fh = open('sheet1')
for line in fh:
lefts = set()
rights = set()
with open('sheet1', 'r') as fh:
for line in fh:
trimmed = line.strip()
if trimmed: # Skip blanks (file end often looks like that)
left, right = line.strip().split('\t')
lefts.add(left)
rights.add(right)
result = lefts & rights
-Scott
# change to however many columns may later exist
cols = [ set() for i in range(0, 2) ]
with open('sheet1', 'r') as fh:
for line in fh:
for col, data in zip(cols, line.strip().split('\t')):
col.add(data)
result = cols[0] & cols[1]
My laptop started complaining about low power before I was really
finished with this last night, so I just hit Send and went to bed. The
last line really should be:
result = reduce(operator.and_, cols)
I'll note that using 'zip' transparently deals with handling those cases
where there are either more or fewer columns of data than expected.
Finally, does anyone familar with P3K know how best to do the reduction
without using 'reduce'? Right now, sets don't support the 'add' and
'multiply' operators, so 'sum' and (the currently ficticious) 'product'
won't work at all; while 'any' and 'all' don't work as one might hope.
Are there an 'intersection' and 'union' built-ins anywhere?
For a start, you are iterating k in data *everytime* you iterate a
line in fh which will give you a speed issue and give you duplicates
in the result. The following is probably what you intended to do
> for line in fh.readlines():
> do stuff
> for k in data:
> do stuff
.split() splits by spaces, newlines AND tabs so you just need
> splitted = line.split()
eg
>>> ln = 'fhl1\tfkh2\r\n'
>>> ln.split()
['fhl1', 'fkh2']
I think I would have done something like this (not tested)
Input = open('sheet1').read().split()
data = set(Input[::2])
data1 = set (Input[1::2])
result = data.intersection(data1)
or even this (if you don't need data and data1 later in the code)
Input = open('sheet1').read().split()
result = set(Input[::2]).intersection(set (Input[1::2]))
HTH :)
For intersection and union of a sequence of sets, I'd use:
def union_all(sos):
result = set()
for s in sos: result.update(s)
return result
def intersect_all(sos):
it = iter(sos)
result = set(it.next())
for s in it: result.intersection_update(s)
return result
The latter will raise an exception if sos is empty -- I don't think the
"intersection of no sets at all" has a single natural interpretation
(while the "union of no sets at all" appears to be naturally interpreted
as an empty set)... if you disagree, just wrap a try/except around the
initialization of result, and return whatever in the except clause.
Of course, hoisting the unbound method out of the loops can afford the
usual small optimization. But my point is that, in Python, these
operations (like, say, the concatenation of a sequence of lists, etc)
are best performed "in place" via loops calling mutator methods such as
update and intersection_update (or a list's extend, etc), rather than
"functionally" (building and tossing away many intermediate results).
E.g., consider:
brain:~ alex$ python -mtimeit -s'sos=[set(range(x,x+4)) for x in
range(0, 100, 3)]' 'r=set()' 'for x in sos: r.update(x)'
100000 loops, best of 3: 18.8 usec per loop
brain:~ alex$ python -mtimeit -s'sos=[set(range(x,x+4)) for x in
range(0, 100, 3)]' 'r=reduce(set.union, sos, set())'
10000 loops, best of 3: 87.2 usec per loop
Even in a case as tiny as this one, "reduce" is taking over 4 times
longer than the loop with the in-place mutator -- and it only gets
worse, as we're talking about O(N squared) vs O(N) performance! Indeed,
this is part of what makes reduce an "attractive nuisance"...;-). [[And
so is sum, if used OTHERWISE than for the documented purpose, computing
"the sum of a sequence of numbers": a loop with r.extend is similarly
faster, to concatenate a sequence of lists, when compared to sum(sol,
[])...!!!]]
Alex
> Of course, hoisting the unbound method out of the loops can afford the
> usual small optimization. But my point is that, in Python, these
> operations (like, say, the concatenation of a sequence of lists, etc)
> are best performed "in place" via loops calling mutator methods such as
> update and intersection_update (or a list's extend, etc), rather than
> "functionally" (building and tossing away many intermediate results).
> E.g., consider:
>
> brain:~ alex$ python -mtimeit -s'sos=[set(range(x,x+4)) for x in
> range(0, 100, 3)]' 'r=set()' 'for x in sos: r.update(x)'
> 100000 loops, best of 3: 18.8 usec per loop
>
> brain:~ alex$ python -mtimeit -s'sos=[set(range(x,x+4)) for x in
> range(0, 100, 3)]' 'r=reduce(set.union, sos, set())'
> 10000 loops, best of 3: 87.2 usec per loop
>
> Even in a case as tiny as this one, "reduce" is taking over 4 times
> longer than the loop with the in-place mutator -- and it only gets
> worse, as we're talking about O(N squared) vs O(N) performance! Indeed,
> this is part of what makes reduce an "attractive nuisance"...;-). [[And
> so is sum, if used OTHERWISE than for the documented purpose, computing
> "the sum of a sequence of numbers": a loop with r.extend is similarly
> faster, to concatenate a sequence of lists, when compared to sum(sol,
> [])...!!!]]
Curiously, when I attempt the actual problem at hand (set intersection)
rather than something altogether different (set union) on my Dell laptop
running Win2K Pro, I get the following results:
stmt = 'r=reduce(set.intersection, los)'
best of 3: 21.9 usec per loop
stmt = 'r=intersect_all(los)'
best of 3: 29.3 usec per loop
So, the "nuisance" is more attractive than you thought. Apparently, one
can't really depend on "in place" mutator methods being more efficient
that using functional techniques. BTW, the above figures reflect 10,000
iterations; using larger counts makes the difference even larger. I
suspect that this is because the Python interpreter has fewer chances to
be interrupted by external processes when it's using 'reduce'. This in
turn indicates that both implementations actually work about same and
your "O(n squared)" argument is irrelevant.
P.S. Here's the source I used for the timings:
from timeit import Timer
setup = """
def intersect_all(los):
it = iter(los)
result = set(it.next())
for s in it: result.intersection_update(s)
return result
not7=range(2,11)
not7.remove(7)
los=[set(range(0,360,step)) for step in not7]
"""
number = 10000
repeat = 3
precision = 3
for stmt in 'r=reduce(set.intersection, los)', 'r=intersect_all(los)':
t = Timer(stmt, setup)
r = t.repeat(repeat, number)
best = min(r)
usec = best * 1e6 / number
print "stmt =", repr(stmt)
print "best of %d: %.*g usec per loop" % (repeat, precision, usec)
The set-union case, just like the list-catenation case, is O(N squared)
(when approached in a functional way) because the intermediate result
often _grows_ [whenever a new set or list operand adds items], and thus
a new temporary value must be allocated, and the K results-so-far
"copied over" (as part of constructing the new temp value) from the
previous temporary value; and sum(range(N)) grows quadratically in N.
The in-place approach avoids that fate by a strategy of proportional
over-allocation (used by both set and lists) that makes in-place
operations such as .update(X) and .extend(X) amortized O(K) where K is
len(X).
In the set-intersection case, the intermediate result _shrinks_ rather
than growing, so the amount of data "copied over" is a diminishing
quantity at each step, and so the analysis showing quadratic behavior
for the functional approach does not hold; behavior may be roughly
linear, influenced in minor ways by accidental regularities in the sets'
structure and order (especially likely for short sequences of small
sets, as in your case). Using a slightly longer sequence of slightly
larger sets, with little structure to each, e.g.:
random.seed(12345) # set seed to ensure total repeatability
los=[set(random.sample(range(1000), 990)) for x in range(200)]
at the end of the setup (the intersection of these 200 sets happens to
contain 132 items), I measure (as usual on my 18-months-old Macbook Pro
laptop):
stmt = 'reduce(set.intersection,los)'
best of 3: 1.66e+04 usec per loop
stmt = 'intersect_all(los)'
best of 3: 1.66e+04 usec per loop
and occasionally 1.65 or 1.67 instead of 1.66 for either or both,
whether with 10,000 or 100,000 loops. (Not sure whether your
observations about the reduce-based approach becoming faster with more
loops may be due to anomalies in Windows' scheduler, or the accidental
regularities mentioned above; my timings are probably more regular since
I have two cores, one of which probably ends up dedicated to whatever
task I'm benchmarking while the other one runs all "background" stuff).
> turn indicates that both implementations actually work about same and
> your "O(n squared)" argument is irrelevant.
It's indeed irrelevant when the behavior _isn't_ quadratic (as in the
case of intersections) -- but unfortunately it _is_ needlessly quadratic
in most interesting cases involving containers (catenation of sequences,
union of sets, merging of dictionaries, merging of priority-queues,
...), because in those cases the intermediate temporary values tend to
grow, as I tried to explain in more detail above.
Alex
Mostly directed to samwyse: Alex is correct here and I'd add that
Python and its libraries are written for an imperative, mutating
approach though there are some situations in which you can write in
functional style without a big efficiency hit. In particular, the
quadratic behavior Alex points out is because of Python's hash-based
implementation of sets, so you can't make a new set that adds or
removes one element from the old set, without copying all the elements
around.
A more purist functional approach would probably implement sets with
some other data structure such as AVL trees, so you can make a new one
that adds or deletes some node in O(log n) time (it would share almost
all of its structure with the old one). So instead of O(n) for the
hash-based mutating implementation or O(n**2) for the hash/copying
based functional implementation, you get O(n log n) for the functional
tree implementation. Haskell's Data.Map module does something like
this. There's a book "Purely Functional Data Structures" by Chris
Okasaki that shows how to implement dicts, priority queues, and
fancier objects in functional style based on similar principles.
If you want to try out functional programming in a lightweight
language (much easier to grok than Haskell) you might look at Hedgehog Lisp:
It includes a functional dictionary implementation based on AVL trees
but with a Python dictionary-like API.