2 views

Skip to first unread message

Aug 17, 2007, 7:17:37 AM8/17/07

to pytho...@python.org

On Friday 17 August 2007, Beema shafreen wrote:

> hi everybody,

> i have a file with data separated by tab

> mydata:

> fhl1 fkh2

> dfp1 chk1

> mal3 alp14

> mal3 moe1

> mal3 spi1

> mal3 bub1

> mal3 bub3

> mal3 mph1

> mal3 mad3

> hob1 nak1

> hob1 wsp1

> hob1 rad3

> cdr2 cdc13

> cdr2 cdc2

> shows these two are separated by tab represented as columns

> i have to check the common data between these two coloumn1 an coloumn2

> my code:

> data = []

> data1 = []

> result = []

> fh = open('sheet1','r')

> for line in fh.readlines():

> splitted = line.strip().split('\t')

> data.append(splitted[0])

> data1.append(splitted[1])

> for k in data:

> if k in data1:

> result.append(k)

> print result

> fh.close()

>

> can you tell me problem with my script and what should is do for this

> hi everybody,

> i have a file with data separated by tab

> mydata:

> fhl1 fkh2

> dfp1 chk1

> mal3 alp14

> mal3 moe1

> mal3 spi1

> mal3 bub1

> mal3 bub3

> mal3 mph1

> mal3 mad3

> hob1 nak1

> hob1 wsp1

> hob1 rad3

> cdr2 cdc13

> cdr2 cdc2

> shows these two are separated by tab represented as columns

> i have to check the common data between these two coloumn1 an coloumn2

> my code:

> data = []

> data1 = []

> result = []

> fh = open('sheet1','r')

> for line in fh.readlines():

> splitted = line.strip().split('\t')

> data.append(splitted[0])

> data1.append(splitted[1])

> for k in data:

> if k in data1:

> result.append(k)

> print result

> fh.close()

>

> can you tell me problem with my script and what should is do for this

No, I have not tested it. You tell us the problem, and we might understand the

situation better than you.

--

Regards, Thomas Jollans

GPG key: 0xF421434B may be found on various keyservers, eg pgp.mit.edu

Hacker key <http://hackerkey.com/>:

v4sw6+8Yhw4/5ln3pr5Ock2ma2u7Lw2Nl7Di2e2t3/4TMb6HOPTen5/6g5OPa1XsMr9p-7/-6

Aug 17, 2007, 7:43:03 AM8/17/07

to

Thomas Jollans a écrit :

> On Friday 17 August 2007, Beema shafreen wrote:

>> hi everybody,

>> i have a file with data separated by tab

>> mydata:

>> fhl1 fkh2

<zip>

>> shows these two are separated by tab represented as columns

>> i have to check the common data between these two coloumn1 an coloumn2

>> my code:

>> data = []

>> data1 = []

>> result = []

>> fh = open('sheet1','r')

>> for line in fh.readlines():

>> splitted = line.strip().split('\t')

>> data.append(splitted[0])

>> data1.append(splitted[1])

>> for k in data:

>> if k in data1:

>> result.append(k)

>> print result

>> fh.close()

> On Friday 17 August 2007, Beema shafreen wrote:

>> hi everybody,

>> i have a file with data separated by tab

>> mydata:

>> fhl1 fkh2

>> shows these two are separated by tab represented as columns

>> i have to check the common data between these two coloumn1 an coloumn2

>> my code:

>> data = []

>> data1 = []

>> result = []

>> fh = open('sheet1','r')

>> for line in fh.readlines():

>> splitted = line.strip().split('\t')

>> data.append(splitted[0])

>> data1.append(splitted[1])

>> for k in data:

>> if k in data1:

>> result.append(k)

>> print result

>> fh.close()

Use set data type for data and data1 (you fill them with an algo like th

one you wrote - just use add() in place of appen()) then use set

intersection to get common data.

See doc for set data type:

http://docs.python.org/lib/types-set.html

Would look like (not tested):

data = set()

data1 = set()

fh = open('sheet1','r')

for line in fh.readlines():

splitted = line.strip().split('\t')

data.add(splitted[0])

data1.add(splitted[1])

result = data.intersection(data1)

Aug 17, 2007, 7:44:16 AM8/17/07

to

Laurent Pointal a écrit :

[cleaning]

fh = open('sheet1')

for line in fh:

Aug 17, 2007, 9:09:29 AM8/17/07

to

lefts = set()

rights = set()

with open('sheet1', 'r') as fh:

for line in fh:

trimmed = line.strip()

if trimmed: # Skip blanks (file end often looks like that)

left, right = line.strip().split('\t')

lefts.add(left)

rights.add(right)

result = lefts & rights

-Scott

Aug 17, 2007, 10:32:02 PM8/17/07

to

Scott David Daniels wrote:

> lefts = set()

> rights = set()

> with open('sheet1', 'r') as fh:

> for line in fh:

> trimmed = line.strip()

> if trimmed: # Skip blanks (file end often looks like that)

> left, right = line.strip().split('\t')

> lefts.add(left)

> rights.add(right)

> result = lefts & rights

>

> -Scott

> lefts = set()

> rights = set()

> with open('sheet1', 'r') as fh:

> for line in fh:

> trimmed = line.strip()

> if trimmed: # Skip blanks (file end often looks like that)

> left, right = line.strip().split('\t')

> lefts.add(left)

> rights.add(right)

> result = lefts & rights

>

> -Scott

# change to however many columns may later exist

cols = [ set() for i in range(0, 2) ]

with open('sheet1', 'r') as fh:

for line in fh:

for col, data in zip(cols, line.strip().split('\t')):

col.add(data)

result = cols[0] & cols[1]

Aug 18, 2007, 11:02:58 AM8/18/07

to

My laptop started complaining about low power before I was really

finished with this last night, so I just hit Send and went to bed. The

last line really should be:

result = reduce(operator.and_, cols)

I'll note that using 'zip' transparently deals with handling those cases

where there are either more or fewer columns of data than expected.

Finally, does anyone familar with P3K know how best to do the reduction

without using 'reduce'? Right now, sets don't support the 'add' and

'multiply' operators, so 'sum' and (the currently ficticious) 'product'

won't work at all; while 'any' and 'all' don't work as one might hope.

Are there an 'intersection' and 'union' built-ins anywhere?

Aug 18, 2007, 11:37:42 AM8/18/07

to Beema shafreen, pytho...@python.org

On 17/08/07, Beema shafreen <beema.s...@gmail.com> wrote:

> hi everybody,

> i have a file with data separated by tab

> mydata:

> fhl1 fkh2

> hi everybody,

> i have a file with data separated by tab

> mydata:

> fhl1 fkh2

> dfp1 chk1

> mal3 alp14

> mal3 moe1

> mal3 spi1

> mal3 bub1

> mal3 bub3

> mal3 mph1

> mal3 mad3

> hob1 nak1

> hob1 wsp1

> hob1 rad3

> cdr2 cdc13

> cdr2 cdc2

> mal3 alp14

> mal3 moe1

> mal3 spi1

> mal3 bub1

> mal3 bub3

> mal3 mph1

> mal3 mad3

> hob1 nak1

> hob1 wsp1

> hob1 rad3

> cdr2 cdc13

> cdr2 cdc2

> shows these two are separated by tab represented as columns

> i have to check the common data between these two coloumn1 an coloumn2

> my code:

> data = []

> data1 = []

> result = []

> fh = open('sheet1','r')

> for line in fh.readlines():

> splitted = line.strip().split('\t')

> data.append(splitted[0])

> data1.append(splitted[1])

> for k in data:

> if k in data1:

> result.append(k)

> print result

> fh.close()

>

> i have to check the common data between these two coloumn1 an coloumn2

> my code:

> data = []

> data1 = []

> result = []

> fh = open('sheet1','r')

> for line in fh.readlines():

> splitted = line.strip().split('\t')

> data.append(splitted[0])

> data1.append(splitted[1])

> for k in data:

> if k in data1:

> result.append(k)

> print result

> fh.close()

>

> can you tell me problem with my script and what should is do for this

For a start, you are iterating k in data *everytime* you iterate a

line in fh which will give you a speed issue and give you duplicates

in the result. The following is probably what you intended to do

> for line in fh.readlines():

> do stuff

> for k in data:

> do stuff

.split() splits by spaces, newlines AND tabs so you just need

> splitted = line.split()

eg

>>> ln = 'fhl1\tfkh2\r\n'

>>> ln.split()

['fhl1', 'fkh2']

I think I would have done something like this (not tested)

Input = open('sheet1').read().split()

data = set(Input[::2])

data1 = set (Input[1::2])

result = data.intersection(data1)

or even this (if you don't need data and data1 later in the code)

Input = open('sheet1').read().split()

result = set(Input[::2]).intersection(set (Input[1::2]))

HTH :)

Aug 18, 2007, 12:26:10 PM8/18/07

to

samwyse <deja...@email.com> wrote:

...

> Finally, does anyone familar with P3K know how best to do the reduction

> without using 'reduce'? Right now, sets don't support the 'add' and

> 'multiply' operators, so 'sum' and (the currently ficticious) 'product'

> won't work at all; while 'any' and 'all' don't work as one might hope.

> Are there an 'intersection' and 'union' built-ins anywhere?

...

> Finally, does anyone familar with P3K know how best to do the reduction

> without using 'reduce'? Right now, sets don't support the 'add' and

> 'multiply' operators, so 'sum' and (the currently ficticious) 'product'

> won't work at all; while 'any' and 'all' don't work as one might hope.

> Are there an 'intersection' and 'union' built-ins anywhere?

For intersection and union of a sequence of sets, I'd use:

def union_all(sos):

result = set()

for s in sos: result.update(s)

return result

def intersect_all(sos):

it = iter(sos)

result = set(it.next())

for s in it: result.intersection_update(s)

return result

The latter will raise an exception if sos is empty -- I don't think the

"intersection of no sets at all" has a single natural interpretation

(while the "union of no sets at all" appears to be naturally interpreted

as an empty set)... if you disagree, just wrap a try/except around the

initialization of result, and return whatever in the except clause.

Of course, hoisting the unbound method out of the loops can afford the

usual small optimization. But my point is that, in Python, these

operations (like, say, the concatenation of a sequence of lists, etc)

are best performed "in place" via loops calling mutator methods such as

update and intersection_update (or a list's extend, etc), rather than

"functionally" (building and tossing away many intermediate results).

E.g., consider:

brain:~ alex$ python -mtimeit -s'sos=[set(range(x,x+4)) for x in

range(0, 100, 3)]' 'r=set()' 'for x in sos: r.update(x)'

100000 loops, best of 3: 18.8 usec per loop

brain:~ alex$ python -mtimeit -s'sos=[set(range(x,x+4)) for x in

range(0, 100, 3)]' 'r=reduce(set.union, sos, set())'

10000 loops, best of 3: 87.2 usec per loop

Even in a case as tiny as this one, "reduce" is taking over 4 times

longer than the loop with the in-place mutator -- and it only gets

worse, as we're talking about O(N squared) vs O(N) performance! Indeed,

this is part of what makes reduce an "attractive nuisance"...;-). [[And

so is sum, if used OTHERWISE than for the documented purpose, computing

"the sum of a sequence of numbers": a loop with r.extend is similarly

faster, to concatenate a sequence of lists, when compared to sum(sol,

[])...!!!]]

Alex

Aug 19, 2007, 8:03:39 AM8/19/07

to

Alex Martelli wrote:

> Of course, hoisting the unbound method out of the loops can afford the

> usual small optimization. But my point is that, in Python, these

> operations (like, say, the concatenation of a sequence of lists, etc)

> are best performed "in place" via loops calling mutator methods such as

> update and intersection_update (or a list's extend, etc), rather than

> "functionally" (building and tossing away many intermediate results).

> E.g., consider:

>

> brain:~ alex$ python -mtimeit -s'sos=[set(range(x,x+4)) for x in

> range(0, 100, 3)]' 'r=set()' 'for x in sos: r.update(x)'

> 100000 loops, best of 3: 18.8 usec per loop

>

> brain:~ alex$ python -mtimeit -s'sos=[set(range(x,x+4)) for x in

> range(0, 100, 3)]' 'r=reduce(set.union, sos, set())'

> 10000 loops, best of 3: 87.2 usec per loop

>

> Even in a case as tiny as this one, "reduce" is taking over 4 times

> longer than the loop with the in-place mutator -- and it only gets

> worse, as we're talking about O(N squared) vs O(N) performance! Indeed,

> this is part of what makes reduce an "attractive nuisance"...;-). [[And

> so is sum, if used OTHERWISE than for the documented purpose, computing

> "the sum of a sequence of numbers": a loop with r.extend is similarly

> faster, to concatenate a sequence of lists, when compared to sum(sol,

> [])...!!!]]

Curiously, when I attempt the actual problem at hand (set intersection)

rather than something altogether different (set union) on my Dell laptop

running Win2K Pro, I get the following results:

stmt = 'r=reduce(set.intersection, los)'

best of 3: 21.9 usec per loop

stmt = 'r=intersect_all(los)'

best of 3: 29.3 usec per loop

So, the "nuisance" is more attractive than you thought. Apparently, one

can't really depend on "in place" mutator methods being more efficient

that using functional techniques. BTW, the above figures reflect 10,000

iterations; using larger counts makes the difference even larger. I

suspect that this is because the Python interpreter has fewer chances to

be interrupted by external processes when it's using 'reduce'. This in

turn indicates that both implementations actually work about same and

your "O(n squared)" argument is irrelevant.

P.S. Here's the source I used for the timings:

from timeit import Timer

setup = """

def intersect_all(los):

it = iter(los)

result = set(it.next())

for s in it: result.intersection_update(s)

return result

not7=range(2,11)

not7.remove(7)

los=[set(range(0,360,step)) for step in not7]

"""

number = 10000

repeat = 3

precision = 3

for stmt in 'r=reduce(set.intersection, los)', 'r=intersect_all(los)':

t = Timer(stmt, setup)

r = t.repeat(repeat, number)

best = min(r)

usec = best * 1e6 / number

print "stmt =", repr(stmt)

print "best of %d: %.*g usec per loop" % (repeat, precision, usec)

Aug 19, 2007, 12:03:53 PM8/19/07

to

samwyse <deja...@email.com> wrote:

...

> > brain:~ alex$ python -mtimeit -s'sos=[set(range(x,x+4)) for x in

> > range(0, 100, 3)]' 'r=set()' 'for x in sos: r.update(x)'

> > 100000 loops, best of 3: 18.8 usec per loop

> >

> > brain:~ alex$ python -mtimeit -s'sos=[set(range(x,x+4)) for x in

> > range(0, 100, 3)]' 'r=reduce(set.union, sos, set())'

> > 10000 loops, best of 3: 87.2 usec per loop

> >

> > Even in a case as tiny as this one, "reduce" is taking over 4 times

> > longer than the loop with the in-place mutator -- and it only gets

> > worse, as we're talking about O(N squared) vs O(N) performance! Indeed,

> > this is part of what makes reduce an "attractive nuisance"...;-). [[And

...

> > brain:~ alex$ python -mtimeit -s'sos=[set(range(x,x+4)) for x in

> > range(0, 100, 3)]' 'r=set()' 'for x in sos: r.update(x)'

> > 100000 loops, best of 3: 18.8 usec per loop

> >

> > brain:~ alex$ python -mtimeit -s'sos=[set(range(x,x+4)) for x in

> > range(0, 100, 3)]' 'r=reduce(set.union, sos, set())'

> > 10000 loops, best of 3: 87.2 usec per loop

> >

> > Even in a case as tiny as this one, "reduce" is taking over 4 times

> > longer than the loop with the in-place mutator -- and it only gets

> > worse, as we're talking about O(N squared) vs O(N) performance! Indeed,

> > this is part of what makes reduce an "attractive nuisance"...;-). [[And

The set-union case, just like the list-catenation case, is O(N squared)

(when approached in a functional way) because the intermediate result

often _grows_ [whenever a new set or list operand adds items], and thus

a new temporary value must be allocated, and the K results-so-far

"copied over" (as part of constructing the new temp value) from the

previous temporary value; and sum(range(N)) grows quadratically in N.

The in-place approach avoids that fate by a strategy of proportional

over-allocation (used by both set and lists) that makes in-place

operations such as .update(X) and .extend(X) amortized O(K) where K is

len(X).

In the set-intersection case, the intermediate result _shrinks_ rather

than growing, so the amount of data "copied over" is a diminishing

quantity at each step, and so the analysis showing quadratic behavior

for the functional approach does not hold; behavior may be roughly

linear, influenced in minor ways by accidental regularities in the sets'

structure and order (especially likely for short sequences of small

sets, as in your case). Using a slightly longer sequence of slightly

larger sets, with little structure to each, e.g.:

random.seed(12345) # set seed to ensure total repeatability

los=[set(random.sample(range(1000), 990)) for x in range(200)]

at the end of the setup (the intersection of these 200 sets happens to

contain 132 items), I measure (as usual on my 18-months-old Macbook Pro

laptop):

stmt = 'reduce(set.intersection,los)'

best of 3: 1.66e+04 usec per loop

stmt = 'intersect_all(los)'

best of 3: 1.66e+04 usec per loop

and occasionally 1.65 or 1.67 instead of 1.66 for either or both,

whether with 10,000 or 100,000 loops. (Not sure whether your

observations about the reduce-based approach becoming faster with more

loops may be due to anomalies in Windows' scheduler, or the accidental

regularities mentioned above; my timings are probably more regular since

I have two cores, one of which probably ends up dedicated to whatever

task I'm benchmarking while the other one runs all "background" stuff).

> turn indicates that both implementations actually work about same and

> your "O(n squared)" argument is irrelevant.

It's indeed irrelevant when the behavior _isn't_ quadratic (as in the

case of intersections) -- but unfortunately it _is_ needlessly quadratic

in most interesting cases involving containers (catenation of sequences,

union of sets, merging of dictionaries, merging of priority-queues,

...), because in those cases the intermediate temporary values tend to

grow, as I tried to explain in more detail above.

Alex

Aug 20, 2007, 11:18:03 AM8/20/07

to

al...@mac.com (Alex Martelli) writes:

> > turn indicates that both implementations actually work about same and

> > your "O(n squared)" argument is irrelevant.

>

> It's indeed irrelevant when the behavior _isn't_ quadratic (as in the

> case of intersections) -- but unfortunately it _is_ needlessly quadratic

> in most interesting cases involving containers (catenation of sequences,

> union of sets, merging of dictionaries, merging of priority-queues,

> ...), because in those cases the intermediate temporary values tend to

> grow, as I tried to explain in more detail above.

> > turn indicates that both implementations actually work about same and

> > your "O(n squared)" argument is irrelevant.

>

> It's indeed irrelevant when the behavior _isn't_ quadratic (as in the

> case of intersections) -- but unfortunately it _is_ needlessly quadratic

> in most interesting cases involving containers (catenation of sequences,

> union of sets, merging of dictionaries, merging of priority-queues,

> ...), because in those cases the intermediate temporary values tend to

> grow, as I tried to explain in more detail above.

Mostly directed to samwyse: Alex is correct here and I'd add that

Python and its libraries are written for an imperative, mutating

approach though there are some situations in which you can write in

functional style without a big efficiency hit. In particular, the

quadratic behavior Alex points out is because of Python's hash-based

implementation of sets, so you can't make a new set that adds or

removes one element from the old set, without copying all the elements

around.

A more purist functional approach would probably implement sets with

some other data structure such as AVL trees, so you can make a new one

that adds or deletes some node in O(log n) time (it would share almost

all of its structure with the old one). So instead of O(n) for the

hash-based mutating implementation or O(n**2) for the hash/copying

based functional implementation, you get O(n log n) for the functional

tree implementation. Haskell's Data.Map module does something like

this. There's a book "Purely Functional Data Structures" by Chris

Okasaki that shows how to implement dicts, priority queues, and

fancier objects in functional style based on similar principles.

If you want to try out functional programming in a lightweight

language (much easier to grok than Haskell) you might look at Hedgehog Lisp:

It includes a functional dictionary implementation based on AVL trees

but with a Python dictionary-like API.

Reply all

Reply to author

Forward

0 new messages

Search

Clear search

Close search

Google apps

Main menu