Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

efficient way to process data

75 views
Skip to first unread message

Larry Martell

unread,
Jan 12, 2014, 2:23:17 PM1/12/14
to pytho...@python.org
I have an python app that queries a MySQL DB. The query has this form:

SELECT a, b, c, d, AVG(e), STD(e), CONCAT(x, ',', y) as f
FROM t
GROUP BY a, b, c, d, f

x and y are numbers (378.18, 2213.797 or 378.218, 2213.949 or
10053.490, 2542.094).

The business issue is that if either x or y in 2 rows that are in the
same a, b, c, d group are within 1 of each other then they should be
grouped together. And to make it more complicated, the tolerance is
applied as a rolling continuum. For example, if the x and y in a set
of grouped rows are:

row 1: 1.5, 9.5
row 2: 2.4, 20.8
row 3: 3.3, 40.6
row 4: 4.2, 2.5
row 5: 5.1, 10.1
row 6: 6.0, 7.9
row 7: 8.0, 21.0
row 8: 100, 200

1 through 6 get combined because all their X values are within the
tolerance of some other X in the set that's been combined. 7's Y value
is within the tolerance of 2's Y, so that should be combined as well.
8 is not combined because neither the X or Y value is within the
tolerance of any X or Y in the set that was combined.

AFAIK, there is no way to do this in SQL. In python I can easily parse
the data and identify the rows that need to be combined, but then I've
lost the ability to calculate the average and std across the combined
data set. The only way I can think of to do this is to remove the
grouping from the SQL and do all the grouping and aggregating myself.
But this query often returns 20k to 30k rows after grouping. It could
easily be 80k to 100k rows or more that I have to process if I remove
the grouping and I think that will end up being very slow.

Anyone have any ideas how I can efficiently do this?

Thanks!
-larry

Petite Abeille

unread,
Jan 12, 2014, 2:53:02 PM1/12/14
to pytho...@python.org

On Jan 12, 2014, at 8:23 PM, Larry Martell <larry....@gmail.com> wrote:

> AFAIK, there is no way to do this in SQL.

Sounds like a job for window functions (aka analytic functions) [1][2].

[1] http://www.postgresql.org/docs/9.3/static/tutorial-window.html
[2] http://docs.oracle.com/cd/E11882_01/server.112/e26088/functions004.htm#SQLRF06174

Chris Angelico

unread,
Jan 12, 2014, 5:18:42 PM1/12/14
to pytho...@python.org
On Mon, Jan 13, 2014 at 6:53 AM, Petite Abeille
<petite....@gmail.com> wrote:
> On Jan 12, 2014, at 8:23 PM, Larry Martell <larry....@gmail.com> wrote:
>
>> AFAIK, there is no way to do this in SQL.
>
> Sounds like a job for window functions (aka analytic functions) [1][2].

That's my thought too. I don't think MySQL has them, though, so it's
either going to have to be done in Python, or the database back-end
will need to change. Hard to say which would be harder.

ChrisA
Message has been deleted

Chris Angelico

unread,
Jan 12, 2014, 6:27:02 PM1/12/14
to pytho...@python.org
Trying to get my head around this a bit more. Are columns a/b/c/d
treated as a big category (eg type, brand, category, model), such that
nothing will ever be grouped that has any difference in those four
columns? If so, we can effectively ignore them and pretend we have a
table with exactly one set (eg stick a WHERE clause onto the query
that stipulates their values). Then what you have is this:

* Aggregate based on proximity of x and y
* Emit results derived from e

Is that correct?

So here's my way of writing it.

* Subselect: List all values for x, in order, and figure out which
ones are less than the previous value plus one
* Subselect: Ditto, for y.
* Outer select: Somehow do an either-or group. I'm not quite sure how
to do that part, actually!

A PGSQL window function would cover the two subselects - at least, I'm
fairly sure it would. I can't quite get the whole thing, though; I can
get a true/false flag that says whether it's near to the previous one
(that's easy), and creating a grouping column value should be possible
from that but I'm not sure how.

But an either-or grouping is a bit trickier. The best I can think of
is to collect all the y values for each group of x values, and then if
any two groups 'overlap' (ie have points within 1.0 of each other),
merge the groups. That's going to be seriously tricky to do in SQL, I
think, so you may have to go back to Python on that one.

My analysis suggests that, whatever happens, you're going to need
every single y value somewhere. So it's probably not worth trying to
do any grouping/aggregation in SQL, since you need to further analyze
all the individual data points. I can't think of any way better than
just leafing through the whole table (either in Python or in a stored
procedure - if you can run your script on the same computer that's
running the database, I'd do that, otherwise consider a stored
procedure to reduce network transfers) and building up mappings.

Of course, "I can't think of a way" does not equate to "There is no
way". There may be some magic trick that I didn't think of, or some
arcane incantation that gets what you want. Who knows? If you can
produce an ASCII art Mandelbrot set [1] in pure SQL, why not this!

ChrisA

[1] http://wiki.postgresql.org/wiki/Mandelbrot_set

Larry Martell

unread,
Jan 12, 2014, 10:17:09 PM1/12/14
to Petite Abeille, pytho...@python.org
On Sun, Jan 12, 2014 at 2:53 PM, Petite Abeille
<petite....@gmail.com> wrote:
>
> On Jan 12, 2014, at 8:23 PM, Larry Martell <larry....@gmail.com> wrote:
>
>> AFAIK, there is no way to do this in SQL.
>
> Sounds like a job for window functions (aka analytic functions) [1][2].
>
Unfortunately, MySQL does not support this.

Larry Martell

unread,
Jan 12, 2014, 10:18:22 PM1/12/14
to Chris Angelico, pytho...@python.org
On Sun, Jan 12, 2014 at 5:18 PM, Chris Angelico <ros...@gmail.com> wrote:
> On Mon, Jan 13, 2014 at 6:53 AM, Petite Abeille
> <petite....@gmail.com> wrote:
>> On Jan 12, 2014, at 8:23 PM, Larry Martell <larry....@gmail.com> wrote:
>>
>>> AFAIK, there is no way to do this in SQL.
>>
>> Sounds like a job for window functions (aka analytic functions) [1][2].
>
> That's my thought too. I don't think MySQL has them, though, so it's
> either going to have to be done in Python, or the database back-end
> will need to change. Hard to say which would be harder.

Changing the database is not feasible.

Larry Martell

unread,
Jan 12, 2014, 10:25:57 PM1/12/14
to Dennis Lee Bieber, pytho...@python.org
On Sun, Jan 12, 2014 at 5:43 PM, Dennis Lee Bieber
<wlf...@ix.netcom.com> wrote:
> On Sun, 12 Jan 2014 14:23:17 -0500, Larry Martell <larry....@gmail.com>
> declaimed the following:
>
>>I have an python app that queries a MySQL DB. The query has this form:
>>
>>SELECT a, b, c, d, AVG(e), STD(e), CONCAT(x, ',', y) as f
>>FROM t
>>GROUP BY a, b, c, d, f
>>
>>x and y are numbers (378.18, 2213.797 or 378.218, 2213.949 or
>>10053.490, 2542.094).
>>
>
> Decimal (Numeric) or floating/real. If the latter, the internal storage
> may not be exact (378.1811111111 and 378.179999999 may both "display" as
> 378.18, but will not match for grouping).

In the database they are decimal. They are being converted to char by
the CONCAT(x, ',', y).

>>The business issue is that if either x or y in 2 rows that are in the
>>same a, b, c, d group are within 1 of each other then they should be
>>grouped together. And to make it more complicated, the tolerance is
>>applied as a rolling continuum. For example, if the x and y in a set
>>of grouped rows are:
>>
> As I understand group by, it will first group by "a", WITHIN the "a"
> groups it will then group by "b"... Probably not a matter germane to the
> problem as you are concerning yourself with the STRING representation of
> "x" and "y" with a comma delimiter -- which is only looked at if the
> "a,b,c,d" are equal... Thing is, a string comparison is going to operate
> strictly left to right -- it won't even see your "y" value unless all the
> "x" value is equal.

Yes, that is correct. The original requirement was to group by (X, Y),
so the CONCAT(x, ',', y) was correct and working. Then the requirement
was change to apply the tolerance as I described.

>
> You may need to operate using subselects... So that you can specify
> something like
>
> where abs(s1.x -s2.x) < tolerance or abs(s1.y-s2.y) < tolerance
> and (s1.a = s2.a ... s1.d = s2.d)
>
> s1/s1 are the subselects (you may need a primary key <> primary key to
> avoid having it output a record where the two subselects are for the SAME
> record -- or maybe not, since you /do/ want that record also output). Going
> to be a costly query since you are basically doing
>
> foreach r1 in s1
> foreach r2 in s2
> emit r2 when...

Speed is an issue here, and while the current query performs well, in
my experience subqueries and self joins do not. I'm going to try and
do it all in python and see how it performs. The other option is to
pre-process the data on the way into the database. Doing that will
eliminate some of the data partitioning as all of the data that could
be joined will be in the same input file. I'm just not sure if it will
OK to actually munge the data. I'll find that out tomorrow.

Larry Martell

unread,
Jan 12, 2014, 10:35:47 PM1/12/14
to Chris Angelico, pytho...@python.org
On Sun, Jan 12, 2014 at 6:27 PM, Chris Angelico <ros...@gmail.com> wrote:
> On Mon, Jan 13, 2014 at 6:23 AM, Larry Martell <larry....@gmail.com> wrote:
>> I have an python app that queries a MySQL DB. The query has this form:
>>
>> SELECT a, b, c, d, AVG(e), STD(e), CONCAT(x, ',', y) as f
>> FROM t
>> GROUP BY a, b, c, d, f
>>
>> x and y are numbers (378.18, 2213.797 or 378.218, 2213.949 or
>> 10053.490, 2542.094).
>>
>> The business issue is that if either x or y in 2 rows that are in the
>> same a, b, c, d group are within 1 of each other then they should be
>> grouped together. And to make it more complicated, the tolerance is
>> applied as a rolling continuum. For example, if the x and y in a set
>> of grouped rows are:
>>
>> row 1: 1.5, 9.5
>> row 2: 2.4, 20.8
>> row 3: 3.3, 40.6
>> row 4: 4.2, 2.5
>> row 5: 5.1, 10.1
>> row 6: 6.0, 7.9
>> row 7: 8.0, 21.0
>> row 8: 100, 200
>>
>> 1 through 6 get combined because all their X values are within the
>> tolerance of some other X in the set that's been combined. 7's Y value
>> is within the tolerance of 2's Y, so that should be combined as well.
>> 8 is not combined because neither the X or Y value is within the
>> tolerance of any X or Y in the set that was combined.
>
> Trying to get my head around this a bit more. Are columns a/b/c/d
> treated as a big category (eg type, brand, category, model), such that
> nothing will ever be grouped that has any difference in those four
> columns? If so, we can effectively ignore them and pretend we have a
> table with exactly one set (eg stick a WHERE clause onto the query
> that stipulates their values). Then what you have is this:
>
> * Aggregate based on proximity of x and y
> * Emit results derived from e
>
> Is that correct?

There will be multiple groups of a/b/c/d. I simplified the query for
the purposes of posting my question. There is a where clause with
values that come from user input. None, any, or all of a, b, c, or d
could be in the where clause.
Thanks for the reply. I'm going to take a stab at removing the group
by and doing it all in python. It doesn't look too hard, but I don't
know how it will perform.

Chris Angelico

unread,
Jan 13, 2014, 1:09:59 AM1/13/14
to pytho...@python.org
On Mon, Jan 13, 2014 at 2:35 PM, Larry Martell <larry....@gmail.com> wrote:
> Thanks for the reply. I'm going to take a stab at removing the group
> by and doing it all in python. It doesn't look too hard, but I don't
> know how it will perform.

Well, if you can't switch to PostgreSQL or such, then doing it in
Python is your only option. There are such things as GiST and GIN
indexes that might be able to do some of this magic, but I don't think
MySQL has anything even remotely like what you're looking for.

So ultimately, you're going to have to do your filtering on the
database, and then all the aggregation in Python. And it's going to be
somewhat complicated code, too. Best I can think of is this, as
partial pseudo-code:

last_x = -999
x_map = []; y_map = {}
merge_me = []
for x,y,e in (SELECT x,y,e FROM t WHERE whatever ORDER BY x):
if x<last_x+1:
x_map[-1].append((y,e))
else:
x_map.append([(y,e)])
last_x=x
if y in y_map:
merge_me.append((y_map[y], x_map[-1]))
y_map[y]=x_map[-1]

# At this point, you have x_map which is a list of lists, each one
# being one group, and y_map which maps a y value to its x_map list.

last_y = -999
for y in sorted(y_map.keys()):
if y<last_y+1:
merge_me.append((y_map[y], last_x_map))
last_y=y
last_x_map=y_map[y]

for merge1,merge2 in merge_me:
merge1.extend(merge2)
merge2[:]=[] # Empty out the list

for lst in x_map:
if not lst: continue # been emptied out, ignore it
do aggregate stats, get sum(lst) and whatever else

I think this should be linear complexity overall, but there may be a
few aspects of it that are quadratic. It's a tad messy though, and
completely untested. But that's an algorithmic start. The idea is that
lists get collected based on x proximity, and then lists get merged
based on y proximity. That is, if you have (1.0, 10.1), (1.5, 2.3),
(3.0, 11.0), (3.2, 15.2), they'll all be treated as a single
aggregation unit. If that's not what you want, I'm not sure how to
handle it.

ChrisA

Larry Martell

unread,
Jan 13, 2014, 1:27:37 PM1/13/14
to Chris Angelico, pytho...@python.org
Thanks. Unfortunately this has been made a low priority task and I've
been put on to something else (I hate when they do that). I'll revive
this thread when I'm allowed to get back on this.

Chris Angelico

unread,
Jan 13, 2014, 1:32:17 PM1/13/14
to pytho...@python.org
On Tue, Jan 14, 2014 at 5:27 AM, Larry Martell <larry....@gmail.com> wrote:
> Thanks. Unfortunately this has been made a low priority task and I've
> been put on to something else (I hate when they do that).

Ugh, I know that feeling all too well! Life's better when you're
unemployed, and you can choose the interesting problems to work on.
Apart from the "has to be in MySQL" restriction (dodged now that the
work's all being done in Python anyway), yours is _definitely_ an
interesting problem.

ChrisA

Mark Lawrence

unread,
Jan 13, 2014, 1:36:58 PM1/13/14
to pytho...@python.org
On 13/01/2014 18:32, Chris Angelico wrote:
> On Tue, Jan 14, 2014 at 5:27 AM, Larry Martell <larry....@gmail.com> wrote:
>> Thanks. Unfortunately this has been made a low priority task and I've
>> been put on to something else (I hate when they do that).
>
> Ugh, I know that feeling all too well! Life's better when you're
> unemployed, and you can choose the interesting problems to work on.

No it ain't :(

--
My fellow Pythonistas, ask not what our language can do for you, ask
what you can do for our language.

Mark Lawrence

Mark Lawrence

unread,
Jan 13, 2014, 1:42:39 PM1/13/14
to pytho...@python.org
> Thanks. Unfortunately this has been made a low priority task and I've
> been put on to something else (I hate when they do that). I'll revive
> this thread when I'm allowed to get back on this.
>

I've not followed this thread closely but would this help
http://pandas.pydata.org/ ? When and if you get back to it, that is!!!

Larry Martell

unread,
Jan 13, 2014, 2:51:52 PM1/13/14
to Chris Angelico, pytho...@python.org
On Mon, Jan 13, 2014 at 1:32 PM, Chris Angelico <ros...@gmail.com> wrote:
> On Tue, Jan 14, 2014 at 5:27 AM, Larry Martell <larry....@gmail.com> wrote:
>> Thanks. Unfortunately this has been made a low priority task and I've
>> been put on to something else (I hate when they do that).
>
> Ugh, I know that feeling all too well!

Right? You're deep in debugging and researching and waking up in the
middle of the night with potential solutions, and then they say put it
aside. It's so hard to put it down, and then when you pick it up later
(sometimes months) you're like WTF is this all about. I recently
picked up something I had to put down in September - spent an entire
day getting back to where I was, then it was put on the back burner
again.

> Life's better when you're
> unemployed, and you can choose the interesting problems to work on.

Ahhh .... I don't think so.

> Apart from the "has to be in MySQL" restriction (dodged now that the
> work's all being done in Python anyway),

It's a big existing django app.

> yours is _definitely_ an
> interesting problem.

Thanks! I thought so too.

Petite Abeille

unread,
Jan 13, 2014, 3:47:01 PM1/13/14
to Python List

On Jan 13, 2014, at 7:42 PM, Mark Lawrence <bream...@yahoo.co.uk> wrote:

> I've not followed this thread closely but would this help http://pandas.pydata.org/ ? When and if you get back to it, that is!!!

I doubt it. The mean overhead by far would be to shuffle pointless data between the server & client. Best to process data closest to their source.

On the other hand:

http://devour.com/video/never-say-no-to-panda/



Larry Martell

unread,
Jan 14, 2014, 9:18:50 AM1/14/14
to Chris Angelico, pytho...@python.org
On Mon, Jan 13, 2014 at 11:32 AM, Chris Angelico <ros...@gmail.com> wrote:
> On Tue, Jan 14, 2014 at 5:27 AM, Larry Martell <larry....@gmail.com> wrote:
>> Thanks. Unfortunately this has been made a low priority task and I've
>> been put on to something else (I hate when they do that).
>
> Ugh, I know that feeling all too well! Life's better when you're
> unemployed, and you can choose the interesting problems to work on.
> Apart from the "has to be in MySQL" restriction (dodged now that the
> work's all being done in Python anyway), yours is _definitely_ an
> interesting problem.

if you're interested in what the application is, this is data
collected with an electron microscope from semiconductor wafers as
they are being manufactured. The x and y are the position on the wafer
that the data was collected, in microns. If 2 data points are
collected within 1 micron of each other they need to be combined when
being analyzed.

Chris Angelico

unread,
Jan 14, 2014, 9:26:31 AM1/14/14
to pytho...@python.org
On Wed, Jan 15, 2014 at 1:18 AM, Larry Martell <larry....@gmail.com> wrote:
> if you're interested in what the application is, this is data
> collected with an electron microscope from semiconductor wafers as
> they are being manufactured. The x and y are the position on the wafer
> that the data was collected, in microns. If 2 data points are
> collected within 1 micron of each other they need to be combined when
> being analyzed.

As far as I'm concerned, you won geek cred the moment you said
"electron microscope", and "semiconductor wafers as they are being
manufactured" is just gravy I don't suppose you want to hire another
programmer? :)

Do you actually mean here that the two points need to be within 1
micron, or that data gets combined if it's nearby in *either*
coordinate? There are libraries for figuring out if two things are
near each other - I'm not 100% sure, but you might be able to do this
inside PostgreSQL (though that just gets back to the previous rule:
can't move off MySQL). Treat every data point as a circle or square,
and then look for overlap.

ChrisA

Larry Martell

unread,
Jan 14, 2014, 11:23:02 AM1/14/14
to Chris Angelico, pytho...@python.org
On Tue, Jan 14, 2014 at 7:26 AM, Chris Angelico <ros...@gmail.com> wrote:
> On Wed, Jan 15, 2014 at 1:18 AM, Larry Martell <larry....@gmail.com> wrote:
>> if you're interested in what the application is, this is data
>> collected with an electron microscope from semiconductor wafers as
>> they are being manufactured. The x and y are the position on the wafer
>> that the data was collected, in microns. If 2 data points are
>> collected within 1 micron of each other they need to be combined when
>> being analyzed.
>
> As far as I'm concerned, you won geek cred the moment you said
> "electron microscope", and "semiconductor wafers as they are being
> manufactured" is just gravy.

Thanks! You made my day. I showed this to my wife and she asked "Is that good?"

> Do you actually mean here that the two points need to be within 1
> micron, or that data gets combined if it's nearby in *either*
> coordinate?

Either coordinate.

Chris Angelico

unread,
Jan 14, 2014, 11:37:42 AM1/14/14
to pytho...@python.org
On Wed, Jan 15, 2014 at 3:23 AM, Larry Martell <larry....@gmail.com> wrote:
>> As far as I'm concerned, you won geek cred the moment you said
>> "electron microscope", and "semiconductor wafers as they are being
>> manufactured" is just gravy.
>
> Thanks! You made my day. I showed this to my wife and she asked "Is that good?"

Heh! Geek language can be a bit weird at times. Of course, it's not a
dream job if things like this get tantalizingly dangled in front of
you and then taken away as you're given something less fun and more
urgent to do...

>> Do you actually mean here that the two points need to be within 1
>> micron, or that data gets combined if it's nearby in *either*
>> coordinate?
>
> Either coordinate.

Okay, so the code I put together earlier in the thread is correct.

ChrisA
0 new messages