I know some differences between SELECT DISTINCT vs GROUP BY in MySQL, but I'd like to open this up as a discussion on the group. Each can accomplish the same thing depending on what you are doing, yet each has it's own purpose. I'd like to see what the group says before I say anything myself (call it a brain teaser if you wish). When do you use each of these? When do you choose one over the other? For other database engines, what's the equivalent (if there's any difference), and how do the same questions apply to your favorite database?
On Thu, Jan 15, 2009 at 07:16:04PM -0700, Brandon Stout wrote:
> I know some differences between SELECT DISTINCT vs GROUP BY in MySQL, > but I'd like to open this up as a discussion on the group. Each can > accomplish the same thing depending on what you are doing, yet each has > it's own purpose. I'd like to see what the group says before I say > anything myself (call it a brain teaser if you wish). When do you use > each of these? When do you choose one over the other? For other > database engines, what's the equivalent (if there's any difference), and > how do the same questions apply to your favorite database?
In the development version of PostgreSQL, SELECT DISTINCT is implemented using the same logic as GROUP BY, because the GROUP BY logic is a performance improvement over the code that was used.
Joshua Tolley wrote: > On Thu, Jan 15, 2009 at 07:16:04PM -0700, Brandon Stout wrote: >> I know some differences between SELECT DISTINCT vs GROUP BY in MySQL, >> but I'd like to open this up as a discussion on the group. Each can >> accomplish the same thing depending on what you are doing, yet each has >> it's own purpose. I'd like to see what the group says before I say >> anything myself (call it a brain teaser if you wish). When do you use >> each of these? When do you choose one over the other? For other >> database engines, what's the equivalent (if there's any difference), and >> how do the same questions apply to your favorite database?
> In the development version of PostgreSQL, SELECT DISTINCT is implemented > using the same logic as GROUP BY, because the GROUP BY logic is a > performance improvement over the code that was used.
> - Josh
Does SELECT DISTINCT also sort, then, if it does the same as GROUP BY in PostgreSQL?
Sasha Pachev wrote: > With MySQL, I would recommend using SELECT DISTINCT when you need just > distinct values, and GROUP BY when you need more than that. Even > though in a lot of cases the conceptual logic and the number of > examined rows may end up being the same, even then the optimizer will > breathe a sigh of relief when it notices there is no GROUP BY. So a > lot of potentially unnecessary code will not get executed.
> Additionally, MySQL GROUP BY always orders the results. You can even > say something like GROUP BY name ASC or GROUP BY name DESC. DISTINCT > does not have to. In some case the need to order will burden the > optimizer.
> In short, keep it simple if you can.
Thank you Sasha! I was looking mainly at the fact that GROUP BY sorts the results while SELECT DISTINCT, plus one more I'm not so sure of. I wonder about the processing order since GROUP BY is in a different location of the SELECT statement than SELECT DISTINCT is. Any comments on that one? How about in other databases? Does MS Access even have SELECT DISTINCT?
On Tue, Jan 20, 2009 at 10:56:33PM -0700, Brandon Stout wrote: > Does SELECT DISTINCT also sort, then, if it does the same as GROUP BY in > PostgreSQL?
Neither sorts, unless you explicitly tell it to with an ORDER BY clause. The grouping algorithm used (at least most of the time) is hash-based, so there's no sorting required.
Joshua Tolley wrote: > On Tue, Jan 20, 2009 at 10:56:33PM -0700, Brandon Stout wrote: >> Does SELECT DISTINCT also sort, then, if it does the same as GROUP BY in >> PostgreSQL?
> Neither sorts, unless you explicitly tell it to with an ORDER BY clause. The > grouping algorithm used (at least most of the time) is hash-based, so there's > no sorting required.
> - Josh
So, to further clarify, this table:
apple red apple red apple red apple green apple green banana yellow apple red
when selecting distinct would return this, since it's not sorted:
On Wed, Jan 21, 2009 at 03:00:42PM -0700, Brandon Stout wrote: > Joshua Tolley wrote: > > On Tue, Jan 20, 2009 at 10:56:33PM -0700, Brandon Stout wrote: > >> Does SELECT DISTINCT also sort, then, if it does the same as GROUP BY in > >> PostgreSQL?
> > Neither sorts, unless you explicitly tell it to with an ORDER BY clause. The > > grouping algorithm used (at least most of the time) is hash-based, so there's > > no sorting required.
> > - Josh
> So, to further clarify, this table:
> apple red > apple red > apple red > apple green > apple green > banana yellow > apple red
> when selecting distinct would return this, since it's not sorted:
> apple red > apple green > banana yellow > apple red
Nope -- it still calculates the DISTINCT correctly, despite not sorting. It finds distinct values by putting each datum into a hash table, and then returning all the hash table's keys when it finishes.
5432 josh@josh# select * from fruit ; fruit_name | fruit_color ------------+------------- apple | red apple | red apple | red apple | green apple | green banana | yellow apple | red (7 rows)
5432 josh@josh*# select distinct * from fruit; fruit_name | fruit_color ------------+------------- banana | yellow apple | red apple | green (3 rows)
Joshua Tolley wrote: > Nope -- it still calculates the DISTINCT correctly, despite not sorting. > It finds distinct values by putting each datum into a hash table, and > then returning all the hash table's keys when it finishes.
> 5432 josh@josh# select * from fruit ; > fruit_name | fruit_color > ------------+------------- > apple | red > apple | red > apple | red > apple | green > apple | green > banana | yellow > apple | red > (7 rows)
> 5432 josh@josh*# select distinct * from fruit; > fruit_name | fruit_color > ------------+------------- > banana | yellow > apple | red > apple | green > (3 rows)
> - Josh
Very good. That's what I hoped since that's what happens with MySQL, and that's what I would expect any true database to do. This brings out another weakness with spreadsheets - at least from the day I used them. You have to sort before you can do 'subtotals'.