Select * but apply distinct to one column

1,607 views
Skip to first unread message

Alex Hall

unread,
Mar 9, 2016, 1:38:23 PM3/9/16
to sqlalchemy
Hi all,
I want to select * from a table, getting all columns. However, the
only rows I want are where the item number is distinct. I've got:
items = session.query(itemTable)\
.distinct()\
.limit(10)
But that doesn't apply "distinct" to just item_number. I'm not the
best with SQL in general or I'd express the query I want so you could
see it. Hopefully my explanation is clear enough.

After my fighting with the iSeries here at work a few weeks ago, I set
up SA to access a Microsoft SQL database yesterday... In about two
hours. That includes setting up the DSN, getting the database name
wrong, getting the credentials wrong, and other non-SA problems. With
all that, SA itself was a breeze. It's amazing what happens when you
don't try to use IBM machines in the mix. :)

Jonathan Vanasco

unread,
Mar 9, 2016, 2:52:19 PM3/9/16
to sqlalchemy
It would probably be best for you to figure out the correct raw sql you want, then convert it to SqlAlchemy.  

Postgres is the only DB I know of that offers "DISTINCT ON (columns)" -- and even that works a bit awkward.

The query that you want to do isn't actually simple -- there are concerns with how to handle duplicate rows (based on the distinct field).    Often people will use "GROUP BY" + "ORDER BY" along with distincts, subselects and misc database functions.

If I were in your place, I would read through some DB tutorials and StackOverflow questions on how people are dealing with similar problems.  That should help you learn.

Alex Hall

unread,
Mar 9, 2016, 3:02:05 PM3/9/16
to sqlal...@googlegroups.com
Fair enough, thanks. I didn't realize it was such a complex task; I
figured it was just a matter of passing an argument to distinct() or
something equally easy. Speed isn't a huge concern, so I suppose I
could get around this by storing the item numbers I find and then
checking that the row I'm about to use doesn't have a number in that
set. Still, there could be hundreds of thousands of items, so that
might not be the best plan. Anyway, I'll look into it more.
> --
> You received this message because you are subscribed to the Google Groups
> "sqlalchemy" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to sqlalchemy+...@googlegroups.com.
> To post to this group, send email to sqlal...@googlegroups.com.
> Visit this group at https://groups.google.com/group/sqlalchemy.
> For more options, visit https://groups.google.com/d/optout.
>

Mike Bayer

unread,
Mar 9, 2016, 3:02:51 PM3/9/16
to sqlal...@googlegroups.com


On 03/09/2016 01:38 PM, Alex Hall wrote:
> Hi all,
> I want to select * from a table, getting all columns. However, the
> only rows I want are where the item number is distinct. I've got:
> items = session.query(itemTable)\
> .distinct()\
> .limit(10)
> But that doesn't apply "distinct" to just item_number. I'm not the
> best with SQL in general or I'd express the query I want so you could
> see it. Hopefully my explanation is clear enough.

as mentioned here you can send ItemTable.item_number to the distinct()
function, but to my knowledge the "SELECT DISTINCT(x, y, z).." syntax
only works on Postgresql.

Jonathan Vanasco

unread,
Mar 9, 2016, 5:17:28 PM3/9/16
to sqlalchemy

On Wednesday, March 9, 2016 at 3:02:05 PM UTC-5, Alex Hall wrote:
Fair enough, thanks. I didn't realize it was such a complex task; I
figured it was just a matter of passing an argument to distinct() or
something equally easy.


Yeah PostgreSQL is the only db that supports "DISTINCT ON"... but it can be very awkward.

Let me try to explain this better, because I was you a few years ago -- and thought / believed the same things.  (and this still annoys me!)

Here's a table to represent an imaginary situation where  `id` is the primary key (it is unique) but the other columns aren't.

id | product_id | name
===+============+=====
1  | 1          | foo
2  | 1          | bar
3  | 2          | biz
4  | 2          | bang
5  | 3          | foo
6  | 1          | bar

The distinct column values are:

    id - 1,2,3,4,5
    product_id - 1, 2, 3
    name - foo, bar, biz, bang

If you want to get distinct data from the table though, you need to think in rows.  (unless you're querying for column data)

If you want "distinct" rows based on the product id, how should these 3 rows be handled?

1  | 1          | foo
2  | 1          | bar
6  | 1          | bar

They all have 1 for the product_id.

The rows are all distinct if we think of the primary id key being an attribute.
If we limit the distinction to the product_id and the name, we can drop the 3 down to 2 combinations:

1          | foo
1          | bar

But this probably won't work for your needs.  

The (1, foo) row corresponds to id 1; 
but the (1, bar) row could correspond to (2,1,bar) or (6,1,bar) rows in the table.

So when you say only want rows "where the item number is distinct.", you should try asking "What should I do with rows where the item_number isn't distinct?"

That should raise some red flags for you, and help you realize that you probably don't really want rows where the item number is distinct.    You probably want to do some other query and approach some other goal.

"DISTINCT" is (usually) a really complex situation because people often think it will do one thing, but it does something very different... and to accomplish the task they want, it's a totally different query.

Alex Hall

unread,
Mar 9, 2016, 5:29:50 PM3/9/16
to sqlal...@googlegroups.com
That makes sense. Part of my problem is that, as I've mentioned in the
past, I was recently hired. I didn't set anything up, and I still
don't know for sure what I can trust to be unique, or 6 versus 8
characters, or a lot of other small details. That said, SSMS shows the
item ID as a primary key, which means it is unique. I think I'm safe
to just apply distinct() to my entire query, since there's no way the
ID can ever be repeated. I've been looking at a bunch of tables today,
and I had it in my head that the id in this one was only *part of* the
PK and thus could be duplicated. At least I learned something from all
this. Thanks again for the help, guys.

On 3/9/16, Jonathan Vanasco <jona...@findmeon.com> wrote:
>

Jonathan Vanasco

unread,
Mar 10, 2016, 12:09:42 AM3/10/16
to sqlalchemy
If that ID is the primary key, then don't bother with a DISTINCT().  Just select everything from the table.  Otherwise you're going to make the backend select everything , then waste time doing the distinct.

A quick way to confirm would be to just run these 2 commands:

    SELECT COUNT(item_id) FROM table;
    SELECT COUNT(item_id) FROM (SELECT DISTINCT item_id FROM TABLE) q_distinct;

If they both return the same count, there are no dupes.
Reply all
Reply to author
Forward
0 new messages