require explicit FROMs

103 views
Skip to first unread message

Dave Vitek

unread,
Jun 24, 2017, 10:54:01 AM6/24/17
to sqlal...@googlegroups.com
Hi all,

I'll start with an example. Assume A and B are Table objects:

>>> print select([A.id], from_obj=[A], whereclause=(B.field == 123))
SELECT A.id FROM A, B WHERE B.id = 123

As a convenience, sqlalchemy has added B to the FROM clause on the
assumption that I meant to do this.

However, in my experience, this behavior has hidden many bugs. It isn't
a feature we intentionally rely on. Usually, someone forgot a join and
the query sqlalchemy invents is drastically underconstrained. These
bugs sometimes go unnoticed for a long time. I want to raise an
exception when this happens instead of letting the computer take a guess
at what the human meant.

I have modified _get_display_froms from selectable.py to return fewer
FROM entities. So far, this seems to be working. Can you think of any
pitfalls with this change?

I've attached a copy of the patch in case you want to take a look. If
there is interest in merging the change, it would need to be made opt-in
(how is this best done globally--not in a per-query fashion?).

- Dave


strict_froms.patch

Mike Bayer

unread,
Jun 24, 2017, 5:34:01 PM6/24/17
to sqlal...@googlegroups.com


On 06/24/2017 10:53 AM, Dave Vitek wrote:
> Hi all,
>
> I'll start with an example. Assume A and B are Table objects:
>
> >>> print select([A.id], from_obj=[A], whereclause=(B.field == 123))
> SELECT A.id FROM A, B WHERE B.id = 123
>
> As a convenience, sqlalchemy has added B to the FROM clause on the
> assumption that I meant to do this.
>
> However, in my experience, this behavior has hidden many bugs.

So yes, the implicit FROM behavior is for a very long time the ultimate
symptom of many other bugs, probably moreso in the ORM itself within
routines such as eager loading and joined table inheritance. A lot of
ORM bugs manifest as this behavior. However, as often as this is the
clear sign of a bug, it is actually not a bug in a whole bunch of other
cases when the "FROM comma list" is what's intended.

It isn't
> a feature we intentionally rely on. Usually, someone forgot a join and
> the query sqlalchemy invents is drastically underconstrained. These
> bugs sometimes go unnoticed for a long time. I want to raise an
> exception when this happens instead of letting the computer take a guess
> at what the human meant.

So to continue from my previous statement that "comma in the FROM
clause" is more often than not not what was intended, there is a large
class of exceptions to this, which is all those times when this *is*
what's intended.

Within the ORM, the most common place the FROM clause w/ comma is seen
is in the lazy load with a "secondary" table. In modern versions, the
two tables are added as explicit FROM, so your patch still allows this
behavior to work, though in older versions it would have failed:

from sqlalchemy import *
from sqlalchemy.orm import *
from sqlalchemy.ext.declarative import declarative_base

Base = declarative_base()


atob = Table(
'atob', Base.metadata,
Column('aid', ForeignKey('a.id'), primary_key=True),
Column('bid', ForeignKey('b.id'), primary_key=True)
)


class A(Base):
__tablename__ = 'a'
id = Column(Integer, primary_key=True)
bs = relationship("B", secondary=atob)


class B(Base):
__tablename__ = 'b'
id = Column(Integer, primary_key=True)

e = create_engine("sqlite://", echo=True)
Base.metadata.create_all(e)
s = Session(e)
s.add(A(bs=[B(), B()]))
s.commit()
a1 = s.query(A).first()
print a1.bs


lazyload at the end is:

SELECT b.id AS b_id
FROM b, atob
WHERE ? = atob.aid AND b.id = atob.bid

So this works with your patch because that routine is using explicit
select_from(). However, there are a lot of other codepaths in the ORM
that still rely on implicit FROM and will fail with this change.

>
> I have modified _get_display_froms from selectable.py to return fewer
> FROM entities.

I am pleased that you have the motivation to dig in and work on this
code, and you've produced a quality patch that has taken the time to
understand how this works. So I think we can work here to possibly
come up with a behavioral adjustment though it's not certain yet.


So far, this seems to be working. Can you think of any
> pitfalls with this change?

It's always a good idea to run the tests to see. The
README.unittests.rst file has lots of information on how to do this but
for a deep change like this it's as easy as "python setup.py test".
There are lots of cases that break, many because you've asked that the
feature be "global opt in" and that isn't yet implemented, but also many
that would need to be fixed even if the "global opt in" is enabled.

I've pushed up your patch to gerrit here:

https://gerrit.sqlalchemy.org/#/c/441/

The builds there are both the main SQLAlchemy test suite as well as a
selection of test suites from the Openstack project, and in this case
both fail:

https://jenkins.sqlalchemy.org/job/sqlalchemy_gerrit/962/

https://jenkins.sqlalchemy.org/job/openstack_gerrit/824/

(note that the above links are only good for a few days as the builds
get automatically truncated).

It's not surprising that the SQLAlchemy suite has failures, though it
has a lot, but even the Openstack suite has failures and that is much
more surprising, because Openstack does not use the fancy mapping
patterns that we see failing in the main SQLAlchemy tests.

Within the SQLAlchemy suite, we have failures in the aaa_profiling
suite, which is testing that the number of Python calls in a select() is
limited to a known value; for these particular tests the guidelines are
pretty strict, so the change adds a small performance overhead, for example:

Adjusted function call count 145 not within 5.0% of expected 157,
platform 2.7_postgresql_psycopg2_dbapiunicode_cextensions

That's not a deal breaker as those tests are intended just to alert when
the performance profile of something changes. Additionally, you've
stated you would want this to be "opt-in", so one would be opting in to
that small performance hit in any case.

The patch causes many failures just in the Core tutorial, meaning there
are a bunch of basic SELECT examples that have been there for years that
don't work under strict mode:

OperationalError: (sqlite3.OperationalError) no such column:
users.fullname [SQL: u'SELECT users.fullname || ? ||
addresses.email_address AS title \nWHERE users.id = addresses.user_id
AND users.name BETWEEN ? AND ? AND (addresses.email_address LIKE ? OR
addresses.email_address LIKE ?)'] [parameters: (', ', 'm', 'z',
'%@aol.com', '%@msn.com')]

But those are end-user queries and if they had opted in to the alternate
behavior, they'd not be writing them that way, OK.

Then, lots of ORM-inheritance kinds of tests, CTE tests, and others
break with this change, and in many cases, the Core / ORM itself is
relying on implicit FROM taking place in an internal sense, so the
application that "globally opts in" here would still fail on all these
other cases, unless fixed. Many if not all of these breakages may be
fixable by reworking all the various logic to no longer make use of the
implicit FROM logic, however that would be lots of complexity and
destabilization to make that happen.

One example is, using the Company/Person/Engineer set of models that are
similar to those introduced at
http://docs.sqlalchemy.org/en/latest/orm/inheritance.html#joined-table-inheritance,
the following query fails:

sess.query(Person)
.with_polymorphic(
Engineer,
people.outerjoin(engineers))
.order_by(Person.person_id)

The above query names only one entity, Person, and does not outwardly
break any rules about adding additional FROM objects implicitly, however
the logic that triggers when we go to load additional columns that have
to do with Manager and Boss objects (another form of lazy load) renders
an invalid SELECT:

OperationalError: (sqlite3.OperationalError) no such column:
managers.manager_name [SQL: u'SELECT managers.manager_name AS
managers_manager_name, managers.status AS managers_status,
boss.golf_swing AS boss_golf_swing \nWHERE ? = managers.person_id AND
managers.person_id = boss.boss_id'] [parameters: (3,)]

The query it wants to emit is:

SELECT managers.status AS managers_status, boss.golf_swing AS
boss_golf_swing, managers.manager_name AS managers_manager_name
FROM managers, boss
WHERE ? = managers.person_id AND managers.person_id = boss.boss_id

So fixing cases like that means a lot of new work and testing, not to
mention that all of these cases must be tested in *two* entirely
different behavioral contracts, making this a much greater maintenance
and testing burden. Any time we fix a new bug in the area of SELECT
rendering in Core or ORM, we will often have to write twice as many
tests to make sure they work in "alternate FROM mode" as well.

Over in the Openstack suite, here's a Keystone test that fails:

keystone.tests.unit.test_backend_sql.SqlIdentity.test_list_users_with_unique_id_and_idp_id

sqlalchemy.exc.OperationalError: (sqlite3.OperationalError) no such
column: federated_user.user_id [SQL: u'SELECT user.enabled AS
user_enabled, user.id AS user_id, user.domain_id AS user_domain_id,
user.extra AS user_extra, user.default_project_id AS
user_default_project_id, user.created_at AS user_created_at,
user.last_active_at AS user_last_active_at \nFROM user LEFT OUTER JOIN
local_user ON user.id = local_user.user_id AND user.domain_id =
local_user.domain_id \nWHERE user.id = federated_user.user_id AND
federated_user.unique_id = ? AND federated_user.idp_id = ?']
[parameters: ('9a5a4599df20426c9e9e6c86c8e36d3b', 'ORG_IDP')]

There's seven more in that variety in Keystone, looking at the query
they are testing it is:

query = session.query(model.User).outerjoin(model.LocalUser)
query = query.filter(model.User.id == model.FederatedUser.user_id)
query = self._update_query_with_federated_statements(hints, query)
user_refs = sql.filter_limit_query(model.User, query, hints)
return [identity_base.filter_user(x.to_dict()) for x in user_refs]

We can guess that they are probably making use of the implicit FROM
feature within _update_query_with_federated_statements or
filter_limit_query(). Curious, I dug into Keystone to see what query
they intend to render; I thought hey, perhaps it would be funny if they
actually were getting a cartesian product here and your feature has
"found" it. Although, if that were the case, already I think this
points more to the kind of feature I'd prefer, which is that instead of
it just emitting the wrong SQL, it emits the SQL exactly as it does now
but emits a warning; that way, everyone gets to know about their poor
FROM listing and exactly why it is occurring, without having to
"globally opt in" to anything:

SELECT user.enabled AS user_enabled, user.id AS user_id, user.domain_id
AS user_domain_id, user.extra AS user_extra, user.default_project_id AS
user_default_project_id, user.created_at AS user_created_at,
user.last_active_at AS user_last_active_at
FROM federated_user, user LEFT OUTER JOIN local_user ON user.id =
local_user.user_id AND user.domain_id = local_user.domain_id
WHERE user.id = federated_user.user_id AND federated_user.unique_id = ?
AND federated_user.idp_id = ?

So above, we can see they are in fact using the dreaded comma in the
FROM and even in an unusual way, "FROM table1, table2 JOIN table3" which
is very odd for SQL, however they are setting up the join from
federated_user to the user table properly, and they know what they are
doing.

So let's think of "global opt in" from that perspective. Openstack
application are enormous, sprawling apps that are developed and
maintained by dozens of people who often haven't met each other. By
requiring that the feature is only available as "global opt in",
Keystone in order to use this feature would need to do the same work we
did earlier; that they'd need to locate these queries and add in the
explicit FROM clause, or otherwise alter their logic that is taking
advantage of the implicit FROM to simplify their code. Additionally,
developers who work on Keystone would observe that for some reason they
don't understand, SQLAlchemy now behaves completely differently than it
does when they work on another Openstack app like Nova that does not use
this feature.

What would more likely happen is that Keystone just wouldn't make this
change; "global opt in" is too coarse grained to make migrations towards
using this feature simple and risk-free. That would lead to a
balkanized chunk of very intricate code in the middle of Core that
doesn't get used often, which is a recipe for heavy maintenance burdens,
when the behavior of FROM logic itself needs to be maintained and now it
must be maintained across two very different behavioral contracts, one
of which is almost never used.

The original use case you had is, "the current behavior leads to subtle
bugs". Why don't we re-think how to fix that directly, which I would
propose is most easily grantable to the entire SQLAlchemy ecosystem at
once by using a warning.

The warning would be, if someone does this:

stmt = select([t1.c.a]).select_from(t1).where(t2.c.b == 5)

that is, they put explicit FROM but then suggested a second implicit
FROM, and therefore are mixing the approaches, would emit a warning, and
that's it.

For your case, you'd see these warnings in your logs, and I would argue
this is sufficient for you to prevent the thing you're concerned about,
which is the mixing of explicit / implicit FROM.

But I'm not even sure I'm comfortable with a warning here because as we
see in Keystone, they are using this feature successfully. They
shouldn't have to see a warning if they like how it works now; there is
of course the warnings filter but I've observed people are rightly still
annoyed about warnings that are perceived as undeserved.

So far I've spent a lot of time showing my thought process in great
detail here because I hope to illustrate that yes, the FROM comma-list
thing is a problem, but the use is too much a core of SQLAlchemy's being
for it to really be changed, noting that I'd love to find a way to
improve this situation. An alternative use contract is definitely an
option, but as given as a patch to Core I fear this produces a so-called
"balkanized block" that greatly adds to maintenance burden. I've had to
remove lots of "pet features" over the years that I was too liberal in
accepting up front. Usually, if we can really detect a pattern that
"smells" of an error, and there's a way to explicitly prevent it, we can
make a warning in that case, but even in that case I'm not sure people
would appreciate being pushed away from "implicit from"; it's a handy
feature when used correctly and simplifies code.

A more elaborate detection would be, look for implicit FROM, then look
inside the WHERE clause and detect that in fact the multiple FROM
entries aren't being related to each other. That logic would be very
complicated as well as a performance hit, and would often be incorrect.
We would essentially be reinventing a small fragment of what the
database's query planner already does.

We can instead just ask the database using EXPLAIN to analyze the query,
or even just turn on slow query logging. This is pretty much the
answer I'd normally have given in this area in general; your developers
may write an inefficient SQL statement whether or not SQLAlchemy makes
one particular class of them more likely, and to really look for errors
of this class, you should be looking at SQL logs, running EXPLAIN on
queries you see, and setting up slow query logging on your development /
CI servers. There are other query-planner error cases besides this
one, such as missing indexes or other table-scan scenarios, gratuitous
use of OFFSET, etc. There's lots of things that can make a query slow.
Generalized safeguards in CI would guard against this one very
specific class of error in any case.

After all that, what I can suggest for you is that since you are looking
to "raise an error", that is actually easy to do here using events or
compiles. All you need to do is compare the calculated FROM list to
the explicit FROM list:

from sqlalchemy import table, column, select
from sqlalchemy.ext.compiler import compiles
from sqlalchemy.sql import Select
from sqlalchemy import exc


@compiles(Select)
def _strict_froms(stmt, compiler, **kw):
actual_froms = stmt.froms

# TODO: tune this as desired
if stmt._from_obj:
# TODO: we can give you non-underscored access to
# _from_obj
legal_froms = stmt._from_obj
else:
legal_froms = set([
col.table for col in stmt.inner_columns
])
if len(legal_froms) != len(actual_froms):
raise exc.CompileError("SELECT has implicit froms")

return compiler.visit_select(stmt, **kw)


t1 = table('t1', column('a'))
t2 = table('t2', column('b'))


# works
print select([t1]).where(t2.c.b == 5).select_from(t1).select_from(t2)

# raises
print select([t1]).where(t2.c.b == 5)

The issue of "FROM with comma" *is* a problem I'd like to continue to
explore solutions towards. I've long considered various rules in
_display_froms but none have ever gotten through the various issues I've
detailed above. I hope this helps to clarify the complexity and depth
of this situation.

Dave Vitek

unread,
Jun 25, 2017, 12:15:29 PM6/25/17
to sqlal...@googlegroups.com
Mike,

Thanks for the thorough feedback. It isn't surprising that there's a lot
of code out there relying on the current behavior, but internal
sqlalchemy reliance is certainly problematic. It sounds difficult to
walk back the existing behavior at this point and I understand the
apprehension about even adding a warning at this point.

Perhaps it could be an opt-in warning for projects that subscribe to the
"explicit from" requirement. People who contribute to multiple projects
might still be annoyed that sqlalchemy complains differently from one
project to the next, but I think there's plenty of precedent for
projects having rules about what subset of library and language features
they allow. Once a project has been bitten by this sort of bug one too
many times, they might look into enabling the warning. Not ideal, but
better than no recourse at all? The internal ORM uses of implicit FROMs
would need to be cleaned up for this to really work.

However, in my experience warnings are ignored unless they trigger test
failures. Furthermore, I am worried in part about these bugs being
security bugs, in which case I might prefer an exception even in a
production setting. Then again, I might not, since most of our implicit
FROMs that weren't recently introduced were producing intended behavior.

With respect to detecting implicit FROMs by noticing slow queries or
unexpected behavior, that works OK with some queries, but it still takes
significantly more human time than detecting things more explicitly and
earlier in the pipeline.

There are cases where an under-constrained query runs faster, and fairly
specific inputs would be required for a human to notice the bad
behavior. Specifically, I think under-constrained EXISTS clauses
frequently go undetected for some time. Some query that is supposed to
check whether a particular row in a table has some relationship ends up
checking whether any row in the table has some relationship. To make
matters worse, it isn't entirely unusual for these two behaviors to be
indistinguishable for typical data sets (perhaps because all the rows
behave the same, or the tables are frequently singletons). In my worst
nightmare, the intent might be to check if Alice has permission to do X
to Y, but the query ends up asking whether Alice has permission to do X
to anything, or whether anyone has permission to do X to Y.

The compile hook looks great and I will take a look at the test failures
caused by reliance on these features in the ORM, if only to see whether
we use or foresee using those features.

- Dave

Mike Bayer

unread,
Jun 25, 2017, 9:54:05 PM6/25/17
to sqlal...@googlegroups.com


On 06/25/2017 12:15 PM, Dave Vitek wrote:
> Mike,
>
> Thanks for the thorough feedback. It isn't surprising that there's a lot
> of code out there relying on the current behavior, but internal
> sqlalchemy reliance is certainly problematic. It sounds difficult to
> walk back the existing behavior at this point and I understand the
> apprehension about even adding a warning at this point.


The internals by and large don't rely on implicit FROM that much in any
case, since the internals need to do very complex transformations on SQL
fragments and queries that are completely arbitrary, and the FROM adds
information to the query that is otherwise often lost during these
transformations. There are of course places where implicit FROM does
come into play, and these areas currently work without issue. But
overall these use cases are not at all like the end-user use case I
think you're referring towards.

Implicit FROM is not a feature that works incorrectly, it's a feature
that works perfectly, but delays the symptoms of other particular bugs
from appearing as quickly as we'd like. I'd like to keep the focus here
on solving the problem of inadvertent bugs, and not on removal of the
feature itself which is extremely useful, intuitive, and widely relied upon.


>
> Perhaps it could be an opt-in warning for projects that subscribe to the
> "explicit from" requirement. People who contribute to multiple projects
> might still be annoyed that sqlalchemy complains differently from one
> project to the next, but I think there's plenty of precedent for
> projects having rules about what subset of library and language features
> they allow. Once a project has been bitten by this sort of bug one too
> many times, they might look into enabling the warning.
>
> However, in my experience warnings are ignored unless they trigger test
> failures.

I actually have not had much in the way of complaints about this issue
occurring in the wild. I see it a lot with the ORM features described
above but that's due to internal bugs within intricate systems that I
fix myself. It's actually a great idea to set the warnings filter so
that this does occur. py.test is aggressively moving towards making
warnings visible and making it easy to raise them also.

Furthermore, I am worried in part about these bugs being
> security bugs, in which case I might prefer an exception even in a
> production setting. Then again, I might not, since most of our implicit
> FROMs that weren't recently introduced were producing intended behavior.
>
> With respect to detecting implicit FROMs by noticing slow queries or
> unexpected behavior, that works OK with some queries, but it still takes
> significantly more human time than detecting things more explicitly and
> earlier in the pipeline.
>
> There are cases where an under-constrained query runs faster, and fairly
> specific inputs would be required for a human to notice the bad
> behavior. Specifically, I think under-constrained EXISTS clauses
> frequently go undetected for some time. Some query that is supposed to
> check whether a particular row in a table has some relationship ends up
> checking whether any row in the table has some relationship. To make
> matters worse, it isn't entirely unusual for these two behaviors to be
> indistinguishable for typical data sets (perhaps because all the rows
> behave the same, or the tables are frequently singletons). In my worst
> nightmare, the intent might be to check if Alice has permission to do X
> to Y, but the query ends up asking whether Alice has permission to do X
> to anything, or whether anyone has permission to do X to Y.

There's lots of ways to write SQL queries that don't SELECT from the
correct rows, or do get the correct rows but perform poorly (the most
common is that indexes are missed), or a combination of both. Overall,
you can't get away with only checking early or only checking late in the
pipeline, there need to be good unit tests that cover every query used
against decent test data, and there needs to be continuous analysis of
performance and SQL emitted in the running application. I know people
are doing this because that's how they alert me when the ORM is
mis-interpreting a query.

Back to the issue at hand, I still maintain that providing a "works
totally differently" mode is a recipe for headaches and complexity for
everyone, adding more codepaths that add to the long term testing
burden, diluting the usefulness of "implicit FROM" rather than fixing
it, confusing users with a choice they shouldn't have to make, and at
the end of the day providing "detects cartesian products" logic to only
a subset of applications that opt in to this alternate mode and write
their code in a totally new way.

Alternatively, logic that does do something like what I mentioned in my
other email, that is a "linter" that can scan a select() for cartesian
products would have the advantages of: 1. it can be "opt in" per
environment, e.g. development and not production, so that any codebase
can use it 2. it will find errors of the "forgot to add a JOIN" variety
no matter how it occurred, whether it be user error at the Core or ORM
level or if an ORM internal produced it 3. it would have no impact on
existing applications or internals as it would be a self-contained
routine that only runs if enabled. 4. it can be released as something
"experimental" so that people can just play with it for a year or so
having it only emit harmless warnings by default when turned on, until
we find all the problems.

One disadvantage is that it adds a performance hit. This is mitigated
by the fact that it is optional and can be enabled only in development /
tests. It also would be part of the statement compilation step so would
be part of what we cache when statement caching is used, which in 1.2 is
used more prevalently than before within the internals and is also
used by end-user code now in the form of the BakedQuery system.

Another disadvantage is that it is reinventing part of what the query
planner is sort of doing already, but as you mentioned, it's more of a
direct structural problem that is not as directly indicated within an
EXPLAIN, and at least moves up the detection of the issue.

Finally, it may be problematic that it is not always correct, more in
the case of emitting false positives, or even that there are some
queries for which it can't work correctly. It would need to be seen how
difficult this is.

Here is how it would work.

We are looking for "orphaned FROM" clauses, or even "split FROM
clauses". It's a graph problem (since SQL is an expression tree, all
these problems are just graph problems). Given a query like this:

select([a.something, b.something]).\
select_from(b).join(c, a.c_id == c.id).\
select_from(d).\
where(
and_(
a.id == b.id,
a.data == 10,
d.data == 5,
e.data = 10,
d.e_data > some_func(e.thing)
)
)

we then create a graph of the selectables that come from our FROM list,
which are a, b.join(c), d, and e:

a --- b.join(c) (joins on a.c_id == c.id)
a --- b.join(c) (a.id == b.id)
d --- e (d.e_data > some_func(e.thing))

then the graph shows that we have two disconnected graphs, the "abc"
graph and the "de" graph; that fails the lint.

Another situation would be dealing with correlation:

subq = select([b.something]).where(b.id == a.id).correlate(a).as_scalar()

stmt = select([a.something, subq])

the graph for subq would be "a --- b", even though "a" is not in the
FROM list, we know we are correlating to it at compile time so that is
given as an adequate "link".

the graph for "stmt" would be just "a" because subq is not in the FROM list.

This algorithm can of course be written and used right now within a
@compiles recipe as I stated before.

Keep in mind if you want to stick with the approach you have, you can
still just do that within @compiles. But this graph-linter could be a
neat thing, e.g. a mini-query planner that runs on the SQL expression
tree. It could even look for Index() objects in the schema and try to
anticipate table scans. Linters are popular, and this would likely be
the first of its kind. If it's really something people want to work on,
it would likely be popular.

Dave Vitek

unread,
Jun 26, 2017, 2:25:02 PM6/26/17
to sqlal...@googlegroups.com
In case anyone tries to use the compile visitor below, I'll document a
couple snags I ran into. For now, I think I'm satisfied to go back to
my original patch and deal with any merge conflicts during upgrades.

The hook has false positives when it runs on inner queries that
correlate to tables from outer queries. I think these could be
addressed by borrowing most of the logic from _get_display_froms having
to do with dropping extra FROMs to make this work right, although you
would need to get the values for explicit_correlate_froms and
implicit_correlate_froms from somewhere. I'm sure it could be made to
work eventually.

Additionally, it raises even in contexts where no sql is being
executed. For example, if someone renders a yet-to-be-nested select to
a string for debugging purposes, they might get an exception. It's also
a bit confusing to the client, because if they render the selectable, it
will show them the FROM entity that it simultaneously claims it missing
(that is, if you had a way of preventing the exception in the rendering
process). Sending the bad SQL to the RDBMS avoids these two caveats.

Also, I found it necessary to use col._from_objects instead of col.table
for non-trivial column elements.

Mike Bayer

unread,
Jun 28, 2017, 12:09:29 PM6/28/17
to sqlal...@googlegroups.com
On Mon, Jun 26, 2017 at 2:24 PM, Dave Vitek <dvi...@grammatech.com> wrote:
In case anyone tries to use the compile visitor below, I'll document a couple snags I ran into.  For now, I think I'm satisfied to go back to my original patch and deal with any merge conflicts during upgrades.

The hook has false positives when it runs on inner queries that correlate to tables from outer queries.  I think these could be addressed by borrowing most of the logic from _get_display_froms having to do with dropping extra FROMs to make this work right, although you would need to get the values for explicit_correlate_froms and implicit_correlate_froms from somewhere.  I'm sure it could be made to work eventually.

Here's a proof of concept for the "SELECT linter", this is totally something that could work itself into some extension, flag, or other kind of feature in SQLAlchemy but I'd need folks to help determine that this does in fact work in all cases.  I wired it into the compiler to see what the tests came up with, there's actually lots of tests that are only testing stringification which don't attempt to join the FROM clauses and these failed;  though I spot checked and it had the right answer in all those cases.


from sqlalchemy import table, column, select
from sqlalchemy.sql.expression import Select
from sqlalchemy.sql import visitors
from sqlalchemy.ext.compiler import compiles
from sqlalchemy import util
import itertools
import collections


@compiles(Select)
def _lint_select(element, compiler, **kw):
    froms = set(element.froms)
    if not froms:
        return
    edges = set()

    # find all "a <operator> b", add that as edges
    def visit_binary(binary_element):

        edges.update(
            itertools.product(
                binary_element.left._from_objects,
                binary_element.right._from_objects
            )
        )

    # find all "a JOIN b", add "a" and "b" as froms
    def visit_join(join_element):
        if join_element in froms:
            froms.remove(join_element)
            froms.update((join_element.left, join_element.right))

    # unwrap "FromGrouping" objects, e.g. parentheized froms
    def visit_grouping(grouping_element):
        if grouping_element in froms:
            froms.remove(grouping_element)

            # the enclosed element will often be a JOIN.  The visitors.traverse
            # does a depth-first outside-in traversal so the next
            # call will be visit_join() of this element :)
            froms.add(grouping_element.element)

    visitors.traverse(
        element, {}, {
            'binary': visit_binary, 'join': visit_join,
            'grouping': visit_grouping
        })

    # take any element from the list of FROMS.
    # then traverse all the edges and ensure we can reach
    # all other FROMS
    start_with = froms.pop()
    the_rest = froms
    stack = collections.deque([start_with])
    while stack and the_rest:
        node = stack.popleft()
        the_rest.discard(node)
        for edge in list(edges):
            if edge not in edges:
                continue
            elif edge[0] is node:
                edges.remove(edge)
                stack.appendleft(edge[1])
            elif edge[1] is node:
                edges.remove(edge)
                stack.appendleft(edge[0])

    # FROMS left over?  boom
    if the_rest:
        util.warn(
            "for stmt %s FROM elements %s are not "
            "joined up to FROM element %r" % (
                id(element),  # defeat the warnings filter
                ", ".join('"%r"' % f for f in the_rest),
                start_with
            )
        )
    return compiler.visit_select(element, **kw)

a = table('a', column('x'))
b = table('b', column('x'))
c = table('c', column('x'))
d = table('d', column('x'))


print "-------------------------------------"
print "everything is connected"
print select([a]).\
    select_from(a.join(b, a.c.x == b.c.x)).\
    select_from(c).select_from(d).\
    where(d.c.x == b.c.x).\
    where(c.c.x == d.c.x).where(c.c.x == 5)

print "-------------------------------------"
print "disconnect between a,b, c,d"
print select([a]).\
    select_from(a.join(b, a.c.x == b.c.x)).\
    select_from(c).select_from(d).\
    where(c.c.x == d.c.x).where(c.c.x == 5)

print "-------------------------------------"
print "c and d both disconnected"
print select([a]).\
    select_from(a.join(b, a.c.x == b.c.x)).\
    where(c.c.x == 5).where(d.c.x == 10)

print "-------------------------------------"
print "now connected"
print select([a]).\
    select_from(a.join(b, a.c.x == b.c.x)).\
    select_from(c.join(d, c.c.x == d.c.x)).\
    where(c.c.x == b.c.x).\
    where(c.c.x == 5).where(d.c.x == 10)


print "-----------------------------------"
print "test disconnected subquery"
subq = select([a]).where(a.c.x == b.c.x)
stmt = select([c]).select_from(subq)
print stmt

print "-----------------------------------"
print "now connect it"
subq = select([a]).where(a.c.x == b.c.x).alias()
stmt = select([c]).select_from(subq).where(c.c.x == subq.c.x)
print stmt


print "-----------------------------------"
print "right nested join w/o issue"

print select([a]).select_from(
    a.join(
        b.join(c, b.c.x == c.c.x), a.c.x == b.c.x
    )
)

print "-----------------------------------"
print "right nested join with an issue"

print select([a]).select_from(
    a.join(
        b.join(c, b.c.x == c.c.x), a.c.x == b.c.x
    )
).where(d.c.x == 5)


output:

-------------------------------------
everything is connected
SELECT a.x
FROM d, c, a JOIN b ON a.x = b.x
WHERE d.x = b.x AND c.x = d.x AND c.x = :x_1
-------------------------------------
disconnect between a,b, c,d
test.py:75: SAWarning: for stmt 139871395661008 FROM elements "<sqlalchemy.sql.selectable.TableClause at 0x7f3658d93bd0; b>", "<sqlalchemy.sql.selectable.TableClause at 0x7f3658d93a10; a>" are not joined up to FROM element <sqlalchemy.sql.selectable.TableClause at 0x7f3658d93c90; c>
  start_with
SELECT a.x
FROM c, d, a JOIN b ON a.x = b.x
WHERE c.x = d.x AND c.x = :x_1
-------------------------------------
c and d both disconnected
test.py:75: SAWarning: for stmt 139871395662864 FROM elements "<sqlalchemy.sql.selectable.TableClause at 0x7f3658d93bd0; b>", "<sqlalchemy.sql.selectable.TableClause at 0x7f3658d93d50; d>", "<sqlalchemy.sql.selectable.TableClause at 0x7f3658d93a10; a>" are not joined up to FROM element <sqlalchemy.sql.selectable.TableClause at 0x7f3658d93c90; c>
  start_with
SELECT a.x
FROM c, d, a JOIN b ON a.x = b.x
WHERE c.x = :x_1 AND d.x = :x_2
-------------------------------------
now connected
SELECT a.x
FROM a JOIN b ON a.x = b.x, c JOIN d ON c.x = d.x
WHERE c.x = b.x AND c.x = :x_1 AND d.x = :x_2
-----------------------------------
test disconnected subquery
test.py:75: SAWarning: for stmt 139871395659920 FROM elements "<sqlalchemy.sql.selectable.Select at 0x7f3658da4390; Select object>" are not joined up to FROM element <sqlalchemy.sql.selectable.TableClause at 0x7f3658d93c90; c>
  start_with
SELECT c.x
FROM c, (SELECT a.x AS x
FROM a, b
WHERE a.x = b.x)
-----------------------------------
now connect it
SELECT c.x
FROM c, (SELECT a.x AS x
FROM a, b
WHERE a.x = b.x) AS anon_1
WHERE c.x = anon_1.x
-----------------------------------
right nested join w/o issue
SELECT a.x
FROM a JOIN (b JOIN c ON b.x = c.x) ON a.x = b.x
-----------------------------------
right nested join with an issue
test.py:75: SAWarning: for stmt 139871393147280 FROM elements "<sqlalchemy.sql.selectable.TableClause at 0x7f3658d93d50; d>" are not joined up to FROM element <sqlalchemy.sql.selectable.TableClause at 0x7f3658d93a10; a>
  start_with
SELECT a.x
FROM d, a JOIN (b JOIN c ON b.x = c.x) ON a.x = b.x
WHERE d.x = :x_1
 
--
SQLAlchemy - The Python SQL Toolkit and Object Relational Mapper

http://www.sqlalchemy.org/

To post example code, please provide an MCVE: Minimal, Complete, and Verifiable Example.  See  http://stackoverflow.com/help/mcve for a full description.
--- You received this message because you are subscribed to the Google Groups "sqlalchemy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sqlalchemy+unsubscribe@googlegroups.com.
To post to this group, send email to sqlal...@googlegroups.com.
Visit this group at https://groups.google.com/group/sqlalchemy.
For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages