Joomla! 2.5 + Virtuoso Universal Server (ODBC)

282 views
Skip to first unread message

Will Daniels

unread,
Mar 24, 2012, 1:39:54 PM3/24/12
to joomla-de...@googlegroups.com
Dear Joomla! Development Team,

Warning, this is _VERY_ LONG. I'm sorry, but I don't know how to write it all
differently. I have tried to summarise below enough to know whether you want to
read anything in more depth...

Executive Summary
-----------------

* I have created a Joomla! ODBC driver for OpenLink's Virtuoso Universal Server
* The main use cases for Virtuoso relate to publication for the Semantic Web
* Currently the implementation is "rudimentary" but essentially working.
* During this experiment I encountered some;
- ISO SQL standards compliance issues in core Joomla! queries
- problems and omissions with the JDatabase abstraction
- general issues with grouping long data types, auto-increment fields
- reserved words in the Virtuoso RDBMS that are part of the Joomla! schema
- other little bugs
* I'm seeking feedback about these issues to know how best to proceed because;
- I have made hundreds of changes to quote SQL reserved words in Joomla!
- it would help my planning to know if some issues can be addressed in core


Some background
---------------

OpenLink's Virtuoso Server has both Open Source and Commercial editions. At core
it is a high-performance RDBMS with close technological ties to ODBC and SQL
standardisation. Virtuoso is also a webserver with a PHP SAPI module, so it can
actually host Joomla! entirely within a single server process (as a point of
interest, not that it's necessarily a good idea or even useful in most cases).

In recent times Virtuoso has mostly differentiated itself from mainstream RDBMS
by way of it's built-in RDF store and SPARQL query processing abilities, making
it ideal for Semantic Web applications.

I don't work for or have any affiliation with OpenLink, I'm just a guy who loves
both Joomla! and Virtuoso, with an interest in Semantic Web technologies.
Basically, I want to play around with using Virtuoso to create RDF mappings of
Joomla! content, among other things. Traditionally, Drupal has always been the
CMS of choice for RDF and other Semantic Web stuff, but I wanted to use Joomla!
and I guess that is what Open Source is all about :P

I'm hoping that any work I do on Virtuoso's ODBC driver for Joomla! might help
someone else to at least get started with other ODBC drivers for Joomla! If
there is another ODBC implementation around already, I didn't find it. So maybe
parts of my class could be separated as a generic ODBC driver (from which the
Virtuoso one and other variants can inherit) once it's fully completed, tested
and matured.

The reason I'm writing to you is because I'm hoping that Joomla! wants to fix
some of the issues I found that could potentially help other RDBMS driver
efforts, especially while my changes are against recent source. I thought it
best to open discussion about these things before trying to file any bugs, to
find out which issues you might accept and how you might want to deal with those
(e.g. how many bugs to file, whether to provide patches, against which source
trees, at what level of granularity etc.).

And I thought you just might like to know anyway about Joomla! running on
Virtuoso as a curiosity if nothing else.


Issues related to ISO SQL Compliance
------------------------------------

Summary:

* Use <> instead of != for "not equal"
* Quote SQL reserved words LANGUAGE, MODULE, OPTION and TIME as identifiers [1]
* DATABASE operations are not abstracted by JDatabase
* Can't assume " denotes a string in prefix replacement

The ISO standard operator for "not equal to" is the more traditional <> symbol
instead of the PHP-like != symbol. Not all RDBMS will work with != but all that
I know of work with <> operator. As of 2.5.2 there were 27 occurrences of !=
used in SQL across 20 files [2].

LANGUAGE is a reserved word in ISO SQL-92, SQL-99 and also SQL-2003. I'm sure I
don't need to tell you that this is one is a biggie for Joomla! As of 2.5.2 I
found 77 occurrences across 40 files [3].

MODULE is also a reserved word in SQL-92, SQL-99 & SQL-2003. Not such a big one,
but still, as of 2.5.2 I found 13 occurrences across 6 files [4].

TIME is another [5] and also OPTION, but I only found one occurrence each of
these. The commit for OPTION accidentally ended up in my big commit for LANGUAGE
[3], basically because git hates me and I was too tired to bother fixing that. I
can't really imagine anyone would want one without the other anyway.

There is no abstraction within the driver class for database creation, and
JInstallationModelDatabase::createDatabase makes assumptions:

$query = 'CREATE DATABASE '.$db->quoteName($name).' CHARACTER SET `utf8`';

Similarly in setDatabaseCharset with ALTER DATABASE. Neither of these statements
are applicable to (or supported by) Virtuoso and `utf8` is using MySQL
identifier quotes, so I've assumed this part of the installation is already
slated for re-implementation and have not tried to patch the code here. Instead
I'm just using regex in setQuery to fix this and a couple of other issues. But
if you would welcome patches from outside the team to implement such things, I
would of course be happy to write them.

Finally, there is an issue in the base JDatabase class replacePrefix method,
where it tries to be too clever about strings. In Virtuoso (and as per SQL
standard) strings must be single-quoted and only identifiers are double-quoted.
So in the case of ISO standard SQL JDatabase will deliberately _not_ replace #__
in the double-quoted identifer names, wrongly assuming that they are string
literals.

But this is essentially a non-issue because it's easy to just overload that
method in the driver class. I note it here more for completeness and because it
might be "nice" to have JDatabase base class be fully RDBMS-agnostic about such
things, particularly where this runs contrary to standards.


Issues related to ODBC
----------------------

Summary:

* DATETIME literal formats should not assume to be quoted as string literals
* Connection parameters could be improved for DSNs instead of host names
* Ideally the database driver could pre-populate some connection parameters

The JDatabase base class defines getDateFormat to allow for differences between
RDBMS implementations for date and time literals. However, since the base
implementation does not automatically quote the usual ISO format date/time
strings, consumers of this function all do that themselves. However, this is an
unsafe assumption, for example in ODBC the following "brace escape" notations
are used:

{ts 'YYYY-MM-DD HH:MM:SS'}
{d 'YYYY-MM-DD'}
{t 'HH:MM:SS'}

Currently, if I were to override getDateTime to:

return '{ \\t\\s \'d.m.Y H:i:s\' }';

...the correct syntax is produced by PHP in DateTime::format, but this all gets
escaped and quoted as a string literal, and so it ends up in queries as
something like:

AND (foo OR a.publish_up <= '{ts \'24-03-2012 05:43:47\'}')

...instead of the required...

AND (foo OR a.publish_up <= {ts '24-03-2012 05:43:47'})

The problem would be the same trying to use ISO standard syntax here (similar to
ODBC with a type prefix to the string literal, only without the curly brackets),
or even a CAST. I might be wrong, but I think all RDBMS that will happily accept
solely a string literal for DATETIME values (i.e. all cases where the current
method is suitable), would accept the typical ISO-style format anyway.

So really, I think quoting as a string literal (where needed) should be included
as part of the format returned from getDateFormat, which would be more
convenient for the client code anyway. Without such a change, all I can do is
regex rewrite anything that looks like an unescaped string literal
date/time/timestamp in setQuery. While the date/time pattern can be identified
quite reliably, it will take some effort to properly harden that expression to
exclude strings, functions, DDL queries and probably other exceptions. It seems
to me that getDateFormat was intended to avoid all that.

On the user interface side, with ODBC the typical host/user/password/database
credentials are not entirely appropriate because of the Data Source Name (DSN)
method of setting up connection details in ODBC INI files. These named data
sources can be enumerated using PHP's odbc_data_source function [6]. For now I
am just taking the hostname option to be the DSN. It would obviously be nicer to
present the user with a list of DSNs to select from, or at least allow the
driver class to name it's own parameters.

Furthermore, Virtuoso has a nice little trick when it is also acting as the
webserver, whereby an ODBC DSN including username and password of the user
owning the hosting virtual directory can be retrieved by a function exported by
the PHP SAPI module: __virt_internal_dsn. Where this function exists it would be
nice to at least be able to pre-populate the connection fields during installation.


Issues that may also affect other RDBMS
---------------------------------------

Summary:

* PASSWORD is a reserved word in Virtuoso (+ ?)
* Long data types not supported in GROUP BY or ORDER BY clauses
* Auto-increment fields should not be set to zero
* Limits and offsets may need SQL rewrites

PASSWORD is a reserved word in Virtuoso. I'm not aware that this is a reserved
word in ISO SQL but it does seem like one that might be reserved in other RDBMS
also. I only found two occurrences of password used unquoted [7]. More
generally, I wonder if it might not be too difficult to parse identifiers and
automatically quote them.

In converting the schema for Virtuoso (still very much a work in progress, the
driver class itself doesn't need much more) it seemed unavoidable to make a
number of fields, particularly in #__content long data types (e.g. intro text).
Unless things have changed in recent years, it is not uncommon to find that
RDBMS will not group, sort, join or aggregate such columns. However, Joomla! has
a questionable requirement to do just that, which happens at the bottom of
getListQuery methods in many of the model classes. Example:

[from components/com_content/models/articles.php]

// Add the list ordering clause.
$query->order($this->getState('list.ordering', 'a.ordering').' '
.$this->getState('list.direction', 'ASC'));

$query->group('a.id, a.title, a.alias, a.title_alias, a.introtext,
a.checked_out, a.checked_out_time, a.catid, a.created, a.created_by,
a.created_by_alias, a.created, a.modified, a.modified_by, uam.name,
a.publish_up, a.attribs, a.metadata, a.metakey, a.metadesc, a.access,
a.hits, a.xreference, a.featured, a.fulltext, a.state, a.publish_down,
badcats.id, c.title, c.path, c.access, c.alias, uam.id, ua.name, ua.email,
contact.id, parent.title, parent.id, parent.path, parent.alias,
v.rating_sum, v.rating_count, c.published, c.lft, a.ordering, parent.lft,
fp.ordering, c.id, a.images, a.urls');

return $query;

It seems excessive, unless there is sometimes an aggregate in the select list
that I didn't spot with a cursory glance and has not yet shown up in my limited
testing.

I could not understand this behaviour and was surprised that MySQL handles it
with long data types. I know for sure than many RDBMS won't. For some it might
be avoidable if they have more generous limits on maximum row size than
Virtuoso, but in my case I would need to seriously limit the amount of text
across fields like introtext, params, all the metadata fields etc. to no more
than ~5000 characters in total to fit all of it within the row.

So in the end I just added some regex to strip out any GROUP BY clause that
contained some known long field names, which appears to have dealt with the
offending queries without any obvious ill-effects. In any case, I can see this
being an issue not only for Virtuoso.

MySQL interprets setting zero to an auto-increment field to mean that you want
to take the next serial in the sequence. Virtuoso interprets zero literally, as
do some other RDBMS which expect NULL if you want an automatic default from the
sequence. Since NULL works just fine for MySQL anyway, perhaps this could be
changed for JDatabase::insertObject in the base class to exclude the primary key
from the values set for INSERTs. Fortunately this method can just be overridden,
but it might make other driver implementations less painful. I'm afraid I don't
know whether the SQL standards have any position on that matter, I suspect not.

Similar to MS SQL Server, Virtuoso uses the TOP keyword before the SELECT list
for limits and offsets (although last time I used SQL Server, either v7 or 2000
I think, it did not support offsets in TOP) which means "rewriting" within the
query rather than just appending to the end as you would with the MySQL syntax.
So to do that reliably requires regex replacement, which is not ideal and could
be avoided if passing only the column list to the "select" method.


Issues specific to Virtuoso
---------------------------

Summary:

* Implicit casting string literals to timestamp values causes logic errors [8]
* Implicit casting to non-string types (even numeric) is noisy (QW004 errors)
* Null characters in NVARCHAR data types not handled properly
(primarily affects session data field)
* No unicode collations provided by OpenLink, must be user defined
* UTF-8 collations are not currently supported

[I will document this stuff in more detail elsewhere, I doubt it is of interest
to Joomla! except to note that Internationalisation is problematic for this
driver right now, although UTF-8 works fine besides sort order]


Simple bugs in Joomla!
----------------------

There were originally three of these on my list, but I just noticed that two
have since been fixed in 2.5.3 which leaves me with just:

* Default null date hard-coded as 000-00-00 in user notes form [9]


In conclusion
-------------

I was very pleased to see the progress in making Joomla! truly RDBMS-agnostic. I
once started to implement a PostgreSQL driver some years back but there just
wasn't nearly enough abstraction then, nor did there seem to be enough
motivation within the core team to really push the amount of rewriting and
refactoring needed to make it happen.

No doubt I have missed some instances of some of the issues identified above,
have many issues yet to discover, and much work to do yet on the Virtuoso
schema, but overall it was easier than I expected to get Joomla! up and running
with Virtuoso thanks to the work done in this area. I feel much more confident
now about investing my time into such projects!

I hope you regard the information I have collected here useful for improving
Joomla! further. For my part I welcome all feedback about whether I have gone
about this the right way, if there are some issues I have misunderstood, ways I
could present information or code changes that would fit better with your
development processes and so on...

Cheers!
-Will


[0] Virtuoso Open Source Edition <http://virtuoso.openlinksw.com/>

[1] SQL Reserved Words Reference
<http://developer.mimer.se/validator/sql-reserved-words.tml>
[2] ISO Not Equal Symbol

<https://github.com/wdaniels/joomla-cms/commit/4bc6a12a091d7781df7ff4f2735eeabc51f623c4>
[3] ISO Reserved Words LANGUAGE and OPTION

<https://github.com/wdaniels/joomla-cms/commit/155781a77b9ce8aa5f597f94db714c4279ff6346>
[4] ISO Reserved Word MODULE

<https://github.com/wdaniels/joomla-cms/commit/48715e01ba614cce8aaa64086bdc63569779f35e>
[5] ISO Reserved Word TIME

<https://github.com/wdaniels/joomla-cms/commit/d1268815033b2d2a3f3f7d39a99bc64f5f7705e6>
[6] PHP Function Reference [odbc_data_source]
<http://www.php.net/manual/en/function.odbc-data-source.php>
[7] Virtuoso Reserved Word PASSWORD

<https://github.com/wdaniels/joomla-cms/commit/72b3ea88c2c7ba852c92b02229edd392fbadeccd>
[8] Reported SQL Logic Error [virtuoso-devel]

<https://sourceforge.net/mailarchive/forum.php?thread_name=4F695C0A.4030007%40willdaniels.co.uk&forum_name=virtuoso-devel>
[9] NULL Date Issue In User Notes Form

<https://github.com/wdaniels/joomla-cms/commit/fb62cca49f5cf268d1898caaee55f56c3a49fbc5>

Rouven Weßling

unread,
Mar 24, 2012, 2:32:39 PM3/24/12
to joomla-de...@googlegroups.com
Wow, impressive list of things. I noticed that some of the files affected are platform files (basically everything in libraries/joomla). Could you make a pull request for those changes to the platform repro? Seems like it aren't a lot of changes so we should be able to get that sorted out quickly.

Best regards
Rouven

elin

unread,
Mar 24, 2012, 3:41:44 PM3/24/12
to joomla-de...@googlegroups.com
Whoa, that's great. We are also looking at some query changes in the CMS that are needed for Postgres, and if you wanted to propose what you have as patches that would be great. What is good is to keep the patches to a size that is manageable for people doing manual tests. If you could make a pull request and then a linked tracker item for each of the commits that would be great.

  I think you might want to repost some this on the platform list (specifically about standards compliance and I think most importantly the date issue) because that is where the people working on database issues tend to be most active. (This mailing list is more about extension, while the CMS and Platform lists are about the core code).  


Elin

Amy Stephen

unread,
Mar 24, 2012, 2:35:29 PM3/24/12
to joomla-de...@googlegroups.com


On Sat, Mar 24, 2012 at 12:39 PM, Will Daniels <ma...@willdaniels.co.uk> wrote:
However, Joomla! has a questionable requirement to do just that, which happens at the bottom of getListQuery methods in many of the model classes. Example:

[from components/com_content/models/articles.php]

 // Add the list ordering clause.
 $query->order($this->getState('list.ordering', 'a.ordering').' '
              .$this->getState('list.direction', 'ASC'));

 $query->group('a.id, a.title, a.alias, a.title_alias, a.introtext,
   a.checked_out, a.checked_out_time, a.catid, a.created, a.created_by,
   a.created_by_alias, a.created, a.modified, a.modified_by, uam.name,
   a.publish_up, a.attribs, a.metadata, a.metakey, a.metadesc, a.access,
   a.hits, a.xreference, a.featured, a.fulltext, a.state, a.publish_down,
   badcats.id, c.title, c.path, c.access, c.alias, uam.id, ua.name, ua.email,
   contact.id, parent.title, parent.id, parent.path, parent.alias,
   v.rating_sum, v.rating_count, c.published, c.lft, a.ordering, parent.lft,
   fp.ordering, c.id, a.images, a.urls');

 return $query;

It seems excessive, unless there is sometimes an aggregate in the select list that I didn't spot with a cursory glance and has not yet shown up in my limited testing.

I could not understand this behaviour and was surprised that MySQL handles it with long data types. I know for sure than many RDBMS won't. For some it might be avoidable if they have more generous limits on maximum row size than Virtuoso, but in my case I would need to seriously limit the amount of text across fields like introtext, params, all the metadata fields etc. to no more than ~5000 characters in total to fit all of it within the row.

So in the end I just added some regex to strip out any GROUP BY clause that contained some known long field names, which appears to have dealt with the offending queries without any obvious ill-effects. In any case, I can see this being an issue not only for Virtuoso.

Completely agree with this. This morning, someone else raised the issue on the CMS list and the answer given really didn't make sense.  Hopefully, we can sort this out. Very odd.

https://groups.google.com/forum/#!topic/joomla-dev-cms/sKFdMjRHNxI

Also, might I recommend that you post this on the Platform list? That's a better fit for this type of interaction.

https://groups.google.com/forum/#!forum/joomla-dev-platform

Appreciate your work.


Will Daniels

unread,
Mar 25, 2012, 12:10:56 AM3/25/12
to joomla-de...@googlegroups.com
On 24/03/12 18:35, Amy Stephen wrote:
> Completely agree with this. This morning, someone else raised the issue on
> the CMS list and the answer given really didn't make sense. Hopefully, we
> can sort this out. Very odd.
>
> https://groups.google.com/forum/#!topic/joomla-dev-cms/sKFdMjRHNxI

You are correct Amy, Mark has misunderstood the SQL spec. This is ringing some
bells, I came across the same misunderstanding once before, though I'm
struggling to remember how it came about now.

Anyway, I'm sure we can get that all clarified. I have a copy of the standard
(somewhere) and I'm not afraid to use it :D

I will post in that thread when I get approved for the group.


> Also, might I recommend that you post this on the Platform list? That's a
> better fit for this type of interaction.
>
> https://groups.google.com/forum/#!forum/joomla-dev-platform

Will do.

> Appreciate your work.

Thanks :)

Will Daniels

unread,
Mar 25, 2012, 12:17:28 AM3/25/12
to joomla-de...@googlegroups.com
On 24/03/12 19:41, elin wrote:
> Whoa, that's great. We are also looking at some query changes in the CMS
> that are needed for Postgres, and if you wanted to propose what you have as
> patches that would be great. What is good is to keep the patches to a size
> that is manageable for people doing manual tests. If you could make a pull
> request and then a linked tracker item for each of the commits that would
> be great.
>
> I think you might want to repost some this on the platform list
> (specifically about standards compliance and I think most importantly the
> date issue) because that is where the people working on database issues
> tend to be most active. (This mailing list is more about extension, while
> the CMS and Platform lists are about the core code).
>
>
> Elin
>
> On Saturday, March 24, 2012 2:32:39 PM UTC-4, Rouven We�ling wrote:
>>
>> Wow, impressive list of things. I noticed that some of the files affected
>> are platform files (basically everything in libraries/joomla). Could you
>> make a pull request for those changes to the platform repro? Seems like it
>> aren't a lot of changes so we should be able to get that sorted out quickly.
>>
>> Best regards
>> Rouven
>>
>


OK, so I will divide my changes by Platform and CMS. Then in the CMS by
Front/Admin, then by component also? I guess this should be OK for unit testing.

I have a lot of other work the next few days but should be done some time next
week and I'll register a pull request then.

So if I make patches also to solve the date literal quoting problem, these
should make it into core eventually? Or does that need further discussion on the
platform list?

Thanks for the encouragement :)

-Will

Mark Dexter

unread,
Mar 25, 2012, 1:11:33 AM3/25/12
to joomla-de...@googlegroups.com
Well, if a column is listed in the SELECT clause (without an aggregate function), and it is NOT listed in the GROUP BY clause, what is the SQL engine supposed to do with it? In this scenario, you are only displaying one value out of multiple possible values. Which one?

This is a quote from one of many SQL online tutorials:

"When grouping, keep in mind that all columns that appear in your SELECT column list, that are not aggregated (used along with one of the SQL aggregate functions), have to appear in the GROUP BY clause too."

I'm not sure what MySQL does in this scenario. It's possible that I have misunderstood the issue, but to me this is common sense as well as a SQL requirement.

Mark

On Sat, Mar 24, 2012 at 9:17 PM, Will Daniels <ma...@willdaniels.co.uk> wrote:
On 24/03/12 19:41, elin wrote:
Whoa, that's great. We are also looking at some query changes in the CMS
that are needed for Postgres, and if you wanted to propose what you have as
patches that would be great. What is good is to keep the patches to a size
that is manageable for people doing manual tests. If you could make a pull
request and then a linked tracker item for each of the commits that would
be great.

  I think you might want to repost some this on the platform list
(specifically about standards compliance and I think most importantly the
date issue) because that is where the people working on database issues
tend to be most active. (This mailing list is more about extension, while
the CMS and Platform lists are about the core code).


Elin

On Saturday, March 24, 2012 2:32:39 PM UTC-4, Rouven Weßling wrote:

Wow, impressive list of things. I noticed that some of the files affected
are platform files (basically everything in libraries/joomla). Could you
make a pull request for those changes to the platform repro? Seems like it
aren't a lot of changes so we should be able to get that sorted out quickly.

Best regards
Rouven




OK, so I will divide my changes by Platform and CMS. Then in the CMS by Front/Admin, then by component also? I guess this should be OK for unit testing.

I have a lot of other work the next few days but should be done some time next week and I'll register a pull request then.

So if I make patches also to solve the date literal quoting problem, these should make it into core eventually? Or does that need further discussion on the platform list?

Thanks for the encouragement :)

-Will


--
You received this message because you are subscribed to the Google Groups "Joomla! General Development" group.
To post to this group, send an email to joomla-dev-general@googlegroups.com.
To unsubscribe from this group, send email to joomla-dev-general+unsub...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/joomla-dev-general?hl=en-GB.


Will Daniels

unread,
Mar 25, 2012, 1:42:50 AM3/25/12
to joomla-de...@googlegroups.com
On 25/03/12 06:11, Mark Dexter wrote:
> Well, if a column is listed in the SELECT clause (without an aggregate
> function), and it is NOT listed in the GROUP BY clause, what is the SQL
> engine supposed to do with it? In this scenario, you are only displaying
> one value out of multiple possible values. Which one?
>
> This is a quote from one of many SQL online tutorials:
>
> "When grouping, keep in mind that all columns that appear in your SELECT
> column list, that are not aggregated (used along with one of the SQL
> aggregate functions), have to appear in the GROUP BY clause too."

The operative word here is _when_. The rule does not apply if you are not
grouping _anything_. I didn't notice any aggregates used in the places where I
have seen this. And without any aggregates, ORDER BY should be used to "group"
items in a list.

This reminds me now of where I have encountered the same misunderstanding
before. Some RDBMS automatically add any missing columns to GROUP BY internally,
in selection order, at the end of the list. It is not a good idea to rely on
that behaviour for portability of course, because other RDBMS are more strict
and I think some subtlety of that behaviour (perhaps related to the SELECT *
case) changed once with MS SQL Server between versions. I'm pretty sure that is
where I've seen this come up before (but I might be wrong, it would have been
years ago).

Sam Moffatt

unread,
Mar 25, 2012, 1:42:08 AM3/25/12
to joomla-de...@googlegroups.com
To be honest I think I've seen this before with Oracle where it gets
picky about the validity of an SQL statement if you don't include all
of the columns.

Cheers,

Sam Moffatt
http://pasamio.id.au

>> joomla-de...@googlegroups.com.


>> To unsubscribe from this group, send email to

>> joomla-dev-gene...@googlegroups.com.


>> For more options, visit this group at
>> http://groups.google.com/group/joomla-dev-general?hl=en-GB.
>>
>

> --
> You received this message because you are subscribed to the Google Groups
> "Joomla! General Development" group.

> To post to this group, send an email to joomla-de...@googlegroups.com.


> To unsubscribe from this group, send email to

> joomla-dev-gene...@googlegroups.com.

Will Daniels

unread,
Mar 25, 2012, 2:27:53 AM3/25/12
to joomla-de...@googlegroups.com
On 25/03/12 06:42, Sam Moffatt wrote:
> To be honest I think I've seen this before with Oracle where it gets
> picky about the validity of an SQL statement if you don't include all
> of the columns.

Hi Sam,

Yes, Oracle I remember is one of the picky ones.

I'm 99% sure it is also one of the ones that will not GROUP BY long data types.

Quite honestly, I thought all RDBMS refused to do that until now. Normally if
you want to select a long column in conjunction with aggregates it is necessary
to either do something like CONVERT and truncate to a shorter VARCHAR, or fetch
the long column data separately afterwards.

But the point is there are no aggregates in those queries. That I noticed. I
wish I'd looked more thoroughly now, perhaps there is :@

-Will

ma...@willdaniels.co.uk

unread,
Mar 25, 2012, 3:10:52 AM3/25/12
to joomla-de...@googlegroups.com
On Sunday, March 25, 2012 7:27:53 AM UTC+1, ma...@willdaniels.co.uk wrote:

But the point is there are no aggregates in those queries. That I noticed. I
wish I'd looked more thoroughly now, perhaps there is :@


And indeed there is, at least in some places:

  SELECT c.* ... COUNT(i."id") AS numitems ... FROM jos_categories

So probably nobody has misunderstood anything, we are just talking at cross-purposes.

But the issue I originally raised is that c.* will select all columns, some of which may be long data types. In categories it is probably not an issue, I doubt we really need long types in there. In content, however...a different SQL form may be needed for portability. Maybe count using a subquery. I can certainly look into that part of the code if you'd like, although my original intention was only to raise the issue.

Cheers!
-Will  

In this case, for example

Amy Stephen

unread,
Mar 25, 2012, 1:44:31 AM3/25/12
to joomla-de...@googlegroups.com
On Sun, Mar 25, 2012 at 12:42 AM, Will Daniels <ma...@willdaniels.co.uk> wrote:
On 25/03/12 06:11, Mark Dexter wrote:
 
"When grouping,

The operative word here is _when_.


+1 Thanks Will.

Amy Stephen

unread,
Mar 25, 2012, 4:24:52 AM3/25/12
to joomla-de...@googlegroups.com
Not certain which query you are looking at since I don't see that in the content query. But, there are subqueries that have been present since 1.6. Those are not SELECT level inclusions.

You can see that subqueries in old 1.6 releases and no group bys:

old query => http://joomlacode.org/gf/project/joomla/scmsvn/?action=browse&path=%2Fdevelopment%2Fbranches%2F1.6.4%2Fcomponents%2Fcom_content%2Fmodels%2Farticles.php&view=markup

The group by is new - but I don't see any select aggregation, just subqueries.

new query => https://github.com/joomla/joomla-cms/blob/master/components/com_content/models/articles.php#L207

It's a pretty intense query - and adding aggregation isn't going to help performance any.

I still can't see why it's needed.


84.le0n

unread,
Mar 25, 2012, 4:58:03 AM3/25/12
to joomla-de...@googlegroups.com

Il giorno 25/mar/2012, alle ore 07:11, Mark Dexter <dexter...@gmail.com> ha scritto:

> This is a quote from one of many SQL online tutorials:
>
> "When grouping, keep in mind that all columns that appear in your SELECT column list, that are not aggregated (used along with one of the SQL aggregate functions), have to appear in the GROUP BY clause too."
>

True for Postgresql too.

Eng. Gabriele Pongelli

ma...@willdaniels.co.uk

unread,
Mar 25, 2012, 6:45:42 AM3/25/12
to joomla-de...@googlegroups.com
On Sunday, 25 March 2012 09:24:52 UTC+1, Amy Stephen wrote:
Not certain which query you are looking at since I don't see that in the content query. But, there are subqueries that have been present since 1.6. Those are not SELECT level inclusions.

That one I found in JCategories::_load (libraries/joomla/application/categories.php). Logically there shouldn't be any need to do anything similar with content items (counting what?) I'm just trying to give everyone the benefit of the doubt and assuming there must be aggregates sometimes, somewhere in those queries using GROUP BY.

There seems no other reasonable explanation for doing that otherwise.

It's a pretty intense query - and adding aggregation isn't going to help performance any.

Agreed, it's usually not a good idea to try to "flatten" queries to kind of "do it all in one lump", it really is taking a chance on the SQL compiler's execution plan, you could end up constructing a very large temporary table for the product of all relevant tables to solve the query (well, maybe if the stats were off or some index missing etc. I don't know and probably shouldn't comment because I haven't looked at it properly).

Sometimes subqueries (either directly in the select list or joining as a derived table for selecting the long and non-indexed fields only after finding the rows to be returned) can produce worse best-case performance but scale more linearly and/or help to dictate the execution plan.

Sadly, this is one of the things that makes it hard to be completely agnostic about the RDBMS engine, some will optimise things one way, others might make a hash of it (no pun intended), and you can't even fall back on compiler "hints" within the query.

Anyway, it doesn't much matter how I would or wouldn't do it, at this point I only care about a solution that is portable, which ironically seems to have been the motivation for having done it that way.

I still can't see why it's needed.

I suspect that it isn't, in most of the places where it's happening. If you're not using an aggregate expression, you don't _want_ much less _need_ to GROUP BY anything at all, not least because if you do, you have to add all the columns, some of which you _can't_ GROUP BY in _some_ RDBMS, at least with usable schemas.

Whether or not this RDBMS or that RDBMS requires the complete SELECT list in GROUP BY is of no real consequence because _all_ of them internally, obviously, have to group by all non-aggregated columns anyway. It is merely a convenience if they don't make you write them all out, they will still fail if they can't group on any of the data types selected, and you would see in the error details that it added those columns silently in the background by itself.

So my feedback, for what it's worth, is that the general approach here is not sufficiently portable because many RDBMS (e.g. Virtuoso, Oracle, older versions of MS SQL Server) cannot GROUP BY long data types. But nobody has really answered that point yet, so I still don't know if we can try to find a solution together to patch the main codebase, or if I need to start brushing up on my regex for the driver :(

I think I've contributed all I can to the discussion, I hope I didn't upset anyone (Mark) with wrong choices of words or poor tone. If I did, be sure it was unintended. I know well what it's like to do good work solving 99% of a problem only to have spectators start picking at the bits that are left, often just trying to make themselves look clever.

I know SQL database products pretty well and have used most of them to some extent, at some point in time. But as of now I am certainly not an expert on the latest versions of _any_ of them and much of what I knew about older versions is since forgotten. I will continue to try to pursue whatever issues I find while playing around with Joomla! on Virtuoso, to see which things the team might like to address. Virtuoso is in many ways a good RDBMS to test with because it is the ultimate standards Nazi of all RDBMS I've ever used. And if there is anything more general database-related I might be able to help with then I'm always happy to get more involved with the project.

</ramble>

elin

unread,
Mar 25, 2012, 6:50:01 AM3/25/12
to joomla-de...@googlegroups.com
Most of those long lists were added as part of the work for SqlSrv which is very picky about these issues. At the time Sudhi did that work there was lots of discussion about this because it is different from what people are used to in MySQL.  

$nulldate is already part of JDatabase and it doesn't make sense to hard code it. I think the reason you just found it in one place is that it's unusual in the core to have a date that doesn't default to Now and also that we are much more likely to see small issues like that with code migrated from 1.5 which user notes was. I would think in user notes someone will eventually probably want to make a default time period something that can be configured (such as now+2 days). 

Elin

Sam Moffatt

unread,
Mar 25, 2012, 7:39:33 AM3/25/12
to joomla-de...@googlegroups.com
Sam Moffatt
http://pasamio.id.au

On Sun, Mar 25, 2012 at 8:45 PM, ma...@willdaniels.co.uk
<ma...@willdaniels.co.uk> wrote:
> On Sunday, 25 March 2012 09:24:52 UTC+1, Amy Stephen wrote:
>>
>> Not certain which query you are looking at since I don't see that in the
>> content query. But, there are subqueries that have been present since 1.6.
>> Those are not SELECT level inclusions.
>
>
> That one I found in JCategories::_load
> (libraries/joomla/application/categories.php). Logically there shouldn't be
> any need to do anything similar with content items (counting what?) I'm just
> trying to give everyone the benefit of the doubt and assuming there must be
> aggregates sometimes, somewhere in those queries using GROUP BY.

Do you mean this excerpt from "libraries/joomla/application/categories.php"?
<quote>
$query->select('COUNT(i.' .
$db->quoteName($this->_key) . ') AS numitems');
}

// Group by
$query->group('c.id, c.asset_id, c.access, c.alias,
c.checked_out, c.checked_out_time,
c.created_time, c.created_user_id,
c.description, c.extension, c.hits, c.language, c.level,
c.lft, c.metadata, c.metadesc, c.metakey,
c.modified_time, c.note, c.params, c.parent_id,
c.path, c.published, c.rgt, c.title,
c.modified_user_id');
</quote>

Ref:
https://github.com/joomla/joomla-cms/blob/master/libraries/joomla/application/categories.php#L344

Looks pretty clear to me there is a COUNT there. It is indeed
conditionally included a mere three or four lines above the group by
statement. It could potentially be rolled in to the if statement so
it's only included when the COUNT is also added.

Cheers,

Sam

ma...@willdaniels.co.uk

unread,
Mar 25, 2012, 8:10:52 AM3/25/12
to joomla-de...@googlegroups.com
On Sunday, 25 March 2012 12:39:33 UTC+1, Samuel Moffatt wrote:

Looks pretty clear to me there is a COUNT there. It is indeed
conditionally included a mere three or four lines above the group by
statement.


Yes that is why I picked it out. The problem is with using aggregates in the same select list as selecting *, because some fields types are unsuitable for grouping in some RDBMS. 

elin

unread,
Mar 25, 2012, 10:54:53 AM3/25/12
to joomla-de...@googlegroups.com
So setting aside all of the side discussion, the concern you have is that there are some mediumtext fields in the group by. In looking around about this this morning, I learned that  MySQL handles this by using max_sort_length to truncate but you are saying that some RDBMS don't handle this natively. For example apparently for Access you need to trim to the first 255 characters in memo fields.  While I doubt that anyone is running the CMS queries on Access ( though it sure would be fascinating to see ) we should strive to be as agnostic as possible if only because of the possibility that different systems are default truncating to different lengths (of course this could already be happening on different servers even for MySQL). So to me it seems like perhaps what the database package needs to do is to default to enforce truncation to a standard length for Group By and other places where this may come into play. I don't think this is a super urgent issue in the CMS but it's a good one to think about. It's kind of cool that we're at the point where this is what we can be worried about.

Elin 

Mark Dexter

unread,
Mar 25, 2012, 11:02:16 AM3/25/12
to joomla-de...@googlegroups.com
Another way to work around this issue might be to put the mediumtext column in an aggregate expression in the SELECT clause (for example, MAX(blah)) and then remove it from the GROUP BY. Don't know if that would have the same issues or if you would want to truncate it in the aggregate expression.

Also, I agree that any query with a GROUP BY should probably not use a.* in the select clause, since this would require manually editing the GROUP BY any time a column is added to the table.

Mark

--
You received this message because you are subscribed to the Google Groups "Joomla! General Development" group.
To view this discussion on the web, visit https://groups.google.com/d/msg/joomla-dev-general/-/Vt8wH-tdRrkJ.

Amy Stephen

unread,
Mar 25, 2012, 10:54:08 AM3/25/12
to joomla-de...@googlegroups.com


On Sun, Mar 25, 2012 at 6:39 AM, Sam Moffatt <pas...@gmail.com> wrote:
Sam Moffatt
http://pasamio.id.au


Do you mean this excerpt from "libraries/joomla/application/categories.php"?
<quote>
                       $query->select('COUNT(i.' .
$db->quoteName($this->_key) . ') AS numitems');
               }

               // Group by
               $query->group('c.id, c.asset_id, c.access, c.alias,
c.checked_out, c.checked_out_time,
                       c.created_time, c.created_user_id,
c.description, c.extension, c.hits, c.language, c.level,
                       c.lft, c.metadata, c.metadesc, c.metakey,
c.modified_time, c.note, c.params, c.parent_id,
                       c.path, c.published, c.rgt, c.title,
c.modified_user_id');
</quote>

Ref:
https://github.com/joomla/joomla-cms/blob/master/libraries/joomla/application/categories.php#L344

Definitely agree, Sam, that one is aggregation and therefore requires a GROUP BY.

The ones that are a question are all of the queries that run menu item result sets for the blogs and lists, etc. Those are not aggregates or unique list generators. For that reason, they do not require GROUP BY and it does take a performance hit to do that.

The standard Mark is pointing to makes the premise "If you are going to group data, then do this." Definitely agree with that. But the question is: "Why is this data being grouped?"

I've got *many* more years of SQL Server experience than MySQL. I also have strong DB2 and Oracle experience. I am perplexed by this. My use of SQL Server was for heavy duty data warehousing so aggregating data and using GROUP BY's is not foreign to me. This really isn't even a complex database issue.

Here's a quick example of a SQL Server query with a subquery -- note how the primary query has no group by --

http://www.simple-talk.com/sql/sql-training/subqueries-in-sql-server/

Anyway, I'm done. This will be one of those things that end up coming up again and again. As we get more experience working with these multiple database environments, these things will be refined. In reality, as long as the primary key is contained within the SELECT list, the result set will not be impacted by adding a GROUP BY. But, it takes a performance hit.

Thanks.

Will Daniels

unread,
Mar 25, 2012, 10:21:40 PM3/25/12
to joomla-de...@googlegroups.com

There are two issues that are interrelated:

1) GROUP BY is used in many places where it shouldn't be there
2) Where GROUP BY _is_ needed for an aggregate, the query form must be altered
to avoid aggregating on all columns for portability.

There has been much confusion in the thread about (2) because everybody has
assumed we are talking about not explicitly listing all the columns, which was
_correctly_ added for portability. The problem is that this is still not a
viable query pattern for portable SQL.


On 25/03/12 16:02, Mark Dexter wrote:
> Another way to work around this issue might be to put the mediumtext column
> in an aggregate expression in the SELECT clause (for example, MAX(blah))
> and then remove it from the GROUP BY. Don't know if that would have the
> same issues or if you would want to truncate it in the aggregate expression.

Sure, but it's an awful lot of work (for code maintenance as much as the RDBMS)
to do things like that in places where you don't need to. In most of the places
where we have these GROUP BY clauses there are no aggregates at all, the GROUP
BY can just be removed entirely.

> Also, I agree that any query with a GROUP BY should probably not use a.* in
> the select clause, since this would require manually editing the GROUP BY
> any time a column is added to the table.

Yes, this is the only reason a problem genuinely still exists in a few places
(like the categories.php I cited before). In those cases the best solution is to
not try to do the whole query in one. So instead of the form:

SELECT c.*, COUNT(foo)
FROM categories c <join content>
WHERE <access controls>
GROUP BY <everything>

...you would do something more like...

SELECT c.*, sub.* FROM categories
INNER JOIN
( SELECT c.id _id, COUNT(foo) numitems
FROM categories c <join content>
WHERE <access controls>
GROUP BY c.id
) sub ON (c.id = sub._id)
ORDER BY <sort orders>

Thus avoiding expensive aggregation on non-indexed or large columns, and also
avoiding the problem of RDBMS that won't automagically truncate your long
columns in the background (Oracle, Virtuoso, Access, no doubt many others).


> Mark
>
> On Sun, Mar 25, 2012 at 7:54 AM, elin<elin....@gmail.com> wrote:
>
>> the CMS queries on Access ( though it sure would be fascinating to see ) we
>> should strive to be as agnostic as possible if only because of the
>> possibility that different systems are default truncating to different
>> lengths (of course this could already be happening on different servers

I don't think the issue is that some RDBMS will truncate at different lengths,
because we're not genuinely interested in accurately grouping on those columns.
RDBMS that have a problem here simply won't truncate _at all_ and will throw and
exception when compiling the SQL.

It is not so much a feature deficiency in those RDBMS (well, it might be for
Access), it is more a case of not wanting to make it easy to do things that you
probably didn't want to be doing in the first place. If the query author
actually _does_ want to group rows in the result by a long column, he/she is
expected to write a suitable expression to reduce the problem. Grouping by
almost every column just to make the query work as one huge statement muddies
the logic, makes the code harder to maintain and runs the risk of performing
very badly.


Cheers!
-Will

Reply all
Reply to author
Forward
0 new messages