Key problems on tags table in latest svn

0 views
Skip to first unread message

Owen Winkler

unread,
May 1, 2008, 7:24:13 PM5/1/08
to habar...@googlegroups.com
For some reason or other, Habari just now started complaining about key
constraint issues on the tags table during my most recent upgrade.

michaeltwofish and I hashed through this issue on IRC last night and
came up with the following 2 SQL queries:


update tag2post, (select t1.id as from_id, t2.id as to_id from tags t1,
tags t2 where t1.tag_slug = t2.tag_slug and t1.id > t2.id) t set
tag2post.tag_id = t.to_id where t.from_id = tag2post.tag_id;

delete tags from tags left join tag2post on tag2post.tag_id = tags.id
where isnull(tag2post.tag_id);


The first query updates all posts to use only one of the two duplicate
tags. The second query removes the unused tags from the tags table.

Be sure to account for the table prefix if you apply this fix.

I'm not sure whether this is required for all installs, or even what
changed that caused this to happen, but at least three people in IRC
last night were reporting this issue, so perhaps we should consider
including it as part of the install process?

Also, for some reason, there is a single unique key on the tags table
and it's on the tag_text field. This makes no sense. It seems to me
that this index should be on the tag_slug table. Can anyone explain why
things are the way they are, or should we update things? ("update
things" can also mean "convert all of this tag stuff into a taxonomy" if
anyone feels up to it.)

Owen

Scott Merrill

unread,
May 1, 2008, 8:41:45 PM5/1/08
to habar...@googlegroups.com
Owen said:
> Can anyone explain why things are the way they are, or should we ...

> convert all of this tag stuff into a taxonomy

What are the long-term benefits of using a full taxonomy system?

What are the implementation challenges for migrating Habari to use
tags this late in the game?

Do you have some body of code already prepared to make this happen?
If not, how much work do you expect it will take from the community?

Owen Winkler

unread,
May 1, 2008, 10:23:30 PM5/1/08
to habar...@googlegroups.com
Scott Merrill wrote:
> Owen said:
>> Can anyone explain why things are the way they are, or should we ...
>> convert all of this tag stuff into a taxonomy
>
> What are the long-term benefits of using a full taxonomy system?

The long-term benefits of a full taxonomy system are manyfold.

First, we'd be able to re-implement and provide API access (which we
don't currently have) to the current "tags" vocabulary.

Second, we'd be able to implement hierarchical categories separately
from tags by enabling parent-child relationships between vocabulary terms.

Third, we'd be able to associate meaningful metadata to posts outside of
generic tagging. So in the cases where we'd tried to do things like add
tags of "displayon:homepage" to have certain posts appear only on the
homepage, we could create a vocabulary that was used for that purpose
and had specific values that a user could choose from.

> What are the implementation challenges for migrating Habari to use
> tags this late in the game?

I think it will not be too terrible.

We'll need to consider a taxonomy schema that works - a set of tables to
hold the taxonomy data. I suggest that we drop the tags-related tables
altogether as part of the conversion (after converting them), so as to
make a clean break from "tags" altogether. If things are done properly,
we should still be able to access all of the tag data via the Post
object, using an installed-by-default "tags" vocabulary.

As for plugins that do tag-cloud things, we'll need to do some things:
First, provide a taxonomy API that will allow access to "tags" and other
vocabularies so that accessing the tables directly is not part of the
standard procedure. Second, identify the plugins that will be affected
and help convert them to use the new API. These things should really
happen side-by-side.

> Do you have some body of code already prepared to make this happen?
> If not, how much work do you expect it will take from the community?

I do not personally have code to make this happen, but I suspect that a
dedicated weekend could get the basic API functional enough to replace
the tag system we have now, especially if there were people around to
code and test.

Owen

Christian Mohn

unread,
May 2, 2008, 3:42:31 AM5/2/08
to habar...@googlegroups.com
update habari__tag2post, (select t1.id as from_id, t2.id as to_id from
habari__tags t1,
habari__tags t2 where t1.tag_slug = t2.tag_slug and t1.id > t2.id) t set
habari__tag2post.tag_id = t.to_id where t.from_id = habari__tag2post.tag_id;

gives me #1062 - Duplicate entry '3-78' for key 1 ?

Rich Bowen

unread,
May 2, 2008, 6:58:33 AM5/2/08
to habar...@googlegroups.com
Yeah. It's called Drupal.

--
I have nature and art and poetry,
and if that is not enough, what is enough?
(Vincent van Gogh)



phil.gs

unread,
May 3, 2008, 1:37:09 PM5/3/08
to habari-dev
I'm glad this has come up. It's something I've been thinking about
since I first starting looking at Habari (which, granted, was about a
week ago).

Specifically, I think the current taxonomy API in WordPress is pretty
good. I've spent a good amount of time with it, so I feel like I have
a good idea of its strengths but also where it could use some
improvement. I've been thinking about the necessary schema, and I
could probably post a draft one tonight or at least this weekend
sometime. I'd also be willing to advise on a specification for the API
but I don't have any time to do any actual coding. I could probably be
talked into testing pretty easily.

Owen Winkler

unread,
May 3, 2008, 9:11:07 PM5/3/08
to habar...@googlegroups.com
phil.gs wrote:
>
> Specifically, I think the current taxonomy API in WordPress is pretty
> good. I've spent a good amount of time with it, so I feel like I have
> a good idea of its strengths but also where it could use some
> improvement. I've been thinking about the necessary schema, and I
> could probably post a draft one tonight or at least this weekend
> sometime. I'd also be willing to advise on a specification for the API
> but I don't have any time to do any actual coding. I could probably be
> talked into testing pretty easily.

I'm interested in hearing your ideas.

Idly, I was thinking about this today, and here are my embryonic schema
thoughts. Note that this is not a final concept and is open to
discussion and revision, even though I'm presenting it as "here's what
we'll do..."

We'd have three tables for taxonomy support: Vocabularies, Terms, and
TermObject

Vobabularies would be a table naming the vocabularies that are available
on the system and associating a numeric id to them. It would include a
type parameter that would allow Habari to distinguish "free"
vocabularies (one where new terms can be added at runtime), from "set"
vocabularies (one where terms are predefined). It would include a size
parameter that would specify the maximum number of terms that could be
associated to a single object, which could be null for unlimited. An
additional optional field would allow a Vocabulary to be applied to
specific things, like posts, entries, or comments.

The Terms table would associate a term, a term's slug, any parent term
it might have to a vocabulary. A term (and its slug) might appear more
than once in this table, but only once per vocabulary.

The TermsObjects table would associate terms to objects numerically.
The table would consist of two integer values corresponding to the term
id and the object id.

Using this system, we could reimplement the existing tag system by
creating a new vocabulary called "tags". The "tags" vocabulary would be
a "free" vocabulary with no limit that applies to posts.

The existing tags table would be migrated to the terms table, with the
only changes being to add a null value for the additional parent field,
and the id of the "tags" vocabulary.

The existing tags2post table would be migrated to the TermsObjects
table, pretty much intact. We'd still need to write some kind of
mapping system since we shouldn't assume that all of the id's will be
the same.


Some might prefer an additional separation of the Terms table, where its
split into a VocabulariesTerms table that holds two integers (the
vocabulary id, and a term id), and a Terms table that describes the
terms alone. This would prevent duplicate term values in the Terms
table, but add an additional level of complexity to querying for posts
by vocabulary terms. It also may be desirable to keep a single Terms
table for the purpose of allowing different slugs/terms for different
vocabularies. Not sure on this point. I'm interested in the opinion of
our resident database gurus on this topic.


It might also be worthwhile to add a flag to the vocabularies rows that
defines a vocabulary as directly user-editable. For example, if a
vocabulary is marked this way, then it could appear (in the case that it
applies to posts) in the additional fields area of the post editing
page, and automatically display appropriate options (a dropdown in the
case of a "set" type vocabulary, or a text box in the case of a "free"
type vocabulary). If this is done, the "tags" vocabulary would simply
be another of these user-editable types. For non-editable types,
plugins would either provide an interface for taxonomy access, or
manipulate the taxonomy data behind the scenes.


For those folks once again asking what all of this is for, I'll present
a more concrete example.

Assume that you are creating a site with four distinct areas to which
posts can be distributed. One might in the current system add specific
tags to a post that make it appear in a single area by virtue of a call
to Posts::get() and passing that special tag. But then those special
tags "pollute" the tag pool, potentially showing up in unexpected places.

With the taxonomy system, you'd do virtually the same thing, except
you'd define a vocabulary that represents the categorization of posts
into one of the four areas. Call that vocabulary "category" and define
it as one where each object in it can only be defined by one term among
four that are preset. That is, you will be allowed only to assign one
of four terms to any single post.

A call to Posts::get() that includes this specific taxonomy information
will return posts that qualify for that category just as with the tag
setup, but the categories will not pollute the tag system.

Also, you might define an additional vocabulary that could be used to
query for pages that make up the menu for each area. Posts without a
term defined in this vocabulary would not appear in the menu for that
category of posts.

Applying the taxonomy system would yield many benefits such as this,
which seem to provide features that people have been asking about.

Owen

phil.gs

unread,
May 4, 2008, 12:30:07 AM5/4/08
to habari-dev
On May 3, 9:11 pm, Owen Winkler <epit...@gmail.com> wrote:
> Idly, I was thinking about this today, and here are my embryonic schema
> thoughts. Note that this is not a final concept and is open to
> discussion and revision, even though I'm presenting it as "here's what
> we'll do..."

I think we independently arrived at similar positions. Here's my rough
schema, or at least a list of tables and column names:

Table: Vocabularies
* id
* name
* slug
* type - "free" or "set" as per Owen
* limit - maximum number of terms (NULL for unlimited)
* object_type - limit vocabulary to posts, comments, etc.
* active - boolean to allow user to deactivate vocabulary without
deleting/losing data

Table: terms
* id
* name
* slug
* description
* vocab_id
* parent - id of another term for hierarchical vocabularies

Table: term_info
* term_id
* key
* value

Table: term2object
* term_id
* object_id

I tried to include descriptions where the point of a column didn't
seem to be self-evident, but let me know if I need to give a better
explanation. It follows Owen's description pretty closely. The biggest
difference is the term_info table. (I think that's an appropriate name
based on the schema in the wiki. In WP, they'd call it term_meta.)
This table allows the database to capture additional information about
a term. I shall illustrate its usefulness with an example (one which
also shows why a taxonomy system is more useful than the current tags
setup).

A system like this can be used to organize a series of posts on a
topic (i.e. a multi-part post). You could do it with a tag or
category, but then you have to worry about where that tag is going to
show up. In this case, "series" is the vocabulary and the term might
be "civil war battles." We also have a plugin that displays a blurb at
the beginning of any post in a series term. There's a default blurb,
but we want to have a different one for this series on civil war
battles. So we just create an entry in term_info that specifies the
different template.

Another example might be if you had a vocabulary called "people" and
you used it to tag your friends in a photo gallery. You could then
associate additional information about each person (e.g. email
address, birthday) with that person's term.

I also think it's important for the end-user to be able to define new
vocabularies so that they can put it to any use they can envision. My
philosophy is that the more extensible your platform is, the better.

Ok, I hope this makes sense. It's been a long day. I look forward to
getting feedback from the group.

Owen Winkler

unread,
May 4, 2008, 1:24:45 AM5/4/08
to habar...@googlegroups.com
phil.gs wrote:
>
> I think we independently arrived at similar positions. Here's my rough
> schema, or at least a list of tables and column names:

Obviously, this looks good to me, too. ;)

To add to your initial take on the terminfo table, we'll need an
additional `type` field to differentiate serialized data from regular
data. A strange type of injection we're trying to avoid would involve a
plugin that stores user-entered values into info data, and a user
injects a serialized value. If we just checked to see if unserialize()
fails, then we run the risk of unserializing injected data, which is
bad. That's why we have that extra field. FYI.

As an aside to all of this: I'm looking now curiously at the table
names in the existing database, wondering why some have underscores and
some not. But I am opposed to table names with "2" in them used instead
of the word "to". I'd much prefer just the names of the two things (in
their respective plurality, depending on whether the relationship was
one-to-one, one-to-many, or many-to-many) that are being linked, ala
"groups_permissions". Call it a pet peeve.

>
> I also think it's important for the end-user to be able to define new
> vocabularies so that they can put it to any use they can envision. My
> philosophy is that the more extensible your platform is, the better.
>

I'm not opposed to this happening in the future. Right away though, I'd
simply like to build a functional schema and API. We can consider how
to build the UI for adding end-user custom data to this as we continue.

Owen

Geoffrey Sneddon

unread,
May 4, 2008, 10:00:44 AM5/4/08
to habar...@googlegroups.com

On 4 May 2008, at 05:30, phil.gs wrote:

> * parent - id of another term for hierarchical vocabularies

Creating multi-dimensional trees using parent IDs is actually rather
inefficient: the deeper you get, the more queries you need to fetch
the tree. See <http://www.sitepoint.com/print/hierarchical-data-
database> and <http://dev.mysql.com/tech-resources/articles/hierarchical-data.html
>, which both discuss another way (the nested set model) to store
hierarchical data.


--
Geoffrey Sneddon
<http://gsnedders.com/>

Michael C. Harris

unread,
May 4, 2008, 10:35:31 PM5/4/08
to habar...@googlegroups.com
My contribution to this discussion is:

taxonomy++

and

On Sun, May 04, 2008 at 01:24:45AM -0400, Owen Winkler wrote:
> As an aside to all of this: I'm looking now curiously at the table
> names in the existing database, wondering why some have underscores and
> some not. But I am opposed to table names with "2" in them used instead
> of the word "to". I'd much prefer just the names of the two things (in
> their respective plurality, depending on whether the relationship was
> one-to-one, one-to-many, or many-to-many) that are being linked, ala
> "groups_permissions". Call it a pet peeve.

+1

--
Michael C. Harris, School of CS&IT, RMIT University
http://twofishcreative.com/michael/blog

Rich Bowen

unread,
May 5, 2008, 7:53:44 AM5/5/08
to habar...@googlegroups.com

On May 4, 2008, at 01:24, Owen Winkler wrote:

I also think it's important for the end-user to be able to define new
vocabularies so that they can put it to any use they can envision. My
philosophy is that the more extensible your platform is, the better.


I'm not opposed to this happening in the future.  Right away though, I'd 
simply like to build a functional schema and API.  We can consider how 
to build the UI for adding end-user custom data to this as we continue.


Yes, flexible is good, but the manner in which we expose this to the end-user (as opposed to the plugin developer) can either be useful, or monstrously confusing. Hence my earlier remark about Drupal. It's ok now, but in the earlier days, the terminology used in the UI was so completely baffling that it was unusable.

So, let's make sure that things are robust and well-documented before we start telling the end-user to do it themselves.


--
"That's what being alive IS ... It's being badly prepared for everything!
Because you only get one chance ... You only get one chance and then you die
and they don't let you go round again after you've got the hang of it!"
_The Bromeliad Trilogy_, Terry Pratchett


Reply all
Reply to author
Forward
0 new messages