Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

How best to detect duplicate values in a column?

6 views
Skip to first unread message

Robert Maas, http://tinyurl.com/uh3t

unread,
Dec 21, 2009, 12:08:59 PM12/21/09
to
I am building a PHP/MySQL application that lets users vote for
answers to survey questions, in order to determine the most popular
answers to each question. There's a table of users (each with
autoincr userID), a table of survey questions (each with autoincr
questID, and with varchar title), a table of answers for the
questions (each with link to questID, and with autoincr ansID, and
with varchar title), and a table of votes (each with links to
userID and ansID).

In the course of developing the code, I might accidently insert the
same-titled question or the same-titled answer twice. I want to
write an audit tool that checks whether any two different-ID
questions have exactly the same title. (Later I also want to write
an audit tool that checks whether any two different answers have
exactly the same title.) How best to do it?
- Join the table with itself, looking for matching title but
different questIDs. If resultset is empty there's no problem,
else deal with it.
- Subquery the table grouped by title, count each group, select
only when count>1. If resultset is empty there's no problem,
else deal with it.
- Any other way (other than doing all the work in PHP) that would work?

Note that "deal with it" means showing me the rest of the info for
the same-titled questions and allowing me to pick which of them to
delete from the table. I don't want to simply re-declare the table
as having that column forbidding duplicates and have MySQL decide
for itself to delete all but one of the duplicates, which might
delete the one I want and keep another that I don't want.

Banana

unread,
Dec 21, 2009, 1:06:32 PM12/21/09
to
Robert Maas, http://tinyurl.com/uh3t wrote:
> In the course of developing the code, I might accidently insert the
> same-titled question or the same-titled answer twice. I want to
> write an audit tool that checks whether any two different-ID
> questions have exactly the same title. (Later I also want to write
> an audit tool that checks whether any two different answers have
> exactly the same title.) How best to do it?

Why not create one more table to contain survey definition? That way you
can then create a UNIQUE index across the question IDs per survey and
thus enforce that every survey never will have duplicate questions?

It also will make it easier for you to figure which survey a given user
has answered by having the answer table reference the survey ID /
question ID rather than question ID alone.

Jerry Stuckle

unread,
Dec 21, 2009, 1:30:01 PM12/21/09
to
Robert Maas, http://tinyurl.com/uh3t wrote:
> I am building a PHP/MySQL application that lets users vote for
> answers to survey questions, in order to determine the most popular
> answers to each question. There's a table of users (each with
> autoincr userID), a table of survey questions (each with autoincr
> questID, and with varchar title), a table of answers for the
> questions (each with link to questID, and with autoincr ansID, and
> with varchar title), and a table of votes (each with links to
> userID and ansID).
>

Why do you have title in both survey questions and survey answers?

> In the course of developing the code, I might accidently insert the
> same-titled question or the same-titled answer twice. I want to
> write an audit tool that checks whether any two different-ID
> questions have exactly the same title. (Later I also want to write
> an audit tool that checks whether any two different answers have
> exactly the same title.) How best to do it?
> - Join the table with itself, looking for matching title but
> different questIDs. If resultset is empty there's no problem,
> else deal with it.
> - Subquery the table grouped by title, count each group, select
> only when count>1. If resultset is empty there's no problem,
> else deal with it.
> - Any other way (other than doing all the work in PHP) that would work?
>
> Note that "deal with it" means showing me the rest of the info for
> the same-titled questions and allowing me to pick which of them to
> delete from the table. I don't want to simply re-declare the table
> as having that column forbidding duplicates and have MySQL decide
> for itself to delete all but one of the duplicates, which might
> delete the one I want and keep another that I don't want.

We need more information to tell.

First of all, is it legitimate to have the same title for two
different questions? If so, is this going to be a common occurrence?
i.e. is your title something like "Favorite Food" and you want them them
to select from a list. One list might have veggies, another fruits.

If you are going to have a limited set of titles with multiple questions
per title, put your titles in a separate table. If titles are going to
be pretty much unique to the question, place the titles in the same
table as the questions themselves.

Then - simply create a unique index on the title id or title (depending
on if you have a separate table or not, respectively) and the question
itself.

Of course, this will only work if the two titles and questions are
absolutely identical. Any difference, including an extra space, a comma
instead of a semicolon, etc., will show up as a different question. But
that will be true for any simple tool you use.

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
jstu...@attglobal.net
==================

Robert Maas, http://tinyurl.com/uh3t

unread,
Dec 22, 2009, 1:47:30 PM12/22/09
to
> From: Banana <Ban...@Republic.com>

> Why not create one more table to contain survey definition?

I don't know what you mean by "survey definition". I already have a
table that lists all the survey questions:
$ctcmd="CREATE TABLE surveys(
id int(8) NOT NULL auto_increment,
shortitl varchar(20) default NULL,
longtitl varchar(70) default NULL,
PRIMARY KEY (id),
UNIQUE id (id)
)";
Each suruey question is independent of any other, except that
question#1 is the meta-question of what survey question you'd most
like to answer. There is (at present) no intermediate structure
such as groups of related questions. If and when there are a large
number of different questions, I plan to set up a way to "index"
them via keywords, and have a search engine for the user to ask for
questions that mention a particular keyword, and a nearest-neighbor
browser. The nearest-neighbor graph will involve an additional
table showing pairs of survey questions which are directly linked.
But that's all in the distant future, after I get at least ten
active users who submit at least thirty different questions.

So what type of data do you propose would be in a "survey
definition" table??

> That way you can then create a UNIQUE index across the question
> IDs per survey and thus enforce that every survey never will have
> duplicate questions?

The word "survey" as used in the name of the table defined above is
in some way a misnomer. It really means a single question of the
survey type. But "surveyquestion" is too long a name for a table,
and "question" is totally ambiguous because I already have Turing
questions used to protect against spambots and to provide real-live
users with small amounts of credit, and I am already planning to
implement "truth futures" whereby claims are evauated as to their
truth value hence are in effect questions "is this true?", and also
I am planning to impleent several kinds of flashcard questions,
including personal advertisements, ESL, foreign languages, and
per-user security questions, so having a table just called
"questions" would be absurd if it really meant survey questions and
not all those other types of questions.

The word "survey" as I use it in the phrase "survey questions" is
an adjective, not a noun. It specifies that the question is for the
purpose of polling popularity, i.e. surveying my users to determine
which answer to the question is most popular. Survey questions are
thus discriminated from Turing questions, ESL and foreign language
questions, advertising slogan questions (expecting the user to
memorize a particular brand name), etc. There is no such thing as a
"survey" per se, only questions of the "survey" type (and other
questions of several other types).

> It also will make it easier for you to figure which survey a
> given user has answered

That makes no sense in the context of my system. The user can do
two things:
- At present: Answer the meta question of what survey he/she would
like to include as an actual survey question. Thus answers
(short titles and one-line answers) to the meta question will
become title and one-line-summary of new questions.
- If and when any such meta-answer gets enough votes, so that it
gets promoted to a new survey question: Answer that new question.
The table of survey investments tells how much funds each user has
invested in each answer. Grouping these per answer and adding all
the investments tells the score for each answer. Grouping the
answer scores per question gives a display for any particular
question showing how the various answers compare. At present, it
doesn't matter to me whether a user has looked at a particular
question and decided not to invest in any of its answers, or
whether the user hasn't even looked at the question (with its set
of answers) yet. Perhaps at some later time when there are more
than thirty different questions, I'll add a new table that keeps
track of each user's policy per each question, such as which
questions a particular user never wants to see again, or for which
questions the user wants to get an automatic alert whenever anyone
else invests in that question, or which questions should be at the
very top of the display, or what sort function should be used to
put the questions in sequence on that user's display (such as
sorting most popular first, or least popular first, or newest
questions first, etc.).

> by having the answer table reference the survey ID / question ID
> rather than question ID alone.

There's no such thing as a "survey", hence no such thing as a
"survey ID". There's only a survey-type-of-*question*
(distinguished from other types of questions, such as Turing ESL
etc. elsewhere), hence a (survey-)question ID here (and
turing-question ID elsewhere).

Aside: Currently there are only two types of Turing questions, a
single canned (hardcoded in PHP) question for restoring inactive
accounts whose credit has dropped below zero, and a table of
optional Turing questions that are all of the missing-word format:
$ctcmd="CREATE TABLE missing1(
id int(8) NOT NULL auto_increment,
pretext varchar(255) NOT NULL,
misstext varchar(255) NOT NULL,
posttext varchar(255) NOT NULL,
value int(3) NOT NULL,
PRIMARY KEY (id),
UNIQUE id (id)
)";
This same table of missing-word questions might also be used for
ESL lessons at some time in the future, hence the name mentionning
"missing" but not mentionning "turing".

Note: You didn't even attempt to answer my original question: Which
type of SQL query is best for detecting exactly-duplicate text in
the "shortitl" column of the "surveys" (i.e. survey-questions)
table, or in the "surans" (i.e. survey-answers) table as defined
here:
$ctcmd="CREATE TABLE surans(
id int(8) NOT NULL auto_increment,
surtab int(2) default 0,
idsur int(8) default 0,
shortitl varchar(20) default NULL,
linedesc varchar(70) default NULL,
fulldesc varchar(255) default NULL,
PRIMARY KEY (id),
UNIQUE id (id)
)"; }

Robert Maas, http://tinyurl.com/uh3t

unread,
Dec 22, 2009, 3:48:58 PM12/22/09
to
> From: Jerry Stuckle <jstuck...@attglobal.net>

> Why do you have title in both survey questions and survey answers?

Title in each answer is what will fit on a one-inch cell-phone screen
alongside the total vote (investment) and user's personal investment.
That title is a clickable link (a href=url?querystring) to get a
screen with the one-line description and full description too.
The title is long enough to be a reasonable clue as to the meaning,
and a reminder of the full meaning if the user has already seen the
one-line and full descripitions at least once, but is short enough
to crowd onto a single line of a one-inch screen, and not overwealm
users with verbosity on larger screen, and allow user on large
screen to shrink the window to a 2-inch-wide column to avoid
filling the whole screen thus allow the rest of the screen to be
used for other browsing at exactly the same time.

Title in the question has multiple purposes:
- It's shown at the top of the question-plus-answers display.
- It will be used as a virtual link between a meta-answer and the
corresponding promoted-to-question. Title will be copied exactly
from original meta-answer to promoted-to-question.

Here's the current form (including question+answers display
mentionned above) which the user sees when looking at the meta
question with its set of meta answers:
S.questions
TOTL SELF invests
1.12 .120 [Features]
.002 .001 [Worst TV ads]
.002 .001 [Best ST-TOS]
.002 .001 [BestOfNewEco]
.301 .000 [best TV host]
____________ [Nominate new answer]
______________________________________________________________________
______________________________________________________________________
The textfield to the left of the nominate button is for the title,
textfield just below the button is for the single-line description,
and final textfield is for the full description (up to 255
characters, plenty of room for passing notes back and forth between
this ordinary user who has nominated a new answer and the master
user acting as censor).

If this particular user clicks on the [Features] link, he sees:
Question#1 {S.questions} = {All survey questions}
Answer#1 {Features}
-line {What features would you most like installed in NewEco?}
-fullDescr {The more investment users make in expressing preference for
a particular feature, the more likely the NewEco-master will implement
it.}
User#2 has invested 120 milliseconds already.
You can adjust your investment to anywhere in the range 1 to 6095 ms.
Use the form to withdraw funds, or deposit funds.
1 < 120_____ < 6095 [New investment (ms)]
Note that the one-line description of the question, as well as the
one-line and full description of this particular answer, are shown
here, in addition to the short titles of each which already
appeared on the previous screen.

Alternatively, if this user clicks on the [best TV host] link, he sees:
Question#1 {S.questions} = {All survey questions}
Answer#5 {best TV host}
-line {Who is the best TV talk-show host?}
-fullDescr {}
User#2 has not yet invested here.
You can adjust your investment to anywhere in the range 0 to 5971 ms.
Use the form to increase your investment, or use back button to stay at
the minimum.
0 < 0_______ < 5971 [New investment (ms)]


> In the course of developing the code, I might accidently insert the
> same-titled question or the same-titled answer twice. I want to
> write an audit tool that checks whether any two different-ID
> questions have exactly the same title. (Later I also want to write
> an audit tool that checks whether any two different answers have
> exactly the same title.) How best to do it?
> - Join the table with itself, looking for matching title but
> different questIDs. If resultset is empty there's no problem,
> else deal with it.
> - Subquery the table grouped by title, count each group, select
> only when count>1. If resultset is empty there's no problem,
> else deal with it.
> - Any other way (other than doing all the work in PHP) that would work?
>
> Note that "deal with it" means showing me the rest of the info for
> the same-titled questions and allowing me to pick which of them to
> delete from the table. I don't want to simply re-declare the table
> as having that column forbidding duplicates and have MySQL decide
> for itself to delete all but one of the duplicates, which might
> delete the one I want and keep another that I don't want.

We need more information to tell.

> First of all, is it legitimate to have the same title for two
> different questions?

No, but in the course of developing the code I might execute the
very same MySQL INSERT statement more than one time until it get it
right. For ease of debugging one thing at a time, I want the
ability to mistakenly put more than one answer with the same title,
rather than having the database refuse to execute my code because
the title isn't unique. Then after I get that code working, I want
the ability to audit the table to detect if I have multiple
questions by the same title, and choose which is correct and delete
all the rest.

> If so, is this going to be a common occurrence?

Any time I'm changing my PHP code and might accidently create two
answers by the same title (or two questions by the same title), and
need to clean up my spazzes afterward. **RULE** of software: No
program is ever "complete". Every program is open to being changed
from time to time as circumstances warrant, as users request new
features, as the developer thinks of things to change for whatever
reason, so tolerance of mistakes during upgrading the code is a
good thing.

> i.e. is your title something like "Favorite Food"

That is more than 12 characters, so wouldn't be allowed as-is,
would have to be shortened, such as "FavoriteFood". The one line
description would then be "What is your favorite food?" or "What
food do you most like to eat?" etc. So the short answer is YES,
"like" "Favorite Food".

> and you want them them to select from a list.

Initially, the list is empty, and the user nominates a new answer.
If that answer is approved by the censor, then it gets approved,
and appears in the list to select from.

> One list might have veggies, another fruits.

If the question was favorite foods, that would include them all. If
that general quetion was instead broken into multiple more-specific
questions, favorite veggies, favorite fruits, favorite candies,
favorite desserts, favorite meats, favorite meat substitutes, etc.,
then each would be a separate question, each individually nominated
(proposed) by some user and all approved by the master user
(or eventually by a "reverse tree" consisting of trusted users,
sort of how Yahoo! Groups allow more than one "moderator" despite
only one of them the "owner" of the Group, except that in a
"reverse tree" each such moderator has only limited authority, and
a tree-concensus is needed to achieve final determination),
then each would be a separate question.

> If you are going to have a limited set of titles with multiple
> questions per title,

No, each question-title refers to a single question.

> If titles are going to be pretty much unique to the question,
> place the titles in the same table as the questions themselves.

Yes, that's the current design. Here's the table definition again:


$ctcmd="CREATE TABLE surveys(
id int(8) NOT NULL auto_increment,
shortitl varchar(20) default NULL,
longtitl varchar(70) default NULL,
PRIMARY KEY (id),
UNIQUE id (id)
)";

Note that each row in this table is one (1) question (of the survey
type, as opposed to the Turing type or ESL type etc., see my
response earlier to the other person who responded to my OP in this
thread for discussion of why the table is called "surveys" instead
of "questions".
<http://groups.google.com/group/comp.databases.mysql/msg/a66ddb2a0a04cedc?hl=en&dmode=source>
= Message-ID: <REM-2009...@Yahoo.Com>

> Then - simply create a unique index on the title id or title
> (depending on if you have a separate table or not, respectively)
> and the question itself.

That's only for efficiency, if it's too slow without an index, right?

> Of course, this will only work if the two titles and questions
> are absolutely identical.

Correct. Ordinary user code will get a bug trap if it's looking for
a question per some given (exact) title and it gets more than one
row in the result set, and I want my audit utility to make sure
that never happens (except in the middle of debugging new code
where execuating the same or very similar INSERT commands more than
once creates a temporary not-quite-valid situation. Hopefully this
inconsistent state will occur *only* in my private deveopment
hosting site, not in my primary online hosting site, so I'm the
only user who will ever see the bug trap. But on the chance that I
might need an emergency repair on a live site, and in the course of
the repair I might create a duplcate row in a table, I want a fully
automated audit utility I can run once at the end of such a session
just to make sure the situation is valid for regular users to see
later.

> Any difference, including an extra space, a comma instead of a
> semicolon, etc., will show up as a different question. But that
> will be true for any simple tool you use.

Yes. The primary use of the title where exact match is actually
*used* will be after an answer to the meta question (what survey
questions would you like) has been promoted to an actual question
(here's the question, what answers would you like for it). Somebody
starting from the meta question will see a link to the meta answer,
and upon clicking on it will *not* see a form for investing in that
meta-answer directly (see examples obove), but instead will see
that meta-answer as an actual question, and will see something like
this (manually synthesized here, not an actual screen-image):
Features
TOTL SELF invests
28.2 10.0 [TruFut]
7.24 1.00 [FilJob]
.101 .001 [RevTre]
.002 .001 [ESL]
____________ [Nominate new answer]
______________________________________________________________________
______________________________________________________________________
The code to map across from meta-answer to regular-question is via:
SELECT ... from surveys where shortitl = $metaans
Since this is done once per user keyclick from an answer within the
meta-question+answers form to the
one-meta-answer-promoted-to-regular-question form, there's no need
(at present) for it to be super-efficient by making an index for
surveys.shortitl.

So given that this is *only* for an audit tool to use during or
after a new-code-development session, used *only* by the master
user, so it doesn't have to be super efficient, just nice and clean
and fullproof, do you recommend joining table to itself, something
like (untested code):
SELECT surveys as s1, surveys as s2
WHERE s1.shortitl = s2.shortitl
AND s1.id < s2.id
which will return all pairs of different records with same title,
thus (n-1)*n/2 rows in result set for each group of n copies?

Or something involving COUNT of GROUP BY which would be much more
complicated for me to learn how to write but return just a single
resultset row for each group whose count was greater than 1? I've
never written anything like this, so this is almost surely not
correct code, but a wild stab at the idea:
SELECT * from
(SELECT COUNT(GROUP) as cnt
FROM (SELECT id,shortitl FROM surveys GROUP BY shortitl))
WHERE cnt > 1

Or maybe something closer to what I already know, again untested code:
SELECT cnt, skey.shortitl
FROM (SELECT count(sdata.id) as cnt ;;Count of how many rows per each title
FROM surveys as sdata ;;Multiple rows for each title
, ;;JOIN between sdata and unique titles
(SELECT UNIQUE shortitl FROM surveys as skey) ;;One row per different title
WHERE sdata.shortitl = skey.shortitl)
WHERE cnt > 1
Hey, that might actually be correct code already!
If nobody finds a coding mistake in a day or two,
I might copy it to a test PHP script and give it a royal try!

Or (UGH!!) list the entire table with SORT BY and have PHP do the
work of reading one row at a time from the resultset and detecting
adjacent rows of result set with same title?
That would work today, but wouldn't scale well when I have millions
of users and tens of thousands of different survey questions,
nevermind millions of different answers they've contributed to the
survey-answers table.

*NOT* list the entire table in default order and store it all in a
PHP array and then have PHP sort it into alphabetical order by
shortitl and then check for duplicates in adacent entries, wouldn't
scale well even with mere hundreds of users and corresponding tens
or hundreds of questions and thousands of answers. Double-UGH on that!!

Or something else entirely that I'm not even thinking MySQL would
provide? Like can I set an error-condition catcher around an
attempt to build an index, and if there's a duplicate then the
index-build fails, and reports back the duplicate key it attempted
to build, and then my code does a SELECT on the troublesome key to
learn what rows shared that key?

Gordon Burditt

unread,
Dec 22, 2009, 7:04:14 PM12/22/09
to
>I am building a PHP/MySQL application that lets users vote for
>answers to survey questions, in order to determine the most popular
>answers to each question. There's a table of users (each with
>autoincr userID), a table of survey questions (each with autoincr
>questID, and with varchar title), a table of answers for the
>questions (each with link to questID, and with autoincr ansID, and
>with varchar title), and a table of votes (each with links to
>userID and ansID).

>In the course of developing the code, I might accidently insert the
>same-titled question or the same-titled answer twice. I want to
>write an audit tool that checks whether any two different-ID
>questions have exactly the same title. (Later I also want to write
>an audit tool that checks whether any two different answers have
>exactly the same title.) How best to do it?
>- Join the table with itself, looking for matching title but
> different questIDs. If resultset is empty there's no problem,
> else deal with it.

I like this idea. Be sure that there is an index (non-unique, since
you want to allow duplicates to exist, then fix them) on the title
field.

>- Subquery the table grouped by title, count each group, select
> only when count>1. If resultset is empty there's no problem,
> else deal with it.

I think this puts too much into the potential result set until you
get to the HAVING count > 1 part. (That is, it's likely to be slow.)
This might work better, though, if you often have *massive* duplication
to clean up.

>- Any other way (other than doing all the work in PHP) that would work?

>Note that "deal with it" means showing me the rest of the info for
>the same-titled questions and allowing me to pick which of them to
>delete from the table. I don't want to simply re-declare the table
>as having that column forbidding duplicates and have MySQL decide
>for itself to delete all but one of the duplicates, which might
>delete the one I want and keep another that I don't want.

As far as duplicate possible answers go, you'd need to find duplicates
of the pair (quesID, answer_title). I think this is also best done
with the self-join. (Note: due to some legal oddities relating
to the elections in Florida, in all multiple-choice and true-false
questions in Florida, it is required to present "Al Gore" as a
choice.) There are probably a lot of questions with the same group
of answers, such as True/False/Al Gore or Agree/Disagree/Al Gore
or Strongly Agree/Agree/Neutral/Disagree/Strongly Disagree/Al Gore.

Jerry Stuckle

unread,
Dec 22, 2009, 8:29:55 PM12/22/09
to
Robert Maas, http://tinyurl.com/uh3t wrote:
>> From: Jerry Stuckle <jstuck...@attglobal.net>
>> Why do you have title in both survey questions and survey answers?
>
> Title in each answer is what will fit on a one-inch cell-phone screen
> alongside the total vote (investment) and user's personal investment.

<snip>

I don't care about how it is displayed. That has nothing to do with the
database design.

If it isn't right, you should either UPDATE or DELETE the bad one.

And if you don't want more than one title in the final result, then you
shouldn't allow it in development, either.

Auditing the table later is the wrong way to go.

>> If so, is this going to be a common occurrence?
>
> Any time I'm changing my PHP code and might accidently create two
> answers by the same title (or two questions by the same title), and
> need to clean up my spazzes afterward. **RULE** of software: No
> program is ever "complete". Every program is open to being changed
> from time to time as circumstances warrant, as users request new
> features, as the developer thinks of things to change for whatever
> reason, so tolerance of mistakes during upgrading the code is a
> good thing.
>

If that's not valid, then by definition your database is inconsistent.
Don't allow it in the first place. This has nothing to do with the code
being changed. It has everything to do with maintaining data validity -
which is completely separate.

>> i.e. is your title something like "Favorite Food"
>
> That is more than 12 characters, so wouldn't be allowed as-is,
> would have to be shortened, such as "FavoriteFood". The one line
> description would then be "What is your favorite food?" or "What
> food do you most like to eat?" etc. So the short answer is YES,
> "like" "Favorite Food".
>

I don't care if it's 12 or 1200 characters. It's a SAMPLE! Don't get
caught up in irrelevant details!

So what you're saying is you CAN have two questions with the same title?

>> and you want them them to select from a list.
>
> Initially, the list is empty, and the user nominates a new answer.
> If that answer is approved by the censor, then it gets approved,
> and appears in the list to select from.
>

So, what's the problem?

>> One list might have veggies, another fruits.
>
> If the question was favorite foods, that would include them all. If
> that general quetion was instead broken into multiple more-specific
> questions, favorite veggies, favorite fruits, favorite candies,
> favorite desserts, favorite meats, favorite meat substitutes, etc.,
> then each would be a separate question, each individually nominated
> (proposed) by some user and all approved by the master user
> (or eventually by a "reverse tree" consisting of trusted users,
> sort of how Yahoo! Groups allow more than one "moderator" despite
> only one of them the "owner" of the Group, except that in a
> "reverse tree" each such moderator has only limited authority, and
> a tree-concensus is needed to achieve final determination),
> then each would be a separate question.
>

That is what I'm asking. Can you have multiple questions with the same
title? First you say no - but then your responses indicate maybe the
answer is yes. Which is it?

>> If you are going to have a limited set of titles with multiple
>> questions per title,
>
> No, each question-title refers to a single question.
>

Are you sure about that? If so, you shouldn't have the title in two
different tables.

>> If titles are going to be pretty much unique to the question,
>> place the titles in the same table as the questions themselves.
>
> Yes, that's the current design. Here's the table definition again:
> $ctcmd="CREATE TABLE surveys(
> id int(8) NOT NULL auto_increment,
> shortitl varchar(20) default NULL,
> longtitl varchar(70) default NULL,
> PRIMARY KEY (id),
> UNIQUE id (id)
> )";
> Note that each row in this table is one (1) question (of the survey
> type, as opposed to the Turing type or ESL type etc., see my
> response earlier to the other person who responded to my OP in this
> thread for discussion of why the table is called "surveys" instead
> of "questions".
> <http://groups.google.com/group/comp.databases.mysql/msg/a66ddb2a0a04cedc?hl=en&dmode=source>
> = Message-ID: <REM-2009...@Yahoo.Com>
>

That is not what you indicated in your original post. To quote you:
"...a table of survey questions (each with autoincr
questID, and with ***varchar title***), a table of answers for the


questions (each with link to questID, and with autoincr ansID, and

with ***varchar title***)"

Which is it?

>> Then - simply create a unique index on the title id or title
>> (depending on if you have a separate table or not, respectively)
>> and the question itself.
>
> That's only for efficiency, if it's too slow without an index, right?
>

No, learn about unique indexes.

>> Of course, this will only work if the two titles and questions
>> are absolutely identical.
>
> Correct. Ordinary user code will get a bug trap if it's looking for
> a question per some given (exact) title and it gets more than one
> row in the result set, and I want my audit utility to make sure
> that never happens (except in the middle of debugging new code
> where execuating the same or very similar INSERT commands more than
> once creates a temporary not-quite-valid situation. Hopefully this
> inconsistent state will occur *only* in my private deveopment
> hosting site, not in my primary online hosting site, so I'm the
> only user who will ever see the bug trap. But on the chance that I
> might need an emergency repair on a live site, and in the course of
> the repair I might create a duplcate row in a table, I want a fully
> automated audit utility I can run once at the end of such a session
> just to make sure the situation is valid for regular users to see
> later.
>

You're going about it all wrong. Maintain data integrity in the database.

Which again is presentation - not database design.


The bottom line is - learn about database normalization and how to
maintain data integrity at the database level. Presentation of the data
is completely irrelevant.

You don't even know the basics of database design, much less program
design. You have a long ways to go.

Robert Maas, http://tinyurl.com/uh3t

unread,
Dec 23, 2009, 7:01:36 PM12/23/09
to
> From: gordonb.gu...@burditt.org (Gordon Burditt)

> I like this idea. Be sure that there is an index (non-unique, since
> you want to allow duplicates to exist, then fix them) on the title
> field.

Is that necessary only for efficiency? Would the basic MySQL stuff
still work without an index, but slower? Since I'd be doing this
audit task only at times when I've been developing code that might
have accidently added extra records with duplicate titles, and I
want to clean up that mess, it seems to me that high speed isn't
necessary.

> As far as duplicate possible answers go, you'd need to find
> duplicates of the pair (quesID, answer_title).

Hmm, at present, as I develop the code, there's only one question,
the meta-question (which question would like to install as
a survey question), so at present quesID is always 1. But later when
there are *regular* questions available, if two *different* questions
have a same-titled answer then I guess that's not a problem. So I think
you are correct and I need to amend my original question: I want to
detect pairs of answer records such that:
- QuestionIDs are the same (added condition now)
- Titles are the same
- IDs are different, i.e. they are different records

For duplicate *questions* with same title, the original spec is correct.

> I think this is also best done with the self-join.

Let me see if I can write the correct MySQL queries for each case now:

For duplicate answers within same question:
SELECT surans as s1, surans as s2
WHERE s1.shortitl = s2.shortitl
AND s1.surtab = s2.surtab
AND s1.idsur = s2.idsur
AND s1.id < s2.id

For duplicate questions:


SELECT surveys as s1, surveys as s2
WHERE s1.shortitl = s2.shortitl
AND s1.id < s2.id

In each case, for each group of n duplicates there would be n*(n-1)
rows in the resultset. With n typically in the range 5 to 10, that
means 20 to 5000 rows. But it occurs to me that really I only need
to know whether there is or is not such a duplicated, and in the
case of a duplicate I only need to know one (1) such pair, so that
I can then fetch the shortitl from that one row and use it to pull
up all the n table rows for that one shortitl and select all but
one to get rid of, and then run the self-join query again to see if
any *other* shortitl comes up as also duplicated somewhere. It
would be too messy to present menus for more than one title at a
time and try to select all but one of each group, so I really want
to clean up just one title at a time, so LIMIT to just one row in
resultset ought to be just what I need. Would such a LIMIT cause
MySQL to stop searching as soon as it finds that one duplicate
pair, *rather* than generate the complete set of n*(n-1) matches
and *then* trim it down to just one, thus it would be reasonably
efficient? So I should add this clause to each of the above
queries
LIMIT 1
right?

> There are probably a lot of questions with the same group of
> answers, such as True/False/Al Gore or Agree/Disagree/Al Gore
> or Strongly Agree/Agree/Neutral/Disagree/Strongly Disagree/Al Gore.

Yeah. See the extra clauses in the surans self-join above. Since
the investment table shows user's investments in the answer record
per se, not in some sort of virtual question-answer pair, this will
require complete duplicate set of answers for each question with
those possible answers. However since I generally plan surveys with
free-form answers like "What do you want to do in Iraq?" (Leave
immediately, drop hydrogen bombs on all cities, spread anthrax to
all villages, fly drones that shoot all vehicles larger than a
bicycle, etc.), rather than standard agree/disagree sorts of
responses, I don't see lots-of-duplicate-sets-of-answers as a
serious problem. For agree/disagree sorts of questions about
matters of fact, such as asking whether people agree there are WMD
in Iraq, I'll mostly use truth-futures instead. For questions about
how strongly people feel about some issue, instead of weak/strong
scale, I'll use amount of lebor they are willing to invest in some
answer as guage of their interest, on the philosophy that actions
speak louder than words in judging serious interest. Thus instead
of asking "agree/disagree that we should bomb Baghdad", I'll ask
what to do in Iraq and see how much investment there is total in
the bomb-Baghdad answer, effectively in competition (for funds)
with other answers to the same question and also answers to other
questions, thus votes (labor investments) determine both which
questions seem most important to my users and which answers to
those questions are preferred by my users.

Robert Maas, http://tinyurl.com/uh3t

unread,
Dec 23, 2009, 7:45:10 PM12/23/09
to
> >> Why do you have title in both survey questions and survey answers?
> > Title in each answer is what will fit on a one-inch cell-phone screen
> > alongside the total vote (investment) and user's personal investment.
> From: Jerry Stuckle <jstuck...@attglobal.net>

> I don't care about how it is displayed. That has nothing to do
> with the database design.

You asked why I have that field in that record. I answered that
it's because I want to display it (and I have a little bit of extra
info to explain why I have separate fields for title and one-line
description and full description in answer records). Basically
every piece of text that must be displayed differently for each
record, must be **in** each record, so that's why it's in the
record, because I want to be able to display it. Maybe I gave TMI,
but still I did answer your question *why* I have the field in the
first place.

Would you have preferred that I just say:
"The title for each record is different, so that's why the title is
a field in each record."
If you don't think the title of each record should be in the
record, then where do you think a record's title should be stored??

> If it isn't right, you should either UPDATE or DELETE the bad one.

That's the purpose of the audit utility, to show me where I have a
problem of more than one record with same short title, and generate
a HTML form showing which records I might want to DELETE, and then
I check which ones I want to immediately delete, and click the
submit button to do it.

> And if you don't want more than one title in the final result,
> then you shouldn't allow it in development, either.

I disagree. During development, I prefer to be able to re-test the
same code until I get it correct, and not have to delete the result
of each test before I can run the next test. Then when I am sure I
have it correct, I then clean up whatever duplicate records the
various tests created that were more than really wanted in the end.

> Auditing the table later is the wrong way to go.

I disagee. My style of code deveopment is different from yours, and
your way of doing things doesn't apply to mine.

> If that's not valid, then by definition your database is inconsistent.

That's fine by me during rapid development, to keep things simple,
to allow me a natural way of developing/testing new code. It's like
having more than one dirty dish in the sink. Ultimately you want
*all* your dishes to be clean, but it's not absolutely necessary to
immediately wash each dish after it gets dirty. It's sufficient to
prepare and eat a meal and create a whole pile of dirty dishes, and
*not* bother to clean them yet, and then later when the meal is
done and you have some extra energy you go wash a batch of dishes.

> Don't allow it in the first place.

That's too much trouble to program from the start. It takes one
line of code to INSERT a record. It takes *several* lines of code
to INSERT a record but first DELETE anything that it might conflict
with.

> So what you're saying is you CAN have two questions with the same title?

Only temporarily while debugging code to INSERT a new record in the
table of questions. (Likewise two answers with the same title
temporarily while debugging code to INSERT a new record in the
table of answers.) And usually only on the devleopment site, not on
the main online site.

> > Initially, the list is empty, and the user nominates a new answer.
> > If that answer is approved by the censor, then it gets approved,
> > and appears in the list to select from.
> So, what's the problem?

During development/debugging, I might execute the code to INSERT a
row more than once, and after I'm sure the code is correct I want
to get rid of all the copies of that row (or similar rows) that I
don't want to *remain* in the table permanently.

> Can you have multiple questions with the same title?

Only for a few minutes while I'm trying to get the code to work.
Then I want to delete all but one duplicates.

> you shouldn't have the title in two different tables.

There's one title in the question table, namely the meta-question,
which isn't in the answer table, because it's the **root** of the
tree, it has no ancestor.

There are lots of titles in the answers table which aren't in the
questions table, either because they haven't yet been promoted to
be questions, or because they aren't answers to the meta-question
so they will *never* be promoted to questions.

With one qustion title which is never an answer title, and lots of
answer titles that aren't question titles, the titles must be in
two different places. (Yeah, I could have a third table that is
just titles and nothing else, and reference from both questions and
answers to that table, thus sharing the text between questions and
answers. But since the short titles are limited to 12 characters
each, and only a very small fraction of the different titles are
shared between both tables anyway, I don't see how any extra
efficiency is obtained that is worth the extra trouble to normalize
the tables to such an extreme degree. COPYing the title from one
table to the other in those few cases of shared title seems
reasonably efficient.)

> To quote you:
> "...a table of survey questions (each with autoincr questID, and
> with ***varchar title***), a table of answers for the questions
> (each with link to questID, and with autoincr ansID, and with
> ***varchar title***)"
> Which is it?

Both: Each question has a short title, and each answer has a short title.
For example (truncated):


S.questions
TOTL SELF invests
1.12 .120 [Features]
.002 .001 [Worst TV ads]

The question title is shown in the first row, and the answer titles
are clickable links in the last two rows. (Second row has column
headers for total investments and investment by self.)

Jerry Stuckle

unread,
Dec 23, 2009, 9:49:45 PM12/23/09
to
Robert Maas, http://tinyurl.com/uh3t wrote:
>>>> Why do you have title in both survey questions and survey answers?
>>> Title in each answer is what will fit on a one-inch cell-phone screen
>>> alongside the total vote (investment) and user's personal investment.
>> From: Jerry Stuckle <jstuck...@attglobal.net>
>> I don't care about how it is displayed. That has nothing to do
>> with the database design.
>
> You asked why I have that field in that record. I answered that
> it's because I want to display it (and I have a little bit of extra
> info to explain why I have separate fields for title and one-line
> description and full description in answer records). Basically
> every piece of text that must be displayed differently for each
> record, must be **in** each record, so that's why it's in the
> record, because I want to be able to display it. Maybe I gave TMI,
> but still I did answer your question *why* I have the field in the
> first place.
>

You don't get it. Data storage is independent of data display.

Learn about database normalization and get your database right first.

> Would you have preferred that I just say:
> "The title for each record is different, so that's why the title is
> a field in each record."
> If you don't think the title of each record should be in the
> record, then where do you think a record's title should be stored??
>

You have the title in two different tables. Bad idea.

>> If it isn't right, you should either UPDATE or DELETE the bad one.
>
> That's the purpose of the audit utility, to show me where I have a
> problem of more than one record with same short title, and generate
> a HTML form showing which records I might want to DELETE, and then
> I check which ones I want to immediately delete, and click the
> submit button to do it.
>

You shouldn't need such a utility. Databases will handle it quite
nicely, if you design the database correctly.

>> And if you don't want more than one title in the final result,
>> then you shouldn't allow it in development, either.
>
> I disagree. During development, I prefer to be able to re-test the
> same code until I get it correct, and not have to delete the result
> of each test before I can run the next test. Then when I am sure I
> have it correct, I then clean up whatever duplicate records the
> various tests created that were more than really wanted in the end.
>

Now I see why you haven't been able to get a job in 18 years. You have
no idea about database design, program design - OR DEVELOPMENT.

>> Auditing the table later is the wrong way to go.
>
> I disagee. My style of code deveopment is different from yours, and
> your way of doing things doesn't apply to mine.
>

Fine. Do it your way.

>> If that's not valid, then by definition your database is inconsistent.
>
> That's fine by me during rapid development, to keep things simple,
> to allow me a natural way of developing/testing new code. It's like
> having more than one dirty dish in the sink. Ultimately you want
> *all* your dishes to be clean, but it's not absolutely necessary to
> immediately wash each dish after it gets dirty. It's sufficient to
> prepare and eat a meal and create a whole pile of dirty dishes, and
> *not* bother to clean them yet, and then later when the meal is
> done and you have some extra energy you go wash a batch of dishes.
>

Databases should never be inconsistent - during development or production.

>> Don't allow it in the first place.
>
> That's too much trouble to program from the start. It takes one
> line of code to INSERT a record. It takes *several* lines of code
> to INSERT a record but first DELETE anything that it might conflict
> with.
>

It's much more trouble to fix it later.

<snip the rest of the crap>

You don't want help. You want people to tell you how to make an even
bigger fool of yourself.

Well, I'm not biting. I tried to give you some help. You don't like it
- you're on your own.

Robert Maas, http://tinyurl.com/uh3t

unread,
Dec 23, 2009, 11:06:20 PM12/23/09
to
> From: Jerry Stuckle <jstuck...@attglobal.net>

> You don't get it. Data storage is independent of data display.

It's impossible to display data unless it is stored somewhere or it
can be computed from what's stored somewhere.

> Learn about database normalization and get your database right first.

Data specific to questions are in one table. Data specific to
answers are in another table. There's a many-to-one relationship
between questions and answers, hence a link from each answer to the
ID of the corresponding question. That sounds normalized to me.

There's a weighted many-to-many relationship between users and
answers, hence another table linking to IDs of questions and IDs of
answers, and directly containing weight values. That sounds
normalized to me.

> You have the title in two different tables. Bad idea.

I have titles of questions in the table of questions, and titles of
answers in the answers table. At present there are no question
titles which are also answer titles. The only question titles that
will ever be the same as answer titles are answers to the one
meta-question.

Several weeks before I implemented the current design of tables, I
posted a link to a Web page describing my idea, and soliciting
suggestions/advice. You didn't offer any except to poo-poo the
whole idea of making a Web site implementing a cooperative economic
system. After a few weeks of no suggestions/advice, I went ahead
and implemented my best idea. If you wanted to influence my table
design, you missed your chance. At this point, my current table
design is good enough that it's not worth the trouble of starting
from scratch and re-doing the whole Web site.

Gordon Burditt

unread,
Dec 24, 2009, 12:04:21 AM12/24/09
to
>> I like this idea. Be sure that there is an index (non-unique, since
>> you want to allow duplicates to exist, then fix them) on the title
>> field.
>
>Is that necessary only for efficiency? Would the basic MySQL stuff
>still work without an index, but slower?

Indexes are usually not mandatory. Setting up indexes is something
you should be testing as part of your development. They aren't
something you add when you go into production. Were you of the
opinion that all indexes are unique indexes? That's not true. Or
are you attempting premature pessimization of your site?

There are a few things that WON'T WORK without indexes. Auto-increment
fields are one example (and they require not just an index or a
unique index, but it must be a primary key). Your site can easily
become too slow to be usable without indexes.

>Since I'd be doing this
>audit task only at times when I've been developing code that might
>have accidently added extra records with duplicate titles, and I
>want to clean up that mess, it seems to me that high speed isn't
>necessary.

An (non-unique) index will not stop you from doing any of this.
Have all the duplicate records you want. Finding duplicates will
be faster.

>> As far as duplicate possible answers go, you'd need to find
>> duplicates of the pair (quesID, answer_title).
>
>Hmm, at present, as I develop the code, there's only one question,
>the meta-question (which question would like to install as
>a survey question), so at present quesID is always 1.

Stop bringing that up, as if it's going to be true for long, your
system is useless. For testing, populate your system with some
test questions, candidate answers, users, and actual test responses
to the survey questions. It's necessary to make It's perfectly
reasonable to assume that you're usually going to have a bunch of
questions and deal with that. The single-question situation is NOT
the appropriate case to try to optimize. You shouldn't have any
problems making the single-question case work if it will work for
many questions.

For performance testing, you should probably load up the database
with test data including 10,000 questions, maybe 10 possible answers
per question, 1,000 users, and maybe 1,000,000 investments spread
over those questions. Then see how well it performs. You might
need to consider how, at that level, choosing a question (out of
10,000 of them) is going to fit on a mobile-phone screen. Sub-categories
might be needed here.

And how will you fit a single question (once you selected it) and
10 possible answers (plus "add your own response") on the mobile
phone screen? You *WILL* get at least 10 answers a couple of months
before the primaries start to "Who should be the {Republican,
Democratic} presidential candidate in 2012?". You may get more than 10
answers with various spellings of the same guy's name, especially if
someone with a hard-to-spell name like Dan Kwayle runs again.

>But later when
>there are *regular* questions available, if two *different* questions
>have a same-titled answer then I guess that's not a problem.

It shouldn't be. It's going to happen a lot in real life.
If it's a problem it will give you a lot of grief.

>So I think
>you are correct and I need to amend my original question: I want to
>detect pairs of answer records such that:
>- QuestionIDs are the same (added condition now)
>- Titles are the same
>- IDs are different, i.e. they are different records
>
>For duplicate *questions* with same title, the original spec is correct.
>
>> I think this is also best done with the self-join.
>
>Let me see if I can write the correct MySQL queries for each case now:
>
>For duplicate answers within same question:
> SELECT surans as s1, surans as s2
> WHERE s1.shortitl = s2.shortitl
> AND s1.surtab = s2.surtab
> AND s1.idsur = s2.idsur
> AND s1.id < s2.id
>
>For duplicate questions:
> SELECT surveys as s1, surveys as s2
> WHERE s1.shortitl = s2.shortitl
> AND s1.id < s2.id

You're missing a FROM clause from both queries, although you've
got most of it but the FROM keyword in there. You also didn't
list any fields to retrieve. I'm not sure what the "surtab" field
is supposed to be. Have you considered TESTING queries?

Do you ever test queries with a command-line query tool? Or
phpMyAdmin or some other tool that will let you type a query into
a command line or window, then see the result? It will save you a
lot of grief compared to putting it into PHP and then trying to
figure out what went wrong.

>In each case, for each group of n duplicates there would be n*(n-1)
>rows in the resultset.

With the "AND s1.id < s2.id" in there (as opposed to "AND s1.id !=
s2.id"), divide by 2. You cut out half of the duplicate rows.

>With n typically in the range 5 to 10, that
>means 20 to 5000 rows.

Your math is off. Using your numbers and my formula, 5 duplicates
-> 10 rows and 10 duplicates -> 45 rows. You get approximately
5000 rows from 100 duplicates.

I realize you want to be able to test stuff, but you're actually
PLANNING on having 5-10 duplicates of each record? (May I suggest
sticking the date and time into each question title, so needing
duplicates isn't an issue?)

It
>would be too messy to present menus for more than one title at a
>time and try to select all but one of each group, so I really want
>to clean up just one title at a time, so LIMIT to just one row in
>resultset ought to be just what I need. Would such a LIMIT cause
>MySQL to stop searching as soon as it finds that one duplicate
>pair, *rather* than generate the complete set of n*(n-1) matches
>and *then* trim it down to just one, thus it would be reasonably
>efficient? So I should add this clause to each of the above
>queries
> LIMIT 1
>right?

LIMIT acts on the rows as they are sorted. You also want, don't you,
to with the one query find out that you've got duplicates in "Are you an
idiot?", more duplicates in "Are you a moron?" and even more in "Are you
a horse?" ? LIMIT will filter that out and restrict you to fixing one
set of duplicates at a time.

>> There are probably a lot of questions with the same group of
>> answers, such as True/False/Al Gore or Agree/Disagree/Al Gore
>> or Strongly Agree/Agree/Neutral/Disagree/Strongly Disagree/Al Gore.
>
>Yeah. See the extra clauses in the surans self-join above. Since
>the investment table shows user's investments in the answer record
>per se, not in some sort of virtual question-answer pair, this will
>require complete duplicate set of answers for each question with
>those possible answers. However since I generally plan surveys with
>free-form answers like "What do you want to do in Iraq?" (Leave
>immediately, drop hydrogen bombs on all cities, spread anthrax to
>all villages, fly drones that shoot all vehicles larger than a
>bicycle, etc.), rather than standard agree/disagree sorts of
>responses, I don't see lots-of-duplicate-sets-of-answers as a
>serious problem.

I can see all of the above set of answers being used for the all
questions:

What do you want to do in Iraq?

What do you want to do in Afghanistan?
What do you want to do in Mexico?
What do you want to do in Washington, DC?
What do you want to do at the motorcycle gang convention next week?

and then there's the guy who will answer all questions of the
form "What do you want to do ... " with "Smoke weed".

>For agree/disagree sorts of questions about
>matters of fact, such as asking whether people agree there are WMD
>in Iraq, I'll mostly use truth-futures instead.

I wasn't thinking of that (fact questions) as much as opinion questions
like "We *SHOULD* leave Iraq immediately. Agree/Disagree".

>For questions about
>how strongly people feel about some issue, instead of weak/strong
>scale, I'll use amount of lebor they are willing to invest in some
>answer as guage of their interest, on the philosophy that actions
>speak louder than words in judging serious interest. Thus instead

Ok, how do you do that? Given various investments in the question
"What do you want to do in Iraq?" with answers:
"Leave immediately"


"drop hydrogen bombs on all cities"

"Smoke weed"
"spread anthrax to all villages"
"fly drones that shoot all vehicles larger than a cockroach"
"steal all the oil"

how do you come up with a consensus answer you could report as a poll
result to the press? Escalate the war? Keep the same level? Reduce
funding and manpower? Cut it off entirely?

Jerry Stuckle

unread,
Dec 24, 2009, 8:46:18 AM12/24/09
to
Robert Maas, http://tinyurl.com/uh3t wrote:

Yes, your idea is pure crap. And you weren't looking for help - you
were looking for free consulting. There is a huge difference. The
former I provide. But not the latter.

Norman Peelman

unread,
Dec 24, 2009, 9:17:35 AM12/24/09
to
Robert Maas, http://tinyurl.com/uh3t wrote:
>>>> Why do you have title in both survey questions and survey answers?
>>> Title in each answer is what will fit on a one-inch cell-phone screen
>>> alongside the total vote (investment) and user's personal investment.

I don't think I would 'invest' in an answer...

>> From: Jerry Stuckle <jstuck...@attglobal.net>
>> I don't care about how it is displayed. That has nothing to do
>> with the database design.
>
> You asked why I have that field in that record. I answered that
> it's because I want to display it (and I have a little bit of extra
> info to explain why I have separate fields for title and one-line
> description and full description in answer records). Basically
> every piece of text that must be displayed differently for each
> record, must be **in** each record, so that's why it's in the
> record, because I want to be able to display it. Maybe I gave TMI,
> but still I did answer your question *why* I have the field in the
> first place.
>

Why would you do that? These should all be in separate tables and
linked together:

Table 1: id, Title, one-line desc, full desc.
Table 2: id, Question, Table 1 link, Table 3 link
Table 3: id, Answer, Table 1 link, Table 2 link

...if you want to keep count of how many times an answer has been used,
throw a 'count' field in Table 3.

> Would you have preferred that I just say:
> "The title for each record is different, so that's why the title is
> a field in each record."
> If you don't think the title of each record should be in the
> record, then where do you think a record's title should be stored??
>

see above. normalization. no duplication of data.

>> If it isn't right, you should either UPDATE or DELETE the bad one.
>
> That's the purpose of the audit utility, to show me where I have a
> problem of more than one record with same short title, and generate
> a HTML form showing which records I might want to DELETE, and then
> I check which ones I want to immediately delete, and click the
> submit button to do it.
>
>> And if you don't want more than one title in the final result,
>> then you shouldn't allow it in development, either.
>
> I disagree. During development, I prefer to be able to re-test the
> same code until I get it correct, and not have to delete the result
> of each test before I can run the next test. Then when I am sure I
> have it correct, I then clean up whatever duplicate records the
> various tests created that were more than really wanted in the end.
>
>> Auditing the table later is the wrong way to go.
>
> I disagee. My style of code deveopment is different from yours, and
> your way of doing things doesn't apply to mine.
>

Your method appears to be to code/store data as you want to see them
when output. So that when you examine your data you see where everything
fits. Problem is, you need to access the data the same way you expect
your (system) users would.

>> If that's not valid, then by definition your database is inconsistent.
>
> That's fine by me during rapid development, to keep things simple,
> to allow me a natural way of developing/testing new code. It's like
> having more than one dirty dish in the sink. Ultimately you want
> *all* your dishes to be clean, but it's not absolutely necessary to
> immediately wash each dish after it gets dirty. It's sufficient to
> prepare and eat a meal and create a whole pile of dirty dishes, and
> *not* bother to clean them yet, and then later when the meal is
> done and you have some extra energy you go wash a batch of dishes.
>
>> Don't allow it in the first place.
>
> That's too much trouble to program from the start. It takes one
> line of code to INSERT a record. It takes *several* lines of code
> to INSERT a record but first DELETE anything that it might conflict
> with.
>

INSERT ??? ON DUPLICATE UPDATE ? = ?

is so hard? The DB does it for you.

>> So what you're saying is you CAN have two questions with the same title?
>
> Only temporarily while debugging code to INSERT a new record in the
> table of questions. (Likewise two answers with the same title
> temporarily while debugging code to INSERT a new record in the
> table of answers.) And usually only on the devleopment site, not on
> the main online site.
>
>>> Initially, the list is empty, and the user nominates a new answer.
>>> If that answer is approved by the censor, then it gets approved,
>>> and appears in the list to select from.
>> So, what's the problem?
>
> During development/debugging, I might execute the code to INSERT a
> row more than once, and after I'm sure the code is correct I want
> to get rid of all the copies of that row (or similar rows) that I
> don't want to *remain* in the table permanently.
>

INSERT ??? ON DUPLICATE UPDATE ? = ?

is so hard? The DB does it for you.

>> Can you have multiple questions with the same title?
>
> Only for a few minutes while I'm trying to get the code to work.
> Then I want to delete all but one duplicates.
>
>> you shouldn't have the title in two different tables.
>
> There's one title in the question table, namely the meta-question,
> which isn't in the answer table, because it's the **root** of the
> tree, it has no ancestor.
>
> There are lots of titles in the answers table which aren't in the
> questions table, either because they haven't yet been promoted to
> be questions, or because they aren't answers to the meta-question
> so they will *never* be promoted to questions.
>
> With one qustion title which is never an answer title, and lots of
> answer titles that aren't question titles, the titles must be in
> two different places. (Yeah, I could have a third table that is
> just titles and nothing else, and reference from both questions and
> answers to that table, thus sharing the text between questions and
> answers. But since the short titles are limited to 12 characters
> each,

...couldn't your short title just be the first n chars of the full
title? Then you can get rid of one whole field of duplication.

> and only a very small fraction of the different titles are
> shared between both tables anyway, I don't see how any extra
> efficiency is obtained that is worth the extra trouble to normalize
> the tables to such an extreme degree. COPYing the title from one
> table to the other in those few cases of shared title seems
> reasonably efficient.)
>
>> To quote you:
>> "...a table of survey questions (each with autoincr questID, and
>> with ***varchar title***), a table of answers for the questions
>> (each with link to questID, and with autoincr ansID, and with
>> ***varchar title***)"
>> Which is it?
>
> Both: Each question has a short title, and each answer has a short title.
> For example (truncated):
> S.questions
> TOTL SELF invests
> 1.12 .120 [Features]
> .002 .001 [Worst TV ads]
> The question title is shown in the first row, and the answer titles
> are clickable links in the last two rows. (Second row has column
> headers for total investments and investment by self.)

I'm not really clear on what an answer/title is... wouldn't answers
link up to question/titles?

--
Norman
Registered Linux user #461062
-Have you been to www.mysql.com yet?-

Robert Maas, http://tinyurl.com/uh3t

unread,
Dec 24, 2009, 1:38:51 PM12/24/09
to
> From: gordonb.q9...@burditt.org (Gordon Burditt)

> Indexes are usually not mandatory.

So it's OK with you if I not bother adding any more indexes than I
already have?

> Setting up indexes is something you should be testing as part of
> your development.

What would "testing" entail for something that isn't needed to
begin with? When I write code that is becoming part of the
application, I test that code to see whether it fullfills its role
as part of that application. But a part that serves no useful
purpose, why and how would I "test" it??

> Were you of the opinion that all indexes are unique indexes?

I might have had that false idea, sorry. So I guess a non-unique
index would make retrieval and joins faster, if the test is an
exact match, such as
WHERE tab1.field1 = tab1.field2 (in a join, i.e. FROM tab1,tab2)
or
WHERE tab.field = $value (in a retrieval, i.e. a simple SELECT)

> There are a few things that WON'T WORK without indexes.
> Auto-increment fields are one example (and they require not just
> an index or a unique index, but it must be a primary key).

Each place I have an auto_increment field, the code looks like this:


id int(8) NOT NULL auto_increment,

PRIMARY KEY (id),
UNIQUE id (id)

There's no explicit mention of an index. I assumed the keyword pair
"PRIMARY KEY" automatically sets up an index. Is that assumption
correct? Is there anything I have failed to include for the 'id'
field that needs to be added?

> Your site can easily become too slow to be usable without indexes.

Since I'm charging the user for each script-run, rounded up to the
nearest whole millisecond, and I'm printing out the charge at the
end of each script-run, it would be quite apparent if scripts ever
run too slow. So-far I'm finding that once in a while there's a
long pause before the server respondes, and correspondingly the
start-end duration of the script shows several seconds have passed.
I assume that's not a problem with my script, but rather the server
itself suddenly getting too busy to finish my script, because
exactly the same script at other times completes in less than 5
milliseconds. But because of the reporting on every script-run, if
due to large number of rows in the tables, the script begins
running very slowly on a regular basis, it'll be readily apparent,
and of course I'll be watching for that.

> An (non-unique) index will not stop you from doing any of this.
> Have all the duplicate records you want. Finding duplicates will
> be faster.

It sounds like *whenever* I want to do a JOIN, either between two
tables, or from a table to itself, the column(s) that are involved
in the matching from one table to the other ought to be indexed,
that without an index the JOIN would be orders of magnitude slower.
Is that correct? Is it sufficient that just one of the two fields
is indexed, or must both of them be indexed? For example, if
there's a unique ID in one table, and a reference to it in another
table, and I want to JOIN per that ID-reference pair, does the
reference *also* need to be indexed, or is it sufficient if only
the ID itself is indexed? Example (truncated to just the relevant
fields):


CREATE TABLE surveys(
id int(8) NOT NULL auto_increment,

PRIMARY KEY (id),
UNIQUE id (id) ...)
CREATE TABLE surans(
idsur int(8) default 0 ...)
SELECT * FROM surveys,surans WHERE surveys.id = surans.idsur
Will that run very much slower if surans.idsur is (as shown) *not*
indexed, compared to if surans.idsur *is* indexed (not shown)?

> And how will you fit a single question (once you selected it) and
> 10 possible answers (plus "add your own response") on the mobile
> phone screen?

Scrolling (but just a little bit) will be required. It won't be
like Google's trip-planner on www.vta.org where it requires
pressing the down button more than fifty times to scroll from the
top of the screen to the place where the submit button is located.
With question title in first line, column headers in second line,
and each line after that (before the SUBMIT button) being one
answer, six answers will appear on the top of the screen, then four
presses of down button will show the next four answers then one
more press of down button will show the TextField and SUBMIT button
new nomination. I could even have a SKIP button on that first line
to jump all eight lines to second screen in a single click, thus
first screen would have title+skip + headers + 6 answers and second
screen would have 9 more answers and TextField+SUBMIT. That would
allow easy navigation of 15 answers to one question. For more than 6
but fewer than 15 answers to one question, I could have the SKIP
button jump by less than a full screen, having some overlap between
first and second screen. Or I could allow the TextFields for
one-line and full description to also fit on that second screen.
Eventually I'd support per-user configuration of sorting order for
the answers, LIMIT to just the top n answers, and skipping mode.

> You may get more than 10 answers with various spellings of the
> same guy's name, especially if someone with a hard-to-spell name
> like Dan Kwayle runs again.

Each user would see only his/her nomination with bad spelling. The
approved answer, that *all* users could see, would have only the
correct spelling. The primary purpose of censoring nominations is
to protect other users from abusive/offensive/obscene language, but
it would also serve to correct misspellings.

>>For duplicate answers within same question:
>> SELECT surans as s1, surans as s2
>> WHERE s1.shortitl = s2.shortitl
>> AND s1.surtab = s2.surtab
>> AND s1.idsur = s2.idsur
>> AND s1.id < s2.id

Sorry, I omitted the returned-values and FROM syntax. I meant to say:
|| SELECT * FROM surans as s1, surans as s2


|| WHERE s1.shortitl = s2.shortitl
|| AND s1.surtab = s2.surtab
|| AND s1.idsur = s2.idsur
|| AND s1.id < s2.id

>>For duplicate questions:
>> SELECT surveys as s1, surveys as s2
>> WHERE s1.shortitl = s2.shortitl
>> AND s1.id < s2.id

Likewise I meant to say:
|| SELECT * FROM surveys as s1, surveys as s2


|| WHERE s1.shortitl = s2.shortitl
|| AND s1.id < s2.id

> I'm not sure what the "surtab" field is supposed to be.

When I designed this application, I had in mind that there might be
more than one type of question, with grossly different record
structure, which however might be presented to users in a somewhat
uniform manner. Thus an answer would have to refer not just to the
index number of the question it is answering, but to the *table* in
which that question resides. For the moment, however, there's only
one table of questions to which the answers table can refer, so the
surtab field always has the value 1.

> Have you considered TESTING queries?

I almost always develop new PHP/MySQL code on a development site,
separate from the main online-for-users site, with ECHO statements
after every operation that does anything that returns a new value.
Only after it's been tested and found working, then I remove most
of the ECHO statements and copy the new scripts to the main site.

> Do you ever test queries with a command-line query tool?

AFAIK neither the development site (oxyhost.com), nor the online
site (uuuq.com), provide any TELNET-to-MySQLCommandLine service.
AFAIK the only way I have to execute MySQL commands is via a PHP
script, which I can upload only via FTP.

> Or phpMyAdmin or some other tool that will let you type a query
> into a command line or window, then see the result?

There ain't no way either oxyhost.com or uuuq.com is going to give
me access to phpMyAdmin on their system!!!

> It will save you a lot of grief compared to putting it into PHP
> and then trying to figure out what went wrong.

The development site (oxyhost.com) provides fine diagnostics for
any PHP syntax error. It even alerts me to undeclared variables
that I call (and then provides a NULL value for each, which
promptly breaks the next line of code in most cases, giving me
additional diagnostics which I can ignore).

But none of the other free PHP/MySQL hosting sites provide that
diagnostic service by default. On all the others, if there's a
syntax error, all I see back is a blank screen or a redirection to
a generic advertisement for the hosting service. Occasionally
oxyhost.com is down and I need to use a third site for development,
where no diagnostics are available Do you know if there's any
PHP/MySQL command that I can use to turn on diagnostics if they are
off by default, such as on that third site I sometimes need to use
for development when oxyhost.com is down?

>>In each case, for each group of n duplicates there would be n*(n-1)
>>rows in the resultset.

Oops, I forgot to include the /2 there, sorry.

> With the "AND s1.id < s2.id" in there (as opposed to "AND s1.id !=
> s2.id"), divide by 2. You cut out half of the duplicate rows.

Yes, I noticed that myself, and in fact included the /2 in the
arithmetic with an earlier posting a couple days ago, but somehow
forgot to type it in this time, sorry. My original code sample used
!= instead of < and later I realized what you say about cutting
result set in half.

I suppose I could cut the resultset down to just one row (per group
of duplicates) by using additional left/right joins to show *only*
the first-last pair within each set of duplicates. Basically:
firstcopy right-join secondcopy join thirdcopy left-oin fourthcopy
WHERE firstcopy.id IS NULL and fourthcopy.id IS NULL etc.
I don't know if that would work, but it'd be too much trouble in
any case, compared to just using "LIMIT 1".

> I realize you want to be able to test stuff, but you're actually
> PLANNING on having 5-10 duplicates of each record?

Only momentaily on development site while testing new code. I test
the code for both actual functionality and HTML formatting. When I
get both of those factors all correct, then I start deleting ECHO
statements, finally I have the code I really want. Between the time
I got the functionality working sorta and the time I get everything
else working and all debugging ECHO statements removed, the same
line of code to INSERT into the table might have been executed 5-10
times total.

> LIMIT acts on the rows as they are sorted.

So you're saying that if the JOIN would retrieve a hundred records,
but the MySQL statement has a "LIMIT 1" clause, MySQL do all the
work of finding all hundred records, and then discard almost all
that work to make just one of the records available to PHP via the
resultset?

> LIMIT will ... restrict you to fixing one set of duplicates at a time.

That's fine. Normally I'll be testing just one question-title or
answer-title at a time while developing new code, so there will be
only one set of duplicates at a time. If by chance I feel the need
to do two separate titles in the same test batch, for example to
test the code for detecting whether the title already exists or
not, then I might have two sets of duplicates, and fixing one then
the other is good enough.

> I can see all of the above set of answers being used for the all
> questions:
> What do you want to do in Iraq?

(Bomb Baghdad, kill all Sunni, close border with Syria.)

> What do you want to do in Afghanistan?

(Bomb Kabul, kill all Taliban, close border with Pakistan.)

> What do you want to do in Mexico?

(Bomb Cuidad whatever, kill all drug dealers, close border with USA.)

> What do you want to do in Washington, DC?

(Impeach Obama, Censor Pelosi, Tax lobbiests.)

> What do you want to do at the motorcycle gang convention next week?

(Knock over all their bikes, check their immigration status, remove
their tattoos.)

I don't see a lot of duplicate answers from one question to another.

> and then there's the guy who will answer all questions of the
> form "What do you want to do ... " with "Smoke weed".

One more reason for the master user, or a reverse tree, to censor
all frivolous answers, so that nobody else sees them.

It's my system, and I make the rules.
Note that I won't allow any offensive/abusive language *ever*, but
I will allow pornography to be viewed by anyone confirmed over 18.

> I wasn't thinking of that (fact questions) as much as opinion questions
> like "We *SHOULD* leave Iraq immediately. Agree/Disagree".

I'd rather keep that as an *answer* to a question of what we
*should* do regarding Iraq.

> Given various investments in the question
> "What do you want to do in Iraq?" with answers:
> "Leave immediately"
> "drop hydrogen bombs on all cities"

> "Smoke weed" (not included, unless the censor is smoking weed)


> "spread anthrax to all villages"
> "fly drones that shoot all vehicles larger than a cockroach"
> "steal all the oil"
> how do you come up with a consensus answer you could report as a
> poll result to the press?

Your question presumes a fact not in evidence.

So anyway, at present I'm leaning towards the self-join with
"LIMIT 1" clause but no additional indexes:
SELECT * FROM surans as s1, surans as s2


WHERE s1.shortitl = s2.shortitl
AND s1.surtab = s2.surtab
AND s1.idsur = s2.idsur
AND s1.id < s2.id

LIMIT 1

Robert Maas, http://tinyurl.com/uh3t

unread,
Dec 24, 2009, 2:19:31 PM12/24/09
to
> From: Norman Peelman <npeel...@cfl.rr.com>

> Table 1: id, Title, one-line desc, full desc.
> Table 2: id, Question, Table 1 link, Table 3 link
> Table 3: id, Answer, Table 1 link, Table 2 link

That's no good because it allows only one answer for a given question,
which defeats the whole purpose of this part of the application.

> Your method appears to be to code/store data as you want to see
> them when output.

Well, if there's only one format in which data will ever be shown,
namely as a plain string of text, then it makes sense that it'd be
stored logically in that same form. (It might actually be
compressed in the database itself, but if so that's invisible to
PHP.)

And if there's a logical "object", which together with a lot of
similarily-structured "obects" needs to be accessible in a uniform
way across the set of "objects", then it makes sense that the set
of "objects" be represented as a table and each row of the table
correspond to one "object". In my application, such objects are
connections, users, sessions, questions, answers, and investments.

> So that when you examine your data you see where everything fits.
> Problem is, you need to access the data the same way you expect
> your (system) users would.

I don't really see your point. Each item of data that is logically
part of an "object", is within the record corresponding to that
"object".

> couldn't your short title just be the first n chars of the full
> title?

Nope, not at all. If the question is "What would you like as a new
feature to be installed in NewEco"? then it's much better for the
title to be "Features" rather than "What would".

> > For example (truncated):
> > S.questions
> > TOTL SELF invests
> > 1.12 .120 [Features]
> > .002 .001 [Worst TV ads]
> > The question title is shown in the first row, and the answer titles
> > are clickable links in the last two rows. (Second row has column
> > headers for total investments and investment by self.)

> I'm not really clear on what an answer/title is.

It's right there in plain sight:
"Features" is an answer-title.
"Worst TV ads" is another answer-title.

> wouldn't answers link up to question/titles?

They would indeed. In the above example:
Answer with title "Features" links to question with title "S.questions".
Answer with title "Worst TV ads" links to question with title "S.questions".
SELECT * FROM surans WHERE idsur = $questid
finds all the answers that link to that one question and returns it
as a resultset, then
while ($rowarr = mysql_fetch_assoc($resanswers)) { ... generate HTML ... }
generates the display from that resultset.

It's actually a little bit more complicated because *only* answers
that have been approved by the master user are retrieved here. So
the actual query is a join between this table of answers and
another table of investments, where an investment by user#1 is
required. Thus the actual query is:
SELECT surans.shortitl, surans.id
FROM surans, surinvest
WHERE surans.id = surinvest.idans
AND surans.idsur = $questid
AND surinvest.userid = 1

Norman Peelman

unread,
Dec 24, 2009, 3:59:22 PM12/24/09
to
Robert Maas, http://tinyurl.com/uh3t wrote:
>> From: gordonb.q9...@burditt.org (Gordon Burditt)
>> Indexes are usually not mandatory.
>
> So it's OK with you if I not bother adding any more indexes than I
> already have?
>
>> Setting up indexes is something you should be testing as part of
>> your development.
>
> What would "testing" entail for something that isn't needed to
> begin with? When I write code that is becoming part of the
> application, I test that code to see whether it fullfills its role
> as part of that application. But a part that serves no useful
> purpose, why and how would I "test" it??
>

OMG... you can add and remove indexes at your whim without starting over
or deleting anything.

>> Were you of the opinion that all indexes are unique indexes?
>
> I might have had that false idea, sorry. So I guess a non-unique
> index would make retrieval and joins faster, if the test is an
> exact match, such as
> WHERE tab1.field1 = tab1.field2 (in a join, i.e. FROM tab1,tab2)
> or
> WHERE tab.field = $value (in a retrieval, i.e. a simple SELECT)
>
>> There are a few things that WON'T WORK without indexes.
>> Auto-increment fields are one example (and they require not just
>> an index or a unique index, but it must be a primary key).
>
> Each place I have an auto_increment field, the code looks like this:
> id int(8) NOT NULL auto_increment,
> PRIMARY KEY (id),
> UNIQUE id (id)
> There's no explicit mention of an index. I assumed the keyword pair
> "PRIMARY KEY" automatically sets up an index. Is that assumption
> correct? Is there anything I have failed to include for the 'id'
> field that needs to be added?
>

OMG

>> Your site can easily become too slow to be usable without indexes.
>
> Since I'm charging the user for each script-run, rounded up to the
> nearest whole millisecond, and I'm printing out the charge at the
> end of each script-run, it would be quite apparent if scripts ever
> run too slow. So-far I'm finding that once in a while there's a

You can't do this... you can't expect your users to pay for your scripts
random execution times.

> long pause before the server respondes, and correspondingly the
> start-end duration of the script shows several seconds have passed.
> I assume that's not a problem with my script, but rather the server

It's probably garbage collection.

> itself suddenly getting too busy to finish my script, because
> exactly the same script at other times completes in less than 5
> milliseconds. But because of the reporting on every script-run, if
> due to large number of rows in the tables, the script begins
> running very slowly on a regular basis, it'll be readily apparent,
> and of course I'll be watching for that.
>

OMG!!!
Readily to whom? How are you going to monitor this?

Their system - your database.

What does porn have to do with this?

>> I wasn't thinking of that (fact questions) as much as opinion questions
>> like "We *SHOULD* leave Iraq immediately. Agree/Disagree".
>
> I'd rather keep that as an *answer* to a question of what we
> *should* do regarding Iraq.
>
>> Given various investments in the question
>> "What do you want to do in Iraq?" with answers:
>> "Leave immediately"
>> "drop hydrogen bombs on all cities"
>> "Smoke weed" (not included, unless the censor is smoking weed)
>> "spread anthrax to all villages"
>> "fly drones that shoot all vehicles larger than a cockroach"
>> "steal all the oil"
>> how do you come up with a consensus answer you could report as a
>> poll result to the press?
>
> Your question presumes a fact not in evidence.
>
> So anyway, at present I'm leaning towards the self-join with
> "LIMIT 1" clause but no additional indexes:
> SELECT * FROM surans as s1, surans as s2
> WHERE s1.shortitl = s2.shortitl
> AND s1.surtab = s2.surtab
> AND s1.idsur = s2.idsur
> AND s1.id < s2.id
> LIMIT 1

You really need to create a small script that will populate a
database for you with consistent data that you can fiddle with to your
hearts content. If something fouls up that bad, just DROP the database
and reload. Then you can test your indexes and where you need them.

I'd say about a million rows

...then you can INSERT, UPDATE from there.

--
Norman
Registered Linux user #461062

-Have you been to www.php.net yet?-

Gordon Burditt

unread,
Dec 25, 2009, 12:22:51 AM12/25/09
to
>> Indexes are usually not mandatory.
>
>So it's OK with you if I not bother adding any more indexes than I
>already have?

I'm not the one who's going to complain if your site turns out to
be unbearably slow, or never goes production in the first place..

What do you have against considering adding indexes as part of
writing a query? Are you familiar with ALTER TABLE? Unless it's
a fairly big table, adding an index can be done from the command-line
in about 10 seconds. So can removing an index. Or adding another
field.

>> Setting up indexes is something you should be testing as part of
>> your development.
>
>What would "testing" entail for something that isn't needed to
>begin with?

Non-indexing is rarely worthwhile and you don't have to test
non-indexing if you don't intend to non-index.

Your site needs to have adequate performance under load under
realistic worst-case conditions: lots of users, lots of data in
the database, and a heck of a lot more than one question. It is a
good idea to test this, before your site goes live, just in case
your site is more popular than you ever dreamed it would be because
of unexpected national news coverage, and suddenly becomes unusable.
Then you lose most of the user base and get a bad reputation, from
which your site might not recover.

>When I write code that is becoming part of the
>application, I test that code to see whether it fullfills its role
>as part of that application. But a part that serves no useful
>purpose, why and how would I "test" it??

Performance testing serves a useful purpose. Along with that, have
you calculated whether you will have enough disk space for the
database when you have 1,000 users, 10,000 questions, and a million
votes on survey questions?

>> Were you of the opinion that all indexes are unique indexes?
>
>I might have had that false idea, sorry. So I guess a non-unique
>index would make retrieval and joins faster, if the test is an
>exact match, such as
> WHERE tab1.field1 = tab1.field2 (in a join, i.e. FROM tab1,tab2)
>or
> WHERE tab.field = $value (in a retrieval, i.e. a simple SELECT)

Indexes work not only for exact matches. They work for sorting
and grouping also. They also work for selecting values between
two limits.

>> Your site can easily become too slow to be usable without indexes.
>
>Since I'm charging the user for each script-run, rounded up to the
>nearest whole millisecond, and I'm printing out the charge at the
>end of each script-run, it would be quite apparent if scripts ever
>run too slow.

So why not fix it ahead of time, instead of pre-pessimizing your
site? Populate the database with a good set of test data. Use
EXPLAIN on each query you will be using often and ensure that it's
using appropriate indexes.

>So-far I'm finding that once in a while there's a
>long pause before the server respondes, and correspondingly the
>start-end duration of the script shows several seconds have passed.
>I assume that's not a problem with my script, but rather the server
>itself suddenly getting too busy to finish my script, because
>exactly the same script at other times completes in less than 5
>milliseconds.

Are you measuring CPU time or real (wall clock) time for script
completion? If you are measuring CPU time, the time for database
queries won't show up at all. If you are measuring wall clock time,
10 users hitting the site simultaneously might all have 10 times
the normal wall clock run time, even if the site is not struggling
under the load. Also, wall clock run time will include time
spent waiting for database locks by some other script. This may not
become an issue until you've actually got several copies of your
script running at the same time, which implies lots of traffic.

Both CPU time and wall clock time measurements for script run time
are not particularly repeatable and consistent measurements, and
using them to charge users can bring surprises. I wouldn't want
to buy gas at a station that will surprise me with the price of
gas, sometimes $1/gallon, sometimes $10/gallon, after I'd pumped
it.

>But because of the reporting on every script-run, if
>due to large number of rows in the tables, the script begins
>running very slowly on a regular basis, it'll be readily apparent,
>and of course I'll be watching for that.
>
>> An (non-unique) index will not stop you from doing any of this.
>> Have all the duplicate records you want. Finding duplicates will
>> be faster.
>
>It sounds like *whenever* I want to do a JOIN, either between two

Joins you're going to do regularly, yes. You don't necessarily
have to put in indexes for queries used only for the annual report
or when taking your company public.

>tables, or from a table to itself, the column(s) that are involved
>in the matching from one table to the other ought to be indexed,
>that without an index the JOIN would be orders of magnitude slower.

That depends on how many records there are in the table. It won't
speed up the one-question case. If you've got a table that's small
and is expected to stay small, for example:

Table sex
sexid sex_title
1 Male
2 Female
3 Unknown

you don't need to index it. (Although I probably would put a unique
index/primary key on sexid anyway). You probably won't have many
additions to that table. With thousands of records and table size
greater than, say, 100K, you should see improvement.


>Is that correct? Is it sufficient that just one of the two fields
>is indexed, or must both of them be indexed? For example, if

Usually only one field needs to be indexed for a given query.
Sometimes you want to index both for different reasons (different
queries).

I'm not sure how as a user that will help, unless you implement
"put the choice the user is going to select on top", which
is rather difficult without mind-reading.

I wonder how much bias will be introduced by having an answer
presented in the first 6 vs. having it in the last 4?

>> You may get more than 10 answers with various spellings of the
>> same guy's name, especially if someone with a hard-to-spell name
>> like Dan Kwayle runs again.
>
>Each user would see only his/her nomination with bad spelling. The
>approved answer, that *all* users could see, would have only the
>correct spelling. The primary purpose of censoring nominations is
>to protect other users from abusive/offensive/obscene language, but
>it would also serve to correct misspellings.

I don't recall seeing an "approved" flag in the answer table, to
indicate whether it should be visible or not.

That means you will have duplicate answers not only from *development*,
but also from *users*. In a question about who should be elected,
if your site is popular, you'll probably have a *second* vote for
candidate B before you've managed to approve the first one. Suppose
they both spell the name correctly. Then you'll need to transfer
the votes from the duplicates to the approved answer, or you'll end
up having a "split vote" between the candidate and him/herself.

I'll still suggest: if you get a vote for a new answer, and it
matches an old answer for the same question, you should apply the
vote to the existing answer, not insert a new one.

I also have to wonder if you aren't going to have problems eventually
where you don't know much about the subject of the question. For
example, you might not know much about school board elections in
Urkistan, but some of your users do. Unfortunately, you don't know
that D. Brown and B. Brown are *different* candidates (and the top
two candidates and married to each other), and you merge the two
as misspellings of a single name, really messing up the results.
You also don't know that the o in Brown is supposed to be accented,
and your keyboard doesn't have that character on it.


>When I designed this application, I had in mind that there might be
>more than one type of question, with grossly different record
>structure, which however might be presented to users in a somewhat
>uniform manner. Thus an answer would have to refer not just to the
>index number of the question it is answering, but to the *table* in
>which that question resides.

It is almost always a mistake to put a table name in a field of another
table.

>> Do you ever test queries with a command-line query tool?
>
>AFAIK neither the development site (oxyhost.com), nor the online
>site (uuuq.com), provide any TELNET-to-MySQLCommandLine service.
>AFAIK the only way I have to execute MySQL commands is via a PHP
>script, which I can upload only via FTP.

You also have the option (do this VERY CAREFULLY, though) of uploading
a carefully protected script which will let *YOU* (and not just any
user or spambot) enter a query into an input field, then run it,
and return the results.

>> Or phpMyAdmin or some other tool that will let you type a query
>> into a command line or window, then see the result?
>
>There ain't no way either oxyhost.com or uuuq.com is going to give
>me access to phpMyAdmin on their system!!!

Why not? In a typical hosting setup, your MySQL login gives you
pretty much unlimited access to *YOUR* database (create and drop
tables, select, insert, update, and delete data, etc.) and no access
to anyone else's. phpMyAdmin or any other web tool wouldn't give
you more than that. Some hosting companies provide tools like this.

>> It will save you a lot of grief compared to putting it into PHP
>> and then trying to figure out what went wrong.
>
>The development site (oxyhost.com) provides fine diagnostics for
>any PHP syntax error.

For MySQL syntax errors, you should check whether your query
worked, and if not, call and log what mysql_error() returns.

>It even alerts me to undeclared variables
>that I call (and then provides a NULL value for each, which
>promptly breaks the next line of code in most cases, giving me
>additional diagnostics which I can ignore).
>
>But none of the other free PHP/MySQL hosting sites provide that
>diagnostic service by default. On all the others, if there's a
>syntax error, all I see back is a blank screen or a redirection to
>a generic advertisement for the hosting service. Occasionally
>oxyhost.com is down and I need to use a third site for development,
>where no diagnostics are available Do you know if there's any
>PHP/MySQL command that I can use to turn on diagnostics if they are
>off by default, such as on that third site I sometimes need to use
>for development when oxyhost.com is down?

Have a look at the PHP function error_reporting(). It, unfortunately,
will not help if your PHP script won't *compile* due to a syntax
error, since in that situation, error_reporting() won't get run,
not even if it's first.

>
>> What do you want to do in Afghanistan?
>
>(Bomb Kabul, kill all Taliban, close border with Pakistan.)
>
>> What do you want to do in Mexico?
>
>(Bomb Cuidad whatever, kill all drug dealers, close border with USA.)
>
>> What do you want to do in Washington, DC?
>
>(Impeach Obama, Censor Pelosi, Tax lobbiests.)
>
>> What do you want to do at the motorcycle gang convention next week?
>
>(Knock over all their bikes, check their immigration status, remove
> their tattoos.)
>
>I don't see a lot of duplicate answers from one question to another.

"Kill everyone" and "Nuke it" could easily apply to all of those.
So could "leave immediately".

>> and then there's the guy who will answer all questions of the
>> form "What do you want to do ... " with "Smoke weed".
>
>One more reason for the master user, or a reverse tree, to censor
>all frivolous answers, so that nobody else sees them.
>
>It's my system, and I make the rules.
>Note that I won't allow any offensive/abusive language *ever*, but
>I will allow pornography to be viewed by anyone confirmed over 18.

Are you suggesting that "smoke weed" is *pornographic*?

>> Given various investments in the question
>> "What do you want to do in Iraq?" with answers:
>> "Leave immediately"
>> "drop hydrogen bombs on all cities"
>> "Smoke weed" (not included, unless the censor is smoking weed)
>> "spread anthrax to all villages"
>> "fly drones that shoot all vehicles larger than a cockroach"
>> "steal all the oil"
>> how do you come up with a consensus answer you could report as a
>> poll result to the press?
>
>Your question presumes a fact not in evidence.

I presume there's an intent that the results of a survey be useful
to someone. I have to wonder how useful they will be if the top
answer gets less than 10% of the investment.

Robert Maas, http://tinyurl.com/uh3t

unread,
Mar 30, 2010, 4:31:43 AM3/30/10
to
> >> Indexes are usually not mandatory.
> > So it's OK with you if I not bother adding any more indexes than I
> > already have?
> >> Setting up indexes is something you should be testing as part of
> >> your development.
> > What would "testing" entail for something that isn't needed to
> > begin with? When I write code that is becoming part of the
> > application, I test that code to see whether it fullfills its role
> > as part of that application. But a part that serves no useful
> > purpose, why and how would I "test" it??
> From: Norman Peelman <npeel...@cfl.rr.com>

> OMG... you can add and remove indexes at your whim without starting over
> or deleting anything.

Good. So I'm not going to bother writing the PHP script to add any
more indexes than I already have until and unless I get enough
users that working without an index slows down script-runs
significantly. Currently most scripts take only 5-10 milliseconds
to run, and only occasionally take as long as 100 ms. If they start
taking 100+ ms on a regular basis, then I'll start to worry about
if an index would speed up any of the JOINs.

> > There's no explicit mention of an index. I assumed the keyword pair
> > "PRIMARY KEY" automatically sets up an index. Is that assumption
> > correct? Is there anything I have failed to include for the 'id'
> > field that needs to be added?
> OMG

Why the worthless cursing? Am I wrong to assume that "PRIMARY KEY"
automatically sets up an index for that key (in MySQL)?

> >> Your site can easily become too slow to be usable without indexes.
> > Since I'm charging the user for each script-run, rounded up to the
> > nearest whole millisecond, and I'm printing out the charge at the
> > end of each script-run, it would be quite apparent if scripts ever
> > run too slow. So-far I'm finding that once in a while there's a
> You can't do this... you can't expect your users to pay for your
> scripts random execution times.

The execution charge is capped at 100 ms. One way of looking at
this is that user is charged for how long a script-run takes, but
the 100 ms limit is a protection. But another way of looking at it
is that I'm promising 100 ms script-run charges but most of the
time the user gets a lucky break when I actually charge even less.

Using the 100 ms figure, the user spending 10 seconds answering a
Turing question (fill in missing word in some text) gives that user
at least a hundred script runs, enough to last a good fraction of
an hour of heavy use of the system, probably as much as any user
would want to do in a full workday. Since just navigating the site
(loop: submit something, get response, eyeball response to decide
what to do next, plan whatever it is, get ready to submit :loop)
typically requires about 5-10 seconds each cycle, 100 script-runs
equals 500-1000 seconds of labor, so the extra 10 seconds to answer
a Turing question is only 1-2% of the total time, so *really* the
service is essentially free. If script-run times are more like 10
ms, that's less than a fifth of a percent of total time doing the
Turing tests.

So why bother forcing the user to do that extra less-than-2% labor?
Because if the user writes a spambot to flood my site with requests
thereby reducing navigation time to zero, then the 10 seconds extra
labor (which *can't* be automated) becomes the dominant time, thus
throttling spambots down to a level of usage only a couple orders
of magnitude larger than a live user, thus disrupting them from
swamping the system.

Note that a botnet is of no value in trying to swamp my system, in
fact it's actually counterproductive, for a spammer, because each
user gets credit accounted separately, and each user may have only
one login session active at any one time, and it's not feasible for
several machines within a botnet to share a single login session
cookie, so a botnet would require a different user account for each
different instance of the 'bot, and hence the spammer to manually
answer yet another randomly-generated Turing question for *each*
instance of the 'bot.

Turing tests are just a bootstrapping method of metering/throttling
usage, in lieu of user doing *useful* work for me about 10 seconds
per typical day of usage, which will take a lot more programming
work for me to implement, but I hope to have the energy to do it
within the next few months. If each 'bot-instance within a botnet
can somehow do *useful* work for me, such as high-quality automated
translation between languages (such as Chinese to English), thus
demonstrating true A.I., then I *welcome* a botnet using my system,
doing immense amounts of useful work on behalf of my users
including me personally.

> > long pause before the server respondes, and correspondingly the
> > start-end duration of the script shows several seconds have passed.
> > I assume that's not a problem with my script, but rather the server
> It's probably garbage collection.

GC in what? Inside MySQL somehow? Or in PHP?

> > itself suddenly getting too busy to finish my script, because
> > exactly the same script at other times completes in less than 5
> > milliseconds. But because of the reporting on every script-run, if
> > due to large number of rows in the tables, the script begins
> > running very slowly on a regular basis, it'll be readily apparent,
> > and of course I'll be watching for that.
> OMG!!!
> Readily to whom?

To the user.

> How are you going to monitor this?

I log in as the master user to perform various tasks such as
approving nominations for answers to survey questions. I also log
in as a regular user to test the software after each upgrade and
when I want to vote on the surveys. After other features are
implemented, I'll be actually using those features. So I'll notice
if one of the scripts within one of the applications starts taking
a lot longer than 10 ms to run each time.

> > There ain't no way either oxyhost.com or uuuq.com is going to give
> > me access to phpMyAdmin on their system!!!
> Their system - your database.

AFAIK the only facilities on the hosting service that I have access
to over VT100 term are FTP and running my own scripts. I'm very
seldom at a computer that has a standard Web browser whereby their
administer-database software would work at all such that I could
then explore to see whether phpMyAdmin is or is not available for
me to use. I'm allowed a total of only a half hour maximum per day,
one or two days per week. And that half hour is too noisy for me to
concentrate on anything serious, and I have many other things I
need during that half hour, so there simply isn't time available to
even learn what's available on the PHP/MySQL hosting services, much
less learn how to use them effectively.

> > The development site (oxyhost.com) provides fine diagnostics for
> > any PHP syntax error. It even alerts me to undeclared variables
> > that I call (and then provides a NULL value for each, which
> > promptly breaks the next line of code in most cases, giving me
> > additional diagnostics which I can ignore).

Update: oxyhost has been down for the past couple months, so I've
been forced to switch to a different hosting site for development
of new code. I tried adding this line of code to the top of each
script I'm debugging:
error_reporting(E_ALL);
which sometimes causes an error message instead of a blank screen,
but even with that some syntax errors just give a blank screen. So
now when that happens I do the --syntax-check locally on my shell
account. Once the syntax errors are fixed, then the hosting service
generates legible diagnostics for any other problems.

> > But none of the other free PHP/MySQL hosting sites provide that
> > diagnostic service by default. On all the others, if there's a
> > syntax error, all I see back is a blank screen or a redirection to
> > a generic advertisement for the hosting service.

See the comments about enabling *some* diagnostics on other hosting
services, and using --syntax-check for anything that gives blank
screen on the hosting service. So now I'm comfortable doing
debugging on J. Random hosting service when oxyhost is down for an
extended time, as it's been lately.

> > It's my system, and I make the rules.
> > Note that I won't allow any offensive/abusive language *ever*, but
> > I will allow pornography to be viewed by anyone confirmed over 18.
> What does porn have to do with this?

It's well known that porn is the most popular "service" on the
InterNet, the reason most users get on the net in the first place,
the activity they spend most of their time doing, the activity they
are most willing to pay money or jump through hoops to get more of.
So it may turn out that the only way to build my service to a large
enough userbase as it really needs to be of full value, would be to
make some type of porn available. In addition, I already plan to
show photos of semi-clad attractive women, in a sort of joke
service: "When you see an attractive woman, what's the first thing
you look at, and the part of her body you are most attracted to and
spend the most time looking at, so much so that she has to keep
reminding you to look at her face instead?" (Answer: Breasts!)
"Here are some breasts." (some JPEGs here) "Which breast do you
think belongs to Ellen Muth, Heather Thompson, and Christini
Ricci?" I'll use human labor, lured by being able to see semi-nude
images of attractive sexy women, in much the same way as Google
uses human labor in the "image tagging" game, to improve indexing
of images hence improve search results. This will be a variation
on the idea in the "Fahrenheit 451" movie, where suspects are
identified by the hair whorls on the backs of their heads, so that
they can be identified by their appearance while running away from
the scene of a crime. It's *joke*! Get it? Anyway, I bet a lot of
well-known women can be identified by their precise breast images.
And I bet most women can be categorized by the general shape/size
of their breasts even when wearing clothes. By asking men to say
which part of a woman they want to look at most, I can get them to
spend the most time and good effort identifying which breasts or
other favorite body part clusters together. By getting independent
clustering on face, breasts, buttox, legs, hands, etc., and the
combining all that info into a single database, I believe can get
better indexing of images than by clustering *only* on faces as
most such databases currently do.

> You really need to create a small script that will populate a
> database for you with consistent data that you can fiddle with to your
> hearts content. If something fouls up that bad, just DROP the database
> and reload. Then you can test your indexes and where you need them.
> I'd say about a million rows

That would be a good idea if I had a really good mathematical model
of what data I expect user to enter. But I don't have that model,
so the best I could is random data, which wouldn't be
representative of real-world expectations, thus would mislead me
about what really needs tuning.

By the way, since I last talked in this thread, I've essentially
finished implementing survey questions. You can not only create an
account and log in, but you can vote in the meta survey, including
nominating new questions, and after I promote a meta-answer to a
new question you can the nominate answers and vote for such anwers
within that new question. At the moment, two meta-answers are already
promoted to questions you can vote on:
- Features = which feature would you most like added to the system?
- TalkShowHosts = who is your favorite talk-show host?
So how about you-all give it a try now?
http://TinyURL.Com/NewEco = explanation of the overall idea/plan
http://TinyURL.Com/MayProj = big set of ideas for features to add to the system
http://TinyURL.Com/Portl1 = link to the portal to the live system
Please choose from NewEco or MayProj whatever feature you most
think would be worth my effort to implement, then in Portl1 do a
Turing question to get above 6 seconds of credit then select
Surveys > Features and nominate your favorite (or vote for it if
it's already nominated). Let's get at least ten new users voting
for what they'd most like me to implement, so that the survey will
have at least some significance.

By the way, since I previously posted I've developed a system for
building deeply-nested JOIN-trees in a way that's very easy to
debug, so now I'm comfortable nesting five or more JOINs within a
single query. For example, a single SQL statement JOINs all answers
for a given question with all investments by the master user, thus
filtering *only* the answers to that question that have been
approved already, then JOINs that subset of answers to *all*
investments whatsoever (a one-to-many relationship) then does GROUP
BY the answer and SUM of the investment amounts to achieve
subtotals per answer, then LEFT JOINs that to investments by *this*
one user who is doing the query to get that user's personal
investment in each answer (or 0 if LEFT JOIN gives a NULL
right-side because of no match), and finally sorts the result into
descending sequence per the total investment in that answer. Thus
no PHP array is needed to buffer the totals. I just read each
successive record from the result set and emit the corresponding
HTML directly down the screen in the sequence that MySQL delivers
the records. Here's just one of the functions involved in building
the deeply-nested SQL query described here:

// Given ID# of survey question, and master user#, get SQL query for
// JOIN between copy of each approved answer to that question and
// various corresponding investments, then group by answer ID and
// add investments per group, return ansid,shotit,groupTotal,
// largest total first.
function suridMasuidToqApransXQInvtot($surid,$masuid) {
$que = "SELECT mul.ansid, mul.anssho, SUM(mul.ms) as totinv
FROM (" . suridMasuidToqApransXQJInvs($surid,$masuid) . ") as mul
GROUP BY mul.ansid ORDER BY totinv DESC";
return $que;
}

Robert Maas, http://tinyurl.com/uh3t

unread,
Mar 30, 2010, 7:40:00 AM3/30/10
to
> >> Indexes are usually not mandatory.
> >So it's OK with you if I not bother adding any more indexes than I
> >already have?
> From: gordonb.jl...@burditt.org (Gordon Burditt)

> I'm not the one who's going to complain if your site turns out to
> be unbearably slow, or never goes production in the first place..

It's already in production, (see other message I posted earlier
Tuesday morning).

> What do you have against considering adding indexes as part of
> writing a query?

I don't understand what you said. I thought adding an index to a
table is something you do to *permanently* (but undoably) alter a
table, not something you do or don't do arbitrary during a given
query. I'm not aware you can add a clause like "WITH INDEX" to a
query and have it build an index just for that query and the
discard the index after the query was done (as if *that* could
*possibly* make any query faster overall?).

> Are you familiar with ALTER TABLE?

I've glanced by it from time to time, but never had a use for it
yet. When I was just starting to design this system, I discoverered
after I had CREATEd a table, before I had much code written for it,
before it was populated hardly at all, that I was missing a field I
now realized I needed, so I wrote a PHP script option to DROP the
old table and another PHP script option to CREATE the new table
(with a new name, so that listing the tables on any of the several
of the hosting services I was trying would immediately show me
which tables were still the old version needing to be DROPped and
replaced and which were already upgraded). But once I got past that
early code, I had all of the table definitions just the way I
wanted, and haven't need to re-define them ever again, just CREATE
new tables to support new types of capability. Currently I have
these tables:
$ctcmd="CREATE TABLE const1(
$ctcmd="CREATE TABLE connects(
$ctcmd="CREATE TABLE accounts5(
$ctcmd="CREATE TABLE sessions2(
$ctcmd="CREATE TABLE missing1(
$ctcmd="CREATE TABLE turingip1(
$ctcmd="CREATE TABLE surveys(
$ctcmd="CREATE TABLE surans(
$ctcmd="CREATE TABLE surinvest(
I haven't actually used const1 yet. It will be mostly for
mostly-constant parameters that are uploaded dynamically, such as
public-key cryptosystem products of primes and modular exponents.
accounts (user ID, password, last login, current funds on account,
etc. - switching from keeping track of seconds, my original idea
from experience with CGI/CMUCL which usually took a second or two
just to start up CMUCL and load all the code necessary to run a
script, to milliseconds, because PHP/MySQL scripts typically take
only 5-10 milliseconds, was the only FLAG DAY), 2 versions of
sessions, but all the other tables were just right the very first
time I CREATEd them.

> Unless it's a fairly big table, adding an index can be done from
> the command-line in about 10 seconds.

Command line is not available. The only way to alter a table on a
non-local host (i.e. where I don't have shell login account) is to
write a PHP/MySQL script that does it, which requires security
precautions to make sure nobody who guesses the name of the script
will run it when I don't want it to be run. Thus it takes several
hours to design the script and security precautions, write the
code, FTP-upload from where I edited it to where it will be run,
and test it via Web browser to see if it really does what I wanted
it to do.

> So can removing an index. Or adding another field.

Yup. Several hours to develop a new script-option for each such task.

NO!! I do *not* want to write a generic script that lets me type
just any old SQL command into a HTML form and it'll be executed.
Somebody who happens to guess the name of my script can wreak havoc
on my database without me having a clue why suddenly all my tables
are gone or emptied of data or re-CREATED in a way that is
incompatible with all my scripts.

> Your site needs to have adequate performance under load under
> realistic worst-case conditions: lots of users, lots of data in
> the database, and a heck of a lot more than one question.

If and when I get a million users and things start slowing down as
a thousand of them try to execute queries simultaneously that JOIN
ten million records per user, I'll spend the several hours it
requires to write a script to add an index where it affects
whichever JOIN is taking the longest time repeatedly.

> It is a good idea to test this, before your site goes live,

Too late. It's been online for months, with survey questions almost
fully implemented for the past several days.

> just in case your site is more popular than you ever dreamed it
> would be because of unexpected national news coverage, and
> suddenly becomes unusable.

Twitter went down for two days at a time because they provide
*UNMETERED* usage and a botnet in Russia decided to attack it. I
*METER* all usage, per user labor to get credit to spend running
scripts, so I won't suffer a Twitter-style outage, and if it's
sorta-down for three hours while I write a new add-INDEX script to
make it more efficient, that's tolerable.

Currently, the master key (the numeric ID) in each key is (I
believe) indexed, by default because it's a PRIMARY KEY (correct me
if I'm mistaken). For example:
$ctcmd="CREATE TABLE surans(


id int(8) NOT NULL auto_increment,

surtab int(2) default 0,
idsur int(8) default 0,
shortitl varchar(20) default NULL,
linedesc varchar(70) default NULL,
fulldesc varchar(255) default NULL,

PRIMARY KEY (id),
UNIQUE id (id)

)"; }
An index for field 'id' is automatically created, right??
(I need a definite answer at this point so that I can rest in my assumption.)

In regular usage, all JOINs have such a PRIMARY KEY as one of the
two fields being related per the ON field1 = field1 clause in the
SQL statement. I believe that means it'll be as efficient as
possible, and it wouldn't make it faster to *also* make an index
for the other field in the other table being collated with it via
the ON clause, right?
(I need a definite answer at this point so that I can rest in my assumption.)

There's just one rarely used task where I do a JOIN between two
string fields, neither of which has an index, and of course *that*
could be made faster if one or the other string field is indexed,
right? So I think that's the only field (one or the other of those)
where I might need to add an index if that particular task gets
done often in the future.

> Then you lose most of the user base and get a bad reputation, from
> which your site might not recover.

Twitter was down for two full days, and for major parts of three
more days in a week or so period, and they didn't lose their user
base. Spending a couple hours to write code to add an index to the
one of the two JOINed string fields, just *once* in the whole
lifetime of my project, shouldn't scare away my user base.

> Performance testing serves a useful purpose.

There are more important things to do, like finish the last of the
code for survey questions, solicit more users to try the system,
beg and plead for them to give me feedback about user interface
(how easy is it to figure out what to do and to actually navigate),
and get user opinion which feature they *most* want me to
implement, which is mostly likely to be the "killer application"
that gets me from 3 users to 100 users and then to 1000 users.
Right now my average number of users per day is less than 1. I
gotta get that up to 100, with 10 or 15 near-simultaneous before I
can even *detect* how load affects response time, much less measure
any slowdown.

> Along with that, have you calculated whether you will have enough
> disk space for the database when you have 1,000 users, 10,000
> questions, and a million votes on survey questions?

"Pray tell" how to calculate the total MySQL storage used by a
table with such-and-such definition and N rows in it? Is that info
documented anywhere so that I can plug in the parameters of my
table and calculate the predicted storage usage?

If I go to a semi-public computer which has JavaScript so that I'll
have access to database-management functions, where I can see how
many percent of my 5 gigabytes I'm using already, it'll surely show
less than 1%, which will give me no information whatsoever as to
whether a thousand times as many rows in the tables will still be
under 1% or not.

> Indexes work not only for exact matches. They work for sorting
> and grouping also.

The only place I'm sorting is on a derived field (SUM with GROUP
BY), where of course there'd be no value in building an index. It
has already selected approved answers to just one question, before
sorting, so that'd be at most twenty rows (answers) for a regular
question and maybe a thousand rows for the meta question, and all
numbers being sorted are medium-small integers (less than a
million), so I think sorting by brute-force N * Log(N) algorithm
would be plenty fast enough.

When there get to be more than about ten answers to any given
question, I plan to LIMIT the output to a "page" (segment) of the
sorted results, with MORE button at bottom to show next "page"
(segment). But this won't affect sorting of the whole set of rows,
which happens before MySQL knows where the "page" starts and ends
(which rows to discard, then which rows to return, before it stops
and never even deals with the remaining rows).

> They also work for selecting values between two limits.

I currently have no use for that.

> >> Your site can easily become too slow to be usable without indexes.
> >Since I'm charging the user for each script-run, rounded up to the
> >nearest whole millisecond, and I'm printing out the charge at the
> >end of each script-run, it would be quite apparent if scripts ever
> >run too slow.
> So why not fix it ahead of time, instead of pre-pessimizing your
> site?

Because my time/energy is limited, and I have more important things
to do with my time currently, as I listed above.

> Populate the database with a good set of test data.

I'm not going to take your advice on that because it's too much trouble.

> Use EXPLAIN on each query you will be using often and ensure that
> it's using appropriate indexes.

Is that a clause you can just add to a MySQL query-string that is
invoked from a PHP script via the mysql_query function??

> Are you measuring CPU time or real (wall clock) time for script
> completion?

Wall-clock time, by calling the microtime function once near start
of script-run and again just before exiting and subtracting the two
values.

> If you are measuring wall clock time, 10 users hitting the site
> simultaneously might all have 10 times the normal wall clock run
> time, even if the site is not struggling under the load.

Hmm, do you have any idea how to avoid this problem?

> Also, wall clock run time will include time spent waiting for
> database locks by some other script.

Since I'm ignoring the rare times when a script takes a lot longer
than usual, and capping the charge to customer as 100 ms no matter
how many thousands of milliseconds a given script-run unexpectedly
takes, I'm just looking at "typical" time ignoring those unusual
events to determine if my scripts are inefficient, I don't think
that's a problem.

> This may not become an issue until you've actually got several
> copies of your script running at the same time, which implies
> lots of traffic.

Even with ten copies running simultaneously (on occasion by
accidental massive collision, or like my proposed application to
have several users listen to the same radio station and all try to
click a button immediately when the station starts playing a new
song), timesharing against each other, that's only 50 ms elapsed
time instead of 5 ms usual time, not worth worrying about. In fact
ten people trying to click a button at the same time when the song
starts will be scattered over about a fifth of a second, that's 200
milliseconds, so the ten 5-ms-each script-runs will hardly ever be
overlapping even in twos much less 3 or more simultaneously.

> Both CPU time and wall clock time measurements for script run time
> are not particularly repeatable and consistent measurements, and
> using them to charge users can bring surprises.

That's yet another reason I made a good choice when I capped the
charge at 100 ms per script-run and advertised *that* as the
expected charge, with "lesser charge if you are lucky".

> I wouldn't want to buy gas at a station that will surprise me
> with the price of gas, sometimes $1/gallon, sometimes $10/gallon,
> after I'd pumped it.

How would you feel if the advertised price was $3/gallon, but 95%
of the time you got any particular gallon for only a tenth or less
of that price, so the average over ten gallons was more like $0.50
per gallon?

> >Is that correct? Is it sufficient that just one of the two fields
> >is indexed, or must both of them be indexed? For example, if
> Usually only one field needs to be indexed for a given query.
> Sometimes you want to index both for different reasons (different
> queries).

It seems to me that when I'm JOINing two tables each with ten
thousand rows, and the resultset will actually have a significant
number of those rows present because they *do* match, like maybe a
hundred rows matching, that I really want to have an index for one
of the two columns collated, which I already do.

But if there's a single SELECT * from tablename where id = $rowID
and that's the whole essence of the HTTP/PHP cycle, even if that
one SELECT slows to twenty or thirty milliseconds due to lack of
index it's not going to be noticed by the user. For example, the
user nominates an answer to a question, and that has to be compared
to answers already nominated to see whether this is a duplicate
nomination or a new nomination, hence different PHP code branch to
deal with the two cases (add new row to table if new, amend old
nomination if duplicate title by same user, REJECT nomination if
already approved or if already nominated by somebody else).

(splitting too-long reply here)

Robert Maas, http://tinyurl.com/uh3t

unread,
Mar 30, 2010, 7:43:32 AM3/30/10
to
(resuming too-long reply from before)

> >Eventually I'd support per-user configuration of sorting order for
> >the answers, LIMIT to just the top n answers, and skipping mode.

> From: gordonb.jl...@burditt.org (Gordon Burditt)


> I'm not sure how as a user that will help, unless you implement
> "put the choice the user is going to select on top", which
> is rather difficult without mind-reading.

The most obvious sort orders to implement are:
- Largest global score first = what I have now with that deeply
nested JOIN I described in previous article in this thread
- Least recently nominated (by somebody else) and not yet seen by
this user (in order to avoid missing something new)
- Largest personal score first (in the case where it didn't show up
in the first couple pages per default sorting, because other users
have voted for *other* answers, knocking *my* favorite down in
rank, to give me a chance to quickly find *my* favorites and vote
*more* for them to get them back up in rank)
- After user has entered a search query, best match first (this
will require a stored procedure inside MySQL database so that MySQL
itself can do the distance-from-query calculation and sorting
smallest-distance-first *before* passing the already-sorted result
to PHP)
I can't guess what the user wants, but those are three ideas what
the user might want, plus a search engine for the user to ask for
something explicitly by keywords.

> I wonder how much bias will be introduced by having an answer
> presented in the first 6 vs. having it in the last 4?

You mean the "join the bandwagon, vote for a known winner"
syndrome. I guess some user will want me to add an option to sort
in reverse of the default, present the current underdogs first.

Of course *some* underdogs really are DOGS, so I'll need a way for
the user to checkbox some of the DOGS as NEVER SHOW THIS TO ME EVER
AGAIN. Plus a global mode flag to be "ignore all those never-show,
show everything in default sequence, because I want to know how the
election really is going now".

> >> You may get more than 10 answers with various spellings of the
> >> same guy's name, especially if someone with a hard-to-spell name
> >> like Dan Kwayle runs again.
> >Each user would see only his/her nomination with bad spelling. The
> >approved answer, that *all* users could see, would have only the
> >correct spelling. The primary purpose of censoring nominations is
> >to protect other users from abusive/offensive/obscene language, but
> >it would also serve to correct misspellings.
> I don't recall seeing an "approved" flag in the answer table, to
> indicate whether it should be visible or not.

The master user is not allowed to vote, but instead can only
approve nominations. Approval is kluged by an investment of 1
millisecond by the master user. Yeah, I could add a special field
just to mark approval, but since the master user can't vote anyway,
why not have the investment field have dual meaning?

> That means you will have duplicate answers not only from
> *development*, but also from *users*. In a question about who
> should be elected, if your site is popular, you'll probably have a
> *second* vote for candidate B before you've managed to approve the
> first one.

No. A nomination is implemented as a 1 millisecond entry in the
investment table. The user can't up that to any larger amount until
it's approved. Two different users can "simultaneously" (before the
master user can see either) nominate the same answer. The master
user sees these duplicate answer-nominations to the same question
*together* in a group and can easily approve the one that is
best-worded/spelled and pass a note back to all inferior duplicates
explaining to please use the other (now approved) nomination
instead of your own.

> Suppose they both spell the name correctly.

Then I approve the 1 millisecond nomination from whichever appears
first on my screen, and pass note back to the rest. *Nobody* can
add additional time beyond the 1 millisecond until after it has
been approved, and the 1 millisecond advantage of the successful
nominator compared to 0 millisecond of all the other duplicates is
insignificant. *Every* one of them will want to add at least a
hundred or thousand or more milliseconds as soon as each such
person logs in and sees the approved answer ready for voting, so
they are all equal in chance of getting to vote first, as if that
made any difference in the end. (Well if they are viewing sorted in
default order, the *later* voters will have the advantage, because
the candate will *already* be near the top of the poll. But if they
are viewing in "underdog" mode, the first voter has the advantage.
If they are viewing "oldest not-yet-seen", they are all equal as to
how hard it is to find the item they want to vote for.)

> Then you'll need to transfer the votes from the duplicates to the
> approved answer,

I'm not going to bother fixing a 1 millisecond investment that was
really just a *tag* not an actual significant vote.

> I'll still suggest: if you get a vote for a new answer, and it
> matches an old answer for the same question, you should apply the
> vote to the existing answer, not insert a new one.

That will never happen, except for that 1-millisecond nomination.

> I also have to wonder if you aren't going to have problems
> eventually where you don't know much about the subject of the
> question. For example, you might not know much about school board
> elections in Urkistan, but some of your users do. Unfortunately,
> you don't know that D. Brown and B. Brown are *different*
> candidates (and the top two candidates and married to each other),

Each answer has 3 fields (in addition to ID and link to ID of question):
- Title
- Short description (<80 chars), used to explain the title
- Long description (256 chars), used to pass notes/questions back
and forth between user who made the nomination and the master user
who is considering whether to approve the answer.
There's room for a TinyURL within the short description, so if I
don't know the topic I can ask for a URL and then do some reading
to understand who each candidate really is.

> You also don't know that the o in Brown is supposed to be accented,
> and your keyboard doesn't have that character on it.

Actualy, for most Latin-1 characters, that problem got fixed with I
converted from direct modem-to-modem VT100 dialup using VersaTerm,
to PPP + TELNET using NiftyTelnet. When I started viewing Twitter
updates through NiftyTelnet, I suddenly started seeing accented
characters that VersaTerm hadn't been showing me before. I can even
type them (at great pain on Macintosh keyboard, but Key Caps helps
a lot there) and transmit them as part of my text.

But some Latin-1 characters show just fine but don't transmit
correctly into Lynx FORM, so then I need to pass the text through a
CMUCL program to convert non-USASCII characters to numeric HTML
entity notation.

But when I get beyond Latin-1 I'm "blind". Neither Nifty Telnet nor
the wireless-Web service in my cell-phone can display non-Latin-1
text.

> >When I designed this application, I had in mind that there might be
> >more than one type of question, with grossly different record
> >structure, which however might be presented to users in a somewhat
> >uniform manner. Thus an answer would have to refer not just to the
> >index number of the question it is answering, but to the *table* in
> >which that question resides.
> It is almost always a mistake to put a table name in a field of another
> table.

Yeah. I changed that to be a numeric code instead of a string name
several months ago. The PHP code will dispatch on that code, or SQL
will select different groups of data per that code, but for now the
value of that code is always exactly 1. If you've programmed in C,
think of it as an "enumeration datatype" which is implemented as a
small integer.

> You also have the option (do this VERY CAREFULLY, though) of uploading
> a carefully protected script which will let *YOU* (and not just any
> user or spambot) enter a query into an input field, then run it,
> and return the results.

Absolutely not. Way too dangerous. One mistake and my whole
database has been destroyed and I have no evidence what happened.

Now after I have public-key encryption implemented in PHP (already
done long ago in Lisp), I could PK-sign using my PK for my local
machine and PK-encrypt for the target machine, so nobody could
impersonate my signed command, and nobody eavesdropping could even
know what I'm doing. Note that several months ago I designed and
implemented a way to bootstrap such a system across a public
communication channel without any eavesdropper knowing what I'm
doing nor being able to impersonate me. I do that between PowerLisp
on my Mac and CMUCL on my shell account. I haven't yet translated
my algorithms to PHP, but it should be reasoanably easy when I'm in
need of it, probably for distributed processing rather than remote
secure commandline for MySQL.

> >> Or phpMyAdmin or some other tool that will let you type a query
> >> into a command line or window, then see the result?
> >There ain't no way either oxyhost.com or uuuq.com is going to give
> >me access to phpMyAdmin on their system!!!
> Why not? In a typical hosting setup, your MySQL login gives you
> pretty much unlimited access to *YOUR* database (create and drop
> tables, select, insert, update, and delete data, etc.) and no access
> to anyone else's.

MySQL admin login doesn't work from VT100 lynx because it requires JavaScript.

> For MySQL syntax errors, you should check whether your query
> worked, and if not, call and log what mysql_error() returns.

Yes, I have *always* from day-one checked whether it worked, and
called my own personal abort function if it ever failed, and after
advice from you and/or others here I appended the string that comes
back from mysql_error() after the regular error message:
$delcmd = "DELETE FROM surinvest WHERE id = $invid";
echo "[" . $delcmd . "] ";
$resdel = mysql_query($delcmd);
if (!$resdel) { timeDie("SQL [$delcmd] failed because: "
. mysql_error()); }
echo "SUCCESS.<br>\n";
and when I'm developing the code and I got the SQL syntax wrong, it
really does often give really nice messages explaining what was
wrong with the SQL syntax. Then when it's working and I no longer
need that extra debugging info, it ain't broke so why fix it?? And
the extra storage needed for the no-longer-needed call to
mysql_error() is so trivial I'm not going to spend any human time
to clean it out.

>It even alerts me to undeclared variables
>that I call (and then provides a NULL value for each, which
>promptly breaks the next line of code in most cases, giving me
>additional diagnostics which I can ignore).
>
>But none of the other free PHP/MySQL hosting sites provide that
>diagnostic service by default. On all the others, if there's a
>syntax error, all I see back is a blank screen or a redirection to
>a generic advertisement for the hosting service. Occasionally
>oxyhost.com is down and I need to use a third site for development,
>where no diagnostics are available Do you know if there's any
>PHP/MySQL command that I can use to turn on diagnostics if they are
>off by default, such as on that third site I sometimes need to use
>for development when oxyhost.com is down?

> Have a look at the PHP function error_reporting(). It,
> unfortunately, will not help if your PHP script won't *compile* due
> to a syntax error, since in that situation, error_reporting() won't
> get run, not even if it's first.

Yeah. I got burned by that after my primary development site had
been down for a month and I couldn't wait any longer to get back to
development, and blank screens because of an unassigned variable
that crashed the script very early, or whatever it was, were such a
pain, so I didn't see your info here, re-discovered it myself by a
couple hours of Google search and browsing the search results.

But oxyhost (my now-defunct primary development host) had better
syntax checking and error reporting than error_reporting(E_ALL)
provides, and I really miss that on my secondary site. But using
--SYNTAX-CHECK on my shell account elimiates the pain, so I'm good
now.

> >> What do you want to do in Afghanistan?
> >(Bomb Kabul, kill all Taliban, close border with Pakistan.)
> >> What do you want to do in Mexico?
> >(Bomb Cuidad whatever, kill all drug dealers, close border with USA.)
> >> What do you want to do in Washington, DC?
> >(Impeach Obama, Censor Pelosi, Tax lobbiests.)
> >> What do you want to do at the motorcycle gang convention next week?
> >(Knock over all their bikes, check their immigration status, remove
> > their tattoos.)
> >I don't see a lot of duplicate answers from one question to another.
> "Kill everyone" and "Nuke it" could easily apply to all of those.
> So could "leave immediately".

Hmm. I guess my current design couldn't handle that case. I'd have
to think whether using a compound instead of simple uniqueness
condition will solve the problem. Currently the title of the answer
must be *globally* unique. But perhaps the ID of the question
conjoined with the title of the answer is all that really needs to
be unique. I just need to figure out (if this need ever arises)
what other code would need to be fixed to handle this case.
Not any time soon...

> >Note that I won't allow any offensive/abusive language *ever*, but
> >I will allow pornography to be viewed by anyone confirmed over 18.
> Are you suggesting that "smoke weed" is *pornographic*?

No, it isn't even offensive/abusive, IMO.

Several weeks ago I posted a sequence of Twitter updates
proposing various mnemonics for remembering George Carlin's
seven words you can't say on TV. The first letter of each
word in the mnemonic matched the first letter in the corresponding
forbidden word. Unfortunately I didn't save a copy of my phrases,
and the Twitter search engine goes back only about 1.5 weeks,
so I've lost them. I can only remember:
Seven Pornographc Forbidden C? C?S? M?F? Television
Maybe one of my readers here will write a tool that fetches *all*
my Twitter updates of the past five months, looking for any that
have the words "Seven" and "Television" within each, and show me
where they are??

> I presume there's an intent that the results of a survey be
> useful to someone.

The results of the meta-question (what survey question would you
like to answer) already guides me to promote such meta-answers to
be live questions.

> I have to wonder how useful they will be if the top answer gets
> less than 10% of the investment.

I'm dealing with an even worse problem: With only 0 active users, 2
semi-active users, 1 hardly active user, and myself trying to
restrain myself from swamping the surveys with my *own* preference:
The largest *total* investment by all users for any one answer to
the meat question is less than one second. I had to engage in a
bidding war just to convince my first semi-active user to get up to
900 milliseconds of investment for "Who is your favorite talk-show
host" (I was bidding against him by investing in "What feature
would you most like to have installed in NewEco" with my last bid
at 950 milliseconds, and he hasn't topped that yet).

Come on folks! You can answer a 10 second Turing question to build
your accout balance up to nearly 15 seconds, then invest nearly 10
seconds (ten thousand milliseconds) of that in some survey-answer,
dropping you down to 5 seconds (the lowest you're allowed to drop
your main account when investing in surveys), lather-rinse-repeat
fifty or a hundred times to build up an *IMMENSE* investment in
your favorite answer! (Don't go past about a hundred, because there
are only 114 questions total, and each time you don't log in for
over 4 days you lose as many seconds as the no-login time exceeds 4
days, and then if you drop too low you *need* another Turing
question to get back up to where you can get into the survey
feature again. Keep those 14 spares. You'll need them for sure!
Actually I recommend not using more than about 30 to 50 of the
Turing questions for survey investments, keeping *lots* of spares
for emergency use. Note: Only a few of the questions are worth 10
points. The quickest is 4 and the average is about 7 or 8.)

I want to see EACH user invest at least twenty thousand
milliseconds in each of their favorite answers, with at least ten
users voting for each popular answer, thus two hundred thousand
milliseconds total for each of the most popular answers.
(I can dream??)
My software can handle up to 9999499 milliseconds before the
four-character "floating" point printing routine overflows.
So 200000 ms is nowhere near close to the maximum my program can support.
.001 = 1 ms
.010 = 10 ms
.100 = 100 ms
1.00 = 1000 ms
10.0 = 10000 ms
100. = 100000 ms
1000 = 1000000 ms
9999 = 9999499 ms
xxxxxx 9999500 ms crashes the printing function

pauldavidson

unread,
Aug 14, 2015, 8:10:51 PM8/14/15
to
Great article. Thanks for the info, this is really a helpful post. BTW, if anyone needs to fill out Duplicate Title form, I found a blank form here http://pdf.ac/4Hl9Ik. This site PDFfiller also has several related forms that you might find useful.


0 new messages