Help to Parse a VARCHAR field.

Tim Davidge

unread,

Jan 10, 2003, 12:56:47 AM1/10/03

to

Folks,

I have a text field in a table that contains a continous value of up to 250
characters. The values are separated by the character "+" [ie:
test+this+that ...]. I need to be able to parse each of the values by the
"+" and to count the number of the common values for all records in the
table.

There are 100 elements that can be either of two conditions. Therefore we
may see [1_a+52_b+87_a+94_a] as one record. I would need to be able to count
the number of records that each of the values [ie: 87] appears in for each
of the distinct values.

This is well beyond my mere basic SQL skills, so any assistance would be
greatly appreciated.

Many thanks.

TD

Knut Stolze

unread,

Jan 10, 2003, 4:07:31 AM1/10/03

to

Tim Davidge wrote on Friday 10 January 2003 06:56:

> Folks,
>
> I have a text field in a table that contains a continous value of up to 250
> characters. The values are separated by the character "+" [ie:
> test+this+that ...]. I need to be able to parse each of the values by the
> "+" and to count the number of the common values for all records in the
> table.
>
> There are 100 elements that can be either of two conditions. Therefore we
> may see [1_a+52_b+87_a+94_a] as one record. I would need to be able to count
> the number of records that each of the values [ie: 87] appears in for each
> of the distinct values.

How about this:

CREATE FUNCTION elements ( string CLOB(64K) )
RETURNS TABLE ( ordinal INTEGER, index INTEGER )
LANGUAGE SQL
DETERMINISTIC
NO EXTERNAL ACTION
CONTAINS SQL
RETURN
WITH t(ordinal, index) AS
( VALUES ( 0, 0 )
UNION ALL
SELECT ordinal+1, COALESCE(NULLIF(
LOCATE('+', string, index+1), 0), LENGTH(string)+1)
FROM t
-- to prevent a warning condition for infinite recursion
WHERE ordinal < 500 AND
LOCATE('+', string, index+1) <> 0 )
SELECT ordinal, index
FROM t
;

This function returns a table with two columns. The first column is a
counter, and the second contains the offset where each of the '+' signs can
be found in the string.

SELECT *
FROM TABLE ( elements('1_a+52_b+87_a+94_a') ) AS t;

ORDINAL INDEX
----------- -----------
0 0
1 4
2 9
3 14

4 record(s) selected.

And this table can now be joined with your table containing the strings to be
parsed and you can do any further processing you want. If you just want to
count the elements, do this:

SELECT yourTable.recordNumber, COUNT(*)
FROM yourTable, TABLE ( elements(<yourColumn>) ) AS t
GROUP BY yourTable.recordNumber;

--
Knut Stolze
DB2 Spatial Extender
IBM Germany / University of Jena

Knut Stolze

unread,

Jan 15, 2003, 6:20:05 AM1/15/03

to

Knut Stolze wrote on Friday 10 January 2003 10:07:

> How about this:
>
> CREATE FUNCTION elements ( string CLOB(64K) )
> RETURNS TABLE ( ordinal INTEGER, index INTEGER )
> LANGUAGE SQL
> DETERMINISTIC
> NO EXTERNAL ACTION
> CONTAINS SQL
> RETURN
> WITH t(ordinal, index) AS
> ( VALUES ( 0, 0 )
> UNION ALL
> SELECT ordinal+1, COALESCE(NULLIF(
> LOCATE('+', string, index+1), 0), LENGTH(string)+1)
> FROM t
> -- to prevent a warning condition for infinite recursion
> WHERE ordinal < 500 AND
> LOCATE('+', string, index+1) <> 0 )
> SELECT ordinal, index
> FROM t
> ;

Because the question came up what you can do with this, here is a more lengthy
response:

CREATE FUNCTION elemIdx ( string CLOB(64K) )

RETURNS TABLE ( ordinal INTEGER, index INTEGER )
LANGUAGE SQL
DETERMINISTIC
NO EXTERNAL ACTION
CONTAINS SQL
RETURN
WITH t(ordinal, index) AS
( VALUES ( 0, 0 )
UNION ALL
SELECT ordinal+1, COALESCE(NULLIF(
LOCATE('+', string, index+1), 0), LENGTH(string)+1)
FROM t
-- to prevent a warning condition for infinite recursion
WHERE ordinal < 500 AND
LOCATE('+', string, index+1) <> 0 )
SELECT ordinal, index
FROM t

UNION
SELECT MAX(ordinal)+1, LENGTH(string)+1
FROM t
;

Please note that I extended the function to return an additional
ordinal/index pair, pointing to the end of the string. This is needed
below in order to have an upper bound for the SUBSTR() funnction.

Here we have a function that uses the results from the above function and
returns all the elements you are interested in:

CREATE FUNCTION elements ( string CLOB(64K) )

RETURNS TABLE ( elements CLOB(64K) )

LANGUAGE SQL
DETERMINISTIC
NO EXTERNAL ACTION
CONTAINS SQL
RETURN

SELECT SUBSTR(string, t1.index+1, t2.index - t1.index - 1)
-- the join below makes sure that we have the lower and upper
-- index where we can find each of the '+' separating the
-- elements. (For this, we need the additional indexes
-- pointing to the beginning and end of the string)
FROM TABLE ( elemIdx(string) ) AS t1,
TABLE ( elemIdx(string) ) AS t2
WHERE t2.ordinal = t1.ordinal+1
;

This function uses "elemIdx" to determine the upper/lower bounds for each of
the elements and then cuts each element out of the string. The result is
returned in a table:

SELECT varchar(elem, 30)
FROM TABLE ( elements('abc+def+ghi') ) AS x(elem)

1
------------------------------
abc
def
dhi

3 record(s) selected.

Counting how many distinct elements you have, just do this:

SELECT COUNT(DISTINCT varchar(elem, 1000))
FROM TABLE ( elements('abc+def+abc+ghi') ) AS x(elem)

And now I assume you have all your strings in a table like this:

CREATE TABLE t ( id INT NOT NULL PRIMARY KEY, string CLOB(64K );
INSERT INTO t VALUES ( 1, 'abc+def+ghi' );
INSERT INTO t VALUES ( 2, 'xyz+nmo+xyz+abc+xyz+def' );

Here you have to do a join and do the grouping to count the distinct elements
for each of the strings:

SELECT COUNT(DISTINCT varchar(elem, 1000))
FROM t, TABLE ( elements(t1.string) ) AS x(elem)
GROUP BY t1.id;

1
-----------
3
4

2 record(s) selected.

If you want to know how often the element 'xyz' appears in the strings in
table t, you can use this:

SELECT id, COUNT(*)
FROM t1, TABLE ( elements(t1.string) ) AS x(elem)
WHERE varchar(elem, 1000) = 'xyz'
GROUP BY t1.id;

I think that joins using table functions are not that straight forward. They
work differently than normal joins. The processing goes something like this:
(1) fetch a row from the table t
(2) give this row to the table function elements
(3) all rows returned by elements for this single input parameter are now
(4) related to the row fetched from t in step 1 (and not to any other row)
(5) go to step 1 and fetch the next row there

p.s: All the "varchar(..., 1000)" are necessary because my "elements"
function above returns a CLOB(64K). If that function would return a
VARCHAR(32K) or so, then the casts would not be needed.