UDD contains names where spaces are not stripped

Andreas Tille

unread,

Dec 7, 2023, 4:00:03 AM12/7/23

to

Hi,

by chance I realised that the uploaders table contains some names where names
are not stripped:

udd=> select '"' || u.name || '"' as name_with_spaces, uploader from uploaders u where name like '% ' or name like ' %' ;
name_with_spaces | uploader
--------------------------+-------------------------------------------
" Mehdi Dogguy" | Mehdi Dogguy <me...@debian.org>
" David Paleino" | David Paleino <da...@debian.org>
" Stéphane Glondu" | Stéphane Glondu <glo...@debian.org>
" Stefano Zacchiroli" | Stefano Zacchiroli <za...@debian.org>
" Stefano Zacchiroli" | Stefano Zacchiroli <za...@debian.org>
" Stefano Zacchiroli" | Stefano Zacchiroli <za...@debian.org>
" Stefano Zacchiroli" | Stefano Zacchiroli <za...@debian.org>
" Stefano Zacchiroli" | Stefano Zacchiroli <za...@debian.org>
"Andreas Tille " | Andreas Tille <ti...@debian.org>
" LI Daobing" | LI Daobing <lida...@debian.org>
" David Paleino" | David Paleino <da...@debian.org>
" Stefano Zacchiroli" | Stefano Zacchiroli <za...@debian.org>
" Nikita V. Youshchenko" | Nikita V. Youshchenko <yo...@debian.org>
" Nikita V. Youshchenko" | Nikita V. Youshchenko <yo...@debian.org>
" Nikita V. Youshchenko" | Nikita V. Youshchenko <yo...@debian.org>
" Nikita V. Youshchenko" | Nikita V. Youshchenko <yo...@debian.org>
" Nikita V. Youshchenko" | Nikita V. Youshchenko <yo...@debian.org>
"Colin Tuckley " | Colin Tuckley <col...@debian.org>
"Colin Tuckley " | Colin Tuckley <col...@debian.org>
"Colin Tuckley " | Colin Tuckley <col...@debian.org>
(20 rows)

This causes slight errors when counting uploads of people. My guess is this
is due to some old importer code (I've checked the hit for my name which
is a pretty old upload). Thus I wonder whether it might be the easiest
fix to simply fix this with some proper UPDATE statement to remove unneeded
spaces. This statement is doing the trick in my local clone:

UPDATE uploaders SET name = trim(name), uploader = trim(name) || ' ' || email WHERE name like ' %' or name like '% ' ;

If I'm not misleaded historic uploads will not importet from scratch so
this would cure the situation. Otherwise users need to always remember
adding some trim(name) when dealing with the uploaders.name column not
to mention that it gets even harder to deal with the uploader column
that might feature extra spaces in the middle.

What do you think?

Kind regards
Andreas.

--
http://fam-tille.de

Lucas Nussbaum

unread,

Dec 7, 2023, 2:10:04 PM12/7/23

to

Hi,

Uploaders is refreshed every few hours from archive data, so a one-time
UPDATE would not help. UDD usually tries to preserve inaccuracies, so
those might be interesting for QA work.
In your case, why don't you use the email address to identify uploaders?
(possibly combining it with the carnivore data to identify different emails
belonging to the same person ?)

Lucas

Andreas Tille

unread,

Dec 7, 2023, 2:30:04 PM12/7/23

to

> > ...

> > UPDATE uploaders SET name = trim(name), uploader = trim(name) || ' ' || email WHERE name like ' %' or name like '% ' ;
> >
>

> Uploaders is refreshed every few hours from archive data, so a one-time
> UPDATE would not help. UDD usually tries to preserve inaccuracies, so
> those might be interesting for QA work.

OK.

> In your case, why don't you use the email address to identify uploaders?

Since this also does not work:

udd=> SELECT count(*), uploader FROM uploaders WHERE name ilike '%tille%' GROUP BY uploader;
count | uploader
-------+------------------------------------
1 | Andreas Tille <ti...@debian.org>
1 | Andreas Tille <and...@an3as.eu>
8785 | Andreas Tille <ti...@debian.org>
(3 Zeilen)

> (possibly combining it with the carnivore data to identify different emails
> belonging to the same person ?)

I could fiddle around with carnivore but that's overkill for thst
purpose and I insist that not stripping blanks from names does not make
any sense, IMHO. (1 Zeile)

BTW: I found

udd=> SELECT count(*), name FROM (SELECT CASE WHEN changed_by_name = '' THEN maintainer_name ELSE changed_by_name END AS name FROM upload_history) uh WHERE name ilike '%tille%' group by name;
count | name
-------+---------------
16524 | Andreas Tille
(1 Zeile)

So why do I have 8707 uploads per uploaders but 16524 per upload_history?

Is my assumption wrong that both values should match (modulo some wrongly
spelled names)

Lucas Nussbaum

unread,

Dec 7, 2023, 2:40:04 PM12/7/23

to

If you look at the uploaders table, there are three columns:
- 'uploader', than contains the raw data
- 'name' and 'email' that contain the parsed (and trimmed) data

udd=> select uploader, name, email, count(*) from uploaders where uploader ilike '%tille%' group by 1,2,3;
uploader | name | email | count
------------------------------------+-----------------+------------------+-------
Andreas Tille <ti...@debian.org> | Andreas Tille | ti...@debian.org | 8785
Andreas Tille <and...@an3as.eu> | Andreas Tille | and...@an3as.eu | 1
Andreas Tille <ti...@debian.org> | Andreas Tille | ti...@debian.org | 1

So, just use name and/or email?

Lucas

Andreas Tille

unread,

Dec 8, 2023, 1:40:04 AM12/8/23

to

> > BTW: I found
> >
> > udd=> SELECT count(*), name FROM (SELECT CASE WHEN changed_by_name = '' THEN maintainer_name ELSE changed_by_name END AS name FROM upload_history) uh WHERE name ilike '%tille%' group by name;
> > count | name
> > -------+---------------
> > 16524 | Andreas Tille
> > (1 Zeile)
> >
> > So why do I have 8707 uploads per uploaders but 16524 per upload_history?

???

> > Is my assumption wrong that both values should match (modulo some wrongly
> > spelled names)

Could you please comment on these different results?

> If you look at the uploaders table, there are three columns:
> - 'uploader', than contains the raw data
> - 'name' and 'email' that contain the parsed (and trimmed) data
>
> udd=> select uploader, name, email, count(*) from uploaders where uploader ilike '%tille%' group by 1,2,3;
> uploader | name | email | count
> ------------------------------------+-----------------+------------------+-------
> Andreas Tille <ti...@debian.org> | Andreas Tille | ti...@debian.org | 8785
> Andreas Tille <and...@an3as.eu> | Andreas Tille | and...@an3as.eu | 1
> Andreas Tille <ti...@debian.org> | Andreas Tille | ti...@debian.org | 1
>
> So, just use name and/or email?

Well, I do not seek for a solution for this (non-)problem. I simply
think that not stripping values from spaces before injecting these into
UDD is wrong. I simply stumbled upon this when I did the query above.

I stumbled upon another reason which might be even worse:

select distinct done, done_name, done_email, owner, owner_name, owner_email from archived_bugs where done_name like '%"%' or owner_name like '%"%' order by done_name;
done | done_name | done_email | owner | owner_name | owner_email
---------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------+-------------------------------------------------+-----------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------+----------------------------------------------
<der...@debian.org> | | der...@debian.org | "van...@gmail.com" <van...@gmail.com> | "van...@gmail.com" | van...@gmail.com
<twe...@debian.org> | | twe...@debian.org | "Varun Hiremath" <varunh...@gmail.com> | "Varun Hiremath" | varunh...@gmail.com
alex...@belikoff.net (Alexander L. Belikoff) | | alex...@belikoff.net | "Alexander L. Belikoff" <alex...@belikoff.net> | "Alexander L. Belikoff" | alex...@belikoff.net
an...@debian.org (Andreas B. Mundt) | | an...@debian.org | "Andreas B. Mundt" <an...@debian.org> | "Andreas B. Mundt" | an...@debian.org
antoine.ro...@gmail.com (Antoine R. Dumont (@ardumont)) | | antoine.ro...@gmail.com | "Antoine R. Dumont" <antoine.ro...@gmail.com> | "Antoine R. Dumont" | antoine.ro...@gmail.com
antoine.ro...@gmail.com (Antoine R. Dumont) | | antoine.ro...@gmail.com | "Antoine R. Dumont" <antoine.ro...@gmail.com> | "Antoine R. Dumont" | antoine.ro...@gmail.com
art...@hell.pl (Artur R. Czechowski) | | art...@hell.pl | "Artur R. Czechowski" <art...@hell.pl> | "Artur R. Czechowski" | art...@hell.pl
...

We have lots of names in probably more than archived_bugs which are not
stripped from '"'. You always find the very same names without the
quotes inside the same table. I think this is similarly wrong and even
more annoying than the spaces.

I wonder where we could sensibly discuss those issues which I consider
bugs in UDD. Would it make sense to add some udd category in
`reportbug other` ?

Lucas Nussbaum

unread,

Dec 8, 2023, 3:00:04 AM12/8/23

to

upload_history contains all uploads ever made to Debian.
uploaders contains packages currently in the suite (not superseded by
another upload)

> > So, just use name and/or email?
>
> Well, I do not seek for a solution for this (non-)problem. I simply
> think that not stripping values from spaces before injecting these into
> UDD is wrong. I simply stumbled upon this when I did the query above.

It has been like that for about 15 years. I'm not sure changing the API
because you think is wrong is a good idea.

> I stumbled upon another reason which might be even worse:
>
> select distinct done, done_name, done_email, owner, owner_name, owner_email from archived_bugs where done_name like '%"%' or owner_name like '%"%' order by done_name;
> done | done_name | done_email | owner | owner_name | owner_email
> ---------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------+-------------------------------------------------+-----------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------+----------------------------------------------
> <der...@debian.org> | | der...@debian.org | "van...@gmail.com" <van...@gmail.com> | "van...@gmail.com" | van...@gmail.com
> <twe...@debian.org> | | twe...@debian.org | "Varun Hiremath" <varunh...@gmail.com> | "Varun Hiremath" | varunh...@gmail.com
> alex...@belikoff.net (Alexander L. Belikoff) | | alex...@belikoff.net | "Alexander L. Belikoff" <alex...@belikoff.net> | "Alexander L. Belikoff" | alex...@belikoff.net
> an...@debian.org (Andreas B. Mundt) | | an...@debian.org | "Andreas B. Mundt" <an...@debian.org> | "Andreas B. Mundt" | an...@debian.org
> antoine.ro...@gmail.com (Antoine R. Dumont (@ardumont)) | | antoine.ro...@gmail.com | "Antoine R. Dumont" <antoine.ro...@gmail.com> | "Antoine R. Dumont" | antoine.ro...@gmail.com
> antoine.ro...@gmail.com (Antoine R. Dumont) | | antoine.ro...@gmail.com | "Antoine R. Dumont" <antoine.ro...@gmail.com> | "Antoine R. Dumont" | antoine.ro...@gmail.com
> art...@hell.pl (Artur R. Czechowski) | | art...@hell.pl | "Artur R. Czechowski" <art...@hell.pl> | "Artur R. Czechowski" | art...@hell.pl
> ...
>
> We have lots of names in probably more than archived_bugs which are not
> stripped from '"'. You always find the very same names without the
> quotes inside the same table. I think this is similarly wrong and even
> more annoying than the spaces.

Use done_email ? It's there for exactly that reason.

> I wonder where we could sensibly discuss those issues which I consider
> bugs in UDD. Would it make sense to add some udd category in
> `reportbug other` ?

There is a udd category in qa.debian.org. See https://wiki.debian.org/UltimateDebianDatabase

Lucas