Sentences deduplication script

50 views
Skip to first unread message

Trang HO

unread,
Nov 19, 2014, 2:56:29 AM11/19/14
to tatoeba...@googlegroups.com
As far as I know, the script to remove duplicate sentences[1] is more or less ready. I still need confirmation from AlanF and gillux about how confident they feel to run the script on the production database because I have not tested it myself.

However, before running the script, we will need to do some changes on the database[2]. These changes require to shut down access to Tatoeba temporarily, so we should ideally to set up some maintenance page.
We do have one[3], but it's a bit outdated, and it's not translated nor translatable[4].
Another thing is that the current implementation of the maintenance page requires to have Tatoeba still running. It's just that every URL would redirect to the maintenance page[5].

So my first question is: do we keep the current implementation or do we need to consider another implementation? Any thoughts?

Gilles Bedel

unread,
Nov 19, 2014, 4:19:09 AM11/19/14
to tatoeba...@googlegroups.com
On Wed, 19 Nov 2014 08:56:08 +0100
Trang HO <tr...@tatoeba.org> wrote:

> So my first question is: do we keep the current implementation or do
> we need to consider another implementation? Any thoughts?

I never saw the current implementation (using CakePHP routes) working
live, but if I understand correctly, it redirects everything using an
HTTP 301, which is very very bad because it tells crawlers the page
permanently moved, i.e. the new URL is meant to replace the old. We
mean a temporary redirect instead (HTTP 302/303). Unfortunately,
CakePHP 1.3 doesn’t seem to allow us to tweak the return code[1] (note
that CakePHP 2.0 does [2]). So the current implementation is a no-go
for me.

The “right way” is to return a page with a 503 code. That’s I did
during the last migration, by crafting a page and editing nginx’s
config to always return that page with 503. I think that not having the
Tatoeba code running during maintenance is a benefit, because if
Tatoeba or PHP is broken, the maintenance page isn’t. Nginx can handle
localized pages as well[3].

So how about something like this. Write a PHP script that 1) contains
the maintenance page strings so that they end up in the POT file and 2)
generates one html maintenance page per language + an nginx config file
to be included in the nginx site config.

If such a script is too much burden, we can still stick the single
multilingual page just like we did during the last migration.

[1] http://api.cakephp.org/1.3/class-Router.html#_connect
[2] http://book.cakephp.org/2.0/en/development/routing.html#redirect-routing
[3] https://stackoverflow.com/questions/3657614/how-to-rewrite-location-in-nginx-depending-on-the-client-browsers-language

— gillux

Trang HO

unread,
Nov 19, 2014, 7:14:27 AM11/19/14
to tatoeba...@googlegroups.com
Thanks gillux.
I guess we'll go with one HTML page that includes the text in several languages.

Next question: is the dedupliation script ready to run on prod?
If everything's good, then we can perhaps plan to run it on November 29. Or is that too soon?


Gilles Bedel

unread,
Nov 19, 2014, 6:34:41 PM11/19/14
to tatoeba...@googlegroups.com
On Wed, 19 Nov 2014 13:14:07 +0100
Trang HO <tr...@tatoeba.org> wrote:

> Next question: is the dedupliation script ready to run on prod?
> If everything's good, then we can perhaps plan to run it on November
> 29. Or is that too soon?

I’d just like to do some more testing, so the 29th is far enough. I’ll
try to install a dev website on the server (#494) today. If it is all
good, I’ll run the deduplication script on it and tell people on the
Wall to check it out. I don’t think we could run into any more trouble
if it runs without problems on the dev website.

— gillux

Trang HO

unread,
Nov 20, 2014, 6:59:10 AM11/20/14
to tatoeba...@googlegroups.com
Sacredceltic is reporting that Tatoeba is very slow today. I'm noticing the same thing.
I suppose that's due to the deduplication script running on the dev version at the moment?


--
You received this message because you are subscribed to the Google Groups "tatoeba" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tatoebaprojec...@googlegroups.com.
To post to this group, send email to tatoeba...@googlegroups.com.
Visit this group at http://groups.google.com/group/tatoebaproject.
To view this discussion on the web visit https://groups.google.com/d/msgid/tatoebaproject/20141120083318.e168bd24225b492c42fdf67e%40free.fr.
For more options, visit https://groups.google.com/d/optout.

Gilles Bedel

unread,
Nov 20, 2014, 10:49:53 AM11/20/14
to tatoeba...@googlegroups.com
On Thu, 20 Nov 2014 12:58:48 +0100
Trang HO <tr...@tatoeba.org> wrote:

> Sacredceltic is reporting that Tatoeba is very slow today. I'm
> noticing the same thing.
> I suppose that's due to the deduplication script running on the dev
> version at the moment?

No, not yet. I investigated a bit, and it sounds like the
pm.max_children value of nginx was bottlenecking requests. I saw many
of theses in /var/log/php5-fpm.log:
WARNING: [pool tatoeba] server reached pm.max_children setting (5),
consider raising it

I raised it to 16 and then I got:
WARNING: [pool tatoeba] seems busy (you may need to increase
pm.start_servers, or pm.min/max_spare_servers), spawning 32 children,
there are 0 idle, and 14 total children

So I raised these too. No more warnings and the site feels responsive
again. Note that according to Google Analytics, Tatoeba has been
getting steadily more and more popular over the last ten days: about
+60% of users/sessions/page views/etc. What puzzles me though is why
didn’t we notice it before, whereas php5-fpm was crying for more than a
month. Or did you?

— gillux

lool0 -

unread,
Nov 21, 2014, 12:39:50 AM11/21/14
to tatoeba...@googlegroups.com
> What puzzles me though is why
> didn’t we notice it before, whereas php5-fpm was crying for more than a
> month. Or did you?

we didn't because it wasn't noticeable to the users. yesterday there was a massive memory leak in sinoparserd and nihongoparserd. I took a look and restarted the daemons and watched the logs and cpu activity for a bit, but didn't want to change anything in production and was too tired to post here. btw, mysql logging is disabled I hope someone can re-enable it asap. at least the slow query logging. also, change the default prod nginx config on the repo so we're more aware of what changed in production.

Gilles Bedel

unread,
Nov 21, 2014, 12:04:12 PM11/21/14
to tatoeba...@googlegroups.com
Quick update.

I ran the InnoDB migration script on the development website. All good,
and it took 3 hours. I’m now running the deduplication script, and I
feel like it’s gonna run for quite some time. Of course, the production
server is still running and slowing everything down, so it will be
quicker when we’ll run it for real on the production server, but still.

— gillux

Gilles Bedel

unread,
Nov 21, 2014, 8:46:50 PM11/21/14
to tatoeba...@googlegroups.com
Giving the deduplication script is still running after 9.5 hours, I’m
considering splitting this into several passes. We could shut down the
site between like 0 h and 8 h (GMT+1) (when the site has less traffic
according to Google Analytics) several times. @lool0 can we use the -i
option of the dedup script for that?

I temporarily disabled the weekly sphinx reindex that was about to run
in a few hours.

— gillux

Gilles Bedel

unread,
Nov 21, 2014, 9:52:42 PM11/21/14
to tatoeba...@googlegroups.com
It finally completed!

# /usr/bin/time -v python manage.py deduplicate -c -b Horus
Running full scan at 2014-11-21 06:48 PM UTC
Running full table scan in 10 queries
Running duplicate filtering on sentence range: [1L, 364536L]
OK
Running duplicate filtering on sentence range: [364537L, 729072L]
OK
Running duplicate filtering on sentence range: [729073L, 1093608L]
OK
Running duplicate filtering on sentence range: [1093609L, 1458144L]
OK
Running duplicate filtering on sentence range: [1458145L, 1822680L]
OK
Running duplicate filtering on sentence range: [1822681L, 2187216L]
OK
Running duplicate filtering on sentence range: [2187217L, 2551752L]
OK
Running duplicate filtering on sentence range: [2551753L, 2916288L]
OK
Running duplicate filtering on sentence range: [2916289L, 3280824L]
OK
Running duplicate filtering on sentence range: [3280825L, 3645360L]
OK
OK full table scan and filtering done 3458882 duplicate sets found
Running deduplication step
OK 52177 sentences merged into 46183 sentences
Running verification step
All audio intact?
NO
All duplicates removed?
YES
All merged sentences intact?
YES
No links refer to deleted duplicates?
NO
Deduplication finished running successfully at 2014-11-22 03:51 AM UTC,
see full log at: http://downloads.tatoeba.org/dedup-2014-11-21T18:48.log
Command being timed: "python manage.py deduplicate -c -b Horus"
User time (seconds): 1364.96
System time (seconds): 48.13
Percent of CPU this job got: 4%
Elapsed (wall clock) time (h:mm:ss or m:ss): 9:03:05
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 3733424
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 36
Minor (reclaiming a frame) page faults: 944403
Voluntary context switches: 3387042
Involuntary context switches: 179019
Swaps: 0
File system inputs: 3392
File system outputs: 465720
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0

— gillux

lool0 -

unread,
Nov 22, 2014, 3:17:05 AM11/22/14
to tatoeba...@googlegroups.com
>can we use the -i option of the dedup script for that?

I don't recommend it unless we have a (text, lang) index or something. It just seems like it's an initial price we have to pay, then continuous small runs would be cheap with -i even without an index. In the current state, I suspect it might take a few hundred -i runs over a couple of days since you'll need to make 50k full table scans but you're welcome to try. btw, can we get access to the dedup log, it seems like the script hit audio duplicates and for some reason has left dangling links, I'd like to investigate further.

lool0

Gilles Bedel

unread,
Nov 22, 2014, 9:51:20 AM11/22/14
to tatoeba...@googlegroups.com
On Sat, 22 Nov 2014 11:17:05 +0300
lool0 - <saeb...@hotmail.com> wrote:

> btw, can we get access to the dedup
> log, it seems like the script hit audio duplicates and for some
> reason has left dangling links, I'd like to investigate further.

It’s lying in /var/www-dev/docs/tatoeba2-django/. Please have a look at
the issue #504 I opened too.

— gillux

Trang HO

unread,
Nov 29, 2014, 6:22:02 AM11/29/14
to tatoeba...@googlegroups.com
The migration to InnoDB is done so we can move on to the deduplication whenever it's ready.
Note: there is still a `wall2` table, I'm not sure why we have this table. Does anyone know? I assume we can delete it but I left it there in case.

Now, for the deduplication script, there are still these issues:
https://github.com/Tatoeba/tatoeba2/milestones/Sentences%20deduplication

Considering that I'm not much more available this weekend, it's probably better to take another week before we run the script. But @gillux, if you feel it can be done this weekend, then you can go ahead ^^



--
You received this message because you are subscribed to the Google Groups "tatoeba" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tatoebaprojec...@googlegroups.com.
To post to this group, send email to tatoeba...@googlegroups.com.
Visit this group at http://groups.google.com/group/tatoebaproject.

CK @ Tatoeba Project

unread,
Nov 29, 2014, 6:29:24 AM11/29/14
to tatoeba...@googlegroups.com
1. I glanced over the tickets and couldn't easily tell.

Does this duplicate-merging script already handle lists?
In other words, will all sentences I've approved on on OK lists still going to be on the lists, and are all sentences I've marked as "not to use" on my lists still going to be on those lists?

2. I'd suggest not using the English word "deduplication" for this script.  In my understanding of that word, it's not really what is happening.


I'd suggest in the future to use:

Duplicate-mergiing
or
Duplicate clean up



Gilles Bedel

unread,
Nov 30, 2014, 12:24:09 AM11/30/14
to tatoeba...@googlegroups.com
On Sat, 29 Nov 2014 20:29:23 +0900
"CK @ Tatoeba Project" <c...@tatoeba.org> wrote:

> 1. I glanced over the tickets and couldn't easily tell.
>
> Does this duplicate-merging script already handle lists?

Yes, it does. As well as comments, tags, favourites, logs and links.

> 2. I'd suggest not using the English word "deduplication" for this
> script. In my understanding of that word, it's not really what is
> happening.
>
> http://en.wikipedia.org/wiki/Data_deduplication
>
> I'd suggest in the future to use:
>
> Duplicate-mergiing
> or
> Duplicate clean up

Noted! I think it’s difficult to change words after people start using
them. :-) Maybe it’s fine to keep using it as long as it’s in the
context of Tatoeba’s development. Please tell us if you see this word
displayed anywhere else to contributors.

— gillux

CK @ Tatoeba Project

unread,
Nov 30, 2014, 8:18:55 PM11/30/14
to tatoeba...@googlegroups.com
Today, it occurred to me that the duplicate-merging script might mess up Jim Breen's Japanese annotations.  This is something that's hidden from most people and maybe the programmers aren't aware that they exist.

This is something you should probably consider before running the script.

Does the script move that data along with merging duplicates?
(I assume that in actuality very few, if any of these will be affected at this time, but in the future more and more may be affected.)


                — gillux

--
You received this message because you are subscribed to the Google Groups "tatoeba" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tatoebaprojec...@googlegroups.com.
To post to this group, send email to tatoeba...@googlegroups.com.
Visit this group at http://groups.google.com/group/tatoebaproject.

Gilles Bedel

unread,
Dec 1, 2014, 1:56:11 AM12/1/14
to tatoeba...@googlegroups.com
On Mon, 1 Dec 2014 10:18:55 +0900
"CK @ Tatoeba Project" <c...@tatoeba.org> wrote:

> Today, it occurred to me that the duplicate-merging script might mess
> up Jim Breen's Japanese annotations. This is something that's hidden
> from most people and maybe the programmers aren't aware that they
> exist.

Yes, the script is supposed to take care of sentences annotations
already. Though I’m not familiar with these and I can’t say if the
result is consistent. Someone eventually need to check this on
http://dev.tatoeba.org/.

I compiled a list of sentences ids in which annotations have been
merged.

merged sentence id: duplicate sentences
---------------------------------------
3596583: 204333
108652: 108650
2976776: 187670
2871629: 195502
77305: 226781
112280: 112288
2974407: 218415
3034176: 144614
77956: 77957
234171: 231710
119076: 188199
2714640: 90230
206008: 206059
2797160: 183369
148112: 195345
3238319: 217019
1502040: 204046
3401625: 187519
2684823: 80458
3221181: 157547
3477255: 219594
2850263: 171836
2946674: 103869
2976593: 198287
3621822: 145021
3101242: 228249
204035: 204202
2879676: 218306
119384: 119386
3357826: 137258
111845: 111848
3459026: 3468432, 194246
2505715: 216002
3360145: 215829
3003382: 174785

— gillux

Trang HO

unread,
Dec 1, 2014, 6:57:46 AM12/1/14
to tatoeba...@googlegroups.com
I think there's actually something missing for the sentence_annotations.







                — gillux

--
You received this message because you are subscribed to the Google Groups "tatoeba" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tatoebaprojec...@googlegroups.com.
To post to this group, send email to tatoeba...@googlegroups.com.
Visit this group at http://groups.google.com/group/tatoebaproject.

Trang HO

unread,
Dec 11, 2014, 7:54:25 PM12/11/14
to tatoeba...@googlegroups.com
Tell me if I'm wrong but I think the remaining issues are done. They're still open, but I'd rather leave them open until we can check them on the dev.
https://github.com/Tatoeba/tatoeba2/milestones/Sentences%20deduplication

@gillux, @lool0, could one of you run once more the deduplication script on the dev before the weekend? (hopefully for the last time)

Do you think it's reasonable to plan on running the script on prod this weekend?
Or are there other issues we should take care of?



Trang HO

unread,
Dec 13, 2014, 11:05:28 PM12/13/14
to tatoeba...@googlegroups.com
I've just started the deduplication script on the dev.

By the way, is it normal that it took a reaallly long time to sync the dev?
I didn't time it, but it took perhaps 6 hours.

Trang HO

unread,
Dec 13, 2014, 11:11:46 PM12/13/14
to tatoeba...@googlegroups.com
Another thing. Maybe we should fix the date on the server before we run the deduplication on prod...
https://github.com/Tatoeba/tatoeba2/issues/478

Right now the latest "contributions" of Horus are displayed as "one hour ago".

Trang HO

unread,
Dec 14, 2014, 12:02:54 AM12/14/14
to tatoeba...@googlegroups.com
Ran into some IntegrityError...

IntegrityError: (1062, "Duplicate entry '3253-401801' for key 'list_id'")
--> https://gist.github.com/trang/301a01beca5712693a74

$ tail dedup-2014-12-14T05\:58.log
--> https://gist.github.com/trang/3005d171998450f40b4a

lool0 -

unread,
Dec 14, 2014, 11:59:44 AM12/14/14
to tatoeba...@googlegroups.com
I fixed those issues and started the dedup script on the dev in a screen. This run is looking good so far.

lool0

Trang HO

unread,
Dec 15, 2014, 9:18:26 AM12/15/14
to tatoeba...@googlegroups.com
Is it on purpose that comments on duplicates are copied rather than moved to the main sentence?

If you look at http://dev.tatoeba.org/eng/sentences/show/3550769 for instance, there is still the comment from CK but the message from Horus says "If there were any linked translations, tags or comments here, they have been moved to #1926402", so you wouldn't expect any other comment to be left in there.


On Sun, Dec 14, 2014 at 5:59 PM, lool0 - <saeb...@hotmail.com> wrote:
I fixed those issues and started the dedup script on the dev in a screen. This run is looking good so far.

lool0

--
You received this message because you are subscribed to the Google Groups "tatoeba" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tatoebaprojec...@googlegroups.com.
To post to this group, send email to tatoeba...@googlegroups.com.
Visit this group at http://groups.google.com/group/tatoebaproject.

lool0 -

unread,
Dec 15, 2014, 12:34:52 PM12/15/14
to tatoeba...@googlegroups.com
>Is it on purpose that comments on duplicates are copied rather than moved to the main >sentence?
>If you look at http://dev.tatoeba.org/eng/sentences/show/3550769 for instance, there is still the comment from >CK but the message from Horus says "If there were any linked translations, tags or comments here, they >have been moved to #1926402", so you wouldn't expect any other comment to be left in there.

yes, early discussions about this settled for duplicating the logs and comments instead of updating them. I guess this is partially what makes the links in the logs look so confusing (https://github.com/Tatoeba/tatoeba2/issues/535), some of them are copied verbatim from the duplicates. Changing this isn't hard, we can use update_merge for comments and not do anything for the logs.

lool0

CK @ Tatoeba Project

unread,
Dec 16, 2014, 2:40:10 AM12/16/14
to tatoeba...@googlegroups.com
I agree with Trang that the only comment left on the deleted one, should be a comment that let's people find the non-deleted original.

This way, a project like mine that links directly to every sentence, will still allow users to find the existing sentence, even before I rebuild pages.

If other comments are on the page, especially if there are many, people wouldn't likely scroll down the page and realize the "redirect" link was there.



--
You received this message because you are subscribed to the Google Groups "tatoeba" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tatoebaprojec...@googlegroups.com.
To post to this group, send email to tatoeba...@googlegroups.com.
Visit this group at http://groups.google.com/group/tatoebaproject.

CK @ Tatoeba Project

unread,
Dec 16, 2014, 1:41:34 PM12/16/14
to tatoeba...@googlegroups.com
Perhaps since the duplicate-merging bot is likely to only "speak English," maybe the message should be re-worded in such a way that it is very easy to understand even for people with a very low understanding of English.  

A simple "Go to #00000" first, would make it very obvious to anyone that they should click that link to find what they were looking for.



FROM:

This sentence has been merged with #260105 because it was a duplicate of it. If there were any linked translations, tags or comments here, they have been moved to #260105.



TO:  (Adding the extra white space, might make it easier to read, too.)

Go to 
#260105.

This sentence was a duplicate.

If there were any linked translations, tags or comments here, they have been moved to that page.




OR: (This actually might be enough.)

Go to #260105.

This sentence was a duplicate.





OR:

Go to #260105.

This sentence was a duplicate.
(Go to my profile to find out more information.)




Trang HO

unread,
Dec 16, 2014, 2:28:53 PM12/16/14
to tatoeba...@googlegroups.com

OR: (This actually might be enough.)

Go to #260105.

This sentence was a duplicate.

Yes I think this would be enough. There is no real need to tell people that comments have been moved if we copy them. And I suppose they will assume that the tags and translations were re-linked to the other sentence.

To be honest I think just this would be enough:

    This sentence was a duplicate of #260105.

Or even:
   
    Duplicate of #260105.

CK @ Tatoeba Project

unread,
Dec 16, 2014, 2:41:10 PM12/16/14
to tatoeba...@googlegroups.com
The reason I suggested "Go to ...." is that likely the only people that will see the message are people that arrived on the page by being directed there from another website.    All they really want to do is find the page with the sentence.

The English words "go" and "to" are very simple and easier to understand than "sentence" and "duplicate."

At least in Japan, probably even elementary school students who haven't yet really even formally studied English would be able to figure out "Go to ....".



--
You received this message because you are subscribed to the Google Groups "tatoeba" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tatoebaprojec...@googlegroups.com.
To post to this group, send email to tatoeba...@googlegroups.com.
Visit this group at http://groups.google.com/group/tatoebaproject.

Trang HO

unread,
Dec 16, 2014, 3:34:00 PM12/16/14
to tatoeba...@googlegroups.com
The reason I suggested "Go to ...." is that likely the only people that will see the message are people that arrived on the page by being directed there from another website.    All they really want to do is find the page with the sentence.

​We could use an arrow symbol then.

    → #260105​

    This sentence was a duplicate.

Or

    Duplicate of → #260105​

CK @ Tatoeba Project

unread,
Dec 16, 2014, 7:43:30 PM12/16/14
to tatoeba...@googlegroups.com
Perhaps even drop any reference to "duplicate."

The only reason to explain would be for members, who would soon realize it was a duplicate as soon as they clicked the link.  New visitors wouldn't really need that information.

Maybe one of these would be enough.

  → #260105​

Or

Go to  #260105​

Or

 Go to → #260105​





--
You received this message because you are subscribed to the Google Groups "tatoeba" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tatoebaprojec...@googlegroups.com.
To post to this group, send email to tatoeba...@googlegroups.com.
Visit this group at http://groups.google.com/group/tatoebaproject.

CK @ Tatoeba Project

unread,
Dec 17, 2014, 7:22:11 AM12/17/14
to tatoeba...@googlegroups.com
Re: Considering that it will be tedious to change the text of the messages added by Horus after the deduplication ...

I wonder if there is a possibility of adding something unique that could be replaced with text.

For example, whenever the following was encountered within a comment....

###123

... it would be replaced with ... 

Go to #123

Or:

  → #123

Or any other way you may decide to do it in the future.

In the future, you may think of a better way to do it, so it would only require a simple change in one place, rather than in every comment.


You already have #123 being replace with a link, so you code use the same type of code to accomplish this.

It would even open up the possibility of having this in various languages sometime in the future, if you decided that was a good idea. I assume you could even substitute it with an image icon, too, if so desired.






Trang HO

unread,
Dec 17, 2014, 9:03:56 AM12/17/14
to tatoeba...@googlegroups.com

I wonder if there is a possibility of adding something unique that could be replaced with text.

It's possible. But the problem is not so the uniqueness of the text. Whether it's "###123" or "Go to #123", we can detect in both cases the pattern and do the necessary replacement.
We can definitely use some codified text, but then we would need to convert it, and the implementation of the conversion is not in the scope of the deduplication. It's some improvement for later, if anyone ever wants to take care of it, but it's not a must.
Until there is such an implementation, the messages have to be written in a way that is readable by humans.



CK @ Tatoeba Project

unread,
Dec 17, 2014, 9:07:30 AM12/17/14
to tatoeba...@googlegroups.com
True.  You're right.

Maybe just a simple "Go to #123." might be the best.

Some members use the arrow in comments, and it may not be unique enough.