Where does Xapian get it's file names?

20 views
Skip to first unread message

Andrei Cristian Petcu

unread,
Apr 19, 2016, 2:43:13 PM4/19/16
to Alaveteli Dev
Hi,

I am restoring my data from my current production server to a test
server and when I run the rebuild_index command I get an error.

RAILS_ENV=production bundle exec rake xapian:rebuild_index
models='InfoRequestEvent' verbose=true --trace &>
/tmp/rebuild_xapian_issue.txt

The error:
https://my.owndrive.com/index.php/s/H3PjjYCrb81vxdX

The strange thing is that I do not have a directory on my production
server or on my testing server with that name
(/srv/www/releases/20160417215801/cache/attachments_production/a8e).
Where is Xapian trying to get it from? All the directory structure
exists except for a8e.

Thank you,
Andrei Petcu

signature.asc

Louise Crow

unread,
Apr 21, 2016, 11:43:48 AM4/21/16
to alavet...@googlegroups.com
Hi Andrei,

It looks as though the indexing process it trying to regenerate a cached attachment file on the filesystem in order to index it. The filename for this cached attachment is generated by the FoiAttachment class based on a hexdigest of the contents of the attachment [1]. As the email parsing code in Alaveteli has changed over time, it may be that the exact hexdigest will not be the same as the one that was generated on your original server with a previous version of the code. However, if you set the permissions for the attachments_production directory such that it is writable by the application user (who should also be the user running the indexing), then an equivalent cached attachment file should be created, allowing the indexing to proceed.

Cheers

Louise


--
You received this message because you are subscribed to the Google Groups "Alaveteli Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to alaveteli-de...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Andrei Cristian Petcu

unread,
Apr 21, 2016, 11:57:14 AM4/21/16
to alavet...@googlegroups.com
Hi Louise,

Thank you for your reply. I don't think it is a permissions issue.
My user has write access in that folder.

deploy@alaveteli:/srv/www/current$ ls -la
/srv/www/releases/20160420203445/cache/attachments_production
total 88
drwxr-xr-x 20 deploy deploy 4096 Apr 21 18:52 .
drwxrwxr-x 4 deploy deploy 4096 Apr 21 00:02 ..
drwxrwxr-x 2 deploy deploy 4096 Apr 21 18:52 0d6
drwxrwxr-x 2 deploy deploy 4096 Apr 21 18:52 17c
drwxrwxr-x 2 deploy deploy 4096 Apr 21 18:52 1ac
drwxrwxr-x 2 deploy deploy 4096 Apr 21 18:52 200
drwxrwxr-x 2 deploy deploy 4096 Apr 21 18:52 224
drwxrwxr-x 2 deploy deploy 4096 Apr 21 18:52 69e
drwxrwxr-x 2 deploy deploy 4096 Apr 21 18:52 75e
drwxrwxr-x 2 deploy deploy 4096 Apr 21 18:52 7f4
drwxrwxr-x 2 deploy deploy 4096 Apr 21 18:52 8e6
drwxrwxr-x 2 deploy deploy 4096 Apr 21 18:52 91d
drwxrwxr-x 2 deploy deploy 4096 Apr 21 18:52 9dc
drwxrwxr-x 2 deploy deploy 4096 Apr 21 18:52 a8e
drwxrwxr-x 2 deploy deploy 4096 Apr 21 18:52 aaa
drwxr-xrwx 401 deploy deploy 12288 Feb 2 19:15 attachments_production
drwxrwxr-x 2 deploy deploy 4096 Apr 21 18:52 b68
drwxrwxr-x 2 deploy deploy 4096 Apr 21 18:52 c99
drwxrwxr-x 2 deploy deploy 4096 Apr 21 18:52 e2c
drwxrwxr-x 2 deploy deploy 4096 Apr 21 18:52 eb7

I will check the class you suggested.

Andrei Petcu

On 04/21/2016 06:43 PM, Louise Crow wrote:
> Hi Andrei,
>
> It looks as though the indexing process it trying to regenerate a cached
> attachment file on the filesystem in order to index it. The filename for
> this cached attachment is generated by the FoiAttachment class based on
> a hexdigest of the contents of the attachment [1]. As the email parsing
> code in Alaveteli has changed over time, it may be that the exact
> hexdigest will not be the same as the one that was generated on your
> original server with a previous version of the code. However, if you set
> the permissions for the attachments_production directory such that it is
> writable by the application user (who should also be the user running
> the indexing), then an equivalent cached attachment file should be
> created, allowing the indexing to proceed.
>
> Cheers
>
> Louise
>
> [1]
> https://github.com/mysociety/alaveteli/blob/develop/app/models/foi_attachment.rb#L38
>
> On 19 April 2016 at 19:43, Andrei Cristian Petcu <and...@ceata.org
> <mailto:and...@ceata.org>> wrote:
>
> Hi,
>
> I am restoring my data from my current production server to a test
> server and when I run the rebuild_index command I get an error.
>
> RAILS_ENV=production bundle exec rake xapian:rebuild_index
> models='InfoRequestEvent' verbose=true --trace &>
> /tmp/rebuild_xapian_issue.txt
>
> The error:
> https://my.owndrive.com/index.php/s/H3PjjYCrb81vxdX
>
> The strange thing is that I do not have a directory on my production
> server or on my testing server with that name
> (/srv/www/releases/20160417215801/cache/attachments_production/a8e).
> Where is Xapian trying to get it from? All the directory structure
> exists except for a8e.
>
> Thank you,
> Andrei Petcu
>
> --
> You received this message because you are subscribed to the Google
> Groups "Alaveteli Dev" group.
> To unsubscribe from this group and stop receiving emails from it,
> send an email to alaveteli-de...@googlegroups.com
> <mailto:alaveteli-dev%2Bunsu...@googlegroups.com>.
> For more options, visit https://groups.google.com/d/optout.
>
>
> --
> You received this message because you are subscribed to the Google
> Groups "Alaveteli Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to alaveteli-de...@googlegroups.com
> <mailto:alaveteli-de...@googlegroups.com>.
signature.asc

Andrei Cristian Petcu

unread,
Apr 23, 2016, 6:04:15 AM4/23/16
to alavet...@googlegroups.com
I think I figured out the problem. Alaveteli expects files to be unique
and to have different hashes but this is a false assumption. I have
several files with the same hash.

If two files have the same hash, Alaveteli generates a new cached file
for the first attachment and populates it. When the second file is
parsed, it deletes what it thinks it is the old file.

I still don't fully understand the issue but I will post more info when
I discover it.

Andrei Petcu
signature.asc

Andrei Cristian Petcu

unread,
Apr 28, 2016, 4:08:52 PM4/28/16
to alavet...@googlegroups.com
I don't think it is due to the duplicate items anymore. I'm not quite
sure what it is.

I made a replica of my server called https://xapian.nuvasuparati.info
and if anybody wants to SSH into the box and take a look around please
tell me in order to add your SSH key to my server.

How to reproduce:

$ssh ro...@xapian.nuvasuparati.info
$sudo su - deploy
$cd /srv/www/current
$RAILS_ENV=production bundle exec rake xapian:rebuild_index
models='InfoRequestEvent' --trace

you will get an error similar to:
No such file or directory -
/srv/www/releases/20160428162258/cache/attachments_production/df5/df5183611e55cceb8a05720b2432da84

I'm really close to have a fully automated Alaveteli setup with Ansible
and this pesky Xapian issue is blocking me and I hope a fresh pair of
eyes can bring some insight into it.

Thank you,
Andrei Petcu
signature.asc

Louise Crow

unread,
May 3, 2016, 4:49:22 AM5/3/16
to alavet...@googlegroups.com
Hi Andrei,

Could you post the full traceback from that error?

Cheers

Louise

To unsubscribe from this group and stop receiving emails from it, send an email to alaveteli-de...@googlegroups.com.

Andrei Cristian Petcu

unread,
May 3, 2016, 5:42:30 PM5/3/16
to alavet...@googlegroups.com
Hmmm... I'm trying to reproduce the issue but I have an out of memory
issue. Here is the error
https://my.owndrive.com/index.php/s/SvzKbQbl5sXfSEN (download the
complete file)

I'll provision a bigger droplet and see if this happens again (1G of RAM
now). I'll go for 2G.

Thank you,
Andrei
signature.asc

Andrei Cristian Petcu

unread,
May 5, 2016, 5:48:47 PM5/5/16
to alavet...@googlegroups.com
Hi Louise,

Here is the stack trace:
https://my.owndrive.com/index.php/s/kFPblvhfaO9JI4O

I added your ssh public key to the server. If you want to get more
details just ssh ro...@xapian.nuvasuparati.info

Thank you,
Andrei

On 05/03/2016 11:49 AM, Louise Crow wrote:
> Hi Andrei,
>
> Could you post the full traceback from that error?
>
> Cheers
>
> Louise
>
> On 28 April 2016 at 21:08, Andrei Cristian Petcu <and...@ceata.org
> <mailto:and...@ceata.org>> wrote:
>
> I don't think it is due to the duplicate items anymore. I'm not quite
> sure what it is.
>
> I made a replica of my server called https://xapian.nuvasuparati.info
> and if anybody wants to SSH into the box and take a look around please
> tell me in order to add your SSH key to my server.
>
> How to reproduce:
>
> $ssh ro...@xapian.nuvasuparati.info
> <mailto:ro...@xapian.nuvasuparati.info>
> >>> <mailto:alaveteli-dev%2Bunsu...@googlegroups.com
> <mailto:alaveteli-dev%252Buns...@googlegroups.com>>.
> >>> For more options, visit https://groups.google.com/d/optout.
> >>>
> >>>
> >>> --
> >>> You received this message because you are subscribed to the Google
> >>> Groups "Alaveteli Dev" group.
> >>> To unsubscribe from this group and stop receiving emails from
> it, send
> >>> an email to alaveteli-de...@googlegroups.com
> <mailto:alaveteli-dev%2Bunsu...@googlegroups.com>
> >>> <mailto:alaveteli-de...@googlegroups.com
signature.asc

Andrei Cristian Petcu

unread,
May 11, 2016, 1:11:38 PM5/11/16
to alavet...@googlegroups.com
Hi Louise,

Did you have time to look at the logs? Did anybody have time to look at
them?

Any hint/advice on how to debug this is welcome.

Thank you,
Andrei Petcu
signature.asc

Louise Crow

unread,
May 20, 2016, 12:52:06 PM5/20/16
to alavet...@googlegroups.com
Hi Andrei,

I've had a look at the problem and successfully rebuilt the xapian index now.

I think the problem from the last stack trace was that the indexing process was having to rebuild all the cached attachments (not sure if you meant to copy them over from your previous server - but there's an attachments_production directory inside the attachments_production directory - maybe you copied it across but it ended up one level too deep?). When an individual attachment couldn't find it's own cache file, it would trigger a reparsing of the parent email, but because your previous install of alaveteli was much older, the attachments reparsed would be different in some cases, so there was still no cache file for that original attachment. I solved this by running a reparse of all incoming messages from the console before rebuilding the index:

IncomingMessage.each do |incoming_message|
  incoming_message.parse_raw_email!(true)
end

Hope you'll be able to get the site migrated now!

Cheers

Louise


To unsubscribe from this group and stop receiving emails from it, send an email to alaveteli-de...@googlegroups.com.

Andrei Petcu

unread,
May 20, 2016, 1:14:48 PM5/20/16
to alavet...@googlegroups.com, Louise Crow
Thank you Louise,
I will check it out this weekend. I have been debugging this issue for quite some time. I'm glad you figured it out. I hope I can automate the steps you did so it will be repeatable.

I'll keep you posted.

Andrei
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
Reply all
Reply to author
Forward
0 new messages