[Dspace-tech] too many open files

9 views
Skip to first unread message

Jose Blanco

unread,
Aug 24, 2015, 4:19:02 PM8/24/15
to dspac...@lists.sourceforge.net

We are  getting this error:

 

2006-12-04 04:47:04 ApplicationDispatcher[]: Servlet internal-error is currently unavailable

2006-12-04 04:48:24 StandardWrapperValve[bitstream]: Servlet.service() for servlet bitstream threw exception

java.io.FileNotFoundException: /l1/dspace/repository/prod/assetstore/14/58/45/145845832063558580850369699477251654488 (Too many open files)

 

 

2006-12-04 04:46:06 StandardWrapperValve[bitstream]: Servlet.service() for servlet bitstream threw exception

java.net.SocketException: Connection reset

 

2006-12-04 04:45:02 StandardWrapperValve[bitstream]: Servlet.service() for servlet bitstream threw exception

java.net.SocketException: Connection reset

 

 

in the evening when google is crawling our repository, and I belive these errors are bringing tomcat down.  I get the impression that some pdf files are not being closed in the /tmp dir and this is brining tomcat down.  When tomcat is restarted all works well.

 

Does any one have any suggestions?

 

Thanks!

 

Jose

Jose Blanco

unread,
Aug 24, 2015, 4:19:07 PM8/24/15
to dspac...@lists.sourceforge.net, dspace-...@mit.edu

A day ago I posted that we were getting “too many files open” error and I found this thread today discussing it:

 

http://sourceforge.net/mailarchive/forum.php?forum_id=39921&max_rows=25&style=flat&viewmonth=200408

 

I’m a bit confused as to what I need to do.  I have version 1.4 of DSpace, I’m not sure what version of Lucene I have.  Can some one tell me how I can find that out?  Do I need to get the latest version of Lucene and run ./filter-media with a –f switch to force all items to be re-indexed to create compound files and get rid of this error?

 

Thanks!

 

Jose

Jose Blanco

unread,
Aug 24, 2015, 4:19:08 PM8/24/15
to Mark Diggory, dspac...@lists.sourceforge.net, dspace-...@mit.edu

Mark:

 

Thanks for answering this question.

 

We run index-all nightly, and when I go to the in <dspace>/search dir this is what I see:

 

-bash-3.00$ pwd

/l1/dspace/repository/prod/search

-bash-3.00$ ls -la

total 2102880

drwxr-xr-x   2 dspace dspace       4096 Dec  5 06:07 .

drwxr-xr-x  13 dspace dspace       4096 Dec  1 10:52 ..

-rw-r--r--   1 dspace dspace          4 Dec  5 06:07 deletable

-rw-r--r--   1 dspace dspace 2151226568 Dec  5 06:07 _s12.cfs

-rw-r--r--   1 dspace dspace         29 Dec  5 06:07 segments

 

Does this look OK to you?

 

Thanks!!

 


From: Mark Diggory [mailto:mdig...@MIT.EDU]
Sent: Tuesday, December 05, 2006 3:33 PM
To: Jose Blanco
Cc: dspac...@lists.sourceforge.net; dspace-...@MIT.EDU
Subject: Re: [Dspace-general] too many open files

 

FilterMedia doesn't actually interact with Lucene directly, only indirectly in that any generated text bitstreams will get picked up later when "index-all" is called. So, no, running filtermedia will not solve your too many files open issue.

 

The current version of <dspace>/bin/index-all will rebuild your entire lucene search index (and this will be completely optimized as welli). The usual suggestion is to run it nightly in a cron job on your dspace server. if you look in <dspace>/search and see many many "segment" files there, this may suggest that your index is not optimized.

 

Cheers,

Mark

 

 

_______________________________________________

Dspace-general mailing list

 

Mark R. Diggory

~~~~~~~~~~~~~

DSpace Systems Manager

MIT Libraries, Systems and Technology Services

Massachusetts Institute of Technology



 

Mark Diggory

unread,
Aug 24, 2015, 4:19:08 PM8/24/15
to Jose Blanco, dspac...@lists.sourceforge.net, dspace-...@mit.edu

Mark Diggory

unread,
Aug 24, 2015, 4:19:09 PM8/24/15
to Jose Blanco, dspac...@lists.sourceforge.net, dspace-...@mit.edu
Sorry, that was an erronious statement I just made:

Filter-Media does actually call the same code that index-all is calling, but you can turn this off at the command line by adding the option "-n".

Note, if you are running index-all and filter-media consecutively every night, you want to either just run filter-media or run filter-media -n and then run index-all.

Either way, it is a reindexing of the Lucene indexes that could help yu with this issue.

-Mark

Mark Diggory

unread,
Aug 24, 2015, 4:19:12 PM8/24/15
to Jose Blanco, dspac...@lists.sourceforge.net, dspace-...@mit.edu
Jose,

This may have much more to do with the number of available inodes on that disk than on the search or filter media, you might do something like "df -i" and post its result to this thread.

-Mark

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
DSpace-tech mailing list

Jose Blanco

unread,
Aug 24, 2015, 4:19:12 PM8/24/15
to Mark Diggory, dspac...@lists.sourceforge.net, dspace-...@mit.edu

So why do you think we are getting “too many open files” error?  It seems to be happening when google is crawling our site.  It also seems like this error message has to do with the kernel limits on the number of open files, which by default is 1024 – which should be enough, no?  And we do just run ./filter-media nightly.

 

Thanks for you thoughts on this.

 

 


Sent: Tuesday, December 05, 2006 3:44 PM
To: Jose Blanco
Cc: dspac...@lists.sourceforge.net; dspace-...@MIT.EDU

Subject: Re: [Dspace-tech] [Dspace-general] too many open files

 

Yes, that looks like an optimized search index. An unoptimized index would many more files in it.

Jose Blanco

unread,
Aug 24, 2015, 4:19:13 PM8/24/15
to Mark Diggory, dspac...@lists.sourceforge.net, dspace-...@mit.edu

Here are the results:

 

-bash-3.00$ df -i

Filesystem            Inodes   IUsed   IFree IUse% Mounted on

/dev/sda7             128520   27978  100542   22% /

/dev/sda1              64256      47   64209    1% /boot

/dev/sda8            5013504   10060 5003444    1% /l

/dev/sdb1            60325888 1881916 58443972    4% /l1

/dev/sdc1            58621952  119448 58502504    1% /l2

none                  223864       1  223863    1% /dev/shm

/dev/sda5             262144     176  261968    1% /tmp

/dev/sda2            3074176  107051 2967125    4% /usr

/dev/sda3             262144    1278  260866    1% /var

AFS                  9000000       0 9000000    0% /afs

Mark Diggory

unread,
Aug 24, 2015, 4:19:15 PM8/24/15
to Jose Blanco, dspac...@lists.sourceforge.net, dspace-...@mit.edu
I'm going to stop duel posting to both lists in my next email and just post to dspace-tech for this issue. 

So you have lots of inodes for your filesystems, the next big question is. How many are you allocating for open files? Which has allot to do with which kernel/O.S. your running. Can you post more detail on your Operating System?

-Mark

Jose Blanco

unread,
Aug 24, 2015, 4:19:15 PM8/24/15
to Jose Blanco, Mark Diggory, dspac...@lists.sourceforge.net

 

 


From: Jose Blanco [mailto:bla...@umich.edu]
Sent: Tuesday, December 05, 2006 4:15 PM
To: 'Mark Diggory'
Cc: 'dspac...@lists.sourceforge.net'
Subject: RE: [Dspace-tech] [Dspace-general] too many open files

 

Mark:

 

Does this help?

 

-bash-3.00$ uname -a

Linux “server_name” 2.6.17-1.2142_FC4smp #1 SMP Tue Jul 11 22:57:02 EDT 2006 i686 i686 i386 GNU/Linux

-bash-3.00$

 

Fedora core 4

 

What do you mean by “allocating for open files”?

 

Thanks!


From: dspace-te...@lists.sourceforge.net [mailto:dspace-te...@lists.sourceforge.net] On Behalf Of Mark Diggory
Sent: Tuesday, December 05, 2006 4:03 PM
To: Jose Blanco
Cc: dspac...@lists.sourceforge.net; dspace-...@MIT.EDU
Subject: Re: [Dspace-tech] [Dspace-general] too many open files

 

I'm going to stop duel posting to both lists in my next email and just post to dspace-tech for this issue. 

Mark Diggory

unread,
Aug 24, 2015, 4:19:16 PM8/24/15
to Jose Blanco, dspac...@lists.sourceforge.net
I think theres one or two things you can do look into in this situation.

1.) throttle your webserver activity so that crawlers cannot open more than a limited number of connections at any one time. 

And/Or 

2.) increase the number of inodes available for open files on your filesystem. You want to look at the following settings and see where they are:

cat /proc/sys/fs/file-max

post them.

file-max has the maximum number of inodes allowed on the system for open files, every open file requires one or more inodes. I think file-nr is the current number being used. Theres allot of documentation on the web about the subject. Heres a brief overview of the trade-offs of increasing inodes Its Redhat/Fedora centric.


-Mark


On Dec 5, 2006, at 4:14 PM, Jose Blanco wrote:

Mark:

 

Does this help?

 

-bash-3.00$ uname -a

Linux “server_name” 2.6.17-1.2142_FC4smp #1 SMP Tue Jul 11 22:57:02 EDT 2006 i686 i686 i386 GNU/Linux

-bash-3.00$

 

Fedora core 4

 

What do you mean by “allocating for open files”?

 

Mark R. Diggory

Mark Diggory

unread,
Aug 24, 2015, 4:19:17 PM8/24/15
to Jose Blanco, dspac...@lists.sourceforge.net
On Dec 5, 2006, at 4:30 PM, Mark Diggory wrote:

I think theres one or two things you can do look into in this situation.

1.) throttle your webserver activity so that crawlers cannot open more than a limited number of connections at any one time. 


Mod_bandwidth on your apache 2.0 server could help you out in this area:


And/Or 

2.) increase the number of inodes available for open files on your filesystem. You want to look at the following settings and see where they are:

cat /proc/sys/fs/file-max

post them.

file-max has the maximum number of inodes allowed on the system for open files, every open file requires one or more inodes. I think file-nr is the current number being used. Theres allot of documentation on the web about the subject. Heres a brief overview of the trade-offs of increasing inodes Its Redhat/Fedora centric.


-Mark


On Dec 5, 2006, at 4:14 PM, Jose Blanco wrote:

Mark:

 

Does this help?

 

-bash-3.00$ uname -a

Linux “server_name” 2.6.17-1.2142_FC4smp #1 SMP Tue Jul 11 22:57:02 EDT 2006 i686 i686 i386 GNU/Linux

-bash-3.00$

 

Fedora core 4

 

What do you mean by “allocating for open files”?

 

Mark R. Diggory
~~~~~~~~~~~~~
DSpace Systems Manager
MIT Libraries, Systems and Technology Services
Massachusetts Institute of Technology


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
DSpace-tech mailing list

Mark Diggory

unread,
Aug 24, 2015, 4:19:18 PM8/24/15
to Jose Blanco, dspac...@lists.sourceforge.net, dspace-...@mit.edu
Yes, that looks like an optimized search index. An unoptimized index would many more files in it.

-Mark

On Dec 5, 2006, at 3:36 PM, Jose Blanco wrote:

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
DSpace-tech mailing list

Jose Blanco

unread,
Aug 24, 2015, 4:19:18 PM8/24/15
to Mark Diggory, dspac...@lists.sourceforge.net

Mark:

 

 

We do have a throttle in place but it does not work on the number of connections opened.  It’s more to block accesses to items, so perhaps this is an issue.  I will talk this over with our sys admin person.

 

Here is the output you wanted me to post:

 

-bash-3.00$ cat /proc/sys/fs/file-max

206034

-bash-3.00$ cat /proc/sys/fs/file-nr

4224    0       206034

 

I’ll do some reading on this.

 

Thanks!

 

Jose

 

 

 


Sent: Tuesday, December 05, 2006 4:31 PM
To: Jose Blanco
Cc: dspac...@lists.sourceforge.net

Mark Diggory

unread,
Aug 24, 2015, 4:19:19 PM8/24/15
to Jose Blanco, dspac...@lists.sourceforge.net
I did some more research on various throttlings that can be run in apache 2, mod_bandwidth is really only apache1. Theres two others I've been looking at "mod_cband" which I'm planning on using in our next production upgrade and mod_bw which is a rewrite of mod_bandwidth for apache2. mod_bw seems a little more bleeding edge (only one developer) than mod_cband.

I was expecting to see lower numbers on your file-max. Just to give you an example on our production dspace service (on Gentoo linux)

 ~ $ cat /proc/sys/fs/file-max
359563
5 ~ $ cat /proc/sys/fs/file-nr
640     0       359563
 ~ $ cat /proc/sys/fs/inode-
inode-nr     inode-state 
 ~ $ cat /proc/sys/fs/inode-nr
50087   15640
 ~ $ cat /proc/sys/fs/inode-state
50090   15640   0       0       0       0       0
 ~ $ df -i
Filesystem            Inodes   IUsed   IFree IUse% Mounted on
/dev/md/0            1224000    8659 1215341    1% /
udev                  217552     745  216807    1% /dev
/dev/mapper/vg-usr   1310720  275633 1035087   22% /usr
/dev/mapper/vg-home  2621440   57411 2564029    3% /home
/dev/mapper/vg-var   1310720   38530 1272190    3% /var
/dev/mapper/vg-tmp    655360      98  655262    1% /tmp
shm                   217552       1  217551    1% /dev/shm
192.168.0.13:/mnt/staging
                     10134720  358260 9776460    4% /mnt/staging
192.168.0.13:/mnt/assetstore
                     10134720  358260 9776460    4% /mnt/prod/assetstore1
192.168.0.13:/mnt/assetstore0
                     10134720  358260 9776460    4% /mnt/prod/assetstore0


-Mark

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
DSpace-tech mailing list

Cory Snavely

unread,
Aug 24, 2015, 4:19:25 PM8/24/15
to dspac...@lists.sourceforge.net
To clarify, we are not using Apache. This is Tomcat running as the
dspace user with iptables forwarding port 80 to Tomcat's HTTP service.

The throttling mechanism we use is integrated into the servlet and works
by limiting the frequency of requests and instituting a penalty (delay)
if the parameters are exceeded. Jose can speak more to its design and
integration with the DSpace servlet. We are seeing that it is regularly
blocking aggressive crawlers.

If I am understanding your theory, you believe that the kernel's free
inode list of 359563 is becoming saturated, correct?

Thanks for your interest and help with this.

Cory Snavely
University of Michigan Library IT Core Services
> > ____________________________________________________________________
> _______________________________________________ DSpace-tech mailing list DSpac...@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-tech


Jose Blanco

unread,
Aug 24, 2015, 4:19:36 PM8/24/15
to Mark Diggory, dspac...@lists.sourceforge.net

Mark:


We are suspecting that memory is being saturated during indexing and that this is causing trouble with tomcat.  Are there any guidelines on how to set memory parameters for repositories based on size?

Here is a bit of information about our server:

Linux “server_name” 2.6.17-1.2142_FC4smp #1 SMP Tue Jul 11 22:57:02 EDT 2006 i686 i686 i386 GNU/Linux

Fedora core 4

 

 

Number if items:  32,710

We have two assetstore dirs with the following sizes

assetstore 0 : 86,1193,48 K

assetstore 1 : 21,411,068 K

 

 

Here is what we have in our search dir:

-bash-3.00$ ls -la

total 2103528

drwxr-xr-x   2 dspace dspace       4096 Dec  7 14:07 .

drwxr-xr-x  13 dspace dspace       4096 Dec  1 10:52 ..

-rw-r--r--   1 dspace dspace          4 Dec  7 14:07 deletable

-rw-r--r--   1 dspace dspace 2151226568 Dec  7 08:56 _s12.cfs

-rw-r--r--   1 dspace dspace       4095 Dec  7 14:07 _s12.del

-rw-r--r--   1 dspace dspace     444666 Dec  7 11:51 _s1u.cfs

-rw-r--r--   1 dspace dspace         10 Dec  7 13:56 _s1u.del

-rw-r--r--   1 dspace dspace      81333 Dec  7 11:52 _s2e.cfs

-rw-r--r--   1 dspace dspace      85694 Dec  7 11:53 _s2y.cfs

-rw-r--r--   1 dspace dspace      27933 Dec  7 14:07 _s3e.cfs

-rw-r--r--   1 dspace dspace         65 Dec  7 14:07 segments

 

Thank you!

Jose

 

 

 

 

Robert Tansley

unread,
Aug 24, 2015, 4:19:38 PM8/24/15
to Jose Blanco, dspac...@lists.sourceforge.net, Mark Diggory
On 12/7/06, Jose Blanco <bla...@umich.edu> wrote:
>
> We are suspecting that memory is being saturated during indexing and that
> this is causing trouble with tomcat. Are there any guidelines on how to set
> memory parameters for repositories based on size?

(Guessing the context here) I'm guessing the memory problems with
indexing a large repo is the org.dspace.core.Context object cache,
which in the case of a full re-index will currently try to get an
in-memory copy of every object in the DSpace instance. Yuck.

Try adding this line to org.dspace.search.DSIndexer.indexAllItems(),
after writeItemIndex(c, writer, target);

item.decache();

That should theoretically prevent memory from filling up, provided
that Java garbage collection does its job in a timely manner. Let us
know how you get on.

Rob

Jose Blanco

unread,
Aug 24, 2015, 4:19:49 PM8/24/15
to Robert Tansley, dspac...@lists.sourceforge.net, Mark Diggory
Rob,

The line you want me to try should be

target.decache();

not

item.decache()

right?
Thanks!

-----Original Message-----
From: dspace-te...@lists.sourceforge.net
[mailto:dspace-te...@lists.sourceforge.net] On Behalf Of Robert
Tansley
Sent: Thursday, December 07, 2006 5:55 PM
To: Jose Blanco

Cory Snavely

unread,
Aug 24, 2015, 4:19:51 PM8/24/15
to dspac...@lists.sourceforge.net
On Fri, 2006-12-08 at 10:37 -0500, Robert Tansley wrote:
> Yup, target, sorry.
>
> Also, in org.dspace.browse.indexAll(), you might need to do the same
> thing in the while loop -- that's another point where it'll try to
> load everything into memory at once!
>
> Let us know if these help remove reduce your server's memory problems!
>
> On 12/8/06, Jose Blanco <bla...@umich.edu> wrote:
> > Rob,
> >
> > The line you want me to try should be
> >
> > target.decache();
> >
> > not
> >
> > item.decache()
> >
> > right?
> > Thanks!
> >
> > -----Original Message-----
> > From: dspace-te...@lists.sourceforge.net
> > [mailto:dspace-te...@lists.sourceforge.net] On Behalf Of Robert
> > Tansley
> > Sent: Thursday, December 07, 2006 5:55 PM
> > To: Jose Blanco
> > Cc: dspac...@lists.sourceforge.net; Mark Diggory
> > Subject: Re: [Dspace-tech] Memory parameters to support a large repository
> >

Robert Tansley

unread,
Aug 24, 2015, 4:19:51 PM8/24/15
to Jose Blanco, dspac...@lists.sourceforge.net, Mark Diggory
Yup, target, sorry.

Also, in org.dspace.browse.indexAll(), you might need to do the same
thing in the while loop -- that's another point where it'll try to
load everything into memory at once!

Let us know if these help remove reduce your server's memory problems!

On 12/8/06, Jose Blanco <bla...@umich.edu> wrote:
> Rob,
>
> The line you want me to try should be
>
> target.decache();
>
> not
>
> item.decache()
>
> right?
> Thanks!
>
> -----Original Message-----
> From: dspace-te...@lists.sourceforge.net
> [mailto:dspace-te...@lists.sourceforge.net] On Behalf Of Robert
> Tansley
> Sent: Thursday, December 07, 2006 5:55 PM
> To: Jose Blanco
> Cc: dspac...@lists.sourceforge.net; Mark Diggory
> Subject: Re: [Dspace-tech] Memory parameters to support a large repository
>

Cory Snavely

unread,
Aug 24, 2015, 4:19:52 PM8/24/15
to Robert Tansley, dspac...@lists.sourceforge.net, Jose Blanco, Mark Diggory
Robert, after these changes are made, how do you think the memory
consumption would be related to the size of the collection?

Whereas before, it was linear, would it essentially now be constant? If
so, the only reason for allocating a large amount of memory would be to
make garbage collection less frequent.

Thanks again for posting this.

Cory Snavely
UM Library IT Core Services

On Fri, 2006-12-08 at 10:37 -0500, Robert Tansley wrote:

Jose Blanco

unread,
Aug 24, 2015, 4:19:53 PM8/24/15
to Robert Tansley, dspac...@lists.sourceforge.net, Mark Diggory
Thanks! I'll keep you posted.

Robert Tansley

unread,
Aug 24, 2015, 4:19:54 PM8/24/15
to Cory Snavely, dspac...@lists.sourceforge.net, Jose Blanco, Mark Diggory
On 12/8/06, Cory Snavely <csna...@umich.edu> wrote:
> Robert, after these changes are made, how do you think the memory
> consumption would be related to the size of the collection?
>
> Whereas before, it was linear, would it essentially now be constant? If
> so, the only reason for allocating a large amount of memory would be to
> make garbage collection less frequent.

Yes, memory use should be constant after this for a full index
(index-all), though it only affects the amount of memory you need to
put in dspace/bin/dsrun or dsrun.bat -- full indexing doesn't run in
the webapp (Tomcat) JVM.

I'm not sure how memory use relates to repo size for Tomcat --
optimally required memory probably depends more on your number of
users than the size of the repository in that case. OAI harvests are
probably the biggest load but should never need more than
DSpaceOAICatalog.MAX_RECORDS items in memory at once. The
community/collection list page (which was originally only supposed to
be a "temporary measure" but like many others has stuck) also scales
badly -- some caching combined with collapsible elements could help --
but this shouldn't affect memory use so much.

Rob

Jose Blanco

unread,
Aug 24, 2015, 4:20:24 PM8/24/15
to Robert Tansley, Cory Snavely, dspac...@lists.sourceforge.net, Mark Diggory
Rob:

Well... we were doing pretty good for a while ( 4 days or so ), but it seems
like tomcat went down last night right after the indexer completed its work.
The good thing is that the indexer is able to run. We have dsrun setup as
follows:

java -Xmx256m -classpath $FULLPATH "$@"

and before this code change of calling decache for every item, we would get
an error that said we ran out of memory, unless we uped the memory, so the
memory must be staying constant now.

Just before the system went down last night we were getting errors, like
this one:

Exception:
java.lang.NumberFormatException: multiple points
at
java.lang.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:1067)
at java.lang.Double.parseDouble(Double.java:220)
at java.text.DigitList.getDouble(DigitList.java:127)

And this one that was kind of interesting:

Exception:
java.io.FileNotFoundException: /l1/dspace/repository/prod/search/_pn2.fnm
(No such file or directory)
at java.io.RandomAccessFile.open(Native Method)
at java.io.RandomAccessFile.<init>(RandomAccessFile.java:204)
at
org.apache.lucene.store.FSInputStream$Descriptor.<init>(FSDirectory.java:376
)


Indicating that the indexer is faulty or something?

We run filter-media every night, and at the time DSpace is up and available
to users ( mainly crawlers at that time ). Is this some sort of issue? Is
it not safe to run filter-media and have DSpace up at the same time?

Mark has recently released some changes to DSIndexer and I've emailed him
about these changes in a separate email to see if his changes could some how
help us.

Your thoughts on this mystery would be most welcomed.


Thanks!

Jose

-----Original Message-----
From: dspace-te...@lists.sourceforge.net
[mailto:dspace-te...@lists.sourceforge.net] On Behalf Of Robert
Tansley
Sent: Friday, December 08, 2006 11:04 AM
To: Cory Snavely
Cc: dspac...@lists.sourceforge.net; Jose Blanco; Mark Diggory
Subject: Re: [Dspace-tech] Memory parameters to support a large repository

Mark Diggory

unread,
Aug 24, 2015, 4:20:25 PM8/24/15
to Jose Blanco, dspac...@lists.sourceforge.net, Cory Snavely, Robert Tansley

On Dec 13, 2006, at 10:40 AM, Jose Blanco wrote:

> Rob:
>
> And this one that was kind of interesting:
>
> Exception:
> java.io.FileNotFoundException: /l1/dspace/repository/prod/search/
> _pn2.fnm
> (No such file or directory)
> at java.io.RandomAccessFile.open(Native Method)
> at java.io.RandomAccessFile.<init>(RandomAccessFile.java:204)
> at
> org.apache.lucene.store.FSInputStream$Descriptor.<init>
> (FSDirectory.java:376
> )
>
>
> Indicating that the indexer is faulty or something?
>

Jose,

I've ran into this before and in reviewing Lucene documentation
further theres a couple issues going on here.

1.) The default java temp directory that tomcat and your dsrun java
executable may be different, in which case the DSQuery is failing to
detect that theres a lock on the indexes. I've been working on a
patch that causes all Lucene Readers/Writers in DSpace to use the
same lock directory. This will rewrite this code to use "Directory"
instead of the raw filesystem path to the search directory, it will
also force a reference to the lock directory via a System property.

http://lucene.apache.org/java/docs/api/org/apache/lucene/store/
Directory.html

2.) The "IndexSearcher" in DSQuery is not properly detecting that the
index has changed version and is attempting to open a stale reference
to a file that no longer exists. I've been working on a patch for
this as well.

This case exemplifies that running indexing and searching from
different JVM's can be problematic if all the hooks are not properly
in place.

One quick way around the issue would be to restart tomcat or reload
the webapplication after your index-all/filter media has run.

-Mark

p.s. I'll send you my new DSIndexer file shortly.
> ----------------------------------------------------------------------
> ---
> Take Surveys. Earn Cash. Influence the Future of IT
> Join SourceForge.net's Techsay panel and you'll get the chance to
> share your
> opinions on IT & business topics through brief surveys - and earn cash
> http://www.techsay.com/default.php?
> page=join.php&p=sourceforge&CID=DEVDEV
> _______________________________________________
> DSpace-tech mailing list
> DSpac...@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/dspace-tech

Jose Blanco

unread,
Aug 24, 2015, 4:20:27 PM8/24/15
to Mark Diggory, Cory Snavely, dspac...@lists.sourceforge.net
Thanks! I'll take a look and see how it works over here.

-Jose

-----Original Message-----
From: Mark Diggory [mailto:mdig...@MIT.EDU]
Sent: Wednesday, December 13, 2006 1:29 PM
To: Jose Blanco; Cory Snavely
Cc: dspac...@lists.sourceforge.net
Subject: Re: [Dspace-tech] Memory parameters to support a large repository

Jose,

I had to regenerate the patch I posted earlier, it had some problems
if your using Eclipse to apply it, due to my projects name. I also
have altered where the call decache was made in that patch because it
could adversely effect other parts of dspace (in
DSIndexer.writeItemIndex) I've since moved it to the location Rob was
directing you to add it at in the following code:

> /**
> * iterate through all items, indexing each one
> */
> private static void indexAllItems(Context c, IndexWriter writer)
> throws SQLException, IOException
> {
> ItemIterator i = Item.findAll(c);
>
> while (i.hasNext())
> {
> Item target = (Item) i.next();
>
> writeItemIndex(c, writer, target);
>
> target.decache();
> }
> }

This way decaching only happens for the indexAllItems case, and not
any time the item is indexed via the webui.

I've attached the patch and the java file:



Mark Diggory

unread,
Aug 24, 2015, 4:20:30 PM8/24/15
to Jose Blanco, Cory Snavely, dspac...@lists.sourceforge.net
dsindexer-patch.txt
DSIndexer.java

Jose Blanco

unread,
Aug 24, 2015, 4:20:34 PM8/24/15
to Mark Diggory, Cory Snavely, dspac...@lists.sourceforge.net
Mark:

I'm getting some compilation errors I think I'm missing some lucene objects.


[javac]
/l1/dspace/build/dev-blancoj/dspace/src/org/dspace/search/DSIndexer.java:868
: cannot resolve symbol
[javac] symbol : variable Index
[javac] location: class org.apache.lucene.document.Field
[javac] doc.add(new Field("location", location,
Field.Store.YES, Field.Index.TOKENIZED));
[javac]
^
[javac]
/l1/dspace/build/dev-blancoj/dspace/src/org/dspace/search/DSIndexer.java:869
: cannot resolve symbol
[javac] symbol : variable Store
[javac] location: class org.apache.lucene.document.Field
[javac] doc.add(new Field("default", location,
Field.Store.YES, Field.Index.TOKENIZED));
[javac] ^
[javac]
/l1/dspace/build/dev-blancoj/dspace/src/org/dspace/search/DSIndexer.java:869
: cannot resolve symbol
[javac] symbol : variable Index
[javac] location: class org.apache.lucene.document.Field
[javac] doc.add(new Field("default", location,
Field.Store.YES, Field.Index.TOKENIZED));


c]
/l1/dspace/build/dev-blancoj/dspace/src/org/dspace/search/DSIndexer.java:155
: cannot resolve symbol
[javac] symbol : method deleteDocuments (org.apache.lucene.index.Term)
[javac] location: class org.apache.lucene.index.IndexReader
[javac] ir.deleteDocuments(t);
[javac] ^
[javac]
/l1/dspace/build/dev-blancoj/dspace/src/org/dspace/search/DSIndexer.java:432
: cannot resolve symbol
[javac] symbol : method setMaxFieldLength (int)
[javac] location: class org.apache.lucene.index.IndexWriter
[javac] writer.setMaxFieldLength(Integer.MAX_VALUE);
[javac] ^
[javac]
/l1/dspace/build/dev-blancoj/dspace/src/org/dspace/search/DSIndexer.java:436
: cannot resolve symbol
[javac] symbol : method setMaxFieldLength (int)
[javac] location: class org.apache.lucene.index.IndexWriter
[javac] writer.setMaxFieldLength(maxfieldlength);
[javac] ^
[javac]
/l1/dspace/build/dev-blancoj/dspace/src/org/dspace/search/DSIndexer.java:552
: cannot resolve symbol
[javac] symbol : variable Store
[javac] location: class org.apache.lucene.document.Field
[javac] doc.add(new Field("name", name, Field.Store.YES,
Field.Index.TOKENIZED));
[javac] ^



I guess these were not part of our lucene install.

-Jose

-----Original Message-----
From: dspace-te...@lists.sourceforge.net
[mailto:dspace-te...@lists.sourceforge.net] On Behalf Of Mark Diggory
Sent: Wednesday, December 13, 2006 1:29 PM
To: Jose Blanco; Cory Snavely
Cc: dspac...@lists.sourceforge.net
Subject: Re: [Dspace-tech] Memory parameters to support a large repository

Mark Diggory

unread,
Aug 24, 2015, 4:20:35 PM8/24/15
to Jose Blanco, dspac...@lists.sourceforge.net, Cory Snavely
Jose, this patch is against 1.4.1. are you on 1.4?

-Mark

Jose Blanco

unread,
Aug 24, 2015, 4:20:36 PM8/24/15
to Mark Diggory, dspac...@lists.sourceforge.net, Cory Snavely
Yes. I'm planning to upgrade early next year. Does this mean I can't use
it?

-----Original Message-----
From: Mark Diggory [mailto:mdig...@MIT.EDU]
Sent: Wednesday, December 13, 2006 4:35 PM
To: Jose Blanco

Mark Diggory

unread,
Aug 24, 2015, 4:20:41 PM8/24/15
to Jose Blanco, dspac...@lists.sourceforge.net, Cory Snavely
You'd have to upgrade the copies of both DSIndexer and DSQuery if
your going to follow Scotts recommendation. That's fine, I did do
that in my last production tweek.

Otherwise, you caught me at a good time because I have a copy for
both versions. 1.4.1 uses Lucene 2.0 and much of its old API was
deprecated and removed. This included these convenience methods. You
can't use this particular patch/class. Heres an older version off my
work against 1.4 that should work.

DSIndexer.java
Reply all
Reply to author
Forward
0 new messages