cfindex is taking forever

3 views
Skip to first unread message

hofar...@houseoffusion.com

unread,
Apr 8, 2015, 6:22:37 PM4/8/15
to ColdFusion Technical Talk

I'm working on building a search interface for a "document depo" on a
site. The document folder has files going all the way back to 2005, and
includes a number of 10+ meg pdf files, a few that are over 20 megs,
countless Word and Excel files, Power Point presentations....

I don't have access to the CFAdministrator, so:

<cfcollection
action = "create"
categories = "no"
collection = "docDEPO"
engine = "verity"
language = "English"
path = "#req.path#\collections\">

<cfindex
collection="docDEPO"
action="refresh"
type="path"
key="#req.path#\documentdepot\"
language="English"
status="info"
extensions=".pdf,.pptx,.docx,.doc,.xls,.xlsx,.ppsx,.txt, ppt">


The collection was created successfully as far as I can tell. However,
indexing has been running (or at least the wheel on my browser is still
turning) for almost 3 hours now. I'm going to forget about it and go mow
my grass and see what's happening when I finish.

I'm thinking though ... too much stuff to index? Or is amount of time
not out of line for a very large collection of files?
Also, I've not been able to find a list of legally accepted extensions.
I might have something listed that's just going to cause it to crap out
anyway.

Thoughts? Try something else? What exactly?

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~|
Order the Adobe Coldfusion Anthology now!
http://www.amazon.com/Adobe-Coldfusion-Anthology/dp/1430272155/?tag=houseoffusion
Archive: http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:360432

hofar...@houseoffusion.com

unread,
Apr 8, 2015, 6:31:41 PM4/8/15
to ColdFusion Technical Talk

> The collection was created successfully as far as I can tell. However,
> indexing has been running (or at least the wheel on my browser is still
> turning) for almost 3 hours now. I'm going to forget about it and go mow
> my grass and see what's happening when I finish.
>
> I'm thinking though ... too much stuff to index? Or is amount of time
> not out of line for a very large collection of files?

That doesn't actually sound unreasonable, but it might be useful to
come up with a document count more specific than "very large".

> Thoughts? Try something else? What exactly?

Have you considered Solr instead of Verity? Not that this would solve
the problem of indexing a lot of files, specifically.

Dave Watts, CTO, Fig Leaf Software
1-202-527-9569
http://www.figleaf.com/
http://training.figleaf.com/

Fig Leaf Software is a Service-Disabled Veteran-Owned Small Business
(SDVOSB) on GSA Schedule, and provides the highest caliber vendor-
authorized instruction at our training centers, online, or onsite.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~|
Order the Adobe Coldfusion Anthology now!
http://www.amazon.com/Adobe-Coldfusion-Anthology/dp/1430272155/?tag=houseoffusion
Archive: http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:360433

hofar...@houseoffusion.com

unread,
Apr 8, 2015, 6:32:09 PM4/8/15
to ColdFusion Technical Talk

> I'm going to forget about it and go mow my grass and see what's
happening when I finish.

Well crap, somebody stole my lawnmower. This is why we can't have nice
things....

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~|
Order the Adobe Coldfusion Anthology now!
http://www.amazon.com/Adobe-Coldfusion-Anthology/dp/1430272155/?tag=houseoffusion
Archive: http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:360434

hofar...@houseoffusion.com

unread,
Apr 8, 2015, 6:38:26 PM4/8/15
to ColdFusion Technical Talk

you also have to take your disk iops into consideration. If you are on a
VPS then this will give you much slower disk performance, especially if its
not SSD, and actions like this can take a lot longer.
Archive: http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:360435

hofar...@houseoffusion.com

unread,
Apr 8, 2015, 6:51:07 PM4/8/15
to ColdFusion Technical Talk

> That doesn't actually sound unreasonable, but it might be useful to
> come up with a document count more specific than "very large".


Approx 3000 documents - around 3 gb of data
... it's still running from what I can tell.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~|
Order the Adobe Coldfusion Anthology now!
http://www.amazon.com/Adobe-Coldfusion-Anthology/dp/1430272155/?tag=houseoffusion
Archive: http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:360436

hofar...@houseoffusion.com

unread,
Apr 8, 2015, 7:46:55 PM4/8/15
to ColdFusion Technical Talk

Not in front of a computer right now, but there is an option in the
CFcollection tag to list or get a collection details (something like that).
Pretty sure that gives you the record or document count and maybe even size
.

I think that is accessible while indexing is happening. You could possibly
write a quick script to see how far along things are.
Archive: http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:360437

hofar...@houseoffusion.com

unread,
Apr 9, 2015, 7:47:25 AM4/9/15
to ColdFusion Technical Talk

This is not a CF solution but it may at least help with what it has to trawl
through - in any case this will help anything else that has to call or
access the document.

This is for PDF files but you might consider converting Office files to PDF
at your discretion of course - a properly prepared PDF version of an Office
document can be up to a quarter of the file size of the source document -
that's useful and print and view quality is not compromised.

I'm a big fan of PDF but unfortunately it's a file format that suffers a lot
from bad file preparation - the result is unnecessarily big files amongst
other things.

Try optimising all the PDFs to see if this reduces the size of some of the
files - I suspect it might. You'll need to check that the output settings
(e.g. print resolution, image resolution etc.) are suitable for the end
purpose of the document but from my experience the default settings are
usually quite good.

The good news is you can automate this process over the entire file system
with Acrobat Pro's batch feature.

Hope that helps in some way!


++++++++++
Kevin Parker

++++++++++
Archive: http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:360439

hofar...@houseoffusion.com

unread,
Apr 13, 2015, 10:43:04 AM4/13/15
to ColdFusion Technical Talk

I've optimized things as much as I could by building a number of
collections and limiting each to a specific doc type.

Next question!!

I'm trying to return a few sentences from each doc with the search term
highlighted. So, I use "ContextPassages" like below.

<cfsearch name="searchResults"
collection="docDEPO"
criteria="#form.sch#"
ContextHighlightBegin="<b>"
ContextHighlightEnd="</b>"
ContextPassages="4"
ContextBytes="500">

However, very rarely is #searchResults.context# actually giving me
anything. Out of 30 returned documents, maybe only 3 return content for
#searchResults.context#. Usually it's empty/null/

Suggestions?

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~|
Order the Adobe Coldfusion Anthology now!
http://www.amazon.com/Adobe-Coldfusion-Anthology/dp/1430272155/?tag=houseoffusion
Archive: http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:360462
Reply all
Reply to author
Forward
0 new messages