Specifying per-field TermProcessor

12 views
Skip to first unread message

Mohamed Yahya

unread,
Apr 11, 2014, 9:28:55 AM4/11/14
to mg...@googlegroups.com
Is there a way to have per-field term processors in MG4J?

Sebastiano Vigna

unread,
Apr 11, 2014, 9:33:45 AM4/11/14
to mg...@googlegroups.com
On 11 Apr 2014, at 3:28 PM, Mohamed Yahya <yahya....@gmail.com> wrote:

> Is there a way to have per-field term processors in MG4J?

Yes, in principle. Every Scan instance (which analyzes a single field) has a separate term processor. However, there is presently no way to set the processor from outside.

The current workaround is to do as many indexing round as your term processors, subsetting the index construction to those field that must be parsed by the specific term processor.

Ciao,

seba

Mohamed Yahya

unread,
Apr 22, 2014, 10:46:50 AM4/22/14
to mg...@googlegroups.com
If I understand correctly, the solution you propose is to have an
independent indexing round per field. Is this correct?
> --
> You received this message because you are subscribed to a topic in the Google Groups "MG4J" group.
> To unsubscribe from this topic, visit https://groups.google.com/d/topic/mg4j/TPiko0cKq4A/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to mg4j+uns...@googlegroups.com.
> To post to this group, send email to mg...@googlegroups.com.
> Visit this group at http://groups.google.com/group/mg4j.
> For more options, visit https://groups.google.com/d/optout.



--
Mohamed Yahya

Sebastiano Vigna

unread,
Apr 22, 2014, 11:37:23 AM4/22/14
to mg...@googlegroups.com
On 22 Apr 2014, at 4:46 PM, Mohamed Yahya <yahya....@gmail.com> wrote:

> If I understand correctly, the solution you propose is to have an
> independent indexing round per field. Is this correct?

Well, I would rather gather the fields by term processor and index in each round all fields using the same term processor. But, yes, you can have one round per field. How is large the collection?

Ciao,

seba

Mohamed Yahya

unread,
Apr 22, 2014, 11:40:15 AM4/22/14
to mg4j
I see the point about working per term processor. We're talking about
4 million documents, and 8 fields per document.

Sebastiano Vigna

unread,
Apr 22, 2014, 11:41:44 AM4/22/14
to mg...@googlegroups.com
On 22 Apr 2014, at 5:40 PM, Mohamed Yahya <yahya....@gmail.com> wrote:

> I see the point about working per term processor. We're talking about
> 4 million documents, and 8 fields per document.


Oh well. You can definitely do several rounds :).

Ciao,

seba

Mohamed Yahya

unread,
Apr 22, 2014, 2:04:16 PM4/22/14
to mg4j
With everything being read and written from the same disk on a single
machine with 20GB of memory available to MG4J, how much time would you
expect indexing something like the text of Wikipedia (20GB after
cleaning, two fields: text, title) to take, running with the options
"--downcase -s 1000000".

Sebastiano Vigna

unread,
Apr 23, 2014, 2:38:30 AM4/23/14
to mg...@googlegroups.com
On 22 Apr 2014, at 8:04 PM, Mohamed Yahya <yahya....@gmail.com> wrote:

> With everything being read and written from the same disk on a single
> machine with 20GB of memory available to MG4J, how much time would you
> expect indexing something like the text of Wikipedia (20GB after
> cleaning, two fields: text, title) to take, running with the options
> "--downcase -s 1000000".


I'd say a few hours, but it depends on several factors.

Ciao,

seba

Reply all
Reply to author
Forward
0 new messages