Specifying per-field TermProcessor

Mohamed Yahya

unread,

Apr 11, 2014, 9:28:55 AM4/11/14

to mg...@googlegroups.com

Is there a way to have per-field term processors in MG4J?

Sebastiano Vigna

unread,

Apr 11, 2014, 9:33:45 AM4/11/14

to mg...@googlegroups.com

On 11 Apr 2014, at 3:28 PM, Mohamed Yahya <yahya....@gmail.com> wrote:

> Is there a way to have per-field term processors in MG4J?

Yes, in principle. Every Scan instance (which analyzes a single field) has a separate term processor. However, there is presently no way to set the processor from outside.

The current workaround is to do as many indexing round as your term processors, subsetting the index construction to those field that must be parsed by the specific term processor.

Ciao,

seba

Mohamed Yahya

unread,

Apr 22, 2014, 10:46:50 AM4/22/14

to mg...@googlegroups.com

If I understand correctly, the solution you propose is to have an
independent indexing round per field. Is this correct?

> --
> You received this message because you are subscribed to a topic in the Google Groups "MG4J" group.
> To unsubscribe from this topic, visit https://groups.google.com/d/topic/mg4j/TPiko0cKq4A/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to mg4j+uns...@googlegroups.com.
> To post to this group, send email to mg...@googlegroups.com.
> Visit this group at http://groups.google.com/group/mg4j.
> For more options, visit https://groups.google.com/d/optout.

--
Mohamed Yahya

Sebastiano Vigna

unread,

Apr 22, 2014, 11:37:23 AM4/22/14

to mg...@googlegroups.com

On 22 Apr 2014, at 4:46 PM, Mohamed Yahya <yahya....@gmail.com> wrote:

> If I understand correctly, the solution you propose is to have an
> independent indexing round per field. Is this correct?

Well, I would rather gather the fields by term processor and index in each round all fields using the same term processor. But, yes, you can have one round per field. How is large the collection?

Ciao,

seba

Mohamed Yahya

unread,

Apr 22, 2014, 11:40:15 AM4/22/14

to mg4j

I see the point about working per term processor. We're talking about
4 million documents, and 8 fields per document.

Sebastiano Vigna

unread,

Apr 22, 2014, 11:41:44 AM4/22/14

to mg...@googlegroups.com

On 22 Apr 2014, at 5:40 PM, Mohamed Yahya <yahya....@gmail.com> wrote:

> I see the point about working per term processor. We're talking about
> 4 million documents, and 8 fields per document.

Oh well. You can definitely do several rounds :).

Ciao,

seba

Mohamed Yahya

unread,

Apr 22, 2014, 2:04:16 PM4/22/14

to mg4j

With everything being read and written from the same disk on a single
machine with 20GB of memory available to MG4J, how much time would you
expect indexing something like the text of Wikipedia (20GB after
cleaning, two fields: text, title) to take, running with the options
"--downcase -s 1000000".

Sebastiano Vigna

unread,

Apr 23, 2014, 2:38:30 AM4/23/14

to mg...@googlegroups.com

On 22 Apr 2014, at 8:04 PM, Mohamed Yahya <yahya....@gmail.com> wrote:

> With everything being read and written from the same disk on a single
> machine with 20GB of memory available to MG4J, how much time would you
> expect indexing something like the text of Wikipedia (20GB after
> cleaning, two fields: text, title) to take, running with the options
> "--downcase -s 1000000".

I'd say a few hours, but it depends on several factors.

Ciao,

seba

Reply all

Reply to author

Forward