Generate features for millions of document

Yi Sun

unread,

Jul 15, 2016, 5:28:44 PM7/15/16

to semanti...@googlegroups.com

Hi,

If I have hundreds of millions of lines of text in one file and each line contains a long paragraph of sentence. How to generate the semantic vector for each line efficiently?

I do not want to split the original document into millions of small files.

Thanks,

Yi

Dominic Widdows

unread,

Jul 15, 2016, 5:44:37 PM7/15/16

to semanti...@googlegroups.com

Hi Yi,

You should be able to do something a bit like replacing the way IndexFilePositions calls FilePositionDoc in these Lucene indexing classes / utilities:

https://github.com/semanticvectors/semanticvectors/blob/master/src/main/java/pitt/search/lucene/IndexFilePositions.java

https://github.com/semanticvectors/semanticvectors/blob/master/src/main/java/pitt/search/lucene/FilePositionDoc.java

That is, something like an option where a "document" is added to the Lucene index for each line / paragraph. It might be enough to have a FileReader read text one line at a time and add them as TextField objects. (See https://lucene.apache.org/core/5_3_0/core/index.html?org/apache/lucene/document/TextField.html)

Hope that helps. If this or something like it works, do please reply to the list saying what you did. Good luck :)

Best wishes,

Dominic

--
You received this message because you are subscribed to the Google Groups "Semantic Vectors" group.
To unsubscribe from this group and stop receiving emails from it, send an email to semanticvecto...@googlegroups.com.
To post to this group, send email to semanti...@googlegroups.com.
Visit this group at https://groups.google.com/group/semanticvectors.
For more options, visit https://groups.google.com/d/optout.

Ron King

unread,

Oct 22, 2020, 7:19:00 PM10/22/20

to Semantic Vectors

I see that this is a 4 year old question, but I'm new to actually digging into using SV. Is it still necessary to modify code as described when using the latest version?

I'd like to process a multi-megabyte file of text with 2 million sentences in it, I want them to be considered as separate documents without creating 2 million files.

Dominic

unread,

Oct 22, 2020, 7:25:07 PM10/22/20

to Semantic Vectors

Hi there - it looks like a suitable "Doc from String" function was added at https://github.com/semanticvectors/semanticvectors/blob/master/src/main/java/pitt/search/lucene/FilePositionDoc.java#L41 a couple of years ago - I haven't used it but looks like it should do what you want here, and I'd expect you can wire this in to be called from https://github.com/semanticvectors/semanticvectors/blob/master/src/main/java/pitt/search/lucene/IndexFilePositions.java#L122.

So the answer is probably "yes, it's still necessary to modify some code but not much".

Best wishes,

Dominic

Trevor Cohen

unread,

Oct 23, 2020, 12:16:47 AM10/23/20

to semanti...@googlegroups.com

Evening all - This function can be called (without adding code) by running IndexFlatFilePositions from the command line, with the same syntax as IndexFilePositions but pointing to a file rather than a directory. Each line of the file will be treated as though it were a separate document.

-Trevor

To view this discussion on the web visit https://groups.google.com/d/msgid/semanticvectors/6fee3e9e-4af4-444d-91d5-b5af4dd65118n%40googlegroups.com.

Ron King

unread,

Oct 23, 2020, 11:59:19 AM10/23/20

to Semantic Vectors

Thanks for the tip about pitt.search.lucene.IndexFlatFilePositions! I'll give it a try.

Ron King

unread,

Oct 23, 2020, 11:59:34 AM10/23/20

to Semantic Vectors

Hmm, I made the suggested changes, and my new class IndexFilesPositionsLines worked fine.

But when I run BuildPositionalIndex, this happens:

Processed 2000000 documents

Created 268384 term vectors ...

About to write 268384 vectors of dimension 200 to Lucene format file: elementalvectors.bin ... finished writing vectors.

Initialized LuceneUtils from Lucene index in directory: positional_index

Fields in index are: line_number, modified, contents

Writing vectors incrementally to file docvectors.bin ... Oct 22, 2020 7:47:05 PM pitt.search.semanticvectors.LuceneUtils getExternalDocId

SEVERE: Failed to get external doc ID from doc no. 0 in Lucene index.

This is almost certain to lead to problems.

Check that -docidfield was set correctly and exists in the Lucene index

Exception in thread "main" java.lang.NullPointerException

at pitt.search.semanticvectors.LuceneUtils.getExternalDocId(LuceneUtils.java:200)

at pitt.search.semanticvectors.IncrementalDocVectors.trainIncrementalDocVectors(IncrementalDocVectors.java:122)

at pitt.search.semanticvectors.IncrementalDocVectors.createIncrementalDocVectors(IncrementalDocVectors.java:93)

at pitt.search.semanticvectors.BuildPositionalIndex.main(BuildPositionalIndex.java:176)

Dominic

unread,

Oct 23, 2020, 12:01:29 PM10/23/20

to Semantic Vectors

Thanks for the tip, Trevor - hopefully IndexFlatFilePositions works fine. (That's the advice so far, the messages above got out-of-order.)

Reply all

Reply to author

Forward