ElasticSearch

361 مرّة مشاهدة
التخطي إلى أول رسالة غير مقروءة

kla...@sfile.com

غير مقروءة،
13‏/01‏/2014، 12:51:50 م13‏/1‏/2014
إلى semanti...@googlegroups.com
Does SemanticVectors work seamlessly with indexes created by Elastic Search?

Ken Larrey

Dominic

غير مقروءة،
13‏/01‏/2014، 8:46:18 م13‏/1‏/2014
إلى semanti...@googlegroups.com
Hi Ken,

I have no specific information on Elastic Search. Given that it uses Lucene internally (see http://www.elasticsearch.org/overview/), it might be possible, depending on how Lucene is used. If you can find Lucene index files on disk, I'd try starting with that.

If anyone has more detailed experience on this topic to share, please chime in.

Best wishes,
Dominic

Ken Larrey

غير مقروءة،
15‏/01‏/2014، 5:27:17 م15‏/1‏/2014
إلى semanti...@googlegroups.com
I tried running semanticvectors.BuildIndex on an index created by ElasticSearch (from a twitter stream) with shards set to 1, so that it would all be one contiguous index.  Files created in the index look similar to an index created by ordinary Lucene.  The files in the index are:

_0.cfe
_1.cfe
_2.cfe
_3.cfe
_4.cfe
_5.cfe
_6.cfe
_7.cfe
_0.cfs
_1.cfs
_2.cfs
_3.cfs
_4.cfs
_5.cfs
_6.cfs
_7.cfs
_checksums-1389822965544
segments_1
segments.gen
write.lock


Apparently SemanticVectors still doesn't like the format.  

Here is the error output, following the command:

java pitt.search.semanticvectors.BuildIndex -luceneindexpath "C:\ [...] \nodes\0\indices\my_twitter_river_hbw2\0\index" > RRIonESIndexError.txt

Seedlength: 10, Dimension: 200, Vector type: REAL, Minimum frequency: 0, Maximum frequency: 2147483647, Number non-alphabet characters: 2147483647, Contents fields are: [contents]
Exception in thread "main" org.apache.lucene.index.IndexFormatTooNewException: Format version is not supported (resource: ChecksumIndexInput(SimpleFSIndexInput(path="C:\Users\LANAdmin\Documents\KCL\ElasticSearchData\SFileES\nodes\0\indices\my_twitter_river_hbw2\0\index\segments_1"))): 1 (needs to be between 0 and 0)
at org.apache.lucene.codecs.CodecUtil.checkHeaderNoMagic(CodecUtil.java:148)
at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:323)
at org.apache.lucene.index.StandardDirectoryReader$1.doBody(StandardDirectoryReader.java:56)
at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:812)
at org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:52)
at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:66)
at pitt.search.semanticvectors.LuceneUtils.<init>(LuceneUtils.java:102)
at pitt.search.semanticvectors.BuildIndex.main(BuildIndex.java:97)



Anyone have thoughts on what's wrong and what might be done to make it work?

Sfile Technology Corporation

1990 Post Oak Blvd., Suite D

Houston, TX  77056

Office: (713)228-3184

kla...@sfile.com

www.sfile.com


cid:image002.png@01CBEF26.CB4E67A0

 

Changing the Economics of Discovery ™

 

linkedin16_bw   twitter16_bw   facebook16

Sfile is now on Google Plus!

 

Confidentiality Notice: The information contained in this e-mail and any attachments to it may be legally privileged and include confidential information intended only for the recipient(s) identified above. If you are not one of those intended recipients, you are hereby notified that any dissemination, distribution or copying of this e-mail or its attachments is strictly prohibited. If you have received this e-mail in error, please notify the sender of that fact by return e-mail and permanently delete the e-mail and any attachments to it immediately. Please do not retain, copy or use this e-mail or its attachments for any purpose, nor disclose all or any part of its contents to any other person. Thank you.



--
You received this message because you are subscribed to the Google Groups "Semantic Vectors" group.
To unsubscribe from this group and stop receiving emails from it, send an email to semanticvecto...@googlegroups.com.
To post to this group, send email to semanti...@googlegroups.com.
Visit this group at http://groups.google.com/group/semanticvectors.
For more options, visit https://groups.google.com/groups/opt_out.

image004.png
image001.png
image002.gif
image003.gif

Dominic Widdows

غير مقروءة،
15‏/01‏/2014، 7:42:16 م15‏/1‏/2014
إلى semanti...@googlegroups.com
Hi Ken,

It looks like a Lucene version-mismatch issue. If you're using SV 5.4 (latest jar release) or 5.5 (current source), these are up-to-date with Lucene 4.6.0, which is the latest Lucene AFAIK.

Is there a way to find out which Lucene version was used to write your ElasticSearch indexes?

Best wishes,
Dominic
image004.png
image001.png
image003.gif
image002.gif

Otis Gospodnetic

غير مقروءة،
15‏/01‏/2014، 11:14:02 م15‏/1‏/2014
إلى semanti...@googlegroups.com
Hi,

ES 1.0 RC1 uses Lucene 4.6.0, so that should work with SV that knows how to read Lucene 4.6.0 format.
4.6.1 should be out next week, FYI.

Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/
image003.gif
image001.png
image004.png
image002.gif

Ken Larrey

غير مقروءة،
16‏/01‏/2014، 12:02:36 م16‏/1‏/2014
إلى semanti...@googlegroups.com
Thanks, looks like you may both be right.  

However, now I get a different error:

However, I updated semantic vectors to the 5.4 binary distribution, downloaded lucene 4.6.0, and updated all of the respective classpath variables.  Now I get a different error:

java pitt.search.semanticvectors.BuildIndex -luceneindexpath "C:\etc\nodes\0\indices\my_twitter_river_hbw2\0\index" > RRIonESIndexError.txt

Seedlength: 10, Dimension: 200, Vector type: REAL, Minimum frequency: 0, Maximum frequency: 2147483647, Number non-alphabet characters: 2147483647, Contents fields are: [contents]
Exception in thread "main" java.lang.IllegalArgumentException: A SPI class of type org.apache.lucene.codecs.PostingsFormat with name 'es090' does not exist. You need to add the corresponding JAR file supporting this SPI to your classpath.The current classpath supports the following names: [Lucene40, Lucene41]
at org.apache.lucene.util.NamedSPILoader.lookup(NamedSPILoader.java:109)
at org.apache.lucene.codecs.PostingsFormat.forName(PostingsFormat.java:100)
at org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsReader.<init>(PerFieldPostingsFormat.java:192)
at org.apache.lucene.codecs.perfield.PerFieldPostingsFormat.fieldsProducer(PerFieldPostingsFormat.java:244)
at org.apache.lucene.index.SegmentCoreReaders.<init>(SegmentCoreReaders.java:115)
at org.apache.lucene.index.SegmentReader.<init>(SegmentReader.java:95)
at org.apache.lucene.index.StandardDirectoryReader$1.doBody(StandardDirectoryReader.java:62)
at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:843)
at org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:52)
at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:66)
at pitt.search.semanticvectors.LuceneUtils.<init>(LuceneUtils.java:102)
at pitt.search.semanticvectors.BuildIndex.main(BuildIndex.java:96)


I've noticed that the ElasticSearch index undergoes some frequent changes and seems to use some ElasticSearch specific formats or codecs.  

I checked the index again this morning and found that the contents of that index had changed completely to the following:

segments.gen
segments_5
_checksums-1389830792624
_u.fdt
_u.fdx
_u.fnm
_u.nvd
_u.nvm
_u_es090_0.blm
_u_es090_0.doc
_u_es090_0.pay
_u_es090_0.pos
_u_es090_0.tim
_u_es090_0.tip
_v.cfe
_v.cfs

Fired up ES again and it completely changed again. 

I tried adding the lucene-codecs-4.6.0.jar to the classpath as well as the elasticsearch-0.90.10.jar, but I still get the same error.  

Any thoughts on whether use of the es090 format can be suppressed or handled by SemanticVectors some other way?  

If you can direct me to any resources that can (hopefully concisely) explain the formats of lucene and elasticsearch indexes, their differences and the different types of file types/extensions used in each, I'd be grateful.  Google searches of elasticsearch.org for es090 or any of the various file extensions occuring in ES indexes all yield nothing (e.g. "es090 site:elasticsearch.org").

Thanks!



Sfile Technology Corporation

1990 Post Oak Blvd., Suite D

Houston, TX  77056

Office: (713)228-3184

kla...@sfile.com

www.sfile.com


cid:image002.png@01CBEF26.CB4E67A0

 

Changing the Economics of Discovery ™

 

linkedin16_bw   twitter16_bw   facebook16

Sfile is now on Google Plus!

 

Confidentiality Notice: The information contained in this e-mail and any attachments to it may be legally privileged and include confidential information intended only for the recipient(s) identified above. If you are not one of those intended recipients, you are hereby notified that any dissemination, distribution or copying of this e-mail or its attachments is strictly prohibited. If you have received this e-mail in error, please notify the sender of that fact by return e-mail and permanently delete the e-mail and any attachments to it immediately. Please do not retain, copy or use this e-mail or its attachments for any purpose, nor disclose all or any part of its contents to any other person. Thank you.



image003.gif
image004.png
image002.gif
image001.png

Trevor Cohen

غير مقروءة،
12‏/06‏/2015، 5:16:11 م12‏/6‏/2015
إلى semanti...@googlegroups.com،kla...@sfile.com
Hi Ken,
I realize I'm a bit late to the party here, but I think I've found a way around this that I thought I'd post in case it is useful to anyone else in the group.

I was able to get SemanticVectors to operate on an ElasticSearch index by adapting instructions used for Luke, posted here: 


Specifically: 

(1) Download SV-5.6 (for consistency with ElasticSearch's current releases commitment to Lucene-4.10).


(2) Add ElasticSearch as a dependency to the end of the list of dependencies in the pom.xml file

<!-- ElasticSearch -->
<dependency>
    <groupId>org.elasticsearch</groupId>
    <artifactId>elasticsearch</artifactId>
    <version>1.5.0</version>
</dependency>

(3) mvn install

(4) In the semantic vectors directory, run the script below (adapted from the Luke instructions) to add the ElasticSearch postings format to the packaged .jar:

unzip target/semanticvectors-5.6.jar META-INF/services/org.apache.lucene.codecs.PostingsFormat -d ./tmp/
 
echo "org.elasticsearch.index.codec.postingsformat.BloomFilterPostingsFormat"  >> ./tmp/META-INF/services/org.apache.lucene.codecs.PostingsFormat
 
echo "org.elasticsearch.index.codec.postingsformat.Elasticsearch090PostingsFormat"   >> ./tmp/META-INF/services/org.apache.lucene.codecs.PostingsFormat
 
echo "org.elasticsearch.search.suggest.completion.Completion090PostingsFormat"  >> ./tmp/META-INF/services/org.apache.lucene.codecs.PostingsFormat
 
 jar -uf target/semanticvectors-5.6.jar -C tmp/ META-INF/services/org.apache.lucene.codecs.PostingsFormat

This resulted in a target/semanticvectors-5.6.jar that interacts comfortably with an index that I'm told was produced by the current release of ElasticSearch (v 1.6).

Regards,
Trevor 




الرد على الكل
رد على الكاتب
إعادة توجيه
0 رسالة جديدة