Histogram generated by codegen

291 views
Skip to first unread message

Bruno

unread,
May 13, 2013, 10:04:29 AM5/13/13
to echo...@googlegroups.com

Hello all,

I am trying to create my own solr index with some songs i have and for that I am using codegen to generate the codes, times for each music. This chart is for the codes (blue) and timestamps (orange). I created a new Analyzer in Solr that will uncompress and decode the code string, giving me the above result. Using the HashFilter from echoprint-server (https://github.com/echonest/echoprint-server/blob/master/Hashr/src/com/echonest/knowledge/hashr/HashFilter.java) I get the following error on Solr - 

ERROR - 2013-05-13 15:01:14.231; org.apache.solr.common.SolrException; null:java.lang.IllegalArgumentException: startOffset must be non-negative, and endOffset must be >= startOffset, startOffset=1513,endOffset=440
at org.apache.lucene.analysis.tokenattributes.OffsetAttributeImpl.setOffset(OffsetAttributeImpl.java:45)

which i get because of those steep falls on the timestamp value array. It seems that the OffsetAttribute assumes that the timestamps are always increasing. I translated the FP.py script to java but the end result is the same between both, and I am not finding any more processing on the code samples. What am I doing wrong?

Thanx & Regards
Bruno

Andrew Nesbit

unread,
May 13, 2013, 10:26:41 AM5/13/13
to echo...@googlegroups.com
Hi Bruno,

The reason for the steep falls in the orange line is because the time codes are increasing for the first subband, then they drop to the beginning and start increasing again for the second subband, etc, and this pattern continues up until the eighth subband. So in a plot like the one you have posted you generally will see a sawtooth pattern with eight teeth.

There is no real processing of time codes in the Solr component. Solr searches only on the hash codes. The time codes are used in fp.py:actual_matches, to time-align the query FP with the FP returned from TTyrant.

Andrew


Bruno

--
You received this message because you are subscribed to the Google Groups "echoprint" group.
To unsubscribe from this group and stop receiving emails from it, send an email to echoprint+...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Bruno

unread,
May 13, 2013, 10:37:47 AM5/13/13
to echo...@googlegroups.com
Hello Andrew,

Thanx for the fast answer. But the HashFilter.java from echoprint-server uses the OffsetAttribute which wants alwasy increasing timestamps. Is this class wrong? Do i take out the OffsetAttribute?

Regards
Bruno

public class HashFilter extends TokenFilter {

    protected final static Logger log = LoggerFactory.getLogger(
            HashFilter.class);

    private TermAttribute termAtt;

    private OffsetAttribute offAtt;

    private int prevOff;

    public HashFilter(TokenStream input) {
        super(input);
        termAtt = (TermAttribute) addAttribute(TermAttribute.class);
        offAtt = (OffsetAttribute) addAttribute(OffsetAttribute.class);
    }

    @Override
    public boolean incrementToken() throws IOException {
        if(input.incrementToken()) {
            //
            // Save the state for this token, since we want the position in
            // the next token.
            State s = captureState();
            if(input.incrementToken()) {

                //
                // Parse the position from the next token. Wasteful, but what you
                // gonna do, unless we want to implement our own parseint.
                String ps = termAtt.term();
                int posn;
                try {
                    posn = Integer.parseInt(ps);
                } catch(NumberFormatException ex) {
                    throw new IOException(String.format("Bad offset %s", ps), ex);
                }
                restoreState(s);

                //
                // A finger print extends from the previous position to this one.
                offAtt.setOffset(prevOff, posn);
                prevOff = posn;
                return true;
            }
        }
        return false;
    }
}

Andrew Nesbit

unread,
May 14, 2013, 8:12:40 PM5/14/13
to echo...@googlegroups.com
Hey Bruno,

Ah, now I see where your confusion is coming from. There are several files in that Java source directory, but only the class HashQueryComponent ever really gets used. You can see how it gets hooked in to Solr here:

https://github.com/echonest/echoprint-server/blob/master/solr/solr/solr/conf/solrconfig.xml#L637

It's not immediately clear from the source, but HashFilter.java is a stub function for functionality that was never really finished, so it's a bit misleading. I'm not sure what the original intention was, but I'm going to ask around. Thanks for bringing it up.

The bottom line is that, unless you have a specific reason otherwise, you should ignore everything in that directory except HashQueryComponent.java .

Andrew

Bruno

unread,
May 20, 2013, 5:25:00 AM5/20/13
to echo...@googlegroups.com
Hi Andrew,

I have been checking the echoprint-server you provide on the github and the schema.xml for the solr instance is:

<types>
    <!-- A string field for the track IDs -->
    <fieldType name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true"/>
    <fieldType name="int" class="solr.TrieIntField" precisionStep="0" omitNorms="true" positionIncrementGap="0"/>
    <fieldType name="date" class="solr.DateField"/>

    <!-- A field for the fingerprint hashes and time offsets -->
    <fieldType name="fphash" class="solr.TextField">
      <analyzer type="index" class="com.echonest.knowledge.hashr.HashAnalyzer"/>
      <analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
      </analyzer>
    </fieldType>
  </types>

  <fields>

    <!-- The ID for a track, which is our unique key -->
    <field name="track_id" type="string" indexed="true" stored="true" required="true"/>

    <!-- The hashes associated with a track, which is our default search field -->
    <field name="fp" type="fphash" indexed="true" stored="false" required="true"/>
    
    <field name="artist" type="string" indexed="true" stored="true" required="false"/>
    <field name="release" type="string" indexed="true" stored="true" required="false"/>
    <field name="track" type="string" indexed="true" stored="true" required="false"/>
    <field name="length" type="int" indexed="true" stored="true" required="true"/>
    <field name="codever" type="string" indexed="true" stored="true" required="true"/>
    <field name="source" type="string" indexed="true" stored="true" required="true"/>
    <field name="import_date" type="date" indexed="true" stored="true" required="true"/>

  </fields>

So as you can see you are using the java code to create a Analyzer for the hash that tweak the hash in order to insert it on the solr index. Also I found out why I am having these problems and you are not. On solr 4.3 the OffsetAttributeImpl changed from version (3.1):

public void setOffset(int startOffset, int endOffset) {
    this.startOffset = startOffset;
    this.endOffset = endOffset;
  }

to (version 4.3):

public void setOffset(int startOffset, int endOffset) {

    // TODO: we could assert that this is set-once, ie,
    // current values are -1?  Very few token filters should
    // change offsets once set by the tokenizer... and
    // tokenizer should call clearAtts before re-using
    // OffsetAtt

    if (startOffset < 0 || endOffset < startOffset) {
      throw new IllegalArgumentException("startOffset must be non-negative, and endOffset must be >= startOffset, "
          + "startOffset=" + startOffset + ",endOffset=" + endOffset);
    }

    this.startOffset = startOffset;
    this.endOffset = endOffset;
  }

So now you have to have always increasing offsets, a problem you dont have until solr 3. Do you have any suggestions to solve this? I added a constant to each new subband in order to always increase it. Do you think it is a good strategy? I already did some test and if the startOffset is different from the starts offsets I am no able to get anyhing from Solr.

Even so I will try to use a similar strategy to you HashQueryComponenta and see if the performance increases.

Regards
Bruno

Andrew Nesbit

unread,
May 20, 2013, 3:27:12 PM5/20/13
to echo...@googlegroups.com
Hi Bruno,

Yes, you're right. Sorry about the mistake - the HashAnalyzer is from a long time ago and I wasn't so familiar with it, until you pointed it out and I then asked around. Thanks for the explanation.

The plan is to move this out of Java and get rid of the HashAnalyzer class altogether and store the FP as a list of integers instead of as a string. This means that the FPs will be pre-processed in the Python library instead of in the Java library.

Are you splitting the codes on ingestion into Solr? If yes, then the time codes in each document (FP segment) ingested into Solr should already be in non-decreasing order:

https://github.com/echonest/echoprint-server/blame/master/API/fp.py#L503

If not, then you can use the fp.chunker to convert the FP string into a list of tuples of the form (time-code, "hash-code time-code"), then sort that list and assemble a new FP string similarly to how it's done in fp.split_codes (omitting, of course, the parts that actually split the FP into segments).

I haven't tried or tested the above but I think it should work.

Andrew

Ravikant Bhargava

unread,
Dec 23, 2015, 11:31:35 PM12/23/15
to echoprint
Hello all,

I am also facing the similar issue when indexing in solr 4.0. I am using the python code in 'fp.py' to split the codes and then sort them based on offsets as advised by Andrew. But the HashAnalyzer piece still gives following Runtime exception when trying to index more than 1 segments of the same song.

IllegalArgumentException: startOffset must be non-negative, and endOffset must be >= startOffset


Please correct me if I am wrong but from the discussion on this thread, it feels like whatever 'HashFilter' is doing is redundant and we can live with using a WhitespaceTokenizer and skipping time offsets while indexing hash codes. If this is the case then I think I can skip that 'setOffset' call and be done with this error.



Thanks & Regards,

Ravikant

Reply all
Reply to author
Forward
0 new messages