API questions

32 views
Skip to first unread message

José Tomás Atria

unread,
Oct 20, 2015, 12:59:54 PM10/20/15
to Semantic Vectors
Hello all,

I just started playing around with semanticvectors, and I have a couple of questions about the current state of the package and its API.

First, why are tests disabled in the master branch? When enabled, two tests fail, but since I'm not familiar with the codebase, I can't tell if the error is in the test or the code...

Second, I see that the VectorStore interface used to have methods for getDimension() and getVectorType(), which were, of course, the first two methods that I found myself wanting when I tried to use the interface... Why were they removed? I understand that having these methods in the interface would force implementations to enforce homogeneity, but I would think this a good thing, and I haven't seen any use of heterogeneous vector stores... Is there a use case for that scenario?

Finally, I see that some parts of the code use some rather old or obsolete interfaces (most notoriously, Enumeration<E> all around). Are there any plans to update these parts of the library? Were I to start working in this direction, are there any gotchas that I should be aware of?

Thank you in advance for any help.
Best,
jta

Dominic

unread,
Oct 20, 2015, 1:15:52 PM10/20/15
to Semantic Vectors
Hi there,

Thanks for good questions, answers (some better than others!) inline.


On Tuesday, October 20, 2015 at 9:59:54 AM UTC-7, José Tomás Atria wrote:
Hello all,

I just started playing around with semanticvectors, and I have a couple of questions about the current state of the package and its API.

First, why are tests disabled in the master branch? When enabled, two tests fail, but since I'm not familiar with the codebase, I can't tell if the error is in the test or the code...

Tests are disabled by default because a couple have been flaky as you saw. I believe the flakiness is due to the underlying random nature of some of the elemental vector generation methods, but it's still disturbing.

It would be better to enable the tests in general and just disable the flaky ones until fixed.
 
Second, I see that the VectorStore interface used to have methods for getDimension() and getVectorType(), which were, of course, the first two methods that I found myself wanting when I tried to use the interface... Why were they removed? I understand that having these methods in the interface would force implementations to enforce homogeneity, but I would think this a good thing, and I haven't seen any use of heterogeneous vector stores... Is there a use case for that scenario?

The reason is because (after some mistakes years ago), we decided to make sure that dimension and vector type should be global, and they belong the the FlagConfig instance (of which there should typically be just one). Many methods and constructors in the package pass around one of these flagConfig objects, see e.g.,

The only times the dimension and vector type should be changed are i. at the command line when making new vector stores, and ii. when opening a vector store on disk. If you try to open two different vector stores with different dimensions or vector types in the same runtime, there will be trouble, e.g., you could end up trying to measure the scalar product between incompatible vectors. 

Like many things in the package, these implementation choices have been a tradeoff between (some) sensible encapsulation and error checks, while trying to make prototyping and research adaptations relatively easy.
 
Finally, I see that some parts of the code use some rather old or obsolete interfaces (most notoriously, Enumeration<E> all around). Are there any plans to update these parts of the library? Were I to start working in this direction, are there any gotchas that I should be aware of?

None that I know of - all you're seeing here is that the package has lived a long while and many parts are old.

We are thoroughly glad to review push / integrate requests to the github project. My only request here would be "keep the first push request small" - since I'm relatively inexperienced with github, I'd like to make sure there's a proof-of-concept for the integration process before you spend too much time preparing any large updates.

Best wishes,
Dominic

José Tomás Atria

unread,
Oct 20, 2015, 2:24:36 PM10/20/15
to semanti...@googlegroups.com
Hi Dominic,

Thanks a bunch for your answers! Of course now I have some additional questions :) See below.

On Tue, Oct 20, 2015 at 1:15 PM, Dominic <dwid...@gmail.com> wrote:
Tests are disabled by default because a couple have been flaky as you saw. I believe the flakiness is due to the underlying random nature of some of the elemental vector generation methods, but it's still disturbing.

It would be better to enable the tests in general and just disable the flaky ones until fixed.

Ok. I'll disable piecemeal and look into the failing tests... one of them is an ArrayOutOfBounds exception, so it's probably some offset error somewhere, the other one may be related to randomness because its a failed float comparison... I'll see what's up.​
 

The reason is because (after some mistakes years ago), we decided to make sure that dimension and vector type should be global, and they belong the the FlagConfig instance (of which there should typically be just one). Many methods and constructors in the package pass around one of these flagConfig objects, see e.g.,

The only times the dimension and vector type should be changed are i. at the command line when making new vector stores, and ii. when opening a vector store on disk. If you try to open two different vector stores with different dimensions or vector types in the same runtime, there will be trouble, e.g., you could end up trying to measure the scalar product between incompatible vectors. 

Like many things in the package, these implementation choices have been a tradeoff between (some) sensible encapsulation and error checks, while trying to make prototyping and research adaptations relatively easy.

​Mmm. I see. I confess that what I was trying to do when I bumped into the need for getVectorType and getDimension was precisely to do away with the requirement to use FlagConfig all over, in order to allow for other workflows that may use a different execution model. This might be overkill? I'm generally a little obsessive when it comes to coding practices, and it would be much nicer for my purposes if the interfaces and APIs could be decoupled from the use of FlagConfig, but this may be just too cumbesome to implement...

So far, I forked the repo in github and created a branch in which I added those methods back to the interface, and write in stubs in the different implementations of VectorStore, as well as adding alternative constructors that do not use FlagConfig where I've needed them and it wasn't too complicated to do so.

Does this sound like a major violation of the package's architecture to you? I understand that using a singleton FlagConfig is the easiest way ​to enforce consistency across sessions, so removing or allowing workarounds may lead to trouble later, but maybe it makes sense to have support an alternative execution method?

 
Finally, I see that some parts of the code use some rather old or obsolete interfaces (most notoriously, Enumeration<E> all around). Are there any plans to update these parts of the library? Were I to start working in this direction, are there any gotchas that I should be aware of?

None that I know of - all you're seeing here is that the package has lived a long while and many parts are old.

We are thoroughly glad to review push / integrate requests to the github project. My only request here would be "keep the first push request small" - since I'm relatively inexperienced with github, I'd like to make sure there's a proof-of-concept for the integration process before you spend too much time preparing any large updates.

Gotcha. I am also very inexperienced with the github collaboration model, so I'll look into what is the easiest way to encapsulate any changes I make, and let you know. In any case, before I do anything else, I'll get the tests fixed so I can check my changes against them.

Thanks again for the quick response!

Best,
jta

Best wishes,
Dominic
 

Thank you in advance for any help.
Best,
jta

--
You received this message because you are subscribed to the Google Groups "Semantic Vectors" group.
To unsubscribe from this group and stop receiving emails from it, send an email to semanticvecto...@googlegroups.com.
To post to this group, send email to semanti...@googlegroups.com.
Visit this group at http://groups.google.com/group/semanticvectors.
For more options, visit https://groups.google.com/d/optout.



--
entia non sunt multiplicanda praeter necessitatem

Dominic

unread,
Oct 20, 2015, 2:44:02 PM10/20/15
to Semantic Vectors
Better encapsulation would be sensible and safer for other workflows, for sure. However I'd be worried that the flag config idiom pervades so much of the stack that you might find that codifying all the interfaces that use it would become very cumbersome. E.g., lots of things depend on the luceneindexpath property, and threading variables like this through the stack would make for a lot of complicated method signatures.

Of course, passing round a single object doesn't really tame this complexity, it just hides it, and says nothing about what fields are relevant to what processes.

So my advice would be feel free to try (at least making dimension and vector type explicit in vector stores), but please don't be surprised if it becomes a viral change.

Best wishes,
Dominic
To unsubscribe from this group and stop receiving emails from it, send an email to semanticvectors+unsubscribe@googlegroups.com.
To post to this group, send email to semanticvectors@googlegroups.com.

José Tomás Atria

unread,
Oct 20, 2015, 3:28:44 PM10/20/15
to semanti...@googlegroups.com
I'll see what I can come up with :) So far I only added a few method signatures here and there wherever I needed to get around FlagConfig, but I noticed that a more thorough change would require a lot of work... which is why I came here asking :)

In any case, having the tests working would make any efforts in this direction much easier, so I'll look into that first.

For now, I'll go ahead and set up patches (or whatever github uses) for both the addition to the VectorStore interface and the tests, if I can get fix them.

Thanks for your help!
jta

Dominic

To unsubscribe from this group and stop receiving emails from it, send an email to semanticvecto...@googlegroups.com.
To post to this group, send email to semanti...@googlegroups.com.



--
entia non sunt multiplicanda praeter necessitatem

--
You received this message because you are subscribed to the Google Groups "Semantic Vectors" group.
To unsubscribe from this group and stop receiving emails from it, send an email to semanticvecto...@googlegroups.com.
To post to this group, send email to semanti...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages