Use of VectorStoreRAM

Glen Newton

unread,

Jul 7, 2011, 11:13:58 AM7/7/11

to semanti...@googlegroups.com

Hello,

It has been a couple of year since I delved into the SV source. But
given some questions on scalability and some work I am doing, I
recently downloaded the 2.4 source and briefly looked through it.

I noticed that VectorStoreRAM is used in various places. For example,
in TermTermVectorsFromLucene.java, the term vectors are: private
VectorStoreRAM termVectors; and this is used in trainTermTermVectors.
So as it is RAM based, it seems that it would limited to the heap on
the machine.
VectorStoreRAM is also found in: BuildPositionalIndex,
CompareTermsBatch, DocVectors, IncrementalDocVectors.

I am working on a 1) Berkeley DB[1] implementation of VectorStore; b)
Infinispan[2] implementation of VectorStore.
These would use either a local key store (BDB) or a distributed
keystore (Infinispan) to implement a VectorStore.
I am planning on altering SV src to see how well they work, and how
they reduce hopefully reduce the memory footprint of SV (although
perhaps taking a hit in speed).

I would like SV to support the dynamic loading (plugin-esque) of
VectorStore implementations for internally used VectorStores. The
default would still be VectorStoreRAM, but this can be overridden by
some system property (or some other mechanism: my soon-to-be released
(!) new version LuSql uses a simple plugin mechanism...). So when
someone comes along with a faster and / or more scalable VectorStore,
it can be tested and used without changes to the SV source code.

As someone who has had to use up to 28GB of heap for some of my work,
I'd like to try reduce the memory footprint to allow for large
datasets. While mainly a selfish interest, I suspect this might be of
use to others.

Please let me know if you think this is 1) sensible and 2) of general interest.

Thanks,
Glen Newton

[1]http://www.oracle.com/technetwork/database/berkeleydb/overview/index.html
[2]http://www.jboss.org/infinispan

--

-

Jim Snavely

unread,

Jul 7, 2011, 9:14:19 PM7/7/11

to glen....@gmail.com, semanti...@googlegroups.com

Glen,

I'm very interested in your idea to use an alternate VectorStore
implementation to run SV with reduced memory requirements. I tried to
use the semantic vectors package to index a collection of 6 million
scientific titles.
Because there were so many unique terms, I ran out of memory building
the term vectors in RAM.

I'd love to give it another shot with a different VectorStore, I just
haven't gotten around to trying to patch it.

Thanks,
Jim Snavely

> --
> You received this message because you are subscribed to the Google Groups "Semantic Vectors" group.
> To post to this group, send email to semanti...@googlegroups.com.
> To unsubscribe from this group, send email to semanticvecto...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/semanticvectors?hl=en.
>
>

--
--Jim

widdows

unread,

Jul 7, 2011, 10:53:02 PM7/7/11

to Semantic Vectors

This would be terrific, Glen. A distributed disk-based VectorStore
would be a very welcome addition, and maybe not too difficult to plug
in.

VectorStoreRAM is fast for queries but not very scalable for building
really large indexes. You'll probably find that the choices of where
to use different VectorStore implementations in the current codebase
is somewhat accidental in places - it's evolved more by necessity than
design. We can probably improve here.

Will either of your database serializations be disrupted if we try to
serialize vectors that aren't assumed to be float[] arrays? We've been
working on serializations for vectors from binary and complex vector
spaces as well, so the interfaces are going to become more abstract
(see
http://code.google.com/p/semanticvectors/source/browse/branches/typed-vectors#typed-vectors%2Fpitt%2Fsearch%2Fsemanticvectors%2Fvectors).
I hope the new formats can be accommodated, if not we'll review.

Thanks for the initiative, and I'll be glad to help.

Best wishes,
Dominic

> > [1]http://www.oracle.com/technetwork/database/berkeleydb/overview/index....

Lance De Vine

unread,

Jul 7, 2011, 11:15:38 PM7/7/11

to semanti...@googlegroups.com

Hello,

I would also be interested in helping out with more scalable vectore store implementations. BDB is a good choice and Infinispan looks very interesting as well (I hadn't seen it before). Given Zipf's law I would presume that the use memory cache's would go a long way towards maintaining performance across a distributed architecture when processing text, although this may depend on the particular use case.

Regards,

Lance

________________________________________
From: semanti...@googlegroups.com [semanti...@googlegroups.com] On Behalf Of widdows [wid...@google.com]
Sent: 08 July 2011 12:53
To: Semantic Vectors
Subject: Re: Use of VectorStoreRAM

Danica Damljanovic

unread,

Jul 8, 2011, 5:38:16 AM7/8/11

to semanti...@googlegroups.com

Hi Glen,

I think your proposal is great.

We hit the limit with our huge corpus by using 260G of RAM! So, alternatives were to use smaller dimensionality, min term frequency, etc..but not relying directly on RAM even with downside of performance would be terrific.

Cheers

Danica

--
Best,
Danica Damljanovic
--
Research Associate
GATE team http://gate.ac.uk
Natural Language Processing Group
University of Sheffield
http://www.dcs.shef.ac.uk/~danica

Glen Newton

unread,

Jul 18, 2011, 1:01:30 PM7/18/11

to semanti...@googlegroups.com

Everyone,

Apologies: I was away on vacation, so I am just getting back online now! :-)

I will post a suggested way forward in the next day or so, & address
various points in your emails.

Thanks,
Glen :-)

--

-

Glen Newton

unread,

Jul 22, 2011, 5:12:46 PM7/22/11

to semanti...@googlegroups.com

Hi Dominic,

On Thu, Jul 7, 2011 at 10:53 PM, widdows <wid...@google.com> wrote:
> This would be terrific, Glen. A distributed disk-based VectorStore
> would be a very welcome addition, and maybe not too difficult to plug
> in.

Thanks! :-)

> VectorStoreRAM is fast for queries but not very scalable for building
> really large indexes. You'll probably find that the choices of where
> to use different VectorStore implementations in the current codebase
> is somewhat accidental in places - it's evolved more by necessity than
> design. We can probably improve here.
>
> Will either of your database serializations be disrupted if we try to
> serialize vectors that aren't assumed to be float[] arrays? We've been
> working on serializations for vectors from binary and complex vector
> spaces as well, so the interfaces are going to become more abstract
> (see
> http://code.google.com/p/semanticvectors/source/browse/branches/typed-vectors#typed-vectors%2Fpitt%2Fsearch%2Fsemanticvectors%2Fvectors).
> I hope the new formats can be accommodated, if not we'll review.

I think this should be OK.

However, I was looking at this code, and I think there is a better,
more OO/Java way of doing this:
1 - Vector https://code.google.com/p/semanticvectors/source/browse/branches/typed-vectors/pitt/search/semanticvectors/vectors/Vector.java
should define an interface. If you then want an abstract base class
that implements the interface and that is the root for some
implementations, go ahead and do this.
2 - The whole enum VectorType {
BINARY,
REAL,
COMPLEX;
}
is not a good thing. :-(
I would get rid of it and would instead implement this in a fashion
that could allow arbitrary implementation. For example, I would change
VectorFactory.createZeroVector(VectorType type, int dimension)
https://code.google.com/p/semanticvectors/source/browse/branches/typed-vectors/pitt/search/semanticvectors/vectors/VectorFactory.java
to this:

public static Vector generateRandomVector(String
vectorInstanceClassName,int dimension, int numEntries, Random random)
{
if (2 * numEntries > dimension) {
throw new RuntimeException("Requested " + numEntries + " to be filled
in sparse "
+ "vector of dimension " + dimension + ". This is not sparse
and may cause problems.");
}
Class<?> vectorClass = null;
Vector vectorInstance = null;

try{
vectorInstanceClass = Class.forName(vectorInstanceClassName);
Constructor constructor = vectorInstanceClass.getConstructor();
vectorInstance = (Vector)constructor.newInstance(new
Integer(dimension), new Integer(numEntries), random);
}
catch(ClassNotFoundException e1){
// handle
}
catch(NoSuchMethodException e1){
// handle
}
catch(InstantiationException e1){
// handle
}
catch(IllegalAccessException e1){
// handle
}
catch(java.lang.reflect.InvocationTargetException e1){
// handle
}
return vectorInstance;
}

Usage:
Vector rv = VectorFactory.generateRandomVector("pitt.search.semanticvectors.vectors.BinaryVector",
10, 200, random);

This is very plugin-like, and allows alternate implementations to
easily dropped in-place. I use a more complex version of this for the
plugins in LuSql. And this is how I am looking at implementing the
VectorStore plugins.

let me know what you think. We can talk about this off list if you like. :-)

Thanks,
Glen

Glen Newton

unread,

Jul 22, 2011, 6:46:59 PM7/22/11

to semanti...@googlegroups.com

Hello all,

This proposal is to allow the dynamic loading of VectorStore
implementations, instead of the present situation of using
VectorStoreRAM (in most cases) in the SV codebase. Note that while this does
impose changes to the SV base, existing code using the API will run
exactly the same with no changes needed.

The motivation is to allow very large VectorStore`s (i.e. that exceed RAM) to
transparently use non-RAM solutions, like local disk (perhaps through
Berkeley DB JE) or
through a distributed key-store.

Any class that wants to create a VectorStore must:
1 - use the static instantiator: VectorStoreFactory.newVectorStore
2 - include a string key which uniquely identifies the VectorStore
_across_all_scopes_.
The key is used to look up the VectorStore implementor class that
the user wants to use. This information is made available in a
properties file passed in as a system property by the user to the VM

Changes to VectorStore interface:
Add:
public void init(Properties p);
public void close();

Example changes to SV codebase:
TermTermVectorsFromLucene.java

private VectorStoreRAM termVectors;
> private VectorStore termVectors;

> public final String TermVectorStoreKey1 = "TermTermVectorsFromLucene_vs1";

this.termVectors = new VectorStoreRAM(dimension);
> this.termVectors = VectorStoreFactory.newVectorStore(TermVectorStoreKey1, dimension);

When you run, you must set the system property at the command line so
that the VectorStoreFactory can find its properties file:
java -Dpitt.search.semanticvectors.VectorStoreFactory.propertiesFile="/home/gnewton/vsf.properties"
Note if the system property
pitt.search.semanticvectors.VectorStoreFactory.propertiesFile system
property is not
set, then the default VectorStore is instantiated (VectorStoreRAM)

Contents of /home/gnewton/vsf.properties:
#Start vsf.properties
TermTermVectorsFromLucene_vs1.ClassName=pitt.search.semanticvectors.BDBJE.VectorStoreBDBJE
# ClassName is the class that implements VectorStore that is to be
# instantiated by the factory
# This is an example of a Berkeley DB Java Edition implementation
# http://www.oracle.com/technetwork/database/berkeleydb/overview/index-093405.html

TermTermVectorsFromLucene_vs1.ClassPropertiesFile=/home/gnewton/bdbje.properties
# This the properties file that contains the implementation-specific
# properties of the VectorStore implementation. The properties and
# their values will be specific to this implementation and defined by
# the implementor. This properties file
# is loaded and given to the VectorStore implementation in the
# init(Properties p) method. Not needed for VectorStoreRAM but needed
# for VectorStoreBDBJE, etc.
# Each VectorStore should have a unique properties file here (if needed)

#End vsf.properties

Notes:
- of course, more than one VectorStore can be in one object. They just need to
have different keys. They all need to define their (ClassName and
properties File name)
properties in the above properties file
- Different class instances can instantiate a (local) VectorStore
which should unique to the object (or multiple threads in the same
object/method).
Clearly it should not be the same VectorStore, but for some
implementations the previous instance would be overwritten. How
to deal with this? --> In the ClassPropertiesFile, all property values
containing {$} are
replaced by an integer counter and then used. So, for example, in the
ClassPropertiesFile for
TermTermVectorsFromLucene_vs1 using VectorStoreBDBJE, the BDB library
needs a local directory to keep the
keystore. Here is an example property for this is:
bdbDir=/mnt/bigdisk/vectorStore/foobar_{$}
The first time this particular VectorStoreBDBJE is instantiated, its
BDB directory is /mnt/bigdisk/vectorStore/foobar_0
The second time, /mnt/bigdisk/vectorStore/foobar_1
So overwriting does not happen (this part of the instantiation is
wrapped in a lock). (note that the actual implementation goes a step
further to deal with multiple instances of the same app running: it
appends UUID.randomUUID().toString() producing something like this:
/mnt/bigdisk/vectorStore/foobar_0_067e6162-3b6f-4ae2-a171-2470b63dff00
/mnt/bigdisk/vectorStore/foobar_1_067e6162-3b6f-4ae2-a171-2470b63dff00
- applications using VectorStores should use the close() method to
free up resources.

The sequence for instantiation in VectorStoreFactory.newVectorStore(String
termVectorStoreKey, int dimension) is as follows (in pseudo Java, with
catch/throws missing and I/O simplified):

VectorStore newVectorStore(String termVectorStoreKey, int dimension)
// start
String propFile =
System.getProperties().getProperty("pitt.search.semanticvectors.VectorStoreFactory.propertiesFile");
if(propFile == null)
return new VectorStoreRAM(dimension); // no system property set
Properties vsProps = Properties.load(propFile);
String className = vsProps.getProperty(termVectorStoreKey + ".className");
if(className == null)
return new VectorStoreRAM(dimension); // no className for this key defined
//Instantiate class instance (may fail if class is not defined or not
// in class path or does not define the correct constructors (public
VectorStore(int dimension))
VectorStore vectorStoreInstance = .........
.
//
Properties classProperties = null;
String classPropertiesFile = vsProps.getProperty(termVectorStoreKey +
".classPropertiesFile");
if(classPropertiesFile != null)
classProperties = Properties.load(classPropertiesFile);
// handle unique contruction
lock.lock(); //start
// modify properties if {$} is encountered
classProperties = mangledClassProperties(classProperties)
lock.unlock();
vectorStoreInstance.init(classProperties);
return vectorStoreInstance;

This is a simple method to implement this sort of plugin and much less
onerous than other ways of doing this.
I hope this does not seem too complicated.

Feedback appreciated. :-)

Thanks,
Glen

--

-

Dominic Widdows

unread,

Jul 22, 2011, 6:54:32 PM7/22/11

to semanti...@googlegroups.com

Hi Glen,

Thanks for looking at this.

Options include (and these are just the ones I know of so far):

- Generic Vector<Type> construction. I have antibodies to this because it's so viral in the calling code, but my antibodies are from just a couple of situations that I didn't design very well, so this criticism is not from an expert.

- Paramatrized implementation with fixed enum like we have now. Pretty static, but pretty clear.

- Class based implementation such as you propose, which makes sense to me in the main (though I don't understand the details well enough yet).

I hear you on the plugin-like architecture. To begin with I'd thought "How many cases can there be? According to Frobenius there aren't that many natural ground fields!" But as the implementations have progressed, we have different forms of sparse and dense representation with different serializations. Many mathematically simple concepts end up with several different implementations (set, list, collection, enumeration). So yes, it's perfectly possible that someone comes along not just with a quaternion or prime-p vector that we don't have yet, but with a new implementation of (say) RealVector that is optimized for different operations.

But another question is "How often will this happen for real?". A few too many times I've put a lot of work into making an architecture that will scale seamlessly to any number of plugin components, and then we only ever ended up with a handful of such components and it just wasn't worth the effort.

One option would be to release a new version with what we have at present and be prepared to switch if new implementations arise for which the current system doesn't work.

Please do write back and let me know if I've missed important pros and cons. I agree we should take this off-list if it becomes really too detailed, but I don't think we've hit that barrier yet - you never know who might be interested.

Best wishes,

Dominic

Lance De Vine

unread,

Jul 23, 2011, 9:15:07 PM7/23/11

to semanti...@googlegroups.com

Hello,

I have been looking at the Infinispan javadocs, and in particular at Cache, CacheManager, CacheStore, etc..., and it provides what seems to be quite a good design which may map well to the design requirements of SemanticVectors (?). I am sure that there are other similar software with similar designs.

I would be in favour of having a VectorStoreManager class that basically would be a container for references to VectorStores with the ability to query for info on created VectorStores.

Regards,

Lance

-----Original Message-----
From: semanti...@googlegroups.com [mailto:semanti...@googlegroups.com] On Behalf Of Glen Newton
Sent: Saturday, 23 July 2011 8:47 AM
To: semanti...@googlegroups.com
Subject: Re: Use of VectorStoreRAM

Hello all,

#End vsf.properties

Feedback appreciated. :-)

Thanks,
Glen

--

-

--

Dominic Widdows

unread,

Jul 26, 2011, 2:40:04 PM7/26/11

to semanti...@googlegroups.com

Hi Glen, Lance,

It's taken a few days to get back to you on the VectorStore design, apologies for that.

On Sat, Jul 23, 2011 at 9:15 PM, Lance De Vine <l.de...@qut.edu.au> wrote:

Hello,

I have been looking at the Infinispan javadocs, and in particular at Cache, CacheManager, CacheStore, etc..., and it provides what seems to be quite a good design which may map well to the design requirements of SemanticVectors (?). I am sure that there are other similar software with similar designs.

I would be in favour of having a VectorStoreManager class that basically would be a container for references to VectorStores with the ability to query for info on created VectorStores.

Query for instances or for classnames? (So is the Manager more of a factory for new objects or more of an index of instantiated objects?)

Regards,

Lance

-----Original Message-----
From: semanti...@googlegroups.com [mailto:semanti...@googlegroups.com] On Behalf Of Glen Newton
Sent: Saturday, 23 July 2011 8:47 AM
To: semanti...@googlegroups.com
Subject: Re: Use of VectorStoreRAM

Hello all,

This proposal is to allow the dynamic loading of VectorStore
implementations, instead of the present situation of using
VectorStoreRAM (in most cases) in the SV codebase. Note that while this does
impose changes to the SV base, existing code using the API will run
exactly the same with no changes needed.

The motivation is to allow very large VectorStore`s (i.e. that exceed RAM) to
transparently use non-RAM solutions, like local disk (perhaps through
Berkeley DB JE) or
through a distributed key-store.

Any class that wants to create a VectorStore must:
1 - use the static instantiator: VectorStoreFactory.newVectorStore
2 - include a string key which uniquely identifies the VectorStore
_across_all_scopes_.

Is the full qualified class name (e.g., pitt.search.semanticvectors.VectorStoreRAM) enough for this? Or are you trying to track versions as well?

The key is used to look up the VectorStore implementor class that
the user wants to use. This information is made available in a
properties file passed in as a system property by the user to the VM

Changes to VectorStore interface:
Add:
public void init(Properties p);
public void close();

Example changes to SV codebase:
TermTermVectorsFromLucene.java

private VectorStoreRAM termVectors;
> private VectorStore termVectors;

> public final String TermVectorStoreKey1 = "TermTermVectorsFromLucene_vs1";

this.termVectors = new VectorStoreRAM(dimension);
> this.termVectors = VectorStoreFactory.newVectorStore(TermVectorStoreKey1, dimension);

Note that any VectorStore will be restricted by type of vector (real, binary, complex) as well as by dimension. It's been getting more and more complex to pass dimension around properly - for example, binary vectors need to have dimension a multiple of 64, but making this correction inside the vectors library left a mismatch between the dimension of the individual vectors and the dimension expected by the VectorStore. In the recent branch I've reverted to using the Flags.dimension flag for everything. But by now I regret this move, e.g., it makes testing brittle because you have to remember to reset the flags after tests, which is a good indication that the design is wrong.

Lance proposed passing some of this around in a config object, that may work.

When you run, you must set the system property at the command line so
that the VectorStoreFactory can find its properties file:
java -Dpitt.search.semanticvectors.VectorStoreFactory.propertiesFile="/home/gnewton/vsf.properties"
Note if the system property
pitt.search.semanticvectors.VectorStoreFactory.propertiesFile system
property is not
set, then the default VectorStore is instantiated (VectorStoreRAM)

Where is this default configured? In the class that uses it or somewhere else?

So this is about how to manage several vector stores of the same type in memory at the same time? Is this prevalent enough that we need any automated way of managing these? So far each builder or searcher implementation only needs to know about (at most) two or three vector stores. I may be missing some of the motivation here.

Having a list of strings referring to particular vector store types sounds easy enough, and this sounds quite like our ongoing vector type discussions.

It certainly doesn't sound too complicated in itself. However it does sound like it touches several parts of the codebase (like the overused flags design), and I'd like to understand the relationship between any properties file, flag-based configuration, and how defaults arise. This will point to some deficiencies in the current codebase as well. It may be too ambitious of me, but I'd like to try and make sure that this design is consistent with solving some of the problems we know we have already.

Thanks again for putting this together. One suggestion: maybe we should be collaborating with comments on a shared Google doc? Easier to follow perhaps than email.

Best wishes,

Dominic

Glen Newton

unread,

Jul 26, 2011, 10:23:59 PM7/26/11

to semanti...@googlegroups.com

Hello,

On Tue, Jul 26, 2011 at 2:40 PM, Dominic Widdows <wid...@google.com> wrote:
> Hi Glen, Lance,
> It's taken a few days to get back to you on the VectorStore design,
> apologies for that.
>
> On Sat, Jul 23, 2011 at 9:15 PM, Lance De Vine <l.de...@qut.edu.au> wrote:
>>
>> Hello,
>>
>> I have been looking at the Infinispan javadocs, and in particular at
>> Cache, CacheManager, CacheStore, etc..., and it provides what seems to be
>> quite a good design which may map well to the design requirements of
>> SemanticVectors (?). I am sure that there are other similar software with
>> similar designs.
>>
>> I would be in favour of having a VectorStoreManager class that basically
>> would be a container for references to VectorStores with the ability to
>> query for info on created VectorStores.
>
> Query for instances or for classnames? (So is the Manager more of a factory
> for new objects or more of an index of instantiated objects?)

I would prefer (and have implemented in the past) a factory for new
objects, and the new objects behave like VectorStores.
This way the code only needs to be altered in a few places.

FQCN is needed.
What do you mean by 'version'?

>
>>
>> The key is used to look up the VectorStore implementor class that
>> the user wants to use. This information is made available in a
>> properties file passed in as a system property by the user to the VM
>>
>>
>> Changes to VectorStore interface:
>> Add:
>> public void init(Properties p);
>> public void close();
>>
>> Example changes to SV codebase:
>> TermTermVectorsFromLucene.java
>>
>> private VectorStoreRAM termVectors;
>> > private VectorStore termVectors;
>>
>> > public final String TermVectorStoreKey1 =
>> > "TermTermVectorsFromLucene_vs1";
>>
>> this.termVectors = new VectorStoreRAM(dimension);
>> > this.termVectors =
>> > VectorStoreFactory.newVectorStore(TermVectorStoreKey1, dimension);
>
> Note that any VectorStore will be restricted by type of vector (real,
> binary, complex) as well as by dimension. It's been getting more and more
> complex to pass dimension around properly - for example, binary vectors need
> to have dimension a multiple of 64, but making this correction inside the
> vectors library left a mismatch between the dimension of the individual
> vectors and the dimension expected by the VectorStore. In the recent branch
> I've reverted to using the Flags.dimension flag for everything. But by now I
> regret this move, e.g., it makes testing brittle because you have to

> remember to reset the flags after tests, which is a golod indication that the

> design is wrong.
>
> Lance proposed passing some of this around in a config object, that may
> work.

I think this should be orthogonal to that.

>>
>>
>> When you run, you must set the system property at the command line so
>> that the VectorStoreFactory can find its properties file:
>> java
>> -Dpitt.search.semanticvectors.VectorStoreFactory.propertiesFile="/home/gnewton/vsf.properties"
>> Note if the system property
>> pitt.search.semanticvectors.VectorStoreFactory.propertiesFile system
>> property is not
>> set, then the default VectorStore is instantiated (VectorStoreRAM)
>
> Where is this default configured? In the class that uses it or somewhere
> else?

The default is VectorStoreRAM.

Yes.

>Is this prevalent enough that we need any automated
> way of managing these?

I believe it is, and in general it is nice to make things flexible
going in, if the overhead is not too high.

>So far each builder or searcher implementation only
> needs to know about (at most) two or three vector stores. I may be missing
> some of the motivation here.

I'm not sure I understand your question. :-)
Perhaps ask it in a different way...

The strings are the class names, and this uses the Java classloader mechanism.
I think - correctly if I am mistaken - that this is different from
what you were proposing.

> It certainly doesn't sound too complicated in itself. However it does sound
> like it touches several parts of the codebase (like the overused flags
> design), and I'd like to understand the relationship between any properties
> file, flag-based configuration, and how defaults arise. This will point to
> some deficiencies in the current codebase as well. It may be too ambitious
> of me, but I'd like to try and make sure that this design is consistent with
> solving some of the problems we know we have already.
> Thanks again for putting this together. One suggestion: maybe we should be
> collaborating with comments on a shared Google doc? Easier to follow perhaps
> than email.

Yes, this sounds fine.

> Best wishes,
> Dominic

Thanks,
Glen

--

-

Dominic Widdows

unread,

Jul 27, 2011, 10:23:26 AM7/27/11

to semanti...@googlegroups.com

Hi Glen,

On Tue, Jul 26, 2011 at 10:23 PM, Glen Newton <glen....@gmail.com> wrote:

Hello,

On Tue, Jul 26, 2011 at 2:40 PM, Dominic Widdows <wid...@google.com> wrote:
> Hi Glen, Lance,
> It's taken a few days to get back to you on the VectorStore design,
> apologies for that.
>
> On Sat, Jul 23, 2011 at 9:15 PM, Lance De Vine <l.de...@qut.edu.au> wrote:
>>
>> Hello,
>>
>> I have been looking at the Infinispan javadocs, and in particular at
>> Cache, CacheManager, CacheStore, etc..., and it provides what seems to be
>> quite a good design which may map well to the design requirements of
>> SemanticVectors (?). I am sure that there are other similar software with
>> similar designs.
>>
>> I would be in favour of having a VectorStoreManager class that basically
>> would be a container for references to VectorStores with the ability to
>> query for info on created VectorStores.
>
> Query for instances or for classnames? (So is the Manager more of a factory
> for new objects or more of an index of instantiated objects?)

I would prefer (and have implemented in the past) a factory for new
objects, and the new objects behave like VectorStores.
This way the code only needs to be altered in a few places.

I do agree that this makes a lot of sense.

That (for example) the VectorStoreRAM constructor may in some versions of the package take no arguments, or it may take a dimension, and in future versions of the package it may take a vector type (real, complex, binary or something).

I'm not sure they can be orthogonal, because of the above consideration. If the VectorStores take dimension and vectortype as construction parameters, then the VectorStoreFactory needs to know about them as well. If not, and instead we have global flags / configs, that at least raises the question "What happens when the VectorStoreFactory tries to create a VectorStore from a file on disk built with a different configuration?"

At the moment, very few classes need to know about many vector stores. The Search class may need a queryvectorstore and a searchvectorstore, and vector learning classes typically need to know about input (often elemental) vectors and semantic vectors. Right now I can't think of any class that uses more than two vector stores, and they're just passed in as arguments or constructed in the body of the class. Having a VectorStore registry and passing around the keys for looking them up sounds more complex than this.

If we don't need to track versions of the classes then we're on the same page here.

> It certainly doesn't sound too complicated in itself. However it does sound
> like it touches several parts of the codebase (like the overused flags
> design), and I'd like to understand the relationship between any properties
> file, flag-based configuration, and how defaults arise. This will point to
> some deficiencies in the current codebase as well. It may be too ambitious
> of me, but I'd like to try and make sure that this design is consistent with
> solving some of the problems we know we have already.
> Thanks again for putting this together. One suggestion: maybe we should be
> collaborating with comments on a shared Google doc? Easier to follow perhaps
> than email.

Yes, this sounds fine.

I just started a blank doc and sent you an invitation. Please let me know if you run into any problems with this.