Some random thoughts for online retriever && offline store

54 views
Skip to first unread message

Blake -Qiulei Guo

unread,
Mar 9, 2021, 3:09:30 AM3/9/21
to feast-dev
Just share some random thoughts for some potential Feast features.

(1) online retriever: For the online retriever (get_online_features), it seems the users must explicitly provide all the entities. How about a filter style retriever? For example, assume there is a feature table with multiple entities such as customer_city, customer_id, ... Would that be more useful if the user (data scientist) can retrieve all the feature values based on some specific entity(s) (e.g., customer_city), without explicitly giving all the entities values (e.g., all the customer_id) ?


(2) offline store: I notice there are a few discussions about offline store. And it seems the final decision is BigQuery/Snowflake/Redshift ? Although they are good, they are all commercial solutions and some companies might use none of them. How about an open source solution of Hive Metastore + Iceberg/Hudi ? 



 Best
QiuLei(Blake)

Willem Pienaar

unread,
Mar 15, 2021, 2:48:25 PM3/15/21
to Blake -Qiulei Guo, feast-dev
Hey Blake,
  1. Online Retriever: It would definitely be useful to have that functionality, but the more query patterns we expose through our APIs (and the less we give up) the harder it is for us to model our data and the fewer databases we can actually use. If we allow range scanning then Redis isn't viable any more, and we also won't be able to provide strict performance guarantees. I'd love to find out if you have a specific blocker in fetching these entities using an upstream call, or if you think the feature store should be used for candidate selection.
  2. We're going to try to make it super easy to add new offline stores, and there won't be any restriction on the types of stores that folks can use. Open source is always a priority for us (and many of the teams that use Feast).
Regards,
Willem

--
You received this message because you are subscribed to the Google Groups "feast-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to feast-dev+...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/feast-dev/9448230d-c964-4291-a289-50ba96f81629n%40googlegroups.com.

Jake Mannix

unread,
Mar 16, 2021, 1:24:15 AM3/16/21
to Willem Pienaar, Blake -Qiulei Guo, feast-dev
Hi Willem (and Blake),

  Re: 1) - fetching by KV is obviously the "lowest common denominator" of all random-access stores, but the one tiny step beyond that is prefix-scanning, which *is* supported by Redis, to my understanding (SCAN), as well as HBase and Cassandra (although I'm sure these queries aren't "free" relative to KV queries).  Alternatively, Redis APIs to allow for appending (APPEND) to a value of List type would allow the following common use case to work: essentially any feature situation where a collection is the logical value and we don't want to fetch the full collection, append to it, then write back the sized N+1 collection.  Implementing this *either* as APPEND *or* writing to a compound key with a prefix (doc1234_) which will be the actual known part of the key at query time, followed by even just timestamp, so that at query time you can do say "get me the first N values for keys starting with doc1234_" (where N is some smallish / reasonable size: 10-1000, say.

  This is important for a lot of use cases where the features are essentially immutable append-only collections (most recent recordIds a user has interacted with, most recent search queries a user has issued - basically any time you've got "most recent X's which have done 'a thing' to Y", and you want to later use Y as your query key.  Sure, you could do this with immutable collections as your value, fetch the whole value each time you want to update it, and write back, but for write-heavy workloads and collections sizes smaller than "tiny", this is going to be pretty ugly, performance-wise (amortized O(N^2) cost for building your size N collections per key).

  Any chance efficient mutable collection support could be allowed in the API, for suitably constrained collections (like append-only lists)?  Either composite keys and range scans, or literal "append" (ideally with some limit on the collection size, I guess?) seem generally supported by most backing storage systems (heck, even S3 supports prefix queries in the way it "fakes" a directory structure).

  -jake



--

  -jake

Willem Pienaar

unread,
Mar 16, 2021, 6:25:11 PM3/16/21
to Jake Mannix, Blake -Qiulei Guo, feast-dev
Hi Jake,

Thanks for the input. I think there are a couple of considerations
  • Implementation: While it's true that Redis does support SCAN, it doesn't support in-memory joins as far as I can tell. Which means more of that logic will be moved into our serving layer, adding further complexity with the way features are retrieved. It's not clear how this would affect performance, so we'd need to spend a bit of time evaluating that. So while I dont think it's particularly challenging to implement, I wouldn't call it a "tiny step". Especially since we've learnt from other feature store builders that have implemented wildcard scanning and have ended up spending a lot of time having to optimize queries and data models.
  • Use cases: I think the real question is whether we are excluding important use cases by not having range scans on an entity column. I'd love for you to expand on the above two use cases so that I understand what the query (request/response) would look like for training and serving. Specifically what I am looking for is whether the two stages are consistent and whether you'd be serving intermediate aggregations, or final features.
Regards,
Willem

Michał D

unread,
Mar 18, 2021, 4:56:09 AM3/18/21
to feast-dev
Hi guys

Just wanted to chime in here with my use case which Feast doesn't seem to support yet. I hope this is not too off-topic, you seem to be seeking for use cases. :)

I need to retrieve two features of *all the entities* at inference time (i.e. from online store). I don't know a list of entities.

What I want to achieve is a library lookup - the live model converts given data into embedded representation, then it needs to compare this representation with known embeddings. The library of known embeddings is ever growing and I intend to store them in Feast offline store, because they are essentially one of my entities' features. I intend to ingest two features from offline to online store: entity name and entity embedding. This way, at inference time, I would be able to get all known entities, find the one that is closest to the new embedding, and return its name. The problem is that I don't store the list of known entities anywhere; especially not at inference time.

Best,
Michał

Willem Pienaar

unread,
Mar 31, 2021, 2:53:18 PM3/31/21
to Michał D, feast-dev
Hey Michał,

Thanks for sharing your use case. This is exactly what I was looking for. So it seems like what you are looking for can reasonably be achieved by a feature lookup where we don't ask you to provide entities, but we scan and return features for all entities. 

If your list of entities is "ever growing" you'd probably take a latency hit. If you only have a few thousands entities then you'll probably get a response in milliseconds, but scanning over hundred of thousands of entities to return their embeddings would probably take in the order of seconds.

How important is the online response latency to you? It would be interesting if you had a specific SLO in mind (entity key space and latency target)

Regards,
Willem

Michał D

unread,
Apr 7, 2021, 9:35:02 AM4/7/21
to Willem Pienaar, feast-dev
In my case latency probably won't be a big concern - the model will be exposed via an API used occasionally by humans, not by a service (at least for now).
Yes, my list of entities is ever growing, but growing slowly. The size of my whole dataset will be ~20k records and I don't expect it to quickly grow to the next order of magnitude.
I do not have a specific SLO (had to look up the term :P ), but I'd say that - since my project is more of a research than production - scanning 20k records and returning their embeddings (vector of doubles) and names (string) should take no longer than 3 seconds.

Hope this answers your question. :)

Cheers
Michał
Reply all
Reply to author
Forward
0 new messages