Scoring a matrix of elements

9 views
Skip to first unread message

Kostya Salomatin

unread,
Dec 15, 2017, 10:45:41 PM12/15/17
to Scalding Development
Hi scalding experts,

I need an advise on a workflow to optimize my job efficiency. I need to score a matrix of <Users, Items> (recommendation setup).  I have three sources of input:

users: TypedPipe[(UserId, UserFeatures)] // user features, keyed by user id
items: TypedPipe[(ItemId, ItemFeatures)] // item features, keyed by item id
pairs: TypedPipe[(UserId, Seq[ItemId])] // matrix records that need scoring.

there is some scoring function score(UserData, ItemData).

Let us assume that the number of users is very large, the number of items is modest (thousands). UserFeatures and ItemFeatures are relatively large (thousands of values).
Matrix is relatively dense, each user should be scored for a thousand items.

I have several ideas that depend on the scale of the problem.

Case 1. All item features fit into memory.
I load item features into a ValuePipe as a map: ValuePipe[Map[ItemId, ItemFeatures]]
Then I join users with pairs and cross it with items value pipe:
users.group.join(pairs.group).cross(itemsValuePipe)
  .map { case ((userId, (userFeatures, itemsToScore)), itemFeaturesMap) => /* do my scoring */ }

Case 2. This is where I need help. Item features do not fit into memory, but maybe if I partition them into ~5-10 buckets - they will.
Obvious solution is to replace .cross statement from case 1 with "join by item id", but the hdfs I/O becomes a killer: each UserFeatures object is replicated many times (the number of items we need to score for this user).

Solution that I am thinking about right now: manually partition  items features into say 5 buckets, then apply case 1 solution to each of these buckets (i.e. do a cross) in a loop, and then combine and re-group the results of these 5 computations.
for (p in partitions) {
  users.group.join(pairs.group).cross("items in partition p").map {/* do my scoring */}
}.fold(TypedPipe.empty)(_ ++ _)
 .groupBy( something => memberId)

So the questions are.
Does my solution for case 2 makes sense?
Is there some smart scalding functionality that already does something similar?
Maybe the way I organize my data for this task is completely wrong and I need to do everything differently?

Thanks,
Kostya
Reply all
Reply to author
Forward
0 new messages