Optimising Paginated Queries - Advise on best methodoloy

68 views
Skip to first unread message

Sky

unread,
Aug 12, 2015, 6:43:10 AM8/12/15
to OrientDB
Hey

So I've spent around 10hrs now looking at various ways to model and query something that I can do easily in an RDMS and have yet to find a solution that feels right using OrientDB though I know it must exist!

The scenario:

A user can watch / love / own a movie. When browsing movies, it should show what actions they have applied, and the total counts of each action for all users.

Use Edge per action (the first attempt) with a updatedAt:dateTime property
1) User (V) -> likes[updatedAt:dateTime] (E) -> Film (V) 
or User (V) -> owns[updatedAt:dateTime] (E) -> Film (V)
or User (V) -> watched[updatedAt:dateTime] (E) -> Film (V)

Use a single Edge with action properties
2) User (V) -> actioned[loved:dateTime, watched:dateTime, owns:dateTime, updatedAt:dateTime] (E) -> Film (V)

User a join table (RDBMS design)
3) User (V) <-> FilmAction (V) <->  Film (V)

I got far with 1, then found it hard to order the films by the last actioned as each edge has it's own updatedAt property and seems to slow on ordering.

With 2, that was resolved, but became very slow when ordering by films which are most loved / owns or watched etc as it has to look at the properties of every edge

3 I've not done yet in OrientDB but had it working fine in PostGres using 2 queries - one to get the films, then another IN clause to get the actions for that film for the logged in user


With design 1, which I feel is the most suitable, I find it very fast  (< 0.1s) for something like this:

select @rid,  in_watched.size(), in_owns.size(), in_loved.size(), title, inE('watched','loved','owns')[out=#12:25903].@class.asList() 
from film

Which gives me the counts and the actions of the relevant user if they exist

However, as soon as I add an order clause, e.g. 

select @rid,  in_watched.size(), in_owns.size(), in_loved.size(), title, inE('watched','loved','owns')[out=#12:25903].@class.asList() 
from film
order by in_watched.size()

query time drops to around 600-700ms which I'm not very happy with knowing I should be getting faster and the test data isn't massively populated

I've tried use LET $a = in_watched etc but doesn't help. Removing the query for the users actions, but keeping the ordering, it runs fast, <100ms so I am faced with one or the other right now

Any advice on the best way to model something like this would be great appreciated or how to optimise a query that gets the count of all the edges, as well as if that edge exists for a specific user.

Kindest Regards

Sky






SavioL

unread,
Aug 12, 2015, 8:41:16 AM8/12/15
to OrientDB
Hi,
certainly a way to increase performance is to index the fields on which you make the query (i do not know maybe you've already done).
For stable which approach is better, I think the only way is to experiment and compare between them.

Regards,
Savio L.

Sky

unread,
Aug 12, 2015, 9:03:00 AM8/12/15
to OrientDB
Hi Savio,

Thanks for the reply. I have indexes and unique keys on my edges like so:

edges: loved UK:in_out, IDX:out, IDX:updatedAt

It's really when I combine the order by e.g. order by in_watched.size() with the filter clause e.g. inE('watched','loved','owns')[out=#12:25903] does it drop my 500ms

Leaving either out, and it is incredibly fast. I agree, that an index could be the solution, as the explain shows: 

@version
resultSize
documentReads
current
documentAnalyzedCompatibleClass
recordReads
fetchingFromTargetElapsed
fullySortedByIndex
orderByElapsed
evaluated
user
elapsed
resultType
indexIsUsedInOrderBy
0
2999
2999
2999
2999
650
false
3
2999
654.3758
collection
false



But I don't know how you index an edge count?

Look forward to hearing people's ideas! After 10hrs of experimenting, it's getting very unproductive! :(

thanks

Sky

Sky

unread,
Aug 15, 2015, 4:34:52 PM8/15/15
to OrientDB
Hey

I would like to post my own solution ;)

I found a very good page in OrientDB docs that I had missed

https://orientdb.com/docs/last/Pivoting-With-Query.html

This gave me the a-ha moment. Work form the smallest subset, and join out. This is somewhat a different conceptual model to a RDMS where one would try to avoid many nested subqueries.

Taking this on board, and using the example given to be clear on syntax, I wrote something like this based on single edge, with a label of the action, and createdAt stamp (so hybrid of designs 1 and 2):

SELECT actions, lastUpdated, Film.title, Film.watched_count as watched_count FROM (
  SELECT LIST(action) as actions, MAX(createdAt) AS lastUpdated, in AS Film from (
    SELECT  flatten(outE('filmAction')) AS actions FROM #12:25903 )
  GROUP BY in 
)
WHERE "owns" IN actions
ORDER BY lastUpdated DESC

The key here, is that I decided to use cache counters for the film totals. Without that, it's painful

Query time runs insanely fast < 100ms over 1 million filmActions (this query finds every film a user has actioned, and brings back the film data, along with the actions they user has made against that film, and the films total action counters)

This design also allows for the LIMIT, ORDER BY and if need be, WHERE clause on the action type. Set operations I was struggling with until I read that OrientDB section.

So..happy...but not quite. I wanted to go back to my Design 1, where the edge is direct, not discriminated by a label and avoid the need for cache counters

SELECT actions, lastUpdated, Film.title, Film.in_loved.size() AS loved, Film.in_owns.size() AS owns, Film.in_watched.size() AS watched FROM (
  SELECT LIST(@class) as actions, MAX(createdAt) AS lastUpdated, in AS Film from (
    SELECT  flatten(outE('loved','owns','watched')) AS actions FROM #12:25903 )
  GROUP BY in 
)
WHERE "owns" IN actions
ORDER BY loved DESC
LIMIT 10

this is giving me around ~230ms over 600k edges. So, whilst this is a more purist approach to Graph design, I do feel it won't scale as well already after prelim studies.

So, I'm now down to choosing between these 2 designs. If anyone can let me know how I can improve the last query here, then I would choose that design, but I think for now, I will use counters 

Thanks

Sky

SavioL

unread,
Aug 17, 2015, 5:53:16 AM8/17/15
to OrientDB
hi Sky,
about indices there is no way to index the count, the important thing is to have indexed fields that are used.
Great post about the pivot-class:), I try to do some tests on my db and see how you can eventually possibly optimize ...

regards,
Savio L.

SavioL

unread,
Aug 17, 2015, 6:10:13 AM8/17/15
to OrientDB
it would be possible for you export your db? (if it is not a problem and there are no sensitive data) .. because with the database orient populated, the tests are more significant.
I repeat if these ores is not a problem for you

regards,
Savio L.
Reply all
Reply to author
Forward
0 new messages