Slow performance when building a weighted table out of edge-vertex relationships

Jean-Sebastien Lemay

unread,

Apr 24, 2015, 7:13:46 AM4/24/15

to orient-...@googlegroups.com

To validate if OrientDB is the right fit for my project, I've created a simple scenario:

Vertex classes (2):

User: represents a user
Tag: represents a tag that a user can interact with (e.g. post a comment using that tag). Contains a name property

Edge classes (1):

UserUsedTag: connects a user to a tag. Contains a timestamp property (which is indexed with NOTUNIQUE)

To sum it up:

[V:User] ---[E:UserUsedTag]--> [V:Tag]

I'm trying to build a query that will let me know which tags have been the most popular in the past [x] minutes/hours/days/months...

As such, here is an example of the query I've got right now:

SELECT inV().name as name, COUNT(in) AS weight 
FROM UserUsedTag 
WHERE timestamp < date('2015-04-24 09:40:00') 
GROUP BY in 
ORDER BY weight DESC

The query works, and I get a proper result set:

name	weight
baseball	6117
soccer	5003

My problem is the performance:

Query executed in 0.311 sec. Returned 2 record(s)

If it takes 1/3 of a second to sift through ~11,000 results, I can only imagine how crippled the performance will be if I am dealing with millions of edges, which I expect to end up with. In fact, as I add new edges, it seems like the query time increases linearly.

Here is the EXPLAIN for the above query:

METADATA	PROPERTIES
@version	resultSize	fullySortedByIndex	documentAnalyzedCompatibleClass	recordReads	fetchingFromTargetElapsed	indexIsUsedInOrderBy	compositeIndexUsed	current	documentReads	projectionElapsed	limit	orderByElapsed	evaluated	groupByElapsed	user	elapsed	resultType	involvedIndexes
0	2	false	11120	11120	313	false	1	#14:11120	11120	6	-1	0	11120	0	#5:0	353.77362	collection	["UserUsedTag.timestamp"]

It seems that 'fetchingFromTargetElapsed' is the biggest bottleneck here. Is this due to the nature of my query? Perhaps my query is not optimal? I am new to graph DBs so I'd like to know if there is any way I can rephrase my query to end up with the same result.

The problem for me is that if I was using a relational DB, I would probably get better performance, if I was to query my "link" table and, with my final two (2) rows, perform one (1) query for each to retrieve the name of the corresponding Tag via the foreign key. Is there any way I can split my query similarly with OrientDB, instead of fetching the same Tag name multiple times?

Luca Garulli

unread,

Apr 24, 2015, 10:39:42 AM4/24/15

to orient-...@googlegroups.com

Hi Jean-Sebastien,

If you have X records in your resultset and execution time is Y, then with X2 records it will be about Y2. There is not magic here.

The most common solution, instead, is to pre-aggregate weight in a tree (year->month->day->hour->minute->second),

so every query takes a constant time, no matter by the number of events.

You can do this by writing a Hook that automatically updates counters upon create/update/delete.

In this way the query above would be super fast.

--

Best Regards,

Luca Garulli

CEO at Orient Technologies LTD

the Company behind OrientDB

http://about.me/luca.garulli

--

---
You received this message because you are subscribed to the Google Groups "OrientDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email to orient-databa...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Jean-Sebastien Lemay

unread,

Apr 24, 2015, 11:02:43 AM4/24/15

to orient-...@googlegroups.com

Understood about X2 - - > Y2. However I wanted to confirm if the bottleneck was due to the system navigating the edge 6000 times to retrieve the "name" field from the same vertex.

I'm afraid hooks are out of the question since my next query is to determine, for a specific user, which tags he used the most. I cannot start creating a table for each user for each day month year etc. Especially since I mean past [x] months from this instant, not a calendar month.

Can you confirm if my query is slow due to the repeated re-fetching of the "name" field of my Tag vertex?

You received this message because you are subscribed to a topic in the Google Groups "OrientDB" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/orient-database/rLBqzmENmcU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to orient-databa...@googlegroups.com.

Luca Garulli

unread,

Apr 24, 2015, 12:31:02 PM4/24/15

to orient-...@googlegroups.com

The bottleneck there is the 11k records that match the condition:

timestamp < date('2015-04-24 09:40:00')

Best Regards,

Luca Garulli

CEO at Orient Technologies LTD

the Company behind OrientDB

http://about.me/luca.garulli

Jean-Sebastien Lemay

unread,

Apr 25, 2015, 10:41:38 AM4/25/15

to orient-...@googlegroups.com

Hi there,

OK -- I'm a bit surprised, in a way; this is a straightforward SELECT statement.

I decided to run another experiment. My scenario: determine which tags are a given user's favourite.

My query in OrientDB:

SELECT in.name AS name, COUNT(in.name) AS weight 
FROM UserUsedTag
WHERE "john.doe" IN out.username
AND timestamp < date()
GROUP BY in.name
ORDER BY weight DESC

Query executed in 0.487 sec. Returned 2 record(s)

soccer 5042

baseball 4958

I've built up an equivalent structure in PostgreSQL to compare performance: 3 tables (user, tag, user_used_tag_link), with similar data.

My query in PostgreSQL:

SELECT public.tag.name, selection.weight FROM (
 SELECT tag_id, COUNT(tag_id) AS weight FROM public.user_used_tag_link 
 WHERE timestamp < CURRENT_TIMESTAMP
 AND user_id = 1
 GROUP BY tag_id
 ORDER BY weight DESC
) AS selection
JOIN public.tag ON selection.tag_id = public.tag.id
ORDER BY selection.weight DESC;

Query executes in 12ms.

"soccer";6035

"baseball";4811

Seems that I am getting significantly better performance with PostgreSQL (12ms vs 487ms).

Upon further digging, indeed it seems that it is simply because I am sifting through 11k records on OrientDB that I am getting slow performance (lag not caused by traversal, the SELECT on its own is slow). But surely I must be doing something wrong -- why is OrientDB lagging for my straightforward SELECT on an indexed field, although it is a database built for Big Data queries?

Best regards,

Jean

Luca Garulli

unread,

Apr 26, 2015, 6:33:38 AM4/26/15

to orient-...@googlegroups.com

On 25 April 2015 at 16:41, Jean-Sebastien Lemay <jean.sebas...@gmail.com> wrote:

Hi there,

Hi Jean-Sebastien

OK -- I'm a bit surprised, in a way; this is a straightforward SELECT statement.

I decided to run another experiment. My scenario: determine which tags are a given user's favourite.

My query in OrientDB:
SELECT in.name AS name, COUNT(in.name) AS weight FROM UserUsedTag WHERE "john.doe" IN out.username AND timestamp < date() GROUP BY in.name ORDER BY weight DESC
Query executed in 0.487 sec. Returned 2 record(s)
soccer 5042
baseball 4958

This is not the most efficient way to do that, because the index on username wouldn't be used. With a Graph Database, you should lookup at the stating vertex and then cross relationships. Try this:

select name, count(*) as weight

from (

select expand( in() ) from User where username = "john.doe"

) group by name, order by weight

Assure to have an index on User.username. I'd suggest a Unique index of type HASH_INDEX.

Lvc@

Jean-Sebastien Lemay

unread,

Apr 26, 2015, 9:58:11 AM4/26/15

to orient-...@googlegroups.com

Hi Luca,

Thanks for the advice. However in your query I am unable to filter by timestamp on the edge. Instead, I ended up with:

SELECT inV().name AS name, COUNT(*) AS weight FROM (

 SELECT FROM (
  SELECT expand(outE("UserUsedTag")) 
  FROM user
 WHERE username = "john.doe"
 ) WHERE timestamp > date('2015-04-25 05:00:22')
 AND timestamp < date('2015-04-25 05:01:10')


)
GROUP BY in
ORDER BY weight DESC

soccer 5020

baseball 4939

Query executed in 0.468 sec. Returned 2 record(s)

The above query seems to take into account all indexes (index on User.username and UserUsedTag.timestamp).

Still, 468ms for ~10k records. Am I correct then that OrientDB in the end is a tradeoff, i.e. it's a lot more developer friendly and has tons of features, but at the cost of rougly ~40x the performance of a traditional relational DB such as PostgreSQL?

Izzet Pembeci

unread,

Apr 27, 2015, 6:21:04 AM4/27/15

to orient-...@googlegroups.com

This is surprizing indeed. Can't you put the timestamp where clause just before outer group by so that you eliminate one subquery:

SELECT inV().name AS name, COUNT(*) AS weight FROM (


  SELECT expand(outE("UserUsedTag")) 
  FROM user
  WHERE username = "john.doe"
 )

WHERE timestamp > date('2015-04-25 05:00:22')
 AND timestamp < date('2015-04-25 05:01:10')

GROUP BY in
ORDER BY weight DESC

What does the EXPLAIN return for this latest one and can you share your index definitions (may be there is a problem with those)?

iZzeT

Jean-Sebastien Lemay

unread,

Apr 27, 2015, 11:23:36 AM4/27/15

to orient-...@googlegroups.com

Hi, thanks for looking into this situation, it's really appreciated!

I repopulated my database and ran your query:

SELECT inV().name AS name, COUNT(*) AS weight FROM (
  SELECT expand(outE("UserUsedTag")) 
  FROM user
  WHERE username = "john.doe"
)

WHERE timestamp > date('2015-04-27 15:13:00')
AND timestamp < date('2015-04-27 15:17:00')

GROUP BY in
ORDER BY weight DESC

Query executed in 0.474 sec. Returned 2 record(s)

baseball 5013

soccer 4987

Here is what the EXPLAIN returns:

METADATA	PROPERTIES
@version	resultSize	current	recordReads	rangeQueryConvertedInBetween	limit	fetchingFromTargetElapsed	projectionElapsed	documentReads

orderByElapsed	evaluated	groupByElapsed	user	elapsed	resultType

expandElapsed
0	2	#14:24599	10000	1	-1	272	7	10000	0	10000	0	#5:0	445.29556	collection	3

My schema is as follows:

(User extends V, UNIQUE_HASH_INDEX on User.username)

(Tag extends V, UNIQUE_HASH_INDEX on Tag.name)

(UserUsedTag extends E, NOTUNIQUE index on UserUsedTag.timestamp)

Let me know if you can determine what causes my query to take almost half a second. Hopefully we uncover something I'm doing wrong -- the database is great beyond this particular issue.