Slow performance when building a weighted table out of edge-vertex relationships

95 views
Skip to first unread message

Jean-Sebastien Lemay

unread,
Apr 24, 2015, 7:13:46 AM4/24/15
to orient-...@googlegroups.com
To validate if OrientDB is the right fit for my project, I've created a simple scenario:
  • Vertex classes (2):
    • User: represents a user
    • Tag: represents a tag that a user can interact with (e.g. post a comment using that tag). Contains a name property

  • Edge classes (1):
    • UserUsedTag: connects a user to a tag. Contains a timestamp property (which is indexed with NOTUNIQUE)
To sum it up:
[V:User] ---[E:UserUsedTag]--> [V:Tag]

I'm trying to build a query that will let me know which tags have been the most popular in the past [x] minutes/hours/days/months...
As such, here is an example of the query I've got right now:
SELECT inV().name as name, COUNT(in) AS weight
FROM
UserUsedTag
WHERE timestamp
< date('2015-04-24 09:40:00')
GROUP BY
in
ORDER BY weight DESC

The query works, and I get a proper result set:
name
weight
baseball6117
soccer5003
My problem is the performance:
Query executed in 0.311 sec. Returned 2 record(s)
If it takes 1/3 of a second to sift through ~11,000 results, I can only imagine how crippled the performance will be if I am dealing with millions of edges, which I expect to end up with. In fact, as I add new edges, it seems like the query time increases linearly.

Here is the EXPLAIN for the above query:
METADATAPROPERTIES
@version
resultSize
fullySortedByIndex
documentAnalyzedCompatibleClass
recordReads
fetchingFromTargetElapsed
indexIsUsedInOrderBy
compositeIndexUsed
current
documentReads
projectionElapsed
limit
orderByElapsed
evaluated
groupByElapsed
user
elapsed
resultType
involvedIndexes
02false1112011120313false1#14:11120111206-10111200#5:0353.77362collection["UserUsedTag.timestamp"]
It seems that 'fetchingFromTargetElapsed' is the biggest bottleneck here. Is this due to the nature of my query? Perhaps my query is not optimal? I am new to graph DBs so I'd like to know if there is any way I can rephrase my query to end up with the same result.

The problem for me is that if I was using a relational DB, I would probably get better performance, if I was to query my "link" table and, with my final two (2) rows, perform one (1) query for each to retrieve the name of the corresponding Tag via the foreign key. Is there any way I can split my query similarly with OrientDB, instead of fetching the same Tag name multiple times?

Luca Garulli

unread,
Apr 24, 2015, 10:39:42 AM4/24/15
to orient-...@googlegroups.com
Hi Jean-Sebastien,
If you have X records in your resultset and execution time is Y, then with X2 records it will be about Y2. There is not magic here.

The most common solution, instead, is to pre-aggregate weight in a tree (year->month->day->hour->minute->second),
so every query takes a constant time, no matter by the number of events.

You can do this by writing a Hook that automatically updates counters upon create/update/delete.

In this way the query above would be super fast.

-- 
Best Regards,

Luca Garulli
CEO at Orient Technologies LTD
the Company behind OrientDB


--

---
You received this message because you are subscribed to the Google Groups "OrientDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email to orient-databa...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.




Jean-Sebastien Lemay

unread,
Apr 24, 2015, 11:02:43 AM4/24/15
to orient-...@googlegroups.com

Understood about X2 - - > Y2. However I wanted to confirm if the bottleneck was due to the system navigating the edge 6000 times to retrieve the "name" field from the same vertex.

I'm afraid hooks are out of the question since my next query is to determine, for a specific user, which tags he used the most. I cannot start creating a table for each user for each day month year etc. Especially since I mean past [x] months from this instant, not a calendar month.

Can you confirm if my query is slow due to the repeated re-fetching of the "name" field of my Tag vertex?

You received this message because you are subscribed to a topic in the Google Groups "OrientDB" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/orient-database/rLBqzmENmcU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to orient-databa...@googlegroups.com.

Luca Garulli

unread,
Apr 24, 2015, 12:31:02 PM4/24/15
to orient-...@googlegroups.com
The bottleneck there is the 11k records that match the condition: 

timestamp < date('2015-04-24 09:40:00') 

Best Regards,

Luca Garulli
CEO at Orient Technologies LTD
the Company behind OrientDB


Jean-Sebastien Lemay

unread,
Apr 25, 2015, 10:41:38 AM4/25/15
to orient-...@googlegroups.com
Hi there,

OK -- I'm a bit surprised, in a way; this is a straightforward SELECT statement.

I decided to run another experiment. My scenario: determine which tags are a given user's favourite.

My query in OrientDB:
SELECT in.name AS name, COUNT(in.name) AS weight
FROM
UserUsedTag
WHERE
"john.doe" IN out.username
AND timestamp
< date()
GROUP BY
in.name
ORDER BY weight DESC
Query executed in 0.487 sec. Returned 2 record(s)
soccer 5042
baseball 4958

I've built up an equivalent structure in PostgreSQL to compare performance: 3 tables (user, tag, user_used_tag_link), with similar data.
My query in PostgreSQL:
SELECT public.tag.name, selection.weight FROM (
 SELECT tag_id
, COUNT(tag_id) AS weight FROM public.user_used_tag_link
 WHERE timestamp
< CURRENT_TIMESTAMP
 AND user_id
= 1
 GROUP BY tag_id
 ORDER BY weight DESC
) AS selection
JOIN
public.tag ON selection.tag_id = public.tag.id
ORDER BY selection
.weight DESC;
Query executes in 12ms.
"soccer";6035
"baseball";4811

Seems that I am getting significantly better performance with PostgreSQL (12ms vs 487ms).

Upon further digging, indeed it seems that it is simply because I am sifting through 11k records on OrientDB that I am getting slow performance (lag not caused by traversal, the SELECT on its own is slow). But surely I must be doing something wrong -- why is OrientDB lagging for my straightforward SELECT on an indexed field, although it is a database built for Big Data queries?

Best regards,
Jean

Luca Garulli

unread,
Apr 26, 2015, 6:33:38 AM4/26/15
to orient-...@googlegroups.com
On 25 April 2015 at 16:41, Jean-Sebastien Lemay <jean.sebas...@gmail.com> wrote:
Hi there,

Hi Jean-Sebastien
 

OK -- I'm a bit surprised, in a way; this is a straightforward SELECT statement.

I decided to run another experiment. My scenario: determine which tags are a given user's favourite.

My query in OrientDB:
SELECT in.name AS name, COUNT(in.name) AS weight
FROM
UserUsedTag
WHERE
"john.doe" IN out.username
AND timestamp
< date()
GROUP BY
in.name
ORDER BY weight DESC
Query executed in 0.487 sec. Returned 2 record(s)
soccer 5042
baseball 4958

This is not the most efficient way to do that, because the index on username wouldn't be used. With a Graph Database, you should lookup at the stating vertex and then cross relationships. Try this:

select name, count(*) as weight
from (
  select expand( in() ) from User where username = "john.doe"
) group by name, order by weight

Assure to have an index on User.username. I'd suggest a Unique index of type HASH_INDEX.

Lvc@

Jean-Sebastien Lemay

unread,
Apr 26, 2015, 9:58:11 AM4/26/15
to orient-...@googlegroups.com
Hi Luca,

Thanks for the advice. However in your query I am unable to filter by timestamp on the edge. Instead, I ended up with:
SELECT inV().name AS name, COUNT(*) AS weight FROM (

 SELECT FROM
(
  SELECT expand
(outE("UserUsedTag"))
  FROM user
 WHERE username
= "john.doe"
 
) WHERE timestamp > date('2015-04-25 05:00:22')
 AND timestamp
< date('2015-04-25 05:01:10')

)
GROUP BY
in
ORDER BY weight DESC
soccer 5020
baseball 4939
Query executed in 0.468 sec. Returned 2 record(s)

The above query seems to take into account all indexes (index on User.username and UserUsedTag.timestamp).

Still, 468ms for ~10k records. Am I correct then that OrientDB in the end is a tradeoff, i.e. it's a lot more developer friendly and has tons of features, but at the cost of rougly ~40x the performance of a traditional relational DB such as PostgreSQL?

Izzet Pembeci

unread,
Apr 27, 2015, 6:21:04 AM4/27/15
to orient-...@googlegroups.com
This is surprizing indeed. Can't you put the timestamp where clause just before outer group by so that you eliminate one subquery:

SELECT inV().name AS name, COUNT(*) AS weight FROM (

  SELECT expand
(outE("UserUsedTag"))
  FROM user
  WHERE username
= "john.doe"
 
)

WHERE timestamp > date('2015-04-25 05:00:22')
 AND timestamp
< date('2015-04-25 05:01:10')
GROUP BY in
ORDER BY weight DESC

What does the EXPLAIN return for this latest one and can you share your index definitions (may be there is a problem with those)?

iZzeT

Jean-Sebastien Lemay

unread,
Apr 27, 2015, 11:23:36 AM4/27/15
to orient-...@googlegroups.com
Hi, thanks for looking into this situation, it's really appreciated!

I repopulated my database and ran your query:
SELECT inV().name AS name, COUNT(*) AS weight FROM (
  SELECT expand(outE("UserUsedTag")) 
  FROM user
  WHERE username = "john.doe"
WHERE timestamp > date('2015-04-27 15:13:00')
AND timestamp < date('2015-04-27 15:17:00')
GROUP BY in
ORDER BY weight DESC
Query executed in 0.474 sec. Returned 2 record(s)
baseball 5013
soccer 4987

Here is what the EXPLAIN returns:
METADATAPROPERTIES
@version
resultSize
current
recordReads
rangeQueryConvertedInBetween
limit
fetchingFromTargetElapsed
projectionElapsed
documentReads
orderByElapsed
evaluated
groupByElapsed
user
elapsed
resultType
expandElapsed
02#14:24599100001-12727100000100000#5:0445.29556collection3
My schema is as follows:

(User extends V, UNIQUE_HASH_INDEX on User.username)

(Tag extends V, UNIQUE_HASH_INDEX on Tag.name)

(UserUsedTag extends E, NOTUNIQUE index on UserUsedTag.timestamp)


Let me know if you can determine what causes my query to take almost half a second. Hopefully we uncover something I'm doing wrong -- the database is great beyond this particular issue.

Best regards,
Jean
Reply all
Reply to author
Forward
0 new messages