ArangoDB query very slow

512 views
Skip to first unread message

Nicolas Clairon

unread,
Jun 24, 2013, 5:44:45 AM6/24/13
to aran...@googlegroups.com
Hi,

I'm digging a little deeper with ArangoDB and I find out that quering is very slow, even with indexes:

here's the data ~30 000 documents:

  "_id" : "publications/51773704315", 
  "_rev" : "51773704315", 
  "_key" : "51773704315", 
  "Pmid" : 11425574, 
  "author" : 43438, 
  "lel_is_del" : 0, 
  "lel_validation" : 1, 
  "lib_id" : 15033, 
  "volume" : 9, 
  "lang" : "en", 
  "pagination" : "1731-8", 
  "lel_summary" : "As a part of our ongoing work on beta-L-nucleoside analogues as potential antiviral drugs, we have synthesized 5-(trifluoromethyl)-beta-L-2'-deoxyuridine (L-TFT), the hitherto unknown L-enantiomer of trifluorothymidine (CF(3)dUrd, TFT). We have also studied the effect of L-TFT on human and herpes simplex virus (HSV) type 1 and 2 thymidine kinases, and human thymidine phosphorylase, as well as its anti-HSV-1 and anti-HSV-2 activities in cell cultures. L-TFT has been found: (i) to inhibit HSV-1 TK with activity comparable to TFT, with no effect on human TK, (ii) to be phosphorylated by the viral enzyme with similar efficiency to TFT, (iii) to be resistant, in contrast to TFT, to hydrolysis by human thymidine phosphorylase. Unfortunately, when evaluated in cell cultures, L-TFT did not show any anti-HSV-1 and anti-HSV-2 activities.", 
  "lel_title" : "5-(Trifluoromethyl)-beta-l-2'-deoxyuridine, the L-enantiomer of trifluorothymidine: stereospecific synthesis and antiherpetic evaluations.", 
  "lib_date" : "0000-00-00", 
  "paysJournal" : "England", 
  "titrejournal" : "Bioorganic & medicinal chemistry" 
}

I want all the documents which have "lang" as "en":

arangosh> db.publications.byExample({lang: "en"}).next()
  "_id" : "publications/51773704315", 
  "_rev" : "51773704315", 
  "_key" : "51773704315", 
  "Pmid" : 11425574, 
  "author" : 43438, 
  "lel_is_del" : 0, 
  "lel_validation" : 1, 
  "lib_id" : 15033, 
  "volume" : 9, 
  "lang" : "en", 
  "pagination" : "1731-8", 
  "lel_summary" : "As a part of our ongoing work on beta-L-nucleoside analogues as potential antiviral drugs, we have synthesized 5-(trifluoromethyl)-beta-L-2'-deoxyuridine (L-TFT), the hitherto unknown L-enantiomer of trifluorothymidine (CF(3)dUrd, TFT). We have also studied the effect of L-TFT on human and herpes simplex virus (HSV) type 1 and 2 thymidine kinases, and human thymidine phosphorylase, as well as its anti-HSV-1 and anti-HSV-2 activities in cell cultures. L-TFT has been found: (i) to inhibit HSV-1 TK with activity comparable to TFT, with no effect on human TK, (ii) to be phosphorylated by the viral enzyme with similar efficiency to TFT, (iii) to be resistant, in contrast to TFT, to hydrolysis by human thymidine phosphorylase. Unfortunately, when evaluated in cell cultures, L-TFT did not show any anti-HSV-1 and anti-HSV-2 activities.", 
  "lel_title" : "5-(Trifluoromethyl)-beta-l-2'-deoxyuridine, the L-enantiomer of trifluorothymidine: stereospecific synthesis and antiherpetic evaluations.", 
  "lib_date" : "0000-00-00", 
  "paysJournal" : "England", 
  "titrejournal" : "Bioorganic & medicinal chemistry" 
}

It took 6 seconds to have the results.

Idem with AQL:

arangosh> db._createStatement({query: "FOR pub IN publications FILTER pub.lang == 'en' RETURN pub"}).execute().next()

(6 seconds to have the results)

I even can't have the number of documents which have lang == "en":

arangosh> db.publications.byExample({lang: "en"}).count()
null

I made an HashIndex on the "lang" field but it is not faster:

arangosh> db.publications.getIndexes()
  { 
    "id" : "publications/0", 
    "type" : "primary", 
    "unique" : true, 
    "fields" : [ 
      "_id" 
    ] 
  }, 
  { 
    "id" : "publications/59863074939", 
    "type" : "fulltext", 
    "unique" : false, 
    "minLength" : 2, 
    "fields" : [ 
      "lel_summary" 
    ] 
  }, 
  { 
    "id" : "publications/59868121211", 
    "type" : "hash", 
    "unique" : false, 
    "fields" : [ 
      "lang" 
    ] 
  } 
]

Where I have wrong ? What do I have to do to have the number of publications in english language ?

Thanks for the help

Jan Steemann

unread,
Jun 24, 2013, 6:11:05 AM6/24/13
to aran...@googlegroups.com, Nicolas Clairon
It may be a problem with the total data size of the results.
The queries will construct a full result set and may return all the
documents which match, even if you only want to access the first one (as
you did in the examples). The results will contain the FULL documents,
with all the attributes.

I am not sure how the data is distributed, but when I execute the query
here with the example data from last time, the result set gets very big.
Approx. 17,000 documents are returned, along with all their attribute
data. The server needs to allocate a lot of memory for these result sets
and needs to package them into JSON, which the client must unpack. This
takes a lot of processing time and memory.

If you can avoid constructing the full documents, you should try to do
that. For example, instead of running this query

> arangosh> db._createStatement({query: "FOR pub IN publications FILTER
> pub.lang == 'en' RETURN pub"}).execute().next()

you may save a lot of time by running it like this:

> arangosh> db._createStatement({query: "FOR pub IN publications FILTER
> pub.lang == 'en' RETURN pub._id"}).execute().next()


This returns a much smaller result set, which is of course different
from the original result set. But restricting the size and amount of
attributes will definitely make it much better.

Best regards
Jan
> --
> You received this message because you are subscribed to the Google
> Groups "ArangoDB" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to arangodb+u...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>

--
eMail: j.ste...@triagens.de
Telefon: +49-221-2722999-37
Fax: +49-221-2722999-88

triAGENS GmbH
Hohenstaufenring 43-45
50674 K�ln

Sitz der Gesellschaft: K�ln
Registergericht K�ln; HRB 53597

Gesch�ftsf�hrung:
Dr. Frank Celler
Martin Sch�nert
Claudius Weinberger


Diese e-Mail enth�lt vertrauliche und/oder rechtlich gesch�tzte
Informationen. Wenn Sie nicht der richtige Adressat sind oder diese
e-Mail irrt�mlich erhalten haben, informieren Sie bitte sofort den
Absender und vernichten Sie diese e-Mail. Wir haben alle
verkehrs�blichen Ma�nahmen unternommen, um das Risiko der Verbreitung
virenbefallener Software oder e-Mails zu minimieren, dennoch raten wir
Ihnen, Ihre eigenen Virenkontrollen auf alle Anh�nge an dieser e-Mail
durchzuf�hren. Wir schlie�en au�er f�r den Fall von Vorsatz oder grober
Fahrl�ssigkeit die Haftung f�r jeglichen Verlust oder Sch�den durch
virenbefallene Software oder e-Mails aus.

This e-mail may contain confidential and/or privileged information. If
you are not the intended recipient (or have received this e-mail in
error) please notify the sender immediately and destroy this e-mail. We
have taken precautions to minimize the risk of transmitting software
viruses but nevertheless advise you to carry out your own virus checks
on any attachment of this message. We accept no liability for loss or
damage caused by software.

Nicolas Clairon

unread,
Jun 24, 2013, 7:55:57 AM6/24/13
to Jan Steemann, aran...@googlegroups.com
I understand but how can I get then the number of matches (count()) from the query ? This doesn't work:

arangosh> db._createStatement({query: "FOR pub IN publications FILTER pub.lang == 'en' RETURN pub._id"}).execute().count()
null

other question: with mongodb (I sorry to compare), I can have a cursor to fetch all the documents:

for doc in db.publications.find():
    # do something with doc

This work even with billions of documents. How can I do that with ArangoDB ? 



For more options, visit https://groups.google.com/groups/opt_out.


--
eMail:   j.ste...@triagens.de
Telefon: +49-221-2722999-37
Fax:     +49-221-2722999-88

triAGENS GmbH
Hohenstaufenring 43-45
50674 Köln

Sitz der Gesellschaft: Köln
Registergericht Köln; HRB 53597

Geschäftsführung:
Dr. Frank Celler
Martin Schönert
Claudius Weinberger


Diese e-Mail enthält vertrauliche und/oder rechtlich geschützte Informationen. Wenn Sie nicht der richtige Adressat sind oder diese e-Mail irrtümlich erhalten haben, informieren Sie bitte sofort den Absender und vernichten Sie diese e-Mail. Wir haben alle verkehrsüblichen Maßnahmen unternommen, um das Risiko der Verbreitung virenbefallener Software oder e-Mails zu minimieren, dennoch raten wir Ihnen, Ihre eigenen Virenkontrollen auf alle Anhänge an dieser e-Mail durchzuführen. Wir schließen außer für den Fall von Vorsatz oder grober Fahrlässigkeit die Haftung für jeglichen Verlust oder Schäden durch virenbefallene Software oder e-Mails aus.

Jan Steemann

unread,
Jun 24, 2013, 8:14:29 AM6/24/13
to Nicolas Clairon, aran...@googlegroups.com
Hi,

to get the number of documents returned, please set the "count"
attribute when you create the query.

Instead of

db._createStatement({
query: "FOR ..."
}).execute().count()

you can do this:

db._createStatement({
query: "FOR ...",
count: true
}).execute().count()

This will return number of documents too.

Another way is to not ship the individual documents from the server to
the client and count them there, but to make the query return the
document count:

db._createStatement({
query: "RETURN LENGTH((FOR pub IN publications FILTER pub.lang ==
'en' RETURN pub._id))"
}).execute().next();

This will make the server ship a very small result set to the client.
However, in this case you wouldn't have access to the individual
documents, but just the number of them.


Re MongoDB: I think there is a difference in how cursors work in
ArangoDB and MongoDB.

In ArangoDB, a cursor is created on the server-side. When you issue a
query, the full query result set will be constructed and stored for the
cursor. The client can then use the cursor to iterate over the
(unmodified) result set. The result set needs to fit into memory on the
server side, but not necessarily on the client-side, as the client can
fetch the results lazily.
The result set is consistent even if the underlying collection data gets
modified after the query, but before the client has fulled fetched the
result data from cursor.

I think in MongoDB it is different so that the cursor may return the
first few documents already, and then calculate the outstanding ones
lazily if you query it for more results. I am not sure if and how
MongoDB protects a cursor result from intermittent updates/deletions of
the underlying data.

That means in ArangoDB you wouldn't construct a query that returns
billions of documents, as they would need to fit into memory completely.
You would rather try to keep the result set as small as possible and try
to query only a small fraction of the overall data (that's also where
indexes help - they are of no help if a query returns most of the
documents anyway).

Best regards
Jan
> 5-(trifluoromethyl)-beta-L-2'-__deoxyuridine (L-TFT), the
> hitherto unknown
> L-enantiomer of trifluorothymidine (CF(3)dUrd, TFT). We have also
> studied the effect of L-TFT on human and herpes simplex virus
> (HSV) type
> 1 and 2 thymidine kinases, and human thymidine phosphorylase, as
> well as
> its anti-HSV-1 and anti-HSV-2 activities in cell cultures. L-TFT has
> been found: (i) to inhibit HSV-1 TK with activity comparable to TFT,
> with no effect on human TK, (ii) to be phosphorylated by the viral
> enzyme with similar efficiency to TFT, (iii) to be resistant, in
> contrast to TFT, to hydrolysis by human thymidine phosphorylase.
> Unfortunately, when evaluated in cell cultures, L-TFT did not
> show any
> anti-HSV-1 and anti-HSV-2 activities.",
> "lel_title" : "5-(Trifluoromethyl)-beta-l-2'__-deoxyuridine, the
> L-enantiomer of trifluorothymidine: stereospecific synthesis and
> antiherpetic evaluations.",
> "lib_date" : "0000-00-00",
> "paysJournal" : "England",
> "titrejournal" : "Bioorganic & medicinal chemistry"
> }
>
> I want all the documents which have "lang" as "en":
>
> arangosh> db.publications.byExample({__lang: "en"}).next()
> {
> "_id" : "publications/51773704315",
> "_rev" : "51773704315",
> "_key" : "51773704315",
> "Pmid" : 11425574,
> "author" : 43438,
> "lel_is_del" : 0,
> "lel_validation" : 1,
> "lib_id" : 15033,
> "volume" : 9,
> "lang" : "en",
> "pagination" : "1731-8",
> "lel_summary" : "As a part of our ongoing work on
> beta-L-nucleoside
> analogues as potential antiviral drugs, we have synthesized
> 5-(trifluoromethyl)-beta-L-2'-__deoxyuridine (L-TFT), the
> hitherto unknown
> L-enantiomer of trifluorothymidine (CF(3)dUrd, TFT). We have also
> studied the effect of L-TFT on human and herpes simplex virus
> (HSV) type
> 1 and 2 thymidine kinases, and human thymidine phosphorylase, as
> well as
> its anti-HSV-1 and anti-HSV-2 activities in cell cultures. L-TFT has
> been found: (i) to inhibit HSV-1 TK with activity comparable to TFT,
> with no effect on human TK, (ii) to be phosphorylated by the viral
> enzyme with similar efficiency to TFT, (iii) to be resistant, in
> contrast to TFT, to hydrolysis by human thymidine phosphorylase.
> Unfortunately, when evaluated in cell cultures, L-TFT did not
> show any
> anti-HSV-1 and anti-HSV-2 activities.",
> "lel_title" : "5-(Trifluoromethyl)-beta-l-2'__-deoxyuridine, the
> L-enantiomer of trifluorothymidine: stereospecific synthesis and
> antiherpetic evaluations.",
> "lib_date" : "0000-00-00",
> "paysJournal" : "England",
> "titrejournal" : "Bioorganic & medicinal chemistry"
> }
>
> It took 6 seconds to have the results.
>
> Idem with AQL:
>
> arangosh> db._createStatement({query: "FOR pub IN publications
> FILTER
> pub.lang == 'en' RETURN pub"}).execute().next()
>
> (6 seconds to have the results)
>
> I even can't have the number of documents which have lang == "en":
>
> arangosh> db.publications.byExample({__lang: "en"}).count()
> an email to arangodb+unsubscribe@__googlegroups.com
> <mailto:arangodb%2Bunsu...@googlegroups.com>.
> For more options, visit
> https://groups.google.com/__groups/opt_out
> <https://groups.google.com/groups/opt_out>.
>
>
>
> --
> eMail: j.ste...@triagens.de <mailto:j.ste...@triagens.de>
> Telefon: +49-221-2722999-37
> Fax: +49-221-2722999-88
>
> triAGENS GmbH
> Hohenstaufenring 43-45
> 50674 K�ln
>
> Sitz der Gesellschaft: K�ln
> Registergericht K�ln; HRB 53597
>
> Gesch�ftsf�hrung:
> Dr. Frank Celler
> Martin Sch�nert
> Claudius Weinberger
>
>
> Diese e-Mail enth�lt vertrauliche und/oder rechtlich gesch�tzte
> Informationen. Wenn Sie nicht der richtige Adressat sind oder diese
> e-Mail irrt�mlich erhalten haben, informieren Sie bitte sofort den
> Absender und vernichten Sie diese e-Mail. Wir haben alle
> verkehrs�blichen Ma�nahmen unternommen, um das Risiko der
> Verbreitung virenbefallener Software oder e-Mails zu minimieren,
> dennoch raten wir Ihnen, Ihre eigenen Virenkontrollen auf alle
> Anh�nge an dieser e-Mail durchzuf�hren. Wir schlie�en au�er f�r den
> Fall von Vorsatz oder grober Fahrl�ssigkeit die Haftung f�r
> jeglichen Verlust oder Sch�den durch virenbefallene Software oder
> e-Mails aus.
>
> This e-mail may contain confidential and/or privileged information.
> If you are not the intended recipient (or have received this e-mail
> in error) please notify the sender immediately and destroy this
> e-mail. We have taken precautions to minimize the risk of
> transmitting software viruses but nevertheless advise you to carry
> out your own virus checks on any attachment of this message. We
> accept no liability for loss or damage caused by software.
>
>

--
eMail: j.ste...@triagens.de
Telefon: +49-221-2722999-37
Fax: +49-221-2722999-88

triAGENS GmbH
Hohenstaufenring 43-45
50674 K�ln

Sitz der Gesellschaft: K�ln
Registergericht K�ln; HRB 53597

Gesch�ftsf�hrung:
Dr. Frank Celler
Martin Sch�nert
Claudius Weinberger


Diese e-Mail enth�lt vertrauliche und/oder rechtlich gesch�tzte
Informationen. Wenn Sie nicht der richtige Adressat sind oder diese
e-Mail irrt�mlich erhalten haben, informieren Sie bitte sofort den
Absender und vernichten Sie diese e-Mail. Wir haben alle
verkehrs�blichen Ma�nahmen unternommen, um das Risiko der Verbreitung
virenbefallener Software oder e-Mails zu minimieren, dennoch raten wir
Ihnen, Ihre eigenen Virenkontrollen auf alle Anh�nge an dieser e-Mail
durchzuf�hren. Wir schlie�en au�er f�r den Fall von Vorsatz oder grober
Fahrl�ssigkeit die Haftung f�r jeglichen Verlust oder Sch�den durch

Nicolas Clairon

unread,
Jun 24, 2013, 8:31:12 AM6/24/13
to Jan Steemann, aran...@googlegroups.com
Thank you very much for your answer. I understand better the design of ArangoDB now and it is much faster :)


    50674 Köln

    Sitz der Gesellschaft: Köln
    Registergericht Köln; HRB 53597

    Geschäftsführung:
    Dr. Frank Celler
    Martin Schönert
    Claudius Weinberger


    Diese e-Mail enthält vertrauliche und/oder rechtlich geschützte
    Informationen. Wenn Sie nicht der richtige Adressat sind oder diese
    e-Mail irrtümlich erhalten haben, informieren Sie bitte sofort den

    Absender und vernichten Sie diese e-Mail. Wir haben alle
    verkehrsüblichen Maßnahmen unternommen, um das Risiko der

    Verbreitung virenbefallener Software oder e-Mails zu minimieren,
    dennoch raten wir Ihnen, Ihre eigenen Virenkontrollen auf alle
    Anhänge an dieser e-Mail durchzuführen. Wir schließen außer für den

    Fall von Vorsatz oder grober Fahrlässigkeit die Haftung für
    jeglichen Verlust oder Schäden durch virenbefallene Software oder

    e-Mails aus.

    This e-mail may contain confidential and/or privileged information.
    If you are not the intended recipient (or have received this e-mail
    in error) please notify the sender immediately and destroy this
    e-mail. We have taken precautions to minimize the risk of
    transmitting software viruses but nevertheless advise you to carry
    out your own virus checks on any attachment of this message. We
    accept no liability for loss or damage caused by software.



--
eMail:   j.ste...@triagens.de
Telefon: +49-221-2722999-37
Fax:     +49-221-2722999-88

triAGENS GmbH
Hohenstaufenring 43-45
50674 Köln

Sitz der Gesellschaft: Köln
Registergericht Köln; HRB 53597

Geschäftsführung:
Dr. Frank Celler
Martin Schönert
Claudius Weinberger


Diese e-Mail enthält vertrauliche und/oder rechtlich geschützte Informationen. Wenn Sie nicht der richtige Adressat sind oder diese e-Mail irrtümlich erhalten haben, informieren Sie bitte sofort den Absender und vernichten Sie diese e-Mail. Wir haben alle verkehrsüblichen Maßnahmen unternommen, um das Risiko der Verbreitung virenbefallener Software oder e-Mails zu minimieren, dennoch raten wir Ihnen, Ihre eigenen Virenkontrollen auf alle Anhänge an dieser e-Mail durchzuführen. Wir schließen außer für den Fall von Vorsatz oder grober Fahrlässigkeit die Haftung für jeglichen Verlust oder Schäden durch virenbefallene Software oder e-Mails aus.
Reply all
Reply to author
Forward
0 new messages