Index setup, slow queries

136 views
Skip to first unread message

Patrik Pahulák

unread,
May 15, 2024, 12:09:26 PM5/15/24
to Magnolia User Mailing List
Hello, we have this one specific workspace that is kinda the most important workspace in our project. The workspace contains 7k ish nodes atm. The issue we are faced with are slow(ish) queries. The workspace is structured into folders that contain the nodes (all of these nodes are of the same type). This workspace is queried quite often. We have a rest api endpoint that handles querying this workspace with filtering and pagination (around 10 properties the filter works with). To make this work, we limit the queries to a couple of nodes per query (10-50). However we also need to return a total number of nodes that fit the query for our pagination (first, page, next page, last page, ...). AFAIK there isnt really a COUNT or anything like that to get this result so we are basically forced to run the query with no limit/offset to get the total number of nodes. This can get pretty slow, atm it's at around 1.5 seconds, which is not that bad, but the number of nodes is expected to grow, which will slow this down even more. We know for a fact that the slow part is the process of running the query with no limit.

There is no index setup on our side. I have tried setting up an index for the workspace, but I have most likely failed, since I do not see any speed improvements. I was wondering if setting up an index would even help us out in the first place and if so I was wondering if anyone could share some example index setup with Magnolia. Our repository setup is basically identical to this one, atleast the SearchIndex part
We are running magnolia 6.2.44.

Thanks a lot

Mykola Soldatenkov

unread,
May 15, 2024, 2:33:56 PM5/15/24
to user...@magnolia-cms.com

--
You received this message because you are subscribed to the Google Groups "Magnolia User Mailing List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to user-list+...@magnolia-cms.com.
To view this discussion on the web, visit https://groups.google.com/a/magnolia-cms.com/d/msgid/user-list/04467de1-0351-48a5-8e02-b341ac9af8e9n%40magnolia-cms.com.


--
-- 
Powyższa wiadomość wraz z ewentualnymi załącznikami jest przeznaczona wyłącznie do wiadomości osób lub podmiotów, do których jest w sposób zamierzony zaadresowana. W przypadku otrzymania powyższej wiadomości w wyniku pomyłki, prosimy o pilne powiadomienie o tym fakcie nadawcy oraz o usunięcie wiadomości z systemu, a także o nie rozpowszechnianie treści w niej zawartych. Otrzymanie powyższej wiadomości nie może stanowić podstawy do jakichkolwiek zobowiązań nadawcy tej wiadomości wobec osób nie będących jej zamierzonymi adresatami. 
This e-mail message and any attachments is intended solely for the use of the persons or entities to whom it is intentionally addressed. If you have received this message in error, please notify the sender about this fact immediately and delete it completely from your computer system. Please do not disseminate this message unless you are the intended recipient of it. This e-mail message shall not create any legal obligation on the part of the sender to persons who are not intended recipients of it.

Richard Gange

unread,
May 16, 2024, 7:28:57 AM5/16/24
to Magnolia User Mailing List, undi...@gmail.com
Maybe you might consider adjusting the index configuration file to index only the properties you need for the queries. This page goes into some of the details https://docs.magnolia-cms.com/product-docs/6.2/features/search/jackrabbit-search/ and there is also this https://jackrabbit.apache.org/archive/wiki/JCR/IndexingConfiguration_115513411.html

HTH
Rich

On Wednesday, May 15, 2024 at 2:33:56 PM UTC-4 undi...@gmail.com wrote:
On Wed, May 15, 2024 at 6:09 PM Patrik Pahulák <patrik.pahulak@servermechanics.cz> wrote:
Hello, we have this one specific workspace that is kinda the most important workspace in our project. The workspace contains 7k ish nodes atm. The issue we are faced with are slow(ish) queries. The workspace is structured into folders that contain the nodes (all of these nodes are of the same type). This workspace is queried quite often. We have a rest api endpoint that handles querying this workspace with filtering and pagination (around 10 properties the filter works with). To make this work, we limit the queries to a couple of nodes per query (10-50). However we also need to return a total number of nodes that fit the query for our pagination (first, page, next page, last page, ...). AFAIK there isnt really a COUNT or anything like that to get this result so we are basically forced to run the query with no limit/offset to get the total number of nodes. This can get pretty slow, atm it's at around 1.5 seconds, which is not that bad, but the number of nodes is expected to grow, which will slow this down even more. We know for a fact that the slow part is the process of running the query with no limit.

There is no index setup on our side. I have tried setting up an index for the workspace, but I have most likely failed, since I do not see any speed improvements. I was wondering if setting up an index would even help us out in the first place and if so I was wondering if anyone could share some example index setup with Magnolia. Our repository setup is basically identical to this one, atleast the SearchIndex part
We are running magnolia 6.2.44.

Thanks a lot

--
You received this message because you are subscribed to the Google Groups "Magnolia User Mailing List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to user-list+unsubscribe@magnolia-cms.com.

Patrik Pahulák

unread,
Aug 29, 2024, 6:32:01 AM8/29/24
to Magnolia User Mailing List, Richard Gange, undi...@gmail.com
Hello, I've tried playing around with JCR queries/the index. I've realized that the slow part about the queries is the sorting. Sorting by any property makes the queries dreadfully slow. I've noticed this after rewriting pagination -> infinite scroll. So the queries end up being limited (X<=50 per batch, with setLimit and setOffset used). I am mainly testing it on a simple query, that contains the most results, such as:
SELECT * FROM [mgnl:myCustomNodeType] where state='some-value' ORDER BY [jcr:created]  // tried using CONTAINS(state,'some-value') - did not help, note that sorting by status does not help either or any other property (indexed/nonindexed)
The queries usually work with this one specific property(state) that has 4 possible values(states). These queries take up to around 7s for around 12k nodes right now when looking for a specific property value (specific state - 11K nodes with this specific property value - specific state). We also make more complex queries but those end up actually filtering the nodes by a significant amount, so the sorting slowing the query down is negligible...

I've tried creating a simple index, that just extends  the default magnolia index - /info/magnolia/jackrabbit/indexing_configuration_default.xml by adding another index-rule:
<index-rule nodeType="mgnl:myCustomNodeType">
<property>status</property>
<property>jcr:created</property>
<property isRegexp="true" nodeScopeIndex="false">mgnl:.*</property>
<property isRegexp="true" nodeScopeIndex="false">jcr:.*</property>
<property isRegexp="true">.*:.*</property>
</index-rule>
<index-rule nodeType="nt:base"> ...same as the default config file... </index-rule>
(btw: The index contains more properties), so in the end the index contains the  
the index-rule for my custom node and the default index-rule for nodeType="nt:base" from the default config file (in this order, tried reversing the order  as well, tried without the default index-rule for nodeType="nt-base" as well)

But despite this, the query speed remains the same (the magnolia lucene SearchIndex seems to be using my search index rules, since it finds the config file - it does not default to the previously mentioned default config file - it does not log anything about using the file + it does not get used while debugging)
Note: the queries are slow when making queries in the jcr-tools app and making queries programatically (rest api) 

I was thinking of simply forcing another extra condition in the queries (date range) to get the number of possible hits down -> which would speed things up


Any help would be appreciated
On Thursday 16 May 2024 at 13:28:57 UTC+2 Richard Gange wrote:
Maybe you might consider adjusting the index configuration file to index only the properties you need for the queries. This page goes into some of the details https://docs.magnolia-cms.com/product-docs/6.2/features/search/jackrabbit-search/ and there is also this https://jackrabbit.apache.org/archive/wiki/JCR/IndexingConfiguration_115513411.html

HTH
Rich

On Wednesday, May 15, 2024 at 2:33:56 PM UTC-4 undi...@gmail.com wrote:
On Wed, May 15, 2024 at 6:09 PM Patrik Pahulák <patrik....@servermechanics.cz> wrote:
Hello, we have this one specific workspace that is kinda the most important workspace in our project. The workspace contains 7k ish nodes atm. The issue we are faced with are slow(ish) queries. The workspace is structured into folders that contain the nodes (all of these nodes are of the same type). This workspace is queried quite often. We have a rest api endpoint that handles querying this workspace with filtering and pagination (around 10 properties the filter works with). To make this work, we limit the queries to a couple of nodes per query (10-50). However we also need to return a total number of nodes that fit the query for our pagination (first, page, next page, last page, ...). AFAIK there isnt really a COUNT or anything like that to get this result so we are basically forced to run the query with no limit/offset to get the total number of nodes. This can get pretty slow, atm it's at around 1.5 seconds, which is not that bad, but the number of nodes is expected to grow, which will slow this down even more. We know for a fact that the slow part is the process of running the query with no limit.

There is no index setup on our side. I have tried setting up an index for the workspace, but I have most likely failed, since I do not see any speed improvements. I was wondering if setting up an index would even help us out in the first place and if so I was wondering if anyone could share some example index setup with Magnolia. Our repository setup is basically identical to this one, atleast the SearchIndex part
We are running magnolia 6.2.44.

Thanks a lot

--
You received this message because you are subscribed to the Google Groups "Magnolia User Mailing List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to user-list+...@magnolia-cms.com.

Patrik Pahulák

unread,
Aug 29, 2024, 6:37:55 AM8/29/24
to Magnolia User Mailing List, Patrik Pahulák, Richard Gange, undi...@gmail.com
EDIT: forgot to mention that when I rewrote the whole querying from pagination -> infinite scroll ("I've noticed this after rewriting pagination -> infinite scroll" - previous message), that I no longer make two queries - one to get all orders for the queries (pages) and one for the actual subset of data. Now I only make one query, one with a limit and offset applied. This query is always sorted by JCR and this one remains slow with/without the index. Also forgot to mention that the current properties I am storing in my nodes are most likely not indexed by the default index-rule  (they dont contain a ":") , so this catch-all rule: <property isRegexp="true">.*:.*</property> can't index them.

Thanks a lot

Patrik Pahulák

unread,
Nov 7, 2024, 4:45:28 AM11/7/24
to Magnolia User Mailing List, Patrik Pahulák, Richard Gange, undi...@gmail.com

Update: I've over this again because I can't seem to make it work. I want to reiterate a little. I feel like I left out an important detail about the workspace structure. The workspace is structured as follows:
/(root node) -> folder -> nodes of type mgnl:myNode, where each folder contains nodes whose name starts with folder name, eg. folder named 123 contains nodes 1234, 1235, etc.
Now my main query is trying to basically get all of these nodes based on queries such as this:
SELECT * FROM [mgnl:myNode] WHERE property='some-value' ORDER BY [jcr:created]. In this case a problematic property is called 'status', which has 4 possible values (node uuids). In the end most nodes end up with one specific value of this property, so their queries are quite slow. Unfortunately I need to load all of the nodes based on an equality check of this one specific property value. I was thinking of constraining the query by forcing my users to select a range (eg. the current year, the previous year etc.), which would make these specific queries faster at the moment. Unfortunately if there were to be around 10k+ nodes / year created like this, the queries would just end up taking the same amount of time as now (7+ seconds) or more. I am getting quite desperate at this point. I've tried using a custom index for these nodes, but unfortunately there is no performance gain at all. I've started thinking about moving the querying off to a search server/service. Before I do so, is there anything else I could possibly do to make this perform better (as I've mentioned multiple times, I do infact need to get all of my nodes, well paginated results are fine as well - currently using paginated results with limit and offset)?

Patrik Savior

unread,
Nov 14, 2024, 1:57:18 AM11/14/24
to Magnolia User Mailing List, Patrik Pahulák, Richard Gange, undi...@gmail.com
Update: I've over this again because I can't seem to make it work. I want to reiterate a little. I feel like I left out an important detail about the workspace structure. The workspace is structured as follows:
/(root node) -> folder -> nodes of type mgnl:myNode, where each folder contains nodes whose name starts with folder name, eg. folder named 123 contains nodes 1234, 1235, etc.
Now my main query is trying to basically get all of these nodes based on queries such as this:
SELECT * FROM [mgnl:myNode] WHERE property='some-value' ORDER BY [jcr:created]. In this case a problematic property is called 'status', which has 4 possible values (node uuids). In the end most nodes end up with one specific value of this property, so their queries are quite slow. Unfortunately I need to load all of the nodes based on an equality check of this one specific property value. I was thinking of constraining the query by forcing my users to select a range (eg. the current year, the previous year etc.), which would make these specific queries faster at the moment. Unfortunately if there were to be around 10k+ nodes / year created like this, the queries would just end up taking the same amount of time as now (7+ seconds) or more. I am getting quite desperate at this point. I've tried using a custom index for these nodes, but unfortunately there is no performance gain at all. I've started thinking about moving the querying off to a search server/service. Before I do so, is there anything else I could possibly do to make this perform better (as I've mentioned multiple times, I do infact need to get all of my nodes, well paginated results are fine as well - currently using paginated results with limit and offset)?

Dne čtvrtek 29. srpna 2024 v 12:37:55 UTC+2 uživatel Patrik Pahulák napsal:
Reply all
Reply to author
Forward
0 new messages