Huge Memory Usage of scorch

285 views
Skip to first unread message

Denis Lamotte

unread,
Mar 23, 2021, 9:00:37 AM3/23/21
to bleve

Hi everyone, hi Marty

i've setup a directory application using 5 scorch indexes, using 8,3 Go on disk on total.

The indexation took a bit more than 3 hours to run ( a bit less than 4 millions rows for the main indexes)

i'm using  github.com/blevesearch/bleve v1.0.14 with the patch for autocompletion.
in a beego framework (github.com/astaxie/beego v1.12.1)

The app is deployed in a container. ( haproxy + 2 backend servers here the app are running)

Scorch is using here 22Go of memory after 12h. After 24h it often use more than 36Go sometimes even before, depending on the bot traffic mainly. it represent less than 3 millions search in 24h. After using 36Go the system begin to be unresponsive and the server need to be rebooted in the worst case.

The indexes are opened once when the app is started.  i can give access to the code in private only

i put a memory profile in svg to show the memory consumption.

I begin to manage the memory by restarting the application depending on its memory usage but i find it quick and dirty solution. 

Is there a way to more properly manage the memory consumption ?

any help is welcome :-)

thanks in advance


profile006.svg

Marty Schoch

unread,
Mar 23, 2021, 10:55:48 AM3/23/21
to bl...@googlegroups.com
So, there are several things to consider here.

First, I'm unfamiliar with the "patch for autocompletion" in Beego, so I have no idea how that affects or may be related to this.

Second, I see from the memory profile that you are using zap v11.  This is the default in Bleve v1.x for backwards compatibility reasons, but v11 has known problems.  Specifically, it increasingly wastes space as the segment files get larger (the opposite of what should happen).  Zap v15 is still the latest version, and it was supported in v1.x, however if are going to upgrade, I recommend considering a full upgrade to bleve 2.x as community support only really covers the latest version.  This may not fix or even help your issue at all, but generally smaller indexes on disk will take less memory to search at query time.

Next, was this profile taken on a production system with multiple queries active?  Or was this taken with just a single query running.  The memory usage would appear to be excessive if this is just a single query, but could be quite normal if it represents multiple queries in aggregate.

You asked if there is a "way to more properly manage the memory consumption".  There are a few things, some related to memory used while indexing, and some while searching.  I'll only go into the ones at query time based on the context of your question.  We offer two callback functions which can be set, one which fires before a query starts and one that fires after.


Both functions take one argument, an estimate of the memory required to complete this search in bytes.  This estimate is often not very good, but it is something.  If the first function returns an error, we abort execution of the search and return the error you gave us.  You can use these functions to track how much memory is used by active searches, and being rejecting queries when it has gotten too high.  Again, because the estimates are not great, the usage you track will not be accurate, but it will give you some idea and you can try to dial in to an acceptable level.

But, at the end of the day, all this is doing is rejecting queries, and nobody really wants to do that, unless the index is replicated and you can offload query workload to other servers.  So, you really want to get into what types of queries are you executing, and is there anything you can change to make them require less memory.  Things like large values for size (to return all matches) or computing large facets/aggregations all contribute to using a lot of memory.

Finally, you mentioned you can only share the code in private.  Reviewing code in private is a service I make available to my sponsors, usually with some additional hourly contract work set up.  Reach out to me directly if you're interested.

marty

--
You received this message because you are subscribed to the Google Groups "bleve" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bleve+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bleve/8bbc6793-55a7-4358-a7f2-9c9a152457bfn%40googlegroups.com.

Denis Lamotte

unread,
Mar 23, 2021, 11:31:02 AM3/23/21
to bleve
Thank you Marty for your quick responses,  always very valuable :-)


So, there are several things to consider here.

First, I'm unfamiliar with the "patch for autocompletion" in Beego, so I have no idea how that affects or may be related to this.

well it is this pull request(https://github.com/blevesearch/bleve/pull/858) added MatchPhrasePrefixQuery (autocomplete functionality)  
and i think it is not compatible with version 2, need by adapted
 

Second, I see from the memory profile that you are using zap v11.  This is the default in Bleve v1.x for backwards compatibility reasons, but v11 has known problems.  Specifically, it increasingly wastes space as the segment files get larger (the opposite of what should happen).  Zap v15 is still the latest version, and it was supported in v1.x, however if are going to upgrade, I recommend considering a full upgrade to bleve 2.x as community support only really covers the latest version.  This may not fix or even help your issue at all, but generally smaller indexes on disk will take less memory to search at query time.
ok, the only problem i can have regarding the updgrade is  related to the autocomplete point, but i'll try to upgrade ton Zap v15.

Next, was this profile taken on a production system with multiple queries active?  Or was this taken with just a single query running.  The memory usage would appear to be excessive if this is just a single query, but could be quite normal if it represents multiple queries in aggregate.
 
it has been taken in production with multiple queries

You asked if there is a "way to more properly manage the memory consumption".  There are a few things, some related to memory used while indexing, and some while searching.  I'll only go into the ones at query time based on the context of your question.  We offer two callback functions which can be set, one which fires before a query starts and one that fires after.


Both functions take one argument, an estimate of the memory required to complete this search in bytes.  This estimate is often not very good, but it is something.  If the first function returns an error, we abort execution of the search and return the error you gave us.  You can use these functions to track how much memory is used by active searches, and being rejecting queries when it has gotten too high.  Again, because the estimates are not great, the usage you track will not be accurate, but it will give you some idea and you can try to dial in to an acceptable level.

But, at the end of the day, all this is doing is rejecting queries, and nobody really wants to do that, unless the index is replicated and you can offload query workload to other servers.  So, you really want to get into what types of queries are you executing, and is there anything you can change to make them require less memory.  Things like large values for size (to return all matches) or computing large facets/aggregations all contribute to using a lot of memory.
i'll try those queries 

Finally, you mentioned you can only share the code in private.  Reviewing code in private is a service I make available to my sponsors, usually with some additional hourly contract work set up.  Reach out to me directly if you're interested.
 
i will consider making a full public example without all the private logics related to my client so that all the community can understand.
i will reach you out if needed

Thanks again
 
marty
Reply all
Reply to author
Forward
0 new messages