Conifer Client issue

Alain Lamothe

unread,

Feb 7, 2014, 8:54:42 AM2/7/14

to conifer...@googlegroups.com, Aline Krause, Lorraine Racine, Marlene Bonin, Noella Cliche, Rachelle Larcher

Hi everyone,

Laurentian is continuing to experience issues with the Client. Circulation can't sign any books out/in and Cataloguing can't import or save records. They keep getting network error messages or an extremely slow response times.

On the other hand, the OPAC is functioning a-ok.

Alain

Alain Lamothe, M.Sc., M.L.I.S.
Chair, Department of Library and Archives
Head, Collections and Technical Services
J.N. Desmarais Library
Laurentian University
Sudbury, Ontario
Canada
P3E 2C6
(705) 675-1151 ext. 3304
alam...@laurentian.ca

Richard Scott

unread,

Feb 7, 2014, 9:05:37 AM2/7/14

to conifer...@googlegroups.com

Good morning,

Some of the Conifer services that the staff client needs to sign in appear to be experiencing issues. We're just restarting them now; hopefully that should bring things back into line.

Cheers,
Rick
--
Rick Scott // Library Technologies Specialist // Wishart Library @ AlgomaU

Good judgment comes from experience.
Experience comes from bad judgment. :Nasrudin

________________________________________
From: conifer...@googlegroups.com [conifer...@googlegroups.com] On Behalf Of Alain Lamothe [alam...@laurentian.ca]
Sent: February-07-14 08:54
To: conifer...@googlegroups.com
Cc: Aline Krause; Lorraine Racine; Marlene Bonin; Noella Cliche; Rachelle Larcher
Subject: [conifer-discuss] Conifer Client issue

--
You received this message because you are subscribed to the Google Groups "Conifer testing and discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to conifer-discu...@googlegroups.com.
To post to this group, send email to conifer...@googlegroups.com.
Visit this group at http://groups.google.com/group/conifer-discuss.
For more options, visit https://groups.google.com/groups/opt_out.

Allan Laporte

unread,

Feb 7, 2014, 9:10:23 AM2/7/14

to conifer...@googlegroups.com

Windsor is experiencing the same issues.

Allan Laporte, BCS
Systems Technical Support Specialist • Leddy Library
University of Windsor • 401 Sunset Ave. • Windsor ON Canada • N9B 3P4
tel: 519/253-3000 (3164)

Support the University of Windsor • http://www.uwindsor.ca/donations

From: "Alain Lamothe" <alam...@laurentian.ca>
To: <conifer...@googlegroups.com>
Cc: "Aline Krause" <AKr...@laurentian.ca>, "Lorraine Racine" <LRa...@laurentian.ca>, "Marlene Bonin" <MBo...@laurentian.ca>, "Noella Cliche" <NCl...@laurentian.ca>, "Rachelle Larcher" <RLar...@laurentian.ca>
Date: 02.07.2014 08:54 AM
Subject: [conifer-discuss] Conifer Client issue

Alain

Richard Scott

unread,

Feb 7, 2014, 9:17:21 AM2/7/14

to conifer...@googlegroups.com

The service restart appears to have resolved the issue for me -- I can log in to the staff client and do circs. Can you have a look and see if things have returned to normal for y'all as well?

Cheers,
Rick
--
Rick Scott // Library Technologies Specialist // Wishart Library @ AlgomaU

Good judgment comes from experience.
Experience comes from bad judgment. :Nasrudin

________________________________________

From: conifer...@googlegroups.com [conifer...@googlegroups.com] On Behalf Of Allan Laporte [Allan....@uwindsor.ca]
Sent: February-07-14 09:10
To: conifer...@googlegroups.com
Subject: Re: [conifer-discuss] Conifer Client issue

Windsor is experiencing the same issues.

[cid:_2_1016956C10169198004DDC8985257C78] Allan Laporte, BCS

Alain Lamothe

unread,

Feb 7, 2014, 9:28:38 AM2/7/14

to conifer...@googlegroups.com

Thanks Richard!

Out of curiosity, are these issues related to the DoS attack of last week?

Alain

Alain Lamothe, M.Sc., M.L.I.S.
Chair, Department of Library and Archives
Head, Collections and Technical Services
J.N. Desmarais Library
Laurentian University
Sudbury, Ontario
Canada
P3E 2C6
(705) 675-1151 ext. 3304
alam...@laurentian.ca

>>> Richard Scott <Richar...@algomau.ca> 2/7/2014 9:05 AM >>>

Dan Scott

unread,

Feb 7, 2014, 10:03:54 AM2/7/14

to Conifer testing and discussion

Hi Alain and everyone else:

The issues may be related to the approach we've been using to address the main symptom related to what looks like (but might not be) a denial of service of attack. This morning we identified the source of the queries (a link checker system, which could still be a front for a deliberate attack... but might also be entirely innocent) and have taken steps to prevent that source from issuing further such queries--but we'll see if that's effective.

In a nutshell: almost everything in Evergreen needs to go through the database. The database is capable of handling multiple concurrent queries, but generally (due in part to the way our database server is set up, and in part how our database has grown over time), think of it as handling every request in sequence.

In a best case scenario, read-only queries like searches can be resolved entirely in RAM. Over time the operating system learns which parts of the database are frequently accessed and tries to cache those in RAM, which is by far the fastest data access method. These can run in a second or less.

When a really pathological search query comes in, however, (and by pathological I mean one that ends up having to touch almost every row in the bibliographic database), problems can occur. For one thing, over the past four years our server has grown enough that it doesn't have enough RAM to hold the entire database in memory anymore. Once the database has to go to disk, access to the data is up to a thousand times slower than access to RAM. In addition, the cached data in RAM starts getting pushed out... so subsequent queries are more likely to have to go back to disk. When a query takes minutes to handle rather than a few seconds, things start to fall apart because other requests that need to update the data that the search query is looking at end up having to wait until the rows of data that they need to touch are free. These requests can then start to build up a queue... and we have a 60-second timeout built into Evergreen for requests that will terminate long database queries, which then automatically roll back those requested changes.

What we've been seeing for the first time since Conifer started running is a repeated set of identically pathological queries coming in at the same time. This escalates the problem significantly, because now there is a whole lot of contention going on in the system.

Short term, what we've been doing to handle the pathological queries is killing them directly at the database. Manually. Which is really pretty crazy, when you think about it, but necessity / motherhood / trying to keep the system alive / next thing you know you're a slave to the machine. I am beginning to suspect (but am not sure yet) that killing these queries may also have a side-effect of disturbing the Evergreen processes that were connected to the database and potentially causing subsequent requests from the same Evergreen process.

This morning I hoped to upgrade PostgreSQL to the latest minor version (we're at 9.1.9 and 9.1.11 is available with many important bug fixes and performance improvements) but unfortunately 9.1.11 is not yet available through the channel we used to use. So that will wait until the weekend; that way we'll be able to use our test system to ensure there are no surprises.

Longer term, when we move the system from Guelph to our new hosts in the June timeframe we're going to get a whole lot of advantages:

a) From a hardware perspective, instead of the old-school spinning hard drives that we rely on, they're using SSD -- which is much closer to RAM in terms of performance

b) They also plan to spec out the database servers with far more RAM than the actual size of the database. So read-only queries like searches should always be able to run in RAM.

c) They'll be running the latest major version of PostgreSQL (9.3) which brings many other performance improvements.

In theory, we'll also be able to rely on our hosts for Evergreen support in situations like this where we don't really have full-time people dedicated to Conifer support.

Shortly after we're on the new hardware platform and have ensured that we're stable, we'll be able to upgrade to the latest version of OpenSRF and Evergreen (we're on Evergreen 2.4, while 2.5 has been out for a while now and 2.6 is just around the corner), both of which include performance improvements and robustness improvements for the cases like the long-running processes.

Sorry for the relative quiet on my side; I've been mostly working behind the scenes to try and support Robin, Rick, and Art. In the interests of transparency during what must be a very frustrating time for you all, though I wanted to let you know what we're seeing on the systems side and what we've been trying to do.

Dan

Dan Scott

unread,

Feb 7, 2014, 10:52:01 AM2/7/14

to Conifer testing and discussion

By the way, having just seen a fresh batch of the same requests come in--either because the source didn't honour our request, or just hasn't processed it yet--I have now resorted to blocking the IP range associated with the requests. Which is pretty severe, but should buy us breathing room.

Due to that batch of requests, though, we may end up having to restart the services. Again.

Noella Cliche

unread,

Feb 7, 2014, 11:55:59 AM2/7/14

to Conifer testing and discussion

we are back to a crawl here at laurentian once again... unable to search the catalogue as well

Noëlla Cliche

Library Assistant, Access Services

Bibliothèque J.N. Desmarais Library

Laurentian University

935 Ramsey Lake Road, Sudbury, Ontario, Canada P3E 2C6

Tel: 705-675-1151 extension 4377 / 3242

Fax: 705-671-3803
ncl...@laurentian.ca

>>> Dan Scott <den...@gmail.com> 2/7/14 10:03 AM >>>

Richard Scott

unread,

Feb 7, 2014, 1:50:58 PM2/7/14

to conifer...@googlegroups.com

Any improvement now? I cleared out another round of long-running queries some time ago, and at least on this end, things seem to have been fairly responsive since.

Cheers,
Rick
--
Rick Scott // Library Technologies Specialist // Wishart Library @ AlgomaU

Good judgment comes from experience.
Experience comes from bad judgment. :Nasrudin

________________________________________

From: conifer...@googlegroups.com [conifer...@googlegroups.com] On Behalf Of Noella Cliche [ncl...@laurentian.ca]
Sent: February-07-14 11:55
To: Conifer testing and discussion

Subject: Re: [conifer-discuss] Conifer Client issue

we are back to a crawl here at laurentian once again... unable to search the catalogue as well

Noëlla Cliche
Library Assistant, Access Services
Bibliothèque J.N. Desmarais Library
Laurentian University
935 Ramsey Lake Road, Sudbury, Ontario, Canada P3E 2C6
Tel: 705-675-1151 extension 4377 / 3242
Fax: 705-671-3803

ncl...@laurentian.ca<mailto:ncl...@laurentian.ca>

>>> Dan Scott <den...@gmail.com> 2/7/14 10:03 AM >>>
Hi Alain and everyone else:

The issues may be related to the approach we've been using to address the main symptom related to what looks like (but might not be) a denial of service of attack. This morning we identified the source of the queries (a link checker system, which could still be a front for a deliberate attack... but might also be entirely innocent) and have taken steps to prevent that source from issuing further such queries--but we'll see if that's effective.

In a nutshell: almost everything in Evergreen needs to go through the database. The database is capable of handling multiple concurrent queries, but generally (due in part to the way our database server is set up, and in part how our database has grown over time), think of it as handling every request in sequence.

In a best case scenario, read-only queries like searches can be resolved entirely in RAM. Over time the operating system learns which parts of the database are frequently accessed and tries to cache those in RAM, which is by far the fastest data access method. These can run in a second or less.

When a really pathological search query comes in, however, (and by pathological I mean one that ends up having to touch almost every row in the bibliographic database), problems can occur. For one thing, over the past four years our server has grown enough that it doesn't have enough RAM to hold the entire database in memory anymore. Once the database has to go to disk, access to the data is up to a thousand times slower than access to RAM. In addition, the cached data in RAM starts getting pushed out... so subsequent queries are more likely to have to go back to disk. When a query takes minutes to handle rather than a few seconds, things start to fall apart because other requests that need to update the data that the search query is looking at end up having to wait until the rows of data that they need to touch are free. These requests can then start to build up a queue... and we have a 60-second timeout built into Evergreen for requests that will terminate long database queries, which then automatically roll back those requested changes.

What we've been seeing for the first time since Conifer started running is a repeated set of identically pathological queries coming in at the same time. This escalates the problem significantly, because now there is a whole lot of contention going on in the system.

Short term, what we've been doing to handle the pathological queries is killing them directly at the database. Manually. Which is really pretty crazy, when you think about it, but necessity / motherhood / trying to keep the system alive / next thing you know you're a slave to the machine. I am beginning to suspect (but am not sure yet) that killing these queries may also have a side-effect of disturbing the Evergreen processes that were connected to the database and potentially causing subsequent requests from the same Evergreen process.

This morning I hoped to upgrade PostgreSQL to the latest minor version (we're at 9.1.9 and 9.1.11 is available with many important bug fixes and performance improvements) but unfortunately 9.1.11 is not yet available through the channel we used to use. So that will wait until the weekend; that way we'll be able to use our test system to ensure there are no surprises.

Longer term, when we move the system from Guelph to our new hosts in the June timeframe we're going to get a whole lot of advantages:

a) From a hardware perspective, instead of the old-school spinning hard drives that we rely on, they're using SSD -- which is much closer to RAM in terms of performance
b) They also plan to spec out the database servers with far more RAM than the actual size of the database. So read-only queries like searches should always be able to run in RAM.
c) They'll be running the latest major version of PostgreSQL (9.3) which brings many other performance improvements.

In theory, we'll also be able to rely on our hosts for Evergreen support in situations like this where we don't really have full-time people dedicated to Conifer support.

Shortly after we're on the new hardware platform and have ensured that we're stable, we'll be able to upgrade to the latest version of OpenSRF and Evergreen (we're on Evergreen 2.4, while 2.5 has been out for a while now and 2.6 is just around the corner), both of which include performance improvements and robustness improvements for the cases like the long-running processes.

Sorry for the relative quiet on my side; I've been mostly working behind the scenes to try and support Robin, Rick, and Art. In the interests of transparency during what must be a very frustrating time for you all, though I wanted to let you know what we're seeing on the systems side and what we've been trying to do.

Dan

On Fri, Feb 7, 2014 at 9:28 AM, Alain Lamothe <alam...@laurentian.ca<mailto:alam...@laurentian.ca>> wrote:

Thanks Richard!

Out of curiosity, are these issues related to the DoS attack of last week?

Alain

Alain Lamothe, M.Sc., M.L.I.S.
Chair, Department of Library and Archives
Head, Collections and Technical Services
J.N. Desmarais Library
Laurentian University
Sudbury, Ontario
Canada
P3E 2C6

(705) 675-1151 ext. 3304<tel:%28705%29%20675-1151%20ext.%203304>
alam...@laurentian.ca<mailto:alam...@laurentian.ca>

>>> Richard Scott <Richar...@algomau.ca<mailto:Richar...@algomau.ca>> 2/7/2014 9:05 AM >>>

Good morning,

Some of the Conifer services that the staff client needs to sign in appear to be experiencing issues. We're just restarting them now; hopefully that should bring things back into line.

Cheers,
Rick
--
Rick Scott // Library Technologies Specialist // Wishart Library @ AlgomaU

Good judgment comes from experience.
Experience comes from bad judgment. :Nasrudin

________________________________________

From: conifer...@googlegroups.com<mailto:conifer...@googlegroups.com> [conifer...@googlegroups.com<mailto:conifer...@googlegroups.com>] On Behalf Of Alain Lamothe [alam...@laurentian.ca<mailto:alam...@laurentian.ca>]
Sent: February-07-14 08:54
To: conifer...@googlegroups.com<mailto:conifer...@googlegroups.com>

Cc: Aline Krause; Lorraine Racine; Marlene Bonin; Noella Cliche; Rachelle Larcher
Subject: [conifer-discuss] Conifer Client issue

Hi everyone,

Laurentian is continuing to experience issues with the Client. Circulation can't sign any books out/in and Cataloguing can't import or save records. They keep getting network error messages or an extremely slow response times.

On the other hand, the OPAC is functioning a-ok.

Alain

Alain Lamothe, M.Sc., M.L.I.S.
Chair, Department of Library and Archives
Head, Collections and Technical Services
J.N. Desmarais Library
Laurentian University
Sudbury, Ontario
Canada
P3E 2C6

(705) 675-1151 ext. 3304<tel:%28705%29%20675-1151%20ext.%203304>
alam...@laurentian.ca<mailto:alam...@laurentian.ca>

--
You received this message because you are subscribed to the Google Groups "Conifer testing and discussion" group.

To unsubscribe from this group and stop receiving emails from it, send an email to conifer-discu...@googlegroups.com<mailto:conifer-discuss%2Bunsu...@googlegroups.com>.
To post to this group, send email to conifer...@googlegroups.com<mailto:conifer...@googlegroups.com>.

Visit this group at http://groups.google.com/group/conifer-discuss.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "Conifer testing and discussion" group.

To unsubscribe from this group and stop receiving emails from it, send an email to conifer-discu...@googlegroups.com<mailto:conifer-discuss%2Bunsu...@googlegroups.com>.
To post to this group, send email to conifer...@googlegroups.com<mailto:conifer...@googlegroups.com>.

Visit this group at http://groups.google.com/group/conifer-discuss.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "Conifer testing and discussion" group.

To unsubscribe from this group and stop receiving emails from it, send an email to conifer-discu...@googlegroups.com<mailto:conifer-discuss%2Bunsu...@googlegroups.com>.
To post to this group, send email to conifer...@googlegroups.com<mailto:conifer...@googlegroups.com>.

Dan Scott

unread,

Feb 7, 2014, 3:38:58 PM2/7/14

to Conifer testing and discussion

On our continuing day of pain, around 3:00 we seem to have started experiencing more problems from the same source; it turns out that when I added a rule to the firewall, it came in the wrong order and so wasn't being applied. So... we're trying again. One more round of stops/starts.