How to control the state of the agents

65 views
Skip to first unread message

Pablo Maldonado

unread,
Feb 12, 2019, 11:09:57 AM2/12/19
to schedulix
Hello, 

I need to know how to know which agents are offline or fallen by making a query to the database, I put together a query attached, but if an agent does not report the correct status.

SELECT * FROM (SELECT rsn.name,
                      CASE 
                          WHEN rss.is_online = 1 THEN "ONLINE"
                          WHEN rss.is_online = 0 THEN "OFFLINE"
                      END AS status
               FROM   RESSOURCE rss,
                      NAMED_RESOURCE rsn
               WHERE      rss.nr_id = rsn.id
                      AND rsn.name like 'AGENT@%'
              ) rs
WHERE rs.status = 'OFFLINE';


Pablo Maldonado

Ronald Jeninga

unread,
Feb 12, 2019, 7:06:32 PM2/12/19
to schedulix
Hi Pable,

the easiest method is to execute a "list sessions;" through sdmsh.
The output can be easily parsed.

You can also retrieve the list of agents by querying the SCOPE table.
E.g.

[SYSTEM@localhost:2506] SDMS> exit   
-bash-4.2$ sdmsh

Connect

CONNECT_TIME : 13 Feb 2019 00:04:33 GMT

Connected

[SYSTEM@localhost:2506] SDMS> select id from sci_scope where type = 'SERVER' with id scope;

Selected Values

ID                              
--------------------------------
GLOBAL.EXAMPLES.HOST_1.SERVER   
GLOBAL.EXAMPLES.HOST_2.SERVER   
GLOBAL.EXAMPLES.LOCALHOST.SERVER


3 Row(s) selected

[SYSTEM@localhost:2506] SDMS> 

The names that are in the full list, but not in the list of sessions are the ones that aren't connected.

The use of the *res(s)ource tables won't help as the required information isn't stored there.
In fact, the information if an agent is connected isn't stored anywhere on disk. Hence there's no alternative than to ask the server.

HTH

Best regards,

Ronald

Pablo Maldonado

unread,
Feb 13, 2019, 12:45:53 PM2/13/19
to schedulix
Hello Roland,

How could you generate an output of the query that you made in the sdmsh to a file

Pablo

Pablo Maldonado

unread,
Feb 14, 2019, 5:04:52 PM2/14/19
to schedulix
Hello Roland,

It's very simple :
sdmsh <query.sdms> query.log

Thank you

Pablo Maldonado

Pablo Maldonado

unread,
Feb 14, 2019, 5:47:42 PM2/14/19
to schedulix
Hello Roland,

Far from being a good script but only as a test works well

*************************************************************************************************************

#!/bin/bash

sdmsh --host xxx --port xxx --user SYSTEM --pass xxx < get_sessions.sql  > get_sessions.log
sdmsh --host xxx --port xxx --user SYSTEM --pass xxx < get_agent.sql     > get_agent.log

### aplicamos grep
grep GLOBAL get_sessions.log | awk '{ print $10 }'   > get_agentdown.log
grep GLOBAL get_agent.log    | awk '{ print $1  }'  >> get_agentdown.log

sort -n get_agentdown.log > tmp.log
mv tmp.log get_agentdown.log

uniq -uc get_agentdown.log > tmp.log

cat tmp.log | awk '{ print $2 }' > get_agentdown.log


cat get_agentdown.log

rm -f get_sessions.log
rm -f get_agent.log
rm -f tmp.log

exit

----------------------------------------------------------

get_sessions.sql  :  list sessions;

get_agent.sql : select id from SCI_SCOPE where type = 'SERVER' with id scope;



Ronald Jeninga

unread,
Feb 15, 2019, 4:06:52 AM2/15/19
to schedulix
Hi Pable,

I'm glad you managed to solve the problem yourself.

and I learned something :-)
I wasn't aware of the count option in uniq. (Although you add the count and than use awk to get rid of it; I think omitting the "-c" (and the subsequent awk) would be something to evaluate). 

Have a look at the trap statement. That'll enable you to remove all the created temporary files, even if your script dies unexpectedly.
E.g.

 trap "rm -f get_session.log get_agent.log tmp.log" 0 1 2 3

Thank you,

Ronald

Pablo Maldonado

unread,
Feb 19, 2019, 10:57:07 AM2/19/19
to schedulix
Hello Ronald,

I improve it a little bit, but it would be something like this script, now I just need to integrate it with Nagios.
Thank you very much for the help

#!/bin/bash

LOG_PATH=/home/schedulix/ctrlAgent/log
SQL_PATH=/home/schedulix/ctrlAgent/sql
SQL_FILE_AGENT=ctrlAgents
SQL_FILE_SESS=ctrlSessions
LOG_FINAL=listAgentDown
LOG_TMP=listAgentDown2

########################################################
# Obtencion de los agentes y sesiones.                 #
########################################################
sdmsh --host xxx --port xxx --user SYSTEM --pass xxx < $SQL_PATH/$SQL_FILE_AGENT.sql > $LOG_PATH/$SQL_FILE_AGENT.log
sdmsh --host xxx --port xxx --user SYSTEM --pass xxx < $SQL_PATH/$SQL_FILE_SESS.sql  > $LOG_PATH/$SQL_FILE_SESS.log


########################################################
# Creacion de un log general con las lineas GLOBAL.    #
########################################################
grep GLOBAL $LOG_PATH/$SQL_FILE_AGENT.log | awk '{ print  $1 }' >  $LOG_PATH/$LOG_FINAL.log
grep GLOBAL $LOG_PATH/$SQL_FILE_SESS.log  | awk '{ print $10 }' >> $LOG_PATH/$LOG_FINAL.log

sort -n $LOG_PATH/$LOG_FINAL.log | uniq -u > $LOG_PATH/$LOG_TMP.log
mv $LOG_PATH/$LOG_TMP.log $LOG_PATH/$LOG_FINAL.log

########################################################
# Muestra los agentes en estado DOWN.                  #
########################################################
cat $LOG_PATH/$LOG_FINAL.log

########################################################
# Eliminamos los archivos de logs.                     #
########################################################
trap "rm -f $LOG_PATH/$SQL_FILE_AGENT.log $LOG_PATH/$SQL_FILE_SESS.log $LOG_PATH/$LOG_FINAL.log" 0 1 2 3

exit 0

Thank you

Pablo Maldonado

Ronald Jeninga

unread,
Feb 19, 2019, 11:19:13 AM2/19/19
to schedulix
Hi Pablo, 

maybe you should consider using a construct like

SQL_FILE_AGENT="select id from SCI_SCOPE where type = 'SERVER' with id scope;"
SQL_FILE_SESS="list sessions;"

echo $SQL_FILE_AGENT | sdmsh --ini some_ini_file > $LOG_PATH/$SQL_FILE_AGENT.log
echo $SQL_FILE_SESS | sdmsh --ini some_ini_file > $LOG_PATH/$SQL_FILE_SESS.log

...

At least that's more like I would do it.
The big advantage is that you only need to take care of a single (executable) file, not a set of 3 files.
The some_ini_file reduces the effect, you'll now need two files, but it'll keep the passwords hidden instead of visible through ps
You could even generate the file from the information available in the script (Host, Port, User, Password) and eliminate it after having run the script.
E.g.

INIFILE=.sdmshrc.$$

echo "
Host=ocelot
Port=2506
User=ronald
Password=unknown
" > $LOG_PATH/$INIFILE

That again would eliminate both the extra file and the passwords on the command line.
In my example some_ini_file would be $LOG_PATH/$INIFILE

The trap statement basically installs a signal handler.
This is why it should be at the start of your script (after setting your constants), not at the end.
The "0 1 2 3" means:

0 : at termination
1 : after a SIGHUP
2 : after a SIGINT
3 : after a SIGQUIT

You can add other signal numbers if you like (9 won't work).

Just my 2 cents.

Please tell me about the integration in Nagios as soon as you got it working.

Best regards,

Ronald

John Tsui

unread,
Feb 25, 2019, 4:47:09 AM2/25/19
to schedulix
Hi Ronald,

For checking on agent status, I use below:
echo "LIST JOB_SERVER GLOBAL WITH EXPAND = ALL;" | sdmsh | awk '{print $4,$2,$10}' | grep "^SERVER"

Is it better with using query on "SCI_SCOPE" instead?

Best Regards,
John Tsui
25.Feb.2019

Dieter Stubler

unread,
Feb 25, 2019, 7:54:43 AM2/25/19
to schedulix
Hi,

When using sci_scope it is a bit more difficult to check whether a job server is alive.
With list jobsever you get an idle time which you can use to detect a non active jobserver without remembering values between checks.
Using sci_scope you have to check for the last_active time and remember it between checks to compare the last_active time with the one got previously to check if its moving.

Regards
Dieter

Pablo Maldonado

unread,
Feb 25, 2019, 1:27:06 PM2/25/19
to schedulix
Hi Ronald

Finally the script with the integration to Nagios stayed like this, at least it works well for now.

#!/bin/bash

export BICSUITEHOME=/Software/Schedulix/schedulix
export BICSUITECONFIG=/Software/Schedulix/etc
export BICSUITELOGDIR=/Software/Schedulix/log
export PATH=$BICSUITEHOME/bin:$PATH
export SWTJAR=/Software/Schedulix/swt/swt.jar
export JNAJAR=/usr/share/java/jna.jar

export WRK_PATH=/Software/Schedulix/admin
export SQL_FILE_AGENT="select id from SCI_SCOPE where type = 'SERVER' with id scope;"
export LOG_FILE_AGENT=ctrlAgents.log
export SQL_FILE_SESS="list sessions;"
export LOG_FILE_SESS=ctrlSessions.log
export LOG_FINAL=listAgentDown.log
export LOG_TMP=listAgentDown2.log
export NUM_AGENT=0
export INIFILE=.sdmshrc
export MESSAGE="Critical"
export RETURN=2

########################################################
# Si ya existe INFILE cortamos la ejecucion.           #
########################################################
if [ -f $WRK_PATH/$INIFILE ]; then
        exit 0
fi

########################################################
# Eliminamos los archivos de logs.                     #
########################################################
trap "rm -f $WRK_PATH/$LOG_FILE_AGENT $WRK_PATH/$LOG_FILE_SESS $WRK_PATH/$LOG_FINAL $WRK_PATH/$INIFILE" 0 1 2 3

########################################################
# Creamos el INFILE temporal.                          #
########################################################
echo "
User=SYSTEM
Password=xxxx
Host=xxxx
Port=xxxx
Timeout=0
" > $WRK_PATH/$INIFILE

########################################################
# Obtencion de los agentes y sesiones.                 #
########################################################
echo $SQL_FILE_AGENT | sdmsh --ini $WRK_PATH/$INIFILE > $WRK_PATH/$LOG_FILE_AGENT
echo $SQL_FILE_SESS  | sdmsh --ini $WRK_PATH/$INIFILE > $WRK_PATH/$LOG_FILE_SESS

########################################################
# Creacion de un log general con las lineas GLOBAL.    #
########################################################
grep GLOBAL $WRK_PATH/$LOG_FILE_AGENT | awk '{ print  $1 }' >  $WRK_PATH/$LOG_FINAL
NUM_AGENT=`wc -l $WRK_PATH/$LOG_FINAL | awk '{ print $1 }'`
grep GLOBAL $WRK_PATH/$LOG_FILE_SESS  | awk '{ print $10 }' >> $WRK_PATH/$LOG_FINAL 

#### echo "Numero de agentes registrados : $NUM_AGENT"

sort -n $WRK_PATH/$LOG_FINAL | uniq -u > $WRK_PATH/$LOG_TMP
mv $WRK_PATH/$LOG_TMP $WRK_PATH/$LOG_FINAL

########################################################
# Muestra los agentes en estado DOWN.                  #
########################################################
## cat $WRK_PATH/$LOG_FINAL 


########################################################
# Muestra los agentes en estado DOWN por Nagios.       #
########################################################

if [ ! -s $WRK_PATH/$LOG_FINAL ]; then
## exit por ok
    MESSAGE="Ok"
    RETURN=0
else
## exit por warning o critical

    txtOutput=""
countAgent=0

    while read linea
    do 

        count=1
        for host in $(echo $linea | tr "." "\n")
        do
        if [ $count -eq 4 ]; then
                txtOutput="$txtOutput [$host] "
                        countAgent=$((countAgent + 1)) 
                fi
                count=$((count + 1))
        done

    done < $WRK_PATH/$LOG_FINAL 

    if [ $countAgent -eq $NUM_AGENT ]; then
        ## exit por critical
                MESSAGE="Critical Agent Down : $txtOutput"
        RETURN=2
    else
        ## exit por warning
                MESSAGE="Warning Agent Down : $txtOutput"
        RETURN=1
    fi

fi

## Retorno final hacia nagios 
echo $MESSAGE
exit $RETURN;


thank you

Pablo Maldonado

unread,
Feb 25, 2019, 4:49:57 PM2/25/19
to schedulix
Hi Dieter,

According to what I understand would be better so ...

#!/bin/bash

export BICSUITEHOME=/Software/Schedulix/schedulix
export BICSUITECONFIG=/Software/Schedulix/etc
export BICSUITELOGDIR=/Software/Schedulix/log
export PATH=$BICSUITEHOME/bin:$PATH
export SWTJAR=/Software/Schedulix/swt/swt.jar
export JNAJAR=/usr/share/java/jna.jar

export WRK_PATH=/Software/Schedulix/admin
export SQL_LIST_AGENTS="LIST JOB_SERVER GLOBAL WITH EXPAND = ALL;"
export LOG_LIST_AGENTS=listAgents.log
export INIFILE=.sdmshrc
export MESSAGE="Critical"
export RETURN=2

########################################################
# Si ya existe INFILE cortamos la ejecucion.           #
########################################################
if [ -f $WRK_PATH/$INIFILE ]; then
        exit 0
fi

########################################################
# Eliminamos los archivos de logs.                     #
########################################################
trap "rm -f $WRK_PATH/LOG_LIST_AGENTS $WRK_PATH/$INIFILE" 0 1 2 3

########################################################
# Creamos el INFILE temporal.                          #
########################################################
echo "
User=SYSTEM
Password=xxxx
Host=xxxx
Port=xxxx
Timeout=0
" > $WRK_PATH/$INIFILE

########################################################
# Obtencion de los agentes y sus estados.              #
########################################################
echo $SQL_LIST_AGENTS | sdmsh --ini $WRK_PATH/$INIFILE | grep "SERVER" | awk '{print $2,$10}' > $WRK_PATH/LOG_LIST_AGENTS

count_line=`wc -l $WRK_PATH/LOG_LIST_AGENTS | awk '{ print $1 }'`
count_down=0
txtOutput=""

while read line
do
host=`echo $line | tr "." " " | awk '{ print $4 }'`
        status=`echo $line | awk '{ print $2 }'`

if [ "$status" = "false" ]; then
count_down=$((count_down + 1))
txtOutput="$txtOutput [ $host ] "
fi 

done < $WRK_PATH/LOG_LIST_AGENTS

########################################################
# Muestra los agentes en estado DOWN por Nagios.       #
########################################################

if [ $count_down -eq 0 ]; then
## exit por ok
MESSAGE="Ok"
RETURN=0
else
if [ $count_down -eq $count_line ]; then
## exit por Critical
MESSAGE="Critical Agent Down : $txtOutput"
RETURN=2
else
                ## exit por Warning 

Ronald Jeninga

unread,
Feb 27, 2019, 4:04:36 AM2/27/19
to schedulix
Hi Pablo,

I like your implementation.
I'm sorry that the fact that the LIST SCOPE statement provides all required information slipped my mind.
But I think you've learned a bit about shell programming on the way.

Maybe it's a good idea to set the umask to 077 when creating the ini file. That would make it harder to spy out passwords.

Best regards,

Ronald

Pablo Maldonado

unread,
Feb 28, 2019, 9:58:56 AM2/28/19
to schedulix
Hi Ronald,

Thanks to you Ronald, you helped me a lot and also to give more security to my script.
The same today I'm using it with nagios and it works perfect or as I expected.

Attached the final script and sorry for my English.

Thank you


Pablo Maldonado
ctrlAgents.sh
Reply all
Reply to author
Forward
0 new messages