Proposal for statistics schema

17 views
Skip to first unread message

Álvaro Justen [Turicas]

unread,
Jun 7, 2012, 3:54:30 PM6/7/12
to py...@googlegroups.com
Hi folks,
I've created a schema for statistics. These statistics will be used by
monitor to show relevant information about the cluster, and will be
saved in MongoDB (the interval will be specified in manager
configuration -- so manager will propagate it to all brokers,
pipeliners and monitors).
Basically, the document that will be saved in MongoDB needs to have 2 keys:

- host: a dictionary with host information, like IP address, number of
CPUs, total memory, used memory (by all processes) etc.
- processes: a list of dicionaries with information of relevant
processes for us (by now, only broker and workers). The dicionaries of
this list will have the keys about the process:
* type: 'broker' or 'worker'
* cpu percent: percentage of CPU used by this process
* pid: process ID
* resident memory: resident memory used by the process
* virtual memory: virtual memory used by the process

I've created a gist with a script that can generate a sample of this schema:
https://gist.github.com/2891134

An example:
----- 8< -----
{'host': {'cpu percent': 5.1,
'ip': '...',
'memory buffers': 15491072L,
'memory cached': 410517504,
'memory free': 129654784L,
'memory free virtual': 0,
'memory percent': 96.70761604316898,
'memory real free': 555663360L,
'memory real percent': 85.88978304215279,
'memory real used': 3382358016L,
'memory total': 3938021376L,
'memory total virtual': 0,
'memory used': 3808366592L,
'memory used virtual': 0,
'number of cpus': 4,
'pid': 5687},
'processes': [{'cpu percent': 0.0,
'pid': 5687,
'resident memory': 7254016,
'type': 'broker',
'virtual memory': 44728320}]}
----- >8 -----

Notes to Flávio:
1- You can use this script to create fake entries on MongoDB and test monitor.
2- When I merge feature/new-backend into develop (I'll do it later
this week), I think the best option is to monitor get configuration
from manager (as broker does sending the 'get configuration' command
to manager's API). This way we will avoid a lot of configuration files
(and the needing of sync all in all cluster nodes when something
changes).

[]s
--
 Álvaro Justen "Turicas"
   http://blog.justen.eng.br http://twitter.com/turicas
   http://CursoDeArduino.com.br http://github.com/turicas
   +55 21 9898-0141

Flavio Coelho

unread,
Jun 7, 2012, 4:16:47 PM6/7/12
to py...@googlegroups.com
Alvaro,

I think we need to include also active jobs. I'd like to be able to track jobs, their info,running time and possible traceback when they fail. And we need a timestamp as well, so that we can create time-series plots with this data.

I think we should configure the statistics collection to be a capped collection so that we can keep only say, a month of statistics. Or come up with a way to automatically archive older data.

I am anxious for this merge!

cheers,

Flávio 

--
You received this message because you are subscribed to the Google Groups "PyPLN" group.
To post to this group, send email to py...@googlegroups.com.
To unsubscribe from this group, send email to pypln+un...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/pypln?hl=en.




--
Flávio Codeço Coelho
================
+55(21) 3799-5567
Professor
Escola de Matemática Aplicada 
Fundação Getúlio Vargas
Rio de Janeiro - RJ
Brasil


Álvaro Justen [Turicas]

unread,
Jun 7, 2012, 4:47:55 PM6/7/12
to py...@googlegroups.com


El jun 7, 2012 5:17 p.m., "Flavio Coelho" <fcco...@gmail.com> escribió:
>
> Alvaro,
>
> I think we need to include also active jobs. I'd like to be able to track jobs, their info,running time and possible traceback when they fail. And we need a timestamp as well, so that we can create time-series plots with this data.

You are right. I'll add this information as soon as I come back to my notebook.
I'll add a key 'timestamp' in document's root and worker info (worker name, start timestamp, document id) to worker process dictionary.
Probably the total number of jobs processed by that broker and the total amount of 'worker time' (since broker started) will be interesting to us too.
I think failure information should be inserted in another collection since it'll be exceptions, not 'node usage' metrics (but I'll only work on this when creating the timeline tests, after merge -- so I'll deal with lost jobs, job timeout etc.).

> I think we should configure the statistics collection to be a capped collection so that we can keep only say, a month of statistics. Or come up with a way to automatically archive older data.

It depends! What if we increase the number of brokers in the middle of a month or change the interval which brokers send statistics?
I think the best approach here is to always have the working set in memory, so MongoDB will always answer quickly independent of collections' size. We also need to remember that monitor won't be used 100% of the time, so I think there is no problem loading more things on memory (on MongoDB's machine) when you open monitor and then remove those things when you stop using it.

Álvaro Justen [Turicas]

unread,
Jun 8, 2012, 5:58:15 PM6/8/12
to py...@googlegroups.com
I've updated the script with a lot of more information (timestamp,
storage, network and some worker related information).
An example of what will be saved in MongoDB:
https://gist.github.com/2891134#file_sample.py

The process of type 'broker' now has 'active workers', the processes
of type 'worker' now have 'worker', 'started at' and 'document id'.

Flávio, is it ok for you?

Flavio Coelho

unread,
Jun 8, 2012, 6:38:25 PM6/8/12
to py...@googlegroups.com
Much better now, though I wonder if this is not too much... Anyway... if the stats collection starts growing too fast we can always increase the interval between reports. Maybe we can store it gzipped, if we are going to consume it sequentially.

The document Id is the document being processed right? not the job Id? Sounds good to me.

I'll make the changes in the webmon right away. When do you plan to merge all this?

Flávio

--
You received this message because you are subscribed to the Google Groups "PyPLN" group.
To post to this group, send email to py...@googlegroups.com.
To unsubscribe from this group, send email to pypln+un...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/pypln?hl=en.

Álvaro Justen [Turicas]

unread,
Jun 9, 2012, 5:14:36 AM6/9/12
to py...@googlegroups.com
On Fri, Jun 8, 2012 at 7:38 PM, Flavio Coelho <fcco...@gmail.com> wrote:
> Much better now, though I wonder if this is not too much... Anyway... if the
> stats collection starts growing too fast we can always increase the interval
> between reports. Maybe we can store it gzipped, if we are going to consume
> it sequentially.

The BSON size of a document with information of 4 workers + 1 disk
partition + 4 network interfaces is 1747 bytes. We probably won't
*always* have 4 active workers, so the average size will be smaller
than that.
So, the maximum amount of monitoring information a broker will store
if it saves this information every minute is: 60 * 24 * 1747 =~
2.4MB/day. In a year, it'll be ~876MB.
So, if we have 4 machines running a broker each, each machine having 4
CPUs and active workers in 100% of the time, we'll have 3.42GB of
monitoring information in a year -- I think it's not that much (but
note that we are not counting logs, information about jobs/pipelines
and failures).

> The document Id is the document being processed right? not the job Id?
> Sounds good to me.

Yes, so we can figure out which worker is consuming most processing
time for which document (and check manually that document to correct
the worker, if there is a bug, or reschedule the job). I've made a
little change to store ObjectId instead of just the string of
ObjectId.

> I'll make the changes in the webmon right away. When do you plan to merge
> all this?

The merge is done! Please see my next email.
Reply all
Reply to author
Forward
0 new messages