Ruby:
def is_prime(number)
for i in 2..(number-1)
if number%i==0 then return false
end
end
return true
end
######### star here:How many primes in 2 to 50000?
max_number=50000
start_number=2
total=0
time=Time.now.to_i
for i in start_number..max_number
if is_prime(i)
#puts i
total+=1
end
end
time=Time.now.to_i-time
puts "There are #{total} primes between #{start_number} and #{max_number}"
puts "Time taken #{time} seconds"
Java :
import java.io.*;
class prime
{
private static boolean isPrime(long i)
{
for(long test = 2; test < i; test++)
{
if((i%test) == 0)
{
return false;
}
}
return true;
}
public static void main(String[] args) throws IOException
{
long start_time = System.currentTimeMillis();
long n_loops = 50000;
long n_primes = 0;
for(long i = 2; i <= n_loops; i++)
{
if(isPrime(i))
{
n_primes++;
}
}
long end_time = System.currentTimeMillis();
System.out.println(n_primes + " primes found");
System.out.println("Time taken = " + (end_time - start_time)+ "millseconds");
}
}
Perl:
######### star here:How many primes in 2 to 50000?
# start timer
$start = time();
$max_number=50000;
$start_number=2;
$total=0;
for ($ii=$start_number;$ii<=$max_number;$ii++){
if (&is_prime($ii)==1)
{$total+=1}
}
# end timer
$end = time();
# report
print "There are ",$total," primes between ",$start_number," and ",$max_number,"\n";
print "Time taken was ", ($end - $start), " seconds";
sub is_prime() {
my ($number) = $_[0];
my ($si)=2;
for ($si;$si<$number;$si++){
if ($number % $si == 0) {
return 0;}
}
return 1
}
---------------------------------
Discover Yahoo!
Find restaurants, movies, travel & more fun for the weekend. Check it out!
Well, Ruby is a scripting language and not compiled, so its
results in the ballpark of Perl is about right. Only 10x
as slow as Java JVM? wow, that's pretty good, Ruby rocks!
Also, for more fun, run your Ruby through YARV and also
run zenoptimize on it
Benchmarks are great fun and prove nothing.
Ralph "PJPizza" Siegler
for i in 2...number
to
for i in 2..Math.sqrt(number).to_i
Which will significantly reduce the number of tests you will perform.
David
--
David Mitchell
Software Engineer
Telogis
>Just new to Ruby since last week, running my same functional program on the windows XP(Pentium M1.5G), the Ruby version is 10 times slower than the Java version. The program is to find the prime numbers like 2, 3,5, 7, 11, 13... Are there setup issues? or it is normal?
>
>
Your programs aren't the same. The Java and Perl program don't use
objects, the Ruby version does. Both the Perl and the Java version could
overflow, if numbers get too big, the Ruby program doesn't suffer from this.
And look, I just made the Ruby version, faster, smaller and more object
oriented:
class Integer
def prime?
not (2..Math.sqrt(self).floor).find { |i| self % i == 0 }
end
end
######### star here:How many primes in 2 to 50000?
start_number, max_number = 2, 50000
time = Time.now.to_f
total = (start_number..max_number).inject(0) { |s,i| s + (i.prime? ? 1 :
0) }
time = Time.now.to_f - time
puts "There are #{total} primes between #{start_number} and #{max_number}"
puts "Time taken %.3f seconds" % time
Now do the same in the other two languages...
--
Florian Frank
For correctness, perhaps even:
> class Integer
> def prime?
return false if self < 2
> not (2..Math.sqrt(self).floor).find { |i| self % i == 0 }
> end
> end
--
Florian Frank
> def prime?
> not (2..Math.sqrt(self).floor).find { |i| self % i == 0 }
> end
> end
Have you tried skipping even numbers (excluding 2 of course)?
That should give you a little more speed.
--
Jim Freeze
>Have you tried skipping even numbers (excluding 2 of course)?
>
>
Not me, but some guy did 2500 years ago.
>That should give you a little more speed.
>
It depends on how far you take your idea. It could get really slow on a
real computer, if you do it with big numbers.
--
Florian Frank
In my informal tests, Florian's version is already about 70 times faster
than the original one. It is also about 3.5 times faster than the naive
Java version.
Good job Florian!
Also, if the algorithm cannot be optimized in this way, the slow parts can
be coded in C (which is pretty easy with RubyInline.)
When you consider that coding in Java in like having your hands tied
compared to Ruby, I'd take the "slow" Ruby anyday (and I've been
programming Java since 1997...Ruby since 2001.) Just the fact that you
need a monster like Eclipse to productively program Java makes me a bit
hesitant to continue using it. But I don't want to start language
wars...Java has its uses, like any language.
Ryan
> --0-1157793495-1119383752=:96009
> Content-Type: text/plain; charset=iso-8859-1
> Content-Transfer-Encoding: 8bit
>
> Just new to Ruby since last week, running my same functional program on the windows XP(Pentium M1.5G), the Ruby version is 10 times slower than the Java version. The program is to find the prime numbers like 2, 3,5, 7, 11, 13... Are there setup issues? or it is normal?
> 1. Ruby result: 101 seconds
> 2. Java result:9.8 seconds
> 3. Perl result:62 seconds
if you just want a fast ruby version:
harp:~ > cat prime_inline.rb
def benchmark label, code
STDOUT.sync = true
puts "#{ '=' * 16 }\n#{ label }\n#{ '=' * 16 }"
fork do
GC.disable
a = Time::now.to_f
code.call
b = Time::now.to_f
puts " @ #{ b - a }"
end
Process::wait
end
prime_test = lambda{ (2 ... (2 ** 14)).each{|n| n.prime?} }
class Fixnum
def prime?
(2...self).each{|i| return false if self % i == 0}
return true
end
end
benchmark 'pure ruby', prime_test
require 'inline'
class Fixnum
inline do |builder|
builder.c_raw <<-src
static VALUE
is_prime_c (int argc, VALUE *argv, VALUE self) {
long i, n = FIX2LONG (self);
for (i = 2; i < n; i++) if (n % i == 0) return Qfalse;
return Qtrue;
}
src
end
alias prime? is_prime_c
end
benchmark 'inline c', prime_test
harp:~ > ruby prime_inline.rb
================
pure ruby
================
@ 14.9205560684204
================
inline c
================
@ 0.539831876754761
so that's about two orders of magnitude speed-up for less than 10 extra lines
of code - and you still get nice things like overflow safety.
hth.
-a
--
===============================================================================
| email :: ara [dot] t [dot] howard [at] noaa [dot] gov
| phone :: 303.497.6469
| My religion is very simple. My religion is kindness.
| --Tenzin Gyatso
===============================================================================
I was thinking a RubyInline example would be nice, thanks for providing it
Ara. Yet again this shows that Ruby (and the solid community behind it)
rocks!
Ryan
ruby inline is simply amazing - three cheers for ryan (the other one ;-) )
ciao.
Big frickin' deal! Come on, guys, you are all smarter than that.
Recode the Java version using the same algorithm and *then* compare.
It's not like Java won't benefit just as much from the algorithmic
improvements.
--
Glenn Parker | glenn.parker-AT-comcast.net | <http://www.tetrafoil.com/>
> Now do the same in the other two languages...
Yes, tweaking Ruby to be algorithmically faster than the others results
in an unfair comparison.
However, as Florian already pointed out, the Ruby version of the
algorithm was different from the Java and Perl ones from the start. (Or
so he wrote...I don't know enough of the others to know for sure.)
class Integer
def is_prime?
return false if self < 2
return false if self % 2 == 0
max_factor = Math.sqrt(self).to_i
3.step(max_factor, 2) {|i| return false if self % i == 0 }
return true
end
end
And here's a benchmark comparison:
require 'benchmark'
Benchmark.bm(12) do |x|
x.report("original:") { 2.upto(50000) {|i| is_prime(i) } }
x.report("Florian's:") { 2.upto(50000) {|i| i.prime? } }
x.report("Ken's:") { 2.upto(50000) {|i| i.is_prime? } }
end
user system total real
original: 171.767000 0.020000 171.787000 (174.627000)
Florian's: 2.804000 0.010000 2.814000 ( 2.814000)
Ken's: 1.041000 0.000000 1.041000 ( 1.042000)
What's more important, how fast a program _runs_, or how fast I can
_write_ it? With Ruby, I can almost always write something that is
"fast enough", and I can write it a whole lot quicker than in Java or
C. The above (semi-) optimized version only took 2 minutes to write.
(And as a side note... it was more fun to write it in Ruby :) )
And when you need some extra umph... there's always YARV or ruby2c...
and when you really need it (rarely), manually optimized C.
Cheers,
Ken
Yes, that is why I said the naive version. I'm sure the optimized Java
version would be faster than the Ruby one.
But I really think we are getting off track here. For one thing, Java was
really slow in the old days and people made the same comparisons between
it and C that we are making between it and Ruby. But people still used
Java because it was easier to use and more powerful than C. Plus after
years and years of R&D and tons of money, Sun (and other companies) made
some pretty fast implementations (plus computers are really, really fast
these days.)
If some company threw a couple million dollars at the problem I'm sure
Ruby could be made quite a bit faster. But as has been said many times
before, for most problems Ruby is more than fast enough. Plus we do have
some good work being done with YARV, which is essentially a one man
project!
If you prefer Java to Ruby for its performance and whatever other factors,
that is cool. But I fail to see the point in constantly making performance
comparisons, especially from people who seem to want to program in Ruby.
If the performance of Ruby is that much of an issue, I'd suggest putting
your money where your mouth is and helping with YARV or trying to make
your own Ruby VM. It isn't exactly child's play.
Ryan
I definitely would not be here if I preferred Java.
> But I fail to see the point in constantly making performance
> comparisons, especially from people who seem to want to program in Ruby.
Well, I think it's valid to make the comparisons every once in a while,
since it highlights a weakness that Ruby implementations will have a
hard time overcoming on their own. When you lag behind Perl by a factor
of two on *everything*, then you have a real problem.
> If the performance of Ruby is that much of an issue, I'd suggest putting
> your money where your mouth is and helping with YARV or trying to make
> your own Ruby VM. It isn't exactly child's play.
Would throwing money at the problem help YARV? I'm serious.
I'll second that. I'd be more than willing to make a monetary donation
if it would help speed/improve YARV development.
--
Regards,
John Wilger
-----------
Alice came to a fork in the road. "Which road do I take?" she asked.
"Where do you want to go?" responded the Cheshire cat.
"I don't know," Alice answered.
"Then," said the cat, "it doesn't matter."
- Lewis Carrol, Alice in Wonderland
Sounds like a good candidate to add to the RubyCentral projects list
[1]. Something similar to the codefest grants (which seem to have
worked fairly well), perhaps on a larger scale.
[1] http://www.rubycentral.org/index.rb?dest=projects&css=base
> > --
> > Regards,
> > John Wilger
> >
>
> +1
>
> -Ezra Zygmuntowicz
> Yakima Herald-Republic
> WebMaster
--
Bill Guindon (aka aGorilla)
Refer to the simple programs on The Great Computer Language Shootout -
then you can blame those guys for doing meaningless benchmarking rather
than taking the heat yourself :-)
http://shootout.alioth.debian.org/great/benchmark.php?test=all&lang=ruby&lang2=java&sort=fullcpu
> Well, I think it's valid to make the comparisons every once in a
> while, since it highlights a weakness that Ruby implementations will
> have a hard time overcoming on their own. When you lag behind Perl by
> a factor of two on *everything*, then you have a real problem.
Not in everything, only in the stupid "comparisons" and "benchmarks" the
trolls keep posting here. Compare the Ruby version
(flori@lambda:~ 0)$ time ruby -e 'class A; def initialize(a,b) @a, @b =
a, b end ; attr_accessor :a, :b; end ; 1_000_000.times { o = A.new(1,
2); o.a; o.b; o.a = 2 ; o.b = 2 }'
real 0m3.838s
user 0m3.822s
sys 0m0.009s
to the wonderful Perl version here:
(flori@lambda:~ 0)$ time perl -e 'package A; sub new { my $p = $_[0]; my
$c = ref $p || $p; bless { "a" => $_[1], "b" => $_[2] } => $c } sub a {
@_ == 1 ? $_[0]->{a} : ($_[0]->{a} = $_[1]) } sub b { @_ == 1 ?
$_[0]->{b} : ($_[0]->{b} = $_[1]) }; package main; for (1..1_000_000) {
my $o = A->new(1, 2); $o->a ; $o->b; $o->a(2); $o->b(3) }'
real 0m9.420s
user 0m9.420s
sys 0m0.002s
And this doesn't include the qualitytime you can spent debugging Perl,
while tracking down cyclic references in your object graph, or finding
missing reference counter decreases in your extensions.
--
Florian Frank
> user system total real
> original: 171.767000 0.020000 171.787000 (174.627000)
> Florian's: 2.804000 0.010000 2.814000 ( 2.814000)
> Ken's: 1.041000 0.000000 1.041000 ( 1.042000)
That's better than I expected. I'm not sure why Florian thought
it would be slower.
--
Jim Freeze
Indeed it will. However, I think the point was that with just a little
bit of thought the Ruby version can outperform a 'first cut' Java
implementation. If the OP considered the original Java version 'fast
enough' and considered the original Ruby version to be unacceptably slow,
well a little rethinking of the algorithm and the Ruby version ends up
being faster than the Java version which was 'fast enough'. If he really
needs the Java version to be faster, well, now he can go back and recode it.
Phil
>That's better than I expected. I'm not sure why Florian thought
>it would be slower.
>
>
If you compute a sieve to find out, if a very high number is prime...
--
Florian Frank
For theoretical description see:
http://www.ams.org/mcom/2004-73-246/S0025-5718-03-01501-1/S0025-5718-03-01501-1.pdf
And for a sample implementation in C
http://cr.yp.to/primegen.html
cheers,
zsombor
--
http://deezsombor.blogspot.com
Ruby about half the speed of Perl is about right, judging from my own
comparisons.
Personally, I consider it a good tradeoff to have code half the speed,
if it means I never have to write in Perl again.
mathew
It implements the Miller-Rabin-Test.
See http://www.hmug.org/man/3/BN_is_prime.php
Dee Zsombor schrieb:
Okay, here's the Perl version:
#!/usr/bin/perl -w
use Math::Big 'primes';
my $max = 50000;
my $start = time();
my $total = primes($max);
my $time = time() - $start;
print "There are $total primes below $max.\n";
print "Time taken was $time\n";
On my system I get 27 vs. 192 for the original Perl version (which is
not written very perlishly). Sure, this is cheating. CPAN is like
institutionalized cheating, and I love it. :-) In fact, one of the
reasons I started lurking in the Ruby group is that I think Ruby (with
Gems) is closer to developing a CPAN-like system than Python, and thus
I have decided to learn Ruby instead of Python.
- Mark.
Mark
> Just new to Ruby since last week, running my same functional
> program on the windows XP(Pentium M1.5G), the Ruby version is 10
> times slower than the Java version. The program is to find the
> prime numbers like 2, 3,5, 7, 11, 13... Are there setup issues? or
> it is normal?
> 1. Ruby result: 101 seconds
> 2. Java result:9.8 seconds
> 3. Perl result:62 seconds
With some very minor modifications to the original code (I did upto
instead of for and wrapped is_prime in a class), none of which were
algorithmic improvements (ugh):
Modified is_prime code:
class Primer
def is_prime(number)
2.upto(number-1) do |i|
return false if number % i == 0
end
return true
end
end
NORMAL:
% rm -rf ~/.ruby_inline/; ruby primes.rb 50000
There are 5133 primes between 2 and 50000
Time taken 237.315160036087 seconds
(whoa. my laptop is a slowpoke! oh yeah, I'm on battery!)
ONE EXTRA CMD-LINE TOKEN:
% rm -rf ~/.ruby_inline/; ruby -rzenoptimize primes.rb 50000
*** Optimizer threshold tripped!! Optimizing Primer.is_prime
There are 5133 primes between 2 and 50000
Time taken 2.81669783592224 seconds
That said... what did we learn from this thread?
Yes... benchmarks are dumb (see also, 3 lines down)
That there is no accounting for taste (ugliest code/
algorithm ever).
A good algorithm goes a long way.
2/3rds of the ppl are going to mis-analyze anyhow.
.. Nothing really.
--
ryand...@zenspider.com - Seattle.rb - http://www.zenspider.com/
seattle.rb
http://blog.zenspider.com/ - http://rubyforge.org/projects/ruby2c
It's perhaps worth mentioning that although Ruby is about 1/3rd of the size
of Perl, it has a far more complete standard library.
A base Ruby install includes, amongst other things: SSL, base64 encoding/
decoding, MD5/SHA1 hashing, XML/YAML parsing, HTTP client and server, and
remote method calls (DRb, SOAP, XMLRPC).
I think recent version of Perl have accumulated MIME::Base64, but the others
are still extensions you have to download and install.
What we learn depends on many things, including
- initial expectations, are we even in the right ball-park?
- openmindedness, are we willing to hear other peoples interpretations?
- ...
So if we initially expected a little Ruby program to have the same
performance characteristics as a roughly equivalent Java program, we
learned quite a lot.
(We don't all know the same things.)
At Wed, 22 Jun 2005 04:55:58 +0900, Michael Tan wrote:
> 1. Ruby result: 101 seconds
> 2. Java result:9.8 seconds
> 3. Perl result:62 seconds
My Ruby implementation of a sieve fishing for primes takes 4.8 seconds
or so on an AMD K6 running at 350 MHz (128 MB, Aurox Linux) while the
original program takes about 1362 seconds. Rule of thumb: Improving
algorithm good, improving hardware or language bad (to use the famous
animal farm style :-)
time=Time.now.to_f
p, f = [ 2 ], 2
3.step(50000, 2) do |i|
r = Math.sqrt(i).to_i
p.each { |f| break if (i%f).zero? or f > r}
p.push(i) if (i%f).nonzero?
end
puts p
puts p.length
puts Time.now.to_f - time
What does this program do? First it assumes that 2 is the only even
prime so after initializing the list of primes to 2 it can restrict
the search to odd numbers. Iterating over the candidates for primes it
first computes the square root (to improve efficiency of comparisms
the value is converted into an integer). Iterating over all primes
already collected the program tries to find a prime factor of the
candidate for prime - i.e. a number where the remainder after division
by the prime in question is zero. If such a factor is found the loop
searching for a prime factor is aborted. The same happens if the
factor in question exceeds the pre-computed square root. Just after
the checking loop the program looks if the most recently checked
potential prime factor is no prime factor - i.e. division of the
candidate for prime by the potential factor yields nonzero remainder.
If that is the case a new prime has been found and the candidate for
prime is added to the list of primes.
Josef 'Jupp' SCHUGT
--
Preposition: Microsoft uses Power PC CPUs for their Xbox while Apple
uses Intel CPUs for their Mac.
Theorem: Hell has been invaded by flyng pigs, then frozen.
Proof: Uncertainty drive manual, appendix A, 42nd edition or later.
At Fri, 24 Jun 2005 00:02:42 +0200, Josef 'Jupp' SCHUGT wrote:
> time=Time.now.to_i
> p, f = [ 2 ], 2
>
> 3.step(50000, 2) do |i|
> r = Math.sqrt(i).to_i
> p.each { |f| break if (i%f).zero? or f > r}
> p.push(i) if (i%f).nonzero?
> end
> puts p
> puts p.length
> puts Time.now.to_i - time
Follows C implementation:
#include <stdio.h>
#include <math.h>
void main(void) {
unsigned p[25000], f, r, idx = 1;
int i, j;
p[0] = 2;
for (i = 3; i <= 50000; i += 2) {
r = sqrt(i);
for (f = p[j = 0]; j < idx; f = p[++j]) {
if (!(i%f) || f > r) break;
}
if (i%f) p[idx++] = i;
}
for (f = p[i = 0]; i < idx; i++) printf("%u\n", p[i]);
printf("%u\n", idx);
}
Runtime (this time estimated using 'time' command):
real: 0.113s
user: 0.064s
sys: 0.009s
I should add that I of course redirect the output to a file, not to
stdout because otherwise I would essentialy measure the terminal's
scroll speed.
The C program almost is a 1:1 equivalent of the Ruby one. I know that
I am wasting memory using "unsigned p[25000]" but I wanted to avoid
the dynamic memory allocation overhead while assuming not to know the
actual number of primes. Obviously 25000 is the upper limit for the
number of primes up to 50000.
Note that the C program could be optimized further (using pointer
arithmetics) but the speedup were that between "fast as hell" and
"ridiculously fast" - pure nonsense.
What does one learn from the speedup? That using prior knowledge can
tremendously improve speed.
In this case the algorithm knowledge is that one need not check if a
number can be divided by any number smaller than it but that it is
sufficient to check divisibility by all primes smaller than its square
root (and that by definition besides 2 no even number can be prime).
The fact knowledge is the list of all primes smaller than the present
candidate for prime.
Note that it is not a must to collect all primes! To find all primes
up to 50000 one only needs to store all primes smaller than 223 - the
integer part of square root of 50000. I store all of them because in
this case memory is not a problem and adding checks would slow down
the programs.
I'd like to have a daemon process that watches for new files in a
specific directory and then runs a command on them once they're there.
It seems like a problem someone would have solved before but I haven't
been able to dig up anything about it on the web yet (perhaps I'm not
phrasing it correctly). Just thought I'd bounce a question here before
I started coding it myself.
Any thoughts?
thanks,
Keith
There's Ara Howard's "dirwatch". For Win32, there's win32-changejournal.
I don't know if dirwatch works on Win32.
Regards,
Dan
Take a look at "Daedalus", its part of the FreeBSD Sysutils
(http://www.freebsd.org/es/ports/sysutils.html). For some help in how
to configure, use it, see
http://manuals.textdrive.com/read/chapter/61#page147.
Basically, it runs in the background and will execute any system
commands you want (which in your case might be another Ruby script
which can detect if a file has been added to the dir). It will
respond by executing another script of your choosing.
Matt
nope - though it could be made to pretty easily. basically i wrote it because
of the lack of a changejournal type functionality for *nix filesystems in
general. plus dirwatch is really designed to setup a processing system which
runs external programs on files as they arrive in directories vs. running a
ruby block or some such.
cheers.
-a
--
===============================================================================
| email :: ara [dot] t [dot] howard [at] noaa [dot] gov
| phone :: 303.497.6469
| My religion is very simple. My religion is kindness.
| --Tenzin Gyatso
===============================================================================
A very simple one is here:
http://www.ntecs.de/viewcvs/viewcvs/Utils/file_change_notify.rb?rev=232&view=auto
Regards,
Michael
Simple implementation I did:
http://phrogz.net/RubyLibs/rdoc/classes/Dir/DirectoryWatcher.html
> I'd like to have a daemon process that watches for new files in a
> specific directory and then runs a command on them once they're there.
If you are on Linux, it would be trivial to wrap something around
/dev/inotify
From /usr/src/linux/Documentation/filesystems/inotify.txt
or..
http://www.ibiblio.org/peanut/Kernel-2.6.12/filesystems/inotify.txt
inotify
a powerful yet simple file change notification system
Document started 15 Mar 2005 by Robert Love <r...@novell.com>
(i) User Interface
Inotify is controlled by a device node, /dev/inotify. If you do not use
udev,
this device may need to be created manually. First step, open it
int dev_fd = open ("/dev/inotify", O_RDONLY);
Change events are managed by "watches". A watch is an (object,mask) pair
where
the object is a file or directory and the mask is a bitmask of one or more
inotify events that the application wishes to receive. See
<linux/inotify.h>
for valid events. A watch is referenced by a watch descriptor, or wd.
Watches are added via a file descriptor.
Watches on a directory will return events on any files inside of the
directory.
Adding a watch is simple,
/* 'wd' represents the watch on fd with mask */
struct inotify_request req = { fd, mask };
int wd = ioctl (dev_fd, INOTIFY_WATCH, &req);
You can add a large number of files via something like
for each file to watch {
struct inotify_request req;
int file_fd;
file_fd = open (file, O_RDONLY);
if (fd < 0) {
perror ("open");
break;
}
req.fd = file_fd;
req.mask = mask;
wd = ioctl (dev_fd, INOTIFY_WATCH, &req);
close (fd);
}
John Carter Phone : (64)(3) 358 6639
Tait Electronics Fax : (64)(3) 359 4632
PO Box 1645 Christchurch Email : john....@tait.co.nz
New Zealand
Carter's Clarification of Murphy's Law.
"Things only ever go right so that they may go more spectacularly wrong later."
From this principle, all of life and physics may be deduced.
I don't know the internals nor the api for dirwatch, but could ypu
explain where the difference would be ?
well, dirwatch is an application vs. and api. so you don't have something
like
open('directory').on('created') do |file|
puts "#{ file } created"
end
or however you might imagine an api for watching directory events...
with dirwatch, which is a command line tool, you'd do something like this to
setup a watch
~ > dirwatch some_directory create
this initializes an sqlite database, config files, log files, generates sample
scripts, etc. all this will end up in ./some_directory/.dirwatch/. example:
jib:~ > mkdir some_directory
jib:~ > dirwatch some_directory/ create
---
/home/ahoward/some_directory:
dirwatch_dir : /home/ahoward/some_directory/.dirwatch
db : /home/ahoward/some_directory/.dirwatch/db
logs_dir : /home/ahoward/some_directory/.dirwatch/logs
config : /home/ahoward/some_directory/.dirwatch/dirwatch.conf
commands_dir : /home/ahoward/some_directory/.dirwatch/commands
if we peeked in dirwatch.conf we'd see something like
...
...
...
actions:
updated :
-
command: simple.sh
type: simple
pattern: ^.*$
timing: sync
-
command: yaml.rb
type: simple
pattern: ^.*$
timing: sync
...
...
...
(did i mention i love yaml? ;-) )
the 'actions' section is where you setup what to do on certain events. the
possible events are 'created', 'modified', 'deleted', or 'existing' (all of
which are pretty obvious) and the action 'updated' which is the union of
'created' or 'modified'. so this config is saying that, whenever a file is
updated we'll run two commands 'simple.sh' and 'yaml.rb'. note that a list of
commands can be specified - they will be run in that order. the list of
commands themselves are configured with a few paramters
command:
the command to run. the .dirwatch/commands_dir/ is pre-pended to PATH
when running commands so it's convenient to put them there. the
example/auto-generated commands are in that directory.
type:
this is the calling convention. for example simple commands are called
like
simple.sh file_that_was_updated mtime_of_that_file
and is called once for each file. yaml commands are called like
yaml.rb < (list of __every__ updated file and it's mtime on stdin in yaml format)
there are two other types but essentially you just have a choice - your
script is run once with every file or it gets all the files at once on
stdin.
pattern:
only files matching this regex will get passed to this command. dirwatch
itself has a --pattern option which causes it to see only files matching
that pattern but that affects everything. this is on a per command basis.
so you might see
updated :
-
command: gif2png
type: simple
pattern: ^.*\.gif$
timing: sync
-
command: png2ps
type: simple
pattern: ^.*\.png$
timing: sync
timing:
whether we wait for each command to finish or just spawn in the background
and collect exit_status later. this is extremely dangerous on systems
that could update 1,000,000 files at once.
next you'd simply start dirwatch using
jib:~ > dirwatch some_directory/ watch
I, [2005-07-21T09:04:48.668571 #27750] INFO -- : ** STARTED **
I, [2005-07-21T09:04:48.669050 #27750] INFO -- : config </home/ahoward/some_directory/.dirwatch/dirwatch.conf>
I, [2005-07-21T09:04:48.669252 #27750] INFO -- : flat <false>
I, [2005-07-21T09:04:48.669324 #27750] INFO -- : files_only <false>
I, [2005-07-21T09:04:48.682278 #27750] INFO -- : no_follow <false>
I, [2005-07-21T09:04:48.682358 #27750] INFO -- : pattern <>
I, [2005-07-21T09:04:48.682461 #27750] INFO -- : n_loops <>
I, [2005-07-21T09:04:48.682629 #27750] INFO -- : interval <00:05:00>
I, [2005-07-21T09:04:48.683028 #27750] INFO -- : lockfile </home/ahoward/some_directory/.dirwatch.lock>
I, [2005-07-21T09:04:48.683147 #27750] INFO -- : tmpwatch[all] <false>
I, [2005-07-21T09:04:48.683213 #27750] INFO -- : tmpwatch[nodirs] <false>
I, [2005-07-21T09:04:48.683278 #27750] INFO -- : tmpwatch[force] <true>
I, [2005-07-21T09:04:48.683454 #27750] INFO -- : tmpwatch[age] <30 days> == <2592000.0s>
I, [2005-07-21T09:04:48.683530 #27750] INFO -- : tmpwatch[rm] <rm_rf>
...
...
...
now, if i dropped a file into some_directory/ in another terminal:
jib:~/some_directory > touch a
i'd see this in the terminal running dirwatch
I, [2005-07-21T09:06:13.721967 #27839] INFO -- : ACTION.UPDATED.0.0 - cmd : simple.sh '/home/ahoward/some_directory/a' '2005-07-21 15:05:38.000000'
I, [2005-07-21T09:06:13.795296 #27839] INFO -- : ACTION.UPDATED.0.0 - exit_status : 0
the 'ACTION.UPDATED.0.0' is a uniq tag that makes finding the exit_status easy
in the event that the command was run 'async' and it's exit_status ends up in
the log 4000 lines later...
when running from the console like this the stdout of the command run shows
too, so i also saw this - the output of running simple.sh - in the terminal
running dirwatch:
dirwatch_dir: </home/ahoward/some_directory>
dirwatch_action: <updated>
dirwatch_type: <simple>
dirwatch_n_paths: <1>
dirwatch_path_idx: <0>
dirwatch_path: </home/ahoward/some_directory/a>
dirwatch_mtime: <2005-07-21 15:05:38.000000>
dirwatch_pid: <27839>
dirwatch_id: <ACTION.UPDATED.0.0>
command_line: </home/ahoward/some_directory/a 2005-07-21 15:05:38.000000>
path: </home/ahoward/some_directory/a>
mtime: <2005-07-21 15:05:38.000000>
simple.sh basically just prints it's environment and the argv it was called
with, here's the whole script:
jib:~/some_directory > cat .dirwatch/commands/simple.sh
#!/bin/sh
echo "dirwatch_dir: <$DIRWATCH_DIR>"
echo "dirwatch_action: <$DIRWATCH_ACTION>"
echo "dirwatch_type: <$DIRWATCH_TYPE>"
echo "dirwatch_n_paths: <$DIRWATCH_N_PATHS>"
echo "dirwatch_path_idx: <$DIRWATCH_PATH_IDX>"
echo "dirwatch_path: <$DIRWATCH_PATH>"
echo "dirwatch_mtime: <$DIRWATCH_MTIME>"
echo "dirwatch_pid: <$DIRWATCH_PID>"
echo "dirwatch_id: <$DIRWATCH_ID>"
echo "command_line: <$@>"
path=$1
mtime=$2
echo "path: <$path>"
echo "mtime: <$mtime>"
you'll notice quite a bit of information is passed via the environment and
that the mtime is also passed in on the command line. typical programs won't
use all this - but it's there. 'dirwatch --help' explains the meaning of
these environment variables.
so, normally you don't run like that (from the console) and instead have
something like this in your crontab to maintain an 'immortal' daemon
*/15 * * * * dirwatch /home/ahoward/some_directory watch --daemon
this does NOT start a daemon every fifteen minutes. the daemon always sets up
of a lockfile and refuses to start if one is already running. so, this just
makes sure exactly one daemon is running at all times - even after machine
reboots or if some bug causes dirwatch to crash. this may seem a bit odd but
those of you that don't have root on all your boxes in the office will
understand why it can work like that - you can setup robust daemons without
any special privledges. of course you can start it from init.d and it
supports 'start', 'stop', and 'restart' arguments too so this is trivial.
so that's it basically. dirwatch simply scans a directory, compares what it
finds to what's in it's database (sqlite), and runs appropriate actions in the
way you've configured it to do, and then sleeps for a while. it never stops,
automatically logs rolls, and does some other stuff too. there's a whole lot
of options like recursing into subdirectories, ignoring anything that's not a
file, a tmpwatch like facility built-in, etc. but you can read about that in
with --help.
cheers.
btw. i inlined the output of --help below. note that i just did a massive
re-write so some of this is a little off, but it's close.
-a
--
===============================================================================
| email :: ara [dot] t [dot] howard [at] noaa [dot] gov
| phone :: 303.497.6469
| My religion is very simple. My religion is kindness.
| --Tenzin Gyatso
===============================================================================
NAME
dirwatch v0.9.0
SYNOPSIS
dirwatch [ options ]+ mode [ directory = ./ ]
DESCRIPTTION
dirwatch is a tool used to rapidly build processing systems from file system
events.
dirwatch manages an sqlite database that mirrors the state of a directory and
then triggers user definable event handlers for certain filesystem activities
such file creation, modification, deletion, etc. dirwatch can also implement
a tmpwatch like behaviour to ensure files of a certain age are removed from
the directory being watched. dirwatch normally runs as a daemon process by
first sychronizing the database inventory with that of the directory and then
firing appropriate triggers as they occur.
-----------------------------------------------------------------------------
the following actions may have triggers configured for them
-----------------------------------------------------------------------------
created -> a file was detected that was not already in the database
modified -> a file in the database was detected as being modified
updated -> a file was created or modified (union of these two actions)
deleted -> a file in the database is no longer in the directory
existing -> a file in the database still exists in the directory and has not
been modified
-----------------------------------------------------------------------------
the command line 'mode' must be one of the following
-----------------------------------------------------------------------------
create (c) -> initialize the database and supporting files
watch (w) -> monitor directory and trigger actions in the foreground
start (S) -> spawn a daemon watcher in the background
restart (R) -> (re)spawn a daemon watcher in the background
stop (H) -> stop/halt any currently running watcher
status (T) -> determine if any watcher is currently running
truncate (D) -> truncate/delete all entries from the database
archive (a) -> create a tar.gz archive of a watch's directory contents
list (l) -> dump database to stdout in silky smooth yaml format
for all modes the command line argument must be the name of the directory to
which to apply the operation - which defaults to the current directory.
-----------------------------------------------------------------------------
mode: create (c)
-----------------------------------------------------------------------------
initializes a storage directory with all required database files, logs,
command directories, sample configuration, sample programs, etc.
by default the storage dir will be stored in a subdirectory specfied as the
'directory' command line argument, eg:
directory/.dirwatch/
the --dirwatch_dir option can be used to specify an alternate location. this
is particularly important to use if you, for instance, have an external
program like tmpwatch running which might delete this directory!
when a dirwatch storage directory is created a few files are directories are
created underneath it. the hierarchy is
directory/.dirwatch/
commands/
logs/
db
dirwatch.conf
dirwatch.pid
where
commands/ -> any programs placed here will be automatically found as
this location is added to PATH
logs/ -> logs are kept here and are auto-rolled to no scrubbing is needed
db -> this is an sqlite database file
dirwatch.conf -> a yaml configuration file used to configure which commands
to trigger for which actions
dirwatch.pid -> a file containing the pid of the daemon process
examples:
0) initialize the directory incoming_data/ to be dirwatched using all
defaults
~ > dirwatch create incoming_data/
1) initialize the directory incoming_data/ to be dirwatched storing all
metadata in /usr/local/dirwatch/incoming_data
~ > dirwatch create incoming_data/ --dirwatch_dir=/usr/local/dirwatch/incoming_data/
-----------------------------------------------------------------------------
mode: start (S)
-----------------------------------------------------------------------------
dirwatch is normally run in daemon mode. the start mode is equivalent to
running in 'watch' mode with the '--daemon' and '--quiet' flags.
examples:
~ > dirwatch start incoming_data/
-----------------------------------------------------------------------------
mode: restart (R)
-----------------------------------------------------------------------------
'restart' mode checks a watcher's pidfile and either restarts the currently
running watcher or starts a new one as in 'start' mode. this is equivalent to
sending SIGHUP to the watcher daemon process.
examples:
~ > dirwatch restart incoming_data/
-----------------------------------------------------------------------------
mode: stop (H)
-----------------------------------------------------------------------------
'stop' mode checks for any process watching the specified directory and kills
this process if it exists. this is equivalent to sending TERM to the watcher
daemon process. the process will not exit immediately but will do at the
first possible safe opportunity. do not kill -9 the daemon process.
examples:
~ > dirwatch stop incoming_data/
-----------------------------------------------------------------------------
mode: status (T)
-----------------------------------------------------------------------------
'status' mode reports whether or not a watcher is running for the given
directory.
examples:
~ > dirwatch status incoming_data/
-----------------------------------------------------------------------------
mode: truncate (D)
-----------------------------------------------------------------------------
'truncate' (delete) mode atomically empties the database of all state.
examples:
~ > dirwatch truncate incoming_data/
-----------------------------------------------------------------------------
mode: archive (a)
-----------------------------------------------------------------------------
archive mode is used to atomically create a tgz file of a the storage
directory for a given directory while respecting the locking subsystem.
examples:
~ > dirwatch archive incoming_data/
essentially this is useful for making hot backups. you system must have the
tar command for this to operate.
-----------------------------------------------------------------------------
mode: watch (w)
-----------------------------------------------------------------------------
this is the biggie.
dirwatch is designed to run as a daemon, updating the database inventory at
the interval specified by the '--interval' option (5 minutes by default) and
firing appropriate trigger commands. two watchers may not watch the same
dir simoultaneously and attempting the start a second watcher will fail when
the second watcher is unable to obtain the pid lockfile. it is a non-fatal
error to attempt to start another watcher when one is running and this failure
can be made silent by using the '--quiet' option. the reason for this is to
allow a crontab entry to be used to make the daemon 'immortal'. for example,
the following crontab entry
*/15 * * * * dirwatch directory --daemon --dbdir=0 \
--files_only --flat \
--interval=10minutes --quiet
or (same but shorter)
*/15 * * * * dirwatch directory -D -d0 -f -F -i10m -q
will __attempt__ to start a daemon watching 'directory' every fifteen minutes.
if the daemon is not already running one will started, otherwise dirwatch will
simply fail silently (no cron email sent due to stderr).
this feature allows a normal user to setup daemon processes that not only will
run after machine reboot, but which will continue to run after other terminal
program behaviour.
the meaning of the options in the above crontab entry are as follows
--daemon -> become a child of init and run forever
--dbdir -> the storage directory, here the default is specified
--files_only -> inventory files only (default is files and directories)
--flat -> do not recurse into subdirectories (default recurses)
--interval -> generate inventory, at mininum, every 10 minutes
--quiet -> be quiet when failing due to another daemon already watching
as the watcher runs and maintains the inventory it is noted when
files/directories (entries) have been created, modified, updated, deleted, or
are existing. these entries are then handled by user definable triggers as
specified in the config file. the config file is of the format
...
actions :
created :
commands :
...
updated :
commands :
...
...
...
where the commands to be run for each trigger type are enumerated. each
command entry is of the following format:
...
-
command : command to run
type : calling convention
pattern : filter files further by this pattern
timing : synchronous or asynchronous execution
...
the meaning of each field is as follows:
command: this is the program to run. the search path for the program is
determined dynamically by the action run. for instance, when a
file is discovered to be 'modified' the search path for the
command will be
dbdir/commands/modified/ + dbdir/commands/ + $PATH
this dynamic path setting simply allows for short pathnames if
commands are stored in the dbdir/commands/* subdirectories.
type: there are four types of commands. the type merely indicates the
calling convention of the program. when commands are run there
are two peices of information which must be passed to the
program, the file in question and the mtime of that file. the
mtime is less important but programs may use it to know if the file
has been changed since they were spawned. mtime will probably be
ignored for most commands. the four types of commands fall into
two catagories: those commands called once for each file and those
types of commands called once with __all__ files
each file:
simple: the command will be called with three arguments: the file
in question, the mtime date, and the mtime time. eg:
command foobar.txt 2002-11-04 01:01:01.1234
expaned: the command will be have the strings '@file' and
'@mtime' replaced with appropriate values. eg:
command '@file' '@mtime'
expands to (and is called as)
command 'foobar.txt' '2002-11-04 01:01:01.1234'
all at once:
filter: the stdin of the program will be given a list where each
line contains three items, the file, the mtime data, and
the mtime time.
yaml: the stdin of the program will be given a list where each
entry contains two items, the file and the mtime. the
format of the list is valid yaml and the schema is an
array of hashes with the keys 'path' and 'mtime'.
pattern: all the files for a given action are filtered by this pattern,
and only those files matching pattern will have triggers fired.
timing: if timing is asynchronous the command will be run and not waited
for before starting the next command. asynchronous commands may
yield better performance but may also result in many commands
being run at once. asyncronous commands should not load the
system heavily unless one is looking to freeze a machine.
synchronous commands are spawned and waited for before the next
command is started. a side effect of synchronous commands is
that the time spent waiting may sum to an ammount of time greater
than the interval ('--interval' option) specified - if the amount
of time running commands exceeds the interval the next inventory
simply begins immeadiately with no pause. because of this one
should think of the interval used as a minimum bound only,
especially when synchronous commands are used.
note that sample commands of each type are auto-generated in the
dbdir/commands directory. reading these should answer any questions regarding
the calling conventions of any of the four types. for other questions regard
the sample config, which is also auto-generated.
-----------------------------------------------------------------------------
mode: list (l)
-----------------------------------------------------------------------------
dump the contents of the database in yaml format for easy viewing/parsing
ENVIRONMENT
for dirwatch itself:
export SLDB_DEBUG=1 -> cause sldb library actions (sql) to be logged
export LOCKFILE_DEBUG=1 -> cause lockfile library actions to be logged
for programs run by dirwatch the following environment variables will be set:
DIRWATCH_DIR -> the directory being watched
DIRWATCH_ACTION -> action type, one of 'instance', 'created', 'modified',
'updated', 'deleted', or 'existing'
DIRWATCH_TYPE -> command type, one of 'simple', 'expanded', 'filter', or
'yaml'
DIRWATCH_N_PATHS -> the total number of paths for this action. the paths
themselves will be passed to the program in a different
way depending on DIRWATCH_TYPE, for instance on the
command line or on stdin, but this number will always
be the total number of paths the program should expect.
DIRWATCH_PATH_IDX -> for some command types, like 'simple', the program will
be run more than once to handle all paths since calling
convention only allows the program to be called with
one path at a time. this number is the index of the
current path in such cases. for instance, a 'simple'
program may only be called with one path at a time so
if 10 files were created in the directory that would
result in the program being called 10 times. in each
case DIRWATCH_N_PATHS would be 10 and DIRWATCH_PATH_IDX
would range from 0 to 9 for each of the 10 calls to the
program. in the case of 'filter' and 'yaml' command
types, where every path is given at once on stdin this
value will be equal to DIRWATCH_N_PATHS
DIRWATCH_PATH -> for 'simple' and 'expanded' command types, which are
called once for each path, this will contain the path
the program is being called with. in the case of
'filter' or 'yaml' command types the varible contains
the string 'stdin' implying that all paths are
available on stdin.
DIRWATCH_MTIME -> for 'simple' and 'expanded' command types, which are
called once for each path, this will contain the mtime
the program is being called with. in the case of
'filter' or 'yaml' command types the varible contains
the string 'stdin' implying that all mtimes are
available on stdin.
DIRWATCH_PID -> the pid of dirwatch watcher process
DIRWATCH_ID -> an identifier for this action that will be unique for
any given run of a dirwatch watcher process.
restarting the watcher resets the generator. this
identifier is logged in the dirwatch watcher logs to is
useful to match program logs with dirwatch logs
PATH -> the normal shell path. for each program run the PATH
is modified to contain the commands dir of the dirwatch
watcher processs. normally this is
$DIRWATCH_DIR/.dirwatch/commands/
FILES
directory/.dirwatch/ -> dirwatch data files
directory/.dirwatch/dirwatch.conf -> default configuration file
directory/.dirwatch/commands/ -> default location for triggers
directory/.dirwatch/db -> sldb/sqlite database
directory/.dirwatch/dirwatch.pid -> default pidfile
directory/.dirwatch/logs/ -> automatically rolled log files
DIAGNOSTICS
success -> $? == 0
failure -> $? != 0
AUTHOR
ara.t....@noaa.gov
BUGS
1 < bugno && bugno < 42
OPTIONS
--help, -h
this message
--log=path, -l
set log file - (default stderr)
--verbosity=verbostiy, -v
0|fatal < 1|error < 2|warn < 3|info < 4|debug - (default info)
--config=path
valid path - specify config file (default nil)
--template=[path]
valid path - generate a template config file in path (default stdout)
--dirwatch_dir=dirwatch_dir
specify dirwatch storage dir
--daemon, -d
specify daemon mode
--quiet, -q
be wery wery quiet
--flat, -F
do not recurse into subdirectories
--files_only, -f
consider only files
--no_follow, -n
do not follow links
--pattern=pattern, -p
consider only entries that match pattern
--n_loops=n_loops, -N
loop only this many times before exiting
--interval=interval, -i
sleep at least this long between loops
--lockfile=[lockfile], -k
specify a lockfile path
--show_input, -s
show input to all commands run