[CorpLing with R] Progress coutner

1 view
Skip to first unread message

Kevin Parent

unread,
May 13, 2010, 10:52:46 PM5/13/10
to corpling-with-r
This is a trivial matter but I thought I'd ask.

Certain tasks performed on a large corpus take considerable time to execute so there needs to be a progress counter so we know it's working. One typo and the program may try executing the same command over and over without moving on. Here is a mock example of how I do it:

corpus<-scan("really-big-corpus.txt",what="char")
corpus<-strsplit(corpus," ")
for(i in 1:length(corpus)){
corpus(i)<-paste("<TAG>",corpus(i),sep="")
cat(as.integer((i/length(corpus)*100),"% done.\n")
}

Now that works perfectly fine though it's a bit sloppy. You might get "0% done.' printed dozens of times before it increments to 1%, etc. I'm sure there must be a more 'elegant' way of doing this. Any suggestions?

Incidentally, the way I decribe seems to slow down the processing as well. I can live with that but is it inevitable?

--
Kevin Parent, Ph.D
VP-PR, Schoolmasters www.schoolmasters.ning.com
National Korea Toastmasters webmaster www.koreatoastmasters.ning.com


--
You received this message because you are subscribed to the Google Groups "CorpLing with R" group.
To post to this group, send email to corplin...@googlegroups.com.
To unsubscribe from this group, send email to corpling-with...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/corpling-with-r?hl=en.

Stefan Th. Gries

unread,
May 13, 2010, 11:12:10 PM5/13/10
to corplin...@googlegroups.com
Answer 1 (very trivial): why not just leave out the as.integer so that
you get decimals which are very likely to change all the time.
Answer 2: For really long stuff, hours or days, I wrote two functions
that basically output a numeric version of a progress bar: on the
basis of how much time the previous iterations have taken and how many
are left, they extrapolate how much time the rest will take and output
that in hours, minutes, and seconds. The following script may serve as
an example, the lines that are relevant here are the ones with the
three comment marks:

############
corpus.files <- dir("/home/stgries/Corpora/BNCwe_SGML", full.names=TRUE)
sentence.numbers <- vector(length=length(corpus.files))
starting.time <- Sys.time() ###
for (i in 1:length(corpus.files)) {
current.file <- scan(corpus.files[i], what=character(0), sep="\n",
quiet=TRUE)
sentence.numbers[i] <- sum(grepl("<s n=", current.file))
cat(seconds.to.time(remaining.time(starting.time, i,
length(corpus.files)), 2), "\n") ###
}
############

The output of this looks like this:

1 minute(s), 0.1 second(s)
57.3 second(s)
53.8 second(s)
52.8 second(s)
53 second(s)
49.8 second(s)
47.3 second(s)
44.9 second(s)
42.8 second(s)
...

This could be a nice programming exercise ;-)

> Incidentally, the way I decribe seems to slow down the processing as well. I can live with that but is it inevitable?
I would think so, yes. You want to compute something, you want to
output it, ergo you need processing time, only but very little. BTW,
sometimes it seemed to me as if what takes more time than the
processing is the catting into a large console window when many
numbers have to be printed and moved downwards. With big tasks, I
always make the console so small so that only two lines/numbers are
visible. AFAIK, that makes a difference.

BTW: you probably mean corpus[i], not corpus(i), right?

Cheers,
STG
--
Stefan Th. Gries
-----------------------------------------------
University of California, Santa Barbara
http://www.linguistics.ucsb.edu/faculty/stgries
-----------------------------------------------
Reply all
Reply to author
Forward
0 new messages