Loading large datasets: is it possible to avoid this while not caching?

1,056 views
Skip to first unread message

nicky

unread,
May 28, 2012, 11:12:52 AM5/28/12
to knitr
Hello,

When I have a large dataset to load, cache=TRUE can help me avoid
loading data repeatedly. However, the dataset is so huge that caching
is not cheap. I wonder if there is a way to avoid this data loading
chunk at the same time keep the objects from last run.

Usually outside knitr environment, writing a testObject() function can
do this. However, within knitr, it doesn't work. I guess this is
because knitr is memoryless. It doesn't remember what objects are
already there from last run.

The following code shows the point. Running in the R console (outside
knitr), if it is the first time where matrix m doesn't exist, case (2)
would execute. Run it again, case (1) would execute (since matrix m is
already created from last run).
However, within knitr, it always execute case (2). This is why I call
knitr is memoryless.

Best regards,

Nicky

----
testObject <- function(object)
{
exists(as.character(substitute(object)))
}
if(testObject(m)){
# case (1)
print('matrix exist\n')
} else {
# case (2)
print('matrix does not exist\n')
m <- matrix(1:4, nrow=2)
cat('newly created matrix m:', m, '\n')
}

Carl Boettiger

unread,
May 28, 2012, 1:13:17 PM5/28/12
to nicky, knitr
Hi Nicky,

Not sure I understand what behavior you are looking for in this case.  Perhaps if you sent some knitr code and a description of the desired output it would be easier to follow?

You say the testObject() function always executes case 2 in knitr -- when do you see this behavior?  I get the expected behavior (case 1) in this example.  (source; output).  

If you have a large dataset to load, how does cache=TRUE avoid loading the data repeatedly?  Under what circumstances would you ever need to load the same dataset more than once, even with cache=FALSE?  Knitr chunks (just like the original Sweave chunks) access to everything in previous chunks already, there's no need to reload data.  If you are loading in a large dataset from a file and don't want it cached, then perhaps you want cache=FALSE for that chunk?  


hope this helps,

Carl

--
Carl Boettiger
UC Davis
http://www.carlboettiger.info/

nicky

unread,
May 28, 2012, 5:53:15 PM5/28/12
to knitr
Thanks Carl!

Here is the knitr code: https://github.com/nickytong/large-dataset-without-caching
In the pdf file (and Rnw file), section 1 talks about testObject()
issue; section 2 talks about data loading.

When I have a chunk that loads a large dataset, set Cache=TRUE would
make it fast when I run knitr(''). This is why I claim caching avoids
repeatedly executing this data loading chunk. Of course, it would be
extremely slow during the first time that do caching and consumes a
large disk space.

I do agree there is no need to reload a large dataset. This is
actually what I'm trying to avoid.

Best,

Nicky

On May 28, 12:13 pm, Carl Boettiger <cboet...@gmail.com> wrote:
> Hi Nicky,
>
> Not sure I understand what behavior you are looking for in this case.
>  Perhaps if you sent some knitr code and a description of the desired
> output it would be easier to follow?
>
> You say the testObject() function always executes case 2 in knitr -- when
> do you see this behavior?  I get the expected behavior (case 1) in this
> example.  (source<https://github.com/cboettig/sandbox/blob/5e9dc18413d824962fe2a5df7f11...>;
> output<https://github.com/cboettig/sandbox/blob/5e9dc18413d824962fe2a5df7f11...>).

nicky

unread,
May 28, 2012, 6:06:35 PM5/28/12
to knitr
I do expect a chunk to do something differently when I run it a second
time but not cached. The second time I run the code I expect the code
to detect that some objects have already been computed or loaded and
hence it wouldn't repeat this procedure. That's why I write the
testObject function. I agree we'd better not catch this whole
procedure.

I wouldn't prefer caching since it would consume a huge disk space.
i.e. my loaded data might be more than 1GB...

nicky

unread,
May 28, 2012, 11:07:19 PM5/28/12
to knitr
I realized that the problem is from Rstudio. When I generate pdf
report, I always click the 'compile PDF' button. This operation will
treat each execution as brand new. Hence, it behaves like everything
is depleted. That's why the data would be reloaded everytime. Instead,
using knit('***.Rnw') will remember objects computed from last run so
that testObject() would take effect.

I'll report to Rstudio to add another button or shortcut for
knit('**.Rnw') operation.

Thanks for the attentions and discussions!!

Best,

Nicky

Yihui Xie

unread,
May 28, 2012, 11:19:49 PM5/28/12
to nicky, knitr
In terms of the RStudio problem, I guess you do not need to report it,
since it is by design. It starts a new R session each time to make
sure your document is reproducible (objects not polluted by the global
environment), which is good.

I do not quite understand your question here. It seems that you want
to avoid evaluating a chunk but do not want to save the huge object to
the cache database. That is impossible. If you want to use cache, you
have to write the objects to cache files; otherwise you should not use
cache.

Can you simplify your example a little bit? I'm lost in LaTeX. At
least the first 60 lines are not helpful. Start making an example with
three lines:

\documentclass{article}
\begin{document}
\end{document}

Regards,
Yihui
--
Yihui Xie <xiey...@gmail.com>
Phone: 515-294-2465 Web: http://yihui.name
Department of Statistics, Iowa State University
2215 Snedecor Hall, Ames, IA

nicky

unread,
May 28, 2012, 11:44:34 PM5/28/12
to knitr
I agree. The compile PDF button is a good design. It would be nice if
they have another button for knit('***.Rnw') operation. This is useful
even there is the Ctrl+Alt+R shortcut that run all chunks. The reason
is that Ctrl+Alt+R would also re-evaluate the cached chunk which take
a long time sometimes. Instead, adding the knit button (i.e. in the
console window) wouldn't.

A simplified version is updated through: https://github.com/nickytong/large-dataset-without-caching
Please forgive me as a knitr newbie. I didn't know the Rnw file can be
so concise!

The "largeDataNoCaching_byClickingCompilePDF.pdf" file is generated by
clicking compile PDF button and cannot avoid data reloading.
The "largeDataNoCaching_by_knit().pdf" file is generated by knit() and
gives what I need.

On May 28, 10:19 pm, Yihui Xie <xieyi...@gmail.com> wrote:
> In terms of the RStudio problem, I guess you do not need to report it,
> since it is by design. It starts a new R session each time to make
> sure your document is reproducible (objects not polluted by the global
> environment), which is good.
>
> I do not quite understand your question here. It seems that you want
> to avoid evaluating a chunk but do not want to save the huge object to
> the cache database. That is impossible. If you want to use cache, you
> have to write the objects to cache files; otherwise you should not use
> cache.
>
> Can you simplify your example a little bit? I'm lost in LaTeX. At
> least the first 60 lines are not helpful. Start making an example with
> three lines:
>
> \documentclass{article}
> \begin{document}
> \end{document}
>
> Regards,
> Yihui
> --
> Yihui Xie <xieyi...@gmail.com>

Yihui Xie

unread,
May 28, 2012, 11:57:40 PM5/28/12
to nicky, knitr
This is entirely expected. Note it is really bad to have to depend on
the global environment; a document should be reproducible with a
single call to knit() instead of calling knit() twice in the same R
session to change the behavior of the same chunk. This kind of
inconsistency can bite you hard without letting you know because there
will be no error messages.

If you do not want knitr to cache x in files, you can just
load('small.RData') *without* cache when x has already been saved in
small.RData. The whole thing boils down to:

\documentclass{article}
\begin{document}
<<load-data, cache=FALSE>>=
load('small.Rdata')
@
\end{document}

Regards,
Yihui
--
Yihui Xie <xiey...@gmail.com>
Phone: 515-294-2465 Web: http://yihui.name
Department of Statistics, Iowa State University
2215 Snedecor Hall, Ames, IA


nicky

unread,
May 29, 2012, 12:15:35 AM5/29/12
to knitr
Suppose small.Rdata is huge, i.e. over 1GB, I wouldn't load it
everytime I knit. I agree that in general any code chunk shouldn't
depends on the global environment. However, in this case, it's the
only way I can think about to avoid reloading the data when knitting
(I need to knit multiple times interactively to generate a final
analysis report). It's kind of use with caution... It just simplifies
my life. Otherwise, I have to wait for minutes to see the report.

On May 28, 10:57 pm, Yihui Xie <xieyi...@gmail.com> wrote:
> This is entirely expected. Note it is really bad to have to depend on
> the global environment; a document should be reproducible with a
> single call to knit() instead of calling knit() twice in the same R
> session to change the behavior of the same chunk. This kind of
> inconsistency can bite you hard without letting you know because there
> will be no error messages.
>
> If you do not want knitr to cache x in files, you can just
> load('small.RData') *without* cache when x has already been saved in
> small.RData. The whole thing boils down to:
>
> \documentclass{article}
> \begin{document}
> <<load-data, cache=FALSE>>=
> load('small.Rdata')
> @
> \end{document}
>
> Regards,
> Yihui
> --
> Yihui Xie <xieyi...@gmail.com>

Yihui Xie

unread,
May 29, 2012, 12:19:21 AM5/29/12
to nicky, knitr
OK, that makes sense. You might try to ask them to provide such a
feature to knit() a file in the same R session. And you can also use
the command history to run knit() again and again.

Regards,
Yihui
--
Yihui Xie <xiey...@gmail.com>
Phone: 515-294-2465 Web: http://yihui.name
Department of Statistics, Iowa State University
2215 Snedecor Hall, Ames, IA


Reply all
Reply to author
Forward
0 new messages