Optimizing memory usage

26 views
Skip to first unread message

Gwern Branwen

unread,
May 9, 2012, 9:35:10 PM5/9/12
to hakyll, Jasper Van der Jeugt
So compiling my site has become increasingly difficult because hakyll
will eat so much memory it gets killed by the OOM. Some files are
particularly bad, for example my page of Hofstadter's superrationality
essays causes RAM consumption to go from a final 60% to 80%+ when
profiling is turned on and I have to delete it just to get the
profiling runs to finish. (I don't know why this would happen, when on
the commandline, Pandoc apparently can compile it in a fraction of a
second.)

Naturally, the first step is to profile http://www.gwern.net/hakyll.hs
(somewhat out of date repo: https://github.com/gwern/gwern.net )

$ ghc -prof -auto-all -rtsopts -O2 -fforce-recomp -optl-s --make
hakyll.hs && nice ./hakyll rebuild +RTS -p -m10 -RTS

which produces bottomUp.prof.

The top culprits:

pandocTransform Main 20.5 17.9
anyLine Text.Pandoc.Parsing 12.8 25.2
main Main 10.6 9.1
convertInterwikiLinks Main 4.2 0.0
likelyAbbrev Text.Pandoc.Readers.Markdown 4.1 3.8
CAF Main 4.0 5.3
str Text.Pandoc.Readers.Markdown 4.0 4.8
myPageCompiler Main 3.6 5.4
convertHakyllLinks Main 2.1 0.0
inline Text.Pandoc.Readers.Markdown 1.8 0.9
spaceChar Text.Pandoc.Parsing 1.3 1.7
inlineListToHtml Text.Pandoc.Writers.HTML 1.0 0.6
variable Text.Pandoc.Templates 1.0 1.2
charOrRef Text.Pandoc.Parsing 1.0 1.0
characterReference Text.Pandoc.CharacterReferences 0.9 1.0
htmlTag Text.Pandoc.Readers.HTML 0.7 1.1

There's probably nothing to be done about main or anyLine or
likelyAbbrev, so what is pandocTransform?

pandocTransform :: Pandoc -> Pandoc
pandocTransform = bottomUp (map (convertInterwikiLinks . convertHakyllLinks))

That's probably indicating one of the 2 convert functions is using too
much memory. So we look at it and its sub-nodes:

pandocTransform Main
22933 0 20.5 17.9 26.8 17.9
convertInterwikiLinks Main
23046 247843051 4.2 0.0 4.2 0.0
inlinesToString Main
23048 6102 0.0 0.0 0.0 0.0
convertHakyllLinks Main
23045 247841141 2.1 0.0 2.1 0.0
inlinesToURL Main
23362 121 0.0 0.0 0.0 0.0
inlinesToString Main
23363 121 0.0 0.0 0.0 0.0

The inlines* are irrelevant, but the two convert* functions use 0% RAM
and just 6.3% of CPU time! Where is the other 14% of time and *26.8*%
RAM going? The only other functions are 'map' and 'bottomUp'!

Looking at bottomUp
http://hackage.haskell.org/packages/archive/pandoc-types/1.8.2/doc/html/Text-Pandoc-Generic.html
it sounds like it might be inefficient. Unfortunately, if we swap
bottomUp for topDown, we find the profile output changes only
trivially (hakyll.prof). The module offers no other options. Am I
stuck?

--
gwern
http://www.gwern.net
hakyll.prof
bottomUp.prof

Jasper Van der Jeugt

unread,
May 12, 2012, 8:04:10 AM5/12/12
to Gwern Branwen, hakyll
I've looked at it for a bit, and added an option which allows you to
turn off in-memory caching in Hakyll [1]. This improves the situation
a little, but not enough.

I'll try to look at it further later, see if I can reproduce it for a
single page using only pandoc.

[1]: https://github.com/jaspervdj/hakyll/commit/759f1e61eadc29708e60fd51bfb92b9fa5c90ec2

Cheers,
Jasper

Gwern Branwen

unread,
May 23, 2012, 12:01:22 AM5/23/12
to pandoc-...@googlegroups.com, hakyll, Jasper Van der Jeugt, John MacFarlane
I thought I'd spend a little time bisecting my Hofstadter file to
figure out what the problem. 5 hours and hundreds of iterations
later...

Simple bisection doesn't work. It's not a single line, it's an
interaction between lines. I think it must have something to do with
how many rulers or tables there are, because by the time I got the
file down to 150 lines, the explosion was starting to hold constant at
50-70%. It has to do with the tables: the output comes out malformed
and cut across a table. It's weird, and the Hakyll output differs from
the Pandoc output. My best guess is that use of the horizontal ruler
syntax ('---') is somehow forcing backtracking or forwardtracking
caused by the existence of a table like

Bulletin Clock (minutes before midnight) Wald's percentage
(probability per year)
----------------------------------------
----------------------------------------
1 min 20%

If I replace all the horizontal rulers with '* * *', the problem goes
away and Hakyll memory use peaks at a more doable 43% of RAM.

I'm not sure what the root problem is, and at this point I am
sufficiently tired that I am going to simply use this hack and not
inquire further.

--
gwern
http://www.gwern.net

John MacFarlane

unread,
May 23, 2012, 2:31:59 AM5/23/12
to Gwern Branwen, pandoc-...@googlegroups.com, hakyll, Jasper Van der Jeugt
[resending to all]

This may be related to
https://github.com/jgm/pandoc/commit/5b49c47414878c6d907435bb24b8923627af43e1
Are you using a recent version of pandoc?

+++ Gwern Branwen [May 23 12 00:01 ]:

Gwern Branwen

unread,
May 23, 2012, 11:36:33 AM5/23/12
to John MacFarlane, pandoc-...@googlegroups.com, hakyll, Jasper Van der Jeugt
On Wed, May 23, 2012 at 2:31 AM, John MacFarlane <fiddlo...@gmail.com> wrote:
> This may be related to
> https://github.com/jgm/pandoc/commit/5b49c47414878c6d907435bb24b8923627af43e1
> Are you using a recent version of pandoc?

ghc-pkg says I only have installed pandoc-1.8.2.1.

--
gwern
http://www.gwern.net
Reply all
Reply to author
Forward
0 new messages