Alright, I re-discovered Ryan Culpepper's talk, "The Cost of Sugar," from the RacketCon 2018 video stream (
https://youtu.be/CLjXhr_TgP8?t=5908) and made some progress by following along.
Here are the .zo files larger than 100K:
993K ./vector/compiled/tests_rkt.zo
830K ./scribblings/compiled/glm_scrbl.zo
328K ./vector/compiled/relational_rkt.zo
295K ./vec4/compiled/bool_rkt.zo
291K ./vec4/compiled/int_rkt.zo
290K ./vec4/compiled/uint_rkt.zo
290K ./vec4/compiled/double_rkt.zo
289K ./vec4/compiled/float_rkt.zo
280K ./vec3/compiled/bool_rkt.zo
276K ./vec3/compiled/int_rkt.zo
275K ./vec3/compiled/uint_rkt.zo
275K ./vec3/compiled/double_rkt.zo
274K ./vec3/compiled/float_rkt.zo
262K ./vec2/compiled/bool_rkt.zo
258K ./vec2/compiled/uint_rkt.zo
258K ./vec2/compiled/int_rkt.zo
258K ./vec2/compiled/double_rkt.zo
257K ./vec2/compiled/float_rkt.zo
213K ./vec1/compiled/bool_rkt.zo
210K ./vec1/compiled/uint_rkt.zo
210K ./vec1/compiled/int_rkt.zo
210K ./vec1/compiled/double_rkt.zo
209K ./vec1/compiled/float_rkt.zo
102K ./compiled/main_rkt.zo
101K ./compiled/vector_rkt.zo
I'm pretty sure that's a lot of big files. It's for a port of GLM, a graphics math library that implements (among other things) fixed-length vectors of up to 4 components over 5 distinct scalar types, for a total of 20 distinct type-length combinations with many small variations in their APIs and implementations.
The variations I'm targeting either require a macro or exacerbate developer- or run-time overhead when functions are introduced. For example, the base component accessors for a four-component vector of doubles are:
dvec4-x
dvec4-y
dvec4-z
dvec4-w
Each of the "xyzw" components has two aliases -- one from "rgba" and another from "stpq". Each accessor also has a corresponding mutator, e.g., dvec4-g and set-dvec4-g!.
For another example, whereas adding two dvec4's sums four components,
(dvec4
(fl+ (dvec4-x v1) (dvec4-x v2))
(fl+ (dvec4-x v1) (dvec4-x v2))
(fl+ (dvec4-x v1) (dvec4-x v2))
(fl+ (dvec4-x v1) (dvec4-x v2)))
the same operation on dvec2's sums only the first two components.
Furthermore, the sheer volume of the target code base makes writing everything out by hand a mind-numbing exercise in frustration, and that's when looking at a mere 20% of the pile. It's going to get much worse very quickly. To add fixed-length matrices up to shape 4x4 over the same scalar types, I'm looking at 16x5 = 80 more distinct type-shape combinations!
Getting back to the .zo files, I had no luck running "raco macro-profiler" on the top end of the list. It appears to diverge. My dev laptop probably doesn't have enough RAM, so I'll have to try again on a bigger machine.
Here's an excerpt from a file on the bottom end:
[eric@walden racket-glm]$ raco macro-profiler glm/vec4/double
profiling (lib "glm/vec4/double.rkt")
Initial code size: 87
Final code size : 86531
========================================
Phase 0
the-template (defined as the-template.1 in glm/vector/template)
total: 31536, mean: 31536
direct: 2054, mean: 2054, count: 1, stddev: 0
define-dvec4-unop (defined in "this module")
total: 7300, mean: 730
direct: 7480, mean: 748, count: 10, stddev: 0
define/contract (defined in racket/contract/region)
total: 6666, mean: 44
direct: 3572, mean: 23, count: 153, stddev: 1.48
define-dvec4-binop (defined in "this module")
total: 6200, mean: 620
direct: 6380, mean: 638, count: 10, stddev: 0
...
Phase 1
for/list (defined in racket/private/for)
total: 6558, mean: 273
direct: 2274, mean: 95, count: 24, stddev: 14.94
for/fold/derived/final (defined in racket/private/for)
total: 4332, mean: 180
direct: 336, mean: 14, count: 24, stddev: 0
for/fold/derived (defined in racket/private/for)
total: 4284, mean: 178
direct: 240, mean: 10, count: 24, stddev: 0
for/foldX/derived (defined in racket/private/for)
total: 3996, mean: 24
direct: 3164, mean: 19, count: 170, stddev: 48.16
Wow, does that look like nearly 1000x compression? Three orders of magnitude seems right, given what I know about how these macros interact.
The "the-template" macro is defined inside a module generated by my custom #%module-begin. It defines 4 type-agnostic, fixed-length module templates (e.g., glm/vec4/template), which are instantiated once for each of the 5 scalar types. Those fixed-length module templates are based, in turn, on another module template (glm/vector/template) that takes a length argument and uses the other profiled macros (define-dvec4-unop, define/contract, define-dvec4-binop) to create 20 component-wise operations per instance. All together, that should inflate the size of the output to somewhere near the middle of the interval 4x20x5x[1,4], which is 1000.
At phase 1, the comprehension forms are busy churning out component aliases and unrolling component-wise operations at "compile" time. I'm reluctant to anti-inline these because they keep the written code small and the generated code fast.
I guess the next step is to anti-inlinedefine-dvec4-unop and define-dvec4-binop, maybe eliminate some
define/contract's, and re-profile.
Eric