This will be tremendous, Dahua. Interestingly, a lot of this stuff is not actually that Julia-specific, but is what "the pros" do in C/Fortran. In most high-level languages, the *only* chance of writing fast code is to vectorize as much as possible and then rely on someone else already having done this work for the code that implements vectorized operations. The unusual combination here that in Julia you're writing high-level code yet it makes sense to apply low-level optimization approaches.