I want to give some feedback in case more poeple running into this. We have figured out most of the problems with the memory bandwidth.
We have first optimized our streaming benchmark (which I just assume for this case to be the most common data access pattern and contemporary CPUs handle these access patterns without any bandwidth problems) and unrolled our streaming loops to a width of 8 which increased our DDR3 bandwidth to 2.6 GiB/s (see picutre below)

But one can see the bandwidth is still extremely low after the L1. So we went with a configuration that removes everything below the L1 (DDR3, L3 and L2) and the bandwidth increased to 4 GB/s after L1 showing that there are more problems to solve than the memory bus or L3.
The first and most important problem was that the MegaBoom (4-wide) configuration is simply too small to saturate the MSHRs of the L1 data cache. We have 8 MSHRs in the L1d and 32 STQ/LDQ entries. A streaming benchmark can only use at most 4 MSHRs in this configuration. So we have increased the ressources of the MegaBoom configuration by 50% - 100% (issues queues, rob, ldq, stq...) so that all MSHRs of the L1d were used. We have also applied a fix in the Boom that prevented the use of the last ldq/stq entry (you can find the fix in the open issues on the BOOM github) and made a fix in the MSHRs that allows to read and write at the same time to the MSHR linebuffer (credits to David Metz).
While all this improved the bandwidth a bit it was still not enough. We have also increased the memory bus to 128 bit which actually should be the default in my eyes since the Double Data Rate of a DDR3 cannot work on a 64 bit memory bus that only works on a single clock edge. The next problem was that clock domain crossing buffers that backpressured, so we increased them as well. And lastly the coherence trackers were also not sufficient. Waveforms showed that around 9-10 are used if all MSHRs the L1d are perfectly used. With all this the bandwidth curve looks like this:

Lastly we have disabled the L3 which is like you David said not ideal in this scenario:

This is up to 68% of the theoretical load bandwidth used on the 128 bit memory bus.
Cheers,
Björn