Just to confirm what I think you're saying:
- Your exporting two different LUNs (SCST device mapped as LUN)? Or the same SCST device (LUN) to both initiators?
- If the LUNs are different SCST devices, are they the same type (eg, both vdisk_fileio pointing to similar block devices, or both vdisk_fileio pointing to virtual disk files)?
If both cases are the same back-end storage + SCST configuration, then you're saying the only difference is using the ESXi iSCSI initiator, which you put a VMFS volume on, then you create a VMDK file vs. the Windows iSCSI initiator inside of the guest OS and getting a iSCSI volume that way and putting NTFS on it?
If that is what you're saying, I believe in the SCST documentation there is a blurb about this exact scenario and they identify using an iSCSI initiator directly inside of the guest OS is always the top performer... yours seems to be by a large difference, but it seems there is a decent amount of overhead in the first scenario (VMFS + VMDK). I wonder if you did the same benchmark test with multiple test VMs running on top of the VMFS volume (not direct from inside guest) you could squeeze more performance out overall (not necessarily individually). Any performance benchmarks I've seen with VMware ESXi they never use just one VM to push numbers, they have a number of test instances running and they measure the performance as a whole.
That being said, maybe the tuning your looking for isn't in storage parameters per say, but the resource scheduling done in the hyper-visor -- maybe its intentionally throttling the one VM so there isn't a chance for contention? Just guesses, I don't know anything for sure.
--Marc