Re: LIO & SCST Convergence

Bart Van Assche

unread,

Nov 8, 2008, 2:29:51 PM11/8/08

to Nicholas A. Bellinger, Vladislav Bolkhovitin, Rafiu Fakunle, scst-devel, Linux-iSCSI.org Target Dev

(cc-ed SCST and LIO mailing lists)

On Sat, Nov 8, 2008 at 12:56 AM, Nicholas A. Bellinger
<n...@linux-iscsi.org> wrote:
> Some of them just happen to be the most advanced 10 Gb/sec networking
> and virtualization ASICs on the planet, that I will be porting LIO DDP
> (and other) code to using the true zero-copy memory mapping algorithms
> and ERL=2 code from years past that have set multi-platform records and
> whitepapers, etc.

Please explain what you mean with "zero-copy memory mapping". I agree
that not copying data (zero-copy) in a storage target is desirable.
But "memory mapping" or any other manipulation of the CPU's TLB must
definitely be avoided while transferring data, because it introduces
more latency than acceptable.

Bart.

Bart Van Assche

unread,

Nov 9, 2008, 7:57:24 AM11/9/08

to Nicholas A. Bellinger, Vladislav Bolkhovitin, Rafiu Fakunle, scst-devel, Linux-iSCSI.org Target Dev

I'll try to clarify myself a little bit further. Zero-copy means that
data is copied as few times as possible, both in the initiator and in
the target system. What follows is based on the these assumptions:
* The code paths on the target that perform I/O are implemented in the
kernel, not in user space (as is the case for both SCST, LIO and IET,
but not for STGT).
* The initiator kernel driver enables access to a block device on the
target system, and imports the target data also as a block device on
the initiator system.
* The approach must also work for storage devices that are several
terrabytes large.

At the initiator side, user space processes have the following options
to perform zero-copy I/O:
1. Mapping the initiator-side block device in memory through mmap().
2. Using direct I/O to access the initiator-side block device.
3. Using the splice() or tee() system calls for copying data between
two file descriptors without copying the data between kernel space and
user space. Note: sendfile() is implemented in terms of splice().

Approach (2) is taken a.o. by database software running in userspace
on the initiator system. In this scenario, the initiator-side kernel
driver has to call get_user_pages() before it can do anything with the
user space buffer. This call will fault in and pin down the relevant
user pages. These user pages can then be transferred via DMA through
an Ethernet NIC or via RDMA through an InfiniBand or iWarp NIC. I am
assuming here that an IOMMU is present, since an IOMMU is required to
transfer buffers through DMA at once that span multiple virtual memory
pages, that are contiguous in virtual memory but noncontiguous in
physical memory.

At the target side, all data transfers happens inside the kernel which
makes things easier. The following is needed at the target to
implement zero-copying:
- The storage target implementation allocates sufficiently large
buffers. These buffers have to be contiguous in the kernel's virtual
address space but may be noncontiguous in physical memory. I'm again
assuming that an IOMMU is present.
- The NIC supports (R)DMA, and the storage devices support DMA too.

Regarding buffer sizes: measurements have shown that a buffer size of
64 KB is sufficient to utilize more than 90% of the bandwidth
available on an 10 Gb/s InfiniBand network ([3]).

A big question is whether the initiator and target kernel code should
allocate buffers for RDMA once, or that these should be allocated
dynamically each time I/O happens. The last scenario allows zero-copy
I/O but there is a high overhead associated with setting up RDMA
buffers. The first scenario removes significant overhead from the I/O
path but necessitates copying data.

We should keep the following in mind (a quote from [2]):
...
That said, it is important to recognize that direct I/O does not
always provide the performance boost that one might expect. The
overhead of setting up direct I/O can be significant, and the benefits
of buffered I/O are lost.
...

Or: any decision about whether to use zero-copy I/O or buffered I/O
should be based on measurements, not on theoretical arguments.

See also:
[1] Zero Copy I: User-Mode Perspective, Linux Journal, January 2003,
http://www.linuxjournal.com/article/6345
[2] Jonathan Corbet e.a., Linux Device Drivers, Third Edition, 2005,
http://lwn.net/Kernel/LDD3/.
[3] Jiuxing Liu, High Performance VMM-Bypass I/O in Virtual Machines,
USENIX 2006, http://www.usenix.org/events/usenix06/tech/full_papers/liu/liu_html/index.html

Bart.

Vladislav Bolkhovitin

unread,

Nov 10, 2008, 8:02:40 AM11/10/08

to Bart Van Assche, Nicholas A. Bellinger, Rafiu Fakunle, scst-devel, Linux-iSCSI.org Target Dev, Vu Pham

Bart,

Don't let Nicholas mislead you. He uses term "zero-copy" in rather
misleading marketing manner to make people think that only he has
"real", "right" things, everybody else are bad and copy data somewhere.

At first, as I already wrote, RDMA memory registration and zero-copy
caching is fundamentally incompatible, unless you are able to register
all the system memory. This is because once data written in registered
chunk of memory, then they should be:

- Either copied to the cache, then the registered chunk used again

- Or unregistered and then left in the cache in zero-copy manner. Then
for the next transfer a new memory chunk should be registered again.

Second, SCST always and everywhere uses zero-copy, except in buffered IO
and I wrote a simple approach how to eliminate the memory copy on the
"Contributing" page (http://scst.sourceforge.net/contributing.html). For
READ it is almost trivial.

Third, the current deficiencies in SCST memory management can be
addressed in few days work. Namely:

- The lack of memory registration support in SGV cache not
implemented, because before doing anything I want to hear requirements
from somebody practically working in this area as well as capable to
test the new interface. This interface was designed a long ago and
simply waiting its time. I asked Vu if he's interested in such interface
and he replied that the current SCST memory management features are
sufficient to get maximum performance for his SRPT driver.

- The order 0 allocations, which have issues in pass-through mode, can
also be simply addressed. See the "Contributing" page
(http://scst.sourceforge.net/contributing.html) for possible patch.

- I believe support for "page_link" doesn't worth the effort in the
target application. If someone interested, I can elaborate more why.

Nicholas A. Bellinger

unread,

Nov 10, 2008, 2:51:49 PM11/10/08

to linux-iscsi...@googlegroups.com, Vladislav Bolkhovitin, Rafiu Fakunle, scst-devel

On Sun, 2008-11-09 at 13:57 +0100, Bart Van Assche wrote:
> On Sat, Nov 8, 2008 at 8:29 PM, Bart Van Assche
> <bart.va...@gmail.com> wrote:
> > (cc-ed SCST and LIO mailing lists)
> >
> > On Sat, Nov 8, 2008 at 12:56 AM, Nicholas A. Bellinger
> > <n...@linux-iscsi.org> wrote:
> >> Some of them just happen to be the most advanced 10 Gb/sec networking
> >> and virtualization ASICs on the planet, that I will be porting LIO DDP
> >> (and other) code to using the true zero-copy memory mapping algorithms
> >> and ERL=2 code from years past that have set multi-platform records and
> >> whitepapers, etc.
> >
> > Please explain what you mean with "zero-copy memory mapping". I agree
> > that not copying data (zero-copy) in a storage target is desirable.
> > But "memory mapping" or any other manipulation of the CPU's TLB must
> > definitely be avoided while transferring data, because it introduces
> > more latency than acceptable.
>
> I'll try to clarify myself a little bit further. Zero-copy means that
> data is copied as few times as possible,

Ok, that is where we differ in terms. When I say Zero-copy, I mean
there are *NO* data copy, eg: the physical memory segments (eg: struct
page) gets copied ZERO times, not "as few times as possible".

I call that doing a memory copy one (or more) times, either by the APIs
you are using or explictly with memcpy().

Lets please leave the Initiator side out of the discussion for now..

> At the target side, all data transfers happens inside the kernel which
> makes things easier. The following is needed at the target to
> implement zero-copying:
> - The storage target implementation allocates sufficiently large
> buffers. These buffers have to be contiguous in the kernel's virtual
> address space but may be noncontiguous in physical memory. I'm again
> assuming that an IOMMU is present.
> - The NIC supports (R)DMA, and the storage devices support DMA too.
>

Ok, in the true Zero-copy case that I implemented in with
target_core_mod, the generic target engine can accepts a linked list of
struct page (eg: physical memory segments) that pointers are then set to
the struct scatterlist->page_link that is then sent down to Linux/SCSI
and Linux/BLOCK.

The FILEIO case (be it through kernel or userspace) is a different
because it uses virtual memory addreses (as you mention). However, the
source for WRITE/READ ops from/to physical memory of validated packets
(from the RNIC) to virtual memory address from struct iovecl for the
struct file_operations API.

Aways, lets keep the discussion between how true DDP zero-copy works
between the different subsystems (Linux/SCSI, Linux/BLOCK and Linux/VFS)
on a *SINGLE* codepath. This is the design of target_core_mod in
lio-core-2.6.git on kernel.org.

> Regarding buffer sizes: measurements have shown that a buffer size of
> 64 KB is sufficient to utilize more than 90% of the bandwidth
> available on an 10 Gb/s InfiniBand network ([3]).
>

Would you mind to be more specific on what you mean by "buffer size"?
Do you mean the SCSI CDB LBA Count..? (eg: sector_count * sector_size)
or RCaP fabric's max request size..?

> A big question is whether the initiator and target kernel code should
> allocate buffers for RDMA once, or that these should be allocated
> dynamically each time I/O happens. The last scenario allows zero-copy
> I/O but there is a high overhead associated with setting up RDMA
> buffers. The first scenario removes significant overhead from the I/O
> path but necessitates copying data.
>

Again, it all comes down to the following with the code in
lio-core-2.6.git:

*) target_core_mod has a single code path to handle I, II, III, and IV
of my points.
*) target_core_mod uses the same code path, but presents an API that
accepts a linked list of physical memory segments (struct page), then
target_core_mod then sets to struct scatterlist->page_link, which is
then sent down to Linux/SCSI (accepts struct scatterlist->page_link),
Linux/BLOCK (accepts struct scatterlist->page_link) or Linux/VFS
(converted to virtual memory addreses, can be buffered or O_DIRECT).

> We should keep the following in mind (a quote from [2]):
> ...
> That said, it is important to recognize that direct I/O does not
> always provide the performance boost that one might expect. The
> overhead of setting up direct I/O can be significant, and the benefits
> of buffered I/O are lost.
> ...
>

Obviously, the kernel level O_DIRECT Linux/VFS FILEIO target_core_mod
case needs the changes to fs/dio.c, but that patch exists *OUTSIDE* of
target_core_mod.

Anyways, Jonathan is talking about O_DIRECT to struct file here, and not
to struct page's being directly submitted to Linux/SCSI via struct
scatterlist and Linux/BLOCK via struct bio_vec.

Anyways, I think its about time to start adding the points to the
Linux-iSCSI.org wiki for the 10 Gb/sec DDP RFC. Thanks for the
references..

--nab

Bart Van Assche

unread,

Nov 11, 2008, 5:05:19 AM11/11/08

to linux-iscsi...@googlegroups.com, Vladislav Bolkhovitin, Rafiu Fakunle, scst-devel

On Mon, Nov 10, 2008 at 8:51 PM, Nicholas A. Bellinger
<n...@linux-iscsi.org> wrote:
> > I'll try to clarify myself a little bit further. Zero-copy means that
> > data is copied as few times as possible,
>
> Ok, that is where we differ in terms. When I say Zero-copy, I mean
> there are *NO* data copy, eg: the physical memory segments (eg: struct
> page) gets copied ZERO times, not "as few times as possible".

It depends what you count as a copy and what you do not count as a
copy. Zero-copy algorithms always create at least one copy of the data
in memory. Look e.g. at the system call sendfile(), which allows to
transfer disk data over the network. Before sendfile() can hand over
the data to the network driver, the data has to be copied from disk to
kernel memory, e.g. via DMA. Or: the disk data is not transferred
directly from the disk controller to the network controller, but one
copy of the data is made in memory in order to make the transfer
possible. Yet this is called a zero-copy transfer.

> Aways, lets keep the discussion between how true DDP zero-copy works
> between the different subsystems (Linux/SCSI, Linux/BLOCK and Linux/VFS)
> on a *SINGLE* codepath. This is the design of target_core_mod in
> lio-core-2.6.git on kernel.org.

Memory usage by a storage target implementation can be very dynamic.
With current network and storage technologies it is easy to build a
storage target system where the data can be transferred faster over
the network interface than to the persistent storage inside the
storage target. When initiators are writing data faster than the
target can store it, memory consumption of the storage target will
increase rapidly. This way it is e.g. possible that half of the memory
of the storage target is used for data buffers. Using zero-copy I/O in
a storage target system with an RDMA NIC implies the following:
* Before the storage target receives any data from the initiator, a
new buffer is allocated (e.g. 64 KB in size)
* This buffer has to be registered as an RDMA buffer.
* Now the target can receive data from the initiator through RDMA.
* The target can now transfer the received data from the RDMA buffer
through DMA to persistent storage.
* Once DMA finished the buffer can be unregistered and deallocated.
The problem with the above approach is that RDMA registration is
expensive. According to the articles [1] and [2], the time needed for
registration and deregistration is 10 microseconds or more. This is
much more than the total time needed by SCST to send a single 512 byte
message through direct I/O over an SRP link (1.2 microseconds, see
also [3]). Since many applications transfer data in small blocks
(databases: 4 KB, filesystems: 4 KB - 64 KB), this approach would kill
the performance of a storage target.

A similar problem has been encountered by the people that implemented
RDMA in MPI [1].

> > Regarding buffer sizes: measurements have shown that a buffer size of
> > 64 KB is sufficient to utilize more than 90% of the bandwidth
> > available on an 10 Gb/s InfiniBand network ([3]).
>
> Would you mind to be more specific on what you mean by "buffer size"?
> Do you mean the SCSI CDB LBA Count..? (eg: sector_count * sector_size)
> or RCaP fabric's max request size..?

I was referring to the size of the data blocks passed at once to the
RDMA API (verbs).

Nicholas A. Bellinger

unread,

Nov 11, 2008, 11:47:48 PM11/11/08

to linux-iscsi...@googlegroups.com, Vladislav Bolkhovitin, Rafiu Fakunle, scst-devel

On Tue, 2008-11-11 at 11:05 +0100, Bart Van Assche wrote:
> On Mon, Nov 10, 2008 at 8:51 PM, Nicholas A. Bellinger
> <n...@linux-iscsi.org> wrote:
> > > I'll try to clarify myself a little bit further. Zero-copy means that
> > > data is copied as few times as possible,
> >
> > Ok, that is where we differ in terms. When I say Zero-copy, I mean
> > there are *NO* data copy, eg: the physical memory segments (eg: struct
> > page) gets copied ZERO times, not "as few times as possible".
>
> It depends what you count as a copy and what you do not count as a
> copy. Zero-copy algorithms always create at least one copy of the data
> in memory.

Obviously, otherwise you would be copying empty memory and data back and
forth, eg: you would have no Data Source.

> Look e.g. at the system call sendfile(), which allows to
> transfer disk data over the network. Before sendfile() can hand over
> the data to the network driver, the data has to be copied from disk to
> kernel memory, e.g. via DMA. Or: the disk data is not transferred
> directly from the disk controller to the network controller, but one
> copy of the data is made in memory in order to make the transfer
> possible. Yet this is called a zero-copy transfer.

So yeah, this is one simple example using zero-copy on the packet
transmit (TX) path (eg: SCSI READ) from Linux/VFS (FILEIO) and struct
file.

>
> > Aways, lets keep the discussion between how true DDP zero-copy works
> > between the different subsystems (Linux/SCSI, Linux/BLOCK and Linux/VFS)
> > on a *SINGLE* codepath. This is the design of target_core_mod in
> > lio-core-2.6.git on kernel.org.
>
> Memory usage by a storage target implementation can be very dynamic.
> With current network and storage technologies it is easy to build a
> storage target system where the data can be transferred faster over
> the network interface than to the persistent storage inside the
> storage target.

Obviously with traditional electromagnetic disks this is the case,
however this is quickly changing as the 2nd generation solid state disks
(SSDs) start to ship. These are the ones that are not limited by flash
bank alignment issues as alot of the 1st generation designs appear to
be. (espically if you are using a dumb filesystem..)

Suddenly, SSDs present the system with 50,000 IOPs *PER* SSD disk and
serious streaming READ/WRITE bandwith. What happens when you put 24
themes drives on a PCI-e adapter with a multi-port 10 Gb/sec Ethernet
adapter capable of DDP..? Yes, memory copies become a big problem.

> When initiators are writing data faster than the
> target can store it, memory consumption of the storage target will
> increase rapidly. This way it is e.g. possible that half of the memory
> of the storage target is used for data buffers. Using zero-copy I/O in
> a storage target system with an RDMA NIC implies the following:

Are you referring to DDP (RFC-5041) or generically to a RCaP (RDMA
Capable fabric) when you say RDMA in your examples..?

> * Before the storage target receives any data from the initiator, a
> new buffer is allocated (e.g. 64 KB in size)
> * This buffer has to be registered as an RDMA buffer.

The with RNIC hardware, you want to use contigious segments (when
available) of physical memory for use with MMIO, or have the RCaP
hardare reserve the contigious segments at boot before fragmentation
occurs. For the optimized software RNIC/iWARP stack case, this however
gets more interesting..

> * Now the target can receive data from the initiator through RDMA.
> * The target can now transfer the received data from the RDMA buffer
> through DMA to persistent storage.
> * Once DMA finished the buffer can be unregistered and deallocated.
> The problem with the above approach is that RDMA registration is
> expensive.

Bart, it is not as if the expensive memory registration is done for each
packet of bulk I/O that is pushed with DDP (RFC-5041) and iSER
(RFC-5045).. Do you think someone would pay $10k for an 10 Gb/sec RNIC
that required such expensive memreg ops to push bulk I/O..?

> According to the articles [1] and [2], the time needed for
> registration and deregistration is 10 microseconds or more. This is
> much more than the total time needed by SCST to send a single 512 byte
> message through direct I/O over an SRP link (1.2 microseconds, see
> also [3]).

Great!

> Since many applications transfer data in small blocks
> (databases: 4 KB, filesystems: 4 KB - 64 KB),

Obviously.

> this approach would kill
> the performance of a storage target.

Again, you are mixing up a bunch of different technical items in this
thread (RCaP fabrics with generic target engines, buffer sizes
limitiations, how OFA VERBS memory registration works, etc). I
understand that you and some other folks are working with SRP (a
completely different protocol from iSER btw) on IB hardware, and you
have some experience in this field, that is fine.

But, I *NO* idea why you think memory registration for any type of RDMA
(which code that lives outside of a generic target engine btw) has any
effect on the set of algorithms I have detailed to you in I through V
for the target_core_mod upstream discuss. When target_core_mod
*ACCEPTS* physical memory segments from RDMA Capable fabric (RCaP)
hardware, the *POINTERS* to the physical memory segments are set to the
following storage v2.6 subsystem abstractions using a *SINGLE* codepath
for all real and virtual device types:

*) Linux/SCSI using struct scsi_device w/ struct scatterlist->page_link

*) Linux/BLOCK using struct block_device w/ struct bio_vec and
bio_add_page()

*) Linux/VFS using struct file w/ struct file_operations and physical ->
virtual memory conversion.

Again, my point this whole time has been that a generic target engine
MUST be able to accept physical memory segments (pointers to struct
page) containing the validated packets for RDMA READ (SCSI WRITE), or
pre-registered memory for outgoing RDMA WRITEs (SCSI READ) that will be
going down to one of the $STORAGE_OBJECTs listed above.

>
> A similar problem has been encountered by the people that implemented
> RDMA in MPI [1].

Doing RDMA in MPI or DAPL is dumb (and would never make it upstream,
because they are intended to be OS independent anyways). OFA VERBS is
my choice for upstream work.

Also, please understand that RDMA stuff is not the end-all for upstream
acceptance, but because of my experience in the 10 Gb/sec Ethernet
field, I have been thinking *VERY* hard about the issues involved for
generic target engine design.

--nab

Bart Van Assche

unread,

Nov 12, 2008, 6:03:19 AM11/12/08

to linux-iscsi...@googlegroups.com, Vladislav Bolkhovitin, Rafiu Fakunle, scst-devel

On Wed, Nov 12, 2008 at 5:47 AM, Nicholas A. Bellinger
<n...@linux-iscsi.org> wrote:
>> When initiators are writing data faster than the
>> target can store it, memory consumption of the storage target will
>> increase rapidly. This way it is e.g. possible that half of the memory
>> of the storage target is used for data buffers. Using zero-copy I/O in
>> a storage target system with an RDMA NIC implies the following:
>
> Are you referring to DDP (RFC-5041) or generically to a RCaP (RDMA
> Capable fabric) when you say RDMA in your examples..?

I was using the abbreviation RDMA to refer to any network that
supports remote DMA between virtual memory addresses (see also
http://en.wikipedia.org/wiki/Virtual_Interface_Architecture).

> The with RNIC hardware, you want to use contigious segments (when
> available) of physical memory for use with MMIO, or have the RCaP
> hardare reserve the contigious segments at boot before fragmentation
> occurs. For the optimized software RNIC/iWARP stack case, this however
> gets more interesting..

No. With RDMA you don't need physically contiguous segments --
segments that are contiguous in virtual memory are sufficient. This is
one of the important points of RDMA, since RDMA works with virtual
memory addresses. I already tried to explain this to you before, but
apparently you ignored it.

> Again, my point this whole time has been that a generic target engine
> MUST be able to accept physical memory segments (pointers to struct
> page) containing the validated packets for RDMA READ (SCSI WRITE), or
> pre-registered memory for outgoing RDMA WRITEs (SCSI READ) that will be
> going down to one of the $STORAGE_OBJECTs listed above.

Again: segments that are contiguous in virtual memory are sufficient
in systems that are equipped with an IOMMU. Trying to allocate large
physically contiguous segments is asking for trouble -- how to avoid
fragmentation of physical memory in the Linux kernel is an active
research topic.

> Also, please understand that RDMA stuff is not the end-all for upstream
> acceptance, but because of my experience in the 10 Gb/sec Ethernet
> field, I have been thinking *VERY* hard about the issues involved for
> generic target engine design.

By the way, as the SRPT implementation in SCST shows, zero-copy RDMA
is already possible today with the SCST storage target framework.

Bart.

Nicholas A. Bellinger

unread,

Nov 12, 2008, 6:50:23 AM11/12/08

to linux-iscsi...@googlegroups.com, Vladislav Bolkhovitin, Rafiu Fakunle, scst-devel

On Wed, 2008-11-12 at 12:03 +0100, Bart Van Assche wrote:
> On Wed, Nov 12, 2008 at 5:47 AM, Nicholas A. Bellinger
> <n...@linux-iscsi.org> wrote:
> >> When initiators are writing data faster than the
> >> target can store it, memory consumption of the storage target will
> >> increase rapidly. This way it is e.g. possible that half of the memory
> >> of the storage target is used for data buffers. Using zero-copy I/O in
> >> a storage target system with an RDMA NIC implies the following:
> >
> > Are you referring to DDP (RFC-5041) or generically to a RCaP (RDMA
> > Capable fabric) when you say RDMA in your examples..?
>
> I was using the abbreviation RDMA to refer to any network that
> supports remote DMA between virtual memory addresses (see also
> http://en.wikipedia.org/wiki/Virtual_Interface_Architecture).
>

Again, we are talking about very different things here. I am concerned
with the *KERNEL* level zero-copy between RCaP hardware and the
different Linux storage subsystems and generic target engine
infrastructure who's algorithms accept pre-registered physical memory
segments for true zero-copy ops using a *SINGLE* codepath to all
possible Linux storage subsystems.

Your link for VIA above is for *USERSPACE*, again, very different topic
than zero-copy between RCaP and Linux storage subsystems through a set
of target engine algorithms. Not to mention that your link above
intended to be OS independent, just like MPI and DAPL.

What do an OS indepenent API have to do with the kernel discussion at
hand..?

> > The with RNIC hardware, you want to use contigious segments (when
> > available) of physical memory for use with MMIO, or have the RCaP
> > hardare reserve the contigious segments at boot before fragmentation
> > occurs. For the optimized software RNIC/iWARP stack case, this however
> > gets more interesting..
>
> No. With RDMA you don't need physically contiguous segments --
> segments that are contiguous in virtual memory are sufficient. This is
> one of the important points of RDMA, since RDMA works with virtual
> memory addresses.

If you are talking about RDMA in the context of this VIA userspace stuff
you are talking about, then yes.

If you think that RCaP hardware needs to use virtual memory segments
down to upstream Linux storage subsystems (which all except for FILEIO
use phyical memory segments btw) you are *DEAD* wrong.

> I already tried to explain this to you before, but
> apparently you ignored it.
>

So, please stop throwing around *USERSPACE* API definitions, that are
intended to be OS independent, when we are talking about zero-copy of
kernel pages from RCaP hardware registerted memory, down to Linux
storage subsystems. I don't know how more clear I can make this to you
without you looking at the actual code.

> > Again, my point this whole time has been that a generic target engine
> > MUST be able to accept physical memory segments (pointers to struct
> > page) containing the validated packets for RDMA READ (SCSI WRITE), or
> > pre-registered memory for outgoing RDMA WRITEs (SCSI READ) that will be
> > going down to one of the $STORAGE_OBJECTs listed above.
>
> Again: segments that are contiguous in virtual memory are sufficient
> in systems that are equipped with an IOMMU. Trying to allocate large
> physically contiguous segments is asking for trouble -- how to avoid
> fragmentation of physical memory in the Linux kernel is an active
> research topic.
>

For some reason you are still stuck on virtual memory in userspace with
mmap().. We are talking about Linux kernel methods and IETF standards
here, and not a userspace POSIX API definition, please..

> > Also, please understand that RDMA stuff is not the end-all for upstream
> > acceptance, but because of my experience in the 10 Gb/sec Ethernet
> > field, I have been thinking *VERY* hard about the issues involved for
> > generic target engine design.
>
> By the way, as the SRPT implementation in SCST shows, zero-copy RDMA
> is already possible today with the SCST storage target framework.
>

No, SCST core allows a $FABRIC_MOD (SRP over Infiniband in your example)
to setup it's own physical memory segments using a function pointer in
the SCST target template. I don't think you can call it "generic target
infrastructure" if every $FABRIC_MOD is expected to implement it itself,
now does it..? Not to mention SCST core algorithms still lack III and
IV from my checklist to handle greater than PAGE_SIZE memory segments
and received cdb sector count * sector_size >
$STORAGE_OBJECT->max_sectors.

So, you don't seem to think there is a problem, and Vlad says the
problem can be fixed in a few days.. In any event, you are more than
welcome to try to fix basic design issues with SCST core's memory
mapping algorithms to achieve generic true zero-copy across all
$FABRIC_MODs and all Linux storage subsytems, but I find it unacceptable
for my purposes as is (which is how this whole thread started btw, when
Vlad asked me why I did not use SCST core as a base for target_core_mod
in lio-core-2.6.git). In any event, I will move forward with merging
parts from SCST Core of my choosing and getting first non iSCSI fabric
on top of target_core_mod/ConfigFS v3.0.

There are leaders in both the LIO and SCST communties that have
contacted me privately at the desire to finally see this happen..
Perhaps we can put our differences aside and work toward the common
goal..?

Cheers,

--nab

Vladislav Bolkhovitin

unread,

Nov 13, 2008, 7:14:02 AM11/13/08

to Bart Van Assche, Vu Pham, Rafiu Fakunle, scst-devel, Linux-iSCSI.org Target Dev

I asked Vu to clarify and he explained me that kernel space RDMA target
driver can simply register all the system memory at once and then don't
bother anymore with any memory registration/deregistration. It greatly
simplifies code, improves performance and make zero-copy with data cache
possible. This is one more advantage for a target driver of being
in-kernel, which isn't available for user space. User space RDMA
applications, including target drivers, for various reasons, including
security, can't use such approach and have to (pre)register each data
segment before use. People, familiar with RDMA, correct me, if I'm wrong.

Thus, there is *NO* need for memory (pre)registration support in a
generic SCSI target subsystem *AT ALL*.

> - The order 0 allocations, which have issues in pass-through mode, can
> also be simply addressed. See the "Contributing" page
> (http://scst.sourceforge.net/contributing.html) for possible patch.
>
> - I believe support for "page_link" doesn't worth the effort in the
> target application. If someone interested, I can elaborate more why.
>
>> See also:
>> [1] Zero Copy I: User-Mode Perspective, Linux Journal, January 2003,
>> http://www.linuxjournal.com/article/6345
>> [2] Jonathan Corbet e.a., Linux Device Drivers, Third Edition, 2005,
>> http://lwn.net/Kernel/LDD3/.
>> [3] Jiuxing Liu, High Performance VMM-Bypass I/O in Virtual Machines,
>> USENIX 2006, http://www.usenix.org/events/usenix06/tech/full_papers/liu/liu_html/index.html
>>
>> Bart.
>>
>
>

> -------------------------------------------------------------------------
> This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
> Build the coolest Linux based applications with Moblin SDK & win great prizes
> Grand prize is a trip for two to an Open Source event anywhere in the world
> http://moblin-contest.org/redirect.php?banner_id=100&url=/
> _______________________________________________
> Scst-devel mailing list
> Scst-...@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scst-devel
>

Nicholas A. Bellinger

unread,

Nov 13, 2008, 1:59:20 PM11/13/08

to linux-iscsi...@googlegroups.com, Bart Van Assche, Vu Pham, Rafiu Fakunle, scst-devel

Obviously. Again, Vlad, here is the main point from my previous email
to Bart.

<SNIP>

But, I *NO* idea why you think memory registration for any type of RDMA
(which code that lives outside of a generic target engine btw) has any
effect on the set of algorithms I have detailed to you in I through V
for the target_core_mod upstream discuss. When target_core_mod
*ACCEPTS* physical memory segments from RDMA Capable fabric (RCaP)
hardware, the *POINTERS* to the physical memory segments are set to the
following storage v2.6 subsystem abstractions using a *SINGLE* codepath
for all real and virtual device types:

*) Linux/SCSI using struct scsi_device w/ struct scatterlist->page_link

*) Linux/BLOCK using struct block_device w/ struct bio_vec and
bio_add_page()

*) Linux/VFS using struct file w/ struct file_operations and physical ->
virtual memory conversion.

Again, my point this whole time has been that a generic target engine

MUST be able to accept physical memory segments (pointers to struct
page) containing the validated packets for RDMA READ (SCSI WRITE), or
pre-registered memory for outgoing RDMA WRITEs (SCSI READ) that will be
going down to one of the $STORAGE_OBJECTs listed above.

</SNIP>

--nab

Vladislav Bolkhovitin

unread,

Nov 13, 2008, 2:04:41 PM11/13/08

to Nicholas A. Bellinger, linux-iscsi...@googlegroups.com, Rafiu Fakunle, Vu Pham, scst-devel

For what?

> </SNIP>
>
> --nab

Bart Van Assche

unread,

Nov 13, 2008, 2:37:07 PM11/13/08

to linux-iscsi...@googlegroups.com, Vu Pham, Rafiu Fakunle, scst-devel

On Thu, Nov 13, 2008 at 7:59 PM, Nicholas A. Bellinger
<n...@linux-iscsi.org> wrote:
> Obviously. Again, Vlad, here is the main point from my previous email
> to Bart.

And the text you cited contains (for the third time) the erroneous
assumption that RDMA works with physical addresses. Why do you want to
make statements about RDMA if you don't understand some of the most
basic aspects of it ?

Bart.

Nicholas A. Bellinger

unread,

Nov 13, 2008, 2:47:39 PM11/13/08

to linux-iscsi...@googlegroups.com, Rafiu Fakunle, Vu Pham, scst-devel

What the lio-core-2.6.git design allows for (and what the whole point in
I -> V in my previous emails to you have been) is generic target engine
(target_core_mod) infrastructure that *ALL* $FABRIC_MODs can take
advantage to upstream kernel storage subsystem $STORAGE_OBJECTS for true
zero-copy ops for DDP (RFC-5041) capable IP networks, other RDMA Capable
(RCaP) fabrics, or whatever.

Perhaps it would make sense for you to comment on technical issues as I
have listed them in the email to Bart so we can all stay on the same
(physical memory segment) page? Replying to your own messages from a
different thread on emails with the same topic as what Bart and I have
been discussing is surely confusing for the technical folks following
along.

--nab

Nicholas A. Bellinger

unread,

Nov 13, 2008, 2:55:42 PM11/13/08

to linux-iscsi...@googlegroups.com, Vu Pham, Rafiu Fakunle, scst-devel

Heh.. What type of memory do you think the storage subsystems in
linux/drivers/scsi and linux/block use..? Since it is my job to figure
out how to make things generic when going to *ALL* of the Linux
subsystems (not just to struct file in linux/fs) perhaps it would make
sense for you and Vlad to checkout
lio-core-2.6.git/drivers/lio-core/target_core_*.c so see the real issues
involved with *ACCEPT* physical memory *POINTERS* to struct page.

Again, what type of memory do you think is registered with the RCaP
hardware that giving struct page into the generic target core as it
exists today in lio-core-2.6.git on kernel.org. Virtual memory
addresses..?

--nab

> Bart.
>
> >

Bart Van Assche

unread,

Nov 13, 2008, 3:09:35 PM11/13/08

to linux-iscsi...@googlegroups.com, Vu Pham, Rafiu Fakunle, scst-devel, Vladislav Bolkhovitin

On Thu, Nov 13, 2008 at 8:55 PM, Nicholas A. Bellinger <n...@linux-iscsi.org> wrote:
>
> Again, what type of memory do you think is registered with the RCaP
> hardware that giving struct page into the generic target core as it
> exists today in lio-core-2.6.git on kernel.org. Virtual memory
> addresses..?

Both InfiniBand and iWARP are based on the VIA architecture. A comprehensive description of this architecture can be found in http://www.intel.com/intelpress/chapter-via.pdf. A quote from this document (page 10):

The VI provider is directly responsible for a number of functions nor-
mally supplied by the operating system. The VI provider manages the pro-
tected sharing of the network controller, virtual to physical translation of
buffer addresses, and the synchronization of completed work via interrupts.
The VI provider also provides a reliable transport service, with the level of re-
liability depending upon the capabilities of the underlying network.

Bart.

Nicholas A. Bellinger

unread,

Nov 13, 2008, 3:16:36 PM11/13/08

to linux-iscsi...@googlegroups.com, Vu Pham, Rafiu Fakunle, scst-devel, Vladislav Bolkhovitin

Again, what does this have to do with the fact that Linux/SCSI and
Linux/Block only accept *PHYSICAL* memory segments..? And that the
current work in lio-core-2.6.git also accepts *PHYSICAL* memory
segments..?

Can we please stay on topic with the generic target engine and how it
interface between $FABRIC_MOD and Linux storage subsystem memory,
please..?

--nab

> Bart.
>
> >

Vladislav Bolkhovitin

unread,

Nov 14, 2008, 12:47:25 PM11/14/08

to Nicholas A. Bellinger, linux-iscsi...@googlegroups.com, Rafiu Fakunle, scst-devel, Vu Pham

What's the point to add something, which is going to have no users?

Nicholas A. Bellinger

unread,

Nov 14, 2008, 3:40:12 PM11/14/08

to Vladislav Bolkhovitin, linux-iscsi...@googlegroups.com, Rafiu Fakunle, scst-devel, Vu Pham

Unfortuately, I think that is where you are mistaken. The generic
target infrastructure memory algorithms (and the ability to accept
sturct page generically) that I have been detailing in this thread are
*ALREADY* included in lio-core-2.6.git (v3.0), and v2.9-STABLE. I do
not have to "add something", because this requirement has been part of
the design of the generic target engine from early on.

So again, the users of V) from my checklist are RCaP (RDMA Capable)
$FABRIC_MODs, because it reduces the complexity of implementing
$FABRIC_MOD when all of the $FABRIC_MODs can use generic target
infrastructure and common code that applys to all existing Linux storage
subsystems.

In any event, since you and Bart seem to not want to acknowledge the
technical items for long term upstream work that I have laid out for the
community, I will be refraining from this thread until you guys can show
me that you have actually looked at the code in question (before
responding) and understand and acknowledge the real issues at hand.
Until this happens and you can respond in a concrete and professional
manner, I feel my efforts at trying to educate you on these subjects are
falling on deaf ears.

--nab

Bart Van Assche

unread,

Nov 15, 2008, 2:57:55 AM11/15/08

to linux-iscsi...@googlegroups.com, Vladislav Bolkhovitin, Rafiu Fakunle, scst-devel, Vu Pham

On Fri, Nov 14, 2008 at 9:40 PM, Nicholas A. Bellinger
<n...@linux-iscsi.org> wrote:
>
> I will be refraining from this thread until you guys can show
> me that you have actually looked at the code in question (before
> responding) and understand and acknowledge the real issues at hand.

I'm not going to spend any additional time looking at the LIO code.

Regarding the issue that was the start of this thread, namely higher
order allocations in a storage target: I'm surprised that you are
insisting on this subject without showing any actual performance
measurements. All the time you are assuming that higher order
allocations improve performance, but did never prove it with numbers.
While hardware that has no built-in scatter-gather support may perform
I/O faster through higher-order allocations, allocation of
higher-order pages is slower than allocation of order zero pages.

Regarding SCST: SCST already clusters allocated pages itself to
higher-order pages. And making SCST allocate higher-order pages only
requires small and well-localized changes. A patch that implements
this can be downloaded from SCST's "Contributing" page.

Bart.

Vladislav Bolkhovitin

unread,

Nov 15, 2008, 5:32:18 AM11/15/08

to Nicholas A. Bellinger, Vu Pham, Rafiu Fakunle, scst-devel, linux-iscsi...@googlegroups.com

OK, then, what's the point to have something, which doesn't have and not
going to have any users?

> So again, the users of V) from my checklist are RCaP (RDMA Capable)
> $FABRIC_MODs, because it reduces the complexity of implementing
> $FABRIC_MOD when all of the $FABRIC_MODs can use generic target
> infrastructure and common code that applys to all existing Linux storage
> subsystems.
>
> In any event, since you and Bart seem to not want to acknowledge the
> technical items for long term upstream work that I have laid out for the
> community, I will be refraining from this thread until you guys can show
> me that you have actually looked at the code in question (before
> responding) and understand and acknowledge the real issues at hand.
> Until this happens and you can respond in a concrete and professional
> manner, I feel my efforts at trying to educate you on these subjects are
> falling on deaf ears.
>
> --nab
>
>>> Perhaps it would make sense for you to comment on technical issues as I
>>> have listed them in the email to Bart so we can all stay on the same
>>> (physical memory segment) page? Replying to your own messages from a
>>> different thread on emails with the same topic as what Bart and I have
>>> been discussing is surely confusing for the technical folks following
>>> along.
>>>
>>> --nab
>>>
>>>>> </SNIP>
>>
>
>

Reply all

Reply to author

Forward