On 29/08/2022 15.17, shw...@snu.ac.kr wrote:
> From: Heewon Shin <shw...@snu.ac.kr>
>
> This commit contains implementation of append command on file-related
> classes. Almost changes are just copies of write-related functions
> except the returned value. The returned value of append is struct
> io_result that contains both the status value and assigned block address.
>
> User should be aware of which the 'pos' input parameter should be a start
> address of a zone. If 'pos' is not aligned with a zone's starting
> address, it will failed with the errno EINVAL.
I wonder if we should introduce a new function here, or instead use the
regular dma_write in a new subclass and make dma_write call the
low-level append. the new zoned_nvme_file::dma_write can take care of
supplying the correct "pos" to the lower layer.
Using dma_write means that wrapping the file with output_stream will
work as expected, so integration with applications becomes much simpler.
2022년 9월 2일 (금) 오전 12:41, Avi Kivity <a...@scylladb.com>님이 작성:
There are some reasons why we separately added dma_append() a new interface.
On 29/08/2022 15.17, shw...@snu.ac.kr wrote:
> From: Heewon Shin <shw...@snu.ac.kr>
>
> This commit contains implementation of append command on file-related
> classes. Almost changes are just copies of write-related functions
> except the returned value. The returned value of append is struct
> io_result that contains both the status value and assigned block address.
>
> User should be aware of which the 'pos' input parameter should be a start
> address of a zone. If 'pos' is not aligned with a zone's starting
> address, it will failed with the errno EINVAL.
I wonder if we should introduce a new function here, or instead use the
regular dma_write in a new subclass and make dma_write call the
low-level append. the new zoned_nvme_file::dma_write can take care of
supplying the correct "pos" to the lower layer.
Using dma_write means that wrapping the file with output_stream will
work as expected, so integration with applications becomes much simpler.
We thought that some users still want to place their data to the exact block address via legacy dma_write() although dma_append() is provided.
Isn't that impossible with a zoned device?
Furthermore, because the block address is determined after block IO completion, it is hard to melt the append command into the implementation of legacy dma_write() which is expected to place data to the block address provided by the user.
I thought we'd just ignore the block address (apart from
sequencing concurrent write operations generated by
output_stream).
On 07/09/2022 12.28, 신희원 / 학생 / 컴퓨터공학부 wrote:
2022년 9월 2일 (금) 오전 12:41, Avi Kivity <a...@scylladb.com>님이 작성:
There are some reasons why we separately added dma_append() a new interface.
On 29/08/2022 15.17, shw...@snu.ac.kr wrote:
> From: Heewon Shin <shw...@snu.ac.kr>
>
> This commit contains implementation of append command on file-related
> classes. Almost changes are just copies of write-related functions
> except the returned value. The returned value of append is struct
> io_result that contains both the status value and assigned block address.
>
> User should be aware of which the 'pos' input parameter should be a start
> address of a zone. If 'pos' is not aligned with a zone's starting
> address, it will failed with the errno EINVAL.
I wonder if we should introduce a new function here, or instead use the
regular dma_write in a new subclass and make dma_write call the
low-level append. the new zoned_nvme_file::dma_write can take care of
supplying the correct "pos" to the lower layer.
Using dma_write means that wrapping the file with output_stream will
work as expected, so integration with applications becomes much simpler.
We thought that some users still want to place their data to the exact block address via legacy dma_write() although dma_append() is provided.
Isn't that impossible with a zoned device?
Furthermore, because the block address is determined after block IO completion, it is hard to melt the append command into the implementation of legacy dma_write() which is expected to place data to the block address provided by the user.
I thought we'd just ignore the block address (apart from sequencing concurrent write operations generated by output_stream).
2022년 9월 8일 (목) 오전 12:31, Avi Kivity <a...@scylladb.com>님이 작성:
On 07/09/2022 12.28, 신희원 / 학생 / 컴퓨터공학부 wrote:
2022년 9월 2일 (금) 오전 12:41, Avi Kivity <a...@scylladb.com>님이 작성:
There are some reasons why we separately added dma_append() a new interface.
On 29/08/2022 15.17, shw...@snu.ac.kr wrote:
> From: Heewon Shin <shw...@snu.ac.kr>
>
> This commit contains implementation of append command on file-related
> classes. Almost changes are just copies of write-related functions
> except the returned value. The returned value of append is struct
> io_result that contains both the status value and assigned block address.
>
> User should be aware of which the 'pos' input parameter should be a start
> address of a zone. If 'pos' is not aligned with a zone's starting
> address, it will failed with the errno EINVAL.
I wonder if we should introduce a new function here, or instead use the
regular dma_write in a new subclass and make dma_write call the
low-level append. the new zoned_nvme_file::dma_write can take care of
supplying the correct "pos" to the lower layer.
Using dma_write means that wrapping the file with output_stream will
work as expected, so integration with applications becomes much simpler.
We thought that some users still want to place their data to the exact block address via legacy dma_write() although dma_append() is provided.
Isn't that impossible with a zoned device?
No, there is no problem to write data to the exact block address if the block address is not violated with a write pointer managed by device internal.But it is true that the user should carefully manage block addresses and the sequence of multiple write requests.
Ok.
Furthermore, because the block address is determined after block IO completion, it is hard to melt the append command into the implementation of legacy dma_write() which is expected to place data to the block address provided by the user.
I thought we'd just ignore the block address (apart from sequencing concurrent write operations generated by output_stream).
As we mentioned on "[PATCH 1/2] reactor: add passthru functionality to io_uring backend" mailing thread, user should remember assigned lba for reading data from correct address.Here's our document why we need append command in Ceph Seastore (https://docs.google.com/document/d/1DvADmtqVAoXKSmds8ByqPhVTUoexDQvRlkPvB2xzSfc/edit?usp=sharing)
It's protected. I asked for access, but please consider making it
public.
And also attached a presentation file that introduces the necessity of append command. (https://www.usenix.org/system/files/vault20_slides_bjorling.pdf)
This is very helpful.
I think we need an API to allocate new zones, no?
ThanksHeewon-- ᐧ
You received this message because you are subscribed to the Google Groups "seastar-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to seastar-dev...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/seastar-dev/CAG6M7EMY4m-uBBR2UNBgZ4CdBVqwqeBKM_jebHpuN3bJWu_jcw%40mail.gmail.com.
And also attached a presentation file that introduces the necessity of append command. (https://www.usenix.org/system/files/vault20_slides_bjorling.pdf)
This is very helpful.
I think we need an API to allocate new zones, no?
Having read the specification, I see the number of zones is fixed. But we need APIs to list the available zones and their capacities, so that the application can implement a zone allocator itself.
What is the typical zone size? Is it something on the order of a
few megabytes?
On 18/09/2022 13.52, Avi Kivity wrote:
And also attached a presentation file that introduces the necessity of append command. (https://www.usenix.org/system/files/vault20_slides_bjorling.pdf)
This is very helpful.
I think we need an API to allocate new zones, no?
Having read the specification, I see the number of zones is fixed. But we need APIs to list the available zones and their capacities, so that the application can implement a zone allocator itself.
What is the typical zone size? Is it something on the order of a few megabytes?
2022년 9월 18일 (일) 오후 9:32, Avi Kivity <a...@scylladb.com>님이 작성:
On 18/09/2022 13.52, Avi Kivity wrote:
And also attached a presentation file that introduces the necessity of append command. (https://www.usenix.org/system/files/vault20_slides_bjorling.pdf)
This is very helpful.
I think we need an API to allocate new zones, no?
Having read the specification, I see the number of zones is fixed. But we need APIs to list the available zones and their capacities, so that the application can implement a zone allocator itself.
We think that zone management is separated functionality. Users can do it with libzbd(library for zone management) or IOCTl(available at kernel). Should it be provided by Seastar library level?
ioctl will be a blocking call, no? So users will have to use
alien.
It's best to provide it via Seastar so users don't have to glue
together stuff from random places.
What is the typical zone size? Is it something on the order of a few megabytes?
Typically, the zone size depends on the manufacturer, about 100MB to 2GB. User can achieve from libzbd or IOCTl mentioned just above
I see, I'm interested in using it for ScyllaDB too.
What happens if a zone is not completely filled? Is the space
there wasted, or can it be reused? In ScyllaDB we have a large
variety of file sizes.
.ᐧ
On 19/09/2022 11.53, 신희원 / 학생 / 컴퓨터공학부 wrote:
2022년 9월 18일 (일) 오후 9:32, Avi Kivity <a...@scylladb.com>님이 작성:
On 18/09/2022 13.52, Avi Kivity wrote:
And also attached a presentation file that introduces the necessity of append command. (https://www.usenix.org/system/files/vault20_slides_bjorling.pdf)
This is very helpful.
I think we need an API to allocate new zones, no?
Having read the specification, I see the number of zones is fixed. But we need APIs to list the available zones and their capacities, so that the application can implement a zone allocator itself.
We think that zone management is separated functionality. Users can do it with libzbd(library for zone management) or IOCTl(available at kernel). Should it be provided by Seastar library level?
ioctl will be a blocking call, no? So users will have to use alien.
It's best to provide it via Seastar so users don't have to glue together stuff from random places.
What is the typical zone size? Is it something on the order of a few megabytes?
Typically, the zone size depends on the manufacturer, about 100MB to 2GB. User can achieve from libzbd or IOCTl mentioned just above
I see, I'm interested in using it for ScyllaDB too.
What happens if a zone is not completely filled? Is the space there wasted, or can it be reused? In ScyllaDB we have a large variety of file sizes.
.ᐧ
On 19/09/2022 11.53, 신희원 / 학생 / 컴퓨터공학부 wrote:
2022년 9월 18일 (일) 오후 9:32, Avi Kivity <a...@scylladb.com>님이 작성:
On 18/09/2022 13.52, Avi Kivity wrote:
And also attached a presentation file that introduces the necessity of append command. (https://www.usenix.org/system/files/vault20_slides_bjorling.pdf)
This is very helpful.
I think we need an API to allocate new zones, no?
Having read the specification, I see the number of zones is fixed. But we need APIs to list the available zones and their capacities, so that the application can implement a zone allocator itself.
We think that zone management is separated functionality. Users can do it with libzbd(library for zone management) or IOCTl(available at kernel). Should it be provided by Seastar library level?
ioctl will be a blocking call, no? So users will have to use alien.
Yes, it will be a blocking call.
Do you mean we should expose NVME_CMD_URING to users? This will
eliminate any questions about how the API should look like, since users
can just send append via NVME_CMD_URING. After Ceph gains experience
with it, we can think of a better way to expose it to users.
The downside is that Seastar I/O scheduling is bypassed (but, append
likely is cheaper than regular overwrite).
The API could look like
future<uring_cmd_respond>
file::io_uring_command(const void* command, size_t len);
which is then implemented only for block devices.
Perhaps we can pass optional information to tell Seastar how to
classify the request for I/O scheduling purposes.
This can then be used to implement append, administrative requests,
key/value, or whatever the NVMe spec comes up with.
Do you mean we should expose NVME_CMD_URING to users? This will
eliminate any questions about how the API should look like, since users
can just send append via NVME_CMD_URING. After Ceph gains experience
with it, we can think of a better way to expose it to users.
The downside is that Seastar I/O scheduling is bypassed (but, append
likely is cheaper than regular overwrite).
The API could look like
future<uring_cmd_respond>
file::io_uring_command(const void* command, size_t len);
which is then implemented only for block devices.
Perhaps we can pass optional information to tell Seastar how to
classify the request for I/O scheduling purposes.
This can then be used to implement append, administrative requests,
key/value, or whatever the NVMe spec comes up with.We sent the new patches by using git pull-requests. (https://github.com/scylladb/seastar/pull/1218)
If you have any comments, please give us the feedback on this link.