--
You received this message because you are subscribed to the Google Groups "ScyllaDB development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scylladb-dev...@googlegroups.com.
To post to this group, send email to scylla...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/scylladb-dev/CAD-J%3DzaKjCBCrWX71P%2BVCJJrujorf7jD%3DQDfRjiJ-PyHhkAy9g%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
Currently, code interacting with range_tombstone_streams uses the get_next() API
to convert the next available tombstone in the list to a mutation fragment.
In order to properly account the range tombstone, it is desirable to do it
earlier, since the creation of the mutation fragment may move the data to a
different region.
--
2.5.5
--
You received this message because you are subscribed to the Google Groups "ScyllaDB development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scylladb-dev+unsubscribe@googlegroups.com.
To post to this group, send email to scylla...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/scylladb-dev/88f46b3f9880008cc3f4977a26ae44d1ceda4f43.1473911576.git.glauber%40scylladb.com.
--
2.5.5
--
You received this message because you are subscribed to the Google Groups "ScyllaDB development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scylladb-dev+unsubscribe@googlegroups.com.
To post to this group, send email to scylla...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/scylladb-dev/098a0981a37d7b936fcf2525b07c1fd222a96627.1473911576.git.glauber%40scylladb.com.
I have never tested it explicitly in the past but I don't remember the slowdown being that bad. Maybe my disk os getting old ?
--
You received this message because you are subscribed to the Google Groups "ScyllaDB development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scylladb-dev+unsubscribe@googlegroups.com.
To post to this group, send email to scylla...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/scylladb-dev/4d05c77474c646291b9cb72b837ab4520fd11d1c.1473911576.git.glauber%40scylladb.com.
Available at:
    To g...@github.com:glommer/scylla.git  virtual-dirty-v3
Description:
============
Scylla currently suffers from a brick wall behavior of the request throttler.
Requests pile up until we reach the dirty memory limit, at which point we stop
serving them until we have freed enough memory to allow for more requests.
The problem is that freeing dirty memory means writing an SSTable to completion.
That can take a long time, even if we are blessed with great disks. Those long
waiting times can and will translate into timeouts. That is bad behavior.
What this patch does is introduce one form of virtual dirty memory accounting.
Instead of allowing 100 % of the dirty memory to be filled up until we stop
accepting requests, we will do that when we reach 50 % of memory. However,
instead of releasing requests only when an SSTable is fully written, we start
releasing them when some memory was written.
--
2.5.5
--
You received this message because you are subscribed to the Google Groups "ScyllaDB development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scylladb-dev+unsubscribe@googlegroups.com.
To post to this group, send email to scylla...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/scylladb-dev/cover.1474292152.git.glauber%40scylladb.com.
On Mon, Sep 19, 2016 at 10:51 AM, Glauber Costa <gla...@scylladb.com> wrote:Available at:
    To g...@github.com:glommer/scylla.git  virtual-dirty-v3
Description:
============
Scylla currently suffers from a brick wall behavior of the request throttler.
Requests pile up until we reach the dirty memory limit, at which point we stop
serving them until we have freed enough memory to allow for more requests.
The problem is that freeing dirty memory means writing an SSTable to completion.
That can take a long time, even if we are blessed with great disks. Those long
waiting times can and will translate into timeouts. That is bad behavior.
What this patch does is introduce one form of virtual dirty memory accounting.
Instead of allowing 100 % of the dirty memory to be filled up until we stop
accepting requests, we will do that when we reach 50 % of memory. However,
instead of releasing requests only when an SSTable is fully written, we start
releasing them when some memory was written.This indicates that scylla frees memory from memtable as write proceeds, right? What if sstable write fails? example: out of disk space... would we able to recover the mutations written to partial sstable and freed from memory?
--
2.5.5
--
You received this message because you are subscribed to the Google Groups "ScyllaDB development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scylladb-dev+unsubscribe@googlegroups.com.
To post to this group, send email to scylla...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/scylladb-dev/d7e555e83193af621d8c08da8e27b17d6c8ae624.1474292152.git.glauber%40scylladb.com.
This is so we can template it without worrying about declaring the
specializations in the .cc file.
--
2.5.5
--
You received this message because you are subscribed to the Google Groups "ScyllaDB development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scylladb-dev+unsubscribe@googlegroups.com.
To post to this group, send email to scylla...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/scylladb-dev/d9e3dab5aa0b6ccc443f6ea34fe488c92d43ebcb.1474292152.git.glauber%40scylladb.com.
--
2.5.5
By default, we don't do any accounting. By specializing this class and providing
an accounter class, we can account how much memory are we reading as we read
through the elements.
Signed-off-by: Glauber Costa <gla...@scylladb.com>
---
 partition_version.hh | 42 ++++++++++++++++++++++++++++++++++++------
 1 file changed, 36 insertions(+), 6 deletions(-)
diff --git a/partition_version.hh b/partition_version.hh
index 6c57ab0..0bc5258 100644
--- a/partition_version.hh
+++ b/partition_version.hh
@@ -271,7 +271,15 @@ inline partition_version_ref& partition_snapshot::version()
   }
 }
-class partition_snapshot_reader : public streamed_mutation::impl {
+struct partition_snapshot_reader_dummy_accounter {
+Â Â void operator()(const clustering_row& cr) {}
+Â Â void operator()(const static_row& sr) {}
+Â Â void operator()(const range_tombstone& rt) {}
+};
+extern partition_snapshot_reader_dummy_accounter no_accounter;
+
+template <typename MemoryAccounter = partition_snapshot_reader_dummy_accounter>
+class partition_snapshot_reader : public streamed_mutation::impl, public MemoryAccounter {
   struct rows_position {
     mutation_partition::rows_type::const_iterator _position;
     mutation_partition::rows_type::const_iterator _end;
@@ -308,6 +316,10 @@ class partition_snapshot_reader : public streamed_mutation::impl {
   logalloc::region& _lsa_region;
   logalloc::allocating_section& _read_section;
+Â Â MemoryAccounter& mem_accounter() {
+Â Â Â Â return *this;
+Â Â }
+
   uint64_t _reclaim_counter;
   unsigned _version_count = 0;
 private:
@@ -384,11 +396,16 @@ class partition_snapshot_reader : public streamed_mutation::impl {
     return _range_tombstones.get_next();
   }
+Â Â void emplace_mutation_fragment(mutation_fragment&& mfopt) {
+Â Â Â Â mfopt.visit(mem_accounter());
+Â Â Â Â _buffer.emplace_back(std::move(mfopt));
 }
--
2.5.5
--
You received this message because you are subscribed to the Google Groups "ScyllaDB development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scylladb-dev+unsubscribe@googlegroups.com.
To post to this group, send email to scylla...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/scylladb-dev/6c46dafbd93aad982e7d04b7aea94a89803f3f06.1474292152.git.glauber%40scylladb.com.
On Sep 27, 2016 8:38 AM, "Paweł Dziepak" <pdzi...@scylladb.com> wrote:
>
>
>
> On 19 September 2016 at 14:51, Glauber Costa <gla...@scylladb.com> wrote:
>>
>> By default, we don't do any accounting. By specializing this class and providing
>> an accounter class, we can account how much memory are we reading as we read
>> through the elements.
>>
>> Signed-off-by: Glauber Costa <gla...@scylladb.com>
>> ---
>> Â partition_version.hh | 42 ++++++++++++++++++++++++++++++++++++------
>> Â 1 file changed, 36 insertions(+), 6 deletions(-)
>>
>> diff --git a/partition_version.hh b/partition_version.hh
>> index 6c57ab0..0bc5258 100644
>> --- a/partition_version.hh
>> +++ b/partition_version.hh
>> @@ -271,7 +271,15 @@ inline partition_version_ref& partition_snapshot::version()
>> Â Â Â }
>> Â }
>>
>> -class partition_snapshot_reader : public streamed_mutation::impl {
>> +struct partition_snapshot_reader_dummy_accounter {
>> +Â Â void operator()(const clustering_row& cr) {}
>> +Â Â void operator()(const static_row& sr) {}
>> +Â Â void operator()(const range_tombstone& rt) {}
>> +};
>> +extern partition_snapshot_reader_dummy_accounter no_accounter;
>
>
> Apparently, no_accounter played tricks on you during rebase.
> Â
Thanks for spotting it.
>>
>> +
>> +template <typename MemoryAccounter = partition_snapshot_reader_dummy_accounter>
>> +class partition_snapshot_reader : public streamed_mutation::impl, public MemoryAccounter {
>> Â Â Â struct rows_position {
>> Â Â Â Â Â mutation_partition::rows_type::const_iterator _position;
>> Â Â Â Â Â mutation_partition::rows_type::const_iterator _end;
>> @@ -308,6 +316,10 @@ class partition_snapshot_reader : public streamed_mutation::impl {
>> Â Â Â logalloc::region& _lsa_region;
>> Â Â Â logalloc::allocating_section& _read_section;
>>
>> +Â Â MemoryAccounter& mem_accounter() {
>> +Â Â Â Â return *this;
>> +Â Â }
>> +
>> Â Â Â uint64_t _reclaim_counter;
>> Â Â Â unsigned _version_count = 0;
>> Â private:
>> @@ -384,11 +396,16 @@ class partition_snapshot_reader : public streamed_mutation::impl {
>> Â Â Â Â Â return _range_tombstones.get_next();
>> Â Â Â }
>>
>> +Â Â void emplace_mutation_fragment(mutation_fragment&& mfopt) {
>> +Â Â Â Â mfopt.visit(mem_accounter());
>
>
> Well, looks like we could generalise this to mutation_fragment_visitor since the reader can work with anything that visits mutation fragment even if it doesn't do any memory accounting. I am not sure, though, that there is much point in that.
> Â
>>
>> +Â Â Â Â _buffer.emplace_back(std::move(mfopt));
>
>
> This is existing code but push_mutation_fragment() should be used here (and _buffer made private).
>
> I just want to make sure you are aware that it may take some time until the mutation fragment queued in the buffer is actually consumed and written to the sstable. Basically, this means that the memory accounting is going to be done in a bursts of 16 mutation fragments and later when we make buffer size dynamic in a bursts of X kB. Still, I don't think it would ruin flush progress estimates.
> Â
I am aware of that. It doesn't matter as long as the rate in which we consume is tied to the rate in which we write the sstables.
If we were to read the whole thing beforehand it wouldn't work. As we read on a need to use basis, it works - even if there is time elapsed between accounting and usage.
>> To unsubscribe from this group and stop receiving emails from it, send an email to scylladb-dev...@googlegroups.com.
Looks good to me. Paweł, you had some comments, were they addressed?
 };
--
2.5.5
--
You received this message because you are subscribed to the Google Groups "ScyllaDB development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scylladb-dev+unsubscribe@googlegroups.com.
To post to this group, send email to scylla...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/scylladb-dev/fe90793381b1cf0d34c2062c0fd63f4ed494cb14.1474996731.git.glauber%40scylladb.com.
 };
 extern standard_allocation_strategy standard_allocation_strategy_instance;
diff --git a/utils/logalloc.cc b/utils/logalloc.cc
index c244324..17a81f4 100644
--- a/utils/logalloc.cc
+++ b/utils/logalloc.cc
@@ -517,7 +517,7 @@ class segment_pool {
   segment* new_segment(region::impl* r);
   segment_descriptor& descriptor(const segment*);
   // Returns segment containing given object or nullptr.
-Â Â segment* containing_segment(void* obj) const;
+Â Â segment* containing_segment(const void* obj) const;
   void free_segment(segment*) noexcept;
   void free_segment(segment*, segment_descriptor&) noexcept;
   size_t segments_in_use() const;
@@ -720,7 +720,7 @@ segment_pool::descriptor(const segment* seg) {
 }
 segment*
-segment_pool::containing_segment(void* obj) const {
+segment_pool::containing_segment(const void* obj) const {
   auto addr = reinterpret_cast<uintptr_t>(obj);
   auto offset = addr & (segment::size - 1);
   auto index = (addr - _segments_base) >> segment::size_shift;
@@ -1378,6 +1378,17 @@ class region_impl : public allocation_strategy {
     }
   }
+Â Â virtual size_t object_memory_size_in_allocator(const void* obj) const noexcept override {
+Â Â Â Â segment* seg = shard_segment_pool.containing_segment(obj);
+
+Â Â Â Â if (!seg) {
+Â Â Â Â Â Â return standard_allocator().object_memory_size_in_allocator(obj);
+Â Â Â Â } else {
+Â Â Â Â Â Â auto desc = reinterpret_cast<object_descriptor*>(reinterpret_cast<uintptr_t>(obj) - sizeof(object_descriptor));
+Â Â Â Â Â Â return sizeof(object_descriptor) + desc->size();
--
You received this message because you are subscribed to the Google Groups "ScyllaDB development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scylladb-dev+unsubscribe@googlegroups.com.
To post to this group, send email to scylla...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/scylladb-dev/CAO2XSW7BuerHJ75P13iKUwUNGnncrgjod7O%2BPxLK-FKDWfKz5g%40mail.gmail.com.
The code that is common will live in its own reader, the iterator_reader. All
friendly private access to memtable attributes and methods happen through the
iterator reader.
After this patch, we are now left with the scanning_reader - same as always,
but now implemented on top of the iterator_reader, and a flush_reader, which
will be used by SSTable flushes only.
Signed-off-by: Glauber Costa <gla...@scylladb.com>
---
 memtable.hh |  2 +-
 memtable.cc | 127 ++++++++++++++++++++++++++++++++++++++++++++++--------------
 2 files changed, 98 insertions(+), 31 deletions(-)
diff --git a/memtable.hh b/memtable.hh
index 98fc7eb..a0a5cbd 100644
--- a/memtable.hh
+++ b/memtable.hh
@@ -161,5 +161,5 @@ class memtable final : public enable_lw_shared_from_this<memtable>, private loga
     return _replay_position;
   }
-Â Â friend class scanning_reader;
+Â Â friend class iterator_reader;
 };
diff --git a/memtable.cc b/memtable.cc
index 93f3d02..3a1743b 100644
--- a/memtable.cc
+++ b/memtable.cc
+
+Â Â mutation_reader delegate_reader(const query::partition_range& delegate,
   virtual future<streamed_mutation_opt> operator()() override {
     if (_delegate_range) {
       return _delegate();
     }
-Â Â Â Â // We cannot run concurrently with row_cache::update().
-Â Â Â Â if (_memtable->is_flushed()) {
-Â Â Â Â Â Â // FIXME: Use cache. See column_family::make_reader().
-Â Â Â Â Â Â _delegate_range = _last ? _range.split_after(*_last, dht::ring_position_comparator(*_memtable->_schema)) : _range;
-Â Â Â Â Â Â _delegate = make_mutation_reader<sstable_range_wrapping_reader>(
-Â Â Â Â Â Â Â Â _memtable->_sstable, _schema, *_delegate_range, _slice, _pc);
-Â Â Â Â Â Â _memtable = {};
-Â Â Â Â Â Â _last = {};
+Â Â Â Â // FIXME: Use cache. See column_family::make_reader().
+Â Â Â Â _delegate_range = get_delegate_range();
+Â Â Â Â if (_delegate_range) {
+Â Â Â Â Â Â _delegate = delegate_reader(*_delegate_range, _slice, _pc);
       return _delegate();
     }
-Â Â Â Â logalloc::reclaim_lock _(*_memtable);
+Â Â Â Â logalloc::reclaim_lock _(region());
     managed_bytes::linearization_context_guard lcg;
-Â Â Â Â update_iterators();
-Â Â Â Â if (_i == _end) {
+Â Â Â Â memtable_entry* e = fetch_next_entry();
+Â Â Â Â if (!e) {
+Â Â Â Â Â Â Â return make_ready_future<streamed_mutation_opt>(stdx::nullopt);
+Â Â Â Â } else {
+Â Â Â Â Â Â return make_ready_future<streamed_mutation_opt>(e->read(mtbl(), schema(), _slice));Â
+Â Â Â Â }
+Â Â }
+};
+
+class flush_reader final : public iterator_reader {
+public:
+Â Â flush_reader(schema_ptr s, lw_shared_ptr<memtable> m)
+Â Â Â Â : iterator_reader(std::move(s), std::move(m), query::full_partition_range)
+Â Â {}
+
+Â Â virtual future<streamed_mutation_opt> operator()() override {
+Â Â Â Â logalloc::reclaim_lock _(region());
+Â Â Â Â managed_bytes::linearization_context_guard lcg;
+Â Â Â Â memtable_entry* e = fetch_next_entry();
+Â Â Â Â if (!e) {
       return make_ready_future<streamed_mutation_opt>(stdx::nullopt);
+Â Â Â Â } else {
+Â Â Â Â Â Â return make_ready_future<streamed_mutation_opt>((*e).read(mtbl(), schema(), query::full_slice));
     }
-Â Â Â Â memtable_entry& e = *_i;
-Â Â Â Â ++_i;
-Â Â Â Â _last = e.key();
-Â Â Â Â _memtable->upgrade_entry(e);
-Â Â Â Â return make_ready_future<streamed_mutation_opt>(e.read(_memtable, _schema, _slice));
   }
 };
@@ -219,7 +286,7 @@ memtable::make_reader(schema_ptr s,
 mutation_reader
 memtable::make_flush_reader(schema_ptr s) {
-Â Â return make_mutation_reader<scanning_reader>(std::move(s), shared_from_this(), query::full_partition_range, query::no_clustering_key_filtering, default_priority_class());
+Â Â return make_mutation_reader<flush_reader>(std::move(s), shared_from_this());
 }
 void
--
2.5.5
--
You received this message because you are subscribed to the Google Groups "ScyllaDB development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scylladb-dev+unsubscribe@googlegroups.com.
To post to this group, send email to scylla...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/scylladb-dev/d7e555e83193af621d8c08da8e27b17d6c8ae624.1474292152.git.glauber%40scylladb.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/scylladb-dev/CAO2XSW5Wws0QzZV-g1fOevSOOpj3S_oxbV7eC5jMsAtngAGpgw%40mail.gmail.com.
Scylla currently suffers from a brick wall behavior of the request throttler.
Requests pile up until we reach the dirty memory limit, at which point we stop
serving them until we have freed enough memory to allow for more requests.
The problem is that freeing dirty memory means writing an SSTable to completion.
That can take a long time, even if we are blessed with great disks. Those long
waiting times can and will translate into timeouts. That is bad behavior.
What this patch does is introduce one form of virtual dirty memory accounting.
Instead of allowing 100 % of the dirty memory to be filled up until we stop
accepting requests, we will do that when we reach 50 % of memory. However,
instead of releasing requests only when an SSTable is fully written, we start
releasing them when some memory was written.
The practical effect of that is that once we reach 50 % occupancy in our dirty
memory region, we will bring the system from CPU speed to disk speed, and will
start accepting requests only at the rate we are able to write memory back.
Signed-off-by: Glauber Costa <gla...@scylladb.com>
---
 database.cc | 18 ++++++++++++++--
 memtable.cc | 71 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 86 insertions(+), 3 deletions(-)
diff --git a/database.cc b/database.cc
index 012eaf2..07224d3 100644
--- a/database.cc
+++ b/database.cc
@@ -1548,8 +1548,22 @@ database::database(const db::config& cfg)
   // in a different region group. This is because throttled requests are serviced in FIFO order,
   // and we don't want system requests to be waiting for a long time behind user requests.
   , _system_dirty_memory_manager(*this, _memtable_total_space + (10 << 20))
-Â Â , _dirty_memory_manager(*this, &_system_dirty_memory_manager, _memtable_total_space)
-Â Â , _streaming_dirty_memory_manager(*this, &_dirty_memory_manager, _streaming_memtable_total_space)
+Â Â // The total space that can be used by memtables is _memtable_total_space, but we will only
+Â Â // allow the region_group to grow to half of that. This is because of virtual_dirty: memtables
+Â Â // can take a long time to flush, and if we are using the maximum amount of memory possible,
+Â Â // then requests will block until we finish flushing at least one memtable.
+Â Â //
+Â Â // We can free memory until the whole memtable is flushed because we need to keep it in memory
+Â Â // until the end, but we can fake freeing memory. When we are done with an element of the
+Â Â // memtable, we will update the region group pretending memory just went down by that amount.
+Â Â //
+Â Â // Because the amount of memory that we pretend to free should be close enough to the actual
+Â Â // memory used by the memtables, that effectively creates two sub-regions inside the dirty
+Â Â // region group, of equal size. In the worst case, we will have _memtable_total_space dirty
+Â Â // bytes used, and half of that already virtually freed.
+Â Â , _dirty_memory_manager(*this, &_system_dirty_memory_manager, _memtable_total_space / 2)
+Â Â // The same goes for streaming in respect to virtual dirty.
+Â Â , _streaming_dirty_memory_manager(*this, &_dirty_memory_manager, _streaming_memtable_total_space / 2)
   , _version(empty_version)
   , _enable_incremental_backups(cfg.incremental_backups())
 {
diff --git a/memtable.cc b/memtable.cc
index 3a1743b..33bf39d 100644
--- a/memtable.cc
+++ b/memtable.cc
+Â Â Â Â Â Â return current_allocator().object_memory_size_in_allocator(&e) +
+Â Â Â Â Â Â Â Â Â Â current_allocator().object_memory_size_in_allocator(&*(partition_snapshot(e.schema(), &(e.partition())).version())) +
+Â Â Â Â Â Â Â Â Â Â e.key().key().memory_usage();Â
Â
+Â Â Â Â });
+Â Â Â Â update_bytes_read(delta);
+Â Â }
+};
+
+class partition_snapshot_accounter {
+Â Â flush_memory_accounter& _accounter;
+public:
+Â Â partition_snapshot_accounter(flush_memory_accounter& acct): _accounter(acct) {}
+
+Â Â // We will be passed mutation fragments here, and they are allocated using the standard
+Â Â // allocator. So we can't compute the size in memtable precisely. However, precise accounting is
+Â Â // hard anyway, since we may be holding multiple snapshots of the partitions, and the
+Â Â // partition_snapshot_reader may compose them. In doing so, we move memory to the standard
+Â Â // allocation. As long as our size read here is lesser or equal to the size in the memtables, we
+Â Â // are safe, and worst case we will allow a bit fewer requests in.
+Â Â void operator()(const range_tombstone& rt) {
+Â Â Â Â _accounter.update_bytes_read(rt.memory_usage())
;
+Â Â }
+
+Â Â void operator()(const static_row& sr) {
+Â Â Â Â _accounter.update_bytes_read(sr.memory_usage());
+Â Â }
+
+Â Â void operator()(const clustering_row& cr) {
+Â Â Â Â // Every clustering row is stored in a rows_entry object, and that has some significant
+Â Â Â Â // overhead - so add it here. We will be a bit short on our estimate because we can't know
+Â Â Â Â // what is the size in the allocator for this rows_entry object: we may have many snapshots,
+Â Â Â Â // and we don't know which one(s) contributed to the generation of this mutation fragment.
+Â Â Â Â //
+Â Â Â Â // We will add the size of the struct here, and that should be good enough.
+Â Â Â Â _accounter.update_bytes_read(sizeof(rows_entry) + cr.memory_usage());
+Â Â }
+};
+
 class flush_reader final : public iterator_reader {
+Â Â flush_memory_accounter _flushed_memory;
 public:
   flush_reader(schema_ptr s, lw_shared_ptr<memtable> m)
     : iterator_reader(std::move(s), std::move(m), query::full_partition_range)
+Â Â Â Â , _flushed_memory(region())
   {}
+Â Â flush_reader(const flush_reader&) = delete;
+Â Â flush_reader(flush_reader&&) = delete;
+Â Â flush_reader& operator=(flush_reader&&) = delete;
+Â Â flush_reader& operator=(const flush_reader&) = delete;
   virtual future<streamed_mutation_opt> operator()() override {
     logalloc::reclaim_lock _(region());
@@ -253,7 +318,11 @@ class flush_reader final : public iterator_reader {
     if (!e) {
       return make_ready_future<streamed_mutation_opt>(stdx::nullopt);
     } else {
-Â Â Â Â Â Â return make_ready_future<streamed_mutation_opt>((*e).read(mtbl(), schema(), query::full_slice));Â
+Â Â Â Â Â Â auto cr = query::clustering_key_filter_ranges::get_ranges(*schema(), query::full_slice, e->key().key());
+Â Â Â Â Â Â auto snp = e->partition().read(schema());
+Â Â Â Â Â Â auto mpsr = make_partition_snapshot_reader<partition_snapshot_accounter>(schema(), e->key(), std::move(cr), snp, region(), read_section(), mtbl(), _flushed_memory);
+Â Â Â Â Â Â _flushed_memory.account_component(*e);
+Â Â Â Â Â Â return make_ready_future<streamed_mutation_opt>(std::move(mpsr));
     }
   }
 };
--
2.5.5
--
You received this message because you are subscribed to the Google Groups "ScyllaDB development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scylladb-dev+unsubscribe@googlegroups.com.
To post to this group, send email to scylla...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/scylladb-dev/7f0ab28a6a475a1bf249e781d8c873eb405a0c5c.1474292152.git.glauber%40scylladb.com.
Available at:
    To g...@github.com:glommer/scylla.git  virtual-dirty-v3
Description:
============
Scylla currently suffers from a brick wall behavior of the request throttler.
Requests pile up until we reach the dirty memory limit, at which point we stop
serving them until we have freed enough memory to allow for more requests.
The problem is that freeing dirty memory means writing an SSTable to completion.
That can take a long time, even if we are blessed with great disks. Those long
waiting times can and will translate into timeouts. That is bad behavior.
What this patch does is introduce one form of virtual dirty memory accounting.
Instead of allowing 100 % of the dirty memory to be filled up until we stop
accepting requests, we will do that when we reach 50 % of memory. However,
instead of releasing requests only when an SSTable is fully written, we start
releasing them when some memory was written.
The practical effect of that, is that once we reach 50 % occupancy in our dirty
memory region, we will bring the system from CPU speed to disk speed, and will
start accepting requests only at the rate we are able to write memory back.
components at this point. It is still unclear why, but it seems to stem from
the same issue that is plaguing the commitlog.
In any case, while we could mitigate this issue, we can't really fix it. Even
if we decide to optimistically add the free_space() remaining to the virtual
dirty area - allowing more requests to come in, a new flush cannot be initiated
until this one finishes (assuming max concurrency already used). This means
requests will be blocked for this period.
Changes from V2:
================
- The delay-blocked_requests_max counter is removed. It is important to have but
 not crucial, and in its current form it lacks important information about peaks.
 It will be done later as a quantile.
- Revert back to accounting mutation fragments.
- Fix the problem with accounting snapshots, which can lead to overaccounting. Do
 this by reverting back to accounting MFs). We lose a bit of accuracy, but there
 seem to be no simple way to properly handle snapshot partitions otherwise.
Changes from V1:
================
- Only used space is accounted for, and we don't account for padding, or free
 space in the region. While this slow us down a bit, it is not significant and
 it simplifies the code a lot - since we don't have to have extra protections
 against compactions
- get rid of virtual functions
- account objects before they are transformed into a mutation fragment -
 guaranteeing that they will be in the correct region.
Changes from RFC:
=================
- The accounting was moved to the reader side (mutation_reader), instead of being
 done by at the SSTable side.
- Accounting of relevant internal structures (memtable_entry, rows entry) is more
 precise.
Glauber Costa (9):
 database: export virtual dirty bytes region group
 LSA: export information about size of the throttle queue
 LSA: export information about object memory footprint
 sstables: use special reader for writing a memtable
 memtables: split scanning reader in two
 LSA: allow a group to query its own region group
 move partition_snapshot_reader code to header file
 add accounting of memory read to partition_snapshot_reader
 database: allow virtual dirty memory management
 database.hh         | 25 ++++++
 memtable.hh         |  5 +-
 partition_version.hh     | 204 ++++++++++++++++++++++++++++++++++++++++---
 utils/allocation_strategy.hh |  9 ++
 utils/logalloc.hh      |  6 ++
 database.cc         | 37 +++++++-
 memtable.cc         | 199 +++++++++++++++++++++++++++++++++++------
 partition_version.cc     | 178 -------------------------------------
 sstables/sstables.cc     |  2 +-
 utils/logalloc.cc      | 23 ++++-
 10 files changed, 462 insertions(+), 226 deletions(-)
--
2.5.5
--
You received this message because you are subscribed to the Google Groups "ScyllaDB development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scylladb-dev+unsubscribe@googlegroups.com.
To post to this group, send email to scylla...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/scylladb-dev/cover.1474292152.git.glauber%40scylladb.com.
>
>>
>> ;
>> email to scylladb-dev+unsubscribe@googlegroups.com.
>> To post to this group, send email to scylla...@googlegroups.com.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/scylladb-dev/7f0ab28a6a475a1bf249e781d8c873eb405a0c5c.1474292152.git.glauber%40scylladb.com.
>> For more options, visit https://groups.google.com/d/optout.
>
>
--
You received this message because you are subscribed to the Google Groups "ScyllaDB development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scylladb-dev+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/scylladb-dev/CAD-J%3Dza5WNGDtoDYKfyfd-D98F0sz6vBXRrQTXFZoTKj6S6PHw%40mail.gmail.com.
>
>>
>> +Â Â Â Â Â Â return
>> current_allocator().object_memory_size_in_allocator(&e) +
>
>
> Since we can't have perfect estimates anyway, perhaps we could save on the
> virtual calls to the allocator and just use sizeof().
As I have described in my opening letter, I am achieving > 99 %
accuracy with this method. We don't need to be 100 % accurate, but the
more we virtual free, the more requests we can allow.
The allocator overhead accounts for 5 - 10 % of the size of the
structures for each I am calling into the allocator, and they happen
at every partition. If we have a lot of small partitions, the
difference will be significant.
It's not the end of the world, and we could ultimately do it. I just
think that the overhead here is justified given what I have outlined
above.
>> email to scylladb-dev+unsubscribe@googlegroups.com.
I agree. I will include the header in next version