Re: [capnproto] Memory Mapped Reading

100 views
Skip to first unread message
Message has been deleted

Kenton Varda

unread,
Jun 21, 2023, 4:45:21 PM6/21/23
to Adrian, Cap'n Proto
Hi Adrian,

How are you measuring memory usage, exactly?

When using mmap, measuring memory usage gets a bit complicated. The kernel will load pages of the file into memory when you read then, and then it is free to discard those pages at any time -- because it can always load them again later if needed. But the kernel will only actually discard pages if it needs the memory for something else. So if you read the entire file by mmap-ing it and reading every page, and nothing else needs memory, then all those pages will stay resident in memory. But this isn't really the same as your program allocating memory, because, again, all those pages can be freed up instantly whenever memory is needed.

In order to fully understand what is going on you may have to dig into more detailed memory stats. If your OS is just giving you a single number for memory usage, it isn't telling the full story. Usually you can find a bunch of different statistics if you dig in a little more.

-Kenton

On Wed, Jun 21, 2023 at 9:47 AM Adrian <adriannr...@gmail.com> wrote:
Hello

I have been working on Cap'n Proto for some time to make some tests. My aim is to read the small chunks in a big serialized data to reduce the total memory consumption. For that purpose, I used memory-mapped reading and wrote a simple example to make some memory usage tests. 

In the tests, I realized that even if I only read the small data chunk (address) only include "address" string in itself, the total memory usage of the below test program is 512 MB in my machine (the capnp database is 2.1GB). I am wondering where I am doing something wrong. Note: I run the program only "read" mode. I called the "write" once to create capnp database.

If you have any opinion, I would be very happy if you share it with me.

Proto file
----------------------------------------------------------------------------------------------
@0xa5af5d9c9e54c04a;

struct Person {
  name @0 :Text;
  id @1 :UInt32;
  email @2 :Text;
  address @3 :Text;
}

struct AddressBook {
  people @0 :List(Person);
}
----------------------------------------------------------------------------------------------

Source code of example
----------------------------------------------------------------------------------------------
#include "test.capnp.h"
#include <capnp/message.h>
#include <capnp/serialize-packed.h>
#include <capnp/serialize.h>
#include <iostream>
#include <fcntl.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <unistd.h>
#include <stdlib.h>

void writeAddressBook(int fd)
{
constexpr const size_t NodeNumber = 1024 * 8;

::capnp::MallocMessageBuilder message;

AddressBook::Builder addressBook = message.initRoot<AddressBook>();
::capnp::List<Person>::Builder people = addressBook.initPeople(NodeNumber);

// Each string will be 128KB.
constexpr const size_t size = 1024 * 128;

for (int i = 0; i < NodeNumber; i++)
{
Person::Builder person = people[i];
person.setId(i);
person.setName(std::string(size, 'A').c_str());
person.setEmail(std::string(size, 'A').c_str());
person.setAddress("Address");
}

kj::VectorOutputStream output;
writeMessage(output, message);

auto serializedData = output.getArray();

void *dataPtr = const_cast<void *>(static_cast<const void *>(serializedData.begin()));
size_t dataSize = serializedData.size();

size_t totalBytesWritten = 0;
while (totalBytesWritten < dataSize)
{
auto numberOfBytesWritten = write(fd, static_cast<const char *>(dataPtr) + totalBytesWritten, dataSize - totalBytesWritten);
if (numberOfBytesWritten == -1)
{
throw std::runtime_error{"Error during creating capnp database"};
}
totalBytesWritten += numberOfBytesWritten;
}
}

void readAddressBook(int fd)
{
struct stat st;
fstat(fd, &st);
size_t fileSize = st.st_size;

char *mappedData = static_cast<char *>(mmap(nullptr, fileSize, PROT_READ, MAP_PRIVATE, fd, 0));

capnp::FlatArrayMessageReader reader(kj::ArrayPtr<const capnp::word>(
reinterpret_cast<const capnp::word *>(mappedData), fileSize / sizeof(capnp::word)));

AddressBook::Reader addressBook = reader.getRoot<AddressBook>();

for (Person::Reader person : addressBook.getPeople())
{
person.getId();
}

munmap(mappedData, fileSize);
close(fd);
}

int main(int argc, char **argv)
{
int fd = open("./data.bin", O_RDWR);

if (!std::strcmp(argv[1], "--write"))
{
writeAddressBook(fd);
}

if (!std::strcmp(argv[1], "--read"))
{
readAddressBook(fd);
}

return 0;
}

--
You received this message because you are subscribed to the Google Groups "Cap'n Proto" group.
To unsubscribe from this group and stop receiving emails from it, send an email to capnproto+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/capnproto/a3192b90-a8bf-4151-84e8-0b8516d8f71bn%40googlegroups.com.
Message has been deleted

Kenton Varda

unread,
Jun 21, 2023, 5:32:39 PM6/21/23
to Adrian, Cap'n Proto
Hi Adrian,

The memory usage you are seeing happens whether or not you use mmap, it's just accounted differently. If you read the file using many small read() calls, the operating system will still load all of the pages of the file into memory, and will only remove them from memory when the memory is used for something else. That's called caching. But when you use read(), the memory isn't attached directly to your program, it's just in kernel space, so it doesn't look like your program is using a lot of memory, even though it is.

But using memory this way is not really consuming it. The memory is still available for anything else that needs it. Since the memory is still available, it's incorrect to think of it the same as memory your program allocated for private use.

Put simply, your program is not using the memory you think it is. You need to understand what the numbers actually mean.

-Kenton

On Wed, Jun 21, 2023 at 3:53 PM Adrian <adriannr...@gmail.com> wrote:
Hi, thanks for your reply.

I really appreciate your work in this library.

I used /bin/time utility of Linux but I also saw the same result with another memory analyzer.

As I mentioned, since the file could be big, my aim is to reduce memory usage when reading data from capnp database because it could be very big. When I read small portions of that database, I want my program not to consume so much memory. In the documentation, you refer to mmap usage to achieve this. Do you think that my approach is wrong for that purpose like I implemented in my code?

Thanks
Reply all
Reply to author
Forward
Message has been deleted
0 new messages