[boost] Review of Boost.MySql

Marcelo Zimbres Silva via Boost

unread,

May 10, 2022, 4:33:28 PM5/10/22

to Boost, Marcelo Zimbres Silva

Hi, here is my review.

1) ================================

> https://anarthal.github.io/mysql/mysql/ref/boost__mysql__resultset/async_read_all.html
> The handler signature for this operation is
> void(boost::mysql::error_code, std::vector<boost::mysql::row>).
> etc.

Having a std::vector in the signature of the completion handler is
unusual in ASIO. In general, large objects should be passed as
parameters to the initiating function.

Additionally, this operation does not support custom allocators. That
wouldn't be a big problem to me if I could reuse the same memory for
each request but the interface above prevents me from doing so.

It is also a pessimized storage as each row will allocate its own
chunk of memory. A single block of memory would be better.

2) ================================

> https://anarthal.github.io/mysql/mysql/ref/boost__mysql__connection/async_query/overload1.html
> The handler signature for this operation is
> void(boost::mysql::error_code, boost::mysql::resultset<Stream>)

Here the completion handler is returning an io object i.e.
boost::mysql::resultset, which is also unusual as far as I can see.
Wouldn't it make sense for users to specify a different executor for
the resultset? This interface is preventing that.

I prefer having the completion handler return config parameters with
which I can instantiate a boost::mysql::resultset myself in my desired
executor.

3) ================================

> https://github.com/anarthal/mysql/blob/13d3615464f41222052d9ad8fa0c86bcdddee537/include/boost/mysql/value.hpp#L72

I am not convinced the design of the value class is a good idea. The
reason is that users know at compile time what types they are
expecting from a given query. That means it would be possible for them
to specify that before doing the query. So instead of doing this

https://github.com/anarthal/mysql/blob/13d3615464f41222052d9ad8fa0c86bcdddee537/example/query_async_coroutines.cpp#L107

boost::mysql::row row;
while (result.async_read_one(row, additional_info, yield[ec]))

users could do

boost::mysql::row<T1, T2, T3, T3> row;
while (result.async_read_one(row, additional_info, yield[ec]))

avoiding unnecessary further copies. In principle it would be even
possible to parse directly in a custom data structure, for example,
when storing json string in the database.

4) ================================

> https://anarthal.github.io/mysql/mysql/examples.html

IMO, the examples are too limited and show variation of the same
functionality. For example

- Text query, async with callbacks
- Text query, async with futures
- Text query, async with Boost.Coroutine coroutines
- Text query, async with C++20 coroutines
- Default completion tokens

Here you are showing nice Asio features. I am not meaning you should
remove them but if I am not mistaken there is no example that shows

- How to keep long lasting connections open (specially for ssl).
- Timers for resolve, connect, read and write.
- Synchronized writes.
- Healthy checks, failover etc.

I estimate that to address these points users will be required to
write further 500 - 1000 LOC on top of your library. Otherwise they
will be stuck on underperforming short lived connections.

It would be helpful for users to have a starting point to begin with.

There some minor things I have also noticed:

5) ================================

> https://github.com/anarthal/mysql/blob/13d3615464f41222052d9ad8fa0c86bcdddee537/include/boost/mysql/impl/error.hpp#L35

I expect this function to be O(1).

6) ================================

> https://anarthal.github.io/mysql/mysql/ref/boost__mysql__value/get_std_optional.html
> https://anarthal.github.io/mysql/mysql/ref/boost__mysql__value/get_optional.html

Sounds also unusual to have two member functions for the different
versions of optional. I suggest using the boost version until we
transition to C++17.

7) ================================

> https://anarthal.github.io/mysql/mysql/ref/boost__mysql__errc.html

errc should be conditions and not error AFAICS.

============= Review End ================

I vote for conditional acceptance. Points 1, 2, 3 and 4 should be
addressed before acceptance. My acceptance criteria sums up to

1) Remove functionality that provides low performance. With
coroutines it will be easy for users to loop on async_query to create
a std::vector of rows.

2) No IO object should be returned as a parameter of the completion handler.

3) Review the value type. Can it be improved?

4) Improve examples.

Regards,
Marcelo

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Richard Hodges via Boost

unread,

May 10, 2022, 4:54:22 PM5/10/22

to boost@lists.boost.org List, Richard Hodges

I seem to remember this being commented on during the development of the
library.
It is worth remembering that because this is MySQL, nothing can be done
with the connection object until the resultset has been fully read.
In effect, the resultset can be thought of as a modal extension to the
connection.
Causing it to run on a different executor would mean that calls to the
resultset would need to be marshalled to the connection's executor in any
case, and the opposite in reverse.
This may be of questionable value given the limitation of the underlying
protocol.
This arguably strengthens the case for the use of a connection pool, since
without one, as you say, frequent queries will have setup overhead.

The point about preferring not to return IO objects by value is a separate
stylistic issue, on which I offer no opinion.
However it is worth noting that Asio itself does have a form of the
ip::tcp::acceptor::async_accept initiation function whose completion
handler is passed a socket:
https://www.boost.org/doc/libs/1_79_0/doc/html/boost_asio/reference/basic_socket_acceptor/async_accept/overload3.html

Dennis Luehring via Boost

unread,

May 11, 2022, 12:16:51 AM5/11/22

to bo...@lists.boost.org, Dennis Luehring

Am 10.05.2022 um 22:33 schrieb Marcelo Zimbres Silva via Boost:
> users could do
>
> boost::mysql::row<T1, T2, T3, T3> row;
> while (result.async_read_one(row, additional_info, yield[ec]))
>
> avoiding unnecessary further copies. In principle it would be even
> possible to parse directly in a custom data structure, for example,
> when storing json string in the database.

i've developed a small C++17 example of how to work with
Input/Output transformers to write the statements more compact
and type-safe

the base idea is that the MySQL <-> C++ needs sometimes some sort
of conversions and to combine most of the infos at one point so its
easier or at least possible to optimize better - the current interface
looks very runtime allocation stylish :)

the Input- and Output Transformers help to work with basic types
and also with SQL/MySQL special types like Null-string etc. - its not
possible to map MySQL types always 1:1 to C++ and back, sometimes you
want to
behave the transformation different

the Transformers aren't visible when used with basic types

its just an example to promote the Idea: https://pastebin.com/raw/vepbTAKL

the best combination would be some sort of fluent SQL interface like:
https://github.com/rbock/sqlpp11

but i still think a plain string+binding interface like Boost.MySql
currently got is also needed

Julien Blanc via Boost

unread,

May 11, 2022, 3:54:58 AM5/11/22

to bo...@lists.boost.org, Julien Blanc

Le 2022-05-10 22:33, Marcelo Zimbres Silva via Boost a écrit :

> 6) ================================
>
>> https://anarthal.github.io/mysql/mysql/ref/boost__mysql__value/get_std_optional.html
>> https://anarthal.github.io/mysql/mysql/ref/boost__mysql__value/get_optional.html
>
> Sounds also unusual to have two member functions for the different
> versions of optional. I suggest using the boost version until we
> transition to C++17.

I probably won't have the time to do a proper review, but had a quick
look at the documentation and this interface surprised me a bit. I was
expecting optional (whether std or boost) to be used to map nullable
values, but it seems that's not the case. It seems this interface will
return an empty optional if using a wrong type (like trying to get an
int from a string column), but the doc is silent on what it will do if
trying to get an int from a nullable int which is actually NULL (my
understanding is that it would return an empty optional). It bothers me
a bit that these two use cases are handled the same way, because one of
them is a valid use, and the other is a mismatch between the program and
the database schema that i would like to diagnose.

I understand this issue is non-trivial, since when the value is
retrieved from the row object, the information about the type of the
data in the DB (and thus its nullability) is lost. However, it seems
odd. It may be of interest to store whether the field is null in a
independent way from its type, instead of relying on a single null type.

Or maybe i just missed something obvious.

Regards,

Julien

Dominique Devienne via Boost

unread,

May 11, 2022, 4:19:07 AM5/11/22

to bo...@lists.boost.org, Dominique Devienne

On Tue, May 10, 2022 at 10:54 PM Richard Hodges via Boost
<bo...@lists.boost.org> wrote:
> It is worth remembering that because this is MySQL, nothing can be done
> with the connection object until the resultset has been fully read.

Interesting, didn't know that (not being familiar with MySQL).

Note that PostgreSQL's official C client API now offers a *pipeline* mode [1],
which was part of the protocol for a while, but not exposed
client-side until v14.

The protocol also supports two *encoding* modes, text or binary. Is
MySQL's protocol text only?
We tested both, and binary is faster (surprise, right), which matter to me/us.

There's also a special COPY mode, for higher-performance (and
incremental IO) bulk row access.
Any equivalent MySQL side? Again, COPY IN and OUT is faster than
regular DML in PostgreSQL.

A little off-topic for this review, but OTOH answers to the above
would help understand the kind of async
possible with MySQL in general, and thus the proposed Boost.MySQL, at
the protocol level.

More related to this review, I don't see any performance chapter in
the doc's TOC with the official MySQL (C I guess) API.
Doing away with the official client code is a risk, and using ASIO
brings in significant complexity (from my POV),
so how is proposed Boost.MySQL better than the official C client, or a
good C++ wrapper on top of it?

Or is the target market of this library only existing ASIO users who
want to perform MySQL queries/statements?
What's the target use-case? Small (in resultset size) and frequent
(thus low duration / overhead) queries?
Any wins on large bulk data loads for ETL tools, especially compared
to (simpler?) code using the official client?

Efficiency is listed as a goal, but there's no absolute metric, nor
comparison to the obvious alternative that's the offical client API.

Again, I'm not a MySQL user at this point. But I'm a heavy user of
PostgreSQL and SQLite.
And also of Oracle OCI in the past. So I have interest and perspective
on the domain.

I'm also aware of the Boost.PostgreSQL demo/proto Ruben made in the
past, in the same style as Boost.MySQL.
Thus IF I was a user of MySQL, the questions above would be first and
foremost in my mind, about Boost.MySQL.
And OTOH, if Boost.MySQL really has benefits, then maybe I can
translate those benefits to that Boost.PostgreSQL proto?

Beyond the design and implementation of such an ASIO-based
*from-scracth* C++ client,
I feel like there's too much missing for me to evaluate the "What-In-It-For-Me"
and tradeoffs associated with proposed Boost.MySQL.

Hopefully that makes sense :). Thanks, --DD

[1]: https://www.postgresql.org/docs/current/libpq-pipeline-mode.html

Ruben Perez via Boost

unread,

May 11, 2022, 8:14:52 AM5/11/22

to bo...@lists.boost.org, Ruben Perez

On Tue, 10 May 2022 at 22:33, Marcelo Zimbres Silva via Boost
<bo...@lists.boost.org> wrote:
>
> Hi, here is my review.

Thanks for taking your time to write your review.

>
> 1) ================================
>
> > https://anarthal.github.io/mysql/mysql/ref/boost__mysql__resultset/async_read_all.html
> > The handler signature for this operation is
> > void(boost::mysql::error_code, std::vector<boost::mysql::row>).
> > etc.
>
> Having a std::vector in the signature of the completion handler is
> unusual in ASIO. In general, large objects should be passed as
> parameters to the initiating function.
>
> Additionally, this operation does not support custom allocators. That
> wouldn't be a big problem to me if I could reuse the same memory for
> each request but the interface above prevents me from doing so.
>
> It is also a pessimized storage as each row will allocate its own
> chunk of memory. A single block of memory would be better.

Those functions don't strive for efficiency, that is true. I think they can be
useful for people using callbacks, for example. I see two options here,
and I'm happy with both of them:

a) Change the function signature to
async_read_all(vector<row, Allocator>&, CompletionToken&&), with the completion
handler signature being void(error_code)
b) Remove the functionality altogether.

I'm happy to make a) or b) acceptance criteria for inclusion. I've raised
https://github.com/anarthal/mysql/issues/58 to track this.

>
> 2) ================================
>
> > https://anarthal.github.io/mysql/mysql/ref/boost__mysql__connection/async_query/overload1.html
> > The handler signature for this operation is
> > void(boost::mysql::error_code, boost::mysql::resultset<Stream>)
>
> Here the completion handler is returning an io object i.e.
> boost::mysql::resultset, which is also unusual as far as I can see.
> Wouldn't it make sense for users to specify a different executor for
> the resultset? This interface is preventing that.

Please note that `connection`, `resultset` and `prepared_statement`
always use the same executor as the underlying `Stream` object, as
returned by `connection::next_layer`. `resultset` and `prepared_statement`
are proxy I/O objects to make the operations more semantic, but end up
in calls to the underlying `Stream` read and write functions. I don't think
it is possible at all to have separate executors for different objects
in this context.

>
> I prefer having the completion handler return config parameters with
> which I can instantiate a boost::mysql::resultset myself in my desired
> executor.

As pointed above, I don't think this makes sense, unless I'm missing
something. However, returning resultsets by lvalue reference instead of
as completion handler arguments may be positive for efficiency, as it
would allow more memory reuse. Resultsets hold table metadata,
which is required to parse rows and lives in dynamic storage.
There would be no performance gain for prepared_statatetment's,
as these perform no allocations.

I'm happy for this to become an acceptance condition, too,
but please note that it won't allow different executors in any case.
I've raised https://github.com/anarthal/mysql/issues/59
to track the issue.

>
> 3) ================================
>
> > https://github.com/anarthal/mysql/blob/13d3615464f41222052d9ad8fa0c86bcdddee537/include/boost/mysql/value.hpp#L72
>
> I am not convinced the design of the value class is a good idea. The
> reason is that users know at compile time what types they are
> expecting from a given query. That means it would be possible for them
> to specify that before doing the query. So instead of doing this
>
> https://github.com/anarthal/mysql/blob/13d3615464f41222052d9ad8fa0c86bcdddee537/example/query_async_coroutines.cpp#L107
>
> boost::mysql::row row;
> while (result.async_read_one(row, additional_info, yield[ec]))
>
> users could do
>
> boost::mysql::row<T1, T2, T3, T3> row;
> while (result.async_read_one(row, additional_info, yield[ec]))
>
> avoiding unnecessary further copies. In principle it would be even
> possible to parse directly in a custom data structure, for example,
> when storing json string in the database.

I see different use cases here:

a) Parsing a row into a data structure known at compile-time. Rather than
a row<T1, T2, T3, T4> I would go for a std::tuple<T1, T2, T3, T4> and
a describe-able struct (annotated by BOOST_DESCRIBE_STRUCT).
b) Parsing into custom data structures with arbitrary user code. This
would be the case when parsing a JSON directly into a user-provided
data structure.
c) Parsing a row where the schema is not known at compile-time. This
would be the case when implementing a MySQL command-line interface,
for example.

The current interface exposes c) natively, and would require a) and b)
to be implemented on top of it by the user. Support for a) can be added by
adding overloads to resultset::read_one. Supporting b) natively would require
something like Beast's HTTP parser.

I would say the main gain here is on user code simplicity rather than
efficiency. There may be some efficiency gains when working with long
strings, but I don't think they are relevant when compared to I/O operation
overhead.

Implementing a) natively is a decent amount of work but can be done.
I would be reluctant to expose an interface for b) from the beginning as
it increases the API surface of the library. In any case, I wouldn't get
rid of the variant interface, as I think it can be useful for some use cases.

I've raised https://github.com/anarthal/mysql/issues/60 to address this.

>
> 4) ================================
>
> > https://anarthal.github.io/mysql/mysql/examples.html
>
> IMO, the examples are too limited and show variation of the same
> functionality. For example
>
> - Text query, async with callbacks
> - Text query, async with futures
> - Text query, async with Boost.Coroutine coroutines
> - Text query, async with C++20 coroutines
> - Default completion tokens
>
> Here you are showing nice Asio features. I am not meaning you should
> remove them but if I am not mistaken there is no example that shows
>
> - How to keep long lasting connections open (specially for ssl).

Unless I'm missing something obvious, I don't think there is anything to
do on the user side to keep connections open. Once you connect the
underlying `Stream` and call `connection::handshake` (or the composed
function `connection::connect`), the connection will remain open until:

a) It is closed by the user (e.g. by calling `connection::close`).
b) The connection remains idle by a certain amount of time, as defined
by the MySQL variable
https://dev.mysql.com/doc/refman/8.0/en/server-system-variables.html#sysvar_wait_timeout.
By default, this timeout is 8h. It can be changed at the database level and on a
per-connection basis, by issuing a `SET SESSION` command
(e.g. `conn.query("SET SESSION wait_timeout = 3600");` would set the
connection timeout to 1h). Note that this is set *after* the connection is
established.

Unless I'm missing something, there is nothing special in handling
SSL long-lasting connections - keep alives happen at the TCP level.
There is no keep-alive mechanism at the MySQL protocol level.

I've raised https://github.com/anarthal/mysql/issues/61
to add an example demonstrating b).

> - Timers for resolve, connect, read and write.

There is an example on how to use Asio cancellation
mechanism together with C++20 coroutines and Asio timers to implement
timeouts for name resolution, connection establishment and queries here:
https://anarthal.github.io/mysql/mysql/examples/timeouts.html
Is this what you mean?

> - Synchronized writes.

Not following you here, could you please elaborate on this particular
use case?

> - Healthy checks, failover etc.

At the protocol level, there is a COM_PING command
(https://dev.mysql.com/doc/dev/mysql-server/latest/page_protocol_com_ping.html)
that allows the user to check whether the server is still alive or
not. Issuing that
requires additional library support. The interface for this would be
something like
`void connection::ping(error_code&)`. It would return an empty error
code on success
and an appropriate error code on failure. I guess this can be
currently be work-arounded by issuing a dummy SQL query to the server as
a health-check.

I've raised https://github.com/anarthal/mysql/issues/62
to track the PING addition.

I need to further investigate failover in replica set scenarios.
I hope to do it today evening. I've raised
https://github.com/anarthal/mysql/issues/63
to track this.

>
> I estimate that to address these points users will be required to
> write further 500 - 1000 LOC on top of your library. Otherwise they
> will be stuck on underperforming short lived connections.
>
> It would be helpful for users to have a starting point to begin with.
>
> There some minor things I have also noticed:
>
> 5) ================================
>
> > https://github.com/anarthal/mysql/blob/13d3615464f41222052d9ad8fa0c86bcdddee537/include/boost/mysql/impl/error.hpp#L35
>
> I expect this function to be O(1).

Raised https://github.com/anarthal/mysql/issues/64
to track it.

>
> 6) ================================
>
> > https://anarthal.github.io/mysql/mysql/ref/boost__mysql__value/get_std_optional.html
> > https://anarthal.github.io/mysql/mysql/ref/boost__mysql__value/get_optional.html
>
> Sounds also unusual to have two member functions for the different
> versions of optional. I suggest using the boost version until we
> transition to C++17.

Why do you think it can be a problem? I would say having both just targets
a broader audience: those using `boost::optional` and those using
`std::optional`,
without compromising either.

>
> 7) ================================
>
> > https://anarthal.github.io/mysql/mysql/ref/boost__mysql__errc.html
>
> errc should be conditions and not error AFAICS.

Not following you here - errc here is just acting the same as
`boost::beast::http::error` enum.

>
> ============= Review End ================
>
> I vote for conditional acceptance. Points 1, 2, 3 and 4 should be
> addressed before acceptance. My acceptance criteria sums up to
>
> 1) Remove functionality that provides low performance. With
> coroutines it will be easy for users to loop on async_query to create
> a std::vector of rows.

As I mentioned above, either removing or switching to return-by-lvalue
seems OK as an acceptance condition for me.

>
> 2) No IO object should be returned as a parameter of the completion handler.

I can accept this as a performance improvement, but please read the note
on executors above.

>
> 3) Review the value type. Can it be improved?

Does this mean that you consider parsing into compile-time
user-defined data structures (e.g. `tuple`) an acceptance condition?

>
> 4) Improve examples.

Will inform as soon as I have an answer on failover, happy to provide
the other ones.

>
> Regards,
> Marcelo

Regards,
Ruben.

Ruben Perez via Boost

unread,

May 12, 2022, 3:46:35 AM5/12/22

to bo...@lists.boost.org, Ruben Perez

On Wed, 11 May 2022 at 06:16, Dennis Luehring via Boost
<bo...@lists.boost.org> wrote:
>
> Am 10.05.2022 um 22:33 schrieb Marcelo Zimbres Silva via Boost:
> > users could do
> >
> > boost::mysql::row<T1, T2, T3, T3> row;
> > while (result.async_read_one(row, additional_info, yield[ec]))
> >
> > avoiding unnecessary further copies. In principle it would be even
> > possible to parse directly in a custom data structure, for example,
> > when storing json string in the database.
>
> i've developed a small C++17 example of how to work with
> Input/Output transformers to write the statements more compact
> and type-safe
>
>
> the base idea is that the MySQL <-> C++ needs sometimes some sort
> of conversions and to combine most of the infos at one point so its
> easier or at least possible to optimize better - the current interface
> looks very runtime allocation stylish :)

Thanks for sharing. I think I get the idea of "allowing the user to define
custom MySQL <=> C++ mappings".

I'm not really getting the interface that your input/output
transformers would expose, though. Let's take this one, for example:

template <>
struct OutputTransformer<NullAsEmptyString>
{
using value_type = std::string;

static std::string get_value( int /*index_*/ )
{
static std::string x = "hello";
return x;
};
static void get_value( int index_, std::string& value_ )
{
value_ = get_value( index_ );
};
};

I assume this is user-provided code. Where does the user get the
string value from? What does index_ mean in this context?

Let's also take a look at actually using the statements:

My_select my_select( db_connection, "select a, b, c from test
where d == ?1" );

{
// fetch into ref tuple
int some_int{};
float some_float{};
std::string some_string;
my_select( { some_int, some_float, some_string }, { 123 } );
}

How is that different from the following snippet?

resultset r = conn.query("select a, b, c from test where d == ?1" );
tuple<int, float, string> row;
r.read_one(row);

>
>
> the Input- and Output Transformers help to work with basic types
> and also with SQL/MySQL special types like Null-string etc. - its not
> possible to map MySQL types always 1:1 to C++ and back, sometimes you
> want to
> behave the transformation different
>
> the Transformers aren't visible when used with basic types
>
>
> its just an example to promote the Idea: https://pastebin.com/raw/vepbTAKL
>
>
> the best combination would be some sort of fluent SQL interface like:
> https://github.com/rbock/sqlpp11

This library is supposed to be a protocol driver library, so it
provides primitives
close to the MySQL protocol. sqlpp11 is great, but it's a higher level library.
I don't think it makes sense trying to incorporate this kind of features here.
It would make more sense for a higher level library like sqlpp11 to build
on top of Boost.MySQL, instead.

Ruben Perez via Boost

unread,

May 12, 2022, 3:54:59 AM5/12/22

to bo...@lists.boost.org, Ruben Perez

On Wed, 11 May 2022 at 09:54, Julien Blanc via Boost
<bo...@lists.boost.org> wrote:
>
> Le 2022-05-10 22:33, Marcelo Zimbres Silva via Boost a écrit :
>
> > 6) ================================
> >
> >> https://anarthal.github.io/mysql/mysql/ref/boost__mysql__value/get_std_optional.html
> >> https://anarthal.github.io/mysql/mysql/ref/boost__mysql__value/get_optional.html
> >
> > Sounds also unusual to have two member functions for the different
> > versions of optional. I suggest using the boost version until we
> > transition to C++17.
>
> I probably won't have the time to do a proper review, but had a quick
> look at the documentation and this interface surprised me a bit. I was
> expecting optional (whether std or boost) to be used to map nullable
> values, but it seems that's not the case. It seems this interface will
> return an empty optional if using a wrong type (like trying to get an
> int from a string column), but the doc is silent on what it will do if
> trying to get an int from a nullable int which is actually NULL (my
> understanding is that it would return an empty optional). It bothers me
> a bit that these two use cases are handled the same way, because one of
> them is a valid use, and the other is a mismatch between the program and
> the database schema that i would like to diagnose.

That's right, value::get_optional<uint64_t>() will return an empty optional
either if your value is not an int (e.g. an string) or an actual NULL value.
You can distinguish both using value::is_null(). For the use case you are
proposing, I would suggest this kind of code:

value v = /* get your value */
if (v.is_null())
{
// handle NULL case
}
else
{
uint64_t my_int = v.get<uint64_t>(); // This will throw on type mismatch
}

Of course, if we end up implementing reading rows into compile-time known
data structures, we can do a better job here.

>
> I understand this issue is non-trivial, since when the value is
> retrieved from the row object, the information about the type of the
> data in the DB (and thus its nullability) is lost. However, it seems
> odd. It may be of interest to store whether the field is null in a
> independent way from its type, instead of relying on a single null type.

Additionally to value::is_null, you can also access field metadata
using resultset.fields()[index]. This returns a field_metadata object
https://anarthal.github.io/mysql/mysql/ref/boost__mysql__field_metadata.html
which contains a is_not_null() function, which will return true if you
created your field with a NOT NULL clause.

Mateusz Loskot via Boost

unread,

May 12, 2022, 4:06:53 AM5/12/22

to bo...@lists.boost.org, Mateusz Loskot

On Thu, 12 May 2022 at 09:46, Ruben Perez via Boost
<bo...@lists.boost.org> wrote:
> On Wed, 11 May 2022 at 06:16, Dennis Luehring via Boost <bo...@lists.boost.org> wrote:
> >
> > the best combination would be some sort of fluent SQL interface like:
> > https://github.com/rbock/sqlpp11
>
> This library is supposed to be a protocol driver library, so it
> provides primitives close to the MySQL protocol.
> sqlpp11 is great, but it's a higher level library.

Ruben, this is a very important feature that distinguishes
your library from high level 'common denominator'
database libraries like sqlpp11, SOCI and others.

It may be worth to make this difference prominent,
somewhere in the docs.

Best regards,
--
Mateusz Loskot, http://mateusz.loskot.net

Dennis Luehring via Boost

unread,

May 12, 2022, 4:21:33 AM5/12/22

to Ruben Perez, bo...@lists.boost.org, Dennis Luehring

sorry for beeing not clear (and sending a not directly fitting example)

that code should be library code - more a less a collection of
base-mysql concepts
that can be used - this sample transformer lets you act empty strings as
null in mysql
- the implementation is a dummy - only to get a feeling how the data-flow is

my adaption is used with SQLite and the index is the parameter index
that would then map to SQLite bind functions or as in this case checks
if the
value is null and returns ""

plus serveral other "typical" helper for adaption problems

the transformer get also used for all fetch routines

> Let's also take a look at actually using the statements:
>
> My_select my_select( db_connection, "select a, b, c from test
> where d == ?1" );
>
> {
> // fetch into ref tuple
> int some_int{};
> float some_float{};
> std::string some_string;
> my_select( { some_int, some_float, some_string }, { 123 } );
> }
>
> How is that different from the following snippet?
>
> resultset r = conn.query("select a, b, c from test where d == ?1" );
> tuple<int, float, string> row;
> r.read_one(row);

my goal was to keep the sql-string combined with the Prepared_fetch_1
instanciation
but string use in templates is a little bit limited

and i also map input types for inserts or where clauses - thats also
possible with
splitted tuples for the input/output data but then its even more
separated from the statement (which is tied to the input/output types)

to know as much as possible before-hand - allows maybe deeper
optimization etc. for example the my_select instance
can use prepared statements per default (and this is connection oriented
with sqlite)

the "readers" are just variants (that also allow to beeing const - see
const auto tuple):

// fetch into ref tuple

my_select( { ein_int, ein_float, ein_string }, { 123 } );

// return value tuple
const auto [ein_int2, ein_float2, ein_string2] = my_select( { 123 } );

// fetch into class/struct...
Result3 result;
my_select( result, { 123 } );

the real optimization party starts with multi row fetches

// multi row fetch
using My_select = Prepared_fetch<std::tuple<int>, std::tuple<int, float,
NullAsEmptyString>>;
My_select my_select( db_connection, "select a, b from test where c == ?1" );

std::vector<Result2> result;
my_select.fetch_copy( std::back_inserter( result ), 34 );
my_select.fetch_copy( result, 34 );

auto fetch_func = []( const int& /*ein_int_*/, const float& /*ein_float_*/,
const std::string& /*ein_string_*/ ) {};
my_select.fetch_func( fetch_func, 34 );
auto fetch_func_cancel = []( const int& /*ein_int_*/, const float&
/*ein_float_*/,
const std::string& /*ein_string_*/ ) {
return false; };
my_select.fetch_func_with_cancel( fetch_func_cancel, 34 );

because i know at instanciation times what parts are fixed, variant size
etc. - so i can further reduce
the memory overhead etc. - you could directly combine the procotocl
result parsing with the result-set content etc.
its not implemented in my sqlite wrapper so far but the interface allows
such optization (if the backend is deep enough - like yours)

that means fetch-copy can be prepared at compile time for exact the data
etc. would allow zero or less-copy concepts

> >
> >
> > the Input- and Output Transformers help to work with basic types
> > and also with SQL/MySQL special types like Null-string etc. - its not
> > possible to map MySQL types always 1:1 to C++ and back, sometimes you
> > want to
> > behave the transformation different
> >
> > the Transformers aren't visible when used with basic types
> >
> >
> > its just an example to promote the Idea: https://pastebin.com/raw/vepbTAKL
> >
> >
> > the best combination would be some sort of fluent SQL interface like:
> > https://github.com/rbock/sqlpp11
>
> This library is supposed to be a protocol driver library, so it
> provides primitives
> close to the MySQL protocol. sqlpp11 is great, but it's a higher level library.
> I don't think it makes sense trying to incorporate this kind of features here.
> It would make more sense for a higher level library like sqlpp11 to build
> on top of Boost.MySQL, instead.

Boost does only provide low level stuff for real low level concepts
(smart-pointer, maps etc.-)
but most other libraries are always introducing very high level concepts

Mateusz Loskot via Boost

unread,

May 12, 2022, 4:35:40 AM5/12/22

to bo...@lists.boost.org, Mateusz Loskot

On Thu, 12 May 2022 at 10:21, Dennis Luehring via Boost
<bo...@lists.boost.org> wrote:
> Boost does only provide low level stuff for real low level concepts

This is a misconception.
There nothing that should prevent authors from submitting
high level libraries or libraries specific to protocols or domains.
See also https://lists.boost.org/Archives/boost/2022/04/252823.php

Best regards,
--
Mateusz Loskot, http://mateusz.loskot.net

Ruben Perez via Boost

unread,

May 12, 2022, 6:12:32 AM5/12/22

to bo...@lists.boost.org, Ruben Perez

On Wed, 11 May 2022 at 10:19, Dominique Devienne via Boost
<bo...@lists.boost.org> wrote:
>
> On Tue, May 10, 2022 at 10:54 PM Richard Hodges via Boost
> <bo...@lists.boost.org> wrote:
> > It is worth remembering that because this is MySQL, nothing can be done
> > with the connection object until the resultset has been fully read.

To be clear here, this is what the protocol for a query (or a prepared statement
execution) looks like. Every "message" in the below diagram is a 4-byte header
plus a payload.

client server
----- query request ---> // written by connection::query
<------- query OK ---- // read by connection::query
<------- metadata ---- // read by connection::query
<------- rows -------- // read by resultset::read_one
<------- EOF --------- // read by resultset::read_one

When sending two queries, you can follow a strictly sequential model.
Let's call this model a). It is the one implemented by this library:

client server
----- query request ---> // written by connection::query
<------- query OK ---- // read by connection::query
<------- metadata ---- // read by connection::query
<------- rows -------- // read by resultset::read_one
<------- EOF --------- // read by resultset::read_one
----- query request ---> // written by connection::query
<------- query OK ---- // read by connection::query
<------- metadata ---- // read by connection::query
<------- rows -------- // read by resultset::read_one
<------- EOF --------- // read by resultset::read_one

You could also initiate the next query without completely
reading all the packets sent by the server. This would look like this:

client server
--- query request 1 --->
--- query request 2 --->
<----- query 1 OK ----
<----- metadata 1 ----
<----- rows 1 --------
<----- EOF 1 ---------
<----- query 2 OK ----
<----- metadata 2 ----
<----- rows 2 --------
<----- EOF 2 ---------

This is possible at the protocol level. Note that the server won't
do any special handling here: it will process the two queries sequentially.
It will read the first one, process it, then send all the response packets,
then repeat for the second one. This second model b) is *currently not
possible* with the current interface. It would require a batch interface like:

serializer sr;
sr.add_query("SELECT * FROM table1");
sr.add_query("SELECT * FROM table2");
connection.write(sr);
resultset r1 = connection.read_query_result();
// Now you *MUST* read all the rows in this resultset
// before moving on to the next one
resultset r2 = connection.read_query_result();
// Same for r2

Note that this is different from a proper pipeline mode,
as described by Postgres. I see two major differences:

1) In a real pipeline mode, the server processes the queries
in batch. Here, the server still processes the queries sequentially.
2) Pipeline modes usually specify an option on what to do when a query
in the pipeline fails. Here, you don't have that - subsequent
queries will be executed regardless of the result of previous ones.

If you think this interface should be provided, please let me know,
and I will raise the relevant issues.

More recent versions of MySQL (v8.x) include a plugin with a new
version of the protocol, called the X protocol. I'm not an expert on it,
AFAIK it was created with the idea of using MySQL as a document store,
but can also be used for regular SQL ops. The documentation is here:
https://dev.mysql.com/doc/dev/mysql-server/8.0.26/page_mysqlx_protocol.html
This protocol does have a pipeline mode.

Please note that this library does *NOT* implement this protocol.
There are actually
two problems with it:

1) It employs Google's protobufs as message format, which creates licensing
conflicts with anything submitted to Boost.
2) It only targets MySQL 8+, not MariaDB or MySQL v5.x.

In the classic protocol (the one implemented by this library), there
are a couple extra concepts that haven't been implemented yet:

1) Multi-statement. This is a primitive form of pipelining, where you specify
several queries to connection::query() as a single string, separated
by semicolons.
The server sends several resultsets after that. I haven't focused a lot on
this because it sounded risky (in terms of security) for me.
2) Multi-resultset. This is used with stored procedures, when a procedure
returns more than one resultset.

This issue tracks both: https://github.com/anarthal/mysql/issues/8

MySQL docs on this:
https://dev.mysql.com/doc/dev/mysql-server/8.0.26/page_protocol_command_phase_sp.html

>
> Interesting, didn't know that (not being familiar with MySQL).
>
> Note that PostgreSQL's official C client API now offers a *pipeline* mode [1],
> which was part of the protocol for a while, but not exposed
> client-side until v14.
>
> The protocol also supports two *encoding* modes, text or binary. Is
> MySQL's protocol text only?
> We tested both, and binary is faster (surprise, right), which matter to me/us.

MySQL does define a text and a binary encoding. It will use the text
encoding when using text queries (i.e. connection::query) and the
binary encoding when using prepared statements (i.e.
prepared_statement::execute).
Resultset objects remember where they come from and will use the relevant
encoding.

>
> There's also a special COPY mode, for higher-performance (and
> incremental IO) bulk row access.
> Any equivalent MySQL side? Again, COPY IN and OUT is faster than
> regular DML in PostgreSQL.

There are two operations on the MySQL side:
1) The LOAD DATA statement
(https://dev.mysql.com/doc/refman/8.0/en/load-data.html#load-data-local).
You can use it like "LOAD DATA LOCAL INFILE 'test.csv' INTO TABLE test;"
to do bulk loads from the client. This needs library support and is
currently *NOT* implemented (I actually wasn't aware of the LOCAL option,
thanks for bringing this up). I've raised
https://github.com/anarthal/mysql/issues/67
to track this issue.
2) The SELECT INTO OUTFILE statement
(https://dev.mysql.com/doc/refman/8.0/en/select-into.html).
Unfortunately, this only writes files in the server host and not to the client.
This means there is no special support required in the client.
I'd say the usefulness of this statement is more limited than 1)

>
> A little off-topic for this review, but OTOH answers to the above
> would help understand the kind of async
> possible with MySQL in general, and thus the proposed Boost.MySQL, at
> the protocol level.
>
> More related to this review, I don't see any performance chapter in
> the doc's TOC with the official MySQL (C I guess) API.
> Doing away with the official client code is a risk, and using ASIO
> brings in significant complexity (from my POV),
> so how is proposed Boost.MySQL better than the official C client, or a
> good C++ wrapper on top of it?

I've updated https://github.com/anarthal/mysql/issues/50
to include this performance page.

Traditionally, the C interface had only synchronous functions,
so you would have to spawn a separate thread for each connection you had.
With an async API, you can likely have much higher throughput.
I've noticed that the official client has very recently added non-blocking
functions. It's a curious interface, as it seems you have to repeatedly call
the same function with the same parameters until the operation completes.
https://dev.mysql.com/doc/c-api/8.0/en/c-api-asynchronous-interface-usage.html
if you're curious.

>
> Or is the target market of this library only existing ASIO users who
> want to perform MySQL queries/statements?

This was definitely my initial target audience. But I think
we can reach more than that.

> What's the target use-case? Small (in resultset size) and frequent
> (thus low duration / overhead) queries?
> Any wins on large bulk data loads for ETL tools, especially compared
> to (simpler?) code using the official client?

I've been more focused on that first case, typical in web apps.
I'd say ETL loads can be a target in the future, but we may have
some missing features (like the LOAD DATA I mentioned above).

>
> Efficiency is listed as a goal, but there's no absolute metric, nor
> comparison to the obvious alternative that's the offical client API.

This is true. I'd say being able to use an async API can already
grant the user some performance benefits over a sync-only API,
but I agree that we should have benchmarks. Which benchmarks
would you, as a user, find useful? I'm now thinking on single-query
execution time as a measurement of latency, and bulk query
execution time as a measurement of throughput. I'm open to suggestions.

Ruben Perez via Boost

unread,

May 12, 2022, 7:04:05 AM5/12/22

to bo...@lists.boost.org, Ruben Perez

On Thu, 12 May 2022 at 10:07, Mateusz Loskot via Boost
<bo...@lists.boost.org> wrote:

> >
> > This library is supposed to be a protocol driver library, so it
> > provides primitives close to the MySQL protocol.
> > sqlpp11 is great, but it's a higher level library.
>
> Ruben, this is a very important feature that distinguishes
> your library from high level 'common denominator'
> database libraries like sqlpp11, SOCI and others.
>
> It may be worth to make this difference prominent,
> somewhere in the docs.

I honestly thought it was on the main page but it is not.
Raised https://github.com/anarthal/mysql/issues/70.

Phil Endecott via Boost

unread,

May 12, 2022, 10:25:31 AM5/12/22

to bo...@lists.boost.org, Phil Endecott

Hi Ruben,

Ruben Perez wrote:
> This library is supposed to be a protocol driver library, so it
> provides primitives close to the MySQL protocol.

It's frustrating to see this now, after writing my review.

I thought I was reviewing a library that, quoting your docs,
made "ease of use" its first design goal and aimed to "hide the
MySQL protocol complexities". Make up your mind, are you trying
to hide the protocol or provide something close to it?

And in a later message:

Ruben Perez wrote:
> When sending two queries
[...]

> You could also initiate the next query without completely reading
> all the packets sent by the server.

[...]

> This is possible at the protocol level.

[...]
> This [...] is *currently not possible* with the current [library] interface.

Contrast that with what your docs say:

Warning
Because of how the MySQL protocol works, once you obtain a resultset,
you must read it entirely (i.e. until it's complete) before engaging
in any subsequent operation that implies communication with the server
(e.g. issuing another query, preparing a statement, closing a
statement...). Failing to do so results in undefined behavior.

You are explicitly blaming the protocol for this restriction in
that quote. It seems that that is not true, and that it is a
limitation of the library.

If I were writing my review again on the basis of this being a low-level
protocol library - which I'm not going to do, I don't have time - I would
probably say that support for this sort of pipelining is essential. I
would perhaps hope to see "low-level" functions that encode and decode
the protocol packets, which the user could even use with their own
networking layer, if they wanted to.

> Pipeline modes usually specify an option on what to do when a query
> in the pipeline fails. Here, you don't have that - subsequent queries
> will be executed regardless of the result of previous ones.

Transactions help here though. For example:

Transaction t(conn);
for (auto&& a: lots_of_things) {
execute_insert_stmt(a);
}
t.commit();

I would expect pipelining to significantly improve the performance of
that, especially if there were significant network latency.

Regards, Phil.

Dominique Devienne via Boost

unread,

May 12, 2022, 11:37:13 AM5/12/22

to bo...@lists.boost.org, Dominique Devienne

On Thu, May 12, 2022 at 4:25 PM Phil Endecott via Boost
<bo...@lists.boost.org> wrote:
> Ruben Perez wrote (in the doc):

> > Warning
> > Because of how the MySQL protocol works, once you obtain a resultset,
> > you must read it entirely (i.e. until it's complete) before engaging
> > in any subsequent operation that implies communication with the server
> > (e.g. issuing another query, preparing a statement, closing a
> > statement...). Failing to do so results in undefined behavior.

> You are explicitly blaming the protocol for this restriction in
> that quote. It seems that that is not true, and that it is a
> limitation of the library.

That's not quite how I interpret it. The (basic) MySQL protocol does indeed
require the client to fully consume a resultset, and not interleave any other
operations until that's done, from what Ruben described.

The alternate sequence-diagram Ruben presents, shows writing the next
query *before* starting to read the first one. It's once you start reading
that you can't do anything else. And that mode is essentially like having
the two queries in the same packet, semicolons-separated, something
also not supported at this point, but which probably should be.

> > Pipeline modes usually specify an option on what to do when a query
> > in the pipeline fails. Here, you don't have that - subsequent queries
> > will be executed regardless of the result of previous ones.
>
> Transactions help here though. For example:
>
> Transaction t(conn);
> for (auto&& a: lots_of_things) {
> execute_insert_stmt(a);
> }
> t.commit();

That's the pattern I also use, because my APIs throw, yielding the
implicit-rollback.
Although I'm not sure it will translate well to async code.

> I would expect pipelining to significantly improve the performance of
> that, especially if there were significant network latency.

It depends Phil. It really helps on slow, high-latency networks.
In a demo example, manually adding 50ms latency, Larenz Albe got a 4x
speedup [1].
But if you run on the LAN with sub-1ms latencies, and/or fast
Multi-Gb/sec (Cloud) networks,
it's less likely to matter that much I suspect.

[1]: https://www.cybertec-postgresql.com/en/pipeline-mode-better-performance-on-slow-network/

Ruben Perez via Boost

unread,

May 12, 2022, 3:03:10 PM5/12/22

to Dominique Devienne, bo...@lists.boost.org, Ruben Perez

> Indeed. But even your imagined batch interface only works well for
> queries/selects,
> while for inserts (or updates), the client does not need to just send
> a small textual
> SQL query, but potentially a bunch of data for the rows too. A true
> pipeline allows
> sending the rows for a second insert while the first insert is being processed.

It would work for inserts too, as values are either part of the query string
or part of the statement execute packet. In both cases, part of the request.

> Such a mode may be useful for schema creation OTOH. We have large schemas,
> with hundreds of tables, indexes, triggers, etc... Done from C++ code
> client-side,
> not via a DBA manually executing on the server using SQL files and the
> native CLI.
> For that use-case, the ability to send many DDLs in a single batch
> would definitely
> save on the round-trips to the server. We try hard to minimize roundtrips!

Thanks for sharing this use case, I definitely wasn't aware of it
and seems a reason towards implementing multi-statement.

> You don't need to necessarily use Google's protobufs library.
> There's https://github.com/mapbox/protozero for example, and similarly, a
> from-scratch implementation to just encode-decode a specific protocol
> can also be written.

I wasn't aware of this.Thanks.

> > The server sends several resultsets after that. I haven't focused a lot on
> > this because it sounded risky (in terms of security) for me.
>

> Not sure what's risky here. Maybe I'm missing something.

I was just imagining users concatenating queries. May be a misconception,
yours is a legitimate use case.

> Note that PostgreSQL's COPY is not file-specific, and a first-class citizen
> at the protocol level, depending on a pure binary format (with text mode too).
> I use COPY with STDIN / STDOUT "files", i.e. I prepare memory buffers and
> send them; and read memory buffers, and decode them. No files involved.

Unfortunately, that does not seem to be the case for MySQL. You issue the
LOAD DATA statement via a regular query packet, and the server returns
you another packet with the file path it wants. You then read it in the client
and send it to the server in another packet. I've made it work with CSV files,
and I'd say it's the only format allowed, AFAIK.

> Maybe our use-case of very large data (with large blobs), *and* very
> numerous smaller data,
> that's often loaded en-masse, by scientific desktop applications, and
> increasingly mid-tier services
> for web-apps, is different from the more common ecommerce web-app
> use-case of many people.

It is good to hear different use cases than the regular web server
we all have in mind. It will help me during further development.

>
> When I evaluate DB performance, I tend to concentrate on "IO" performance,
> in terms of throughput and latency, independent of the speed of the
> SQL engine itself.
> There's nothing I can do about the latter, while the way one uses the Client API
> (or underlying protocol features) is under my control. So I do mostly
> INSERTs and
> SELECTs, with and without WHERE clauses (no complex JOINs, CTEs, etc...).
>
> Because we are in the scientific space, we care about both many small rows
> (millions, of a few bytes to KBs each at most), and a few (hundreds / thousands)
> much larger rows with blobs (with MBs to GBs sizes). The truly large
> "blobs" (files) are
> kept outside the DB, since mostly read only (in the GBs to TBs sizes each, that
> can accumulate to 2.6 PB for all a client's data I heard just
> yesterday for example).
>
> I'll also compare inserting rows 1-by-1, with and without prepared statements,
> to inserting multi-rows per-statement (10, 100, 1000 at a time), to the "bulk"
> interface (COPY for PostgreSQL, LOCAL for MySQL, Direct-Path load in OCI).
> For example with SQLite:
> https://sqlite.org/forum/forumpost/baf9c444d9a38ca6e59452c1c568044aaad50bbaadfff113492f7199c53ecfed
> (SQLite as no "bulk" interface, doesn't need one, since "embedded"
> thus "zero-latency")
>
> For PostgreSQL, we also compared text vs binary modes (for binds and
> resultsets).
>
> For blobs, throughput of reading and writing large blobs, whole and in-part,
> with different access patterns, like continuous ranges, or scattered inside).
>
> A very important use-case for us, for minimizing round-trips, is how
> to load a subset of rows,
> given a list of primary keys (typically a surrogate key, like an
> integer or a uuid). For that,
> we bind a single array-typed value for the WHERE clause's placeholder
> to the query,
> and read the resultset, selecting 1%, 5%, 10%, etc... of the rows from
> the whole table.
> (selecting each row individually, even with a prepared query, is terribly slow).

Thanks for sharing. It will definitely help me during benchmarking.

Regards,
Ruben.

Ruben Perez via Boost

unread,

May 13, 2022, 9:14:58 AM5/13/22

to bo...@lists.boost.org, Ruben Perez, Phil Endecott

> It's frustrating to see this now, after writing my review.

Hi Phil, please accept my apologies for the misunderstanding.
I still think much of your comments were useful and many
will end as improvements in the library.

>
> I thought I was reviewing a library that, quoting your docs,
> made "ease of use" its first design goal and aimed to "hide the
> MySQL protocol complexities". Make up your mind, are you trying
> to hide the protocol or provide something close to it?

Ease of use is written there because it is really important to me.
I've tried to hide as many complexities of the protocol as I've been
able to, without compromising performance (maybe too many for
a protocol library, as we'll discuss below). I've abstracted the handshake
algorithm, the different encodings, and tried to expose an interface
as simple as possible. But the library is not an ORM or a SQL framework,
and the documentation may not be clear enough in scope.
https://github.com/anarthal/mysql/issues/70 addresses this rewording.

>
> You are explicitly blaming the protocol for this restriction in
> that quote. It seems that that is not true, and that it is a
> limitation of the library.
>
> If I were writing my review again on the basis of this being a low-level
> protocol library - which I'm not going to do, I don't have time - I would
> probably say that support for this sort of pipelining is essential. I
> would perhaps hope to see "low-level" functions that encode and decode
> the protocol packets, which the user could even use with their own
> networking layer, if they wanted to.

I've raised https://github.com/anarthal/mysql/issues/76 to correct
that wording, as it is definitely not true. This section should explain
the whole story, with both the library limitation and the protocol capabilities.
I've raised https://github.com/anarthal/mysql/issues/75 for pipeline mode.

Regards,
Ruben.

Peter Dimov via Boost

unread,

May 13, 2022, 9:42:17 AM5/13/22

to bo...@lists.boost.org, Peter Dimov

Ruben Perez wrote:
> But the library is not an ORM or a SQL framework, and the documentation
> may not be clear enough in scope.

I have little use for ORM or SQL frameworks. I do, however, have use for
two things:

1. A way to say

conn.query( "SELECT * FROM tb_something WHERE id = ?", id );

There is a school of thought that holds that I shouldn't be doing that
and using prepared statements instead, but managing the lifetime of
prepared statements is inconvenient enough to steer people, including
myself, towards sprintf-ing their way to victory, which of course invites
security issues.

The above interface is easier than sprintf, which will steer people away
from sprintf, which will be a big win in both usability and safety.

2. A way to obtain a described struct from a result row of the above
query.

Pretty self-explanatory (*). This is almost always what one wants, if
the query returns more than one column, and is what everyone ends up
reinventing and coding by hand.

Both of these features contain a significant amount of
protocol-independent code, which makes them out of scope under a
strict reading, but they are worth having IMO.

(*) Except for the question of whether out of order columns should be
automatically put into proper order based on the member names.

Dominique Devienne via Boost

unread,

May 13, 2022, 9:44:45 AM5/13/22

to Ruben Perez, bo...@lists.boost.org, Dominique Devienne

Meant to send this to the list, not just Ruben. Re-sending. --DD

On Fri, May 13, 2022 at 11:59 AM Dominique Devienne <ddev...@gmail.com> wrote:

>
> On Thu, May 12, 2022 at 9:02 PM Ruben Perez <rubenp...@gmail.com> wrote:
> > > > The server sends several resultsets after that. I haven't focused a lot on
> > > > this because it sounded risky (in terms of security) for me.
> > >
> > > Not sure what's risky here. Maybe I'm missing something.
> >
> > I was just imagining users concatenating queries. May be a misconception,
> > yours is a legitimate use case.
>

> I've thought more about this, and I'm guessing you were thinking
> about this XKCD https://xkcd.com/327/ SQL Injection.
>
> The proper and ideal remedy is not the XKCD one though,
> it is using prepared statements and binding values.
>
> The still proper but less ideal next remedy is a "safe" way to concatenate
> user inputs into textual SQL. I.e. the equivalent of
> SQLite: https://www.sqlite.org/printf.html with %q and %Q
> PostgreSQL: https://www.postgresql.org/docs/current/libpq-exec.html#LIBPQ-PQESCAPELITERAL
> Since you want to avoid the MySQL client, which I suppose has this
> too, you have to provide your own.
>
> Third and finally, this XKCD hack indeed does only work is one can
> execute several semi-colon separated statements.
> So on second thought, it is *definitely* be a good thing to have
> separate APIs, to execute single statement, as opposed from executing
> "scripts" of statements, plural. With a single-statement API, *even*
> the XKCD hack won't work, since it would yield an invalid single
> statement.
> This is what you had in mind IMHO :).
>
> So the ability to execute multi-statement is important, especially in
> my DDL use-case,
> but it should be a separate API from the more regular single-statement
> APIs, and come
> with extra warnings in the document about such SQL injections (with
> obligatory link to XKCD ;)
>
> PS: Fourth, the XKCD hack is possible only of the SQL is executed as
> the owner of the schema,
> or a DBA/Superuser, which is of course a mistake and a big no-no,
> yet all too common...
> Users of the DB, including mid-tier services, should NOT have DDL
> privileges on the schema and its objects.
> Change the DROP into a DELETE, and that's still nasty, but then you
> use ROLEs to restrict some DMLs,
> and possibly also RLS (Row Level Security) for finer-grained
> authorization. But we are getting off-topic :)

Ruben Perez via Boost

unread,

May 13, 2022, 12:48:30 PM5/13/22

to Dominique Devienne, bo...@lists.boost.org, Ruben Perez

> >
> > I was just imagining users concatenating queries. May be a misconception,
> > yours is a legitimate use case.
>

> I've thought more about this, and I'm guessing you were thinking
> about this XKCD https://xkcd.com/327/ SQL Injection.

Yes, this was definitely what I was thinking of ;)

>
> The proper and ideal remedy is not the XKCD one though,
> it is using prepared statements and binding values.
>
> The still proper but less ideal next remedy is a "safe" way to concatenate
> user inputs into textual SQL. I.e. the equivalent of
> SQLite: https://www.sqlite.org/printf.html with %q and %Q
> PostgreSQL: https://www.postgresql.org/docs/current/libpq-exec.html#LIBPQ-PQESCAPELITERAL
> Since you want to avoid the MySQL client, which I suppose has this
> too, you have to provide your own.

Since not everything can be done with prepared statements
(I've been shown some legitimate use cases that are not
covered), I've raised https://github.com/anarthal/mysql/issues/69
to address this, and I will likely implement something like this.

>
> Third and finally, this XKCD hack indeed does only work is one can
> execute several semi-colon separated statements.
> So on second thought, it is *definitely* be a good thing to have
> separate APIs, to execute single statement, as opposed from executing
> "scripts" of statements, plural. With a single-statement API, *even*
> the XKCD hack won't work, since it would yield an invalid single
> statement.
> This is what you had in mind IMHO :).

The trouble with this is that multi-statement is not a separate protocol
primitive, but rather something you turn on or off for the entire session.
So your connections either support it or don't. If you include such a semicolon
separated set of queries in your query request, and this capability is turned
on, the server will happily execute it. So maybe this should be implemented
as an extra value in connection_params
(https://anarthal.github.io/mysql/mysql/connparams.html)
with adequate security warnings.

Ruben Perez via Boost

unread,

May 13, 2022, 12:53:50 PM5/13/22

to bo...@lists.boost.org, Ruben Perez, Peter Dimov

>
> I have little use for ORM or SQL frameworks. I do, however, have use for
> two things:
>
> 1. A way to say
>
> conn.query( "SELECT * FROM tb_something WHERE id = ?", id );

It has become obvious to me during this review that this functionality
should be high priority, one way or another.
https://github.com/anarthal/mysql/issues/69 tracks it.

> 2. A way to obtain a described struct from a result row of the above
> query.

This one has been mentioned in almost all reviews, too.
https://github.com/anarthal/mysql/issues/60 tracks it.

> Both of these features contain a significant amount of
> protocol-independent code, which makes them out of scope under a
> strict reading, but they are worth having IMO.

It may be the case for 1., although escaping is MySQL-specific,
so it's not insane for the library to provide it. First-class support for
2. has advantages, too.

Regards,
Ruben.

Vinnie Falco via Boost

unread,

May 13, 2022, 1:19:56 PM5/13/22

to boost@lists.boost.org List, Vinnie Falco

On Tue, May 10, 2022 at 1:33 PM Marcelo Zimbres Silva via Boost
<bo...@lists.boost.org> wrote:
> Having a std::vector in the signature of the completion handler is
> unusual in ASIO. In general, large objects should be passed as
> parameters to the initiating function.

I agree that this is weird. Instead of a vector, the library could
introduce a new concept, "RowAccumulator" which is a lightweight,
movable value which is invoked to append elements to a notional
container. This solves all the problems with allocators and
memory-reuse.

I suggest that if the library is accepted, it should be conditional
upon removing the vector and replacing it with a concept. And the
library should provide either in the examples, or as a public API - a
wrapper that implements accumulation to a vector.

Thanks

Vinnie Falco via Boost

unread,

May 13, 2022, 1:29:39 PM5/13/22

to boost@lists.boost.org List, Vinnie Falco

On Thu, May 12, 2022 at 3:12 AM Ruben Perez via Boost
<bo...@lists.boost.org> wrote:
> ...Google's protobufs...creates licensing

> conflicts with anything submitted to Boost.

If you are only reading protobufs and using the most common subset of
its features, it should be fairly easy to do a clean-room
implementation of just the part of the protobuf format that you need
to make your library work. It wouldn't be that much code either. It
would not be necessary (or desirable) to require linking with the
stock protobuf library.

Thanks

Ruben Perez via Boost

unread,

May 15, 2022, 9:57:50 AM5/15/22

to bo...@lists.boost.org, Ruben Perez

On Fri, 13 May 2022 at 19:20, Vinnie Falco via Boost
<bo...@lists.boost.org> wrote:
>
> On Tue, May 10, 2022 at 1:33 PM Marcelo Zimbres Silva via Boost
> <bo...@lists.boost.org> wrote:
> > Having a std::vector in the signature of the completion handler is
> > unusual in ASIO. In general, large objects should be passed as
> > parameters to the initiating function.
>
> I agree that this is weird. Instead of a vector, the library could
> introduce a new concept, "RowAccumulator" which is a lightweight,
> movable value which is invoked to append elements to a notional
> container. This solves all the problems with allocators and
> memory-reuse.

How would that RowAccumulator concept be different from
a regular output iterator with boost::mysql::row value type?

>
> I suggest that if the library is accepted, it should be conditional
> upon removing the vector and replacing it with a concept. And the
> library should provide either in the examples, or as a public API - a
> wrapper that implements accumulation to a vector.

This makes sense. I've updated
https://github.com/anarthal/mysql/issues/58
to include your comments.

Regards,
Ruben.

Ruben Perez via Boost

unread,

May 15, 2022, 9:59:20 AM5/15/22

to Vinnie Falco, Ruben Perez, boost@lists.boost.org List

On Fri, 13 May 2022 at 19:29, Vinnie Falco <vinnie...@gmail.com> wrote:
>
> On Thu, May 12, 2022 at 3:12 AM Ruben Perez via Boost
> <bo...@lists.boost.org> wrote:
> > ...Google's protobufs...creates licensing
> > conflicts with anything submitted to Boost.
>
> If you are only reading protobufs and using the most common subset of
> its features, it should be fairly easy to do a clean-room
> implementation of just the part of the protobuf format that you need
> to make your library work. It wouldn't be that much code either. It
> would not be necessary (or desirable) to require linking with the
> stock protobuf library.

It's good to know that this is not impossible, as I could
see this library implementing the X protocol at some point.
I don't have the time right now to do this, but it's good to know
it can be done.

Regards,
Ruben.

Ruben Perez via Boost

unread,

May 15, 2022, 12:26:11 PM5/15/22

to Dennis Luehring, Ruben Perez, boost@lists.boost.org List

> > I'm not really getting the interface that your input/output
> > transformers would expose, though. Let's take this one, for example:
> >
> > template <>
> > struct OutputTransformer<NullAsEmptyString>
> > {
> > using value_type = std::string;
> >
> > static std::string get_value( int /*index_*/ )
> > {
> > static std::string x = "hello";
> > return x;
> > };
> > static void get_value( int index_, std::string& value_ )
> > {
> > value_ = get_value( index_ );
> > };
> > };
> >
> > I assume this is user-provided code. Where does the user get the
> > string value from? What does index_ mean in this context?
>
>
> sorry for beeing not clear (and sending a not directly fitting example)
>
> that code should be library code - more a less a collection of
> base-mysql concepts
> that can be used - this sample transformer lets you act empty strings as
> null in mysql
> - the implementation is a dummy - only to get a feeling how the data-flow is
>
> my adaption is used with SQLite and the index is the parameter index
> that would then map to SQLite bind functions or as in this case checks
> if the
> value is null and returns ""
>
> plus serveral other "typical" helper for adaption problems

What kind of transformers do you propose? For instance, something to make a NULL
string be treated as empty is a pattern I don't especially like. I
guess string-to-json
parsing is the obvious one. That would end up having the form of
custom types that
can be specified in the tuple/struct representing a row, so something like:

tuple<int, std::string, null_as_empty_string> row;
resultset.read_one(row);

>
> to know as much as possible before-hand - allows maybe deeper
> optimization etc. for example the my_select instance
> can use prepared statements per default (and this is connection oriented
> with sqlite)

I'd say you know at compile-time as much in both cases.

>
> // return value tuple
> const auto [ein_int2, ein_float2, ein_string2] = my_select( { 123 } );

I'd suggest something like

struct my_row {
int ein_int2;
float ein_float2;
string ein_string2;
};
BOOST_DESCRIBE_STRUCT(my_row);
resultset.read_one(row);

Returning it by value works great for sync ops but can have bad implications
in async ops.

> // multi row fetch
> using My_select = Prepared_fetch<std::tuple<int>, std::tuple<int, float,
> NullAsEmptyString>>;
> My_select my_select( db_connection, "select a, b from test where c == ?1" );
>
> std::vector<Result2> result;
> my_select.fetch_copy( std::back_inserter( result ), 34 );
> my_select.fetch_copy( result, 34 );
>
> auto fetch_func = []( const int& /*ein_int_*/, const float& /*ein_float_*/,
> const std::string& /*ein_string_*/ ) {};
> my_select.fetch_func( fetch_func, 34 );
> auto fetch_func_cancel = []( const int& /*ein_int_*/, const float&
> /*ein_float_*/,
> const std::string& /*ein_string_*/ ) {
> return false; };
> my_select.fetch_func_with_cancel( fetch_func_cancel, 34 );

I can see something like fetch_copy, as a generalization of
resultset::read_many.
Other users have also brought it up.
https://github.com/anarthal/mysql/issues/58 tracks it.
It can be extended to work with tuples or structs.

>
> Boost does only provide low level stuff for real low level concepts
> (smart-pointer, maps etc.-)
> but most other libraries are always introducing very high level concepts
>

As Mateusz pointed out, I think there is room for everything - Beast has very
low level stuff, too.

Thanks,
Ruben.

Ruben Perez via Boost

unread,

May 15, 2022, 12:31:02 PM5/15/22

to boost@lists.boost.org List, Ruben Perez

On Thu, 12 May 2022 at 17:37, Dominique Devienne via Boost
<bo...@lists.boost.org> wrote:
>
> On Thu, May 12, 2022 at 4:25 PM Phil Endecott via Boost
> <bo...@lists.boost.org> wrote:
> > Ruben Perez wrote (in the doc):
> > > Warning
> > > Because of how the MySQL protocol works, once you obtain a resultset,
> > > you must read it entirely (i.e. until it's complete) before engaging
> > > in any subsequent operation that implies communication with the server
> > > (e.g. issuing another query, preparing a statement, closing a
> > > statement...). Failing to do so results in undefined behavior.
>
> > You are explicitly blaming the protocol for this restriction in
> > that quote. It seems that that is not true, and that it is a
> > limitation of the library.
>
> That's not quite how I interpret it. The (basic) MySQL protocol does indeed
> require the client to fully consume a resultset, and not interleave any other
> operations until that's done, from what Ruben described.
>
> The alternate sequence-diagram Ruben presents, shows writing the next
> query *before* starting to read the first one. It's once you start reading
> that you can't do anything else. And that mode is essentially like having
> the two queries in the same packet, semicolons-separated, something
> also not supported at this point, but which probably should be.

What I meant with that, is that you will have to consume the entire resultset
before starting to consume the next resultset (or the response to any subsequent
request). You can still write stuff to the server, but your replies
will be queued
until you read that resultset. I do think the docs are misleading, and raised
https://github.com/anarthal/mysql/issues/76 to fix it.

>
> It depends Phil. It really helps on slow, high-latency networks.
> In a demo example, manually adding 50ms latency, Larenz Albe got a 4x
> speedup [1].
> But if you run on the LAN with sub-1ms latencies, and/or fast
> Multi-Gb/sec (Cloud) networks,
> it's less likely to matter that much I suspect.

I have raised https://github.com/anarthal/mysql/issues/76 to address pipeline
mode.

Regards,
Ruben.

Dennis Luehring via Boost

unread,

May 16, 2022, 11:29:22 AM5/16/22

to Ruben Perez, Dennis Luehring, boost@lists.boost.org List

std:optional would be more correct

clearly all Basic types

TEXT NULL <-> std::optional<std::string>
BasicType NULL <-> std::optional<BasicType>

> >
> > to know as much as possible before-hand - allows maybe deeper
> > optimization etc. for example the my_select instance
> > can use prepared statements per default (and this is connection oriented
> > with sqlite)
>
> I'd say you know at compile-time as much in both cases.
>
> >
> > // return value tuple
> > const auto [ein_int2, ein_float2, ein_string2] = my_select( { 123 } );
>
> I'd suggest something like
>
> struct my_row {
> int ein_int2;
> float ein_float2;
> string ein_string2;
> };
> BOOST_DESCRIBE_STRUCT(my_row);
> resultset.read_one(row);

i like it flexible to adapt to my structures, tuples etc. - but yours
seems flexible enough so far
im a big nearly/zero copy fan

> Returning it by value works great for sync ops but can have bad implications
> in async ops.

thats true - but that counts for every value stuff using async

>
> > // multi row fetch
> > using My_select = Prepared_fetch<std::tuple<int>, std::tuple<int, float,
> > NullAsEmptyString>>;
> > My_select my_select( db_connection, "select a, b from test where c == ?1" );
> >
> > std::vector<Result2> result;
> > my_select.fetch_copy( std::back_inserter( result ), 34 );
> > my_select.fetch_copy( result, 34 );
> >
> > auto fetch_func = []( const int& /*ein_int_*/, const float& /*ein_float_*/,
> > const std::string& /*ein_string_*/ ) {};
> > my_select.fetch_func( fetch_func, 34 );
> > auto fetch_func_cancel = []( const int& /*ein_int_*/, const float&
> > /*ein_float_*/,
> > const std::string& /*ein_string_*/ ) {
> > return false; };
> > my_select.fetch_func_with_cancel( fetch_func_cancel, 34 );
>
> I can see something like fetch_copy, as a generalization of
> resultset::read_many.
> Other users have also brought it up.
> https://github.com/anarthal/mysql/issues/58 tracks it.
> It can be extended to work with tuples or structs.

don't forget the ability to read cursor/stream based and allow breaking
the read beforehand

therefor im having this callable lambda that allows to end reading based
on runtime decision

> >
> > Boost does only provide low level stuff for real low level concepts
> > (smart-pointer, maps etc.-)
> > but most other libraries are always introducing very high level concepts
> >
>
> As Mateusz pointed out, I think there is room for everything - Beast has very
> low level stuff, too.
>

you right with that

Ruben Perez via Boost

unread,

May 18, 2022, 6:16:48 AM5/18/22

to Dennis Luehring, Ruben Perez, boost@lists.boost.org List

> clearly all Basic types
>
> TEXT NULL <-> std::optional<std::string>
> BasicType NULL <-> std::optional<BasicType>

When we do https://github.com/anarthal/mysql/issues/60
we will have those mappings. Nullable types correspond to
optionals, surely. We will allow boost::optional, too.

> > I can see something like fetch_copy, as a generalization of
> > resultset::read_many.
> > Other users have also brought it up.
> > https://github.com/anarthal/mysql/issues/58 tracks it.
> > It can be extended to work with tuples or structs.
>
> don't forget the ability to read cursor/stream based and allow breaking
> the read beforehand
>
> therefor im having this callable lambda that allows to end reading based
> on runtime decision

The trouble here is that when calling connection::query, MySQL sends
all the results beforehand. So you will have to read those packets even
if you want to discard them; otherwise, you won't be able to use the
connection further.

Prepared statements allow fetching a limited number of rows, using
cursors. This is not implemented yet, and is tracked by
https://github.com/anarthal/mysql/issues/20. Even in this case,
you have to manually ask the server "I want 5 rows", and you will
have to read those 5 rows before going further.

Regards,
Ruben.

Ruben Perez via Boost

unread,

May 18, 2022, 6:28:14 AM5/18/22

to Kostas Savvidis, Ruben Perez, boost@lists.boost.org List

On Wed, 18 May 2022 at 11:05, Kostas Savvidis <ksav...@gmail.com> wrote:
>
> Hi Ruben,
>
> Some people have already suggested making the tcp_ssl connection as the default. In that spirit, it seems to me that it would be better to
> remove such prefixes as tcp_ssl from tcp_ssl_prepared_statement and tcp_ssl_resultset
> and instead add apropriate prefixes to the non-default connections. Actually, since the only other type is unix socket,
> one only need to add the prefix plaintext or nonssl for the noncrypted connections.

I like how this sounds. I've updated https://github.com/anarthal/mysql/issues/73
to reflect your comments. Note that we may want to add named_pipe_connection
for Windows in the future.

> Moreover, is there no technical solution that can completely remove the need to deal with these distinct names for resultset and prepared_statement after the appropriate connection is established?

They depend on the Stream type because they have I/O operations. E.g.
for a prepared_statement, you need the Stream type to implement
the prepared_statement::execute function. Internally, these are really
just pointers to the connection object (more or less). One possible
option is making prepared_statement and resultset not be I/O objects,
and move these functions to the connection object. So the following signatures:

prepared_statement::execute(const params<ValueIterator>&)
prepared_statement::close()
resultset::read_one(row&)

Would change to these ones:

connection::execute_statement(const prepared_statement&, const
params<ValueIterator>&)
connection::close_statement(const prepared_statement&)
connection::read_row(resultset&, row&)

What do you think?

Dominique Devienne via Boost

unread,

May 18, 2022, 8:35:16 AM5/18/22

to bo...@lists.boost.org, Dominique Devienne

On Wed, May 18, 2022 at 12:16 PM Ruben Perez via Boost
<bo...@lists.boost.org> wrote:
> The trouble here is that when calling connection::query, MySQL sends
> all the results beforehand. So you will have to read those packets even
> if you want to discard them; otherwise, you won't be able to use the
> connection further.
>
> Prepared statements allow fetching a limited number of rows, using
> cursors. This is not implemented yet, and is tracked by
> https://github.com/anarthal/mysql/issues/20. Even in this case,
> you have to manually ask the server "I want 5 rows", and you will
> have to read those 5 rows before going further.

LibPQ (and the protocol below it I assume) are similar.
You must wait and "consume" the whole resultset of a prepared statement,
while with a cursor, you fetch "batches" of your chosen size(s).

BUT at least with LibPQ, turns out the Cursor approach is slower
overall, in my testing.

The Cursor approach gives you faster time-to-first-row (in sync mode
at least), and you
don't need to have libpq accumulate a large resultset in its own
memory, since you can
process and discard the smaller batches, but you pay for that with
being up to 2X slower overall,
probably from the increased round-trips I guess (the slowdown probably
depends on the batch sizes too).

In your case, you are async, so unlike libpq (sync mode), you have
good time-to-first-row in both cases,
and still allow processing the rows as they arrive. But I'd double
check the overhead of Statement vs Cursor,
with different batch sizes, on a larger resultset, in your performance
testing. But that's straying a bit outside
the scope of your library maybe. Although you have high-performance as
clearly in-scope, so tradeoffs like these,
which granted are outside your control, are still important to bench
for, and mention in the doc IMHO. My $0.02.

Kostas Savvidis via Boost

unread,

May 19, 2022, 3:21:42 AM5/19/22

to bo...@lists.boost.org, Kostas Savvidis

> On May 18, 2022, at 13:27, Ruben Perez via Boost <bo...@lists.boost.org> wrote:
>
> So the following signatures:
>
> prepared_statement::execute(const params<ValueIterator>&)
> prepared_statement::close()
> resultset::read_one(row&)
>
> Would change to these ones:
>
> connection::execute_statement(const prepared_statement&, const
> params<ValueIterator>&)
> connection::close_statement(const prepared_statement&)
> connection::read_row(resultset&, row&)
>
> What do you think?

Someone with experience using the library and a better understanding than myself should comment.

But I noticed in the postgres-asio example you declare
auto stmt = conn.prepare_statement("UPDATE mytable SET f1 = $1 WHERE f2 = $2");
In the mysql exaples you did not use auto, I presume this style is possible also in mysql,
then maybe not worth changing signatures as it is not such a big deal.

Cheers,
Kostas

Kostas Savvidis via Boost

unread,

May 19, 2022, 4:26:18 AM5/19/22

to bo...@lists.boost.org, Kostas Savvidis

> On May 18, 2022, at 13:27, Ruben Perez via Boost <bo...@lists.boost.org <mailto:bo...@lists.boost.org>> wrote:
>
> So the following signatures:
>
> prepared_statement::execute(const params<ValueIterator>&)
> prepared_statement::close()
> resultset::read_one(row&)
>
> Would change to these ones:
>
> connection::execute_statement(const prepared_statement&, const
> params<ValueIterator>&)
> connection::close_statement(const prepared_statement&)
> connection::read_row(resultset&, row&)
>
> What do you think?

Someone with experience using the library and a better understanding than myself should comment.

But I noticed in the postgres-asio example you declare
auto stmt = conn.prepare_statement("UPDATE mytable SET f1 = $1 WHERE f2 = $2");
In the mysql exaples you did not use auto, I presume this style is possible also in mysql,
then maybe not worth changing signatures as it is not such a big deal.

Cheers,
Kostas

Ruben Perez via Boost

unread,

May 19, 2022, 5:30:20 AM5/19/22

to Dominique Devienne, Ruben Perez, boost@lists.boost.org List

It will likely have similar effects to what you describe. Unless I don't
think this should be in scope in this first library version, it definitely
is something users may want at some point and that requires library support,
so it should be taken into account. https://github.com/anarthal/mysql/issues/20
tracks it. I've noted your comments on performance benchmarks
so they don't get lost.

Regards,
Ruben.

Reply all

Reply to author

Forward