Postgres pipelining vs. batch requirements

259 views
Skip to first unread message

Stefano Casazza

unread,
Jul 7, 2018, 1:00:54 PM7/7/18
to framework-benchmarks
Hi,
I think I have understood the issue 'pipelining vs. batch requirements' in Postgres context.
In general, we wish to avoid a client/server round-trip after each query to get the results before issuing the next query, so we can queueing up queries into a pipeline to be executed as a batch on the server. Batched operations will be executed by the server in the order the client sends them. The server will send the results in the order the statements executed. The server may begin executing the batch before all commands in the batch are queued and the end of batch command is sent. If any statement encounters an error the server aborts the current transaction and skips processing the rest of the batch. Query processing resumes after the end of the failed batch. It's fine for one operation to depend on the results of a prior one. One query may define a table that the next query in the same batch uses; similarly, an application may create a named prepared statement then execute it with later statements in the same batch. We can delimits the end of a set of a batched commands by sending a sync message and flushing the send buffer. The end of a batch serves as the delimiter of an implicit transaction and an error recovery point. In this context the individual queries in the batch are sharing the same "virtual transaction id". This is the batching scenario. To make this transparent at the network level is sufficient to put in the pipeline queue a sync message after every single query, in this way we replicate the network flow between the client and the server without the batch context. I think this is what is required by TechEmpower.
Can someone more expert than me validate this?

-Stefano

Michael Hixson

unread,
Jul 7, 2018, 2:49:09 PM7/7/18
to Stefano Casazza, framework-benchmarks
Hi Stefano,

The thing we want to avoid in the multi-query test is implementations
that are written like this:

* For each incoming HTTP request, construct a batch of queries.
* In one operation, send the entire batch as a unit to the database.
* Don't do anything else until the response for the entire batch is received.

The multi-query test is a stand-in for a "real" application where such
an approach isn't possible. For the dependent query requirement,
imagine the result of each query has to be read *by the application
server* before it can know what the next query should be. Or you
could imagine it is interleaving queries between multiple databases or
other external services.

If we allowed batch queries in the multi-query test, that would be a
different sort of test than we have now. I think it would be too
similar to the single-query test. I think the multi-query test that
we have now (where batching is disallowed) is more useful. If we were
to allow batch queries in the multi-query test, in my mind me might as
well allow "select * from world where id in (...)" too. Even if we
could update all the existing implementations for free, I don't think
that's a better place to end up than where we are now.

All that said, I take it that you're interested in building a
compliant pipelining implementation from scratch, however we define
compliant. So we're not really arguing about batch queries; we're
trying to produce a second pipelining implementation (with
reactive-pg-client being the first). Let me do my best to help there.

These aren't hard requirements, but I bet a compliant pipelining
implementation would:

* Deliver each individual query in the multi-query test to the
database separately, probably. Or it's at least theoretically capable
of doing this.
* Not have any HTTP-request-level query queues. If there is a queue
anywhere, it is centralized (for example, application-scope or
database-connection-level scope). The HTTP request handling code
wouldn't directly call any "flush" operation on the queue.
* *Perhaps* deliver multiple queries to the database together (in a
batch) some of the time, but those queries would not always originate
from the same HTTP request. For example, in the single-query test, it
might send a single packet to the database with queries that came from
4 different HTTP requests.

That last bullet point about incidental batching is part of the reason
I keep mentioning failed queries and virtual transactions. If the
application has a centralized queue of queries, and one query in that
queue results in failure, that shouldn't affect the other queries that
happened to be in the same batch. The queries in each batch may come
from logically unrelated parts of the application. If you end up
implementing a centralized queue like this, then what you said about
sync messages between each query sounds correct to me. I'm not
familiar with the low-level details of the postgres protocol, though.

-Michael
> --
> You received this message because you are subscribed to the Google Groups
> "framework-benchmarks" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to framework-benchm...@googlegroups.com.
> Visit this group at https://groups.google.com/group/framework-benchmarks.
> For more options, visit https://groups.google.com/d/optout.

Michael Hixson

unread,
Jul 7, 2018, 6:07:05 PM7/7/18
to Stefano Casazza, framework-benchmarks
Oh one other thing. You said:

> In general, we wish to avoid a client/server round-trip after each query to get the results before issuing the next query, so we can queueing up queries into a pipeline to be executed as a batch on the server.

I don't think the goal of pipelining is "avoid round-trips". I think
the goal is "use the connections to the database more efficiently".

As a counter-example, I'm pretty sure that all of the Java frameworks
that rely on JDBC (which does not support pipelining) work like this:

* There is an application-wide connection pool.
* When handling an incoming HTTP request, the code "rents out" one of
the connections from this pool, gaining exclusive rights to use that
connection. No other part of the application uses that particular
connection concurrently.
* When issuing each query, the code sends the query out over the
connection then waits for a response from the database before sending
the next query. The connection is idle between sending the query and
receiving the response, while the database is executing the query.
* Therefore there can only be ${size_of_connection_pool} queries
executing at any given time.

If pipelining was used instead, then the request-handling code would
never gain exclusive access to any database connection. That code
would say, "Here's a query I want to send out, and here's the action I
want to perform when the result comes back." The application would
then send that query out over *some* connection, probably immediately.
That connection would not sit idle waiting for a result; the
application would send more queries over that same connection coming
from other contexts before the first result comes back. Since each
connection can process more than one query concurrently, the
application isn't limited to ${size_of_connection_pool} queries at a
time. That application can probably get away with having a smaller
connection pool to achieve the same throughput that a non-pipelining
application can.

The pipelining application *may* avoid some round trips if it decides
to pack more than one concurrent query into each packet and/or the
database server decides to pack more than one result into each packet.
But I see that as a secondary optimization. And I see that
optimization being handled by code that is in low-level networking
parts of the application and database, not by HTTP request-handling
code.

-Michael

Shay Rojansky

unread,
Jul 9, 2018, 10:35:15 AM7/9/18
to framework-benchmarks
Hi Stefano, this is Shay, the .NET PostgreSQL driver guy.

Your description below is pretty accurate. One small detail regarding failure... A batch isn't exactly the same as a transaction, although the two concepts are indeed related. In PostgreSQL, any sort of error that happens in a transaction automatically fails that transaction; more precisely, it puts the transaction in a failed state, after which you must roll it back (any attempt to execute further commands will fail). Now, if a batch is executed outside of any explicit transaction (BEGIN), it indeed executes within its own implicit transaction. However, if the batch is executed inside an existing explicit transaction, no additional (implicit) transaction is created and the existing explicit transaction is failed. PostgreSQL does not support any kind of nested/concurrent transaction model. But this isn't very important for this conversation.

It's true that if Npgsql were changed to add a "sync" message after every single individual query, that would change the semantics (and indeed the actual wire protocol messages sent) to be identical to a scenario where batching isn't used at all: failure of an earlier command wouldn't cause the skipping of a later command. That also seems like a pretty bad way for a database driver to work: assuming no transaction is in progress ("auto-commit mode"), do you really want later commands to execute after earlier commands have failed?

Regardless and most importantly, as I tried to argue in the other conversation, I agree that the TechEmpower rules can be understood in a way that allows for the 2nd sort of batching (with sync after each query), and forbids the 1st sort of batching (with sync at the end of all the queries). What I can't understand after all this discussion is *why* the rules are the way they are - what do we gain by excluding Npgsql's current way of batching? Once again, it would make sense to exclude *all* kinds of batching (i.e. both types), but I can't understand why allow one but not the other.

Shay Rojansky

unread,
Jul 9, 2018, 11:02:19 AM7/9/18
to framework-benchmarks
Hi Michael, see some comments below.


The thing we want to avoid in the multi-query test is implementations
that are written like this:

* For each incoming HTTP request, construct a batch of queries.
* In one operation, send the entire batch as a unit to the database.
* Don't do anything else until the response for the entire batch is received.

The multi-query test is a stand-in for a "real" application where such
an approach isn't possible.  For the dependent query requirement,
imagine the result of each query has to be read *by the application
server* before it can know what the next query should be.  Or you
could imagine it is interleaving queries between multiple databases or
other external services.

This is quite confusing to me... First, why do you say that such an approach isn't possible in a real application? Assuming there's no dependency between your queries, this actually seems like quite a sensible way to execute - in fact I've written code like this many times. Most database drivers indeed give you "exclusive" rights to a pooled connection: after you're assigned a connection, indeed nobody else can use it. This is how Npgsql works, for example. When using such a driver, why would you *not* batch all your queries rather than wait for each query's results before sending the next (again, assuming there's no dependency)?

I suspect that there's some confusion here around terminology. In your 2nd mail you write that you don't think the goal of pipelining is to avoid round-trips, but rather to "use the connections to the database more efficiently". At least if you look at what HTTP pipelining is, "avoiding round-trips" is exactly what pipelining is about: it's simply about not waiting for a previous response before sending another request. In fact, the PostgreSQL protocol really is very similar to how HTTP pipelining works: you can send requests before waiting for responses, but responses will always come back in FIFO order. Now, a small minority of drivers do seem to allow the same connection to be shared by several "users", allowing user B to send query 2 just after user A sent query 1, but before the results of query 1 have been received. I preferred to call this "multiplexing" as opposed to "pipelining" (because pipelining does seem to have the somewhat standard meaning of HTTP pipelining), but terminology is secondary here. The important thing is to know whether it's about having multiple users having queries in flight on the same connection.

But to go back to what TechEmpower allows or does not allow... Once again, from my point of view:
  • If the goal of the multi-query benchmark is to measure multiple roundtrips (i.e. one roundtrip per query), that's fine and it makes some sense. It can be explicitly required in the benchmark requirements, and tt can be "enforced" by introducing a dependency between the query: query 2 contains a parameter that we only get from the resultset of query 1.
  • If the goal of the multi-query benchmark is to allow programs and drivers to execute multiple queries as fast as possible, that also makes sense, but then why not allow whatever form of speed-up (batching/pipelining) is available? Why do you care how the wire protocol messages look like, and forbid Npgsql's style of batching?

Stefano Casazza

unread,
Jul 9, 2018, 11:04:22 AM7/9/18
to Shay Rojansky, framework-benchmarks
Hi Shay,

> do you really want later commands to execute after earlier commands have failed?

No, batched operations will be executed by the server in the order the
client sends them. The server will send the results in the order the
statements executed. The server may begin executing the batch before
all commands in the batch are queued and the end of batch command is
sent. If any statement encounters an error the server aborts the
current transaction and skips processing the rest of the batch. Query
processing resumes after the end of the failed batch. PQgetResult()
behaves the same as for normal asynchronous processing except that it
may contain the new PGresult types PGRES_BATCH_END and
PGRES_BATCH_ABORTED. PGRES_BATCH_END is reported exactly once for each
PQbatchSendQueue() call at the corresponding point in the result
stream and at no other time. PGRES_BATCH_ABORTED is emitted during
error handling. When a query in a batch causes an ERROR the server
skips processing all subsequent messages until the end-of-batch
message. The open transaction is aborted. From the client perspective,
after the client gets a PGRES_FATAL_ERROR return from PQresultStatus()
the batch is flagged as aborted. libpq will report PGRES_BATCH_ABORTED
result for each remaining queued operation in an aborted batch. The
result for PQbatchSendQueue() is reported as PGRES_BATCH_END to signal
the end of the aborted batch and resumption of normal result
processing. If the batch used an implicit transaction then operations
that have already executed are rolled back and operations that were
queued for after the failed operation are skipped entirely. The same
behaviour holds if the batch starts and commits a single explicit
transaction (i.e. the first statement is BEGIN and the last is COMMIT)
except that the session remains in an aborted transaction state at the
end of the batch. If a batch contains multiple explicit transactions,
all transactions that committed prior to the error remain committed,
the currently in-progress transaction is aborted and all subsequent
operations in the current and all later transactions in the same batch
are skipped completely. The client must not assume that work is
committed when it sends a COMMIT, only when the corresponding result
is received to confirm the commit is complete. Because errors arrive
asynchronously the application needs to be able to restart from the
last received committed change and resend work done after that point
if something goes wrong.

> I agree that the TechEmpower rules can be understood in a way that allows for the 2nd
> sort of batching (with sync after each query), and forbids the 1st sort of batching (with
> sync at the end of all the queries)

I agree too.

> I can't understand why allow one but not the other

For me, the key point is network transparency, so
C1 <-> S1
C2 <-> S2
is equivalent to
C1, C2 <-> S1,S2







2018-07-09 16:35 GMT+02:00, Shay Rojansky <ro...@roji.org>:
> Hi Stefano, this is Shay, the .NET PostgreSQL driver guy.
>
> Your description below is pretty accurate. One small detail regarding
> failure... A batch isn't exactly the same as a transaction, although the
> two concepts are indeed related. In PostgreSQL, any sort of error that
> happens in a transaction automatically fails that transaction; more
> precisely, it puts the transaction in a failed state, after which you must
> roll it back (any attempt to execute further commands will fail). Now, if a
>
> batch is executed outside of any explicit transaction (BEGIN), it indeed
> executes within its own implicit transaction. However, if the batch is
> executed inside an existing explicit transaction, no additional (implicit)
> transaction is created and the existing explicit transaction is failed.
> PostgreSQL does not support any kind of nested/concurrent transaction
> model. But this isn't very important for this conversation.
>
> It's true that if Npgsql were changed to add a "sync" message after every
> single individual query, that would change the semantics (and indeed the
> actual wire protocol messages sent) to be identical to a scenario where
> batching isn't used at all: failure of an earlier command wouldn't cause
> the skipping of a later command. That also seems like a pretty bad way for
> a database driver to work: assuming no transaction is in progress
> ("auto-commit mode"), do you really want later commands to execute after
> earlier commands have failed?
>
> Regardless and most importantly, as I tried to argue in the other
> conversation
> <https://groups.google.com/d/msg/framework-benchmarks/A78HrYsz4AQ/M6JdRNGQAgAJ>,
> --
> You received this message because you are subscribed to a topic in the
> Google Groups "framework-benchmarks" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/framework-benchmarks/Kbd2No7xrv8/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to

Shay Rojansky

unread,
Jul 9, 2018, 11:15:10 AM7/9/18
to framework-benchmarks
I agree with everything you describe above. In your initial message, you seemed to be suggesting (maybe as a theoretical possibility) that Npgsql send a Sync message after each query, in order to adhere to the TechEmpower rules - making the wire messages identical identical to what they would be if the application sent query1, waited for the result, sent query2, waited for the result. In this theoretical scenario, assuming there's no explicit transaction, an earlier failing message would *not* prevent a later message from executing, because each message is in its own implicit transaction and has a Sync right after it. This seems like a problematic state of affair for a database driver batching API.

> I can't understand why allow one but not the other

For me, the key point is network transparency, so
C1 <-> S1
C2 <-> S2
is equivalent to
C1, C2 <-> S1,S2

I guess this is indeed the key point, but can you please explain *why* network transparency is important? I can understand that the rules are set up that way at the moment, I just don't think that makes sense. Why do you want to reject Npgsql's batching support only because it has a Sync after all batched queries rather than after each and every query?
 

Stefano Casazza

unread,
Jul 9, 2018, 11:25:44 AM7/9/18
to Shay Rojansky, framework-benchmarks
Hi Shay,
you speak to me as I am who have decide the requirements, I am only
trying to implement it.
Personally I think in general that the 'batching' is more natural for
real application, but I don't see why 'pipelining' is 'problematic
state of affair for a database driver batching API'

Shay Rojansky

unread,
Jul 9, 2018, 12:54:02 PM7/9/18
to framework-benchmarks
Hi Stefano,

you speak to me as I am who have decide the requirements, I am only
trying to implement it.
Personally I think in general that the 'batching' is more natural for
real application, but I don't see why 'pipelining' is 'problematic
state of affair for a database driver batching API'

I'm sorry if it came across that way, I actually have no idea who you are or what you're responsible for! It's just that I haven't been able to get an answer from anybody on why "network transparency" is a deciding factor here, and was hoping maybe you know.

I don't have any problem with pipelining or multiplexing - in fact I'm considering implementing it in Npgsql (https://github.com/npgsql/npgsql/issues/1982). There are definitely some perf benefits in supporting what I call multiplexing, i.e. not reserving connection exclusively to one user, but dispatching queries from multiple users via multiple connections. My comment above was only on the (theoretical) error handling behavior: if a database driver exposes a batching API as Npgsql does, then I'd expect earlier query failures to skip later ones; this isn't the case when one sends a Sync message after every query in the batch.


Michael Hixson

unread,
Jul 9, 2018, 6:38:18 PM7/9/18
to Shay Rojansky, framework-benchmarks
I'll try to have more complete responses for both of you (Shay and
Stefano) later, but that probably won't happen until this coming
weekend at the earliest. I apologize in advance for the delay. I am
reading all of your replies, though.

I'll give a quick response on one point now though:

> I don't have any problem with pipelining or multiplexing - in fact I'm
> considering implementing it in Npgsql
> (https://github.com/npgsql/npgsql/issues/1982). There are definitely some
> perf benefits in supporting what I call multiplexing, i.e. not reserving
> connection exclusively to one user, but dispatching queries from multiple
> users via multiple connections. My comment above was only on the
> (theoretical) error handling behavior: if a database driver exposes a
> batching API as Npgsql does, then I'd expect earlier query failures to skip
> later ones; this isn't the case when one sends a Sync message after every
> query in the batch.

I would also expect a batching API to work like Npgsql's does. The
thing is, we're not interested in having implementations that use a
batching API in the multi-query test. The requirements aren't all
"black box"; we do care what the client side code is doing in ways
that only a manual review can verify. We do care that the client side
code is executing independent queries in a way that batching doesn't
satisfy, no matter how it looks over the wire. I've tried to explain
the reasons why the requirements are this way with respect to
batching, and you may not agree with all of those reasons, but I hope
that everyone realizes that these are still the requirements. I told
Stefano recently in a PR that we don't allow batching, and then I said
that again here in this thread, and yet here we are discussing
additions to libpq where every new function name contains the word
"batch".

-Michael

Shay Rojansky

unread,
Jul 10, 2018, 12:19:36 AM7/10/18
to Michael Hixson, framework-benchmarks
Michael,

I'll try to have more complete responses for both of you (Shay and
Stefano) later, but that probably won't happen until this coming
weekend at the earliest.  I apologize in advance for the delay.  I am
reading all of your replies, though.

No problem at all, of course there's no rush here whatsoever.

I would also expect a batching API to work like Npgsql's does.  The
thing is, we're not interested in having implementations that use a
batching API in the multi-query test.  The requirements aren't all
"black box"; we do care what the client side code is doing in ways
that only a manual review can verify.  We do care that the client side
code is executing independent queries in a way that batching doesn't
satisfy, no matter how it looks over the wire.

I do understand that you guys forbid batching, and I do understand why one may want to do so in certain benchmark scenarios, but I'm still lost on why pipelining would be allowed on the same benchmark (by which I mean any technique that sends a query on a physical connection which still has another query in-flight, i.e. one whose results haven't been fully consumed). If you could point me to where that's explained that would be great.

Michael Hixson

unread,
Jul 21, 2018, 8:22:03 PM7/21/18
to Stefano Casazza, framework-benchmarks
Hi Stefano,

Well, I was unable to explain the difference between pipelining and
batch queries in this discussion thread. I gave it my best.

A few weeks ago I met with the TFB team at TechEmpower and
whiteboarded all the different ways to query the database. My
colleagues all agreed this was much easier to understand than
everything I had said on the mailing list. Also, it's not a great
state of affairs when in order to understand our requirements, one has
to read through multiple large email threads.

With those lesson learned, I've created a new task to use diagrams to
show which kinds of querying we allow and which we don't:
https://github.com/TechEmpower/FrameworkBenchmarks/issues/3946

Hopefully, whenever we get around to doing that, that'll prevent some
misunderstandings like we had in this thread.

-Michael
> You received this message because you are subscribed to the Google Groups "framework-benchmarks" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to framework-benchm...@googlegroups.com.

Michael Hixson

unread,
Jul 21, 2018, 9:41:26 PM7/21/18
to Shay Rojansky, framework-benchmarks
Hi Shay,

Responses inline:

On Mon, Jul 9, 2018 at 8:02 AM, Shay Rojansky <ro...@roji.org> wrote:
> Hi Michael, see some comments below.
>
>> The thing we want to avoid in the multi-query test is implementations
>> that are written like this:
>>
>> * For each incoming HTTP request, construct a batch of queries.
>> * In one operation, send the entire batch as a unit to the database.
>> * Don't do anything else until the response for the entire batch is
>> received.
>>
>> The multi-query test is a stand-in for a "real" application where such
>> an approach isn't possible. For the dependent query requirement,
>> imagine the result of each query has to be read *by the application
>> server* before it can know what the next query should be. Or you
>> could imagine it is interleaving queries between multiple databases or
>> other external services.
>
>
> This is quite confusing to me... First, why do you say that such an approach
> isn't possible in a real application? Assuming there's no dependency between
> your queries, this actually seems like quite a sensible way to execute - in
> fact I've written code like this many times. Most database drivers indeed
> give you "exclusive" rights to a pooled connection: after you're assigned a
> connection, indeed nobody else can use it. This is how Npgsql works, for
> example. When using such a driver, why would you *not* batch all your
> queries rather than wait for each query's results before sending the next
> (again, assuming there's no dependency)?

I was not saying that batch queries aren't possible in any
application. I can see why you'd find that opinion confusing. :)

I was saying the particular real application we're simulating is one
where batch queries aren't a solution. Why? Because that's how we've
defined it. I gave a couple of examples of problems that can't be
solved by batch queries. Imagine that our multi-query application is
solving those problems.

To make it easier for people to contribute solutions, we've simplified
the actual problem somewhat. But we retain some requirements from the
more complex theoretical problems.

>
> I suspect that there's some confusion here around terminology. In your 2nd
> mail you write that you don't think the goal of pipelining is to avoid
> round-trips, but rather to "use the connections to the database more
> efficiently". At least if you look at what HTTP pipelining is, "avoiding
> round-trips" is exactly what pipelining is about: it's simply about not
> waiting for a previous response before sending another request. In fact, the
> PostgreSQL protocol really is very similar to how HTTP pipelining works: you
> can send requests before waiting for responses, but responses will always
> come back in FIFO order.

Yes, we're probably using the term "round-trips" differently. A batch
query solution might issue 4 queries by sending a single packet to the
database and receiving all 4 result sets in a single packet, which I
was calling one round-trip. Meanwhile, a pipelining solution might
issue 4 queries by sending out 4 packets and receiving 4 back, which I
was calling 4 round-trips.

I suspect that your definition of "round-trips" is the more popular
one so I'll try to avoid using my definition the future.

(Aside: A pipelining solution might opportunistically squash the
queries and/or result sets into fewer packets. This was the
contentious part of what vertx-postgres was doing, which started the
debate about whether to allow pipelining. I don't think anyone was
concerned about *when* queries/results were sent/received or how many
connections were used.)

> Now, a small minority of drivers do seem to allow
> the same connection to be shared by several "users", allowing user B to send
> query 2 just after user A sent query 1, but before the results of query 1
> have been received. I preferred to call this "multiplexing" as opposed to
> "pipelining" (because pipelining does seem to have the somewhat standard
> meaning of HTTP pipelining), but terminology is secondary here. The
> important thing is to know whether it's about having multiple users having
> queries in flight on the same connection.

Right. I saw your definitions of batching, pipelining, and
multiplexing in the GitHub issue you linked. They're great.

For what it's worth, in all our discussions I have been saying
"pipelining" when I mean "multiplexing" in your terminology. The
client that vertx-postgres uses, reactive-pg-client, is doing
multiplexing.

>
> But to go back to what TechEmpower allows or does not allow... Once again,
> from my point of view:
>
> If the goal of the multi-query benchmark is to measure multiple roundtrips
> (i.e. one roundtrip per query), that's fine and it makes some sense. It can
> be explicitly required in the benchmark requirements, and tt can be
> "enforced" by introducing a dependency between the query: query 2 contains a
> parameter that we only get from the resultset of query 1.

It *could* be enforced like you say, but we'd have to rewrite all of
the implementations, which would be a lot of work. Instead, we're
enforcing it manually.

> If the goal of the multi-query benchmark is to allow programs and drivers to
> execute multiple queries as fast as possible, that also makes sense, but
> then why not allow whatever form of speed-up (batching/pipelining) is
> available?

If this was the goal, we'd allow "SELECT * FROM world WHERE id IN
(...)" and all of the implementations would be written that way.

> Why do you care how the wire protocol messages look like, and
> forbid Npgsql's style of batching?

We really don't want to care how the wire protocol messages look.
That was part of my original announcement that explained why we're
allowing pipelining!

However, we do care about the general solution being used from the
application's point of view. We forbid *all* styles of batching. The
whole discussion about Sync messages was a red herring. Forget the
points I made about failed queries; that was in response to a
theoretical solution that no one actually proposed. (Imagine a
framework that solves the single-query test using batch queries. Or,
if you have no idea what I mean by that, then just forget it - it's
not important.)

-Michael

Michael Hixson

unread,
Jul 21, 2018, 10:12:45 PM7/21/18
to Shay Rojansky, framework-benchmarks
I believe you've read all of my messages on this topic. We've had
minor misunderstandings here and there, but based on your replies I
think you know what my position is and what the current requirements
for TFB are. You disagree with the requirements and I can't think of
anything to say that might change your mind. Oh well.

I'd like to wrap up this discussion in the next reply or two. No
one's done anything wrong and the topic was really interesting, but to
be honest I will be glad if I don't hear the word "pipelining" again
for several months. :)

-Michael

Shay Rojansky

unread,
Jul 31, 2018, 1:57:11 AM7/31/18
to framework-benchmarks
Hi Michael, thanks for your continued responses and the interesting discussion. I'm sorry it's taking me a while to respond each time, life is very busy at the moment. I think I understand the source of the confusion in this conversation, it's indeed related to terminology.

I was saying the particular real application we're simulating is one
where batch queries aren't a solution.  Why?  Because that's how we've
defined it.  I gave a couple of examples of problems that can't be
solved by batch queries.  Imagine that our multi-query application is
solving those problems.

I do understand the idea of simulating a real application where batch queries aren't a possible solution (and agree the it's a valuable benchmark).

The point that I still find a little bit "muddled" in the conversation is that wherever batch queries aren't a possible solution (e.g. because of inter-query dependencies), pipelining shouldn't be a possible solution either. Note that I'm using "pipelining" in the terminological sense I defined above - the "client" (handling a single web request) keeps exclusive ownership of the database connection while the first query is in-flight, and nobody else can use it concurrently at that point (like HTTP pipelining). In that case, the query dependency forces the client to wait for the first query's response before sending the second query, making the scenario identical in all respects to batching.

However, if we're talking about "multiplexing" in my sense - non-exclusive ownership of database connections - the situation is different. While the first query is in-flight, a *different* client (e.g. thread) can still use the connection to send its own (unrelated) query, thus achieving better utilization of connections. This means that a multiplexing driver should indeed perform much better than either a pipelining or batching driver (again, I see the last two mostly as API variations over the exact same thing).

In other words, the whole discussion may have been the result of a terminological mismatch :) I'm not sure if there's an actual driver out there which does pipelining but not multiplexing (I'm sticking to my terminology because it clearly distinguishes between the two), so I may have wasted everyone's time to a certain extent here...

(Aside:  A pipelining solution might opportunistically squash the
queries and/or result sets into fewer packets.  This was the
contentious part of what vertx-postgres was doing, which started the
debate about whether to allow pipelining.  I don't think anyone was
concerned about *when* queries/results were sent/received or how many
connections were used.)

OK. I don't really see why that would be contentious - whether requests or responses are packed into more or less packets still seems irrelevant to the discussion here... A multiplexing driver may decide to do a form of "Nagling" - before it sends out a query, it may wait a bit of time to see if some other client/thread wants to send out another, and combine them into the same packet. This may of course be more or less efficient depending on many factors. But this question isn't very important in the context of this discussion.

Right.  I saw your definitions of batching, pipelining, and
multiplexing in the GitHub issue you linked.  They're great.

I'm glad they can help clear things up :) The thing that was lost here in my opinion was the question of exclusive ownership of the database connection (whether it's usable by other clients as a query is in-flight). It may be worth making this point more explicit in the rules.
 
For what it's worth, in all our discussions I have been saying
"pipelining" when I mean "multiplexing" in your terminology.  The
client that vertx-postgres uses, reactive-pg-client, is doing
multiplexing.

That's what I now understand :) Everything makes a lot more sense now.
 
> If the goal of the multi-query benchmark is to measure multiple roundtrips
> (i.e. one roundtrip per query), that's fine and it makes some sense. It can
> be explicitly required in the benchmark requirements, and tt can be
> "enforced" by introducing a dependency between the query: query 2 contains a
> parameter that we only get from the resultset of query 1.

It *could* be enforced like you say, but we'd have to rewrite all of
the implementations, which would be a lot of work.  Instead, we're
enforcing it manually.

That's perfectly understandable. However, I think this whole discussion shows the weakness of forbidding/allowing "manually", via rules, as opposed to simply structuring the benchmark in the right way... Had the multiquery benchmark simply been made with inter-query dependencies, this whole discussion would have never needed to happen, since batching and pipelining (in my sense) would have been naturally prevented. Rules are subject to interpretation, discussion, etc. whereas when there's a simple way to avoid the whole problem, that seems much better...

Anyway, thanks again for your patience in these discussions. If you think I may contribute to any discussion here (PostgreSQL low-level details or anything else) please don't hesitate to include me (I don't regularly follow this group at the moment).

Ben Adams

unread,
Jul 31, 2018, 4:55:12 AM7/31/18
to framework-benchmarks
Aside: the single query test would gain advantages from multiplexing; but it wouldn't from batching or pipelining.

Shay Rojansky

unread,
Jul 31, 2018, 5:54:07 AM7/31/18
to Ben Adams, framework-benchmarks
Aside: the single query test would gain advantages from multiplexing; but it wouldn't from batching or pipelining.

Yeah, multiplexing helps pretty much any scenario where physical connections are saturated. Batching/pipelining only helps when the application has two or more independent queries that they know they can send together.

Pipelining can technically help in another scenario where batching can't: if your application's *input* does pipelining, i.e. you're receiving new incoming requests before sending responses to previous requests. If each of these requests gets translated to an outgoing database request, then pipelining allows you to send those out before waiting for the previous database response - this isn't possible with batching where you have to know all the queries you're sending up-front, and can't add more once you've sent.

Reply all
Reply to author
Forward
0 new messages