You are right. However I think there is a reasonable solution for both problems you mention:
1. (Breaking existing apps) the SUBSCRIBE and PUBLISH commands could be extended to have a new optional parameter "broadcast" which defaults to true. This parameter indicates, whether the receiving node should broadcast the PUBLISH command over the cluster bus. Existing apps would behave as before, while all client drivers (e.g. Jedis) could add the feature of treating non-broadcast PUBLISH calls using hash slot semantics. This part is rather easy because essentially clients just have to treat PUBLISH similar to any other data command.
2. (Subscriptions during hash slot migration) This is harder but totally possible. This is what I would propose:
-When migration starts, PUBLISH calls are redirected with -ASK to the new hash slot owner.
-The node taking over the hash slot keeps a temporary log of PUBLISH messages for some configurable time (SUBSCRIPTION_MIGRATION_TIMEOUT)
-Atomically with the begin of -ASK redirections, subscribed clients receive a message indicating that they sould reconnect to the new node.
-On connection with the new node, clients initially receive (either by default or by option), all messages the node accumulated since the migration and before SUBSCRIPTION_MIGRATION_TIMEOUT has passed.
Using this scheme, no messages will be lost, as long as subscribed clients are able to reconnect to the new node within SUBSCRIPTION_MIGRATION_TIMEOUT. In Redis Cluster, this requires implementing the temporary PubSub migration log. Clients need to extend their subscribe-connections to handle reconnection to a new node. This part can be combined with 1. by having a SUBSCRIBE that indicates whether a client is capable of a handover and defaults to false.
Also, there is a rich body of recent research on live migration in database systems (e.g. Slacker, Albatross, Zephyr [1]) that we can learn from.
Do you have a suggestion, on how to have Salvatore take a look at this problem and proposal?
As for know I think in terms of reliability, functionality and scalability Apache Kafka [2] is far superior to Redis Cluster PubSub - but there is no reason to leave it at that. In terms of operational simplicity and single server perormance there is still a large niche for a scalable Redis Cluster PubSub.