My first approach would be to measure.
* The full system will have a bottleneck as per Amdahl's law. You need to take a holistic view since the bottleneck may be outside Go (ie, the kernel, hardware, ...). Knowing the bottleneck will usually suggest an angle of attack on the problem at hand.
* High memory usage suggests either an overload situation (because you don't have enough resources) or excessive copying of data is going on. Try doing some napkin math. A million active users with 1k of data each takes one gigabyte of memory and so on.
* High CPU can be indicative of real work, that the CPU is bandwidth constrained toward memory, or that lock contention is going on. Kernels usually will not discriminate between these states, so some investigation is necessary.
* Keeping thousands of connections on small hardware is expensive. Each TCP connection needs some kernel space in addition to userland space. It quickly adds up.
* TCP sending is going to cost a lot of time. In laboratory tests, the network is fast. Inside a datacenter, such as one operated by Google, network transmission is fast. The internet in general is slow, latency-inducing, and brittle. This forces your system to keep data lingering for longer, and this puts more pressure on memory usage.
* Observation: in a noisy chatroom, you want to skip messages if they flow out of the view on the client. You don't need to process every message in this case. Just the messages that are visible. This suggests a polling construction like the disruptor: only care about the K newest messages in the fast path. The slow path does historical lookups. Keep an "epoch" count of where we are in the message flow. When a socket is ready for data, use the epoch count to figure out what happened in the meantime.
* If messages are immutable, they don't change and can be concurrently read with little overhead. Edits to messages can be handled by a patching-construction which overrides an earlier message.
* The *publisher* should take the effort of constructing as much of the payload as possible and place it into a buffer everyone writes directly into the network socket. If every subscriber has to do work, things get expensive.
* Channels are likely to be fast enough, and resources are likely to run out quickly: especially memory on a 512 megabyte computer.
* Channels should be used to send epochs around. The actual payload ought to be somewhere else where it is ready and barriered appropriately for read-only consumption. Alternative: pass a reference to the data around. This is simple to do and is likely to be fast.
* Your goal is to get data to the socket so the kernel can do work. If you do this correctly, it is likely the socket is going to be the bottleneck of the system. Also, consider the possibility that you will block a goroutine on data transfer to the outside world while its channel buffer fills up. The publisher should not block in this situation.
In general, system engineering tend to trump local tuning. Effort is a constrained resource, so it is usually best spent in the areas where the cost/benefit analysis falls out nicely.