Hi Claudia,
Your benchmark numbers seem to be what I would expect from a single client pushing to a single endpoint.
For single client benchmarking the bottleneck is in the latency and single endpoint thread handling most all the load.
When multiple clients start bombing one endpoint the bottleneck become endpoint CPU and that single node communication limits with HBase.
However if multiple clients will bomb multiple endpoints (ideally through load balancer) the performance can grow up to the write limits of the whole HBase cluster. In that case it may sense to play with maximum connections and memory configuration of the HBase.
However HBase still have write performance limits based on load distribution across regions. In any case you should pre-split the target table into expected number of regions, so the region servers will split the load.
If you expect to work with really big data, I would suggest to pre-split to large number of regions and ideally chunk the data and bulk-load them.
In this scenario the performance is limited by hardware only. My measurement in this case are around 40.000 triples per cluster node per second, or 100 billion triples on 10 nodes cluster in 72 hours.
Thanks,
Adam