You may take a look at this issue on github:
https://github.com/cp2k/cp2k/issues/73
In your particular case, your setup of 8 V100 is pretty extreme and it would require a large computation. Which test are you using for benchmarking?
Then, your setup of 8 ranks + 5 threads should be OK. CP2K attaches ranks to GPU in a round-robin manner, therefore in your case there is a rank talking to each GPU.
We don't have a large experience of multi-gpu nodes, hence I would suggest to do some scalability test by running 1 rank, 2 ranks, ... 8 ranks (always 5 threads) to check how the performance scales. BTW, make sure CP2K is able to recognize 8 GPUs by checking the following output at the beginning:
DBCSR| ACC: Number of devices/node 1
Eventually, you might consider reoptimizing the kernels for the V100, but this is not a priority...
Alfio