Congratulations on successfully completing the lab 2b.
The binning process and getting neighborhood offset list are done on CPU. Also calculating atoms that are not binned will also be part of the CPU execution time(part 2 of the pseudo code of step 3). The "Compute" time below includes all of these. In this lab, CPU/GPU overlapping time is not meaningful as much, because the code is serialized. So for the overall execution time, IO + GPU + Copy + Compute would be very close approximation.
Your comment on the optimization is generally correct according to Amdahl's law. However, the lab is deliberately designed not to be a huge burden to the system. You might still need to optimize the kernel further if you understand how big the realistic data could be. An example is lab 3, where even with a few gigabytes of memory, it failed to work.
Further optimization is an open question to lab 2b. By the time you get the idea of the algorithm, it is up to you to play with it. In that sense, I believe you are on the right track.