Interesting situation. I haven't seen too many applications that try to work with a data structure that large for every element. So yes, you'll probably run into some less common issues.
My first comment is that registers are used for more than local variables. In fact, some local variables may not take up an "extra" register at all. Putting values in registers, and thus determining how many registers a function needs, is done by the compiler, which tracks all the data values you load, store or compute to any variable or memory location, and puts the values your code uses in registers when it needs to use them.
So when you allocated scratch space in the data structure to avoid using registers, the compiler still has the option to keep the value in a register, especially if it knows that your code is using it later.
To be more explicit to the compiler, you should think about putting your scratch space in local variables with the __local__ keyword. This is a more direct way of telling the compiler "I want every thread to have space to put these values, and I don't want that space to come from the registers." Since the compiler can't figure out that's what you meant to do with the code you have, it's essentially undoing it in an attempt to increase performance. Using the local keywords should help the compiler understand what you actually meant to do.
The other question I had is, are you putting all those hundreds of functions into a single kernel, or are you declaring a different kernel for each? The size of the kernel can sometimes present problems for the compiler and hardware as well in terms of resource usage.