Generating set partitions is efficiently performed by [the second] Algorithm H in Knuth's Art of Computer Programming, Volume 4A, Combinatorial Algorithms, Part 1. It appears on page 416.
Counting the number of set partitions means evaluating the Bell numbers, which is easily done by table summation.
Efficiently programming the generation of set partitions depends on what you need contextually, either all of the partitions at once as a list of solutions, or one at a time in the spirit of a solution generator.
In the second case, a generator for set partitions, it is possible to generate them in "shape order"; a 4-tuple broken into blocks of length 4, 3+1, 2+2, 2+1+1, 1+1+1+1. (The five "shapes" being the integer partition of the number 4). If you were to create slices of these cardinalities in Go and just fill in the values, then the generator would have a minimal allocation overhead, given that there may be many instances of some shapes.
It is also possible to skip the generation of the physical subsets altogether, by generating the manifests for the various blocks. For example, dividing "ABCD" into AD+B+C can be represented by the value "1231" meaning that A and D are in group 1, B is in 2, and C is in group 3. I have done this scheme in a combination generator before in Go with respectable efficiency:
BenchmarkOrderedRate1-8 300000000 4.65 ns/op
BenchmarkOrderedRate2-8 100000000 10.5 ns/op
BenchmarkOrderedRate3-8 100000000 17.1 ns/op
BenchmarkOrderedRate4-8 100000000 24.1 ns/op
BenchmarkOrderedRate5-8 50000000 32.5 ns/op
BenchmarkOrderedRate6-8 50000000 39.5 ns/op
BenchmarkOrderedRate7-8 30000000 46.5 ns/op
BenchmarkOrderedRate8-8 30000000 53.0 ns/op
BenchmarkOrderedRate9-8 20000000 59.6 ns/op
BenchmarkOrderedRate10-8 20000000 66.9 ns/op
BenchmarkOrderedRate11-8 20000000 75.9 ns/op
BenchmarkOrderedRate12-8 20000000 82.6 ns/op
BenchmarkOrderedRate13-8 20000000 88.9 ns/op
BenchmarkOrderedRate14-8 20000000 95.6 ns/op
BenchmarkOrderedRate15-8 20000000 102 ns/op
...and because i was able to use the "manifest" representation in my code, there is no allocation in the loops and even 15-element generation proceeds at 10 million iterations per second. I did not need a set partition generator, so I don't have code to offer, but the work is similar and the speeds should not be too different, at least for the manifest (aka "index") representation.