It can get expensive to do that. Instead of just a mark bit per object, and a queue of pointers to mark, you need a mark bit per word and a queue of ptr+len. You can also end up doing more than constant work per mark.
x := [10]*int{ ... 10 pointers ... }
a := x[:3:3]
b := x[7::]
x is now dead, but a and b are live. GC needs to scan a[0:3] and a[7:10]. What about a[3:7]?
x := [10]*int{ ... 10 pointers ... }
a1 := x[:1:1]
a2 := x[:2:2]
a3 := x[:3:3]
...
a10 := x[:10:10]
If the GC encounters the references in order a1, a2, ... , a10, then at each step it has to scan one more word of x. Using just a mark bit per word, it will end up taking quadratic time to process these references.