This is highly reminiscent of some work I did over 20 years ago on "data flow" computer architectures, which today are the foundation for DSP, GPGPU, FPGA and other digital signal processing. The general notion actually pre-dates digital processing, as it was the fundamental architecture for the analog computers used for sonar and radar processing. Much radar processing is still performed in the analog domain, because digital systems still aren't fast enough.
In high-throughput environments, it can be very effective to have the data "flow" as streams through the code, the flow of data branching and merging as needed, rather than having the code "get" and "put" the data. At an extreme, it may be better to bring the code to the data, rather than the other way around, which is the concept behind embedded database actions/functions, where it can be far more effective to compute some data values within the database itself rather than perform a fetch-compute-store cycle. It is also similar to parallel pipelined CPU architectures, where a given processor element (ALU, FPU, etc.) "fires" only when data is presented to it, and is idle otherwise (saving power).
For me, the various features and restrictions of Go and goroutines make much more sense when viewed from a fine-grained cascaded/parallel signal processing perspective, rather than a traditional coarse-grained task/thread-oriented parallel processing perspective.
Current languages and tools often force programs to be designed to make optimal use of large-scale system capabilities/resources, with little inherent flexibility when moved to systems with different capabilities. Go may encourage programming toward the smallest capabilities, letting the Go runtime manage how goroutines are aggregated and mapped to the system resources actually present. That's a difficult problem that should not need to be uniquely solved by each programmer for each program for each system. How many goroutines per thread? How many threads per core? Make "enough" goroutines, and let the Go runtime handle the rest.
Only recently has the Linux kernel begun to manage "task affinity" to run related programs on the same core, and unrelated programs (or program instances) on different cores. Go will need to do similar work for goroutines, and possibly for threads, in close cooperation with the OS.
Ultimately, the only thing a Go programmer may need to care about is the relative cost of a local function call versus the cost of using a channel and a goroutine in the same thread. And that will only matter when cores are few and/or when algorithms are inherently serial (Amdahl's Law). In that case, the goal will be to minimize overhead when forced to run on a single core, and maximize performance when many cores/sockets are available.
But why should a Go programmer even need to care about that? I would strongly favor a Go runtime optimization that, when connected goroutines are scheduled into the same thread, will automatically collapse simple goroutine invocations into function calls (convert parallel to serial). This is a *much* easier problem than the reverse (automatic parallelization of function calls)! It would also enable a given Go program to run as efficiently as possible on both a single-core phone and on a 100 core workstation.
When starting a Go program, it should be made possible to pass "hints" to the Go runtime to support this optimization, based on previous patterns of execution (information I'd put into the resource fork of the program file, or into the ELF file structure).
If Go had such capabilities, I'd probably call far few functions, and make much greater use of channels and goroutines.
-BobC