I think of it this way.
in an Object-Oriented style, both sender and receiver share an understanding about the possible objects that existing in the domain model. that means they share understanding not just about the "names" of objects, but also the component properties of those objects and the relation between various objects in the problem domain.
in a REST-oriented style, both sender and receiver share an understanding about the possible message formats (media types) used in client-server communications. that means they share understanding not just about the media type names (app/atom+xml, app/svg, app/hal, etc.) but also the way in which the message carries state (data elements) and describes transitions (hypermedia controls).
REST-style systems are designed to share "state", not "objects", therefore the significance of "types" is found in the the media "types". this allows both client and server implementors to use any code style (OO, functional, procedural, etc.) to manipulate state *locally* w/o requiring the other party to share their coding style, object names, etc.
on a few of your points...
"client to post a request for two coffees without understanding the content"
there is no requirement in Fielding's description that clients act w/o understanding _content_. the point is that clients should be able to act w/o sharing knowledge of server-side internal coding details.
"Application specific media types where the client is programmed to look for domain specific stuff defeat the whole idea of avoiding out of band information,"
Fielding is not saying that "app specific media types...defeat" anything. what he is addressing is the "shared understanding" between client and server - which he claims should be limited to the message (media type) itself and not leak into the source code implementation details (e.g. the object model, storage model, etc.) of any server or client.
"What are the conditions for a media type suitable for hypermedia?"
Fielding offered a quote that might help answer that Q:
"Hypermedia is defined by the presence of application control information embedded within, or as a layer above, the presentation of information."[0] most media types lack this "application control information". For example, XML, JSON, CSV have no hypermedia controls. HTML, VoiceXML, Atom, HAL, Siren, Collection+JSON all have hypermedia controls defined as part of a valid message.
I talk about the concept of "hypermedia types" via what I named "H-Factors" in 2010[1]. I expanded on this idea in 2011 and covered one approach to designing hypermedia messages in "Building Hypermedia APIs."[2]. I also recently wrote a blog post that describes mental model for dissecting the messages sent between client and server. It's titled: "Three Levels of Hypermedia Messages"[3] and talks about Structure-, Protocol-, and Application-level semantics.
hopefully, this will give you some ideas.
feel free to continue discussion here or start additional threads, if you like.
Cheers.