What's your definition of
peer to peer? You could use WebRTC to implement a client-to-client communication endpoint, use Websockets or even go barebone tcp sockets.
WebRTC needs a signalling server for discovering your peers, but after the initial handshake it is a fully p2p connection. TCP sockets are similar but you would have to implement a sort of protocol yourself and it's a major burden to connect two clients when behind a NAT or such. Websockets are the easiest but data has to go through the server before reaching a client. This means that there's more latency than the first two options. In theory that would be something like 100-150ms latency in quasi-optimal conditions.
You could record the microphone's audio with the Media Plugin
http://plugins.cordova.io/#/package/org.apache.cordova.media , encode it in real-time in base64/base128 or binary, and decode it on the receiver. This would add some latency though and you might want to use native calls/plugins for the decoding part at least.
I haven't looked up for the video part, but it shouldn't be too difficult to build a plugin which gets the camera feed. When that part is done you'd proceed in the same way of audio transmission.