The very idea of a convolution relies on a structure of data. In images, neighboring pixels form a grid over which you can iterate and which contains context information. In an XYZ point cloud there is no such relation - you just have a list of points with (generally) no structure in it. Just finding neighboring points in the cloud (something we take for granted when processing images) is a problem with dedicated algorithms to solve it.
Voxels are a different thing and sure you can do 3D convolution with it. Take a look at
Maturana & Scherer (2015) for example - they convert a point cloud to voxel model via occupancy grids and convolve over it normally. 3D convolution was also used for computer tomography scan segmentation (as convolution over a 3D image) and pose estimation (3D "images" made of sequences of images - 3rd dimension is actually time). I'm sure you can find more examples.
Caffe supports 3D convolution - you can start reading how to accomplish it
here.