Any tracking solution inherently contains noise, even when using very highend capture hardware that can cost as much as a car or more.
A Kinect sensor is a cheap consumer device which contains a lot of noise.
Both my software and Faceshift contain functionality to filter/smooth the data, which is always a tradeoff between smoothness and fidelity.
Keep in mind that Faceshift costs about 10 times as much, delivers higher quality but also takes a lot more time to learn and finetune.