Assuming that your camera's viewport's dimensions are the same as the input image's dimensions, you can use the 2D pose landmarks, which are normalized to the width/height of the input image in the range [0, 1].
You can multiply the X,Y values by your image width/height to get the clip space x, y coordinates. You will need to convert these back to world space. Depending on your setup you can multiply by the inverse projection matrix and then by the inverse view matrix to get the world position.
For z I'm not completely sure if there's an accurate way of getting this, but I just discard the z value of the 2D pose landmark and take the z value from the 3D world landmark as it is.
This should be enough to move the hips of a rigged 3D model in a convincing way. You may want to explore other options for getting a more accurate z value though.