I shot some music videos with Pointcloud v2 years ago and found a solution but I set it up in-camera -- you might be able to use part of the technique in post but it will be harder.
Since the Kinect has a variable framerate (my system fluctuated between 28~32fps) I used a phone with a timecode generator on it positioned somewhere in-shot to give me a fixed reference to real time in the RGB data recorded by Pointcloud. I then took that image sequence and mapped it to a time-time curve in Houdini (any 3D app should work though). I set keyframes on the first and last frames of the song to match the timecode I could see on the phone, then scrubbed through the track looking for places where the phone timecode didn't match the playbar in Houdini and added keys to nudge it back into place. With a few dozen keyframes I had a very steady match. I then copied that time curve to the alembic sequence and my pointcloud lined up well.
Without the reference of the phone in your RGB data this would be much harder, but if there are musical instruments or dialog you can sync up visually with the song it's possible you could at least get close.