Segments are user specified in a request to capture labels during a specific time frame. If unspecified, each video is treated as a single segment.
Shots change each time a video cut occurs or the contents of the video have changed. When a new shot is detected, labels are annotated for the new shot.
Frames are each individual image that makes up the video (e.g '24 fps' = 24 frames per second). Each frame can be annotated to identify labels.
You can see an example here for how to parse out the labels for all Segments, Shots, and Frames.
By default the 'labelDetectionMode' is set to 'SHOT_MODE', therefore labels detected will be at the Shot level.
In general, since a segment is made up of shots and frames, any label detected within a segment should also be detected by a shot or frame. Therefore it is ok to not rely on segment annotations and instead use only shot and/or frame annotations in order to avoid excess duplicates. Note that a video can often have multiple frames and shots that contain the same images, therefore duplicate labels may occur at every level. As such it is recommended to manually performed de-duplication using the 'entityId' no matter the annotation type should your application require unique results for the entire video.