Thanks, Casey, my team is excited to work with the improved author disambiguation!
One issue we're running into: It appears that the data schema for documents in the earlier "updated_date=*" files differs from that in later files. For example, the schema in "updated_date=2021-11-03/part_000.gz" is different from the schema in "updated_date=2023-08-18/part_000.gz". The 2023 file follows the current documentation, but the older file seems to use an outdated schema.
I haven't done a full comparison to see how common these schema differences are, but any differences are going to make automated processing of the snapshot files very challenging.
Can you offer any insight?
Thanks,
Jason