I'm sure that NIST has already been thinking about compliance testing of Post-Quantum Cryptography, perhaps more than anyone. But I think that this is something where implementers, vendors, and especially cryptographic design teams can do more to help.
In the semiconductor industry, people repeat the mantra of "design for test" a lot. PQC algorithms tend to be quite a bit more complex than older types of algorithms. So perhaps the community can present arguments on how to get good coverage testing for the correctness of the implementations, and also for testing implementation security.
After standardization, PQC algorithms can be used in "certified cryptographic modules" which are tested in third-party security testing labs against FIPS 140-3 or some Common Criteria profile. Much of commercial cryptography works like this, as a lack of that FIPS certificate will prevent government sales, and also because those standards are handy for specifying requirements and acceptance criteria in contracts. However those third-party labs will typically only test the specific things that they are asked to test, on paper, so these considerations are essential.
I have a talk scheduled today on Thursday at ICMC ‘21 in the post-quantum track (right after Dustin Moody): 16:30 (ET) "PQC Modules: Requirement Specifications, Integration, and Testing (Q23b)". https://icmconference.org/
Here are the slides: https://mjos.fi/doc/20210902-icmc-q23b-saarinen-pqc-reqspec.pdf
The talk is mostly about how we do testing currently, not the best possible way it can be done in the future.
On this topic I do have a miscellaneous wish-list for NIST and the community:
1. Testing coverage (and KATs)
Test cases for failures: This may seem awfully mundane to academic cryptographers but is essential to implementers. It is obvious that malformed and mismatching KEM ciphertexts and signatures need to be tested. I may use random bit flips for that, and mismatching public keys and secret keys, but perhaps there is more that ought to be done.
Technical lemmas and formal models for internal components: Sometimes internal traces are not particularly useful as internal representations can vary (lazy reduction, masked representations, etc.). But formal models for internal components are extremely useful. Some candidate specifications have “technical lemmas” that can be converted into formal assertions, but more would be great. I’ve also found some errors in those, so it seems like those technical lemmas have not necessarily been formally derived.
Fully deterministic functions with a seed argument: For automated testing (ACVTS style), the primitives could be specified as fully deterministic functions, which take a random "seed" as an extra API argument. This way one doesn't need to test with a dummy RBG. The spec can just state how much "full entropy" (SP 800-90C term) the algorithm needs and use an XOF expander internally. Many candidates do this already. The common use of SHAKE for this purpose was originally done for performance, but it has many advantages beyond that.
Additional test vectors: Detached signatures (i.e. signatures without the message). Furthermore, since hash-and-sign is clearly not preferable anymore, having a definition of how to use other hash functions with the signature schemes would be great too. The already ratified SP 800-208 hash-based signatures do randomized hashing in two different ways (XMSS vs LMS/HSS), and it now looks like there might be even more coming up. Perhaps one could have a uniform way of doing this.
2. Let cryptographers specify serialization
We know how more modern Elliptic Curve systems have moved from ASN.1 point encodings to carefully considered octet-to-number transformations that have become essential parts of the algorithms definitions themselves.
Most submission teams have reasonably efficient encodings in the specifications. I wish that bit-level specifications for ciphertext, detached signatures, public keys, and private keys will be contained in the upcoming standards, together with rules for input validation.
3. Try to make sure that FIPS 140-3 non-invasive testing will catch side-channel attacks
In the crypto module world, side-channel countermeasures are called "non-invasive attack mitigations." TVLA (in the ISO 17825 variant) has emerged as a way to do basic side-channel testing (Timing, DPA, Emissions) of PQC modules. It is not perfect but it is suitable for third-party testing labs to use, as there are reasonably clear fail/pass criteria and only a limited amount of creativity required when applying the test.
My impression is that ISO 17825 testing will probably become mandatory in FIPS 140-3 near future (i.e. a reference to that standard will go into an SP 800-140F update).
What to test: Design teams and the community can comment on which TVLA tests need to be run (e.g. non-specific random key vs static key, malformed ciphertext vs good ciphertext, etc). As noted elsewhere, these must capture at least decapsulation failures.
Do we need to improve on ISO 17825? When looking at some side-channel key attack, one should estimate how likely it is to be applicable to a module that has passed TVLA testing at some level. This will determine the practical impact of those attacks in a near-future market where many modules generally try to be at least ISO 17825 compliant. Or if entirely new tests need to be introduced.