mCDF Draft Specification

26 views
Skip to first unread message

John Dziurlaj

unread,
May 2, 2022, 12:43:50 PM5/2/22
to cdf-ball...@list.nist.gov

Good Morning!

 

I am pleased to share with you the draft specification of the Micro Common Data Format. Please find it attached to this email, and take a look in advance of our Wednesday meeting. We will discuss the mCDF in detail during the CDF Research Group Meeting.

 

I am also attaching a prototype that allows you to mark a ballot then view the mCDF representation. (NB: This PDF requires Adobe Reader to be viewed, it cannot be viewed using third party viewers). This will be demoed on our Wednesday, Noon ET call.

 

Thanks and talk to you soon!

 

John Dziurłaj /d͡ʑurwaj/

 

________________________________________________________________________________

Microsoft Teams meeting

Join on your computer or mobile app

Click here to join the meeting

Or call in (audio only)

+1 301-453-2257,,124739975#   United States, Silver Spring

Phone Conference ID: 124 739 975#

Find a local number | Reset PIN

Learn More | Meeting options

________________________________________________________________________________

 

mCDF Prototype.pdf
mCDF Special Publication Draft - WG.docx

Ray Lutz

unread,
May 10, 2022, 1:39:16 PM5/10/22
to cdf-ball...@list.nist.gov
Dear Ballot Styles working group:

Here are some initial comments from my point of view on the "micro CDF" proposal.

Profiling the CDF CVR records is what is proposed here, to form a micro-cdf format that can be thrifty enough to be easily encoded into a qr code.

> This general idea is good. The CDF CVR standard is based on extremely un-thrifty formats, with long identifiers that are repeated to create many gigabytes of data. Most particularly, XML is not concise, due the fact that these long identifier tags are both opened and closed with the same identifier. JSON is a bit less verbose, because closing brackets can be used to close the data item. Meanwhile, the actual data, such as contest names and option names are substituted with integers and defined by enumerations, making it easy to read the repeating identifier names, but making it hard to identify the value data, as it has to be run through enumerations. Using these enumerations unfortunately also would make it fairly easy to swap the results for two candidates, just by changing their integer labels in the candidate manifest file.

> This is NOT a ballot styles standard, but instead a re-turn of the CDF CVR standard, something the group said was out of scope. I like the idea of discussing this, but I think it is a bit strange that the group first says it won't discuss the CDF CVR standard (and repeats it at every meeting in the intro), then proceeds to work on a method of making the resulting data more terse, while avoiding the ballot styles issue.

> And, don't get me wrong, I like the idea that this is being worked on. But I do not support jumping to a strange syntax from the healthcare industry, while avoiding the obvious: Simple JSON. The industry has already STRONGLY EMBRACED JSON as a platform and language independent standard. It is much more terse, and there are flavors of JSON that are being used by some implementations that take it a bit further, by eliminating quotes around identifiers. If the identifier names are reduced in size, then we have something that is directly compatible with the CDF CVR and also much more terse, which we sorely need! I have already proposed that JSON become the preferred format to the CDF CVR github comments, and to deprecate XML, but there appears to be no way to move that idea forward, since this group claims not to want to work on it (and yet does anyway).

> A number of alternative realizations of the CDF CVR should be available. Moving to simple JSON (no quotes around identifiers) and shorter identifier names (not "OutstackConditionIds" for example, but perhaps "OCI", or in fact eliminating this redundant list).  A flat realization may be much more thrifty, where a hex value identifies the contests included on the ballot (i.e. the ballot style), and then bits indicating whether options are selected. This solves the ballot style indication somewhat and also allows for compressed flat expression, using only ascii characters 0-9a-f. Total characters per ballot is the (total number of contests in the election)/4 + (total options in active contests)/4. RCV would be decomposed into separate contests per rank (resulting in bits). Total bits in the example:, assuming 23 contests, 58 total options (in this style): 6 characters + 15 characters: total of 21 characters. Using Base64 encoding could reduce it even more to use 6 bits per character instead of 4, thus 4 characters + 10, or 14 characters, and really no more difficult to decode than the healthcare encoding, HL7v2.

Not including the header info, which should be further reduced, The proposal in the document says:

/Tla9;SEL|1GO|1AEF^^^1~1AAR^^^2;SEL|3AS|1CDY;SEL|4SS|
1DNT;SEL|5TS|1EJM;SEL|6RC|1FMZ;SEL|8SR34|1GCB;SEL|9CC|1HDW~1HSK;SEL|10SB|
1IMC;SEL|11JS1|1JSK;SEL|11JS2|1KJO;SEL|12CA9|1LTL;SEL|13CP1|1MTO;SEL|18CP6|1RRM;


This portion is encoded to 313 characters, which is still not terse enough! The "compressed flat" approach, which is also a direct conversion of the CVR data, takes only 21 (or 14) characters for this portion.


2. I do not support the notion that the group should ignore the support of hand-marked paper ballots, and is focused only on BMDs. The terse version of the CVR should use a syntax which is robust enough to easily accommodate the reduction of the CDF CVR for any type of ballot. This is important!

3. The current version proposes that the jurisdiction should be written out as a text string, like "ocd-division/country:us/state:oh/county:summit". I disagree with this approach, as there are already standards that can be adopted for the FIPS id of the jurisdiction. This is a simple lookup, and a single number is better, with date of the election.

4. Referring to another document using a URL in the terse format should be replaced instead with an identifier of other data needed. Even the CDF CVR requires a number of manifest tables to decode the JSON or XML. These can be provided by the jurisdiction and identified for the election in question. Thus, yes, we need other data, but since the jurisdiction is identified, we need only a short identifier of the data package used with this. I would guess perhaps the first 5 characters of the MD5 checksum of the data might be good enough.

5. I am surprised and dismayed that the standards group is not operating by requesting recommendations from the election community, and also not addressing the ballot styles issue that it first claimed to be working on. Yet I support improving the CDF CVR standard. I do not support using an outdated and obtuse syntax from the healthcare industry, when simple JSON would be far easier to support and in harmony with current trends in the industry in general. I would like to see at least the options evaluated, and how they compare.

For example:
a. How does simple JSON compare with shortened identifiers?
b. How would a flat CVR format compare, using hex bitfields?

Just saying "the winner is" in the presentation does not provide any justification for why this is considered the winner.

6. the proposal says "Enumeration values in CDF are String literals and in mCDF are Integer literals. But it seems the integers are also expressed as ascii and not binary, so it seems this is a mistake unless I am not understanding this (from Appendix A).

7. I suggest encoding write-ins at the end rather than embedding next to each contest. This way, the initial QRCode can encompass all the main selections and then continuation codes can be used only when writeins are included, and then these include only the list of writeins.

Thanks for allowing me to comment on this work.

--Ray Lutz
--
To unsubscribe from this group, send email to cdf-ballot-sty...@list.nist.gov
 
View this group at https://list.nist.gov/cdf-ballot-styles
---
To unsubscribe from this group and stop receiving emails from it, send an email to cdf-ballot-sty...@list.nist.gov.

-- 
-------
Ray Lutz
Citizens' Oversight Projects (COPs)
http://www.citizensoversight.org
619-820-5321

Susan Eustis

unread,
May 10, 2022, 2:59:10 PM5/10/22
to Ray Lutz, cdf-ballot-styles
Ray,
I concur with your comments
Susan
--

Susan Eustis
President
WinterGreen Research
6 Raymond Street
Lexington, Massachusetts
phone 781 863 5078
cell     617 852 7876

John Dziurłaj

unread,
May 19, 2022, 7:32:35 AM5/19/22
to cdf-ballot-styles, ray...@citizensoversight.org

Good Morning Ray,

Thank you for you feedback.

Regarding the Cast Vote Record Common Data Format Specification, we need to give manufactures and vendors that opportunity to implement versions of the CDFs that have been published. Additionally, the CVR CDF is incorporated into the Voluntary Voting Systems Guidelines, Version 2.0 with the citation of:

Wack et al. Cast Vote Records Common Data Format Specification (NIST SP 1500-103), Version 1.0. February 2019.

Thus, the systems certified to VVSG 2.0 (as currently written) would need to support JSON and XML regardless of what new serialization is developed in a subsequent revision. A major update to the CVR CDF at this juncture would be premature; we simply do not have enough real-world experience from vendors and other stakeholders on the standard’s use.

Instead, it is our desire to work towards a minor point release of the CVR CDF that would incorporate the “mCDF Profile for Contest Selection Capture” (currently Appendix D in mCDF Specification). This addition would have no impact on any current VVSG 2.0 use-case for CVRs, unlike the deprecation of XML or the addition of a new serialization that could replace JSON/XML.

Regarding the development of a more terse serialization that could be exported from scanners, note on the first page of the mCDF spec it says:

mCDF is not intended to offer an alternative to the JSON and XML serializations of the CDFs in environments that are not store-space constrained.

Thus, we are not looking to create a compact serialization for any non-paper application. This effort is scoped to the reading of ballots, and the mCDF for this purpose is a profile of the CVR that can be read by scanners. The mCDF instance would appear on the ballot and its use would be defined in the ballot definition.

Regarding the use of HL7, this was a result of analysis that showed that any name/value paired style format is far too verbose, and in light of this, how can we leverage the model driven approach that was used to develop the JSON and XML schemas? The design principles section of the mCDF Specification draft (pg. 1) describes several requirements we had for a new serialization. An HL7 style serialization meets them.

Regarding a few other items:

·       You mention “OutstackConditionIds”; this is not part of the NIST CVR CDF, perhaps you are looking at a vendor specific CVR output?

·       You mention FIPS codes. FIPS is supported by the specification. See here. (3)

·       I need a little bit more information to understand (4)

·       You mention enumeration values. String literals are of the form `fips`, `local-level`, etc. integer literals are of the form 1,2, etc. (6)

Regards,

John Dziurlaj

Ray Lutz

unread,
Jun 3, 2022, 12:35:14 PM6/3/22
to cdf-ball...@list.nist.gov
Sorry for my delay. Here is my response.


On 5/19/2022 4:32 AM, 'John Dziurłaj' via cdf-ballot-styles wrote:

Good Morning Ray,

Thank you for you feedback.

Regarding the Cast Vote Record Common Data Format Specification, we need to give manufactures and vendors that opportunity to implement versions of the CDFs that have been published. Additionally, the CVR CDF is incorporated into the Voluntary Voting Systems Guidelines, Version 2.0 with the citation of:

Wack et al. Cast Vote Records Common Data Format Specification (NIST SP 1500-103), Version 1.0. February 2019.

Thus, the systems certified to VVSG 2.0 (as currently written) would need to support JSON and XML regardless of what new serialization is developed in a subsequent revision. A major update to the CVR CDF at this juncture would be premature; we simply do not have enough real-world experience from vendors and other stakeholders on the standard’s use.

I am actually suggesting a lighter-weight version of JSON without making too many changes, and certainly not adopting an ancient healthcare standard for encoding which has nothing to do with any implementation in the election sphere.

Thus, JSON without quotes, and with short keys in the key:value pair format.
There are now a number of variants of JSON, that can be easier for humans and also can be more terse.

HJSON: https://hjson.github.io/
JSON5: https://json5.org/

These eliminate the need to surround keys and values with quotes.

JSON: {"key1":"123", "key2":"456", "Alist": ["a","b","c"]}
HJSON: {key1:123, key2:456, Alist: [a,b,c]}


Instead, it is our desire to work towards a minor point release of the CVR CDF that would incorporate the “mCDF Profile for Contest Selection Capture” (currently Appendix D in mCDF Specification). This addition would have no impact on any current VVSG 2.0 use-case for CVRs, unlike the deprecation of XML or the addition of a new serialization that could replace JSON/XML.

I strongly disagree with this proposal. XML should be deprecated and we should stick with JSON or JSON-like syntax and not use healthcare syntax.

Regarding the development of a more terse serialization that could be exported from scanners, note on the first page of the mCDF spec it says:

mCDF is not intended to offer an alternative to the JSON and XML serializations of the CDFs in environments that are not store-space constrained.

Why not? I say delete that sentence.

Thus, we are not looking to create a compact serialization for any non-paper application. This effort is scoped to the reading of ballots, and the mCDF for this purpose is a profile of the CVR that can be read by scanners. The mCDF instance would appear on the ballot and its use would be defined in the ballot definition.

That is out of scope from the original intent of the working group, which was to work on ballot styles. This is not a ballot style indicator, but an alternative cast vote record encoding, the think you explicitly say is out of scope at the start of every meeting.

Regarding the use of HL7, this was a result of analysis that showed that any name/value paired style format is far too verbose,

Please provide your analysis that reached that conclusion. I have seen no analysis of this type.

and in light of this, how can we leverage the model driven approach that was used to develop the JSON and XML schemas? The design principles section of the mCDF Specification draft (pg. 1) describes several requirements we had for a new serialization. An HL7 style serialization meets them.

JSON also can express heirarchical relationships. But just because the data is modeled one way does not mean the encoding has to follow that directly. We need to also be aware of the real concerns of space and time when we move from a UML model to a realization of the model.

Regarding a few other items:

·       You mention “OutstackConditionIds”; this is not part of the NIST CVR CDF, perhaps you are looking at a vendor specific CVR output?

I have been working only with the Dominion JSON implementation of the CVR which is quite similar to the NIST final recommendation, and apparently this was from an earlier draft. It probably would be a good use of time for those at NIST and your group to study the Dominion implementation and see how it compares and differs with the NIST CVR CDF. Maybe it can be a profile because it is pretty close.

·       You mention FIPS codes. FIPS is supported by the specification. See here. (3)

In a concise mCVR, the long-winded alphanumeric expression of the jurisdiction should be avoided and only use FIPS code, and do not allow any other method.

·       I need a little bit more information to understand (4)

4. Referring to another document using a URL in the terse format should be replaced instead with an identifier of other data needed.

--> I do not like the idea that it will be necessary to provide or access a URL to decode the mCVR. This is a security hole.


Even the CDF CVR requires a number of manifest tables to decode the JSON or XML. These can be provided by the jurisdiction and identified for the election in question. Thus, yes, we need other data, but since the jurisdiction is identified, we need only a short identifier of the data package used with this. I would guess perhaps the first 5 characters of the MD5 checksum of the data might be good enough.

--> So for example, let's say we continue to use integers as labels for the candidates. These are provided in a "manifest" table now, which simply lists the candidates full names and perhaps other information, and also provides the Idx number for the candidate. Similarly for contests. All these tables are already expressed as JSON and are part of the CVR but are not needed to be repeated in every record. So for the microCVR, then we could calculate the MD5 hash value of those tables when expressed as JSON (and you have to specify the line endings to make sure they are the same across the various hosts), and then take the first five hex digits of that hash code, such as '345af', so the user will know they have the right set of other files.

·       You mention enumeration values. String literals are of the form `fips`, `local-level`, etc. integer literals are of the form 1,2, etc. (6)

My comment here is specifically about the text you are using in that section and is a technicality.


6. the proposal says "Enumeration values in CDF are String literals and in mCDF are Integer literals. But it seems the integers are also expressed as ascii and not binary, so it seems this is a mistake unless I am not understanding this (from Appendix A).
I think the only difference is whether they are quoted. Strict JSON requires quotes but some JSON variants do not.
Certainly, the enumeration values are always numbers (thus the reason for the term 'enumeration') but they are all encoded as ASCII
or equivalent, and not a binary encoding. Thus, they are both strings in CVR and mCVR.

I would rather that you use the term Cast Vote Record, rather than the more general "common data format" which is essentially meaningless.
This is a cast vote record, in any reading of the term.

--Ray

Ray Lutz

unread,
Jun 3, 2022, 2:25:01 PM6/3/22
to cdf-ball...@list.nist.gov
I'd like to propose a different direction for the micro CVR implementation for QRCodes:

CBOR appears to be a standard concise method to encode JSON-like data.
See IETF RFC-7049  https://www.rfc-editor.org/rfc/rfc7049  "Concise Binary Object Representation (CBOR)"
see also IETF RFC-8392  https://datatracker.ietf.org/doc/html/rfc8392  "CBOR Web Token (CWT)"

QRCode for EU Green Pass (covid vax) uses the following:

CBOR Encode
zip encode
base45 encode

See https://github.com/ehn-dcc-development/eu-dcc-hcert-spec/blob/main/README.md

overview

Design objectives apply here:
  1. Use an encoding which is as compact as practically possible whilst ensuring reliable decoding using optical means.

    Example: CBOR in combination with deflate compression and QR encoding.

  2. Use existing, proven and modern open standards, with running code available (when possible) for all common platforms and operating environments to limit implementation efforts and minimise risk of interoperability issues.

    Example: CBOR Web Tokens (CWT).

  3. When existing standards do not exist, define and test new mechanisms based on existing mechanisms and ensure running code exists.

    Example: Base45 encoding per new Internet Draft.

  4. Ensure compatibility with existing systems for optical decoding.

    Example: Base45 encoding for optical transport.


Please give these ideas consideration.

--Ray

John Dziurlaj

unread,
Jul 5, 2022, 12:11:14 PM7/5/22
to Ray Lutz, cdf-ball...@list.nist.gov

Good Afternoon Ray,

 

Thanks for sharing this with the group. Have you attempted to apply this approach to our domain? Would you be able to share your findings in terms of storage space or features?

 

Regards,

 

John Dziurlaj

Ray Lutz

unread,
Jul 5, 2022, 5:22:07 PM7/5/22
to John Dziurlaj, cdf-ball...@list.nist.gov
Hi John:

I think the best way to approach your question is if you provide the basic JSON version of the CVR that was the basis for the encoding you proposed, and I will make a proposed alternative.

The JSON/CBOR/COSE/Zlib/Base45 pipelines have already been developed and area easily available. These are already being used for QR Code data encoding.

See https://github.com/ehn-dcc-development/ehn-sign-verify-python-trivial

Notice the "payload" is json but using shortened names for each field.

{
    "dob": "XXXX-XX-XX",
    "nam": {
        "fn": "xxx Xxxxx",
        "fnt": "XXX<XXXXX",
        "gn": "Xxxx Xxxxxx",
        "gnt": "XXXX<XXXXXX"
    },
    "v": [
        {
            "ci": "URN:UCI:01:NL:......#:",
            "co": "NL",
            "dn": 1,
            "dt": "2021-06-07",
            "is": "Ministry of Health Welfare and Sport",
            "ma": "ORG-100001417",
            "mp": "EU/1/20/1525",
            "sd": 1,
            "tg": "840539006",
            "vp": "J07BX03"
        }
    ],
    "ver": "1.3.0"
}

Thus, the only thing we really have to do is decide how to effectively shorten the field names as used in the wasteful CVR CDF spec for a single ballot entry, and not the entire set of CVRs for a tabulator or an election.

Maybe it was in the presentation but the document does not have an easily referenced example.

--> Please provide an example I can work with. Please encode your way and I will encode this way to compare.

CBOR is heavily used now, and supports a superset of JSON capabilities. This spec also includes a method of signing the data, which is missing from your current proposal.

--Ray

Carl Hage

unread,
Aug 2, 2022, 9:39:51 PM8/2/22
to cdf-ball...@list.nist.gov
I agree with Ray that the proposed mCDF is too large and requires more
complex software than necessary:

On 5/10/22 10:39 AM, Ray Lutz wrote:
...
> ...  A flat
> realization may be much more thrifty, where a hex value identifies the
> contests included on the ballot (i.e. the ballot style), and then bits
> indicating whether options are selected...
> This portion is encoded to 313 characters, which is still not terse
> enough! The "compressed flat" approach, which is also a direct
> conversion of the CVR data, takes only 21 (or 14) characters for this
> portion.

To represent contest selections, I do not see the need for a complex
format that can represent arbitrary data structures. Besides being
unnecessarily large, this requires substantially more software (10-100x)
to encode and decode vs a simple custom format specific to representing
contest selection capture (CSC).

Ray mentions a simple "flat" format that is binary formatted. This
proposal is better than the HL7-style proposal both in size and software
requirements. Other compact and efficient alternatives can be invented.

I am attaching another example format that is substantially more compact
than HL7, one character per possible selection, also human-readable.
[It's based on a CVR format I used in some in-progress open-source
ballot counting software.]

The format described below may be of no interest to you, so in that case
it's just an example of why I think the HL7 mCDF is excessively complex,
both in size, readability, and software required.

If there is interest, I could create a prototype HTML+Javascript sample
ballot with QR code generator.

The following is a markdown description of the alternative format, and a
PDF version is attached.

You can get a good idea of the format from the "Summary and Example"
section on the first page.

----

# Simple Contest Selection Capture

Described below is a proposal for a simple and compact format for
representing contest selections on a ballot, optimized for
representation as a QR code. The QR code could be used to pre-load a
ballot marking/capture device prepared in advance using a home computer,
or could be printed on a ballot by a BMD (Ballot Marking Device) as a
check of optical scan or as primary mark with optical scan as an
integrity check.

The goals of this format include:
* Compact representation to minimize QR size
* Easy to encode/decode with minimal straightforward software
* Human readable/comprehensible
* Core is usable as an efficient CVR format
* Simple representation that can also act as internal CVR data structure

The Contest Selection Capture (CSC) must be associated with a ballot
definition
file that contains a definition of the contests, and for each the type
of voting and definition of selections. The particular format of the
ballot definition is separate from this proposal-- the header of the CSC
data must identify the ballot definition, retrieved separately. We also
assume the ballot style is defined in the referenced external file (one
per style) with a set ordered list of contests, each with a set ordered
list of selections/candidates.

To enable compact representation as a QR code we restrict the data to be
the base45 character set, A-Z, 0-9, space, and `$%*+-./:`

We assume there are `num_contests` and within each contest we have
`num_selections` (choices) and `num_votes` the maximum number of
selections to be made for the contest.

## Summary and Example

The CSC data consists of the following sections described in detail below:
```
{header}{body}{write-ins}
```
For the most common case, the body has 1 character for each contest
(vote for 1) or 1 character per vote (rank up to n or vote for no more
than n). The header identifies the jurisdiction and can be used to
retrieve and validate the ballot definition.

An extended example using the sample ballot distributed on the NIST
ballot CDF working group:

```
HTTPS://CSC.OHIOSOS.GOV/F315D808/++CBABABBDE BAAABA BAA B A JANE
DOE+JOHN SMITH
```
This data can be base45 encoded in a 41x41 pixel QR code with the
highest (33%) level of error correction.

This example includes a RCV contest with 2 write-in candidates. The
body, `++CBABABBDE BAAABA BAA B A `, indicates the selections made,
followed by `+` separated write-in names `JANE DOE+JOHN SMITH`.

The letters `A`, `B`, `C`, ... represent selections 1, 2, 3, ..., space
represents no selection, and `+` represents a write-in selection. There
are 3 characters for the RCV contest (rank up to 3) and the "vote for no
more than 3" contest, otherwise a single character per contest.

The header, `HTTPS://CSC.OHIOSOS.GOV/F315D808` can function simply as
identification in the case the ballot definition data is available in a
local file (the usual case), or can function as a URL to retrieve the
ballot definition file. The first 8 hex digits of the SHA256 sum of the
ballot definition result serves as an ID as well as verification check.
Rather than include coding for the election date, election
admininstration, ballot style, CSC format, etc. in the CSC, we rely on
this content within the ballot definition. In this example, we assume
the Secretary of State collects and archives ballot definition files for
all jurisdictions in the state, and also insures there are no hash
collisions (1 of 4 billion chance between any 2).

The example above with one write-in RCV contest
(`HTTPS://CSC.OHIOSOS.GOV/F315D808/+CABABABBDE BAAABA BAA B A JANE
DOE`) requires 2 QR codes to represent the proposed HL7-style mCDF,
split into:

```
NS1|^~&;CSC|1|1|;ELE|ocd-division/country:us/state:oh/county:summit&4|20141104;CBK|052001|http://go.usa.gov/Tla9;SEL|1GO|1AEF^^^3~1AAR^^^2~1AWI^^^1^^^^JANE
DOE;SEL|2AG|1BMD;SEL|3AS|1CBB;SEL|4SS|1DKK;SEL|5TS|1ECP;SEL|6RC|1FMF;SEL|8SR34|1GES;DSC|123;

NS1|^~&;CSC|1|1|123;SEL|9CC|1HJD~1HGH;SEL|10SB|1IMC;SEL|11JS1|1JTL;SEL|11JS2|1KJO;SEL|12CA9|1LTL;SEL|13CP1|1MRC;SEL|14CP2|1NTG;SEL|18CP6|1RRM;SEL|19CP7|1SJO;SEL|20CP8|1TLT;SEL|22CP10|1VKO;SEL|24CA1|2DY;
```
The HL7-style format requires 248+202 (450) bytes in 2 QR codes, 61x61
or 81x81 QR for 15/33% correction (for the first QR).

The proposed SCSC (with 1 write-in RCV) equivalent is 70 bytes, 33x33 or
41x41 QR for 15/33% correction.


## Selections Body Format

### Usual vote-for-one choice format

For a vote-for-one voting method (1-of-m, approval, plurailty, etc. with
one choice) and 26 of fewer selections, the body will consist of
`num_contests` columns (1..`num_contests`), each a character
representing the selection. The character in each column will be:

* `A-Z` for choice 1..`num_selections`
* space for no/blank vote
* `-` reserved for intentional "no choice made" (abstain)
* `+` indicates a write-in vote with name in a CSC data suffix
* `%` indicates a repeat of the prior write-in (for voting methods
that allow a selection to be repeated)

When the body format is used in a CVR of a scanned ballot we have some
additional possible characters reserved:

* `*` for an overvote
* `.` for an ambiguous mark, no identified selection
* `a-z` for an ambiguous selection requiring adjudication to confirm

Note write-in is inherently ambiguous unless created from text with a
BMD. If the `+` selection is ambiguous we can use the `.`, but then we
cannot add a name. We could reserve the `$` as a code for ambiguous
write-in selection.

A scanner might detect a selected write-in choice, but not try to
read a handwritten text area. If the end of the record is reached, then
we assume the write-in name is undetermined (could be blank or text to
be identified).

### Write-in selections

In many jurisdictions, candidates who wish to be eligible as a
write-in candidate must file prior to the election. The electronic
ballot definition could be updated prior to the election to include
write-in candidates, so the contest definition would have a set of names
that appear on the ballot, and an additional set of names (following
names actually appearing on the ballot) defined. A BMD or web page used
to create the CSC data could allow a pull down selection in the write-in
area of the ballot with names of eligible write-in candidates. In that
case we simply extend the selection list so `num_selections` includes
write-ins, e.g. `A-D` might represent candidates appearing on a ballot,
and `E-F` eligible write-in candidates not on the ballot.

If the selection is a write-in not in the ballot definition file, we
use the `+` character, then take the name in a suffix following the body
columns. Each + indicates the next name in the suffix, where a name is
terminated by a `+` character or end-of-data. (A `+` character is used
as a name separator in the `{write-ins}` suffix.

We can use a `%` character to repeat the prior `+` write-in to
support voting methods that allow a candidate to be selected more than once.

### More than 26 choices

If the `num_selections` is greater than 26, we use `AA-AZ` for
choices 1-26, `BA-BZ` for 27-52, etc. In this case the columns are 2
characters. The non-alphabetic characters are repeated with a 2 letter
per selection.

### `n-of-m` voting

When we have n-of-m voting with `n`>1 (e.g. vote for no more than 3),
we will have `n` characters (or `2n` on more than 26 selections) to
represent the column. To implement software to encode or decode the CSC
body, we save the starting character index in the CSC body, then use
that to identify the location of the `n` or `2n` characters for the contest.

### Ranked Choice, Cumulative, Borda Voting

Ranked Choice or Borda voting is the same as n-of-m voting except the
order of appearance indicates rank 1-`n`. In this case `n` is the
maximum selections to be ranked. Cumulative voting if the same as n-of m
except a selection can be repeated. We have `n` or `2n` chareacters for
the contests, same as n-of-m/

### Proportional/Range Voting

For a range or proportional voting system, the voter indicates a
range `0-9` or `0-99` (percentage) assigned to each possible selection,
and we use `num_selections` or `2*num_selections` characters of `0-9` or
`00-99`, one per possible selection. The ballot definition file must
indicate the number of digits in the range/proportion and the number of
additional write-ins allowed. If the last `num_writeins` column is
non-blank, then the name is obtained from the `{write-ins}` suffix.

If a percentage or range can be assigned to only `n` of `m` choices
and `2n<m` we can use pairs of (selection,range), with the first
character indicating `A-Z` (or 2 characters `AA-ZZ`) for a selection,
followed by the range/proportion of `0-9` or `00-99`. The ballot
definition file must indicate if the compact paired representation is used.

## Header Format

At minimum, we need to identify the ballot definition data associated
with the CSC data. We can use a portion of the SHA256 checksum of the
ballot definition data as both identifier and error check. We
arbitrarily choose the first 32 bits of the SHA256. We could represent
this with 6 base45 encoded characters, and with a prefix identifying the
format and version, we might have:
`CSA*WU:DR`, Where `CS` is a format prefix, `A` is the version of
that format, and `*WU:DR` is the base45 encoding of `F315D808`. However,
it seems not worth the savings of 2 characters vs the more human
readable hex, so `CSAF315D808` would be better.

### URL-Based Header

With a simple format and ballot definition ID we rely on the CSC
being combined with pre-stored ballot definition data (for all valid
ballot styles). We can use a URL as a prefix/header that can server as
both an identification of the jurisdiction and scope of the ballot
definition ID, plus serve as a means to retrieve the ballot definition
data for generic validation apps independent of a vote center or
election sdmin web site.

In the example above we use `HTTPS://CSC.OHIOSOS.GOV/` to identify
the jurisdiction (`OHIOSOS.GOV`) and use the subdomain `CSC` in lieu of
`WWW` to select the CSC definition app. The full URL prefix
`HTTPS://CSC.OHIOSOS.GOV/F315D808` retrieves the ballot definition data
associated with the sample CSC.

We assume the rightmost `/` in the QR coded CSC data separates the
header from CSC body and write-in names. We assume the second rightmost
`/` separates the SHA256 prefix from the header prefix. Each election
admin could format the URL with additional content following the domain
name, e.g. a 2 digit year and letter sequence election ID, county
subcode, precinct, etc. To allow a QR code to work with software
independent of the URL requirements, we assume the SHA prefix is on the
right end of the header.

### BMD Output

A BMD can produce a paper printout of a ballot that has human
readable and machine scannable text. A typical blank ballot includes a
preprinted precinct ID and bar coded or timing mark coded machine
readable precinct ID.

The BMD can print a QR coded CSC as a check for scanned human
readable text and marks or vice versa. The CSC QR could be combined with
the machine readable precinct ID markings to extract the CVR.
Alternatively, the CSC header could be extended to include a precinct
and BMD machine ID, e.g.
```
HTTPS://CSC.OHIOSOS.GOV/0520AC/F315D808/ BABABBDE BAAABA BAA B A
```

Here, the precinct `0520` is combined with the BMD ID `AC`.

## Ballot Definition File

The format of the ballot definition data content is separate from
the scope of this document, but the supported format must be
identifiable from the first part of the data. Given below is a
discussion of possible formatting and what content needs to be included.


The ballot definition file could be CSV, XML, JSON, or even an HTML
representation of the ballot using standardized attributes and classes
to hold the SCSC format version assumed. Attributes per candidate/choice
can hold the response codes, e.g. `B` for second candidate. Attrbutes on
a contest would indicate the voting type and number of votes per
contest, and starting character index in the body per contest.
Standardized class names could identify the HTML element corresponding
to a contest, and selection.

The ballot definition should include the following:

* Identification of the standard and version of the formatting of the
ballot definition data as well as formatting of the associated CSC.

* Identification of the ballot style (set of contests within the
election).

* Total number of characters in the CSC body. This could be computed,
but an explicit number could be used as a verification.

* Number of selections (candidates) for each contest. This could be
derived by counting candidates.

* Number of additional write-in blanks allowed for a contest.

* Order of contests within the CSC. This could be implied if the
contests within the ballot definition are ordered. In the case of
Proportional/Range Voting the number of digits must be indicated, and if
n (selection,value) pairs are used or there is one value per selection.

* For each contest, the voting style, and number of votes/ranks allowed.

* The starting character index for a contest in the CSC body. This
could be derived from the contest order, voting type, and number of
votes, plus max number of candidates (in case there are more than 26
choices).

* For each choice within a contest, the choice value `A-Z` or
`AA-ZZ`. This could be implied by the candidate order, however, ballot
counting could be simplified by using a standard order independent of
all candidate rotations across all ballot styles containing a contest.

* The name and ID (used by the election admin) for each contest
required to associate the contest within other election files. The ID of
the geographic area associated with the contest.

* The name and/or ID (used by the election admin) for each selection,
used to associate the choice within other election files.

* Optional (but helpful) definitions of the election date, election
administration idenfification and geographic area limiting the scope of
voters.

* Precinct identification if used on a QR code for a BMD printout.
Alternatively, the precinct ID (reporting unit) could be an addition to
the URL prefix, e.g. `HTTPS://CSC.OHIOSOS.GOV/0520/F315D808/ BABABBDE
BAAABA BAA B A `. The ballot definition file would indicate the number
of characters in a precinct ID and number of characters in a machine ID
used to create the QR code (if desired).

For contests that span multiple election administrations, e.g. a school
district crossing county boundaries, we tyically have results reported
separately within each county. To determine the combined total, we would
need to associate the contest IDs and selection IDs across election
administrations. For a state senator, the geographic area ID is
sufficient to match a contest, but election admins routinely spell
district names and office titles differently, making name matching
difficult. Election admins also sometimes spell candidate/choice names
differently, though matching last names is generally reliable.

### Versioning of Ballot Definition Files

Throughout an election cycle, the candidate list might be updated, e.g.
qualified write-in candidates might be added. Any time the ballot
definition file is altered, it's SHA256 hash will change. When an update
is made, the prior SHA256 hash values can be included in the updated
file, with backwards compatibility definitions. This way, a QR code
generated from a prior version can be used with the updated definition.

Prior versions could be of 2 types:

* Backwards compatible additions. New candidates (write-ins) are
appended without changes to the prior `A`, `B`, ... letter representations.

* Incompatible changes. A contest or candidate appearing on the
ballot could be inserted/deleted, the number or rankings could change,
etc. In this case, we need a mapping file to convert the prior contest
starting column to new starting column (if contests change or number of
ranks/votes per contest changes), and for each candidate, the prior
letter to new letter.

## Contest Selection Capture within a CVR (Cast Vote Record)

The above proposed format could be used as a Cast Vote Record, however
the header for a CVR record should contain different information. We
assume all CVR data collected applies to a particular election and
election admin, so this common content can be omitted.

The ballot style can be used to identify the ballot definition data
assumed for the CSC body, so a SHA256 prefix is redundant with the
ballot style.

A CVR header should include the following:

* Capture Device ID (scanner or BMD ID)
* Batch ID (collection, e.g. box of ballots)
* CVR/Ballot ID (sequence/imprinted ID)
* Reporting Group ID (e.g. election day, vote-by-mail)
* Reporting Unit ID (consolidated precinct)
* Ballot style
* Card/Page ID (for separately scanned sheets)
* Record Status/Adjudication level

The results may be subtotaled in independent batches of ballots
processed rather than combined together. We can include the batch ID to
facilitate subtotaling as well as identify the location (box) containing
a ballot.

Scanners can identify ambiguous marks requiring adjudication. We can
include the original CVR as well as corrected values. We can use a sort
to order all CVRs for easy processing, then place the most recent
adjudicated value before prior versions (latest first to easily identify
the latest and skip prior). So we could use `5` for a normal scanned
output, `2` for modified data, `0` for final adjudicated value, `8` for
record created during an audit.

The record status might also be used to indicate a reassignment of
the group/precinct/style/card when a voter submits a provisional ballot
at the wrong precinct, possibly with a different ballot style, so the
wrong precinct CVR may be recast as a corrected CVR with a different
ballot style and precinct assignment.

The above remarks are not intended as a specific compact CVR format
specification, rather to give an idea of how the Simple CSC body can be
used for CVR capture and tabulation.

## Digital Signatures

The SHA256 prefix is suitable as a verification check but not a
cryprographically secure ID. We could use a separate digital signature
QR code on a BMD printout to authenticate the CSC and as being printed
by a particular authorized BMD. Each BMD could have a private key
specific to an election, with certificate issued by the election admin.
At the close of the election, the secret key can be erased.

To sign a CSC, we need to combine the ballot definition data with the
CSC data, and sign the concatenated data. The full signature then
authenticates the QR coded data with the ballot definition used to
create and interpret the QR CSC content.

A digital signature is around 300 bytes, so might require a 69x69 QR
code at error correction level M (15%) or 89x89 at level H (33%).

## Formatting Alternatives

In the proposal above, blank, overvote, and undetermined write-in are
represented by special characters, '` `', `*`, and `+`. There are some
advantages to reserve `A` for blank vote, `B` for overvote, and assign a
letter for a fill-in write-in vote, e.g. `C`, `D`, `E`, are candidates
1, 2, 3, `F` is a write-in with a fill-in-the-blank for name, `G` might
be assigned to a qualified write-in candidate to be counted. When used
for a scanned CVR body, a lower case letter can represent an ambiguous
reading.

If we store a CVR this way, and compare adjudicated versions, we can
format a difference with blank being the same and a character an updated
value.

### Compact Binary Format

Rather than use a full character `A`, `B`, ... to represent a selection
(and under/overvote) we could use a variable width binary encoding, e.g.
3 bits for a typical contest (7 or fewer choices including write-in plus
undervote), 2 bits for a yes/no ballot measure. The bits would be packed
into a byte stream. The resulting binary data could be base45 encoded.
The sample ballot mentioned above would require 7 3 bit fields, and 20 2
bit fields for a total of 61 bits, or 8 bytes, or 12 QR code characters.

The reduction from 27 to 12 characters is greater than 50%, but requires
much more software to pack and unpack the binary format and the
selections are not directly human readable.

scsc.pdf
Reply all
Reply to author
Forward
0 new messages