I agree with Ray that the proposed mCDF is too large and requires more
complex software than necessary:
On 5/10/22 10:39 AM, Ray Lutz wrote:
...
> ... A flat
> realization may be much more thrifty, where a hex value identifies the
> contests included on the ballot (i.e. the ballot style), and then bits
> indicating whether options are selected...
> This portion is encoded to 313 characters, which is still not terse
> enough! The "compressed flat" approach, which is also a direct
> conversion of the CVR data, takes only 21 (or 14) characters for this
> portion.
To represent contest selections, I do not see the need for a complex
format that can represent arbitrary data structures. Besides being
unnecessarily large, this requires substantially more software (10-100x)
to encode and decode vs a simple custom format specific to representing
contest selection capture (CSC).
Ray mentions a simple "flat" format that is binary formatted. This
proposal is better than the HL7-style proposal both in size and software
requirements. Other compact and efficient alternatives can be invented.
I am attaching another example format that is substantially more compact
than HL7, one character per possible selection, also human-readable.
[It's based on a CVR format I used in some in-progress open-source
ballot counting software.]
The format described below may be of no interest to you, so in that case
it's just an example of why I think the HL7 mCDF is excessively complex,
both in size, readability, and software required.
If there is interest, I could create a prototype HTML+Javascript sample
ballot with QR code generator.
The following is a markdown description of the alternative format, and a
PDF version is attached.
You can get a good idea of the format from the "Summary and Example"
section on the first page.
----
# Simple Contest Selection Capture
Described below is a proposal for a simple and compact format for
representing contest selections on a ballot, optimized for
representation as a QR code. The QR code could be used to pre-load a
ballot marking/capture device prepared in advance using a home computer,
or could be printed on a ballot by a BMD (Ballot Marking Device) as a
check of optical scan or as primary mark with optical scan as an
integrity check.
The goals of this format include:
* Compact representation to minimize QR size
* Easy to encode/decode with minimal straightforward software
* Human readable/comprehensible
* Core is usable as an efficient CVR format
* Simple representation that can also act as internal CVR data structure
The Contest Selection Capture (CSC) must be associated with a ballot
definition
file that contains a definition of the contests, and for each the type
of voting and definition of selections. The particular format of the
ballot definition is separate from this proposal-- the header of the CSC
data must identify the ballot definition, retrieved separately. We also
assume the ballot style is defined in the referenced external file (one
per style) with a set ordered list of contests, each with a set ordered
list of selections/candidates.
To enable compact representation as a QR code we restrict the data to be
the base45 character set, A-Z, 0-9, space, and `$%*+-./:`
We assume there are `num_contests` and within each contest we have
`num_selections` (choices) and `num_votes` the maximum number of
selections to be made for the contest.
## Summary and Example
The CSC data consists of the following sections described in detail below:
```
{header}{body}{write-ins}
```
For the most common case, the body has 1 character for each contest
(vote for 1) or 1 character per vote (rank up to n or vote for no more
than n). The header identifies the jurisdiction and can be used to
retrieve and validate the ballot definition.
An extended example using the sample ballot distributed on the NIST
ballot CDF working group:
```
HTTPS://CSC.OHIOSOS.GOV/F315D808/++CBABABBDE BAAABA BAA B A JANE
DOE+JOHN SMITH
```
This data can be base45 encoded in a 41x41 pixel QR code with the
highest (33%) level of error correction.
This example includes a RCV contest with 2 write-in candidates. The
body, `++CBABABBDE BAAABA BAA B A `, indicates the selections made,
followed by `+` separated write-in names `JANE DOE+JOHN SMITH`.
The letters `A`, `B`, `C`, ... represent selections 1, 2, 3, ..., space
represents no selection, and `+` represents a write-in selection. There
are 3 characters for the RCV contest (rank up to 3) and the "vote for no
more than 3" contest, otherwise a single character per contest.
The header, `
HTTPS://CSC.OHIOSOS.GOV/F315D808` can function simply as
identification in the case the ballot definition data is available in a
local file (the usual case), or can function as a URL to retrieve the
ballot definition file. The first 8 hex digits of the SHA256 sum of the
ballot definition result serves as an ID as well as verification check.
Rather than include coding for the election date, election
admininstration, ballot style, CSC format, etc. in the CSC, we rely on
this content within the ballot definition. In this example, we assume
the Secretary of State collects and archives ballot definition files for
all jurisdictions in the state, and also insures there are no hash
collisions (1 of 4 billion chance between any 2).
The example above with one write-in RCV contest
(`
HTTPS://CSC.OHIOSOS.GOV/F315D808/+CABABABBDE BAAABA BAA B A JANE
DOE`) requires 2 QR codes to represent the proposed HL7-style mCDF,
split into:
```
NS1|^~&;CSC|1|1|;ELE|ocd-division/country:us/state:oh/county:summit&4|20141104;CBK|052001|
http://go.usa.gov/Tla9;SEL|1GO|1AEF^^^3~1AAR^^^2~1AWI^^^1^^^^JANE
DOE;SEL|2AG|1BMD;SEL|3AS|1CBB;SEL|4SS|1DKK;SEL|5TS|1ECP;SEL|6RC|1FMF;SEL|8SR34|1GES;DSC|123;
NS1|^~&;CSC|1|1|123;SEL|9CC|1HJD~1HGH;SEL|10SB|1IMC;SEL|11JS1|1JTL;SEL|11JS2|1KJO;SEL|12CA9|1LTL;SEL|13CP1|1MRC;SEL|14CP2|1NTG;SEL|18CP6|1RRM;SEL|19CP7|1SJO;SEL|20CP8|1TLT;SEL|22CP10|1VKO;SEL|24CA1|2DY;
```
The HL7-style format requires 248+202 (450) bytes in 2 QR codes, 61x61
or 81x81 QR for 15/33% correction (for the first QR).
The proposed SCSC (with 1 write-in RCV) equivalent is 70 bytes, 33x33 or
41x41 QR for 15/33% correction.
## Selections Body Format
### Usual vote-for-one choice format
For a vote-for-one voting method (1-of-m, approval, plurailty, etc. with
one choice) and 26 of fewer selections, the body will consist of
`num_contests` columns (1..`num_contests`), each a character
representing the selection. The character in each column will be:
* `A-Z` for choice 1..`num_selections`
* space for no/blank vote
* `-` reserved for intentional "no choice made" (abstain)
* `+` indicates a write-in vote with name in a CSC data suffix
* `%` indicates a repeat of the prior write-in (for voting methods
that allow a selection to be repeated)
When the body format is used in a CVR of a scanned ballot we have some
additional possible characters reserved:
* `*` for an overvote
* `.` for an ambiguous mark, no identified selection
* `a-z` for an ambiguous selection requiring adjudication to confirm
Note write-in is inherently ambiguous unless created from text with a
BMD. If the `+` selection is ambiguous we can use the `.`, but then we
cannot add a name. We could reserve the `$` as a code for ambiguous
write-in selection.
A scanner might detect a selected write-in choice, but not try to
read a handwritten text area. If the end of the record is reached, then
we assume the write-in name is undetermined (could be blank or text to
be identified).
### Write-in selections
In many jurisdictions, candidates who wish to be eligible as a
write-in candidate must file prior to the election. The electronic
ballot definition could be updated prior to the election to include
write-in candidates, so the contest definition would have a set of names
that appear on the ballot, and an additional set of names (following
names actually appearing on the ballot) defined. A BMD or web page used
to create the CSC data could allow a pull down selection in the write-in
area of the ballot with names of eligible write-in candidates. In that
case we simply extend the selection list so `num_selections` includes
write-ins, e.g. `A-D` might represent candidates appearing on a ballot,
and `E-F` eligible write-in candidates not on the ballot.
If the selection is a write-in not in the ballot definition file, we
use the `+` character, then take the name in a suffix following the body
columns. Each + indicates the next name in the suffix, where a name is
terminated by a `+` character or end-of-data. (A `+` character is used
as a name separator in the `{write-ins}` suffix.
We can use a `%` character to repeat the prior `+` write-in to
support voting methods that allow a candidate to be selected more than once.
### More than 26 choices
If the `num_selections` is greater than 26, we use `AA-AZ` for
choices 1-26, `BA-BZ` for 27-52, etc. In this case the columns are 2
characters. The non-alphabetic characters are repeated with a 2 letter
per selection.
### `n-of-m` voting
When we have n-of-m voting with `n`>1 (e.g. vote for no more than 3),
we will have `n` characters (or `2n` on more than 26 selections) to
represent the column. To implement software to encode or decode the CSC
body, we save the starting character index in the CSC body, then use
that to identify the location of the `n` or `2n` characters for the contest.
### Ranked Choice, Cumulative, Borda Voting
Ranked Choice or Borda voting is the same as n-of-m voting except the
order of appearance indicates rank 1-`n`. In this case `n` is the
maximum selections to be ranked. Cumulative voting if the same as n-of m
except a selection can be repeated. We have `n` or `2n` chareacters for
the contests, same as n-of-m/
### Proportional/Range Voting
For a range or proportional voting system, the voter indicates a
range `0-9` or `0-99` (percentage) assigned to each possible selection,
and we use `num_selections` or `2*num_selections` characters of `0-9` or
`00-99`, one per possible selection. The ballot definition file must
indicate the number of digits in the range/proportion and the number of
additional write-ins allowed. If the last `num_writeins` column is
non-blank, then the name is obtained from the `{write-ins}` suffix.
If a percentage or range can be assigned to only `n` of `m` choices
and `2n<m` we can use pairs of (selection,range), with the first
character indicating `A-Z` (or 2 characters `AA-ZZ`) for a selection,
followed by the range/proportion of `0-9` or `00-99`. The ballot
definition file must indicate if the compact paired representation is used.
## Header Format
At minimum, we need to identify the ballot definition data associated
with the CSC data. We can use a portion of the SHA256 checksum of the
ballot definition data as both identifier and error check. We
arbitrarily choose the first 32 bits of the SHA256. We could represent
this with 6 base45 encoded characters, and with a prefix identifying the
format and version, we might have:
`CSA*WU:DR`, Where `CS` is a format prefix, `A` is the version of
that format, and `*WU:DR` is the base45 encoding of `F315D808`. However,
it seems not worth the savings of 2 characters vs the more human
readable hex, so `CSAF315D808` would be better.
### URL-Based Header
With a simple format and ballot definition ID we rely on the CSC
being combined with pre-stored ballot definition data (for all valid
ballot styles). We can use a URL as a prefix/header that can server as
both an identification of the jurisdiction and scope of the ballot
definition ID, plus serve as a means to retrieve the ballot definition
data for generic validation apps independent of a vote center or
election sdmin web site.
In the example above we use `
HTTPS://CSC.OHIOSOS.GOV/` to identify
the jurisdiction (`
OHIOSOS.GOV`) and use the subdomain `CSC` in lieu of
`WWW` to select the CSC definition app. The full URL prefix
`
HTTPS://CSC.OHIOSOS.GOV/F315D808` retrieves the ballot definition data
associated with the sample CSC.
We assume the rightmost `/` in the QR coded CSC data separates the
header from CSC body and write-in names. We assume the second rightmost
`/` separates the SHA256 prefix from the header prefix. Each election
admin could format the URL with additional content following the domain
name, e.g. a 2 digit year and letter sequence election ID, county
subcode, precinct, etc. To allow a QR code to work with software
independent of the URL requirements, we assume the SHA prefix is on the
right end of the header.
### BMD Output
A BMD can produce a paper printout of a ballot that has human
readable and machine scannable text. A typical blank ballot includes a
preprinted precinct ID and bar coded or timing mark coded machine
readable precinct ID.
The BMD can print a QR coded CSC as a check for scanned human
readable text and marks or vice versa. The CSC QR could be combined with
the machine readable precinct ID markings to extract the CVR.
Alternatively, the CSC header could be extended to include a precinct
and BMD machine ID, e.g.
```
HTTPS://CSC.OHIOSOS.GOV/0520AC/F315D808/ BABABBDE BAAABA BAA B A
```
Here, the precinct `0520` is combined with the BMD ID `AC`.
## Ballot Definition File
The format of the ballot definition data content is separate from
the scope of this document, but the supported format must be
identifiable from the first part of the data. Given below is a
discussion of possible formatting and what content needs to be included.
The ballot definition file could be CSV, XML, JSON, or even an HTML
representation of the ballot using standardized attributes and classes
to hold the SCSC format version assumed. Attributes per candidate/choice
can hold the response codes, e.g. `B` for second candidate. Attrbutes on
a contest would indicate the voting type and number of votes per
contest, and starting character index in the body per contest.
Standardized class names could identify the HTML element corresponding
to a contest, and selection.
The ballot definition should include the following:
* Identification of the standard and version of the formatting of the
ballot definition data as well as formatting of the associated CSC.
* Identification of the ballot style (set of contests within the
election).
* Total number of characters in the CSC body. This could be computed,
but an explicit number could be used as a verification.
* Number of selections (candidates) for each contest. This could be
derived by counting candidates.
* Number of additional write-in blanks allowed for a contest.
* Order of contests within the CSC. This could be implied if the
contests within the ballot definition are ordered. In the case of
Proportional/Range Voting the number of digits must be indicated, and if
n (selection,value) pairs are used or there is one value per selection.
* For each contest, the voting style, and number of votes/ranks allowed.
* The starting character index for a contest in the CSC body. This
could be derived from the contest order, voting type, and number of
votes, plus max number of candidates (in case there are more than 26
choices).
* For each choice within a contest, the choice value `A-Z` or
`AA-ZZ`. This could be implied by the candidate order, however, ballot
counting could be simplified by using a standard order independent of
all candidate rotations across all ballot styles containing a contest.
* The name and ID (used by the election admin) for each contest
required to associate the contest within other election files. The ID of
the geographic area associated with the contest.
* The name and/or ID (used by the election admin) for each selection,
used to associate the choice within other election files.
* Optional (but helpful) definitions of the election date, election
administration idenfification and geographic area limiting the scope of
voters.
* Precinct identification if used on a QR code for a BMD printout.
Alternatively, the precinct ID (reporting unit) could be an addition to
the URL prefix, e.g. `
HTTPS://CSC.OHIOSOS.GOV/0520/F315D808/ BABABBDE
BAAABA BAA B A `. The ballot definition file would indicate the number
of characters in a precinct ID and number of characters in a machine ID
used to create the QR code (if desired).
For contests that span multiple election administrations, e.g. a school
district crossing county boundaries, we tyically have results reported
separately within each county. To determine the combined total, we would
need to associate the contest IDs and selection IDs across election
administrations. For a state senator, the geographic area ID is
sufficient to match a contest, but election admins routinely spell
district names and office titles differently, making name matching
difficult. Election admins also sometimes spell candidate/choice names
differently, though matching last names is generally reliable.
### Versioning of Ballot Definition Files
Throughout an election cycle, the candidate list might be updated, e.g.
qualified write-in candidates might be added. Any time the ballot
definition file is altered, it's SHA256 hash will change. When an update
is made, the prior SHA256 hash values can be included in the updated
file, with backwards compatibility definitions. This way, a QR code
generated from a prior version can be used with the updated definition.
Prior versions could be of 2 types:
* Backwards compatible additions. New candidates (write-ins) are
appended without changes to the prior `A`, `B`, ... letter representations.
* Incompatible changes. A contest or candidate appearing on the
ballot could be inserted/deleted, the number or rankings could change,
etc. In this case, we need a mapping file to convert the prior contest
starting column to new starting column (if contests change or number of
ranks/votes per contest changes), and for each candidate, the prior
letter to new letter.
## Contest Selection Capture within a CVR (Cast Vote Record)
The above proposed format could be used as a Cast Vote Record, however
the header for a CVR record should contain different information. We
assume all CVR data collected applies to a particular election and
election admin, so this common content can be omitted.
The ballot style can be used to identify the ballot definition data
assumed for the CSC body, so a SHA256 prefix is redundant with the
ballot style.
A CVR header should include the following:
* Capture Device ID (scanner or BMD ID)
* Batch ID (collection, e.g. box of ballots)
* CVR/Ballot ID (sequence/imprinted ID)
* Reporting Group ID (e.g. election day, vote-by-mail)
* Reporting Unit ID (consolidated precinct)
* Ballot style
* Card/Page ID (for separately scanned sheets)
* Record Status/Adjudication level
The results may be subtotaled in independent batches of ballots
processed rather than combined together. We can include the batch ID to
facilitate subtotaling as well as identify the location (box) containing
a ballot.
Scanners can identify ambiguous marks requiring adjudication. We can
include the original CVR as well as corrected values. We can use a sort
to order all CVRs for easy processing, then place the most recent
adjudicated value before prior versions (latest first to easily identify
the latest and skip prior). So we could use `5` for a normal scanned
output, `2` for modified data, `0` for final adjudicated value, `8` for
record created during an audit.
The record status might also be used to indicate a reassignment of
the group/precinct/style/card when a voter submits a provisional ballot
at the wrong precinct, possibly with a different ballot style, so the
wrong precinct CVR may be recast as a corrected CVR with a different
ballot style and precinct assignment.
The above remarks are not intended as a specific compact CVR format
specification, rather to give an idea of how the Simple CSC body can be
used for CVR capture and tabulation.
## Digital Signatures
The SHA256 prefix is suitable as a verification check but not a
cryprographically secure ID. We could use a separate digital signature
QR code on a BMD printout to authenticate the CSC and as being printed
by a particular authorized BMD. Each BMD could have a private key
specific to an election, with certificate issued by the election admin.
At the close of the election, the secret key can be erased.
To sign a CSC, we need to combine the ballot definition data with the
CSC data, and sign the concatenated data. The full signature then
authenticates the QR coded data with the ballot definition used to
create and interpret the QR CSC content.
A digital signature is around 300 bytes, so might require a 69x69 QR
code at error correction level M (15%) or 89x89 at level H (33%).
## Formatting Alternatives
In the proposal above, blank, overvote, and undetermined write-in are
represented by special characters, '` `', `*`, and `+`. There are some
advantages to reserve `A` for blank vote, `B` for overvote, and assign a
letter for a fill-in write-in vote, e.g. `C`, `D`, `E`, are candidates
1, 2, 3, `F` is a write-in with a fill-in-the-blank for name, `G` might
be assigned to a qualified write-in candidate to be counted. When used
for a scanned CVR body, a lower case letter can represent an ambiguous
reading.
If we store a CVR this way, and compare adjudicated versions, we can
format a difference with blank being the same and a character an updated
value.
### Compact Binary Format
Rather than use a full character `A`, `B`, ... to represent a selection
(and under/overvote) we could use a variable width binary encoding, e.g.
3 bits for a typical contest (7 or fewer choices including write-in plus
undervote), 2 bits for a yes/no ballot measure. The bits would be packed
into a byte stream. The resulting binary data could be base45 encoded.
The sample ballot mentioned above would require 7 3 bit fields, and 20 2
bit fields for a total of 61 bits, or 8 bytes, or 12 QR code characters.
The reduction from 27 to 12 characters is greater than 50%, but requires
much more software to pack and unpack the binary format and the
selections are not directly human readable.