SSN encryption

feen...@gmail.com

unread,

Sep 29, 2005, 8:25:17 AM9/29/05

to

We want to encrypt social security numbers in a database. The special
requirement is that the same SSN should encrypt to the same value
regardless of context. That way when new records are added to the
database it will be possible to easily match them to existing records.
Also, the tables can be sorted or selected and still matched to other
tables in the database.

This has been discussed in this newgroup before, but as far as I can
tell, the solutions are of the form "Map each SSN to a sequence number,
and compare to the list of already seen SSNs to know when to assign a
new one." However this means keeping the cross ref online for each new
file, which almost defeats the purpose of encrypting the SSN (at least
in our situation).

Since the intruder can probably identify his own record (and those of
close relations) from the non-SSN content, the method needs to have
some degree of security from known plaintext attacks. There are weak
clues to the first 5 digits in the data content (such as region codes),
if that is a consideration. Although the raw files are sorted by SSN,
we would resort tables by the encrypted SSN, to destroy that bit of
information. (An algorithm that maintained sort would be great, but we
assume insecure).

Daniel Feenberg
feenberg isat nber dotte org

tomst...@gmail.com

unread,

Sep 29, 2005, 8:42:30 AM9/29/05

to

drfr...@nber.org wrote:
> We want to encrypt social security numbers in a database. The special
> requirement is that the same SSN should encrypt to the same value
> regardless of context. That way when new records are added to the
> database it will be possible to easily match them to existing records.
> Also, the tables can be sorted or selected and still matched to other
> tables in the database.

Use ... a ... salt [and not a known fixed key]!!!

SSNs are like what 10 digits? that's a 33-bit number... fill the
remainder of the bits with random data and encrypt, voila 128-bit
"token" which when decoded gets you a SSN.

Tom

feen...@gmail.com

unread,

Sep 29, 2005, 9:23:54 AM9/29/05

to

tomst...@gmail.com wrote:
> drfr...@nber.org wrote:
> > We want to encrypt social security numbers in a database. The special
> > requirement is that the same SSN should encrypt to the same value
> > regardless of context. That way when new records are added to the
> > database it will be possible to easily match them to existing records.
> > Also, the tables can be sorted or selected and still matched to other
> > tables in the database.
>
> Use ... a ... salt [and not a known fixed key]!!!
>
> SSNs are like what 10 digits? that's a 33-bit number... fill the

Actually, 9 digits, but close enough.

> remainder of the bits with random data and encrypt, voila 128-bit
> "token" which when decoded gets you a SSN.
>
> Tom

I guess I am too much of a beginner to understand your suggestion. If
we add 85 (33+85=128) random bits and then encrypt, won't the same SSN
encrypt to different values depending on the random bits? (Or are the
random bits a maintained key)? We specifically don't need to decrypt
the SSNs - only use the encrypted values for matching. If we have to
decrypt to use the field, then our purpose is defeated, since the file
user would then be given the actual SSN, which we want to avoid. We
only want to give them enough information to match records across
tables.

Sorry for my naivity prolonging the exchange.

tomst...@gmail.com

unread,

Sep 29, 2005, 9:32:47 AM9/29/05

to

drfr...@nber.org wrote:
> > remainder of the bits with random data and encrypt, voila 128-bit
> > "token" which when decoded gets you a SSN.
> >
> > Tom
>
> I guess I am too much of a beginner to understand your suggestion. If
> we add 85 (33+85=128) random bits and then encrypt, won't the same SSN
> encrypt to different values depending on the random bits? (Or are the
> random bits a maintained key)? We specifically don't need to decrypt
> the SSNs - only use the encrypted values for matching. If we have to
> decrypt to use the field, then our purpose is defeated, since the file
> user would then be given the actual SSN, which we want to avoid. We
> only want to give them enough information to match records across
> tables.

what you need then is a keyed PRF based on the SSN then the salt bits
are the output of the PRF [e.g. use CMAC with a fixed secret key on the
SSN, get an 85-bit tag, append that to the SSN and encrypt].

That way a dictionary [e.g. MITM] attack won't work and you can even
verify entries by checking the MAC.

Tom

Peter Pearson

unread,

Sep 29, 2005, 12:23:06 PM9/29/05

to

drfr...@nber.org wrote:

> We want to encrypt social security numbers in a database.

You've triggered Pearson's predictable "clarify your requirements"
lecture. Regular sci.crypt readers can move along.

You need clarity in your requirements. Without clarity,
(a) you won't get useful help, and (b) you can't tell
whether or not the final design meets your requirements.
If you're not sure where you're going, you'll never know
whether you've arrived.

"We want to encrypt X" is not a requirement; rather, it's
an implementation suggestion. A good requirement sounds
like this: "It should be computationally infeasible for
someone knowing A, B, and C, but not knowing D, to guess
the value of E with probability of success greater than F."

If you can't separate the data-security function from the
rest of your application cleanly enough to articulate the
necessary requirements, then you'll have to hire a cryptographer
who can study the whole application.

So .... you have a database that includes SSNs. It appears that
you're using the SSN as an index into the database, and that
you want to deny somebody the ability to extract certain
information from the database.

Why not just encrypt the entire database with a secret key?

Is the database shared? If so, how much are the users
trusted? Can they share a secret key? Must they all be
able to modify the database?

--
Peter Pearson
To get my email address, substitute:
nowhere -> spamcop, invalid -> net

drfr...@nber.org

unread,

Sep 29, 2005, 4:01:26 PM9/29/05

to

Peter Pearson wrote:
> drfr...@nber.org wrote:
>
> > We want to encrypt social security numbers in a database.
>
> You've triggered Pearson's predictable "clarify your requirements"
> lecture. Regular sci.crypt readers can move along.
>
> You need clarity in your requirements. Without clarity,
> (a) you won't get useful help, and (b) you can't tell
> whether or not the final design meets your requirements.
> If you're not sure where you're going, you'll never know
> whether you've arrived.
>
> "We want to encrypt X" is not a requirement; rather, it's
> an implementation suggestion. A good requirement sounds
> like this: "It should be computationally infeasible for
> someone knowing A, B, and C, but not knowing D, to guess
> the value of E with probability of success greater than F."
>
> If you can't separate the data-security function from the
> rest of your application cleanly enough to articulate the
> necessary requirements, then you'll have to hire a cryptographer
> who can study the whole application.
>
> So .... you have a database that includes SSNs. It appears that
> you're using the SSN as an index into the database, and that
> you want to deny somebody the ability to extract certain
> information from the database.
>

In the United States, there are many datasets whose content need not be
kept secret, as long as the content can not be associated with a
particular individual. For example, the census department releases a
public use dataset which anyone can purchase which includes much
private informatio on a 1% sample of all households, but since none of
the information included allows one to identify exactly which household
is referred to, there is no violation of confidentiality. Most
government sponsored surveys promise respondents that their responses
will not be released in "indentifiable" form. The content can be
released, but not with identifiers, or with sufficient detail that
individuals could be identified. So often bith year will be included,
but not birth day. Name, address and SSN are always excluded. The
national archives and the census department offer hundreds of such
surveys for sale.

Now we have a dataset which is not actually public use, but is used by
a dozen or so researchers who have signed confidentiality agreements.
So there is no requirement that the SSNs be hidden. However, we have
been asked if it wouldn't be possible to conceal the identifiers from
the research datasets, as an extra level of protection. That way an
accidental release wouldn't be quite so serious, since the intruder
would have difficulty finding identifying who any particular record
referred to. It seems to be a reasonable request, and several
government agencies have published statements that they do this with
similar datasets. We are interested in following that lead.

> Why not just encrypt the entire database with a secret key?
>

Then we would have to give each researcher the secret key, and they
would have to work with the unencrypted dataset, leaving it possibly
exposed if there were a breakdown in the other security precautions.

Encrypting a dataset that is in constant use doesn't seem to accomplish
much. I understand it might be effective for backup tapes - then the
tapes could be stored in an insecure location. But if the data is used
every day, then the unencrypted dataset will always be available, and
the fact that there is another encrypted copy doesn't seem to be very
significant.

The statistical software used in analysis does not support encryption.

> Is the database shared? If so, how much are the users
> trusted? Can they share a secret key? Must they all be
> able to modify the database?

The database has several users. They are trustworthy but may possibly
make mistakes. They are subject to "shoulder surfing", etc.

They do not update the database, but they make extracts and merge
extracts from multiple tables for analysis. These temporary tables
persist for weeks or months, and may be used frequently.

Now perhaps you can see the logic to my request. If the SSN can be
suitably hashed, then the original records including the SSN can be
stored offline, and analysis can go on as before, with table merges
easily done on the hashed SSN, but the accidental release of a record
will not include an individual identifier, mitigating concerns about
violations of outer security barriers.

>From a cyptographic point of view, I thought the main difficulty is
that the intruder can probably identify himself in the database, and
perhaps some close relations, if he knows enough about them. This would
give him plaintext for several encrypted fields, and we wouldn't want
that to be sufficient to recover the key. Practically, it would be nice
if the hash were no more than 9 characters, so we didn't have to modify
any of the documentation when we replaced the field. (Beyond pointing
out that SSN was encrypted).

>
> --
> Peter Pearson
> To get my email address, substitute:
> nowhere -> spamcop, invalid -> net

Hope this is enough information to make clear what we are looking for.
It really isn't that obscure.

Joseph Ashwood

unread,

Sep 29, 2005, 5:13:09 PM9/29/05

to

"drfr...@nber.org" <feen...@gmail.com> wrote in message
news:1127996717....@g14g2000cwa.googlegroups.com...
> The requirement is that the same SSN [and only the SSN] should encrypt to
> the same value [for later comparison]

Let's begin by attempting to find the security requirements (I'll give you a
hint, any system that has to hold only the SSN like this will not be
secure). First the SSN is not anything approaching random, in fact the
Social Security Administration has a website on they are generated
(http://www.ssa.gov/history/ssn/geocard.html) so there is very little that
is unguessable, for example 602-12-3456 (apologies to whomever has this
number, it has been issued and I only guessed) is from California,
eliminating 3 digits immediately, and the 12 identifies that it is fairly
early in the usage, in fact while I don't have the timing information in
front of me 602-12-xxxx is probably mostly retired if not permanently so. So
the level of protection has to be extremely high in order to prevent the
leakage of the last group of numbers (serial number).

The requirement of a 1-1 mapping leads to low levels of protection. Even
assuming all 9 digits are unknown that is less than 2^30 work, so a modern
computer can run through all combinations in a matter of minutes. This means
that for security you MUST have a random component. Unfortunately, this
violates the aforementioned requirement.

Your best bet would be to use a hash function (e.g. MD5 chosen because it's
fast and the security levels here cannot be high anyway). Your other option
would be to use the suggestion by Tom, and depend on decrypting each of the
SSNs for comparison on addition to the database (this can be secure).
Joe

Andrew Swallow

unread,

Sep 29, 2005, 6:07:12 PM9/29/05

to

drfr...@nber.org wrote:
[snip]

> Now we have a dataset which is not actually public use, but is used by
> a dozen or so researchers who have signed confidentiality agreements.
> So there is no requirement that the SSNs be hidden. However, we have
> been asked if it wouldn't be possible to conceal the identifiers from
> the research datasets, as an extra level of protection. That way an
> accidental release wouldn't be quite so serious, since the intruder
> would have difficulty finding identifying who any particular record
> referred to. It seems to be a reasonable request, and several
> government agencies have published statements that they do this with
> similar datasets. We are interested in following that lead.

Assuming the researches are using their own computers and that they will
never need to work back to the original SSN.
Generate a secret key variable.

Copy the database a record at a time.
Set each name and most of the address field to spaces. (The researches
may need to know the town or district.)
Use AES encryption to encrypt all zeros to produce 128 random bits;
using the secret key variable and an IV equal to the SSN.
Take the SSN and exclusive (XOR) it with the appropriate number of
random bits. This is the encrypted SSN, insert the number into SSN
field of the copied record. Insert both into the encrypted databases.

If the same key variable is used this method will produce the same
result each time allowing changes to be detected in say 10 years time.
Note: There is a possibility that two different records will be given
the same encrypted number.

If there may be a need to reverse the encryption, for instance the
researches may have discovered people will a high probability of illness
who therefore need a medical inspection.
Set the SSN field in the copied records to a simple count and keep a
secret unmodified copy of the database.

Andrew Swallow

feen...@gmail.com

unread,

Sep 29, 2005, 6:30:37 PM9/29/05

to

Joseph Ashwood wrote:
> "drfr...@nber.org" <feen...@gmail.com> wrote in message
> news:1127996717....@g14g2000cwa.googlegroups.com...
> > The requirement is that the same SSN [and only the SSN] should encrypt to
> > the same value [for later comparison]
>
> Let's begin by attempting to find the security requirements (I'll give you a
> hint, any system that has to hold only the SSN like this will not be
> secure). First the SSN is not anything approaching random, in fact the
> Social Security Administration has a website on they are generated
> (http://www.ssa.gov/history/ssn/geocard.html) so there is very little that
> is unguessable, for example 602-12-3456 (apologies to whomever has this
> number, it has been issued and I only guessed) is from California,
> eliminating 3 digits immediately, and the 12 identifies that it is fairly
> early in the usage, in fact while I don't have the timing information in
> front of me 602-12-xxxx is probably mostly retired if not permanently so. So
> the level of protection has to be extremely high in order to prevent the
> leakage of the last group of numbers (serial number).

The dataset does include state of residence, which is correlated with
state of issuance of SSN. However, if I am looking for a particular
person, knowing that they are from California allows me to exclude 80%
of the records. Even if that knowledge also allows me to guess at part
of the SSN, once I have done so it really only allows me to exclude the
same 80%, so SSN structure isn't a source of weakness unless it helps
the intruder solve for the remaining digits. With the right encryption
technology, it shouldn't have that effect.

>
> The requirement of a 1-1 mapping leads to low levels of protection. Even
> assuming all 9 digits are unknown that is less than 2^30 work, so a modern
> computer can run through all combinations in a matter of minutes. This means

Let me make it clear that the threat model is not that someone would
take a record and solve for the SSN that goes with it, but that they
would locate the record in the database that had a particular SSN of
interest to them (an ex-spouse, for instance). With 100 million SSNs in
the database, that may be much harder (again, depending on the method).

I understand that for a 9 character message, there is no problem for a
computer to run through all possible messages, but this isn't the crux
of the problem, which would be to identify which solution was the
correct one. With some methods of encryption, it would be possible to
tell immediately that the correct key had been found - if the decrypt
contained only digits that would be strong evidence the key had been
found if invalid keys would decrypt to random bits which would rarely
be valid SSNs. With another method every key, correct or not, might
decrypt to a string of digits valid as an SSN. In that case it would be
more difficult to tell if the correct key had been found. The intruder
could look at the content of the record for clues, but by assumption
the intruder doesn't know enough of the record content to use that to
identify the individual. That is why I resist just hashing to a much
longer string.

> that for security you MUST have a random component. Unfortunately, this
> violates the aforementioned requirement.
>
> Your best bet would be to use a hash function (e.g. MD5 chosen because it's
> fast and the security levels here cannot be high anyway). Your other option

The intruder would take the known SSN, hash it using any MD5 program,
and search the database for the matching hash value. There is no key to
solve for - the equivalent of ROT13 security. This is much weaker than
necessary. For example, we could add (without carry) a constant K to
every SSN. Then the intruder would need to have alternate means of
identifying at least one record before he could solve for K and with
that search the database for the target record. That would be the ROTN
level of security. (Admittedly he could probably solve for the first
few digits from knowledge of the SSN structure - that isn't the point).
I expect it is possible to do much better. For example, if I make a
separate substitution map for each character position, I need 9 maps,
each with 10 elements. An intruder with knowledge of the true SSN for
as few as 9 records could solve for all the maps, still a substantial
improvement over ROTN, which beats ROT13. I'd like to do much better
than that.

> would be to use the suggestion by Tom, and depend on decrypting each of the
> SSNs for comparison on addition to the database (this can be secure).

If we are going make the comparisons to prior entries, then we can
assign serial numbers to individuals as they are encountered and no
encryption technology is required. However the comparison to the entire
database is expensive and something we were hoping to avoid with
appropriate encryption technology.

> Joe

karl malbrain

unread,

Sep 29, 2005, 6:51:28 PM9/29/05

to

Why do you need to retain the original SSN? Can't you just renumber
the records starting at 1 and running through 999,999,999 as needed? or
use random numbers? If you need the original SSN for a newer-version
comparison of records from the issuing authority, you can create an
index when you renumber and keep that index secure, off-line.

karl m

Paul Rubin

unread,

Sep 29, 2005, 7:29:21 PM9/29/05

to

"drfr...@nber.org" <feen...@gmail.com> writes:
> If we are going make the comparisons to prior entries, then we can
> assign serial numbers to individuals as they are encountered and no
> encryption technology is required. However the comparison to the entire
> database is expensive and something we were hoping to avoid with
> appropriate encryption technology.

How does this sound:

1. Let K = some secret string
2. Define F(x,i) = sha1(K # x # ',' # hex(i)) where # means concatenation
3. Convert the SSN to a 30-bit number.
4. Let A = upper 15 bits of the SSN, B = lower 15 bits.
5. For i = 1,2,3,...,10, do:
5.1 set A = (A + F(B,i)) mod 32768
5.2 swap A and B
6. Let T = (A*32768 + B)
7. If T > 10**9 then go back to step 5. This will rarely happen.
8. T (with leading zeros if needed) is the new 9-digit "encrypted" SSN.

Sample implementation in Python:

import sha,re

key = 'shhh, this is the secret key'

def F(x,i): # feistel round function
a = sha.new(key + hex(x) + ',' + hex(i)).hexdigest()
return int(a,16) % 32768

def validate_ssn(ssn):
# check that ssn is 9 decimal digits
assert re.match('\d{9}', ssn)

def new_ssn(ssn):
validate_ssn(ssn)
s = int(ssn)
a,b = divmod(s, 32768)
while True:
for i in range(10):
a = (a + F(b,i)) % 32768
a,b = b,a
t = 32768*a + b
if t < 10**9:
break
return '%09d'% t

# =================================================
# a few tests
assert new_ssn('000000000') == '319946843'
assert new_ssn('123456789') == '047014243'

import random, sets
old_seen = sets.Set()
new_seen = sets.Set()

# generate some random pairs and make sure the mapping is unique
for i in xrange(10000):
ss = random.randint(0,10**9 - 1)
s = '%09d'% ss
if s in old_seen: continue
old_seen.add(s)
t = new_ssn(s)
assert t not in new_seen, (i,s,t)
new_seen.add(t)

Joseph Ashwood

unread,

Sep 29, 2005, 8:06:15 PM9/29/05

to

"drfr...@nber.org" <feen...@gmail.com> wrote in message

news:1128033037.0...@f14g2000cwb.googlegroups.com...

> The dataset does include state of residence, which is correlated with
> state of issuance of SSN. However, if I am looking for a particular
> person, knowing that they are from California allows me to exclude 80%
> of the records. Even if that knowledge also allows me to guess at part
> of the SSN, once I have done so it really only allows me to exclude the
> same 80%, so SSN structure isn't a source of weakness unless it helps
> the intruder solve for the remaining digits. With the right encryption
> technology, it shouldn't have that effect.

You're overestimating the ability of encryption to solve the problem. But
I'll cover that more later.

>> The requirement of a 1-1 mapping leads to low levels of protection.

> Let me make it clear that the threat model is not that someone would

> take a record and solve for the SSN that goes with it, but that they
> would locate the record in the database that had a particular SSN of
> interest to them (an ex-spouse, for instance). With 100 million SSNs in
> the database, that may be much harder (again, depending on the method).

Actually it's even easier. Given any unkeyed 1-1 mapping this is simply a
database lookup. Using a keyed 1-1 mapping gets you some, but not much, the
only solution is the one that Tom gave, it is not perfect, and actually
violates your original statement, but it is the only secure way to do it.

> I understand that for a 9 character message, there is no problem for a
> computer to run through all possible messages, but this isn't the crux
> of the problem, which would be to identify which solution was the
> correct one.

Actually, that is the problem. If you have a 1-1 mapping where each output
is unique, there exists an onto mapping, and with only 2^30 of them even the
smallest hard drive on the market could store all of them making it actually
faster to break than compute in the first place.

> With some methods of encryption, it would be possible to
> tell immediately that the correct key had been found - if the decrypt
> contained only digits that would be strong evidence the key had been
> found if invalid keys would decrypt to random bits which would rarely
> be valid SSNs.

You seem to have a high opinion of encryption, unfortunately the limits of
encryption are far below where you put them. What you have is an unsolvable
problem, if you change a requirement (either security is not required or 1-n
is acceptable) there are solutions, without changing requirements the onto
hole will break your security.

> With another method every key, correct or not, might
> decrypt to a string of digits valid as an SSN.

Such systems would rely on compressing the plaintext first, or something
that is provably equivalent (see Unicity Distance).

> In that case it would be
> more difficult to tell if the correct key had been found.

Actually it wouldn't. Your original statement of the problem is that
encrypting the same data will lead to the same result. In the example given
you have an attacker who knows the information being encrypted, the
ciphertext and key are unknown. The odds of a collision in
plaintext:(ciphertext*) for even 4.2 billion (2^32) entries with a 128-bit
cipher are 1/2^96, so the key will be found. Again see Unicity Distance.

> The intruder
> could look at the content of the record for clues, but by assumption
> the intruder doesn't know enough of the record content to use that to
> identify the individual.

Earlier you said "ex-spouse" with known SSN, I should hope someone knows
enough about their ex-spouse to fill in the rest of the information. But if
the extra information is unique, use it in the hash, that will deliver the
security be eliminating one of the limiting assumptions.

>That is why I resist just hashing to a much
> longer string.

You won't find encryption that delivers better, going with 64-bit blocks
(e.g. 3DES) you have severe limiting factors on your database size before
hitting collisions (don't think you'll reach them with an SSN index, but
they are worth considering), also 64-bit blocks should generally not be
considered for new systems because of various limitations. 128-bit blocks
are the same size as the MD5 hash I suggested (although if you want it to
actually be secure I'd suggest a different hash, perhaps SHA-256, or
Whilrpool).

>
>> that for security you MUST have a random component. Unfortunately, this
>> violates the aforementioned requirement.
>>
>> Your best bet would be to use a hash function (e.g. MD5 chosen because
>> it's
>> fast and the security levels here cannot be high anyway). Your other
>> option
>
> The intruder would take the known SSN, hash it using any MD5 program,
> and search the database for the matching hash value. There is no key to
> solve for - the equivalent of ROT13 security.

Exactly the problem you will have unless you can add randomness to the
system. Without a random component your key is not random either, you have
in effect created an encryption oracle in the database add area, a
knowledgable attacker can simply use that to generate the colliding
information regardless of the system used. It is absolutely critical that
you eliminate one of your requirements or the problem is completely
intractable.

> This is much weaker than
> necessary.

Exactly my point, your problem as given is intractable, in my previous
example I chose to sacrifice the security requirement simply because with
the other requirements the security requirement appeared the softest.

> For example, we could add (without carry) a constant K to
> every SSN. Then the intruder would need to have alternate means of
> identifying at least one record before he could solve for K and with
> that search the database for the target record. That would be the ROTN
> level of security. (Admittedly he could probably solve for the first
> few digits from knowledge of the SSN structure - that isn't the point).
> I expect it is possible to do much better. For example, if I make a
> separate substitution map for each character position, I need 9 maps,
> each with 10 elements. An intruder with knowledge of the true SSN for
> as few as 9 records could solve for all the maps, still a substantial
> improvement over ROTN, which beats ROT13. I'd like to do much better
> than that.

Then you'll have to sacrifice one of the other requirements. In the mean
time it still won't beat the MD5 suggestion I made, you'll be less secure,
and it will likely be slower. It is necessary to sacrifice one of the
requirements, pick one, any one.

> If we are going make the comparisons to prior entries, then we can
> assign serial numbers to individuals as they are encountered and no
> encryption technology is required. However the comparison to the entire
> database is expensive and something we were hoping to avoid with
> appropriate encryption technology.

"Hoping to avoid" sounds like a soft requirement to me, maybe it should be
on the short list of ones to sacrifice.
Joe

Ben Livengood

unread,

Sep 29, 2005, 10:10:11 PM9/29/05

to

> Now perhaps you can see the logic to my request. If the SSN can be
> suitably hashed, then the original records including the SSN can be
> stored offline, and analysis can go on as before, with table merges
> easily done on the hashed SSN, but the accidental release of a record
> will not include an individual identifier, mitigating concerns about
> violations of outer security barriers.

If this is the case, then you could keep a table mapping each SSN to a
unique random key and store that mapping table safely with the original
data, while replacing each SSN with its random key for the "work" data
set. Merging new data would just involve a lookup in the mapping table
to map existing SSNs to the proper keys, and create new keys for new
SSNs. The important requirement is that the new key is totally
unrelated to the SSN it represents. The safest way is to randomly
generate a key for each SSN that's not already in the mapping table and
use the mapping table to determine if there are any collisions. If you
choose large enough keys (128 bit), you shouldn't get any collisions
either. In this case the keys aren't use to encrypt or decrypt
anything, they're just keys in the database context.

arachn...@gmail.com

unread,

Sep 29, 2005, 10:56:42 PM9/29/05

to

Unless I'm missing something fairly major, why not just use an HMAC
instead of all these complicated protocols people have proposed? Unless
they know the key to the HMAC, no number of known plaintext/hash pairs
will allow them to discover others. The HMAC result can be
shortened/formatted as appropriate, though the shorter it is, the
higher the probability of collision.

Alternately, just number sequentially, as someone else suggested.

Paul Rubin

unread,

Sep 29, 2005, 11:14:17 PM9/29/05

to

arachn...@gmail.com writes:
> Unless I'm missing something fairly major, why not just use an HMAC...

> Alternately, just number sequentially, as someone else suggested.

The problem with HMAC is the output has to be much longer than the
input or else collisions are likely. The problem with numbering
sequentially is it requires preprocessing a huge database rather than
encrypting query output on the fly. Whether those problems are
insurmountable is a different question.

drfr...@nber.org

unread,

Sep 30, 2005, 8:09:15 AM9/30/05

to

Andrew Swallow wrote:
> drfr...@nber.org wrote:
> [snip]
>
> > Now we have a dataset which is not actually public use, but is used by

>...

>
> Assuming the researches are using their own computers and that they will
> never need to work back to the original SSN.
> Generate a secret key variable.
>
> Copy the database a record at a time.
> Set each name and most of the address field to spaces. (The researches
> may need to know the town or district.)
> Use AES encryption to encrypt all zeros to produce 128 random bits;
> using the secret key variable and an IV equal to the SSN.
> Take the SSN and exclusive (XOR) it with the appropriate number of
> random bits. This is the encrypted SSN, insert the number into SSN
> field of the copied record. Insert both into the encrypted databases.
>
> If the same key variable is used this method will produce the same
> result each time allowing changes to be detected in say 10 years time.
> Note: There is a possibility that two different records will be given
> the same encrypted number.
>

Thanks, this is actually responsive to my posting and it seems
practical. As for the possibility that two different records will be
given the same encrypted SSN, that already exists to a much greater
degree from human error at data input.

> ...
> Andrew Swallow

drfr...@nber.org

unread,

Sep 30, 2005, 9:53:40 AM9/30/05

to

Joseph Ashwood wrote:
> "drfr...@nber.org" <feen...@gmail.com> wrote in message
> news:1128033037.0...@f14g2000cwb.googlegroups.com...

> Then you'll have to sacrifice one of the other requirements. In the mean

> time it still won't beat the MD5 suggestion I made, you'll be less secure,
> and it will likely be slower. It is necessary to sacrifice one of the
> requirements, pick one, any one.

Does Palu Rubin's suggestion in

http://groups.google.com/group/sci.crypt/msg/c4bdd165ba12b92a?dmode=source&hl=en

not make sense? It appears to me to be valid. To be invalid, there
would need to be a way for an intruder to search the file for a
particular SSN, which would require knowledge of the key, no? Is that
something easy to do? Could the key be determined from knowledge of
some *other* SSN/hash pairs? It isn't a problem that an intruder can
find himself in the database, only if he can use that information to
determine other SSN hashs, or the key itself.

Daniel Feenberg
feenberg isat nber dotte org

> Joe

Paul Rubin

unread,

Sep 30, 2005, 10:07:48 AM9/30/05

to

drfr...@nber.org writes:
> Does Paul Rubin's suggestion in

>
> http://groups.google.com/group/sci.crypt/msg/c4bdd165ba12b92a?dmode=source&hl=en
>
> not make sense? It appears to me to be valid. To be invalid, there
> would need to be a way for an intruder to search the file for a
> particular SSN, which would require knowledge of the key, no?

Unless I made an error (which is very possible), that post is just a
standard Feistel cipher on 30-bit numbers followed by a trick possibly
due to Schroeppel for making sure that 9-digit decimal inputs map to
9-digit outputs. As such, if the key is secret, it should be very
hard to distinguish from a random permutation on the 9-digit numbers.

As others have said, though, even revealing that two records have the
same SSN can leak private info about the person.

See the book "Database Nation" by Simson Garfinkel for where some of
this leads to.

Peter Pearson

unread,

Sep 30, 2005, 12:59:37 PM9/30/05

to

drfr...@nber.org wrote:

> Does Palu Rubin's suggestion in
>
> http://groups.google.com/group/sci.crypt/msg

c4bdd165ba12b92a?dmode=source&hl=en
>
> not make sense? It appears to me to be valid.

It is a sensible suggestion. Paul Rubin is a knowledgeable
and long-standing participant in this newsgroup. I have
confirmed that his sample code runs without errors, and
extended the collision-detection test to 100,000 samples,
which (thanks to the birthday paradox) would almost
certainly produce a collision if the algorithm were just
producing random numbers. I would happily bet real money
that Rubin's billion-to-billion cipher cannot be broken
by all the expertise on sci.crypt.

Note that although Rubin doesn't provide the decryption
algorithm, a decryption algorithm exists and is easy to
program. Naturally it requires the secret key.

To make it perfectly clear, what Rubin has given you is
a secret-key cipher that encrypts 9-digit plaintexts into
9-digit ciphertexts. An attacker not knowing the secret key
could not, given a list of distinct plaintexts and a scrambled
list of the corresponding ciphertexts, identify a plaintext-
ciphertext pair any better than by chance. (However, if a
plaintext appears more than once, the corresponding ciphertext
will appear the same number of times. Therefore, if your database has
several records bound to a given SSN, an attacker can recognize
that those records belong together.)

Peter Pearson

unread,

Sep 30, 2005, 1:13:19 PM9/30/05

to

drfr...@nber.org wrote:

[regarding Paul Rubin's suggestion in]
>
> http://groups.google.com/group/sci.crypt/msg
c4bdd165ba12b92a?dmode=source&hl=en

> Could the key be determined from knowledge of
> some *other* SSN/hash pairs?

No. Resistance to a known-plaintext attack is required
of a good cipher, and Paul Rubin has given you a good
cipher. The attacker's best option is to get a known
plaintext-ciphertext pair (e.g., by recognizing his
own record in the database) and then to guess keys
until he finds one that maps that plaintext onto that
ciphertext. So be sure to choose a key that he won't
guess early on. You can choose a nice key like this,
but with a longer randomish string:

echo 'asldkjfie35nav982hu23asd;klgjiup2389hb' | md5sum

Ari Silversteinn

unread,

Sep 30, 2005, 2:58:20 PM9/30/05

to

On 29 Sep 2005 05:42:30 -0700, tomst...@gmail.com wrote:

> SSNs are like what 10 digits?

Like, no.
--
Drop the alphabet for email

Paul Rubin

unread,

Sep 30, 2005, 5:44:47 PM9/30/05

to

Peter Pearson <ppea...@nowhere.invalid> writes:
> > Does Paul Rubin's suggestion in

> >
> > http://groups.google.com/group/sci.crypt/msg
> c4bdd165ba12b92a?dmode=source&hl=en
> >
> > not make sense? It appears to me to be valid.
>
> It is a sensible suggestion.

Bah, there are several small but minor errors in the spec and the
sample code.

2. Define F(x,i) = sha1(K # x # ',' # hex(i)) where # means concatenation

should say:

Define F(x,i) = sha1(K # hex(x) # ',' # hex(i))

and

7. If T > 10**9 then go back to step 5. This will rarely happen.

should say if T >= 10**9. I spotted both of these while writing the
sample code but forgot to fix them before posting the article. The
sample code does the right thing for both. Also:

# check that ssn is 9 decimal digits
assert re.match('\d{9}', ssn)

should say

assert re.match('\d{9}$', ssn)

which I just noticed.

More substantively, I picked the number of rounds (10) out of the air
and it might be low enough to make a detectably nonuniform scrambling,
given the small block size. For large blocks, four rounds are enough,
and ten seems pretty conservative for this size (i.e. it should be
plenty). But proving it's enough would take a somewhat messy
calculation that I'm not completely sure how to do. I don't think
this is likely to cause any practical security failures even if some
slight bias can be shown.

John E. Hadstate

unread,

Sep 30, 2005, 6:57:42 PM9/30/05

to

"drfr...@nber.org" <feen...@gmail.com> wrote in message

news:1127996717....@g14g2000cwa.googlegroups.com...

> We want to encrypt social security numbers in a database.
> The special
> requirement is that the same SSN should encrypt to the
> same value
> regardless of context. That way when new records are added
> to the
> database it will be possible to easily match them to
> existing records.
> Also, the tables can be sorted or selected and still
> matched to other
> tables in the database.
>

Pardon me if this suggestion has already been advanced. I'd
suggest that for each SSN in your file you:

1. pad it to 128 bits and,

2. encrypt the padded SSN using AES128 with the padded SSN
as
the key.

In the absence of collisions, each SSN will map to a unique
value and the work required to reverse the encryption will
be incomprehensible. There are probably no guarantees
against a collision, but I suspect that the probability of
finding one is very small.

David Wagner

unread,

Sep 30, 2005, 7:18:31 PM9/30/05

to

John E. Hadstate wrote:
>Pardon me if this suggestion has already been advanced. I'd
>suggest that for each SSN in your file you:
>
>1. pad it to 128 bits and,
>
>2. encrypt the padded SSN using AES128 with the padded SSN
>as
>the key.
>
>In the absence of collisions, each SSN will map to a unique
>value and the work required to reverse the encryption will
>be incomprehensible.

By "incomprehensible", do you mean "negligible"?

One can exhaustively search the space of SSNs in minutes.
This means that a dictionary search suffices to recover the
SSN from this obfuscated version of the SSN in minutes. The
work factor is at most 2^33, and that's an overestimate for
several reasons.

One can even precompute a lookup table that allows to subsequently
invert obfuscated SSNs in no time at all.

John Hadstate

unread,

Sep 30, 2005, 7:38:32 PM9/30/05

to

David Wagner wrote:
>
> By "incomprehensible", do you mean "negligible"?
>

Yes, I realized that I had invented a new meaning for the term about 15
seconds after I sent the reply. Another reason why well-meaning
amateurs should stay out of crypto.

Unruh

unread,

Sep 30, 2005, 8:11:56 PM9/30/05

to

Peter Pearson <ppea...@nowhere.invalid> writes:

>drfr...@nber.org wrote:

>[regarding Paul Rubin's suggestion in]
>>
>> http://groups.google.com/group/sci.crypt/msg
>c4bdd165ba12b92a?dmode=source&hl=en

>> Could the key be determined from knowledge of
>> some *other* SSN/hash pairs?

>No. Resistance to a known-plaintext attack is required
>of a good cipher, and Paul Rubin has given you a good
>cipher. The attacker's best option is to get a known
>plaintext-ciphertext pair (e.g., by recognizing his
>own record in the database) and then to guess keys
>until he finds one that maps that plaintext onto that
>ciphertext. So be sure to choose a key that he won't
>guess early on. You can choose a nice key like this,
>but with a longer randomish string:

>echo 'asldkjfie35nav982hu23asd;klgjiup2389hb' | md5sum

Actually I do not think that it is actually collision free. On step 7, 7%
of the values will be larger than 10^9. In that case, they go through
another set of rounds and get mapped into numbers below 10^9 But that
number they are mapped to is liable to be one that one of the previous
things already mapped to. Ie, the probablility seems to be that 7% of the numbers
have a collision. It would be better to use 28 bits (14+14) which
are always less than 10^9. Ie, you would not have all of the numbers in
your hash but you would be guarenteed collision free.
It is however a horribly slow as cyphers go, but I guess that is not really
a problem.

Paul Rubin

unread,

Sep 30, 2005, 8:31:52 PM9/30/05

to

Unruh <unruh...@physics.ubc.ca> writes:
> Actually I do not think that it is actually collision free. On step
> 7, 7% of the values will be larger than 10^9. In that case, they go
> through another set of rounds and get mapped into numbers below 10^9
> But that number they are mapped to is liable to be one that one of
> the previous things already mapped to.

No, the function is invertible, so that cannot happen.

> Ie, the probablility seems to
> be that 7% of the numbers have a collision. It would be better to
> use 28 bits (14+14) which are always less than 10^9. Ie, you would
> not have all of the numbers in your hash but you would be guarenteed
> collision free.

But now you don't cover the whole space of SSN's.

Andrew Swallow

unread,

Sep 30, 2005, 8:59:15 PM9/30/05

to

To avoid collisions you also have to store the encrypted
SSN as 128 bits.

Andrew Swallow

Mike Amling

unread,

Oct 4, 2005, 10:31:00 PM10/4/05

to

The output has to be longer than the square of the number of distinct
inputs, not longer than the inputs themselves. Even if the OP only has
room for 45 bits of MAC output (9 ALPHAnumerics), collisions are not
expected until there are over 4 million distinct SSNs in the DB. 2**30
distinct SSNs can be accommodated by allowing 12 uppercase/numeric
characters or 10 base-64 characters.

--Mike Amling

Mike Amling

unread,

Oct 4, 2005, 10:45:03 PM10/4/05

to

Doh! That would be better phrased as "The output has to be longer
than the log of the square of the number of distinct inputs."

--Mike Amling

Paul Rubin

unread,

Oct 4, 2005, 11:50:45 PM10/4/05

to

Mike Amling <nos...@nospam.com> writes:
> The output has to be longer than the square of the number of
> distinct inputs, not longer than the inputs themselves. Even if the OP
> only has room for 45 bits of MAC output (9 ALPHAnumerics), collisions
> are not expected until there are over 4 million distinct SSNs in the DB.

With that many SSN's, collisions would be ~ 50% likely. With quite a
lot fewer, collisions would be > 1% likely, which might or might not be ok.

Message has been deleted