How to hash fields and detect changes in a record

66 views
Skip to first unread message

Mike Dewhirst

unread,
Jun 10, 2022, 3:53:50 AM6/10/22
to Django users
The use case is auto-deletion of out-of-date records if they have not
changed.

That might sound weird but it is the solution I have come to for a
particular problem. My software analyses chemical properties and writes
note records containing advice, each with a FK to the chemical.

When values change sufficiently on the chemical, the software would
construct a set of mostly different note records. The problem is that
note records still exist from the previous set of properties. These
would definitely confuse the user and thereby invalidate the advice.

The workaround is for the user to delete all notes *prior* to re-saving
and auto-generating a new correct set of notes. There is a proviso that
you wouldn't want to delete notes altered by users. I would document
that so users understand why the software skipped deleting those notes.

I think the solution might be to hash note.title and note.note into a
new field note.hash on being auto-created. On subsequent saves, compare
the latest hash with note.hash to decide whether to delete auto-inserted
notes prior to generating the next set. Those subsequent saves could be
months or years later.

If unchanged, the old note is safe to delete because it is no longer
relevant.

I've googled around and there are lots of possible solutions but it
seems the major problem might be that hashes are difficult to guarantee
when the environment - such as the version of Python - changes.

Also, I'm not convinced I have chosen the correct strategy.

Hope I've explained the problem adequately.

Thoughts appreciated

Cheers

Mike

--
Signed email is an absolute defence against phishing. This email has
been signed with my private key. If you import my public key you can
automatically decrypt my signature and be sure it came from me. Just
ask and I'll send it to you. Your email software can handle signing.

OpenPGP_signature

Ryan Nowakowski

unread,
Jun 10, 2022, 9:25:33 AM6/10/22
to Django users
On Fri, Jun 10, 2022 at 05:52:48PM +1000, Mike Dewhirst wrote:
> The use case is auto-deletion of out-of-date records if they have not
> changed.
>
> That might sound weird but it is the solution I have come to for a
> particular problem. My software analyses chemical properties and writes note
> records containing advice, each with a FK to the chemical.
>
> When values change sufficiently on the chemical, the software would
> construct a set of mostly different note records. The problem is that note
> records still exist from the previous set of properties. These would
> definitely confuse the user and thereby invalidate the advice.

You might consider versioning your chemical model objects. Then when
values change sufficiently on the chemical model object, you can create
a new version of the chemical object, leaving the old notes associated
with the old version of the chemical object. In your web app, you could
allow the users to browse old versions of the chemical including the
notes which may have been altered.

> The workaround is for the user to delete all notes *prior* to re-saving and
> auto-generating a new correct set of notes. There is a proviso that you
> wouldn't want to delete notes altered by users. I would document that so
> users understand why the software skipped deleting those notes.
>
> I think the solution might be to hash note.title and note.note into a new
> field note.hash on being auto-created. On subsequent saves, compare the
> latest hash with note.hash to decide whether to delete auto-inserted notes
> prior to generating the next set. Those subsequent saves could be months or
> years later.

Hashing is useful if you want to check that something has been
unexpectedly changed. I assume the note can only be changed through
your web app so you know when a user is changing a note. Since you're
expecting users to change some of the notes and you know when they do,
hashing might be overkill. Instead, add a boolean `altered_by_user`
field to the note model. Initially when you automatically create the
note altered_by_user would be set to False. If a user changes the note,
set altered_by_user to True.

Mike Dewhirst

unread,
Jun 10, 2022, 10:14:58 AM6/10/22
to django...@googlegroups.com
On 10/06/2022 11:24 pm, Ryan Nowakowski wrote:
> On Fri, Jun 10, 2022 at 05:52:48PM +1000, Mike Dewhirst wrote:
>> The use case is auto-deletion of out-of-date records if they have not
>> changed.
>>
>> That might sound weird but it is the solution I have come to for a
>> particular problem. My software analyses chemical properties and writes note
>> records containing advice, each with a FK to the chemical.
>>
>> When values change sufficiently on the chemical, the software would
>> construct a set of mostly different note records. The problem is that note
>> records still exist from the previous set of properties. These would
>> definitely confuse the user and thereby invalidate the advice.
> You might consider versioning your chemical model objects. Then when
> values change sufficiently on the chemical model object, you can create
> a new version of the chemical object, leaving the old notes associated
> with the old version of the chemical object. In your web app, you could
> allow the users to browse old versions of the chemical including the
> notes which may have been altered.

That's not really appropriate. The user doesn't care about older
versions beyond annual an summary of the calculated analysis. As volumes
(manufactured and/or imported) change the analysis and therefore current
advice changes. There is no need to keep track of out-of-date advice notes.

What really matters is that *when* things change the advice needs to
change and the old advice needs to be deleted.

The only reason I need to avoid deleting old notes is if the user has
edited the advice itself - in any of the the individual notes. Probably
it would be OK to delete an edited note because it is old advice BUT I
feel it would be wrong for software to make that decision. As I said,
I'm happy to document why.

Just thinking about that, I could maybe adjust the note.title to append
something like "Out of date" if I detect it has been edited.

>
>> The workaround is for the user to delete all notes *prior* to re-saving and
>> auto-generating a new correct set of notes. There is a proviso that you
>> wouldn't want to delete notes altered by users. I would document that so
>> users understand why the software skipped deleting those notes.
>>
>> I think the solution might be to hash note.title and note.note into a new
>> field note.hash on being auto-created. On subsequent saves, compare the
>> latest hash with note.hash to decide whether to delete auto-inserted notes
>> prior to generating the next set. Those subsequent saves could be months or
>> years later.
> Hashing is useful if you want to check that something has been
> unexpectedly changed. I assume the note can only be changed through
> your web app so you know when a user is changing a note.

These are automatically generated notes which taken together constitute
advice on how to deal with the analysis. Users can edit them. For
example, someone might record some action taken regarding the advice. I
don't want to delete that. If nothing has been edited, it is safe to delete.

So how do I know it is the same as when originally generated - and safe
to delete - except by storing a hash of the interesting fields.

And if that is the best approach, what sort of hashing will survive
Python upgrades etc?

> Since you're
> expecting users to change some of the notes and you know when they do,
> hashing might be overkill. Instead, add a boolean `altered_by_user`
> field to the note model. Initially when you automatically create the
> note altered_by_user would be set to False. If a user changes the note,
> set altered_by_user to True.

Not sure this would work. Note creation and eventually automatic
deletion is all driven from model methods executed on saving.

>
>> If unchanged, the old note is safe to delete because it is no longer
>> relevant.
>>
>> I've googled around and there are lots of possible solutions but it seems
>> the major problem might be that hashes are difficult to guarantee when the
>> environment - such as the version of Python - changes.
>>
>> Also, I'm not convinced I have chosen the correct strategy.
>>
>> Hope I've explained the problem adequately.
>
>


OpenPGP_signature

Mike Dewhirst

unread,
Jun 10, 2022, 9:41:29 PM6/10/22
to django...@googlegroups.com
Ryan

Thanks very much - you triggered the necessary amount of thinking and I
reckon you are correct - hashing is overkill.

I just need a self-referential value comparison where the value is
independent of outside influences. That means I should not use a hash
library from anywhere.

I'll just convert all the chars in all the fields I'm interested in into
integers and sum them into my "hash" field.

Should be quick and easy!

Cheers

Mike
OpenPGP_signature

Ryan Nowakowski

unread,
Jun 12, 2022, 5:10:09 PM6/12/22
to django...@googlegroups.com
On Sat, Jun 11, 2022 at 12:13:16AM +1000, Mike Dewhirst wrote:
> On 10/06/2022 11:24 pm, Ryan Nowakowski wrote:
> > On Fri, Jun 10, 2022 at 05:52:48PM +1000, Mike Dewhirst wrote:
> > > I think the solution might be to hash note.title and note.note into a new
> > > field note.hash on being auto-created. On subsequent saves, compare the
> > > latest hash with note.hash to decide whether to delete auto-inserted notes
> > > prior to generating the next set. Those subsequent saves could be months or
> > > years later.
> > Hashing is useful if you want to check that something has been
> > unexpectedly changed. I assume the note can only be changed through
> > your web app so you know when a user is changing a note.
>
> These are automatically generated notes which taken together constitute
> advice on how to deal with the analysis. Users can edit them. For example,
> someone might record some action taken regarding the advice. I don't want to
> delete that. If nothing has been edited, it is safe to delete.
>
> So how do I know it is the same as when originally generated - and safe to
> delete - except by storing a hash of the interesting fields.

Because when the user edits a note, during the form.save()(assuming
you're using Django forms), you'll set `altered_by_user` to True.

> And if that is the best approach, what sort of hashing will survive Python
> upgrades etc?

Pick a hash algorithm[1](ex: sha256). The output will remain the same
even with Python upgrades.

[1] https://docs.python.org/3/library/hashlib.html

> > Since you're
> > expecting users to change some of the notes and you know when they do,
> > hashing might be overkill. Instead, add a boolean `altered_by_user`
> > field to the note model. Initially when you automatically create the
> > note altered_by_user would be set to False. If a user changes the note,
> > set altered_by_user to True.
>
> Not sure this would work. Note creation and eventually automatic deletion is
> all driven from model methods executed on saving.

Why wouldn't this work? During note creation, altered_by_user would be
set to False automatically because that's the default. When
automatically deleting, do:

Note.objects.filter(altered_by_user=False).delete()

Mike Dewhirst

unread,
Jun 13, 2022, 12:46:19 AM6/13/22
to django...@googlegroups.com




--
(Unsigned mail from my phone)



-------- Original message --------
From: Ryan Nowakowski <tub...@fattuba.com>
Date: 13/6/22 07:09 (GMT+10:00)
Subject: Re: How to hash fields and detect changes in a record

On Sat, Jun 11, 2022 at 12:13:16AM +1000, Mike Dewhirst wrote:
> On 10/06/2022 11:24 pm, Ryan Nowakowski wrote:
> > On Fri, Jun 10, 2022 at 05:52:48PM +1000, Mike Dewhirst wrote:
> > > I think the solution might be to hash note.title and note.note into a new
> > > field note.hash on being auto-created. On subsequent saves, compare the
> > > latest hash with note.hash to decide whether to delete auto-inserted notes
> > > prior to generating the next set. Those subsequent saves could be months or
> > > years later.
> > Hashing is useful if you want to check that something has been
> > unexpectedly changed.  I assume the note can only be changed through
> > your web app so you know when a user is changing a note.
>
> These are automatically generated notes which taken together constitute
> advice on how to deal with the analysis. Users can edit them. For example,
> someone might record some action taken regarding the advice. I don't want to
> delete that. If nothing has been edited, it is safe to delete.
>
> So how do I know it is the same as when originally generated - and safe to
> delete - except by storing a hash of the interesting fields.

Because when the user edits a note, during the form.save()(assuming
you're using Django forms), you'll set `altered_by_user` to True.

Notes can also be altered in the Admin


> And if that is the best approach, what sort of hashing will survive Python
> upgrades etc?

Pick a hash algorithm[1](ex: sha256).  The output will remain the same
even with Python upgrades.

So the mechanism doesn't need to be a hash - as you said. I now just sum ord(char) for the title and the note and keep that in a flag field.

Only the auto-notes get a flag because they are the only ones I would consider deleting. 



[1] https://docs.python.org/3/library/hashlib.html

> > Since you're
> > expecting users to change some of the notes and you know when they do,
> > hashing might be overkill.  Instead, add a boolean `altered_by_user`
> > field to the note model.  Initially when you automatically create the
> > note altered_by_user would be set to False.  If a user changes the note,
> > set altered_by_user to True.
>
> Not sure this would work. Note creation and eventually automatic deletion is
> all driven from model methods executed on saving.

Why wouldn't this work? During note creation, altered_by_user would be
set to False automatically because that's the default.  When
automatically deleting, do:

    Note.objects.filter(altered_by_user=False).delete()

--
You received this message because you are subscribed to the Google Groups "Django users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-users/20220612210931.GA32625%40fattuba.com.

Ryan Nowakowski

unread,
Jun 14, 2022, 9:21:15 AM6/14/22
to django...@googlegroups.com

You have a couple of choices then.  You could alter the note details view in the admin to set the altered_by_user field.  Alternatively and more generically, you could check the pk field in your model save method.  If it is None, then you are creating a new note.  If the pk field is not None, then you are updating an existing note so you can set altered_by_user to True.


> And if that is the best approach, what sort of hashing will survive Python
> upgrades etc?

Pick a hash algorithm[1](ex: sha256).  The output will remain the same
even with Python upgrades.

So the mechanism doesn't need to be a hash - as you said.I now just sum ord(char) for the title and the note and keep that in a flag field.

Summing the ordinal of the characters won't catch transposition:

>>> chars = 'ab'
>>> sum([ord(c) for c in chars])
195
>>> chars = 'ba'
>>> sum([ord(c) for c in chars])
195

Better to use a real hash algorithm if you're trying to detect changes.  My note above about hashing not being required is because you don't need to detect changes because you explicitly already know when changes are being made.

Mike Dewhirst

unread,
Jun 14, 2022, 11:30:35 PM6/14/22
to django...@googlegroups.com
On 14/06/2022 11:20 pm, Ryan Nowakowski wrote:
>
> Summing the ordinal of the characters won't catch transposition:
>
> >>> chars = 'ab'
> >>> sum([ord(c) for c in chars])
> 195
> >>> chars = 'ba'
> >>> sum([ord(c) for c in chars])
> 195
>
> Better to use a real hash algorithm if you're trying to detect
> changes.  My note above about hashing not being required is because
> you don't need to detect changes because you explicitly already know
> when changes are being made.
>

Thanks Ryan.

It is all working now. I append " - No longer relevant" to the note
title if any change is detected. Otherwise the note gets deleted.

Cheers

Mike
OpenPGP_signature

Ryan Nowakowski

unread,
Jun 15, 2022, 8:18:59 AM6/15/22
to django...@googlegroups.com


On June 14, 2022 10:29:40 PM CDT, Mike Dewhirst <mi...@dewhirst.com.au> wrote:
>On 14/06/2022 11:20 pm, Ryan Nowakowski wrote:
>>
>> Summing the ordinal of the characters won't catch transposition:
>>
>> >>> chars = 'ab'
>> >>> sum([ord(c) for c in chars])
>> 195
>> >>> chars = 'ba'
>> >>> sum([ord(c) for c in chars])
>> 195
>>
>> Better to use a real hash algorithm if you're trying to detect changes.  My note above about hashing not being required is because you don't need to detect changes because you explicitly already know when changes are being made.
>>
>
>Thanks Ryan.
>
>It is all working now. I append " - No longer relevant" to the note title if any change is detected. Otherwise the note gets deleted.
>

Good to hear! Seems like an interesting project.
Reply all
Reply to author
Forward
0 new messages