Programmatic management of resources?

62 views
Skip to first unread message

David Osborne

unread,
Feb 11, 2020, 6:12:54 PM2/11/20
to Arches Project
It seems I've inadvertently duplicated some resource records by accidentally assigning them new uuids before loading from CSVs, so instead of overwriting existing records, duplicates were created.

I was wondering if anyone had developed any 'grep'-like command-line utilities that perhaps returned uuids for resources where, say, a node contains a particular value? If there were also a command to delete resources given their uuid, that would help to remove the duplicate records. If nothing like this exists, is there existing Python code in the codebase which would make a good starting point? We're still on Arches 4.4.2.

A related question is whether it's possible to delete resource-to-resource relationship records? They can obviously be loaded from CSV but there's nothing documented which can remove them. What happens to these links when one or more of the associated resource records are deleted?

Finally, I've always been puzzled that in our resource reports, related resources are followed by the uuid for the relationship (00000000-0000-0000-0000-000000000007), rather than the CIDOC-CRM property phrase ("is-related-to"or "refers-to", I forget which), although I see that Lincoln's Arcade has the almost equally obscure URL for CRM property P67.

Cheers
David

Alexei Peters

unread,
Feb 18, 2020, 1:29:16 PM2/18/20
to Arches Project
Hi David,
There isn't a baked in "grep"-like command to do what you want but I've crafted a bit of sql that you might be able to run to retrieve resource instances based on node values.

SELECT t.tileid,
t.resourceinstanceid,
t.tiledata,
n.nodeid
FROM tiles t
LEFT JOIN nodes n ON t.nodegroupid = n.nodegroupid
WHERE 
(t.tiledata::json -> n.nodeid::text)::text = '12'

In this case you'd replace "12" with the value you're interested in.  This won't work for geometry nodes but should work for the majority of other datatypes.

As for deleting the resources once identified, the surest way to do that would be through the UI.  That might be impractical if you have anything more than 50 records (or less depending on how much perseverance you have).
You could always use SQL to delete resource instances based on the uuid, which should cascade delete the tiles as well.  This will still leave traces of the resource instances in the edit log table (which you could also delete if you wish). Resource-to-resrouce relationships could also be deleted the same way.

I will caveat all this by saying that I haven't tested any of this and you should most definitely back up your database before you attempt to delete anything via SQL.

If you do delete resources via SQL, then you'll need to reindex the data in Elasticsearch for which there is a command
python manange.py es index_database

As for the issue with reports showing the uuid of the relationship rather than the text, that might be something that could be fixed the default report template.
Hope that helps!

Cheers,
Alexei

David Osborne

unread,
Feb 23, 2020, 12:37:36 PM2/23/20
to Arches Project
Hi Alexei

Thank you for the details. I should have asked about doing this using SQL, which seems the more obvious way of doing it but I had looked at the database structure and couldn't see how it related to the data models. Is there anything for developers such as an entity relationship diagram for the database, to explain how to query it?
I'll try your SQL and proceed with caution (and backups!) when I get to deletions.

Cheers
David

Andrew Jones

unread,
Feb 24, 2020, 4:21:32 AM2/24/20
to arches...@googlegroups.com
I would recommend that you use the django model to delete the resources. Like you, we were not confident that we knew enough about the database structure to confidently execute SQL without leaving various elements all over the place. You also need to update the index etc.

I would write a command that allows you to pass in a list of UUIDs (.json UUID list would be easiest) and use the django model functions to delete the Resource objects.  The following command is TOTALLY untested but is what I'm thinking about...

import os
import json
from django.core.management.base import BaseCommand, CommandError
from arches.app.models.resource import Resource

class Command(BaseCommand):

    def add_arguments(self, parser):

        parser.add_argument('operation', nargs='?',
            choices=['delete_resources',],
            help='Operation Type; ' +
            '\'delete_resources\'=Deletes resources from the database')

        parser.add_argument('-s', '--source', action='store', dest='source', default='',
            help='Json file containing Resource instance UUIDs in an array')

    def handle(self, *args, **options):
        if options['operation'] == 'delete_resources':
            self.delete_resources(json_source=options['source'])
                                    
    def load_json(self, json_source):
        """
            json should be formatted ["",""]
        """
        json_data = open(json_source)   
        return json.load(json_data)

    def delete_resources(self,json_source)
        uuids = load_json(json_source=source)
        resources_to_delete = Resource.objects.filter(resourceinstanceid__in=uuids)
        for r in resources_to_delete:
            try:
                r.delete()
            except Exception as e:
                print(str(e))

Have a look at django management commands to see where this needs to be placed in your django project.

Andrew

Alexei Peters

unread,
Feb 24, 2020, 1:03:43 PM2/24/20
to Andrew Jones, Arches Project
Hi David,
I don't really have an ERD available, although someone on our team might.  I'll ask around.
-Alexei

Director of Web Development - Farallon Geographics, Inc. - 971.227.3173


--
-- To post, send email to arches...@googlegroups.com. To unsubscribe, send email to archesprojec...@googlegroups.com. For more information, visit https://groups.google.com/d/forum/archesproject?hl=en
---
You received this message because you are subscribed to the Google Groups "Arches Project" group.
To unsubscribe from this group and stop receiving emails from it, send an email to archesprojec...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/archesproject/3bcf5628-d30f-47f6-a825-9031fa7f6407%40googlegroups.com.

Adam Lodge

unread,
Feb 24, 2020, 5:26:04 PM2/24/20
to Andrew Jones, Arches Project, Alexei Peters
Andy,

To add a little help here, it’s best to think of the underlying physical data model as a relational model that relates relevant json documents.  As such, it’s a hybrid of relational model and document store.

The attached UML diagram describes the Django models and their relationships to each which, in turn, mostly mirror the underlying physical data model.  For your purposes, you’ll want to focus on ResourceInstance and TileModel.  The business data that lives a given Arches instance exists in json documents stored in the Tiledata.data column.  Further, as you can see from Alexei’s sql snippet, you may need to peel apart the json for individual “tiles” to identify the duplicates you are looking for.

Hoping this is of some assistance to you.

Adam Lodge

Martha S

unread,
Feb 25, 2020, 2:07:01 PM2/25/20
to Arches Project
I installed DBeaver Community Edition (https://dbeaver.io/) to generate an ERD. Here is a copy I just exported to PNG and converted to PDF. I hope it helps.

Martha 


On Monday, February 24, 2020 at 10:03:43 AM UTC-8, Alexei Peters wrote:
Hi David,
I don't really have an ERD available, although someone on our team might.  I'll ask around.
-Alexei

Director of Web Development - Farallon Geographics, Inc. - 971.227.3173

-- To post, send email to arches...@googlegroups.com. To unsubscribe, send email to arches...@googlegroups.com. For more information, visit https://groups.google.com/d/forum/archesproject?hl=en

---
You received this message because you are subscribed to the Google Groups "Arches Project" group.
To unsubscribe from this group and stop receiving emails from it, send an email to arches...@googlegroups.com.
HPLA Database 2-25-2020.pdf

David Osborne

unread,
Feb 28, 2020, 10:57:02 AM2/28/20
to Arches Project
Thanks, Martha, that will be useful!

David

David Osborne

unread,
Feb 28, 2020, 11:23:13 AM2/28/20
to Arches Project
Thanks to everyone who posted suggestions to help with our problem of duplications. In the end, we simply deleted the few duplicate resource records using the web user interface: there weren't as many duplicates as I first thought.

That left the duplicate resource-to-resource link records, of which we had over 15,000. Once I found that these are contained in the resource_x_resource database table, I checked them using the following SQL

SELECT resourceinstanceidfrom, resourceinstanceidto, COUNT(*)
FROM resource_x_resource
GROUP BY resourceinstanceidfrom
, resourceinstanceidto
ORDER BY resourceinstanceidfrom
, resourceinstanceidto;

and deleted the duplicates using

DELETE FROM resource_x_resource
WHERE resourcexid IN
(
SELECT A
.resourcexid
  FROM resource_x_resource A
 WHERE EXISTS
(SELECT B.resourcexid
                 FROM resource_x_resource B
                WHERE B
.resourceinstanceidfrom = A.resourceinstanceidfrom
                  AND B
.resourceinstanceidto = A.resourceinstanceidto
                  AND B
.resourcexid < A.resourcexid));

Having noticed that the relationshiptype field is simply text, I replaced all the incorrect 00000000-0000-0000-0000-000000000007 values with an appropriate CIDOC CRM property

UPDATE resource_x_resource SET relationshiptype = 'http://www.cidoc-crm.org/cidoc-crm/P67_refers_to' WHERE relationshiptype='00000000-0000-0000-0000-000000000007';

Finally, a run of
python manage.py es index_resource_relations
made those changes visible in the resource reports in the user interface.

David
Reply all
Reply to author
Forward
0 new messages