Best practices around archiving working copies and managing change history?

79 views
Skip to first unread message

cbur...@healthwise.org

unread,
Sep 9, 2019, 4:46:36 PM9/9/19
to TopBraid Suite Users
I hope anyone reading this can share their experiences with managing similar issues...

We have several enterprise EDG projects (taxonomies and ontologies) whose ".tch" change history graphs are getting very large, since they have been actively edited for several years now. Separately, we have a large number (1000+ workflows per project, for two of our projects) of committed but un-archived workflows. These two conditions seem to be slowing down performance in several areas:
  • Loading the list of workflows
  • Loading the change history for an individual resource (this is painfully slow now)
  • Sending these projects to another server (if we do choose to also send the .tch files)
  • Updating the projects available to Explorer users. (EDG seems to automatically send the .tch file when you do this, but if it were up to me I would choose not to, since the change history is not relevant to our Explorer users.)
I am considering different methods of archiving working copies and reducing the size of our change history graphs, including:
  • Activating the "Archive Working Copies on Commit" option... but I have several questions around what this entails:
    • Will doing this automatically archive all our already-committed workflows? If not, is there a way to do that in a batch? (I can't find one.)
    • Will doing this reduce the size of the .tch graph in any way? Or does it just remove the workflows from the UI?
  • Manually editing the .tch graphs to remove all changes of particular types, or made before a certain date.
    • Is there any easy way of doing this? (I can only think of exporting the graphs, editing them in TBC or a text editor, and then re-importing them.)
It would be nice if there were a built-in setting to "only keep the last x years of changes in the change history" or something similar. It would also be nice if there were a way to only "keep forever" a particular type of change data: who made which edits to a project and when. Over the long term, other types of change data (e.g. details about workflows and which changes they contained) are not useful to us and could be deleted after a certain period (say, a year).

For example, suppose I start a new workflow for a taxonomy, add 30 altLabels to concepts, send the workflow on to a colleague for review, and then commit the changes to production. The only data I'd want to know forever by consulting the change history is the date, time, and creator of those 30 edits to altLabels -- not the fact that they were part of a certain workflow, or the facts that the workflow was created/transitioned/committed/archived by certain people at certain dates and times.

Thanks for any insight that either TQ users or team members can provide.

-Carl

Holger Knublauch

unread,
Sep 9, 2019, 11:18:31 PM9/9/19
to topbrai...@googlegroups.com

Hi Carl,

I am not really a user but know the code behind those features.

As you know, the TCH graphs stores the change records that make up each working copy/workflow. This graph-based approach means that anyone (with admin permissions) can manipulate that graph using SPARQL or scripts such as SWP. In SPARQL, someone could potentially issue an UPDATE request to delete all change records related to certain workflows. Assuming the .tch graph is the default graph, one low-level maintenance operation would be

DELETE {
    ?change ?p ?o .
    ?triple ?tp ?to .
}
WHERE {
    ?change a teamwork:Change .
    ?change teamwork:tag ?tag .
    ?change teamwork:added|teamwork:deleted ?triple .
    ?change ?p ?o .
    ?triple ?tp ?to .
}

which would delete any workflow-related teamwork:Change entries plus their linked triples (which are the bulk of data). This keeps the label and comment of the workflows themselves (instances of teamwork:Tag), and you may want to get rid of those too, e.g.

DELETE {
    ?tag ?p ?o .
}
WHERE {
    ?tag a teamwork:Tag .
    ?tag teamwork:status teamwork:Committed .
    ?tag ?p ?o .
}

to delete all committed workflows completely. This should significantly reduce the size of the TCH graphs.

In all of these queries you can add a FILTER, e.g. do

WHERE {
    ?tag a teamwork:Tag .
    ...
    ?tag teamwork:statusChange ?change .
    ?change teamwork:newStatus teamwork:Committed .
    ?change dcterms:created ?date .
    FILTER (?date < "2019-01-01"^^xsd:date)
}

I have shown the SPARQL also as a way to illustrate the format of workflows in the RDF data model.

When you open teamwork.ui.ttlx in TBC and look at teamwork:ArchiveChangesToFile you also see a script that can be called from the outside and probably does some of what you are interested in. If you need a variation of this, create a clone and call that.

In any case, before making any such calls, try them in a safe environments, e.g. from TBC-ME, not on the actual data!

Just activating the Archive Working Copies on Commit will not retrospectively archive existing committed workflows.

Holger
--
You received this message because you are subscribed to the Google Groups "TopBraid Suite Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to topbraid-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/topbraid-users/56722c17-4adc-4816-808f-c1f5869f61e5%40googlegroups.com.

cbur...@healthwise.org

unread,
Sep 10, 2019, 12:17:25 PM9/10/19
to TopBraid Suite Users
Thanks, Holger. This is quite helpful.

What exactly does archiving an already-committed workflow do, other than removing it from the Workflows user interface?

Do you agree that it might be useful to have some of the workflow management functions we are discussing available to admins via a UI in EDG? I am considering opening a feature request ticket for those.

-Carl

Holger Knublauch

unread,
Sep 10, 2019, 7:29:06 PM9/10/19
to topbrai...@googlegroups.com

On 11/09/2019 02:17, cbur...@healthwise.org wrote:

Thanks, Holger. This is quite helpful.

What exactly does archiving an already-committed workflow do, other than removing it from the Workflows user interface?
Archiving removes the metadata about which triples were added or deleted from the TCH graph. As a result it becomes later impossible to track the origin of those triples (if that's of interest), yet it leads to significantly smaller footprint in the database.


Do you agree that it might be useful to have some of the workflow management functions we are discussing available to admins via a UI in EDG? I am considering opening a feature request ticket for those.

Yes I would encourage you to file a ticket so that we can discuss specific ideas. I assume a feature to archive all committed working copies (up to a certain date?) would be useful, if it's not already exposed through the UI somewhere.

Holger


--
You received this message because you are subscribed to the Google Groups "TopBraid Suite Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to topbraid-user...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages