concurrent development of large ontologies

Todd Detwiler

unread,

Apr 6, 2015, 5:50:49 PM4/6/15

to fma-ow...@googlegroups.com

A recent problem for the FMA team has been how to continue to support
concurrent development now that we are adding assertions directly in
OWL. The attached document is just some internal notes that I made to
document the methods we've tried or are trying now. I am posting them
here, in case anyone is curious what routes we've attempted. We'd be
very interested to know what others are using for concurrent OWL
development?
Thanks,
Todd

ConcurrentOntologyEditting.pdf

Chris Mungall

unread,

Apr 6, 2015, 6:28:57 PM4/6/15

to fma-ow...@googlegroups.com

Hi Todd,

Yes, one of the main observations in my post was that in particular the
combination of the RDF/XML serialization *and* axiom annotations was
particularly bad. Note that most of us keep the source in functional
syntax (or obo; FMA actually fits in the obo subset AFAICR, and the next
beta of Protege 5 should save obo). With functional syntax it's still
necessary to coordinate people to make sure they use the same version of
Protege.

The blog post you mention got some attention, and there is definitely
more widespread interest in making OWL serializations more VCS friendly.

Re: your comments in the doc about modules. In the past Alan has talked
about a modularization strategy that would help with individual file
sizes and may have other advantages on top. I believe this was
modularization in terms of the biology. E.g. a heart module etc. Another
approach would be to slice by axiom type - labels in one file, SubClass
another. With a lightweight script to occasionally organize everything
into the right place (axioms have a habit of going into the working
ontology in Protege). Yet another approach would be to keep your
combinatorial classes somewhere else. I expect you don't much make
manual edits on classes like 'Compact bone of proximal epiphysis of
proximal phalanx of left index finger' using the standard Protege
editing interface much (is the workflow for these documented? I assume
there was some automation at some point?).

> --
>
> --- You received this message because you are subscribed to the Google
> Groups "fma-owl-2009" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to fma-owl-2009...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>
> [ConcurrentOntologyEditting.pdf]

Todd Detwiler

unread,

Apr 7, 2015, 12:46:14 PM4/7/15

to fma-ow...@googlegroups.com

Hey Chris,
Thank you for responding. In your forum post you recommended moving away
from OBO format, in spite of its predictable serialization. So I have
not experimented with that particular serialization format. I did try
using functional syntax, but the owldiff tool had trouble loading it
(memory problems if I recall correctly) and the owl2diff tool couldn't
read it at all. I will take another look at functional syntax and see if
I made any errors (and try using the purely text based diff tools).

Regarding modularizing the ontology, our problem has always been to
determine the correct modularization. To use an anatomical approach, for
example body regions like head, thorax, abdomen, upper limbs, ... seems
sensible. But someone working on the arterial system will be working
across modules. We have also considered the approach you suggested, to
break up the ontology based on axiom type. The problem with that is that
an author trying to flush out all of the details of a particular class
would potentially touch all or many of the modules. This is a problems
only because:

1. As you suggested, new axioms tend to end up in the open ontology
(e.g. parent ontology in a network of imports) in Protege. To get each
axiom to go into the right place, to fill in the details of a class,
would potentially require each axiom module to be opened individually.
2. If all authors are touching all or most modules, we end up with a lot
more potential conflict management than you might expect from
modularized ontologies.

That said, we do think that modularization is a good idea. There are
just some lingering details that we have not worked out.

Anyway, I'm not being contrary, I'm just commenting on the difficulties
I've encountered. I appreciate your feedback/suggestions on this. I can
tell from your post that you've given this a great deal of thought and
effort. I will explore further.

Cheers,
Todd

Nolan Nichols

unread,

Apr 9, 2015, 12:45:41 AM4/9/15

to fma-ow...@googlegroups.com

Hi Chris and Todd,

Possibly relevant to the conversation is a recent announcement from GitHub on large file support (https://git-lfs.github.com/).

It's unclear if it'll support similar diffs between large files as small files, but I've found GitHub to provide a great workflow for curating terms in a community setting using desktop Protege that allows for line by line commenting and discussion before merging a change. Granted the files are much much smaller, but the notifications and workflow itself is quite nice.

GitHub currently cuts off support at 100MB, so I trimmed down the FMA a bit to work up a trivial example of editing a large 90MB+ owl file and commenting on it, which appears to show the kind of diffs that I have found useful: https://github.com/nicholsn/fma/pull/1/files

Also of potential interest is git-annex (http://git-annex.branchable.com/) that supports large files, but not sure what the diffs look like with it.

I'm sure these solutions fall short (and are too new to really use!), but curious to hear your thoughts - recognizing they don't address modularity and axiom placement issues.

Cheers,

Nolan

To unsubscribe from this group and stop receiving emails from it, send an email to fma-owl-2009+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

[ConcurrentOntologyEditting.pdf]

--

--- You received this message because you are subscribed to the Google Groups "fma-owl-2009" group.

To unsubscribe from this group and stop receiving emails from it, send an email to fma-owl-2009+unsubscribe@googlegroups.com.

Chris Mungall

unread,

Apr 9, 2015, 1:10:35 AM4/9/15

to fma-ow...@googlegroups.com

Thanks Nolan

I'm pretty excited about LFS in github. It will be extremely convenient
for a lot of the large derived files we (wider OBO world) tend to keep
in VCS.

Thanks for the test. This particular example looks rather lovely.
However, because you've chosen an RDF-based serialization, I'll bet
you'd see random huge spurious diffs once you start doing things like
axiom annotation modification, which is unfortunate. But this is
something that can really be only addressed at the owlapi end and they
are working on it, just not as much resources as we'd like.

>>>> an email to fma-owl-2009...@googlegroups.com.

>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>> [ConcurrentOntologyEditting.pdf]
>>>>
>>>
>>>
>> --
>>
>> --- You received this message because you are subscribed to the
>> Google
>> Groups "fma-owl-2009" group.
>> To unsubscribe from this group and stop receiving emails from it,
>> send an

>> email to fma-owl-2009...@googlegroups.com.

>> For more options, visit https://groups.google.com/d/optout.
>>
>

> --
>
> ---
> You received this message because you are subscribed to the Google
> Groups "fma-owl-2009" group.
> To unsubscribe from this group and stop receiving emails from it, send

> an email to fma-owl-2009...@googlegroups.com.

Chris Mungall

unread,

Apr 9, 2015, 1:21:53 AM4/9/15

to fma-ow...@googlegroups.com

On 7 Apr 2015, at 9:46, Todd Detwiler wrote:

> Hey Chris,
> Thank you for responding. In your forum post you recommended moving
> away from OBO format, in spite of its predictable serialization.

I've spent a lot of time over the last few years helping people migrate
but one thing we really got right with obo was developing a
serialization that works in conjunction with a VCS, including
visualizing diffs easily without additional layers. over its history GO
has frequently had multiple concurrent editors, with some working on
branches that are later merged in.

> So I have not experimented with that particular serialization format.
> I did try using functional syntax, but the owldiff tool had trouble
> loading it (memory problems if I recall correctly)

odd.

> and the owl2diff tool couldn't read it at all.

anything based off the owlapi should handle it

I would be immediately suspicious of any owl-level tool not built off
the owlapi

> I will take another look at functional syntax and see if I made any
> errors (and try using the purely text based diff tools).
>
> Regarding modularizing the ontology, our problem has always been to
> determine the correct modularization. To use an anatomical approach,
> for example body regions like head, thorax, abdomen, upper limbs, ...
> seems sensible. But someone working on the arterial system will be
> working across modules. We have also considered the approach you
> suggested, to break up the ontology based on axiom type. The problem
> with that is that an author trying to flush out all of the details of
> a particular class would potentially touch all or many of the modules.
> This is a problems only because:
>
> 1. As you suggested, new axioms tend to end up in the open ontology
> (e.g. parent ontology in a network of imports) in Protege. To get each
> axiom to go into the right place, to fill in the details of a class,
> would potentially require each axiom module to be opened individually.

well I was imagining a root importer ontology. yes, it would be terrible
to have to open each individually.

of course with a root importer it's easy for new axioms to end up here
accidentally, but in theory a bit of post-processing could take care of
this

> 2. If all authors are touching all or most modules, we end up with a
> lot more potential conflict management than you might expect from
> modularized ontologies.

perhaps. but if spurious diffs are minimized (as should hopefully happen
with this approach), VCS-level conflicts are rarer.

> That said, we do think that modularization is a good idea. There are
> just some lingering details that we have not worked out.

'lingering details' is probably an understatement :-(

> Anyway, I'm not being contrary, I'm just commenting on the
> difficulties I've encountered. I appreciate your feedback/suggestions
> on this. I can tell from your post that you've given this a great deal
> of thought and effort. I will explore further.

no worries, no contrariness detected, we're all in similar boats here
(though you're in the slightly deeper end as the FMA in OWL is that bit
larger than some of the others). FOr what it's worth, I'd recommend
commenting on some of the Protege/OWLAPI tracker issues, if only to let
people know this is important. A few tickets:

* https://github.com/owlcs/owlapi/issues/332
* https://github.com/owlcs/owlapi/issues/375
* https://github.com/owlcs/owlapi/issues/273

Todd Detwiler

unread,

Apr 9, 2015, 1:42:36 PM4/9/15

to fma-ow...@googlegroups.com

I've read just a bit about LFS in Git. My understanding was that it was primarily for binary files, like images, with the intent of supporting better diff and conflict management than simply looking at the file version number. You can imaging having an image in a layered image format, like PSD. You and I both check out this image, and we both add a layer (say you add a filter and I add a text layer). It is conceivable that these could be merged, by including both layers, rather than having to choose either my version or yours. But the trick would be to have tools that know how to identify and merge the differences. And that is where I saw VCS falling down with large ontologies.

I was able to check the entire FMA into Git on BitBucket (e.g. it allowed files of the size I needed). And I was able to update and commit, so long as I was the only editor. But when it came to potential conflict resolution, requiring diff and merge, it fell flat on its face (as did our internal SVN server). One of the big issue was that some changes were not changes at all, at the ontology level, but were occurring during serialization (order differences). Another issue, related to the first, was that the diff was purely textual. We need diff and merge tools that understand ontologies. Tools that understand that two identical sets of assertions are the same regardless of order. Some exist, but were not sufficient. I found that either they were producing erroneous diffs or they hung on large ontologies (not to mention potential issues of integrating them with the VCS).

Thanks for pointing me to Git_LFS. I will look more into the diff engine used in Git-LFS. Perhaps it is more efficient and would be usable (even as a text-based diff) on large ontologies. But we will still need to ensure consistent serialization (as Chris mentioned in his response) or else fast text diff won't be enough.

Todd

To unsubscribe from this group and stop receiving emails from it, send an email to fma-owl-2009...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

[ConcurrentOntologyEditting.pdf]

--

--- You received this message because you are subscribed to the Google Groups "fma-owl-2009" group.

To unsubscribe from this group and stop receiving emails from it, send an email to fma-owl-2009...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--

---
You received this message because you are subscribed to the Google Groups "fma-owl-2009" group.

To unsubscribe from this group and stop receiving emails from it, send an email to fma-owl-2009...@googlegroups.com.

Todd Detwiler

unread,

Apr 9, 2015, 1:45:18 PM4/9/15

to fma-ow...@googlegroups.com

owl2vcs is based on the OWLAPI, according to their docs. I should have
linked to that work:

https://github.com/utapyngo/owl2vcs

I cannot explain why it will not load all owl serializations. But, that
was my experience.

Todd

Chris Mungall

unread,

Apr 9, 2015, 1:53:26 PM4/9/15

to fma-ow...@googlegroups.com

I agree we definitely need better diff tools. But it's hard to make
these ubiquitous - I look at most diffs via the github web diff
(especially when the commit is linked to a tracker message), and it
would require negotiating with the github developers to integrate this
(I think they might be open to this). So it's always nice to have the
textual diff be as meaningful as possible. Hence the (still slow)
efforts on the owlapi front to make serializations more "obo-like"
(adding labels as comments etc)

>> <mailto:fma-owl-2009%2Bunsu...@googlegroups.com>.

>> For more options, visit https://groups.google.com/d/optout.
>>
>> [ConcurrentOntologyEditting.pdf]
>>
>>
>>
>> --
>> --- You received this message because you are subscribed to the
>> Google Groups "fma-owl-2009" group.
>> To unsubscribe from this group and stop receiving emails from it,
>> send an email to fma-owl-2009...@googlegroups.com

>> <mailto:fma-owl-2009%2Bunsu...@googlegroups.com>.

>> For more options, visit https://groups.google.com/d/optout.
>>
>>
>> --
>>
>> ---
>> You received this message because you are subscribed to the Google
>> Groups "fma-owl-2009" group.
>> To unsubscribe from this group and stop receiving emails from it,
>> send an email to fma-owl-2009...@googlegroups.com

>> <mailto:fma-owl-2009...@googlegroups.com>.

Reply all

Reply to author

Forward