Reference collections

Guus Lange

unread,

Aug 1, 2011, 10:46:13 AM8/1/11

to caa-sema...@googlegroups.com

Hi all,

We are presently on the verge of SKOSifying/CRMificating our Dutch Archaeological Basic Register (ABR) containing all object type names classified in a (poly)hierarchy. This sounds easier than done. I am puzzled on several crucial issues due to lack of knowledge. I have been talking to some of you about this, asking advice and guidance. Because it is perhaps an issue of relevance to others as well and I will try and restate the issue again, with the hope we can solve it together.

It all started some years ago when we discussed the role of Reference Collections as being one of the central knowledge bases of (traditional, uhh, analogue) archaeology : http://www.cultureelerfgoed.nl/sites/default/files/u4/RefColl_web2.pdf.

From that discussion the notion started to grow that Reference Collections, in one way or the other, should also play a central role in the retrieval of relevant data from (the digital) data bases and unstructured texts. Allow me this illustration: It does not matter what somebody has been imputting recently or a long time ago, as long as at a central facility all different notations and different scientific insights (classifications) are kept available for linking and comparisson. This central facility, with all the knowledge of reference collections stored, can than be seen as a sort of lavatory for terms, delivering "clean" (in terms of Precision & Recall) answers from "filthy" (=unstandardized) data structures and equally "filthy" (=unprecise) questions. That is the ideal.

The CIDOC CRM evolved into a promising and obvious environment. But, and I found out only recently, the CRM is precisely NOT developed to ."..define any of the terminolgy appearing typically as data in the respective data structures,..."(Objectives of the CIDOC CRM, p.i of the Official Release), but aims at the data structures themselves. (Please correct if wrong)

As a logical step perhaps, terminologies and thesauri are left to SKOS. In our case (ABR) this entails a severe loss of information, because all kinds of classificatory information needs to be left out (the exact reason why is still also not completely clear to me).I wonder if this is the right thing to do, given the more limited skope of SKOS compaired to the CRM.

Should we not try and model reference material in the CRM. Evidently it is concepts we are talking about, not just terms.

For instance if we take a Dressel 20, we talk about a "container made of ceramics[coarse ware{yellow/white}] for the transport of liquids [like fish sauce] from the Roman period" or something akin (for a much richer description see of course Paul Tyers f.i.: http://intarch.ac.uk/journal/issue1/tyers/DR20.html). In the ABR -hierarchy a distinction is made between the Broader Terms of amphorae from the Mediterranean and those from Gaul. Amphorae as a whole belonging to the BT Roman pottery, BT Wheel cast pottery, BT Ceramics.

We still believe that when we model the classification of all types of Amphorae into the CRM, together with all the other typologies of archaeological material, we just might be able to develop this central facility, freeing ourselves btw. from modelling every source there is or will be..

My questions are

1) is this sketch remotely accurate of the present situation (where am I going wrong)?

2) if so, what should we do? How can we model the Dressel 20 amphora and all other reference collections (Terra Sigilata or Samian ware has it's own classification and so has every tiny group of material) to the CRM or should we leave it at SKOS (and why)?

Perhaps this is not the place to discuss these kind of vague notions and ideas, but I know of no other or better

Thanks

Guus

Leif Isaksen

unread,

Aug 2, 2011, 4:48:20 AM8/2/11

to caa-sema...@googlegroups.com

Hi Guus

A very interesting question and I hope this is very much the forum for it!

Here's the way I see it but if anyone disagrees (on either specifics
or generalities) then please pipe up!

SKOS and the CRM are in no ways alternatives to one another and your
distinction is essentially correct. CRM gives you the vocabulary to
talk about 'stuff that really happened in the world': the people,
actions, spaces/places, times/periods and intersections of all four
('events'). As such, and although it is an abstract language itself,
it is essentially bound to talking about real instances of things
(Winston Churchill, Malta, etc.). In contrast, SKOS is a
metavocabulary for talking about abstract concepts: occupations,
amphora typologies, art styles, biological taxa etc. Essentially word
lists and thesauri and not the things themselves. The fact that these
metavocabularies, thesauri and instances and data sets all tend to get
referred to as 'ontologies' is yet another example of how bad we
'semanticians' are with semantics ;-)

The problem you're facing, as I see it, is that a reference collection
is in some ways both these things: it is a collection of real
instances that stand as a proxies for abstract concepts. As a result
it raises a very interesting question as to how you should model them.
Pragmatically speaking, SKOS is probably what you are after.
Ultimately it's a lot easier for people to reference a single SKOS
term than a complex CRM description. However, if you really want to
get into some heavy inferencing (and have enough faith in the data to
support it) then you could conceivably model all those instances and
then set a skos:definition property to the relevant set of CRM-encode
triples. Or you could simply define it with some images and
descriptive terms - a lot will depend on who you think will use it and
how.

As ever though, the key thing that you really want are stable and
persistent URIs. If they aren;t then no-one will be able to reference
any of it and all the work will be in vain.

Hope that helps? What do others think?

L.

Matthias Lang

unread,

Aug 2, 2011, 7:03:09 AM8/2/11

to caa-sema...@googlegroups.com

Hi Guus,

I support Leifs opinion. I think the CRM is not the right tool to model classifications, because you are dealing with real things not with abstract classes and relations. So we decided to use SKOS vocabularies to put some flesh on the bones of our CIDOC-based model. Another reason for this decision was a practical one: it is quite easy and fast to build big vocabularies - even for archaeologist which are not used to semantic-web stuff. Also it is very easy to exchange thesauri. In the moment we are developing a versioned web-based platform for SKOS-vocabularies and an easy to use SKOS-Editor. We are highly interested in discussions and collaborations in SKOS-vocabularies.

Regards Matthias

Dr. Matthias Lang,
Archäologisches Institut und Sammlung der Gipsabgüsse
Georg-August-Universität Göttingen
Nikolausberger Weg 15
D-37073 Göttingen
Telefon: (0551) 39-7506
Fax: (0551) 39-22062
Mobil: (0151) 27006392
Mail: Matthi...@phil.uni-goettingen.de

-----Ursprüngliche Nachricht-----
Von: caa-sema...@googlegroups.com [mailto:caa-sema...@googlegroups.com] Im Auftrag von Leif Isaksen
Gesendet: Dienstag, 2. August 2011 10:48
An: caa-sema...@googlegroups.com
Betreff: Re: {CAA Semantic SIG} Reference collections

Reinhard Foertsch

unread,

Aug 2, 2011, 10:32:59 AM8/2/11

to caa-sema...@googlegroups.com

Hi,

I agree with Leif pointing out that also with SKOS you are (or should be) in the world of higher abstraction of certain sectors that carry the necessity for this.

One reason for Conceptual Reference Models, though, was to reduce diversity-overkill. Nothing against terminological biodiversity in the humanities, but its an ongoing process to define the corridor of viability between unilateral prescriptiveness and heating up the diversity-overkill. The latter was never very helpful, not today and not in the pre-computing era. There are positions, though, that try to see it as a constituent of the humanities, I know, but they never convinced me.

And as far as SKOS is able to handle many vocabularies technically, the questionability of the semantical handling is rising with each new vocabulary. This is the same, naturally, with mapping different things to the same CIDOC-CRM, but my feeling is that hyper-classification comes earlier with aligning different vocabularies in SKOS.

So the use of SKOS as a sort of simplifier of certain aspects, as Leif put it, makes great sense to me. But, seeing SKOS as a tool to build big vocabularies in short time and as an enhancement to increasing their number, isn't that a way to figure out how take the medicine and being able to carry on with the desease?

Reinhard

Vlachidis Andreas

unread,

Aug 4, 2011, 8:24:57 AM8/4/11

to caa-sema...@googlegroups.com

H All,
Leif is right the two models are not alternatives but they can be understood as complementary. I usually distinguish the two by referring to the CRM as the ontological model which describes the "world" and SKOS as the terminological model which describes the "words". In the STAR project (http://hypermedia.research.glam.ac.uk/kos/star/) we have used both models while the STAR demonstrator enables cross searching by utilizing both CRM-EH (an extension to CRM for archeology) and SKOS. In order to link the two models we have used, due to the absence of a standard, a project specific property which is given the label "is_represented_by". Thus an instance of the concept "EHE0007.Context" which is an extension of the CRM concept "E53.Place" is "is_represented_by" a terminological reference for example "concept#91930" which links to a specific thesaurus/glossary.

I found this arrangement particular helpful, when dealing with polysemy in an information extraction context, especially to what I refer to as ontological polysemy. There are cases for example where a particular lexical instance ie "brick" could be instantiated as an E19.Physical Object or as an E57.Material depending on the contextual evidence of a sentence. Therefore, in one case the lexical instance can be model as E19 which is_represented_by a terminological entry of an Objects Thesaurus and in another case as E57 where the instance is_represented_by a terminological entry of a Material Thesaurus. Even in the case where there is a single terminological resource which provides the terminological references ie Archaeological Thesaurus, where there is only a single terminological definition for "brick" still the distinction between Physical Object and Material can be made on the ontological level while terminologically there is only one available definition. In addition the use of the two models assists in the modeling of negated findings. For example in the phrase "no evidence of coins" we might choose to model "coins" as a negated find not as a find however, either there is evidence or there is not evidence of "coins" the lexical instance of "coins" is true. Thus, by linking the lexical instance to a terminological reference we enable an information retrieval application to show results which are represented by "coin" regardless if such coins are negated.
You can have a closer look at examples from grey literature reports at http://andronikos.kyklos.co.uk
I just put here an STAR RDF example of a EHE0007.Context (that is archaeological context , extension of CRM E53.Place) which is represented by the terminological reference "field drain"

<crmeh:EHE0007.Context rdf:about="http://tempuri/star/base#aberdeen3-34973_1.36462">
   <dc:source rdf:resource="http://tempuri/star/base#aberdeen3-34973_1"/>
   <dc:source rdf:resource="http://tempuri/star/base#ehe0001.oasis"/>
   <crm:P2F.has_type>
      <crm:E55.Type>
         <rdf:value>field drain</rdf:value>
         <crmeh:EXP10F.is_represented_by rdf:resource="http://tempuri/star/concept#91930"/>
      </crm:E55.Type>
   </crm:P2F.has_type>
</crmeh:EHE0007.Context>

hope that helps

Andreas

Guus Lange

unread,

Aug 4, 2011, 6:14:47 PM8/4/11

to caa-sema...@googlegroups.com, j.vande...@cultureelerfgoed.nl, fvande...@gmail.com, l....@cultureelerfgoed.nl, r.oo...@cultureelerfgoed.nl

Dear Leif, Matthias, Reinhard and Andreas

Thank you very much for your illuminating answers to my questions (and the invitation, Matthias, which we will very gladly accept! [once we got our kit together, I, reluctantly, have to add...]). But co-operation on the present level is something I (we) dearly value and appreciate.

We have been discussing your remarks internally in our small task force (they are the names in the cc). Now perhaps it is practical first to give some comments back to you and see where that leads us, OK? Some of the remarks are from me personal, some stem from these discussion.

First to Reinhard. I am not sure if I understand you right (funny: this is all about understanding, isn't it?), but it is our feeling also that SKOS, being too poor, might not be the final solution for ending the semantic chaos when trying to align several thesauri together in one environment. As it is, however, there seems no other solution available (except for the CRM?) Fortunately Matthias and Leif are rather positive. This gives hope.

But, if I am right, you address also the practice in the humanities that causes all these semantic troubles: (in my words) the sheer unbound fantasies of the scientists and the resulting concepts and ultimately their language. I think we have to cope with that ever growing diversity of terms and meanings as a fact of life (it fortunately is too!). As I mentioned in my original post, we have this rather simplistic model that is a bit similar to that of the Encyclopaedists of the 18th century: only, we should not have the ambition to collect all knowledge in one system, but "only" to make it available with one system. This of course is also similar to Linked Data and SemWeb goals. There seem to be two roads to follow: either one changes the used language in the sources, texts and databases themselves, or adding translation metadata to the existing data or, the other way, one translates the original used language on the fly, only when needed. The latter is more what we had in mind, leaving sources as they have been and for maintenance one only touches the translator. Unfortunately the present solutions all seem to opt for the former way: transforming the sources by adding information to it. I am convinced it can be successful (and it has been proved already, f.i. CLAROS). It is however a lot of work to model each source (database/text) in such a separate way, but there is more to it: the user of the stored information still has to have some knowledge of how the system was modelled and each source can be modelled (slightly) differently. And therefore we prefer the development of a digital apparatus, having much the same function as an analogue encyclopaedia, where vague and ambigue concepts are put in, interpreted and disambigued, and giving as output various, but precise answers. Our hopes were that CIDOC CRM could play the central role here. Perhaps it could, right? One of the questions is, what is exactly the nature of our classificatory schemes. As Leif rightly pointed out, they seem to be a mix of both thesauri (vocabularies, terminologies) and descriptions of real world objects.

I am not a linguist and what follows is not hindered by any knowledge on the subject. But if we confine ourselves to the domain of archaeology, I think our language is always classificatorial. With that I mean we speak to each other in concepts, classes, like warm and cold, good or bad, or about archaeological types like Dressel 20. However there is not one Dressel 20 to refer to as THE Dressel 20 (as in Natural History), but is is a concept that has been developed by more than 100 years of experience, analyses, thinking and discourse about a certain, reoccurring phenomenon , in this case Man Made Objects, examples of which together form a reference collection. This process makes the archaeological type very rich in facets and characteristics, which set it apart from other types. In this a type has dimensions, colour, form, surface treatment, like real world objects they are constructed from. As Leif points out CIDOC-CRM is about real world observations and not about abstract concepts. For the latter we use SKOS. (Mathias, I guess you unintentional mixed the two concepts in your first sentence, it became very clear in the following part what you meant).

Matthias’ remark of adding flesh to the bone of the CRM by SKOSified thesauri is very illuminating and makes a lot sense. Problem remains of the limited scope of, at least our, SKOS solution, which throws away a lot of information (this could be a problem of our local SKOS implementation, and my limited knowledge of it, however). It seems a bit awkward if one has a very intricate data model by CIDOC CRM and add to that only rather meagre meat to the bone (If one goes for meagre, that’s ok of course).

So I think coming back to Leif, we could eventually try and use to model our archaeological types (developed from real world reference collections) although being abstractions of real world phenomena by CIDOC CRM (and adding images and tons of extra contexts).

But having said that, I realise that I am still in the dark how do we actually model the Things in practice? Here Andreas came in and he shows a nice example and adding a whole new dimension to the issue by introducing the negation argument. That CIDOC CRM is not the end of the line probably is the way Andreas models “brick” in a different way than I would have expected (albeit only for the sake of argument). Apparently, even the CRM leaves much room for different views and interpretations.

Please have a look at the (fictional, AAT-like) structure below

Roman Period

- Containers <according to Material - ceramics>

- Ceramics <according to Function -transport>

- Amphorae <according to Manufacture - coarse ware>

- Mediterraenean Amphorae <pointed bottom>

- Dressel 20

- Dressel 21

- Gaulish Amphorae <flat bottom>

- Haltern 70

In an earlier conversation Doug Tudhope and Ceri Binding suggested to use <I hope they do not mind me citing them here).

E1_CRM_Entity --> P137_exemplifies --> E55_Type, for which the documentation refers to an associated ‘taxonomic role’ e.g. “prototypical”, “archetypical”, “lectotype”.

And:

E22_Man-Made_Object (Dressel 20) à P45_consists_of à E57_Material <ceramic)

E22_Man-Made_Object (Dressel 20) à P101_had_as_general_use à E55_Type <transportation)

An alternative they proposed was to use SKOS to describe the (poly) hierarchical classification according to the ‘non-indexing’ terms according to different characteristics - such as form or function or material etc. e.g. <containers by material>, <ceramics by function>, <amphorae by shape>.and then using CRM to describe the link to exemplary specimens (To clarify, crm:E55_Type could also be viewed as skos:Concept in the diagram below). Concepts such as “Dressel 20” may then potentially be a member of multiple groups.

In summing up, the bottom line remains: we should use SKOS to model the classifications and could add the CRM for modeling the actual reference objects/collection. That we will try and show you one day.

Leif’s remark “…then you could conceivably model all those instances and then set a skos:definition property to the relevant set of CRM-encode triples” leaves me a little puzzled how that would look like in real codes? It would help tremendously if someone could explain and has available an example of a modelled set of archaeological classificatory terms (like the one above perhaps) perhaps both in CIDOC CRM and SKOS as far as it goes. Anyway we will try ourselves also.

Thank you for bearing with me on this path of puzzles

Guus

(PS Unfortunately I cannot follow up immediately on this discussion, being off to Italy until 29^th of Augusto)

Guus Lange

unread,

Aug 5, 2011, 4:11:32 AM8/5/11

to caa-sema...@googlegroups.com, j.vande...@cultureelerfgoed.nl, fvande...@gmail.com, l....@cultureelerfgoed.nl, r.oo...@cultureelerfgoed.nl

Oops!
That was quite some bits....!
Sorry
I meant to send this (thanks to Ceri)

https://webxsocw.secure-access.biz/owa/,DanaInfo=.awfdpenrGwl6Kx1qp1,SSL+attachment.ashx?id=RgAAAACi6LM8WwTmR4%2bYoQTgplvRBwDfKmNIcTENQrwwk2r85TCLAAAAFJJ8AACMGTaOqb%2bpSq2NXN5vpOXoABl2UbZgAAAJ&attcnt=1&attid0=EACgIZMdFUjtRb0LSqPRpuS9

Guus

Leif Isaksen

unread,

Aug 5, 2011, 4:29:48 AM8/5/11

to caa-sema...@googlegroups.com, j.vande...@cultureelerfgoed.nl, fvande...@gmail.com, l....@cultureelerfgoed.nl, r.oo...@cultureelerfgoed.nl

Hi Guus

Great conversation :-)

some very quick thoughts:

- That AAT thesaurus looks like a perfect candidate for SKOS to me.
It's dealing with abstract terms, rather than concrete instances.

- One way of tying SKOS to CIDOC CRM would be to use skos:description
to refer to either a file or triplestore subgraph that contains the
collection of triples that describe the collection in CRM. So
something like :

<http://concepts.culturefeelgood.nl/skos_concept_a> skos:description
<http://collections.culturefeelgood.nl/reference_collection_a.rdf>

I'm not sure if that's making a high level 'semantic jump' though and
would be keen to know what others think.

Alternatively, I think there are a number of pre-established ways of
referring to SKOS from the CRM and presumably it's possible to
navigate that link in both directions. I'll leave the experts to
comment on that :-)

Best

L.

Vlachidis Andreas

unread,

Aug 5, 2011, 9:13:22 AM8/5/11

to caa-sema...@googlegroups.com

Hi,
Indeed a very interesting discussion. From my experience working with unstructured text I found that is worth keeping the two models separate, and to use SKOS for classification and CRM for actual objects/collections. However, I believe that "actuality" of objects in unstructured text can be vague. There are case, as for example museum catalogues, where the text refers to actual objects of a collection and so a NLP process can extract both the type of an object ie Amphora and its catalogue reference identifying this way an E22 Man Made Object. There are other cases, as for example excavation reports, where authors refer to finds and contexts but it is not always clear which instances they exactly refer to. For example in the phrase "The first phase was a large, discrete, cluster of 22 pits, dating from the Late Neolithic/Early Bronze Age" the author does mention 22 pits which are contexts of excavation however, we can not be sure which pits are these since there is no geospatial information or any other reference which we can use to specify the actual instances. Possibly a few paragraphs down the line the author might give this information but in terms of NLP it is extremely hard if not impossible to resolve. There is also another aspect which affects the "actuality" of instances. There are cases where the authors are not 100% sure about a finding and they might say something like a "possible ditch". It is not a negation and it is not a 100% confirmation. How to model this it is not yet very clear to me, but it seems that the science of Archaeology deals with uncertainties which are hard to be modeled as precise instances.

Best wishes
Andreas

Reply all

Reply to author

Forward