IAO questions

3 views

Skip to first unread message

P. Def

unread,

Oct 20, 2009, 3:21:59 AM10/20/09

to informatio...@googlegroups.com

Hi all,
I recently came across the IAO and I am currently in the process of figuring out its hierarchical structure and understanding the reasons why it has been designed that way.
In particular, I would like to formulate a few questions about the hierarchical structure of the IAO information_content_entity:

(1) a textual_entity (IAO_0000300) is defined as:

"A textual entity is a part of a manifestation (FRBR sense), a generically dependent continuant whose concretizations are patterns of glyphs intended to be interpreted as words, formulas, etc."

"textual entities live at the FRBR manifestation level. Everything is significant: line break, pdf and html versions of same document are different textual entities."

"a document as a whole is not typically a textual entity, because it has pictures in it - rather there are parts of it that are textual entities. Examples: The title, paragraph 2 sentence 7, etc."

the textual_entity then goes and includes a series of sub-entities, such as:
citation, author identification, institutional identification, caption, document title, table

(2) a document (IAO_0000310) is defined as:

"A collection of information content entities intended to be understood together as a whole, e.g. a journal article, patent application, laboratory notebook, or a book"

(3) a document_part (IAO_0000314) is defined as:

"An information content entity that is part of a document, e.g. an abstract, introduction, method or results section."

Firstly, according to the definition of the textual_entity ("a document as a whole is not typically a textual entity, because it has pictures in it - rather there are parts of it that are textual entities"), it seems to me as if the document as a whole could be regarded as a superclass which may contain a series of document_parts, which in turn may or may not contain a series of textual_entities according to their type. So I was wondering what is the reason it has been decided to put these 3 entities on the same hierarchical level, even though a textual entity can always be regarded as part of a document (ie. a document_part), which according to the principle of granularity is in itself ultimately a document, so it would seem that the is_a relationship would stand ?

Secondly, I was wondering why are entities such as symbol, data item, label, and directive_information_entity on the same hierarchical level as the textual_entity as opposed to being a subclass thereof?

Thirdly, considering the sub-classes of a textual_entity, I am confused as to why it has been decided that it should "live at the FRBR manifestation level", in that this would necessarily require all the sub-classes to live at the FRBR manifestation level as well, which seems slightly inconvenient to me, since entities such as citation, author identification, document title, etc appear to be more like a content-related kind of entities (as opposed to a layout-oriented kind) and, to my view, they should thus rather live at the FRBR expression level rather than the manifestation level.

Besides, assuming that a textual_entity cannot be considered as a subclass of a document or a document_part, then it would have more sense to me to have the textual_entity and all its subclasses exist at the level of the FRBR expression, while having instead the document entity and all of its subclasses exist at the level of the FRBR Manifestation, thereby allowing for a textual_entity to acquire a format or a layout only once it has been incorporated into a particular document.

Finally, the concept of a narrative_object (IAO_0000006) defined as:
"A narrative object is an information content entity that is a set of propositions, e.g. reports, journal articles, and patents submission." is very confusing to me, in that the definition seems to coincide with that of "a document of literary nature" and seems therefore to qualify as an entity/class that should theoretically subsists as a subclass of the document_part (and eventually as a superclass of the textual_entities it incorporates?)

And likewise, it would seem natural to me that the report and the study_interpretation entities, which are fundamentally a more specific definition of a narrative_object, should ultimately subsist as a sub-class of the narrative_object, with the report_element entity being a subclass of the report class? but please correct me if I'm wrong.

Again, I am just starting to learn about the IAO. Please do not regard these questions as any kind of criticism, but only as an attempt to better understand the underlying structure of the IAO.

Larry Hunter

unread,

Oct 20, 2009, 11:25:24 AM10/20/09

to P. Def, informatio...@googlegroups.com

Dear P. Def,

Thanks for your interest in the IAO. You raise reasonable questions,
and, since I am the author of most of the terms you question, I offer
my thoughts on them below.

On Oct 20, 2009, at 1:21 AM, P. Def wrote:

> Firstly, according to the definition of the textual_entity ("a
> document as a whole is not typically a textual entity, because it
> has pictures in it - rather there are parts of it that are textual
> entities"), it seems to me as if the document as a whole could be
> regarded as a superclass which may contain a series of
> document_parts, which in turn may or may not contain a series of
> textual_entities according to their type. So I was wondering what is
> the reason it has been decided to put these 3 entities on the same
> hierarchical level, even though a textual entity can always be
> regarded as part of a document (ie. a document_part), which
> according to the principle of granularity is in itself ultimately a
> document, so it would seem that the is_a relationship would stand ?

Not all textual entities are part of a document. A fragmentary note,
say, is a textual entity that is not part of any larger whole.

The hierarchical level of an entity in a OBO ontology is not, in
itself, meaningful. The subsumption hierarchy may well be deeper (or
more deeply specified) for some branches than others. There is no
implication that the relationship "at the same depth in the tree" is
meaningful in any way.

> Secondly, I was wondering why are entities such as symbol, data
> item, label, and directive_information_entity on the same
> hierarchical level as the textual_entity as opposed to being a
> subclass thereof?

Textual entity and friends are relatively new additions to the IAO
compared to these above, and there may well be errors in the
subsumption hierarchy with respect to them.

Symbol is a problematic term that will probably have to remain
primitive. Intuitively, there are symbols that are not
textual_entities (such as a standardized image, perhaps the symbol for
a resistor in a circuit diagram). Data items, which could include
images, are not solely textual_entities. Label strikes me as possible
error; I think all labels have to be textual; can anyone think of a
counterexample? Directive_information_entity could be a diagram (such
as the ones you frequently get to explain the installation of a
technological device) and so is probably not a subclass of
textual_entity.

> Thirdly, considering the sub-classes of a textual_entity, I am
> confused as to why it has been decided that it should "live at the
> FRBR manifestation level", in that this would necessarily require
> all the sub-classes to live at the FRBR manifestation level as well,
> which seems slightly inconvenient to me, since entities such as
> citation, author identification, document title, etc appear to be
> more like a content-related kind of entities (as opposed to a layout-
> oriented kind) and, to my view, they should thus rather live at the
> FRBR expression level rather than the manifestation level.

This was a pragmatic decision so that we could have a clear criterion
for determining equality (and hence counts) of textual_entities. I
think this is only a slight inconvenience for the use cases you
mention, since it would be straightforward to state, for example, that
several citations (different manifestations) are all about the same
document. It was very hard to determine in the general case when two
FRBR expressions were the same (e.g. the HTML vs. PDF versions of a
document might contain different information). It is also the case
that even two identical manifestations might not be about the same
entity (consider distinct authors with a common name, or two different
documents that have the same title). If there were a use case where
FRBR expression level statements were clearly needed (and could be
clearly defined), I think we would consider adding IAO terms that
modeled them. Can you provide such an example and definitions? Any
definition should be specific enough to determine exactly how many
expressions there are in any situation.

> Besides, assuming that a textual_entity cannot be considered as a
> subclass of a document or a document_part, then it would have more
> sense to me to have the textual_entity and all its subclasses exist
> at the level of the FRBR expression, while having instead the
> document entity and all of its subclasses exist at the level of the
> FRBR Manifestation, thereby allowing for a textual_entity to acquire
> a format or a layout only once it has been incorporated into a
> particular document.

Determining equality among expressions (and, hence, the ability to
count the number of expressions in some instance) is very difficult.
Often, the formating and layout convey information (e.g. an italicized
vs. roman gene name often indicates species or gene vs. gene
product). How could you tell what expressions were the same without
knowing the layout or format?

> Finally, the concept of a narrative_object (IAO_0000006) defined as:
> "A narrative object is an information content entity that is a set
> of propositions, e.g. reports, journal articles, and patents
> submission." is very confusing to me, in that the definition seems
> to coincide with that of "a document of literary nature" and seems
> therefore to qualify as an entity/class that should theoretically
> subsists as a subclass of the document_part (and eventually as a
> superclass of the textual_entities it incorporates?)
>
> And likewise, it would seem natural to me that the report and the
> study_interpretation entities, which are fundamentally a more
> specific definition of a narrative_object, should ultimately subsist
> as a sub-class of the narrative_object, with the report_element
> entity being a subclass of the report class? but please correct me
> if I'm wrong.

Narrative_object, report and report_element are older terms that are
intended to be replaced by the document, textual_entity, etc. terms.
They have not been obsoleted yet because of dependencies in other
ontologies (e.g. in the OBI) whose update needs to be coordinated.
When that occurs, study_interpretation will become a subclass of
document.

> Again, I am just starting to learn about the IAO. Please do not
> regard these questions as any kind of criticism, but only as an
> attempt to better understand the underlying structure of the IAO.

Your questions and comments are much appreciated. IAO will meed the
needs of the community only to the extent the community participates
in its creation and maintenance. Your engagement in this process is
very welcome.

Larry

P. Def

unread,

Oct 24, 2009, 11:21:09 AM10/24/09

to Larry Hunter, informatio...@googlegroups.com

Dear Larry,

thanks for your detailled answers, I now have a better understanding of the structure of the IAO.

I would however like to follow up with two more comments or questions:

On Tue, Oct 20, 2009 at 5:25 PM, Larry Hunter <Larry....@ucdenver.edu> wrote:

On Oct 20, 2009, at 1:21 AM, P. Def wrote:

Thirdly, considering the sub-classes of a textual_entity, I am confused as to why it has been decided that it should "live at the FRBR manifestation level", in that this would necessarily require all the sub-classes to live at the FRBR manifestation level as well, which seems slightly inconvenient to me, since entities such as citation, author identification, document title, etc appear to be more like a content-related kind of entities (as opposed to a layout-oriented kind) and, to my view, they should thus rather live at the FRBR expression level rather than the manifestation level.

This was a pragmatic decision so that we could have a clear criterion for determining equality (and hence counts) of textual_entities. I think this is only a slight inconvenience for the use cases you mention, since it would be straightforward to state, for example, that several citations (different manifestations) are all about the same document. It was very hard to determine in the general case when two FRBR expressions were the same (e.g. the HTML vs. PDF versions of a document might contain different information). It is also the case that even two identical manifestations might not be about the same entity (consider distinct authors with a common name, or two different documents that have the same title). If there were a use case where FRBR expression level statements were clearly needed (and could be clearly defined), I think we would consider adding IAO terms that modeled them. Can you provide such an example and definitions? Any definition should be specific enough to determine exactly how many expressions there are in any situation.

Besides, assuming that a textual_entity cannot be considered as a subclass of a document or a document_part, then it would have more sense to me to have the textual_entity and all its subclasses exist at the level of the FRBR expression, while having instead the document entity and all of its subclasses exist at the level of the FRBR Manifestation, thereby allowing for a textual_entity to acquire a format or a layout only once it has been incorporated into a particular document.

Determining equality among expressions (and, hence, the ability to count the number of expressions in some instance) is very difficult. Often, the formating and layout convey information (e.g. an italicized vs. roman gene name often indicates species or gene vs. gene product). How could you tell what expressions were the same without knowing the layout or format?

(1)

So if I understand correctly, the idea of having the textual_entity exists at the level of the FRBR manifestation was basically meant as a mechanism to allow for a machine to be able to determine equality between two distinct textual_entities, so as to be able to unequivocally count the number of times their appear into a document. So a textual_entity basically comes into being as soon as a FRBR expression is embodied into a particular document, and the attributes of a textual_entity are therefore dependent both upon the type of document into which it has been embodied and the way it has been presented into that document.

Now I understand that from a technical point of view it would be much easier to determine equality between 2 distinct textual_entities when their manifestation is also being taken into account. However, it would seem to be limited to the counting of identical textual_entities into one single document, or into one particular class of documents which use a similar formatting style. Whereas, say that you have two documents that contain a similar FRBR expression, e.g. the name of a person, which has however been inserted into each document with a different typo, a different font size, or more generally just with a different formatting. In this case, the two textual_entities that would arise from that will not be regarded as equal, because their differ in their Manifestation. Hence, if I attempt to retrieve all the documents that refer to that particular person, I may be unable to retrieve both documents, only because the name of that person has been embodied into two Manifestations that are not the same.

the FRBR distinguishes between the Expression and the Manifestation of a work according to what is semantically relevant (which pertains to the Expression) and what is a mere matter of "presentation" (which pertains to the Manifestation). Accordingly, in the case of the gene vs. gene product, the italicization would therefore be considered as part of the Expression because it is semantically significant, whereas an italicization that is only justified on the basis of style would instead be regarded as part of the Manifestation.

In some sense, the formatting of a textual_entity can be seen as some kind of encoding: the very same information entity can be represented in a variety of different formats which are all different, but nonetheless correlated by the fact that they all incorporate the same Expression. For instance, most of the time when I am looking for a particular book written by a particular author, I do not care about the edition, or the font in which the book has been published, as I consider them all to be equivalent provided that they all incorporate the same Expression.

I thus think it would sometimes be very useful to be able to identify when 2 information entities are identical at the FRBR expression level, which is the main reason why the FRBR expression entity has been originally conceived, i.e. for the purpose of categorizing and retrieving bibliographical records. Also, in the context of copyright law, the FRBR expression is the actual entity that is eligible for protection, and it would therefore be useful to be able to determine when 2 information entities are a representation of the same Expression, albeit in a different Manifestation, and when they are instead two distinct expressions of the same work and are therefore not likely to be infringing upon each others.

For this reason, I would advocate for the insertion into the IAO of an additional term that would describe an information content entity at the level of the FRBR expression. What is your position on this matter?

(2)

Now when it comes to the technical part, I am afraid I am unable to offer any viable solution to unequivocally determine equality between two information entities at the level of their expression. My first guess would be that of suggesting that there is equality whenever the 2 information entities can be converted back and forth from one format to another without any loss of data, i.e. whenever there exists at least one algorithm that allows for a 1:1 mapping between one format to the other. Accordingly, any given textual entity could be represented either in UTF-8 or Latin-1 and still be considered the same at the level of the Expression. Similarly, an image encoded into JPEG format would be regarded as equivalent to an image encoded in the PNG format in so far as their conversion from one format to the other would not lead to any kind of data loss. An information entity embodied into a word document, a HTML document, or a PDF document, could also be regarded as equivalent to the extent that the only kind of data that may be lost when converting a document from one format to another would only pertain to the layout/format/or presentation (i.e. Manifestation-related data) as opposed to the actual content of the information entity (i.e. the Expression-related data).

After second thoughts, however, this seems to be an unsatisfactory solution, since every kind of conversion would necessarily involve some loss of relevant information, and I am not sure how could a machine identify the nature of the data-loss in order to determine whether it pertains to the expression or the manifestation..

I believe this is however an important question to be considered, which seems to me to be somehow related to the more general question of encoding (i.e. whether and to what extent should the different encodings of an identical information be considered as equivalent?) and I would be willing to spend some more thoughts about it, if it has not been done already and/or if you think this could in some way be relevant for the future developments of the IAO.

Reply all

Reply to author

Forward

0 new messages