GD and Paul -
Thanks for the encouragement on our APML export. We're looking
forward to putting it out there to see how it's consumed.
Here's a bit more clarity on what I was proposing as additions to the
mix:
1. Context / Universe of Discourse Identification
I believe this is pretty straight-forward, and not terribly
controversial. The basic concept here is that with some prior
knowledge inbound will greatly improve our ability to service users
who provide us with their APML. The most obvious solution is to shift
toward using RDF to support ontology identification.
2. Concept Key Value Distribution Data
I know this is a bit more "out there", but seems that it'd be
incredibly beneficial for consumers of APML. This is basically the
same concept as ontology identification, but pointing to the
"mathematical" universe of discourse rather than the "linguistic"
one. I have no idea if this has been proposed (or is in use)
elsewhere, but from my brief discussions with =Drummond, Kingsley, and
Danny about it seem to indicate it's a relatively novel (and useful)
idea.
As it stands now, we have no way of interpreting the range of Values
across Sources. For example, a single APML file may include a set of
Values in the range [0.0023 - 0.0071] for a given Concept Key Source.
We're assuming that each Source will be relatively consistent in their
application of Values (not a great assumption, but there you go),
however we can't assume the same across Sources. For example, another
Source may have Values in the range of [0.15 - 0.87] within the same
APML file. Further, each range could be representative of different
distributions.
The goal of this proposal is to identify a method by which an APML
consumer can interpret the Concept Key Value. My suggestion is to
take a page out of the RDF playbook, and define something akin to a
"mathematical equation / statistical distribution" ontology. For
example, one set of Values may be from a Poisson distribution, while
another may be Gaussian. Another view may be that the Values should
be considered using more of a sigmoidal curve described by a
cumulative distribution function. Each of which may be offset by some
additional factor to be used for normalization.
In practice, then, perhaps each Concept Key Value field could be
identified (ala RDF) with a pointer to the associated interpretation.
In this way, APML consumers would be able to more effectively assign
meaning to the Value. For example, we would know that a Value of
0.0051 may represent a typical interest from Source A while a typical
user may be represented by 0.87 in Source B.
In my discussion with =Drummond, it's also possible we may be able to
extend this idea to a machine-readable solution (ala XML Schema). It'd
be really neat to get to this level, but the first step is adoption of
the general concept.
3. Concept Key Confidence Value
In the same vein of "more prior knowledge is good", it would be
helpful for APML consumers to gauge the level of confidence a Source
has in the Concept Key Value. Following as a logical extension of the
proposed distribution guidance, the Confidence Value could also be
tagged with it's statistical confidence curve. For example, it's
possible that the confidence in a particular Concept Key Value is
highly dependent upon transient popularity, in which case it's value
would decay over time. In this case, the Value should be interpreted
as an exponential decay function of the Date provided by the Source.
The end goal with the Confidence Value is simply to enable the APML
syntax to allow for it. There are a lot of producers who won't supply
it (or it may be suspect), but in some very contexts it'll be highly
valuable.
I hope this detail helps shed light on what I was proposing. Let me
know, though, if it's still too nebulous and I'll see if I can dive a
bit deeper. In the end, though, if we move to an RDF enabled APML
specification, we should be able to support these cases relatively
easily (without breaking anything, if they're ignored).
Any other questions or suggestions?
- Trent