Hi!
> I'll chime in on the value of allowing multiple formats. Allowing
> multiple formats:
>
> 1) Allows tools which migrate data to bibjson to preserve strings with
> their original markup while also creating new strings in a more
> standard format.
> 2) Allows data set publishers to provide alternative formats for
> strings to ensure that records can be displayed on a wide variety of
> systems.
>
Yes
> In fact, I think we can do better than support multiple formats, and
> support multiple alternative strings. In this way we can attach
> arbitrary
> attributes to a string. We could, for example, attach
> language attributes, which would be especially useful for user
> interfaces. (Users will probably prefer to see abstracts and subject
> headings in their own language.)
>
Since the beginning of these BibJSON discussion (what, 8 months ago to a
year?) I thought the goal was to make these set of record descriptions
as simple as possible. Just describe the information about the records,
nothing more nothing less, is a more or less free-form.
What you are describing bellow certain works, however I would like to
mention a couple of things that could make the whole thing cleaner, and
that explains some of the process we try to explain for a couple of
weeks now.
The problem I see with this solution is that you are mixing two totally
different things in the sae *instance records* description file:
(1) all properties related to the records themselves (title, authors, etc)
(2) system processing properties such as: the language of a string, the
format of a string, etc, etc, etc.
This can be ok, don't get me wrong, but this certainly make instance
records files (this is what the data publisher really care about at the
end) much more complex.
Lets take the BKN usecase: sharing research related information between
departments. In this usecase, the data publishers are professors,
students, etc. People with, or without any knowledge in data description
& maintenance. They don't know what Unicode is, what a charset is, what
the ISO standard for language codes are, what an "original" record is,
etc, etc, etc.
They know that their article has a title, a isbn, authors, affiliated
institutions, etc. They only want to know what "attribute" to use to
describe that information. (give me the cookbook, but don't talk about
the chemical process of the interaction between all nutriments).
It is given this specific vision, this specific use-case, and by keeping
this kind of user in minds, that we created the schemas, and that we put
all the "system oriented" information directly in the schemas. We could
have done like XML and N3 and other serialization formats to add the
format, and language, and datatype information directly in the instance
records files, but given the use-case, we choose not to, and to put that
in the schema.
We want the record files as simple as possible.
It is why we created a "format" attribute that is defined in the
*schema* and not the *instance records* file. About the language, is it
a necessary complexity addition? We could add it to the schema too,
however, won't it be better to create language specific datasets instead
of putting all languages in the same one? Lets take Wikipedia. For the
record USA, would you prefer having one dataset per language, or having
the USA record with 45 times the same attribute in different languages?
Also, isn't a little bit complex to *only* describe a title literal?
> "title" : {
> "type" : "RichStringList",
> "strings" : [
> {
> "type" : "RichString"
> "value" : "An L(⅓ + ε) Algorithm for the Discrete Logarithm
> Problem for Low Degree Curves",
> "format" : "unicode",
> "language" : "en"
> },
> {
> "type" : "RichString"
> "value" : "An $L (1/3 + \epsilon)$ Algorithm for the Discrete
> Logarithm Problem for Low Degree Curves",
> "format" : "latex",
> "language" : "en",
> "original" : "true",
> "preferred" : "true"
> },
> {
> "type" : "RichString"
> "value" : "An L(1/3 + Epsilon) Algorithm for the Discrete
> Logarithm Problem for Low Degree Curves",
> "format" : "ascii",
> "language" : "en"
> }
> ]
> }
>
Won't this be easier, faster and more efficient?
{
"title": "An L(1/3 + Epsilon) Algorithm for the Discrete Logarithm
Problem for Low Degree Curve.",
"titleUnicode: "An $L (1/3 + \epsilon)$ Algorithm for the Discrete
Logarithm Problem for Low Degree Curves",
"titleLatex": "An $L (1/3 + \epsilon)$ Algorithm for the Discrete
Logarithm Problem for Low Degree Curves"
}
All the burden is put in the schema. Data publisher won't have to care
about defining all these properties, record after record after record
after record.
But there are even more considerations here: in your example, you have
712 characters, in mine, I have 335. Which is abuot 53% bigger. If we
extrapolate this to a whole dataset, you dataset would be 53% bigger
than mine just because we put all these technical considerations in the
instance record file instead of the schema. (in reality, it would be
lower, maybe 20 to 30%? bigger for an entire dataset).
But it is not only a matter of physical space, it is also a matter of
parsing/processing time of these datasets.
So, by putting all these consideration at the level of the schema, and
maybe the dataset description for the language, we gain:
(1) simpler instance records files
(1.1) easier to use
(1.2) easier to understand
(1.3) easier to analyze
(2) smaller instance records files
(3) faster to process records files
(4) less prone to transmission/transformation errors because all this
information lies at one place, in one record: the description of types
and attributes.
(5) easier to maintain dataset since a single change in the schema
impact all the records. Otherwise, you have to change *all* records if
you want to change one of these things.
(6) we can validate and process the same dataset using different schemas
for different purposes (think about all the different HTML validation
profiles (strictk relaxed, etc.)
(7) Anything else I don't think about?
The drawback is:
(1) defining more attributes such as title, titleLatex, titleUnicode.
(2) Anything else I don't think about?
> The specification would have to provide a schema for these types and
> provide vocabularies for the language and format attributes.
>
> This seems preferable to having separate attributes for different
> formats as it allows us to indicate which string should be preferred,
> which string is the original, the language used, or any other metadata
> that the dataset publisher wishes to include. It also would keep the
> number of attributes in our vocabulary relatively small, since their
> wont be a _html and _latex variant for every attribute that can take a
> text value.
>
Yes, but any solutions have advantages and disadvantages. If the perfect
solution would exists, we won't have all these serialization formats,
and all these programming languages, and all these cars, and all these
different kind of paper, light, etc... :)
So, we have to think about the usecase. Is there something already
existing given what we want to do? Yes, we reuse, no we create.
What I stated above is the usecase I understood from Jim and Nitin
months and months ago. And as this discussion continue, I see this good
vision fading away for other considerations. There was a need, there was
a usecase that created that need, and a solution has started to be created.
I think think that your example above helps anybody archiving this need
for the core BKN usecase.
So, this is what I have to say about this. I can be right and I can be
wrong. I am not the one that will take any final decision for BKN &
BibJSON, but these are the things to take care about and to consider
given my experience and knowledge. From there, it is up to BKN to choose
what it thinks it is best to do for themselves.
Thanks!
Take care,
Fred