The reason I keep bringing up multimethods is that I keep thinking
there's not going to be any getting around specialized data types.
In a simple key:val pair we could assume that the value is a string.
That's a fair asssumption, but as I implied in an earlier mail, that
won't take us very far. If one feed comes in with "2008-06-30" value
and another comes in with "December 2, 1970", which is sorted first.
So of course the inputter needs to say "Field so-and-so is a date".
And then we can decide in the outputter what date format we want to
use.
Similar situations will happen with other types of data, like geo-encoded data.
The nice part here is we can largely ignore this. So long as we have
generic "print" methods for most of these, we're okay.
If a different print format has to be employed, if those methods
aren't available on the method, then the outputter author will need to
provide them.
This is somewhat akin to a Ruby programmer who monkey-patches a class.
This also means the scope of work is reduced for transformer authors
(for the most part). If a new data type comes in, they may want some
specialized transformation based on it, but they're more than likely
just going to want to do the same sorts of things that are done
normally- that is sort it or possibly bound it[1] (ie filter it with a
specialized filter).
The work for handling new data types gets pushed out to the edges it
matters most. It has to be handled by the inputter but the place it's
needed for a particular application is something the designer will
need to work out.[2]
Does this seem reasonable?
- Serge
[1] If this is a numerical sequence, put an upper and lower bound, if
it's time, same thing, but if it's data like geo-encoded, provide a
polygonal bounding box
[2] In a version 2.0 of such a tool, I can imagine a "data mapping"
application in which one describes the data and an inputter and
outputter are generated which follow the convention.
You brought up an interesting point, but I think ultimately we've
either come to a misunderstanding, different conclusions, or we're
looking at different parts of the elephant, so let me restate my view
of the world and then try to address what you wrote.
(As you point out) we're going to need to take different inputs, slurp
them into the system, operate on them and them output them.
One thing that may be a difference between us is that I'm thinking
mostly about end users, rather than developers, and also, I'm thinking
that a vast majority of the time, the format will be a standard file
format, rather than something that would have to be handled by either
the user or a developer.
I'm going to work on replies out of order since I think they'll make more sense.
> For example, if we are reading data that are coming in as strings, and
> need to convert some fields to other kinds of objects, e.g. dates,
> then I think we'd want to separate the inputting from the converting
> into separate components. That way, the inputters are more pluggable;
> for example, it would be trivial to swap an XML inputter for a JSON
> inputter, as there would be no need to copy the conversion
> specification from one inputter to the other. Same on the other end
> for outputters; we'd like to be able to easily swap a socket for a
> text file.
I think there are a few things to note here.
The first thing I notice is that your example uses meta-formats (XML
and JSON) rather than the final formats. This is where much of the
divergence is and why we come to different conclusions.
I think it'll be rare that a user will want a file format that's not
already pre-defined. Most of the time the RSS will be standardized,
like RSS. JSON data isn't as formally defined, but we'll know that the
format from a particular site always follows certain conventions.
Going from here, I believe that 99% of the time, the user will want to
use the "RSS inputter" and then just expect it to do the right thing.
You are right in that it should be as easy as possible to write
functionality that slurps data in, but I suppose where we diverse (and
it's a minor divergence at this point) is that I think that the point
at which the data leaves the inputter, it should be generic and
immediately usable, where as you're saying that you want a second
"identification" process.
I think in most cases, though
> 1) I think that in general we want to architect a system with highly
> cohesive components; i.e. components that do only one thing, and do it
> very well. If so, then data conversions should probably be in
> transformer components and not in inputters or outputters
I think it's hard to argue against the rule of modulartity, and I
wouldn't even try..
Where I don't agree with you is that I feel data identification is
part of understanding a file, and will be needed by both the inputter
and the outputter.
Let's take a practical example from RSS 2.0
<pubDate>Sat, 07 Sep 2002 00:00:01 GMT</pubDate>
I look at that and think "Any RSS reader/writer should know that the
date must be formatted in this way."
If we omit this step, we're left with (in key/val pairs):
pubDate: "Sat, 07 Sep 2002 00:00:01 GMT"
Then (according to my understanding of what you've written), we'd have
a separate step in a transformer that says "Turn pubDate into a time
object"
I don't want to pontificate too much more on this point (this mail is
already far, far too long for this stage), but I imagine the inputter
and outputters both working against the same file specification
document.
If you decouple the process, you're going to have to remember to add
the "converters" at exactly the right times both for inputting and
out, and that just seems cumbersome.
> 2) Regarding formatting of dates (and numbers and currencies, for that
> matter), the Java libraries have some pretty good I18N (internationali
> [sz]ation) built in, such that in some cases, specifying the locale is
> all that's needed to produce a correct result
I think this is another reflection of our different experiences. My
experience is that these formats are predefined. RSS uses RFC822
dates- period. If we input a date, it's in that format. If we output a
date, it's in that format.
> 4) Regarding sorting, as in dates, I think we would usually need to
> normalize all date strings into date objects, or some other
> mathematical (as opposed to string) representation.
Honestly this is where I felt I lost your argument. The point of my
email was that we'd need to transform data formats we understood into
neutral representations that could be later outputted, and unless I'm
missing something, this paragraph basically reiterates that point.
> 5) We need to decide whether or not to assume that all input data will
> be strings.
Well my point was we don't need to decide on any internal
representation, just a standard set of methods.
You may be right that there will be data with no possible characfter
representation, but I'm hard pressed to think of any examples.
You're certainly right that not all methods make sense on all data,
though. I don't know what "sort by" a JPEG image would mean.
I would argue that by the time the data leaves the inputter, it should
be in our standard format, ready to be used (and this is why I feel so
strongly that the data shouldn't be left raw and hence disagreeing
with you on point 1).
> 3) Multimethods may well be useful in the implementation. For example,
> a conversion function could receive a value as one parameter, and a
> map of instructions as another.
I think this is one one of those places where vocabulary is getting us
into trouble.
I don't know what you mean by a "conversion function".
> (Obviously in some cases it might be simpler *not* to use
> multimethods to the fullest, but the great thing is that we'd have the
> freedom to choose.)
The idea of multimethods would be to simplify the creation and
handling of these data types. For example the sort algorithm might be
the same between two types, but representation might be different, or
visa-versa, with the idea being that programmers could use standard
transformations on the various data fields.
If I do a "sort-by" and the field is a date field it shouldn't matter
any more than a "sort-by" on a string field.
So I don't underand what you mean here.
Lastly, I want to emphasize I'm not prescribing any design; I'm just
sharing thoughts I have on the current direction it seems we're going.
- Serge
But I was pretty much done anyway.
I'm happy to be shown a better way (that means I'm around people who
are smarter than I am, and while that's big feat, is always
reassuring).
- Serge