(Maybe) Coming to consensus on the internal representation

1 view
Skip to first unread message

Serge Wroclawski

unread,
Apr 7, 2009, 9:08:12 AM4/7/09
to Clojure Study Group Washington DC
I was thinking last night that while we all discussed going our own
separate ways, we've come to (largely at least) the same conclusion on
the data feed, that is the internal representation will be a Seq which
contains maps.

The reason I keep bringing up multimethods is that I keep thinking
there's not going to be any getting around specialized data types.

In a simple key:val pair we could assume that the value is a string.
That's a fair asssumption, but as I implied in an earlier mail, that
won't take us very far. If one feed comes in with "2008-06-30" value
and another comes in with "December 2, 1970", which is sorted first.
So of course the inputter needs to say "Field so-and-so is a date".
And then we can decide in the outputter what date format we want to
use.

Similar situations will happen with other types of data, like geo-encoded data.

The nice part here is we can largely ignore this. So long as we have
generic "print" methods for most of these, we're okay.

If a different print format has to be employed, if those methods
aren't available on the method, then the outputter author will need to
provide them.

This is somewhat akin to a Ruby programmer who monkey-patches a class.

This also means the scope of work is reduced for transformer authors
(for the most part). If a new data type comes in, they may want some
specialized transformation based on it, but they're more than likely
just going to want to do the same sorts of things that are done
normally- that is sort it or possibly bound it[1] (ie filter it with a
specialized filter).

The work for handling new data types gets pushed out to the edges it
matters most. It has to be handled by the inputter but the place it's
needed for a particular application is something the designer will
need to work out.[2]


Does this seem reasonable?

- Serge

[1] If this is a numerical sequence, put an upper and lower bound, if
it's time, same thing, but if it's data like geo-encoded, provide a
polygonal bounding box

[2] In a version 2.0 of such a tool, I can imagine a "data mapping"
application in which one describes the data and an inputter and
outputter are generated which follow the convention.

Keith Bennett

unread,
Apr 7, 2009, 2:51:59 PM4/7/09
to Clojure Study Group Washington DC
Serge -

Interesting stuff...

I'm not sure that this will exactly address the issues you raised, but
here are some thoughts...

1) I think that in general we want to architect a system with highly
cohesive components; i.e. components that do only one thing, and do it
very well. If so, then data conversions should probably be in
transformer components and not in inputters or outputters.

For example, if we are reading data that are coming in as strings, and
need to convert some fields to other kinds of objects, e.g. dates,
then I think we'd want to separate the inputting from the converting
into separate components. That way, the inputters are more pluggable;
for example, it would be trivial to swap an XML inputter for a JSON
inputter, as there would be no need to copy the conversion
specification from one inputter to the other. Same on the other end
for outputters; we'd like to be able to easily swap a socket for a
text file.

True, Java's libraries may be able to offer some help, through the
Readers and Writers, for example, but I think we should hesitate to
glom orthogonal tasks into a single component.

One of the beauties of doing this in Clojure rather than Java is that
it is trivial to specify a conversion function, either by name, or its
implementation, as part of the configuration.

2) Regarding formatting of dates (and numbers and currencies, for that
matter), the Java libraries have some pretty good I18N (internationali
[sz]ation) built in, such that in some cases, specifying the locale is
all that's needed to produce a correct result. The date formats are
obviously different, as there are many variations (e.g. all numeric
vs. months-as-strings). Given that we're contemplating an open source
project, we should expect users and developers to be in non-U.S.
locales, so I suggest not building in any assumptions regarding
locale.

In general, we can provide reasonable defaults, for example that
"04/07/2009" will be interpreted as April 7, 2009 in the U.S. locale,
and July 4th, 2009 in certain others. It is possible, but not
necessary, to specify which locale to use in parsing and formatting,
but the default locale used by Java is the one in which the Java
runtime was started (try (.. java.util.Locale getDefault) in the
REPL).

3) Multimethods may well be useful in the implementation. For example,
a conversion function could receive a value as one parameter, and a
map of instructions as another. A locale could be one of the values
in the map; the style (e.g. SHORT, MEDIUM, LONG, FULL, see
http://java.sun.com/javase/6/docs/api/java/text/DateFormat.html) could
be another. Or, a function could be another, which would trump the
others. (Obviously in some cases it might be simpler *not* to use
multimethods to the fullest, but the great thing is that we'd have the
freedom to choose.)

4) Regarding sorting, as in dates, I think we would usually need to
normalize all date strings into date objects, or some other
mathematical (as opposed to string) representation.

5) We need to decide whether or not to assume that all input data will
be strings. While this will usually be true for data downloaded from
the web (XML, JSON, etc.), it would certainly not be true for a more
generic data application, since so much data comes from data bases. I
may be working on a project soon where we will be pulling data from
multiple stovepipe systems into a unified format; it would be great
for our framework to be able to use JDBC for that.

- Keith

Serge Wroclawski

unread,
Apr 8, 2009, 9:39:53 AM4/8/09
to clojure-...@googlegroups.com
Keith,

You brought up an interesting point, but I think ultimately we've
either come to a misunderstanding, different conclusions, or we're
looking at different parts of the elephant, so let me restate my view
of the world and then try to address what you wrote.

(As you point out) we're going to need to take different inputs, slurp
them into the system, operate on them and them output them.

One thing that may be a difference between us is that I'm thinking
mostly about end users, rather than developers, and also, I'm thinking
that a vast majority of the time, the format will be a standard file
format, rather than something that would have to be handled by either
the user or a developer.

I'm going to work on replies out of order since I think they'll make more sense.

> For example, if we are reading data that are coming in as strings, and
> need to convert some fields to other kinds of objects, e.g. dates,
> then I think we'd want to separate the inputting from the converting
> into separate components. That way, the inputters are more pluggable;
> for example, it would be trivial to swap an XML inputter for a JSON
> inputter, as there would be no need to copy the conversion
> specification from one inputter to the other. Same on the other end
> for outputters; we'd like to be able to easily swap a socket for a
> text file.

I think there are a few things to note here.

The first thing I notice is that your example uses meta-formats (XML
and JSON) rather than the final formats. This is where much of the
divergence is and why we come to different conclusions.

I think it'll be rare that a user will want a file format that's not
already pre-defined. Most of the time the RSS will be standardized,
like RSS. JSON data isn't as formally defined, but we'll know that the
format from a particular site always follows certain conventions.

Going from here, I believe that 99% of the time, the user will want to
use the "RSS inputter" and then just expect it to do the right thing.

You are right in that it should be as easy as possible to write
functionality that slurps data in, but I suppose where we diverse (and
it's a minor divergence at this point) is that I think that the point
at which the data leaves the inputter, it should be generic and
immediately usable, where as you're saying that you want a second
"identification" process.

I think in most cases, though

> 1) I think that in general we want to architect a system with highly
> cohesive components; i.e. components that do only one thing, and do it
> very well. If so, then data conversions should probably be in
> transformer components and not in inputters or outputters

I think it's hard to argue against the rule of modulartity, and I
wouldn't even try..

Where I don't agree with you is that I feel data identification is
part of understanding a file, and will be needed by both the inputter
and the outputter.

Let's take a practical example from RSS 2.0

<pubDate>Sat, 07 Sep 2002 00:00:01 GMT</pubDate>

I look at that and think "Any RSS reader/writer should know that the
date must be formatted in this way."

If we omit this step, we're left with (in key/val pairs):

pubDate: "Sat, 07 Sep 2002 00:00:01 GMT"

Then (according to my understanding of what you've written), we'd have
a separate step in a transformer that says "Turn pubDate into a time
object"

I don't want to pontificate too much more on this point (this mail is
already far, far too long for this stage), but I imagine the inputter
and outputters both working against the same file specification
document.

If you decouple the process, you're going to have to remember to add
the "converters" at exactly the right times both for inputting and
out, and that just seems cumbersome.


> 2) Regarding formatting of dates (and numbers and currencies, for that
> matter), the Java libraries have some pretty good I18N (internationali
> [sz]ation) built in, such that in some cases, specifying the locale is
> all that's needed to produce a correct result

I think this is another reflection of our different experiences. My
experience is that these formats are predefined. RSS uses RFC822
dates- period. If we input a date, it's in that format. If we output a
date, it's in that format.

> 4) Regarding sorting, as in dates, I think we would usually need to
> normalize all date strings into date objects, or some other
> mathematical (as opposed to string) representation.

Honestly this is where I felt I lost your argument. The point of my
email was that we'd need to transform data formats we understood into
neutral representations that could be later outputted, and unless I'm
missing something, this paragraph basically reiterates that point.

> 5) We need to decide whether or not to assume that all input data will
> be strings.

Well my point was we don't need to decide on any internal
representation, just a standard set of methods.

You may be right that there will be data with no possible characfter
representation, but I'm hard pressed to think of any examples.

You're certainly right that not all methods make sense on all data,
though. I don't know what "sort by" a JPEG image would mean.

I would argue that by the time the data leaves the inputter, it should
be in our standard format, ready to be used (and this is why I feel so
strongly that the data shouldn't be left raw and hence disagreeing
with you on point 1).

> 3) Multimethods may well be useful in the implementation. For example,
> a conversion function could receive a value as one parameter, and a
> map of instructions as another.

I think this is one one of those places where vocabulary is getting us
into trouble.

I don't know what you mean by a "conversion function".

> (Obviously in some cases it might be simpler *not* to use
> multimethods to the fullest, but the great thing is that we'd have the
> freedom to choose.)

The idea of multimethods would be to simplify the creation and
handling of these data types. For example the sort algorithm might be
the same between two types, but representation might be different, or
visa-versa, with the idea being that programmers could use standard
transformations on the various data fields.

If I do a "sort-by" and the field is a date field it shouldn't matter
any more than a "sort-by" on a string field.

So I don't underand what you mean here.

Lastly, I want to emphasize I'm not prescribing any design; I'm just
sharing thoughts I have on the current direction it seems we're going.

- Serge

David James

unread,
Apr 8, 2009, 10:49:08 AM4/8/09
to Clojure Study Group Washington DC
What's our next step? If I remember right, we were going to
independently come up with some ideas (and possibly code) and share
them at our next meeting? Can you remind us of the next meeting times
as well? I'll have a conflict for the next one, but I'd like to stay
in the loop.

Luke VanderHart

unread,
Apr 8, 2009, 11:03:23 AM4/8/09
to Clojure Study Group Washington DC
I think we're just writing our own implementations (as much or as
little as we want) and next time we meet about it we'll borrow the
best parts of each person's implementation/ideas to put together a
basic shared code repository.

Next meeting is Sunday, April 19th, 1pm at HacDC, IIRC.

Thanks,
-Luke

Luke VanderHart

unread,
Apr 8, 2009, 11:13:44 AM4/8/09
to Clojure Study Group Washington DC
Also, I posted a reply dealing with a lot of the content matter of
this thread in the other thread:
http://groups.google.com/group/clojure-study-dc/browse_thread/thread/94a1cb6b0bf6c9f9/f6f695848d989b44#f6f695848d989b44.
You can check it out if you find it interesting.

Thanks,
-Luke

Keith Bennett

unread,
Apr 8, 2009, 6:00:28 PM4/8/09
to Clojure Study Group Washington DC
Serge -

> You brought up an interesting point, but I think ultimately we've
> either come to a misunderstanding, different conclusions, or we're
> looking at different parts of the elephant, so let me restate my view
> of the world and then try to address what you wrote.
>
No worries, mate. ;)

I think our differences in viewpoint stem from our different intended
use cases. I was hoping it could be a tool that I could use in the
enterprise, with relational data bases for example. You seem to be
focused more on emulating Yahoo Pipes, which was, I confess, our
original stated intention. So take what I say with a few grains of
salt, and if you guys don't want to accommodate my use cases, then
that's fine, it will still be a learning experience for me.

More below...

On Apr 8, 9:39 am, Serge Wroclawski <emac...@gmail.com> wrote:
> Keith,
>
> (As you point out) we're going to need to take different inputs, slurp
> them into the system, operate on them and them output them.
>
> One thing that may be a difference between us is that I'm thinking
> mostly about end users, rather than developers, and also, I'm thinking
> that a vast majority of the time, the format will be a standard file
> format, rather than something that would have to be handled by either
> the user or a developer.

It sounds like you are speaking of the output file format, right?
Providing a flexible solution does not preclude defaulting to a
prescribed one.

In general I agree with what I think is your implication, that end
users should not be expected to write Clojure code. However, if we do
our jobs well, then we can create a DSL that doesn't look at all like
Clojure, and that is arguably easier to understand than XML.
Developers could write functions for use by the users; then the users
merely specify a function name, possibly with arguments. We could
provide predefined functions for the most common cases, so writing
serious Clojure code would rarely be necessary.

Of course if we had a GUI, then the users wouldn't need to understand
*any* instruction/configuration file.

>
> I'm going to work on replies out of order since I think they'll make more sense.
>
> > For example, if we are reading data that are coming in as strings, and
> > need to convert some fields to other kinds of objects, e.g. dates,
> > then I think we'd want to separate the inputting from the converting
> > into separate components. That way, the inputters are more pluggable;
> > for example, it would be trivial to swap an XML inputter for a JSON
> > inputter, as there would be no need to copy the conversion
> > specification from one inputter to the other. Same on the other end
> > for outputters; we'd like to be able to easily swap a socket for a
> > text file.
>
> I think there are a few things to note here.
>
> The first thing I notice is that your example uses meta-formats (XML
> and JSON) rather than the final formats. This is where much of the
> divergence is and why we come to different conclusions.

I was referring to the input data; by "final formats" I assume you
mean output data?

> I think it'll be rare that a user will want a file format that's not
> already pre-defined. Most of the time the RSS will be standardized,
> like RSS. JSON data isn't as formally defined, but we'll know that the
> format from a particular site always follows certain conventions.

> Going from here, I believe that 99% of the time, the user will want to
> use the "RSS inputter" and then just expect it to do the right thing.

I hear you saying "*the* RSS", and this underlines our differences in
outlook. I was assuming that RSS would be only one of a multitude of
supported input formats and use cases.

There is nothing to prevent us from providing compound components as
conveniences, or enabling our users to create such a thing. However,
I still maintain that in the interest of cohesion, the "understanding"
of the data, as you put it, *should* be separate from the reading of
it. If the specification of the data source were *always* trivial, as
in merely specifying a URL, then I would agree with you; but if we
want to support other data sources, such as RDBMS's, then it's not.

>
> You are right in that it should be as easy as possible to write
> functionality that slurps data in, but I suppose where we diverse (and
> it's a minor divergence at this point) is that I think that the point
> at which the data leaves the inputter, it should be generic and
> immediately usable, where as you're saying that you want a second
> "identification" process.

The problem with this is that "immediately usable" is variable, not
fixed. It's context dependent. The idea of a component framework is
to allow the user ultimate flexibility. What if the user really
*does* want the date to be a java.lang.String, and not a
java.util.Date?

Yes, I am saying I want a second identification process, or really,
conversion process. I'll give an example. Let's say there's a data
flow we want to support, and normally its input is an RSS feed.
Wouldn't it be nice if we could substitute inputs such that input
could come from a data base, or even a string in memory? If nothing
else, this could facilitate simple and effective testing of the data
flow and its components. This applies to outputs as well as inputs.

A distant second best, IMO, would be to have our input components
include the data conversions, but specify those conversions in a
format uniform to all input components.

> I think in most cases, though
>
> > 1) I think that in general we want to architect a system with highly
> > cohesive components; i.e. components that do only one thing, and do it
> > very well. If so, then data conversions should probably be in
> > transformer components and not in inputters or outputters
>
> I think it's hard to argue against the rule of modulartity, and I
> wouldn't even try..
>
> Where I don't agree with you is that I feel data identification is
> part of understanding a file, and will be needed by both the inputter
> and the outputter.
>
> Let's take a practical example from RSS 2.0
>
> <pubDate>Sat, 07 Sep 2002 00:00:01 GMT</pubDate>
>
> I look at that and think "Any RSS reader/writer should know that the
> date must be formatted in this way."
>
> If we omit this step, we're left with (in key/val pairs):
>
> pubDate: "Sat, 07 Sep 2002 00:00:01 GMT"
>
> Then (according to my understanding of what you've written), we'd have
> a separate step in a transformer that says "Turn pubDate into a time
> object"

Yes, that's what I'm saying, except that the transformer component
could operate on all data values in the record (with a map of field
names (data map keys) and conversion functions, or nil for no
conversion or default conversion. Again, we could prepackage a
compound component tailored to RSS feeds, but there would be
individual components doing each task. Otherwise, reuse is hampered;
developers wind up rewriting each part in multiple components, and
that is very bad. (While the conversion of string to date is
relatively simple, there could be more complex conversions.) Having a
behavior in a single place makes the code far more reliable,
maintainable, and testable.


>
> I don't want to pontificate too much more on this point (this mail is
> already far, far too long for this stage), but I imagine the inputter
> and outputters both working against the same file specification
> document.

Could you elaborate on that? I don't understand. Are you saying that
input and output file specifications would be expressed as a single
specification (file, etc.) rather than two?

>
> If you decouple the process, you're going to have to remember to add
> the "converters" at exactly the right times both for inputting and
> out, and that just seems cumbersome.

While I agree that it adds a little to the verbosity of the data flow,
I think the benefits are worth it. Also, I think it would be pretty
obvious to the data flow author (user or developer) where the
converters would need to be. While it may seem obvious that they
would go just after the inputter and just before the outputter, that
is not necessarily true. For example, some cleanup may need to be
done to a string before converting it to another kind of object.
Grouping all that functionality in the inputter would, IMO, contradict
the modular design of the product.


>
> > 2) Regarding formatting of dates (and numbers and currencies, for that
> > matter), the Java libraries have some pretty good I18N (internationali
> > [sz]ation) built in, such that in some cases, specifying the locale is
> > all that's needed to produce a correct result
>
> I think this is another reflection of our different experiences. My
> experience is that these formats are predefined. RSS uses RFC822
> dates- period. If we input a date, it's in that format. If we output a
> date, it's in that format.

If we confine the mission of this product to RSS, then I agree with
you. However, if it is to be used for a general purpose data
manipulator, then it is necessary to accommodate unexpected formats.
Again, we could provide RFC822 format as a reasonable default format,
without preventing flow designers from overriding it.

>
> > 4) Regarding sorting, as in dates, I think we would usually need to
> > normalize all date strings into date objects, or some other
> > mathematical (as opposed to string) representation.
>
> Honestly this is where I felt I lost your argument. The point of my
> email was that we'd need to transform data formats we understood into
> neutral representations that could be later outputted, and unless I'm
> missing something, this paragraph basically reiterates that point.

I apologize if I implied that I was disagreeing with you -- as I
mentioned in the beginning, "I'm not sure that this will exactly
address the issues you raised, but
here are some thoughts...". I was using your message as a starting
point, and then elaborating. I certainly didn't mean to imply that I
was debating you on all your points. I appreciate your putting all
this in writing (and, as you can see, I have no problem with long
messages ;) ).

>
> > 5) We need to decide whether or not to assume that all input data will
> > be strings.
>
> Well my point was we don't need to decide on any internal
> representation, just a standard set of methods.
>
> You may be right that there will be data with no possible characfter
> representation, but I'm hard pressed to think of any examples.

I didn't mean that. I meant that if we support JDBC, for example,
then some of the objects will already be in their natural
representation (e.g. numbers and dates) when they arrive from the data
source.


>
> You're certainly right that not all methods make sense on all data,
> though. I don't know what "sort by" a JPEG image would mean.

As for the sort, again, I don't think we need to dictate what is a
supported sort -- while we can provide common ones, specifying one's
own sort format in Clojure is trivial, usually a one-liner, and a
short one at that. A user may very well want JPEG images sorted, by
size, a metadata item, etc.; there's no reason we should prohibit
this. IMO, sorting should be just as flexible as filtering; that is,
one should be able to specify any arbitrary function to apply.

>
> I would argue that by the time the data leaves the inputter, it should
> be in our standard format, ready to be used (and this is why I feel so
> strongly that the data shouldn't be left raw and hence disagreeing
> with you on point 1).

But who's to say that the data is raw? It may very well be in the
desired format when it arrives from the input source. To assume
otherwise is to unnecessarily limit the user.

Again, I'm going for the more flexible approach. While we agree that
data need to be normalized, I don't think it's necessary for us to
dictate what that normalized format is. We can certainly provide
defaults (Java number and date objects, for example).

> > 3) Multimethods may well be useful in the implementation. For example,
> > a conversion function could receive a value as one parameter, and a
> > map of instructions as another.
>
> I think this is one one of those places where vocabulary is getting us
> into trouble.
>
> I don't know what you mean by a "conversion function".

I mean a function that converts data from one format to another. The
value is the data item to be converted, and the instructions are the
information needed by the function to accomplish it. As an example
(though maybe not one we would use in practice), a function that
converts a number to a string. The number itself would be the value,
and the instructions might include precision and scale, and even
locale (since thousands and decimal separators differ across locales).

>
> > (Obviously in some cases it might be simpler *not* to use
> > multimethods to the fullest, but the great thing is that we'd have the
> > freedom to choose.)
>
> The idea of multimethods would be to simplify the creation and
> handling of these data types. For example the sort algorithm might be
> the same between two types, but representation might be different, or
> visa-versa, with the idea being that programmers could use standard
> transformations on the various data fields.
>
> If I do a "sort-by" and the field is a date field it shouldn't matter
> any more than a "sort-by" on a string field.

Again, we can provide simple defaults without requiring their use.

>
> So I don't underand what you mean here.
>
> Lastly, I want to emphasize I'm not prescribing any design; I'm just
> sharing thoughts I have on the current direction it seems we're going.

Serge, your vision of this product is probably much closer to that of
the original design than mine. I confess that I have not even
finished viewing the Yahoo Pipes screencast. If you guys view my
extending this product's mission to include non-web data as an
unwelcome distraction, then I'll stop talking about it. However, even
if we start with an assumption that input data is from RSS, I think it
would be a shame for us to make architectural decisions that would
make it more difficult to provide more flexibility in the future.

- Keith

Serge Wroclawski

unread,
Apr 10, 2009, 6:48:16 AM4/10/09
to clojure-...@googlegroups.com
On Wed, Apr 8, 2009 at 6:00 PM, Keith Bennett <keithr...@gmail.com> wrote:

> I think our differences in viewpoint stem from our different intended
> use cases.  I was hoping it could be a tool that I could use in the
> enterprise, with relational data bases for example.

There's no reason it couldn't interface with non-textual data. If you
look at the wiki, I've added some use cases, including wanting to be
able to throw log data at it. I imagine being able to do counts and
things and generate graphs, either directly or via Javascript graphing
libs that use JSON.

Looking at the data types listed in the PostgreSQL manual look a lot
like the primitives I think we should support eventually:
http://www.postgresql.org/docs/7.4/interactive/datatype.html

At the same time, I think it'll be rare for users to get data in this
form; I suspect most data will already have been serialized somehow
and so it makes sense to think of data coming in pre-serialized.

> You seem to be focused more on emulating Yahoo Pipes, which was, I confess, our
> original stated intention.

I think once we emulate the functionality of Pipes, how to solve the
other issues will be clear.

I'd be willing to sacrifice a lot of functionality, including data
types, for something that works.

>> One thing that may be a difference between us is that I'm thinking
>> mostly about end users, rather than developers, and also, I'm thinking
>> that a vast majority of the time, the format will be a standard file
>> format, rather than something that would have to be handled by either
>> the user or a developer.
>
> It sounds like you are speaking of the output file format, right?

Nope. Look at Yahoo Pipes- the input formats are fairly limited. I
think it's reasonable to think that for a vast majority of users, the
input data will be in a standard file format like RSS or the JSON
provided by one of the more popular search engines, etc.

> In general I agree with what I think is your implication, that end
> users should not be expected to write Clojure code.  However, if we do
> our jobs well, then we can create a DSL that doesn't look at all like
> Clojure, and that is arguably easier to understand than XML.

Having users not writing Clojure is not a goal I had in mind. I don't
feel strongly about it but I don't see a reason why they couldn't.
They don't have to "know" it's Clojure, just that the config has
certain rules. Then the file itself is fed into the reader at runtime.
Many programs written in {Perl|Python|Ruby} do this, and this is also
what Emacs does.

Iagine a config for a file format that looked like

(
:title string
:pubDate date
:author striing
)

That's Clojure, and it looks pretty neutral.

But I'd actually encourage them to use Clojure. After all, in this
case they'll be running their own instances, so they should have full
access to the system.

>> The first thing I notice is that your example uses meta-formats (XML
>> and JSON) rather than the final formats. This is where much of the
>> divergence is and why we come to different conclusions.
>
> I was referring to the input data; by "final formats" I assume you
> mean output data?

I wasn't clear here. You mentioned "XML files" a number of times. I'm
pointing out that XML is a specification for writing formats rather
than a format in itself. There's an important distinction.

If I give you a file in XML, it's not meaningful without knowing the
schema description, formally or informally.

JSON is the same way.

That's why I avoid talking about XML or JSON and talk about the "final
format" (which is admittedly a confusing term), to mean "the actual
format", eg "RSS".

> I hear you saying "*the* RSS", and this underlines our differences in
> outlook.  I was assuming that RSS would be only one of a multitude of
> supported input formats and use cases.

Yes, RSS is just one input format, just as Yahoo provides multiple
input formats. RSS is just nice to talk about because it's so clean.

> There is nothing to prevent us from providing compound components as
> conveniences, or enabling our users to create such a thing.  However,
> I still maintain that in the interest of cohesion, the "understanding"
> of the data, as you put it, *should* be separate from the reading of
> it.

> The problem with this is that "immediately usable" is variable, not
> fixed.  It's context dependent. The idea of a component framework is
> to allow the user ultimate flexibility.  What if the user really
> *does* want the date to be a java.lang.String, and not a
> java.util.Date?

> Yes, I am saying I want a second identification process, or really,
> conversion process. I'll give an example. Let's say there's a data
> flow we want to support, and normally its input is an RSS feed.
> Wouldn't it be nice if we could substitute inputs such that input
> could come from a data base, or even a string in memory?

>> Then (according to my understanding of what you've written), we'd have
>> a separate step in a transformer that says "Turn pubDate into a time
>> object"
>
> Yes, that's what I'm saying, except that the transformer component
> could operate on all data values in the record (with a map of field
> names (data map keys) and conversion functions, or nil for no
> conversion or default conversion. Again, we could prepackage a
> compound component tailored to RSS feeds, but there would be
> individual components doing each task.

I think this is one of those "show me the code" details that once you
see it in program, the solution will become more clear to everyone.

What I'm not clear on is, in your mind, what the role of the inputter is?

It is to parse the raw feed into chunks? Is it to provide key:val pairs?

At first I thought it might be only to parse the data into records.
That would make sense except that so many formats work differently.

An edge case I like is that of CSV files where the key of the key:val
pair is provided as the first line of the file, like this:

Beverage,Units Purchased, Units Sold,Remaining
Lemonade,70,50,20
Apple Juice, 90, 40,50
Orange Juice, 200,150,50

In this example the key:val pairs would look like

Beverage:Lemonade
Units Purchased: 70
Units Sold: 50
Remaining: 20

But there's no way to derive that by simply parsing the CSV file into records.

Then I thought maybe you wanted to do this parsing, but not provide
any typing, so going back to the above example...

If we provide no semantic data saying the Units Purchased, Units Sold
and Remaining Values are integers, we'd be stuck using them as
strings.

That's okay, I guess, as long as I could do a transformation later,
but earlier in this mail, you argue for inputters from databases, and
in those cases you'd have data coming in directly in the final format.


>> I don't want to pontificate too much more on this point (this mail is
>> already far, far too long for this stage), but I imagine the inputter
>> and outputters both working against the same file specification
>> document.
>
> Could you elaborate on that?  I don't understand. Are you saying that
> input and output file specifications would be expressed as a single
> specification (file, etc.) rather than two?

That was my idea, yeah.

You'd describe the format once and the parser could parse it and the
outputter could output it.

In this case you need to have knowledge of how the data is serialized
so you can output it, and the inputter can use it to slurp it in.

I'm not attached to any implementation but this was the idea I was
working off of.

.
>> You're certainly right that not all methods make sense on all data,
>> though. I don't know what "sort by"  a JPEG image would mean.
>
> As for the sort, again, I don't think we need to dictate what is a
> supported sort -- while we can provide common ones, specifying one's
> own sort format in Clojure is trivial, usually a one-liner, and a
> short one at that.  A user may very well want JPEG images sorted, by
> size, a metadata item, etc.; there's no reason we should prohibit
> this. IMO, sorting should be just as flexible as filtering; that is,
> one should be able to specify any arbitrary function to apply.

My only point originally was that we'd have a "sort by" method and a
"filter by" method and we should expect them to work no matter what
the underlying data structure is, all "polymorphic-like". Tha

Serge Wroclawski

unread,
Apr 10, 2009, 6:56:00 AM4/10/09
to clojure-...@googlegroups.com
And this is why GMail introduced "Undo Send"- which I've just turned on *sigh*

But I was pretty much done anyway.

I'm happy to be shown a better way (that means I'm around people who
are smarter than I am, and while that's big feat, is always
reassuring).

- Serge

Reply all
Reply to author
Forward
0 new messages