"Semantic" shell? (for lack of better name)

Bryan Bishop

unread,

Jan 2, 2009, 10:00:06 PM1/2/09

to

Hi all,

I'd like to work on a method to search for packages based off of
recognized input file formats and recognized output file formats of
the contained program(s). Maybe by MIME-type (RFC 2046), such as:

image/gif
image/jpeg
image/png
image/tiff
video/mp4
video/mpeg
application/x-latex

Here's the list of MIME-type assignments:
http://www.iana.org/assignments/media-types/

However, I am by no means permanently attached to MIME. It would also
be interesting to revise the typical --help message with some
standardized markup for formally specifying which parameters would
prefer what type of information. Typically, when I write my quick
scripts, I just do a few print statements and spit out some text for
help messages, and sometimes clean it up a bit, so to replace that
laziness I'd have to write a tool to make that less of a pain, maybe
throw it in next to autoproject or something.

So, this might just mean an extra file in a package, with two lines,
the first one for input recognized, the second one for types of
output, but this of course isn't a good map for what each parameter
will trigger in terms of output, esp. in programs that change output
dependent on what it discovers about the input. Also, this only really
works for single-program packages, otherwise this needs to be done at
some other level, i.e. a file next to each binary? Is that where this
should go??

Personally this seems kind of an obvious thing to do, but it hasn't
happened yet, so I'm posting to ask specifically--

(1) Has this been proposed before? Can anyone give me names, links,
addresses, or what went wrong?

(2) Anything better than MIME for these purposes?

(3) Search terms other than 'semantic shell', anyone?

(4) What should I be asking?

I've basically written up this email on a site as well-
http://heybryan.org/shell.html

Happy new year,

- Bryan
1 512 203 0507

--
To UNSUBSCRIBE, email to debian-dev...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listm...@lists.debian.org

Guus Sliepen

unread,

Jan 3, 2009, 3:00:11 AM1/3/09

to

On Fri, Jan 02, 2009 at 08:51:12PM -0600, Bryan Bishop wrote:

> I've basically written up this email on a site as well-
> http://heybryan.org/shell.html

I do not see what this has to do with Debian Development. It also sounds
awfully similar to what doxygen does. However, most manuals generated with
doxygen are absolutely worthless: first and foremost they describe the syntax,
not the semantics. It's up to the programmer to describe the semantics, and most
of them forget.

--
Met vriendelijke groet / with kind regards,
Guus Sliepen <gu...@debian.org>

signature.asc

Stefano Zacchiroli

unread,

Jan 3, 2009, 8:50:08 AM1/3/09

to

On Fri, Jan 02, 2009 at 08:51:12PM -0600, Bryan Bishop wrote:

> I'd like to work on a method to search for packages based off of
> recognized input file formats and recognized output file formats of
> the contained program(s). Maybe by MIME-type (RFC 2046), such as:

I'm not really able to reconcile this paragraph about "searching
packages on a recognized format basis" with the remaining part of your
post ...

Anyhow, debtags does offer a way to search for packages on the basis
of which file format they "work with". Have a look at the
"works-with-format" facet, e.g. in the debtags tag cloud [1]. It might
be less specific than what you need, for example it does not consider
the "direction" (input vs output) of the supported format. YMMV.

[1] http://debtags.alioth.debian.org/cloud/

Cheers

PS preserving the fully quoted version of your post, to the benefit of
debtag...@l.a.d.o readers

--
Stefano Zacchiroli -o- PhD in Computer Science \ PostDoc @ Univ. Paris 7
zack@{upsilon.cc,pps.jussieu.fr,debian.org} -<>- http://upsilon.cc/zack/
Dietro un grande uomo c'è ..| . |. Et ne m'en veux pas si je te tutoie
sempre uno zaino ...........| ..: |.... Je dis tu à tous ceux que j'aime

signature.asc

Josselin Mouette

unread,

Jan 3, 2009, 11:30:11 AM1/3/09

to

Le vendredi 02 janvier 2009 à 20:51 -0600, Bryan Bishop a écrit :
> I'd like to work on a method to search for packages based off of
> recognized input file formats and recognized output file formats of
> the contained program(s).

That doesn’t exist for output file formats, but for input file formats,
we already have such a registry in the app-install-data package. If you
want to look for code making use of this data, you can have a look at
gnome-app-install or at the relevant nautilus code[0].

[0] http://patch-tracking.debian.net/patch/series/view/nautilus/2.24.2-1/20_open-with_install.patch

--
.''`.
: :' : We are debian.org. Lower your prices, surrender your code.
`. `' We will add your hardware and software distinctiveness to
`- our own. Resistance is futile.

signature.asc

Erich Schubert

unread,

Jan 3, 2009, 8:50:07 PM1/3/09

to

Hello Bryan,
I've thought about similar efforts, much were centered about having a
generic "command line syntax definition language".
Not every application can be squeezed into the input, output scheme. The
situation with multiple inputs, single output is common.
Neither can every application convert every input to every output,
sometimes just particular combinations might be possible (in particular,
input format might have to be the same as the output format).
Then there are "meta formats", especially compression.
For example gzip will convert any file type to the same file type but
gzip-compressed (or the other way round using gunzip).
So a tool trying to "magically" build chains would need to understand
that while gzip can process the "*/*" mime type, it won't convert the
file type, whereas 'convert' can convert next to any image file type to
next to any other image file type.
But for example to convert text/plain to image/gif with convert, you
should also specify a font...

The debtags efforts do a very minimal approach here: they use 'looser'
file types than MIME and they do not differentiate between input, output
or whatever-put. There are some benefits from that, including
- less information needs to be collected and updated
- the information is more likely to be accurate
obviously at the cost of the information being less useful. At some
point you need to make a cut.

At some point I was considering to actually use RDF-like triplets such
as "app1 reads image/gif" "app1 writes image/jpeg" etc. but we ended up
to going a tuplet-only approach for complexity reasons.

Of course things have made progress since. For example, the .desktop
files usually include useful information about which MIME types an
application supports (unfortunately, many non-GUI-application still do
not ship with .desktop files), but the information there also has some
kind of "vagueness".

So it might well be time to do the next step and collect such meta
information on a "reads" "writes" "displays" "prints" and whatever
basis. However collecting all this data sounds like a huge task to me.

I mentioned before that I was also thinking about a "command line syntax
definition language". The reason is that command line programs vary a
lot in how parameters are passed. There are certain common standards
such as GNU getopt command line syntax (i.e. single letter options with
a single dash, long options with a double dash, single letter options
can be joined ...), but there are also tons of exceptions
(e.g. "java -version" is different from "java -v -e -r -s -i -o -n" and
would have been "java --version" in getopt style).
A specification of the available options in some meta format ideally
would also give an indication of valid file types for file name
parameters. But also note about mutually exclusive options. And it is
obvious that not all command line can be described this way completely
(e.g. to fully validate "perl -e 'perl expression'" you'd need to be
able to validate perl syntax ... and "only perl can parse perl". So
you'll never know what MIME types that statement accepts ...)
A solution covering 90% might still be very nice to have.
I believe that a "semantic shell" might need to be based around the
command line interface of the applications.

best regards,
Erich Schubert
--
erich@(vitavonni.de|debian.org) -- GPG Key ID: 4B3A135C (o_
Reality continues to ruin my life --- Calvin //\
Der Anfang aller Erkenntnis ist das Staunen. --- Aristoteles V_/_

Bryan Bishop

unread,

Jan 4, 2009, 5:50:08 PM1/4/09

to

On Sat, Jan 3, 2009 at 7:16 PM, Erich Schubert wrote:
> I've thought about similar efforts, much were centered about having a
> generic "command line syntax definition language".

I don't know if formal grammars (BNF, etc.) would apply there.

> Not every application can be squeezed into the input, output scheme. The

Right. Fundamentally all steps in a program take some input and
generate some output. The GNU approach of just splitting everything
into smaller and smaller programs is one way to do this, but I'd
prefer to not have to rewrite everything ever.

Part of what I want to do here is "substitutions", just like you can
find so many substitutions for your favorite text editor (insert
flamewar here). But in a way not dependent on requiring people to know
every text editor ever.

> situation with multiple inputs, single output is common.
> Neither can every application convert every input to every output,
> sometimes just particular combinations might be possible (in particular,
> input format might have to be the same as the output format).

Yes, so there's always going to be exceptions to the rule of thumb of
program structure, but in those exceptional cases is it something that
would qualify for inclusion in this system anyway?

> But for example to convert text/plain to image/gif with convert, you
> should also specify a font...

Right, but there's also a known command for spitting out a list of
valid fonts, so that could be a note attached to that defined
parameter to the 'convert' program.

> The debtags efforts do a very minimal approach here: they use 'looser'
> file types than MIME and they do not differentiate between input, output
> or whatever-put. There are some benefits from that, including
> - less information needs to be collected and updated
> - the information is more likely to be accurate
> obviously at the cost of the information being less useful. At some
> point you need to make a cut.
>
> At some point I was considering to actually use RDF-like triplets such
> as "app1 reads image/gif" "app1 writes image/jpeg" etc. but we ended up
> to going a tuplet-only approach for complexity reasons.

we? who? Any working code that I could go poke?

> Of course things have made progress since. For example, the .desktop
> files usually include useful information about which MIME types an
> application supports (unfortunately, many non-GUI-application still do
> not ship with .desktop files), but the information there also has some
> kind of "vagueness".

Another response to my email included a link over to
'open-with-install', a program to query the apt repositories to find a
program to open a certain file with, which is a good step in that
direction. But yes, there is a vagueness involved in .desktop too.

> So it might well be time to do the next step and collect such meta
> information on a "reads" "writes" "displays" "prints" and whatever
> basis. However collecting all this data sounds like a huge task to me.

Yikes, there are so many of those verbs though - you'd basically be
doing Cyc or WordNet all over again. Each program technically
qualifies as a new verb too, so the usefulness seems to deteriorate.

> I mentioned before that I was also thinking about a "command line syntax
> definition language". The reason is that command line programs vary a
> lot in how parameters are passed. There are certain common standards
> such as GNU getopt command line syntax (i.e. single letter options with

This is why I first thought that this might require a new type of
shell standard, or doing these edits to the system somewhere else
other than on a debian-level. Environmental variables for passing
commands (like in CGI) might not be enough metadata being passed
around (or something). I'm very uncertain about all this.

> A specification of the available options in some meta format ideally
> would also give an indication of valid file types for file name
> parameters. But also note about mutually exclusive options. And it is

What about logic languages for expressing valid and invalid uses of
the parameter pool? I.e., define the valid breadth-first set of
parameters, and then some logic conditionals that describe validity.

> obvious that not all command line can be described this way completely
> (e.g. to fully validate "perl -e 'perl expression'" you'd need to be
> able to validate perl syntax ... and "only perl can parse perl". So
> you'll never know what MIME types that statement accepts ...)

"perl -e" accepts a string of text data (I don't think binary data
works). Whether or not the string is valid in terms of perl syntax
(etc.) is another issue entirely, isn't it?

> A solution covering 90% might still be very nice to have.
> I believe that a "semantic shell" might need to be based around the
> command line interface of the applications.

Another area that I'm applying this to is the interconnection of
manufacturing processes, but I'll digress for the moment.

- Bryan
http://heybryan.org/
1 512 203 0507

Erich Schubert

unread,

Jan 4, 2009, 7:10:10 PM1/4/09

to

Hello Bryan,

> > At some point I was considering to actually use RDF-like triplets such
> > as "app1 reads image/gif" "app1 writes image/jpeg" etc. but we ended up
> > to going a tuplet-only approach for complexity reasons.
>
> we? who? Any working code that I could go poke?

The Debtags developers, in particular Enrico and I. But that was at the
earlier planning stage, and to a certain extend the current tags can be
read as triplet (works-with, uitoolkit can be read as "uses ui toolkit"
etc.), but we have a very restricted set of verbs, and the
implementation treats verb+object pretty much as a union.

A triplet-based approach should probably really be based on RDF and use
the existing tools for that instead of reinventing the wheel.

> Another response to my email included a link over to
> 'open-with-install', a program to query the apt repositories to find a
> program to open a certain file with, which is a good step in that
> direction. But yes, there is a vagueness involved in .desktop too.

This apps data might actually be just derived from the .desktop files.
I don't know the details, but that is what comes in mind. In particular
since Gnome would only suggest opening the file with the application
after installation when it has a matching .desktop file IIRC.

> > So it might well be time to do the next step and collect such meta
> > information on a "reads" "writes" "displays" "prints" and whatever
> > basis. However collecting all this data sounds like a huge task to me.
>
> Yikes, there are so many of those verbs though - you'd basically be
> doing Cyc or WordNet all over again. Each program technically
> qualifies as a new verb too, so the usefulness seems to deteriorate.

In fact, using data from WordNet etc. is something that immediately
comes to mind, especially for not reinventing the wheel.
You will need to make a cut somewhere though, the latest when it comes
to UI. Probably already for data input. Selecting a subset of WordNet
might still be a good idea for data exchange.

> "perl -e" accepts a string of text data (I don't think binary data
> works). Whether or not the string is valid in terms of perl syntax
> (etc.) is another issue entirely, isn't it?

Not really. It's just a question of complexity.
Many applications have some "file type" parameter. For "convert" IIRC
you can prefix the file name with a file type, e.g. "gif:-" to force
output in gif format on stdout.
So where is the difference in having "gif:" as a prefix of the output
parameter (which happens to not just be a plain filename!) or having a
full perl program there that does the gif generation?
Except that you can cover one with a regexp I guess, and the other not.

You have to make compromises somewhere.

P.S. autocompletion data files of zsh and bash might be a useful
starting point, too, btw.

best regards,
Erich Schubert
--
erich@(vitavonni.de|debian.org) -- GPG Key ID: 4B3A135C (o_

Friends are those who reach out for //\
your hand but touch your heart. V_/_
Das größte Hindernis beim Erkennen der Wahrheit ist nicht die
Falschheit, sondern die Halbwahrheit. --- L. N. Tolstoi