The only reason you would use XML is because of its DTD and,
specifically, when you want to use strict typing. There are other
possible frameworks though, like duck typing. That's why, so far, skdb
has been using YAML. There have been people arguing for DTDs though
(Smári McCarthy, Sam Rose, and maybe Matt Campbell IIRC).
> Pros and Cons of XML:
>
> Pro:
> * Human readable
debatable
> * Easy to edit
debatable
> * 1 XML sheet per material.
meh a lot of people violate this in giant xml archives (but whatever)
> Cons:
> * Text based, inherently take more storage
> * A comprehensive material list would require >50,000 files
number of files doesn't really matter to anyone these days- besides,
once you have a data set, you can translate it back and forth after
writing translators, and if you use XML with a schema or DTD you can
use one of the automatic dtd2sql scripts.
> * XML can be slow at times.
Those are not the real "cons" of XML, my friend. ;-) Well, perhaps in
this context, just maybe. Here's a broader overview of problems with
XML, which aren't really relevant in this context, but are worth
knowing about: .. okay, maybe not. I owe you a link (I have someone
searching for the link, so I'll get back to this shortly and send this
email for now.)
> Pros and Cons of a DB:
>
> Pro:
> * Smaller storage
> * Faster Execution
btw just about everything on the face of the planet is faster than yaml parsing
> Cons:
> * Binary Format
> * Potential unused rows
> * More difficult to create subsets
i'd also add "you can't commit it to a distributed revision control
and expect to get out a usable, human readable diff that we can use",
a biggie!
> * Harder to add additional properties (possible, depends on DB
> design)
that's the same with DTDs in general
> Personally I'm leaning toward an XML based format, however I'd like
> some input into that decision from you.
Have you checked the skdb samples?
a possible way to represent materials in yaml
http://designfiles.org/skdb/doc/proposals/materials.yaml
if not, and if you haven't seen any of the *.yaml stuff yet, you
should spend some time clicking around here:
http://designfiles.org/skdb/
there's a readme for the directory structure stuff here:
http://designfiles.org/skdb/readme
also, i spent some time (and so did others) documenting different
material properties into a list:
http://designfiles.org/skdb/doc/lists/material_properties.txt
someone (probably Smári or Christian Siefkes) did do an XML example of
a manufacturing process though:
http://designfiles.org/skdb/doc/proposals/hall-heroult.process
based off of this DTD:
http://www.tangiblebit.org/xml/process-1.0.dtd
... and fenn spent a lot of time representing some manufacturing processes:
http://designfiles.org/skdb/processes.yaml
a really poorly thought-out dependency tree of parts/tools for transhuman tech:
http://designfiles.org/skdb/doc/proposals/trans-tech.yaml
basic example of using tags (!foo) in yaml:
http://designfiles.org/skdb/doc/proposals/tags.yaml
general architecture description readme thing:
http://designfiles.org/skdb/doc/architecture
there's a lot of background, but suffice it to say as part of the
Automated Design Lab i did an update to some of the other VOICED
participants on some of the technical details, but this email has
probably been more informative:
http://adl.serveftp.org/lab/presentations/updates-from-austin.pdf
Ben Lipkowitz, myself and Smári McCarthy were talking in #hplusroadmap
(on irc.freenode.net) about this nearly an exact year ago, if you want
to see the logs:
http://gnusha.org/logs/2009-07-17.log
and if you feel like downloading 20 MB of logs: http://gnusha.org/irclogs.txt
> Also, if this works out, I wouldn't mind people emailing me any
> material spec sheets that they have, I have a few thousand, but more
> never hurts!
Has anyone noticed that octopart.com has somehow been able to convince
some of the suppliers to submit data for their electronics? Not just
pdf datasheets, but actual data. I talked with Andre once and I think
what they are doing is a little sly: (1) sometimes they actually get
data from their supplier in a Microsoft Excel spreadsheet, CSV file,
or something else, and they are very happy, but most commonly (2) they
just have their pdf text search engine look at the different values
and parameters in each of the pdf data sheets. It's a horrible,
horrible data wrangling problem, but if it could be solved for all the
millions of data sheets out there, we all know how wonderful life can
become.
Strangely nobody has figured out how to *do* this from a practical
standpoint. Octopart.com's strategy seems to be "become popular and
use that to get the electronics manufacturers to submit data to us,
and we pray that they have digitized their catalog sufficiently". I
have been thinking about spending some of my funds for SKDB on just
converting data sheets over to a parseable format, for some subset of
interesting components or whatever, but my funds are not infinite and
chipping away at this problem by using manual human labor (like via
Amazon's Mechanical Turk or even just hiring goons off craigslist) is
a really easy way to burn money for little comparable gain. My
preference would be to find something that looks more like a geometric
function for amount of funds spent compared to data sheet gain, or
something.
So anyway.. regardless of the initial file format, a big remaining
unsolved problem is the process by which data gets put into the
system. fenn has talked about this a few times on this mailing list,
like about giant proprietary materials data sets, whether or not the
hardness of steel is "public data" or what (legal issues), if we could
just start transcribing from library books without any fretting, or
how that is supposed to work. Are package maintainers for individual
hardware projects to be responsible for documenting the unique
materials in their projects? or is openmaterials.org going to work on
some data sets for us (hi Catarina!). or is there a way that a
"business" front like octopart.com can convince manufacturers,
material suppliers, etc., to become more standardized? My guess is no,
business standardization initiatives are ridiculously hard work and
have been around forever- something different has to happen (like
public discussion of WTF, as we're doing now).
Also, I'll get that link about actual cons re: XML soon. Thank you for
the email!
XML is slow if you use a DOM parser to read the whole thing into memory. When multi-GB files are involved (eg using XML as an intermediate format for SPICE to VerilogAMS netlist translation) you definitely want to use a more streamed/serialized approach and do multiple passes over the data to access what you need.
Using the streamed XML approach also means you can zip the text for better compression. Then you can pipe unzipped stream to xml parser stream pretty easily. No need to decompress the contents to an intermediate file and/or space in memory.
Andrew.
> --
> You received this message because you are subscribed to the Google Groups "Open Manufacturing" group.
> To post to this group, send email to openmanu...@googlegroups.com.
> To unsubscribe from this group, send email to openmanufactur...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/openmanufacturing?hl=en.
>
--
"The future is already here. It's just not very evenly distributed" -- William Gibson