I'm working on some code to digest daily XML files (about 3.5M when they are all gzipped). It seems to me that a way to handle the data is to create a hash-table with the Values being structures, e.g.
(dolist (game games) (let (( game (make-gamedata)) ( parsing vars )) ...parse 3 xml files, stuff data into structure 'game' ...including identifying hash-key - which happens to be a string, e.g. "2005/09/12/sdnmlb-sfnmlb-1" ...store structure game into *game-hash* with key )) ; end of 'dolist
that picks up the first layer of indexing, the gathered info then is utilized for a second layer of dependent xml files. I'll stuff more data into 'gamedata'. I can use the hash-key to pull out the game of interest and since it is a hash-table, I can do that quickly....fastest?
After all that work, I think saving the result is a good idea. Saving it in the filesystem at the head of the local tree of XML files seems best. That way, I can check that it exists, load it, read it, modify it and put it back into the fs when done. Since this is daily data, and various incremental searches and combinations might need to be constructed, something simple that handles a single hash-table seems appropriate.[1]
Looking at the Persistence and Serialization packages leaves me a little lost. I strongly prefer something that runs inside CMUCL and doesn't need a bunch of support packages too. Something very reliable that clueless noobs can have running in seconds. Speed really isn't of any concern, just be faster that a full reparse of the XML.
Recommendations? Extra credit for packages with sample code
TIA
[1] the Wikipedia entry for Serialization has a subsection for Common Lisp that sorta sucks compared to the others.
-- "Most programmers use this on-line documentation nearly all of the time, and thereby avoid the need to handle bulky manuals and perform the translation from barbarous tongues." CMU CL User Manual
On Fri, 16 Jan 2009 00:18:34 -0800, <spamb...@CloudDancer.com> wrote:
> Recommendations? Extra credit for packages with sample code
I looked at CL-Store, it seems that I would need to maintain any structures in three places, once at definition, once at storage (which seems to be slot by slot) and again at restore.
PLOB - "The inability to licence POSTORE essentially orphans PLOB!"
Elephant didn't work on cmucl last time I tried.
Perec was SBCL specific and I think only worked with a database.
---
Most of the packages I look at sound like global stores only, I want/need multiple local stores (about 180 per year of interest).
-- "Most programmers use this on-line documentation nearly all of the time, and thereby avoid the need to handle bulky manuals and perform the translation from barbarous tongues." CMU CL User Manual
Hi! with-standard-io-syntax binds *print-circle* to nil. Will it loop if object tree have circular references? Will it work if we bind *print- circle* to t? Is 10Mb fasl file really possible in CMUCL?
On Fri, 16 Jan 2009 15:03:14 +0530, <enom...@meer.net> wrote: > * GP lisper <slrngn0ib8.31r.spamb...@phoenix.clouddancer.com> : > Wrote on Fri, 16 Jan 2009 00:46:32 -0800:
>| On Fri, 16 Jan 2009 00:18:34 -0800, <spamb...@CloudDancer.com> wrote: >|> >|> Recommendations? Extra credit for packages with sample code
> While there are any number of serialization options --- I think your > size is not too big and speed is not a showstopping issue --- I'd just > just Use XMLisp and READ the XML file into a CLOS object. Then frob the > CLOS object as required into hash tables modify it back in the CLOS > object and WRITE it out.
Well there are about 410 XML files involved in a single day. There are a lot of overlapping data-keys in all those files, and of course they don't have identical values. So I need to only pick out some data for each file, i.e. I have custom readers for them. Since speed once the data is collected becomes an issue, I'm avoiding objects for now, since I don't need anything more than the features structure offer. If XMLisp will create structures instead of objects, then it would probably be useful to employ it for reading the files. I can always copy slots if overall it is simpler code... I suppose I could do that for objects -> structures too.
I see that XMLisp isn't run on CMUCL however...
-- "Most programmers use this on-line documentation nearly all of the time, and thereby avoid the need to handle bulky manuals and perform the translation from barbarous tongues." CMU CL User Manual
On Fri, 16 Jan 2009 15:11:36 +0530, <enom...@meer.net> wrote: > * Madhu <m34ozzzjc5....@moon.robolove.meer.net> : > Wrote on Fri, 16 Jan 2009 15:03:14 +0530:
>| * GP lisper <slrngn0ib8.31r.spamb...@phoenix.clouddancer.com> : >| Wrote on Fri, 16 Jan 2009 00:46:32 -0800: >| >| | On Fri, 16 Jan 2009 00:18:34 -0800, <spamb...@CloudDancer.com> wrote: >| |> >| |> Recommendations? Extra credit for packages with sample code >| >| While there are any number of serialization options --- I think your
> Re-reading your original message, I understand you do not want go the > XML route. Perhaps you don't need a serialization package and just > something like
Doesn't that require the object to be printable? I would have a Structure that will have structures in some of the slots. Doesn't sound printable. One of the other simple schemes I recall was to mmap the 'object' (in this case a hash-table) and just do a copy/load from memory to a file. I'll track that stuff down again, maybe I will understand it this time.
-- "Most programmers use this on-line documentation nearly all of the time, and thereby avoid the need to handle bulky manuals and perform the translation from barbarous tongues." CMU CL User Manual
* GP lisper <slrngn0v9q.9el.spamb...@phoenix.clouddancer.com> : Wrote on Fri, 16 Jan 2009 04:27:38 -0800:
|> Re-reading your original message, I understand you do not want go the |> XML route. Perhaps you don't need a serialization package and just |> something like |> |> <URL:http://paste.lisp.org/display/898> | | Doesn't that require the object to be printable?
No. The hashtable/structure is never printed. The fasl file is created from the read time value of a variable bound to the object. Thats the trick.
Now there are other issues you'd have to take care of such as ensuring the environment in which the object in the fasl file can be read back in. But this is not an issue.
| I would have a Structure that will have structures in some of the | slots. Doesn't sound printable.
I'd recommend you try it before dissing it :) -- You dont have portability requirements and you can recreate the dump on demand. It is unlikely to get simpler than this
| One of the other simple schemes I recall was to mmap the 'object' (in | this case a hash-table) and just do a copy/load from memory to a file. | I'll track that stuff down again, maybe I will understand it this | time.
> I don't see why this is relevant. *print-readably* is what matters.
I think it is relevant, and that print would just loop up if *print- circle* is nil and structure contains circular references. I think this is stated in a standard:
> No. The hashtable/structure is never printed. The fasl file is created > from the read time value of a variable bound to the object. Thats the > trick.
Wrong. Please look more carefully at what you're advicing :)
On Jan 16, 8:46 am, GP lisper <spamb...@CloudDancer.com> wrote:
> On Fri, 16 Jan 2009 00:18:34 -0800, <spamb...@CloudDancer.com> wrote:
> > Recommendations? Extra credit for packages with sample code
> I looked at CL-Store, it seems that I would need to maintain any > structures in three places, once at definition, once at storage (which > seems to be slot by slot) and again at restore.
Hmmm, I'm confused, I don't understand what would require you to do manual structure maintenance. cl-store should `give back` the hash-table with structures which are roughly equivalent, somewhat like pythons 'pickle'. If it's behaving in a different manner then it's a bug.
I used, almost, this exact same approach when storing data about a parsed spam corpus for my spam filter.
On 16 ÑÎ×, 16:32, budden <budden-l...@mail.ru> wrote:
> > No. šThe hashtable/structure is never printed. šThe fasl file is created > > from the read time value of a variable bound to the object. šThats the > > trick.
> Wrong. Please look more carefully at what you're advicing :)
On 16 янв, 16:44, budden <budden-l...@mail.ru> wrote:
> On 16 ÑÎ×, 16:32, budden <budden-l...@mail.ru> wrote:
> > > No. šThe hashtable/structure is never printed. šThe fasl file is created > > > from the read time value of a variable bound to the object. šThats the > > > trick.
> > Wrong. Please look more carefully at what you're advicing :)
On Fri, 16 Jan 2009 05:33:47 -0800 (PST), <ros...@gmail.com> wrote:
> On Jan 16, 8:46 am, GP lisper <spamb...@CloudDancer.com> wrote: >> On Fri, 16 Jan 2009 00:18:34 -0800, <spamb...@CloudDancer.com> wrote:
>> > Recommendations? Extra credit for packages with sample code
>> I looked at CL-Store, it seems that I would need to maintain any >> structures in three places, once at definition, once at storage (which >> seems to be slot by slot) and again at restore.
> Hmmm, I'm confused, I don't understand what would require you to do > manual structure maintenance.
The Cl-Store examples actually call out the slot names when storing.
cl-user(1): (defclass foo () ((bar :accessor bar :initarg :bar)))
I looked at that and thought, "oh no, I need to list the slot names?" I realized I could store anything if I disassembled it into components first, saw that CL-Store saves structure definitions and decided that a workable approach to serialization is divide-and-conqueor as long as you stored details on re-assembly too. I keep thinking that the BerkeleyDB code is what I need and that is distracting.
> cl-store should `give back` the hash-table with structures which are > roughly equivalent, somewhat like pythons 'pickle'. > If it's behaving in a different manner then it's a bug.
Well that brings up something I was just thinking about. In order to bring back a hash-table, did you declare a hash-table type? How else would you re-obtain a hash-table?
> I used, almost, this exact same approach when storing data about a > parsed spam corpus for my spam filter.
Well that is very encouraging.
-- "Most programmers use this on-line documentation nearly all of the time, and thereby avoid the need to handle bulky manuals and perform the translation from barbarous tongues." CMU CL User Manual
On Fri, 16 Jan 2009 18:32:24 +0530, <enom...@meer.net> wrote:
>| I would have a Structure that will have structures in some of the >| slots. Doesn't sound printable.
> I'd recommend you try it before dissing it :) -- You dont have > portability requirements and you can recreate the dump on demand. It is > unlikely to get simpler than this
Yes, I will writeup a sample case with a Structure some of whose slots are hash-tables of other structures (the game players) and try that out. At least that will provide some code to discuss further and point out errors.
-- "Most programmers use this on-line documentation nearly all of the time, and thereby avoid the need to handle bulky manuals and perform the translation from barbarous tongues." CMU CL User Manual
> This code below from the lisp paste site below need not pay attention to which
package you are in and can expect BINDUMP to work correctly. All you need to do is ensure the environment is correct when you want to LOAD the fasl. Yeah, your' right, I've missed a comman. No, I do not understand at all. In fact, if we just print all symbols qualified (as I suggested) it will not depend on package.
;;; modified from http://paste.lisp.org/display/898/ ;;; not tested at all! might even be unparseable!!! (defpackage :binstore (:use :cl) (:export #:bindump #:binload))
(defvar *hook*)
(defun bindump (object pathname) "Dumps OBJECT into as a FASL-file designated by PATHNAME." (let ((tmp (make-pathname :type "lisp" :defaults pathname))) (unwind-protect (let ((*hook* object)) (with-open-file (f tmp :direction :output :if- exists :supersede) (format f "(cl:setf binstore::*hook* '#.binstore::*hook*)~%")) (compile-file tmp :output-file pathname)) (delete-file tmp))) pathname)
(defun binload (pathname) "Loads an object dumped by BINDUMP from PATHNAME." (let ((*hook* nil)) (load pathname) *hook*))
> On Fri, 16 Jan 2009 05:33:47 -0800 (PST), <ros...@gmail.com> wrote:
> > On Jan 16, 8:46 am, GP lisper <spamb...@CloudDancer.com> wrote: > >> On Fri, 16 Jan 2009 00:18:34 -0800, <spamb...@CloudDancer.com> wrote:
> >> > Recommendations? Extra credit for packages with sample code
> >> I looked at CL-Store, it seems that I would need to maintain any > >> structures in three places, once at definition, once at storage (which > >> seems to be slot by slot) and again at restore.
> > Hmmm, I'm confused, I don't understand what would require you to do > > manual structure maintenance.
> The Cl-Store examples actually call out the slot names when storing.
> I looked at that and thought, "oh no, I need to list the slot names?"
Well, not exactly. This example stores a list of 3 objects. The class named FOO the generic-function named BAR (although you can't really consider this to be serialized, all that is saved is the name) and an instance of FOO with the slot BAR set to "bar".
On Fri, 16 Jan 2009 00:18:34 -0800, <spamb...@CloudDancer.com> wrote:
> I'm working on some code to digest daily XML files (about 3.5M when > they are all gzipped). It seems to me that a way to handle the data > is to create a hash-table with the Values being structures, e.g. > ... > After all that work, I think saving the result is a good idea. Saving > it in the filesystem at the head of the local tree of XML files seems > best. That way, I can check that it exists, load it, read it, modify > it and put it back into the fs when done. Since this is daily data, > and various incremental searches and combinations might need to be > constructed, something simple that handles a single hash-table seems > appropriate.
Well, I wrote up a sample and tried a suggestion. I had suspected that that suggestion wouldn't work and so far, it has not worked
Any pointers?
Sample Code:
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; ; code posted at http://paste.lisp.org/display/74033 ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; ; ; Sample Code GOAL: ; put some non-simple data into a hash table ; store the hash table in the filesystem ;--- ; recover the hash table from the filesystem ; verify data ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
On Fri, 23 Jan 2009 04:45:58 +0530, <enom...@meer.net> wrote:
> I should have added just to avoid another point of possible confusion
> - Just use just the dribble file name without specifying extension. This > way the dump file gets saved with architecture specific extension and > load usually will work right. (OR Tweak the binstore code)
I needed a tag to indicate what the file was, I'll just move it into the filename.
> ;; It would have helped if you gave a fixed test case in addition to > ;; random data > (defun get-some-player (games-hash) > (loop for x being each hash-key of (bs-players (gethash "II" games-hash)) > using (hash-value v) > return v))
> * (daze-games)
If you run (daze-games) a few times, the player hash-table is then full since there isn't any hash-clearing going on. I needed the random effects for other testing, the overall data structure design is new to me.
> On Fri, 16 Jan 2009 00:18:34 -0800, <spamb...@CloudDancer.com> wrote:
> > I'm working on some code to digest daily XML files (about 3.5M when > > they are all gzipped). It seems to me that a way to handle the data > > is to create a hash-table with the Values being structures, e.g. > > ... > > After all that work, I think saving the result is a good idea. Saving > > it in the filesystem at the head of the local tree of XML files seems > > best. That way, I can check that it exists, load it, read it, modify > > it and put it back into the fs when done. Since this is daily data, > > and various incremental searches and combinations might need to be > > constructed, something simple that handles a single hash-table seems > > appropriate.
> Well, I wrote up a sample and tried a suggestion. > I had suspected that that suggestion wouldn't work and so far, it has not worked
> Any pointers?
> Sample Code:
> ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; > ; code posted at http://paste.lisp.org/display/74033 > ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; > ; > ; Sample Code GOAL: > ; put some non-simple data into a hash table > ; store the hash table in the filesystem > ;--- > ; recover the hash table from the filesystem > ; verify data > ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
On Sat, 24 Jan 2009 12:45:53 -0800 (PST), <ros...@gmail.com> wrote: > On Jan 22, 6:06 pm, GP lisper <spamb...@CloudDancer.com> wrote: >> On Fri, 16 Jan 2009 00:18:34 -0800, <spamb...@CloudDancer.com> wrote:
>> > I'm working on some code to digest daily XML files (about 3.5M when >> > they are all gzipped). It seems to me that a way to handle the data >> > is to create a hash-table with the Values being structures, e.g. >> > ... >> > After all that work, I think saving the result is a good idea. Saving >> > it in the filesystem at the head of the local tree of XML files seems >> > best. That way, I can check that it exists, load it, read it, modify >> > it and put it back into the fs when done. Since this is daily data, >> > and various incremental searches and combinations might need to be >> > constructed, something simple that handles a single hash-table seems >> > appropriate.
> #S(PLAYER :PID 4519 :NAME "Smith" :PERFORMANCE 5) > T
I have two versions of the sample code, one for CL-STORE. The file format isn't important, and I will be trying the CL-STORE solution later when the XML parsing gets to me. The test code randomization is part of the 'speed trials', so I can run both versions against the clock (I want to know what does the additional flexibility of CL-STORE cost). CL-STORE looks to be an old and well tested set of code, since I found references about it dating back about a decade. I wouldn't ignore that.
> I have two versions of the sample code, one for CL-STORE. The file > format isn't important, and I will be trying the CL-STORE solution > later when the XML parsing gets to me. The test code randomization is > part of the 'speed trials', so I can run both versions against the > clock (I want to know what does the additional flexibility of CL-STORE > cost).
Well, the CL-STORE solution is much better in storage.
-rw-r--r-- users 191536 Jan 24 10:06 dribble-ht.x86f -rw-r--r-- users 5080 Jan 26 09:58 dribble.ht
Looking into the bindump version at 191k shows a lot of repetition, it appears that every element is wrapped in a descriptive enviroment. I'd expect that once a recovered hash-table is reloaded, any CL-STORE or bindump differences would disappear. Since save/reload are suppose to be one-shot operations, the space savings is the important factor.
This is certainly looking a lot better then the current SQL versions.