On Nov 9, 2005, at 10:57, Robert Klemme wrote:
> Matthew Smillie wrote:
>> I wonder if I could trouble the list a bit further on this one: I've
>> got a collection of newswire articles, and I was thinking of using
>> symbols to represent the words in each.
>
> Also, this smells a bit like premature optimization. Do you
> actually need
> to have those words separately? What exactly are you doing with your
> articles?
They're being tokenised, lemmatised/stemmed, tagged for part of
speech, checked for named entities, and then I generate a dependency
graph from the parse of each sentence. I need to glue all that
information to the relevant words, and then I spend a couple of years
writing a dissertation on what comes out when I poke it with a sharp
stick.
The reasoning behind using symbols was that it struck me as a sort of
built-in flyweight pattern. For example, if I have a couple of
hundred instances of the word 'and', using :and struck me as a simple
way to mash them into the same bit of memory without writing my own
lookup table. e.g:
articles[x].sentences[y].add( Word.new(:"@#
{current_word}", :conjunction, etc...) )
> RK:
> I would not use Symbols for that and I'd also not change the
> application
> architecture (i.e. using fork) because forking essentially would
> try to
> fix a problem introduced by using Symbols.
Using fork actually isn't a big change in the architecture, since
these subsets are going to have to be spun off to separate machines
for processing at some point, otherwise I'd never get through the
entire document corpus. I didn't plan to do it at this (prototyping)
stage, but it's fairly trivially parallel, so the fork isn't likely
to cause many problems.
This, on the other hand, might do:
> On Nov 9, 2005, at 10:37, daz wrote:
>
> I'll wave my hands (in a completely non-scientific way) at 300..500
> max.
> (Table look-up isn't hot.)
Is it really that low? I thought that since symbols are generated
for every name in Ruby (aren't they?), the ceiling would be a little
higher before they started being problematic.
thanks once again,
matthew smillie.
Sounds interesting.
> The reasoning behind using symbols was that it struck me as a sort of
> built-in flyweight pattern. For example, if I have a couple of
> hundred instances of the word 'and', using :and struck me as a simple
> way to mash them into the same bit of memory without writing my own
> lookup table. e.g:
>
> articles[x].sentences[y].add( Word.new(:"@#
> {current_word}", :conjunction, etc...) )
I guess in this case (i.e. with high volume expected) it makes sense to
build a more complex solution. You can even start with a simple class
implementation and make it more intelligent as you go. Off the top of my
head, some options
1.
Use a hash where keys and values are identical
h = Hash.new {|h,s| s.freeze; h[s]=s}
def make_word(s) h[s] end
Pro: fast, easy
Con: need to somehow remove unused stuff
2.
Similar but using WeakReference for easy GC
Pro: less mem consumtion
3.
Build a trie like structure where each string is stored quite efficiently.
A trie is basically a tree like data structure where each node represents
a char. Then you just need to store nodes of this structure.
Pro: That way you might even save so much mem, that you can forget cleanup
4.
Do 3 in a C extension where you can save even more mem.
>> RK:
>> I would not use Symbols for that and I'd also not change the
>> application
>> architecture (i.e. using fork) because forking essentially would
>> try to
>> fix a problem introduced by using Symbols.
>
> Using fork actually isn't a big change in the architecture, since
> these subsets are going to have to be spun off to separate machines
> for processing at some point, otherwise I'd never get through the
> entire document corpus. I didn't plan to do it at this (prototyping)
> stage, but it's fairly trivially parallel, so the fork isn't likely
> to cause many problems.
DRB may help here, too.
> This, on the other hand, might do:
>
>> On Nov 9, 2005, at 10:37, daz wrote:
>>
>> I'll wave my hands (in a completely non-scientific way) at 300..500
>> max.
>> (Table look-up isn't hot.)
>
> Is it really that low? I thought that since symbols are generated
> for every name in Ruby (aren't they?), the ceiling would be a little
> higher before they started being problematic.
No idea. You might obtain better info by trying it out and / or looking
at the source.
Kind regards
robert
I'm saying that a typical advantage to using :symbols is in code
that creates them using the literal syntax (:name) in contrast to
'name'.to_sym (formerly 'name'.intern) e.g.:
cust = {:name => 'Acme', :location => 'some locn', :phone => 'nn-nnn',
:joindate => Date.new('yy-mm-dd')}
cust[:lastupdate] = Date.new('today') if custname == cust[:name]
if cust[:lastupdate] > ...
custlist[:lastupdate] << cust[:name]
end
i.e. repeating use of keys/tags within code; same key name
into multiple hashes/structs etc.
I believe that typing :name in a script is the only way
to create a symbol without it having been a ruby String, first.
Once you've typed 300 unique :symbols, you've probably
reached 6000+ lines of code (?)
I don't see any advantage to having nK unique :symbols
over nK unique 'strings'.
(That doesn't imply there isn't one. :)
daz
<st.c>
/* This is a public domain general purpose hash table package
written by Peter Moore @ UCB. */
/* static char sccsid[] = "@(#) st.c 5.1 89/12/14 Crucible"; */
[...]
</st.c>
Sample impl attached.
robert