proton <
leosa...@gmail.com> writes:
> Hi all,
>
> I have a huge amount of text (more than 2 TB) that I want to process
> to create a database. Basically, I read this text from disk, I split
> it into words, do some processing, and store it in a hash-table or in
> a file.
>
> My question is: what is the most efficient way to treat the words, as
> strings or as symbols? I am interested in possible issues due to the
> huge number of words (about 700K), will the string space run out
> faster than the symbol space,
Only toy implementations distinguish memory spaces for different types.
An out of memory situation will apply to any type of object at the same
time.
Given the average length of words is usually 5 characters, assuming an
implementation supporting unicode therefore using 32-bit per character,
assuming one more 32-bit overhead, 700K words will require
(* 700 1024 6 4) 17,203,200 bytes of core. An insignificant amount of
memory.
Now, the problem you have is that you want to avoid just loading the 2
TB of text in core memory, but instead unify the words. That's what
symbols offer you, with the INTERN function. As mentionned by Barry,
there may be some overhead on symbols. 5 or 6 words if something like
defstruct is used, but perhaps less (implementation may implement
symbols differently, amortizing some of the common or unused slots).
Anyways, assuming an overhead for symbols of 6 words, that's (* 700 1024
6 4) = 17,203,200 more bytes (plus some more for the hash-table in the
package where those symbols are interned), again an insignificant amount.
So, using symbols to represent words is a perfectly good solution.
Symbols were designed for that, and are used a lot to represent words in
NLP software written in lisp. Other advantages are that you can easily
add attributes (using the symbol-plist of the symbol-function) to the
words to help parsing or understanding, or other NLP algorithms.
So already, we're discussing O(48n) = O(n) space. We could do O(24n) by
using and unifying strings, but that'd still just be O(n) space, using
something like:
(defvar *strings* (make-hash-table :test (function equal)))
(defun string-intern (string)
(or (gethash string *strings*)
(setf (gethash string *strings*) string)))
(loop while (word-available-p input)
collect (string-intern (parse-word input)))
But if you need to add attributes to those "words", then you'll have to
define a structure or a class yourself, or otherwise add the overhead
that's already provided by symbols.
But you can use tries to store words more efficiently: all the words
starting by #\a share this initial letter. You could store it only
once, and then store in the children only the rest of the words starting
by #\a. And so on. Therefore having on the order of O(nlogn) space
requirements for words. Well, if you're careful, but notice now that a
word is represented by m characters and m pointers (only some of the
characters and pointers are shared with other words).
write 5
written 7
---
12
writ-e 4+1
\ten +3
----
8
http://en.wikipedia.org/wiki/Trie
https://www.google.com/search?q=cl+tries
> will it be GCed properly once a word is
> not used, etc...
Yes, garbage collectors work correctly.
--
__Pascal Bourguignon__
http://www.informatimago.com/
A bad day in () is better than a good day in {}.