The following code uses a map to store unique bigrams (couple of
words) from a text file (and an array to sort them according to their
frequency of occurence).
The file used for the test is : http://www.gutenberg.org/cache/epub/100/pg100.txt
(concatenated a couple of times for the sake of the test)
At no point during the run of program does the map store an amount of
data comparable to the size of the text file, since the most frequent
bigrams appear thousands of time in said text file.
One would therefore expect the memory usage of the executable to use
an amount of RAM which is less than the size of the text file.
However it seem to be quite the opposite in practice.
Here is the source code:
The version of 6g is : 6g version 5189 release.2010-03-30 release
The hardware is : Mac Mini , Intel Core Duo
Serge Hulne
-------------------
package main
import (
"fmt"
"os"
"bufio"
"unicode"
"strings"
"time"
"runtime"
"sort"
)
func monitor() {
for {
fmt.Printf("Memory usage = %d\n",
runtime.MemStats.Alloc)
time.Sleep(1e9)
}
}
//---
type Bigram struct {
key string
value string
freq uint
}
type BigramArray []Bigram
// Methods required by sort.Interface to sort structures of the type
BigramArray.
func (s BigramArray) Len() int { return len(s) }
func (s BigramArray) Less(i, j int) bool { return s[i].freq >
s[j].freq } //(reverse sort)
func (s BigramArray) Swap(i, j int) { s[i], s[j] = s[j], s[i] }
func hash_bigram(b *Bigram) uint {
var h uint = 5381
for _, r := range (*b).key {
h = (h << 5) + h + uint(r)
}
h = (h << 5) + h + uint('_')
for _, r := range (*b).value {
h = (h << 5) + h + uint(r)
}
return h
}
//---
func main() int {
l_cnt := 0
w_cnt := 0
cpt_chars := 0
inword := false
buf := ""
old_word := ""
new_word := ""
var bigram Bigram
T := map [uint] Bigram{}
//f, err := os.Open("faust.txt", os.O_RDONLY, 0666)
//f, err := os.Open("test.txt", os.O_RDONLY, 0666)
//f, err := os.Open("hamlet.txt", os.O_RDONLY, 0666)
f, err := os.Open("../shakespeare.txt", os.O_RDONLY, 0666)
//f, err := os.Open("/Users/serge2/development/NLP/B60W20.txt",
os.O_RDONLY, 0666)
//f, err := os.Open("../B60W20.txt", os.O_RDONLY, 0666)
go monitor()
if err != nil {
fmt.Printf("\nError => %s\n\n", err)
}
reader := bufio.NewReader(f) //Buffered reader
for {
c, _ ,err := reader.ReadRune() //"ReadRune": reads unicode chars
cpt_chars++
if err != os.EOF && err == nil {
if c == '\n' {
l_cnt++
}
///////////////////////////////
if unicode.IsSpace(c) == false { //Scanner : filters out whitespace
if inword == false {
buf = ""
buf += string(c)
inword = true
w_cnt++
} else {
buf += string(c)
}
} else if inword == true {
//---
buf = strings.ToLower(buf)
new_word = buf
if len(new_word)>0 && len(old_word)>0 {
bigram.key = old_word
bigram.value = new_word
h := hash_bigram(&bigram)
if _, ok := T[h]; ok {
b := T[h]
b.freq++
T[h] = b
} else {
T[h] = bigram
}
}
old_word = new_word
//---
//fmt.Printf("%d\t buf = (%s)\n", w_cnt ,buf)
inword = false
buf = ""
}
///////////////////////////////
} else { //EOF detected
if err == os.EOF {
break
}
} //end if (err=nil)
} // end for (main loop)
//---
var bigramsArray BigramArray
bigramsArray = make (BigramArray, len(T))
i := 0
for _, v := range T{
bigramsArray[i] = v
i++
}
sort.Sort(bigramsArray)
for _, v := range bigramsArray[0:20]{
fmt.Printf("%d\t (%s,%s)\n", v.freq, v.key, v.value)
}
//---
fmt.Printf("\nlines = %d, words = %d, chars = %d\n", l_cnt, w_cnt,
cpt_chars)
return 0
}
-------------------
Here is the result of the run:
serge-hulnes-mac-mini:Go_test serge2$ ls -l ../shakespeare.txt
-rw-r--r-- 1 serge2 staff 15862370 5 mar 11:07 ../shakespeare.txt
serge-hulnes-mac-mini:Go_test serge2$ time ./bigrammes_7
Memory usage = 347648
Memory usage = 18181624
Memory usage = 33399512
Memory usage = 22003136
Memory usage = 32615280
Memory usage = 19154104
Memory usage = 29662040
Memory usage = 18441376
Memory usage = 28763680
Memory usage = 18525536
Memory usage = 28688216
Memory usage = 18481616
Memory usage = 28727720
6809 (of,the)
5619 (in,the)
5539 (to,the)
4699 (i,am)
4149 (i,will)
3459 (i,have)
3079 (it,is)
2769 (if,you)
2609 (that,i)
2589 (and,the)
2569 (by,the)
2469 (to,be)
2099 (and,i)
2059 (my,lord,)
2039 (of,syracuse.)
2019 (you,are)
1999 (is,the)
1999 (for,the)
1989 (antipholus,of)
1899 (of,my)
lines = 495200, words = 2703240, chars = 15862351
real 0m12.725s
user 0m12.602s
sys 0m0.104s
But this program probably saves pointers to most of the copies,
because if you have the text
Enter ANTIPHOLUS OF SYRACUSE, DROMIO OF SYRACUSE, and
then both SYRACUSEs get saved, as part of
("SYRACUSE", "DROMIO") and ("SYRACUSE", "and").
The first SYRACUSE also gets recorded as part of
("OF", "SYRACUSE"). If you have dialogue like
ANTIPHOLUS OF SYRACUSE. A trusty villain, sir, that very oft,
ANTIPHOLUS OF SYRACUSE. Farewell till then. I will go lose myself,
ANTIPHOLUS OF SYRACUSE. He that commends me to mine own content
ANTIPHOLUS OF SYRACUSE. Stop in your wind, sir; tell me this, I pray:
ANTIPHOLUS OF SYRACUSE. I am not in a sportive humour now;
There's only one copy of "ANTIPHOLUS", for ("ANTIPHOLUS", "OF"),
but there are five copies of "SYRACUSE" in the five pairs in which
it appears on the left (with A, Farewell, He, Stop, and I on the right).
In short, if you have a word appearing multiple times on the left
side of a pair, then each time is a new copy, and similarly for
right sides, although an individual left side instance might be
shared with an individual right side instance. You can get a plausible
estimate by assuming no overlap between left and right side:
$ cat shakespeare.txt |
> tr A-Z a-z |
> tr -c a-z '\n' |
> grep . |
> awk '{m[last","$0]=1; last=$0} END{for(s in m) print s}' |
> wc
346473 346473 4104415
$ ls -l shakespeare.txt
-rw-r--r-- 1 rsc staff 5582655 Apr 9 01:59 shakespeare.txt
$
In this case, the pair string data is about 75% of the original file size.
The estimate can be off by at most a factor of two (each left instance
might overlap with one right instance), so you're probably looking at
keeping around at least half of the file, and your memory usage looks
like it is not much more than 100% of the file when a garbage collection
round finishes. Given the various map and struct overhead, that seems
plausible.
In contrast, saving just one copy of each word would only be
about 3% of the file size. You can avoid the duplication by keeping
one canonical copy of each string, with something like
var cmap = make(map[string]string)
func canonical(s string) {
t, ok := cmap[s]
if !ok {
t = s
cmap[t] = t
}
return t
}
and use new_word = canonical(buf).
One more note that shouldn't affect overall memory usage but
does affect garbage collection frequency: all those += of strings
are kicking off quadratic garbage, but the n is small (word length).
Instead, the program could use a long-running bytes.Buffer with
the WriteRune, String, and Reset methods.
It might also be useful to call runtime.GC in the monitor before
printing the statistics.
Good luck.
Russ
Thanks for taking the time to reply to my message !
I have tried the solution you suggested, namely:
var cmap = make(map[string]string)
func canonical(s string) string {
t, ok := cmap[s]
if !ok {
t = s
cmap[t] = t
}
return t
}
....
new_word = canonical(buf)
....
and I call:
runtime.GC()
on the first line of the monitor() routine.
However, it does not seem to change the amount of RAM used use by the
application.
The same application, written in "D", takes only a small fraction of
the time to execute and uses only a small fraction of the memory used
by it's "Go" counterpart, although the algorithm is the same.
The most sriking difference, though is the execution time:
I have tried the same application (and it's "D" equivalent) on a 160
Mb large text file, and the D executable does the job in about a
minute whereas the appl;ication coded in "Go" takes more than 10
minutes.
Serge.
> and I call:
> runtime.GC()
> on the first line of the monitor() routine.
> The most sriking difference, though is the execution time:
> I have tried the same application (and it's "D" equivalent) on a 160
> Mb large text file, and the D executable does the job in about a
> minute whereas the appl;ication coded in "Go" takes more than 10
> minutes.
How does the runtime compare if you remove the call to runtime.GC?
Ian
I am not sure I understand your question, but here is the modified
version, if you want to compare the two versions yourself:
Serge.
===============================
second version
===============================
package main
import (
"fmt"
"os"
"bufio"
"unicode"
"strings"
"time"
"runtime"
"sort"
)
var cmap = make(map[string]string)
func canonical(s string) string {
t, ok := cmap[s]
if !ok {
t = s
cmap[t] = t
}
return t
}
func monitor() {
runtime.GC()
type BigramArray []Bigram
func main() int {
//f, err := os.Open("../shakespeare.txt", os.O_RDONLY, 0666)
go monitor()
//---
buf = strings.ToLower(buf)
//new_word = buf
new_word = canonical(buf)
sort.Sort(bigramsArray)
===============================
On Apr 9, 6:50 pm, Ian Lance Taylor <i...@google.com> wrote:
> I am not sure I understand your question, but here is the modified
> version, if you want to compare the two versions yourself:
I was just asking how much time is taken by calling runtime.GC(). I
misunderstood your program--I thought you were calling runtime.GC() in
a loop.
> if inword == false {
> buf = ""
> buf += string(c)
> inword = true
> w_cnt++
> } else {
> buf += string(c)
> }
Because Go strings are immutable, this is a relatively slow way to do
things. It would be better to do something more like
alc := 100
buf := make([]int, alc)
cnt := 0
for {
if unicode.IsSpace(c) == false { //Scanner : filters out whitespace
if !inword {
buf[0] = c
cnt = 1
w_cnt++
} else {
if cnt >= alc {
alc *= 2
nbuf := make([]int, alc)
copy(nbuf, buf)
buf = nbuf
}
buf[cnt] = c
cnt++
}
} else if inword {
h := hash_bigram(buf)
...
}
}
In fact, with your current program, I don't think there is ever a
reason to use strings. If I missed something, then it's easy enough
to insert string(buf) when you have collected a complete word and are
ready to get a string.
Ian
The program below is a variant of yours. It uses a
simpler loop based on a bytes.Buffer to avoid the
quadratic string allocation, and when it finishes it
prints a memory profile to prof.out. After running it,
you can run
gopprof --web 6.out prof.out
to open the memory profile in your web browser.
When I run the program I get these two profiles:
http://swtch.com/~rsc/before.svg # not using canonical
http://swtch.com/~rsc/after.svg # using canonical
Using canonical does have a noticeable effect:
there's about 12 MB of memory saved.
But the numbers also make it look like there is
a memory leak or some other inefficiency
in the map implementation.
If you change the map and array to hold *Bigram
instead of Bigram then that cuts the memory usage
from 133 MB to 86 MB, which is better but still
too big (by the way, it's a 3-character change to the
program: two * and one &).
Russ
package main
import (
"bufio"
"bytes"
"fmt"
"log"
"os"
"runtime"
"runtime/pprof"
"sort"
"unicode"
)
var ns int
var cmap = make(map[string]string)
func canonical(s string) string {
t, ok := cmap[s]
if !ok {
t = s
cmap[s] = t
ns += 16 + len(s)
}
return t
}
type Bigram struct {
key, value string
freq uint
}
type BigramArray []Bigram
func (s BigramArray) Len() int { return len(s) }
func (s BigramArray) Less(i, j int) bool { return s[i].freq > s[j].freq }
func (s BigramArray) Swap(i, j int) { s[i], s[j] = s[j], s[i] }
func hash(key, value string) uint {
h := uint(5381)
for i := 0; i < len(key); i++ {
h = h<<5 + h + uint(key[i])
}
h = h<<5 + h + ' '
for i := 0; i < len(value); i++ {
h = h<<5 + h + uint(value[i])
}
return h
}
func main() {
runtime.GOMAXPROCS(2)
f := os.Stdin
if len(os.Args) > 1 {
var err os.Error
f, err = os.Open(os.Args[1], os.O_RDONLY, 0)
if err != nil {
log.Exit("opening:", err)
}
}
r := bufio.NewReader(f)
var (
buf bytes.Buffer
last string
nc, nw, nl int
)
m := make(map[uint]Bigram)
for {
c, _, err := r.ReadRune()
nc++
if c == '\n' {
nl++
}
if unicode.IsLetter(c) {
buf.WriteRune(unicode.ToLower(c))
continue
}
if buf.Len() > 0 {
nw++
s := canonical(buf.String())
if last != "" {
h := hash(last, s)
if bb, ok := m[h]; ok {
bb.freq++
m[h] = bb
} else {
m[h] = Bigram{last, s, 1}
}
}
buf.Reset()
last = s
}
if err == os.EOF {
break
}
if err != nil {
log.Exit("reading:", err)
}
}
a := make(BigramArray, len(m))
i := 0
for _, v := range m {
a[i] = v
i++
}
sort.Sort(a)
for i, v := range a {
if i >= 20 {
break
}
fmt.Printf("%d\t(%s,%s)\n", v.freq, v.key, v.value)
}
fmt.Printf("%d\n", runtime.MemStats.Alloc)
runtime.GC()
fmt.Printf("%d\n", runtime.MemStats.Alloc)
fmt.Printf("%11d %11d %11d %11d %11d\n", nc, nw, nl, ns, len(m))
f, _ = os.Open("prof.out", os.O_WRONLY|os.O_CREAT, 0666)
pprof.WriteHeapProfile(f)
}
i also rewrote the code trying to see if memory usage would converge
on large files. i opted to try and use only the standard Go types. my
results are similar in speed to Russ's, but my memory usage is down to
~29mb for a 5mb file (pg100.txt) and 23mb for the same file
concatenated multiple times to the size of 240mb. code is attached.
comparison svg's with rsc's version ran on the same files:
http://mirtchovski.com/go/aam-5mb.svg
http://mirtchovski.com/go/russ-5mb.svg
http://mirtchovski.com/go/aam-240mb.svg
http://mirtchovski.com/go/russ-240mb.svg
i forgo the sorting part, but even without it speed between the two
versions (Russ's and this one) are identical; i thought i'd be faster
:). catstring allocations were unexpected to this layman.
#define free(a) USED(a)
Delete that line and then make clean; make install in
runtime, and recompile your program, and you should
find that it takes up maybe 2/3 of what it had been using.
I'm not going to check that fix in just yet though: I'd
like to understand why the garbage collector isn't getting
the job done.
For the file I'm running, there are 815,101 entries in the map.
Each entry is, after rounding, 48 bytes on amd64 for key+value.
The map itself has a two-word-per-entry overhead, so 64 bytes.
64*815101 is 52 MB, and the latest memory profiles
(with the explicit call to free in the map code) shows around
72 MB allocated for the map, which is reasonable
(maps are never completely full).
The array in main, for sorting, is probably 40 bytes per entry,
so that's 32 MB, which agrees with the profile.
So that's where your memory is going.
In your original post you said that the equivalent D program
used only a small fraction of the memory used by Go, but
I don't see how that is possible - the original Go program
used 130 MB to store 72 MB of data; with the fix to map
it uses 100 MB. But I don't see how the D program could
possibly use less than 72 MB, and 72 / 130 doesn't seem
like a small fraction.
Russ