The specific issue I am worried about is things like unicode tables. They
are public mutable the last time I looked at it. It is a serious security
hole. In theory, I can change the table and try to use reflection to access
private field and the like.
I'm trying to learn a bit more about concurrency (in general) and its
implementation in Go (in particular) and I would thank any help in
realizing the best strategy for this (a bit simplified) case:
Suppose that I want to index all the "words" (not really words but DNA
hexamers) in a set of documents for downstream process.
The sequential process for 1 document would involve to read the text
document and populate a map of strings to vectors of ints, where the
keys are the different possible "words" and each int vector represent
its positions in the document.
ACGAGT &[37 115 163 69] // ACGAGT appears in the document at
positions 37, 115, 163 and 69
ATAACG &[45 155]
...
Each document has millions of such words of 6 letters over an alphabet
of 4 letters. I.e, only 4096 (4^6) words (keys) are possible. The
number of documents used at the same time in one analysis is between 2
and 1000.
If I want to load several documents concurrently, there seems to be
various possibilities:
1.- Open one thread per document and populate the map with the
position of each word and the document where it has been found:
This &[{"doc1" 32} {"doc1" 134} {"doc2" 34}]
...
Is there any problem in having various goroutines inserting such
values in the same data structure (the map of strings to structs)? If
so, I suppose that some kind of lock is needed, will this degrade the
gain of performance given by the concurrency? Would this be the best
approach?
2.- Another option would be to create one map per document, so every
goroutine will write its own map:
"doc1" =>
This &[32 134]
is &.....
"doc2" =>
This &[34]
is &....
.....
The drawback is that to look for a word (I have to look hundreds of
thousands of them), I have to lookup in N maps instead of just one
(N==Number of different documents).
3.- A third approach would be to mix the previous 2 by building 1 map
per document and merging them before starting the lookups.
4.- Other?
Which strategy do you anticipate that will have the biggest gain in
performance?
Also, any good reference text in concurrency/parallelism would be
welcome,
Thanx in advance,
M;
One issue that jumps out to me right away is that if all the documents
are on the same drive, if you aren't careful then you could have
massive hard disk thrashing and slow things down considerably. This
can be taken care of by reading in each document's raw data in its
entirety one at a time, in the beginning, and then process them
concurrently while in memory.
Beyond that, the questions seem to be "Is map threadsafe?" (I don't
know) and "Is there an efficient way to merge two maps?" (Still don't
know!). I'm sure some other readers of this list will be able to
answer these questions even though I cannot.
- John
On Jan 18, 11:38 am, Miguel Pignatelli <miguel.pignate...@uv.es>
wrote:
Or as a middle ground: Define a maximum for CPU bound processing
goroutines, typically $GOMAXPROCS. Have one goroutine, possibly your
main program, reading files from disk and spawn a goroutine for every
file it just read as long as your maximum number of processing
goroutines is not reached.
> Beyond that, the questions seem to be "Is map threadsafe?" (I don't
> know)
It is not. You will have to write some kind of server managing the
map.
> and "Is there an efficient way to merge two maps?" (Still don't
> know!). I'm sure some other readers of this list will be able to
> answer these questions even though I cannot.
That I don't know either.
cheers,
Michael
thanks for the input,
El 18/01/2010, a las 17:51, John Asmuth escribió:
> Hi Miguel,
>
> One issue that jumps out to me right away is that if all the documents
> are on the same drive, if you aren't careful then you could have
> massive hard disk thrashing and slow things down considerably. This
> can be taken care of by reading in each document's raw data in its
> entirety one at a time, in the beginning, and then process them
> concurrently while in memory.
Hmmm, I thought that my alternative avoids that overhead. If I
understand correctly, you suggest to accumulate all the disk access at
the beginning of the execution time (by having all the goroutines
reading the entire files into memory first), wouldn't this cause the
massive disk thrashing? I was thinking of alternating the disk access
and the record processing by reading small chunks of text and process
them.
"distributed" points out a good alternative: having a goroutine for
disk reading and span other goroutines for processing each document.
Also, regarding the locking of the map, only the current entry of the
map should be blocked at any time by each goroutine (not the entire
map, other goroutines can "write" to other entries of the map). One
important question would be: which is the cost of blocking a map entry
compared to the accession time to it?
I should test some of these alternatives to answer myself, but any
comment is also welcome,
Cheers,
M;
On Jan 18, 2:45 pm, Miguel Pignatelli <miguel.pignate...@uv.es> wrote:
> > One issue that jumps out to me right away is that if all the documents
> > are on the same drive, if you aren't careful then you could have
> > massive hard disk thrashing and slow things down considerably. This
> > can be taken care of by reading in each document's raw data in its
> > entirety one at a time, in the beginning, and then process them
> > concurrently while in memory.
>
> Hmmm, I thought that my alternative avoids that overhead. If I
> understand correctly, you suggest to accumulate all the disk access at
> the beginning of the execution time (by having all the goroutines
> reading the entire files into memory first), wouldn't this cause the
> massive disk thrashing?
Not if you read each document one-at-a-time, in its entirety.
I'm not saying this is the best solution, just a solution that would
avoid hard disk thrashing :)
> I was thinking of alternating the disk access
> and the record processing by reading small chunks of text and process
> them.
Done right this could work. But setting this up in such a way as to
avoid disk thrashing seems like a very challenging problem to me.
- John
Yes, some kind of lock is needed. There are no atomicity
guarantees about map accesses: if there are multiple writers
or a single writer and multiple readers, a lock is needed.
Even if individual map accesses were atomic, you would
still need a lock, because multiple goroutines would be doing:
array := m[key]
newArray := append(array, value)
m[key] = newArray
The lock is needed to make sure two simultaneous appends
to the array don't step on each other. Even if line 1 and line 3
happen as atomic operations, you need something making the
whole sequence atomic. That's exactly why Go doesn't bother
to make line 1 and line 3 individually atomic.
> Which strategy do you anticipate that will have the biggest gain in
> performance?
> Also, any good reference text in concurrency/parallelism would be welcome,
I think you'd have to build them and see. Luckily none of them
is a particularly large amount of code.
However, given that the key space is both small and likely to
be fairly dense, you'd probably be better off with a simple array,
something like:
type List struct {
sync.Mutex;
actual list stuff
}
var all [4096]List
then to update key k, you can use
all[k].Lock(); update all[k]; all[k].Unlock()
and not block updates involving other keys.
A common middle ground would be to have a smaller
number of locks L, with each protecting 4096/L keys.
Then each goroutine could wait until it has some
threshold K updates queued up, acquire L, apply all
the updates, and release L.
This last approach translates nicely into a programs
with goroutines instead of locks: L goroutines each
in charge of a section of keys and reading from a
suitably buffered channel, and then one goroutine per
file sending updates on the correct one of the L channels
for each update. I used this structure in a program once
where both kinds of goroutine were quite a bit more complex
than in this example, and being able to split the two tasks
and have the many channels act as a cross-connect
helped quite a lot in keeping the complexity down.
It's true that once you get up to the GBs of documents,
the I/O should be a bigger concern than CPU
saturation, especially if the data is coming off a single
disk. In the program I just mentioned, I had data coming
off many disks in parallel, with one goroutine per disk.
Russ