Finding unique values from very large data structure that has maps array

1,507 views
Skip to first unread message

Abhijit Kadam

unread,
Apr 29, 2014, 1:13:58 AM4/29/14
to golan...@googlegroups.com
I have a structure which is section of a json file parsed into memory, and this section is array of maps. 

dataMap []map[string]interface{}

The 'interface{}' is because the json has no specific structure/schema and this is just one of the section retrieved after applying rules from template.

The 'interface{}' value will be seen as string and hence will be converted to string before comparison. I think using 'fmt.Sprint' would be fine to convert the value to string. hence the above map can be think of array of map[string]string.

There are over 100000 array elements and I have to quickly find out unquie values for certain keys across maps in array. Such that if maps have value 
{"F1":"value a" ,"F2":"value x"}
in another map 
{"F1":"value a", "F2":"value y"}
in some other map
{"F1":"value b", "F2":"value x"}

Processing for unique keys F1 & F2 & after processing I should get:
"F1":["value a", "value b"]
"F2":["value x", "value y"]

Approach:
1 For each key "F1" have a map like kmap map[string}bool. Create go routine for each key. Each routine will iterate over array and look for values inside maps for key like "F1"

2a Then store the "F1" values as keys inside kmap. 
kmap["values a"]=true ? what will be the most efficient dummy type for value?

2b Also before inserting values check if the values exists in kmap. Once done the kmap  keys are the unique values for the desired key in the original data structure.

Is the above approach efficient because I have 100000+ elements? I thought of using slice to collects values however look up will not be fast. Maps would be good at that. But then when I retrieve values from map they are not in proper sequence. This may not be necessary however if it turns out to be then is there any way to track the sequence?

Also is it worth to have several routines work to find unique values for single key "F1". In such case I need to put read- write lock on each kmap s?


Ahmed Waheed

unread,
Apr 29, 2014, 10:15:45 AM4/29/14
to golan...@googlegroups.com
You could use something like this, not sure how "slow" it would be with a bigger dataset though

egon

unread,
Apr 29, 2014, 10:41:39 AM4/29/14
to golan...@googlegroups.com
Explain the full story, talk about the actual context; it allows people to orient to the actual problem much faster.

What do you mean very large? 1GB/1TB/1PB?
How many times do you need to find those?
How will it be used?
Does it need to be offline/online? How often is the data updated?
What algorithm are you trying to implement on top of that?

As a recommendation, do the the simplest approach first and see whether it works or not; maybe it's already sufficient.

Also, without proper context it's difficult to recommend any advanced data structures and/or algorithms to solve the actual problem.

Most efficient dummy type is struct{}.

+ egon

Abhijit Kadam

unread,
Apr 29, 2014, 2:07:59 PM4/29/14
to golan...@googlegroups.com
File is 50 MB size and the array size is 100000. The data is not updated, it is a dump. The unique values to be found out from the columns of the matrix. 
"F1":["value a", "value b"]
"F2":["value x", "value y"]

Then printed like :
F1                F2
--------------------------
value a         value x
value b         value y

Here is how I have done it and it takes few seconds to do it. Below is code extract. "rowMeta.ColsInfo" has information on what is to be processed, i.e. which keys to extract unique values for as there are many other columns that should be ignored. For each column to be looked for unique values I am starting a routine. That routine will find the unique values using a temp map and appends it to the allocated internal slice  inside rowsTowrite.
//Certain t

func ProcessArrayForUniqueValues(repeatData []interface{}, rowMeta *excel.RowInfo, csvfile *csvfile.CSVReport) {
var rowsTowrite = make([][]string, len(rowMeta.ColsInfo), len(rowMeta.ColsInfo))
var wg sync.WaitGroup
for cn, col := range rowMeta.ColsInfo {
if col.Key != "" {
wg.Add(1)
go func(index int, colInfo excel.ColInfo, data []interface{}) {
defer wg.Done()
kmap := make(map[string]bool)
uvalues := make([]string, 0, 500)
for _, rowRefTmp := range data {
rowref := rowRefTmp.(map[string]interface{})
uvalue := fmt.Sprint(rowref[colInfo.Key])
_, ok := kmap[uvalue]
if !ok {
kmap[uvalue] = true
uvalues = append(uvalues, uvalue)
}
}
rowsTowrite[index] = uvalues
}(cn, *col, repeatData)
}
}
wg.Wait()

//write to file
maxColLength := 0
for cn, _ := range rowsTowrite {
slicelen := len(rowsTowrite[cn])
if slicelen > maxColLength {
maxColLength = slicelen
}
}

for i := 0; i < maxColLength; i++ {
var record []string
for j := 0; j < len(rowsTowrite); j++ {
tmpslice := rowsTowrite[j]
if len(tmpslice) > i {
record = append(record, tmpslice[i])
} else {
record = append(record, "")
}

}
csvfile.Write(&record)//todo: send in batches of [][]string to WriteAll
}
}

Abhijit Kadam

unread,
Apr 29, 2014, 2:10:46 PM4/29/14
to golan...@googlegroups.com
When I said it takes few seconds to do it above, I mean it takes few seconds to run the code and find unique values from data structure :)

egon

unread,
Apr 29, 2014, 2:27:27 PM4/29/14
to golan...@googlegroups.com
That's still not the context of the problem. If I'm not able to write the whole program instead of you, then it's not the full context. Please read https://code.google.com/p/go-wiki/wiki/HowToAsk it will make your question and problem clearer. I can't see the business value for your question, which means I have no clue what the best code for solving the problem is. Nor do I have any clue what the actual input data looks like, or what domain it's related to.

Anyways, things I noticed... you are passing in []interface{} instead of concrete type. Why won't you send in a specific type where the conversion to string has already been done?

You can use map[string]struct{} instead of map[string]bool; the first one uses less memory.

Did you intend to pass in *col instead of col? You are copying the data inside excel.ColInfo.

(lot of assumptions made in this suggestion) I expect most time you are spending on fmt.Sprint, just normalize the input json and don't unmarshal the whole content by using json.RawMessage.

You are spinning up a lot of goroutines, it might be faster to split the data (i.e. batching) between a few goroutines.

+ egon

Bakul Shah

unread,
Apr 29, 2014, 2:45:39 PM4/29/14
to Abhijit Kadam, golan...@googlegroups.com
On Mon, 28 Apr 2014 22:13:58 PDT Abhijit Kadam <abhij...@gmail.com> wrote:
> ------=_Part_55_20676219.1398748438065
> Content-Type: text/plain; charset=UTF-8
>
> I have a structure which is section of a json file parsed into memory, and
> this section is array of maps.
>
> dataMap []map[string]interface{}
>
> The 'interface{}' is because the json has no specific structure/schema and
> this is just one of the section retrieved after applying rules from
> template.
>
> The 'interface{}' value will be seen as string and hence will be converted
> to string before comparison. I think using 'fmt.Sprint' would be fine to
> convert the value to string. hence the above map can be think of array of
> map[string]string.
>
> There are over 100000 array elements and I have to quickly find out unquie
> values for certain keys across maps in array. Such that if maps have value
> {"F1":"value a" ,"F2":"value x"}
> in another map
> {"F1":"value a", "F2":"value y"}
> in some other map
> {"F1":"value b", "F2":"value x"}
>
> Processing for unique keys F1 & F2 & after processing I should get:
> "F1":["value a", "value b"]
> "F2":["value x", "value y"]

How about something like this? [Not a solution, just to give
you an idea]

// create a map of string to string-bag
for _, m := range dataMap {
for k, v := range m {
append(m1[k], v)
}
}
// replace bags with sets
for k, v := range m1 {
m1[k] = bagToSet(v)
}

bag == where unlike in a set an object may appear more than once.

Now you have a simpler (but more generally useful) problem to solve : )

Abhijit Kadam

unread,
Apr 29, 2014, 2:53:16 PM4/29/14
to golan...@googlegroups.com
Egon, I apologize for not being much clear, because of certain reasons cannot talk on domain and business. 
Replies:

Anyways, things I noticed... you are passing in []interface{} instead of concrete type. Why won't you send in a specific type where the conversion to string has already been done?
==> The json file has no specific structure/schema. A template has some information about the schema. The json and template comes as input. They can deviate little here and there.

You can use map[string]struct{} instead of map[string]bool; the first one uses less memory.
==> ok.

Did you intend to pass in *col instead of col? You are copying the data inside excel.ColInfo.
==> yes it is little data. will change it to use pointers. Thanks for catching that.

(lot of assumptions made in this suggestion) I expect most time you are spending on fmt.Sprint, just normalize the input json and don't unmarshal the whole content by using json.RawMessage.
===> The processing is directed by template data and not the json file itself. The template can direct to access any data from any struct. Besides I have little understanding of json.RawMessage. Will see that

You are spinning up a lot of goroutines, it might be faster to split the data (i.e. batching) between a few goroutines.
==> yes, will think on this, will take time for me. Right now I have limited columns in the reports to look for uniqueness. 

egon

unread,
Apr 29, 2014, 3:28:57 PM4/29/14
to golan...@googlegroups.com
On Tuesday, April 29, 2014 9:53:16 PM UTC+3, Abhijit Kadam wrote:
Egon, I apologize for not being much clear, because of certain reasons cannot talk on domain and business.

No problem... just mention that you cannot talk about the domain next time. That way, people will know that it's not because you forgot to mention it or just skipping it; and they won't bother asking for it :).
 
Replies:
Anyways, things I noticed... you are passing in []interface{} instead of concrete type. Why won't you send in a specific type where the conversion to string has already been done?
==> The json file has no specific structure/schema. A template has some information about the schema. The json and template comes as input. They can deviate little here and there.


Then probably the best solution would implement a different kind of json parser; one that takes into account your template as well. This would be specially tailored approach for the problem that you are describing. As a rough sketch what I would do:

First implement a json Decoder that allows you to do something like:
type Decoder {
Index() int
Key() string
Enter() Decoder, error
AsString() string
Skip() error
}

Then write your json decoding into "unique" maps directly on that... Of course the idea is a rough sketch; I'm not sure what the proper interface would look like.  Maybe, you can start with the http://golang.org/src/pkg/encoding/json/scanner.go and implementing a custom "step" on the scanner struct. (by copying the scanner code)...

tl;dr; the "json -> go structures -> map[string]struct{}" is unnecessary... you would probably want "json -> map[string]struct{}" for best performance... there's no builtin stuff for that, so you would need to build it yourself.

You can use map[string]struct{} instead of map[string]bool; the first one uses less memory.
==> ok.

Did you intend to pass in *col instead of col? You are copying the data inside excel.ColInfo.
==> yes it is little data. will change it to use pointers. Thanks for catching that.

If it's very small amounts of data then using a pointer might be also slower... just measure :)
 

(lot of assumptions made in this suggestion) I expect most time you are spending on fmt.Sprint, just normalize the input json and don't unmarshal the whole content by using json.RawMessage.
===> The processing is directed by template data and not the json file itself. The template can direct to access any data from any struct. Besides I have little understanding of json.RawMessage. Will see that

json.RawMessage just means it unserializes rest of the structure as a string.

Abhijit Kadam

unread,
May 6, 2014, 6:19:09 AM5/6/14
to golan...@googlegroups.com
Egon, Any example of batching between the go routines?

egon

unread,
May 6, 2014, 6:36:09 AM5/6/14
to golan...@googlegroups.com
On Tuesday, May 6, 2014 1:19:09 PM UTC+3, Abhijit Kadam wrote:
Egon, Any example of batching between the go routines?


I didn't bother doing all the tiny variations; see what performs better or looks nicer and then just profile. (you probably want to increase the size of buffers and batches in real world)
Reply all
Reply to author
Forward
0 new messages