Hello, in an attempt to drag my knowledge of BigData and NoSQL into the 21st Century, I thought I would have a play around with some freely available UK government datasets. I chose the dataset available at this page
http://download.companieshouse.gov.uk/en_output.html specifically the 336Mb dataset whihcunzips to a 2Gb CSV file. Armed with the field information available here
http://resources.companieshouse.gov.uk/toolsToHelp/pdf/freeDataProductDataset.pdf I was able to build some Golang code to read the CSV file and (hopefully) successfully write the MongoDB equivalent. The file structure for the CSV file was acieved by a struct and several nested structs such as:
// Company parent struct
type Company struct {
Details *CompanyData `json:"details,omitempty"`
Corporate *CorporateData `json:"corporate,omitempty"`
Accounts *AccountsData `json:"accounts,omitempty"`
Returns *ReturnsData `json:"returns,omitempty"`
Mortgages *MortgagesData `json:"mortgages,omitempty"`
SICCodes *SICCodesData `json:"sic_codes,omitempty"`
LtdPartners *LtdPartnersData `json:"ltd_partners,omitempty"`
Web *WebData `json:"web,omitempty"`
Old *OldData `json:"old,omitempty"`
Confirmation *ConfirmationData `json:"confirmation,omitempty"`
}
// CompanyData child struct
type CompanyData struct {
CompanyName string `json:"companyname,omitempty"` // 160
CompanyNumber string `json:"company_number,omitempty"` // 8
Careof string `json:"careof,omitempty"` // 100
POBox string `json:"po_box,omitempty"` // 10
AddressLine1 string `json:"address_line_1,omitempty"` // (HouseNumber and Street) 300
AddressLine2 string `json:"address_line_2,omitempty"` // (area) 300
PostTown string `json:"post_town,omitempty"` // 50
County string `json:"county,omitempty"` // (region) 50
Country string `json:"country,omitempty"` // 50
PostCode string `json:"post_code,omitempty"` // 10
}
and so on...
I think, on reflection, that the json: tags were unnecessary since the resultant MongoDB structure does not have the underscores in the names.
The data was put into the MongoDB by reading each line of the CSV file into a variable called `record` which was a slice with 55 elements in it. The indexed elements were then written to the mgo.v2 collection with code such as this:
// read loop
for {
recordCount++
// read an entire record of CSV values
record, err := reader.Read()
// enable printing of thousands characters
p := message.NewPrinter(language.English)
if err == io.EOF {
p.Printf("\nEOF reached")
p.Printf("\n%d records read", recordCount)
break
} else {
checkError(err)
}
// print bytesRead / fileSize
p.Printf("\r%d / %d", cr.bytesRead, fileSize)
// create MongoDB collection
collection := session.DB("Companies").C("Companies")
// insert data into MongoDB
err = collection.Insert(&Company{
Details: &CompanyData{
CompanyName: record[0],
CompanyNumber: record[1],
Careof: record[2],
POBox: record[3],
AddressLine1: record[4],
AddressLine2: record[5],
PostTown: record[6],
County: record[7],
Country: record[8],
PostCode: record[9],
},
and so on...
This has appeared to be succesful since I am now able to connect with MongoDB using Studio 3T which confirms that I have read in 4,106,757 documents. Studio 3T alos conforms that the field names are all present (coorect is another matter!).
The problem now is that I want to construct a fresh program that initially, at least, allows me to query the existing MongoDB dataset. Whereas I effectively imposed the data structure of the documents being built, now I want to get the internally held metadata of the existing data. How do I do that? How would I, for example, write a mgo.v2 Golang program that read the first 50 documents from the collection (I'm trying to be careful to use NoSQL terminaology here!) and print the field names and contents based on the internally held metadata?
I'm sure this will be possible beacuse mgo.v2 seems like a powerful and well written package, but sadly I haven't been able to find much example code for it. If anyone could suggest some sources as well as answer my specific questions, I'd be really grateful.