Adventures with BigData and mgo.v2

58 views
Skip to first unread message

Carl Caulkett

unread,
Nov 22, 2017, 12:55:08 PM11/22/17
to mgo-users
Hello, in an attempt to drag my knowledge of BigData and NoSQL into the 21st Century, I thought I would have a play around with some freely available UK government datasets. I chose the dataset available at this page http://download.companieshouse.gov.uk/en_output.html specifically the 336Mb dataset whihcunzips to a 2Gb CSV file. Armed with the field information available here http://resources.companieshouse.gov.uk/toolsToHelp/pdf/freeDataProductDataset.pdf I was able to build some Golang code to read the CSV file and (hopefully) successfully write the MongoDB equivalent. The file structure for the CSV file was acieved by a struct and several nested structs such as:

// Company parent struct
type Company struct {
   Details      *CompanyData      `json:"details,omitempty"`
   Corporate    *CorporateData    `json:"corporate,omitempty"`
   Accounts     *AccountsData     `json:"accounts,omitempty"`
   Returns      *ReturnsData      `json:"returns,omitempty"`
   Mortgages    *MortgagesData    `json:"mortgages,omitempty"`
   SICCodes     *SICCodesData     `json:"sic_codes,omitempty"`
   LtdPartners  *LtdPartnersData  `json:"ltd_partners,omitempty"`
   Web          *WebData          `json:"web,omitempty"`
   Old          *OldData          `json:"old,omitempty"`
   Confirmation *ConfirmationData `json:"confirmation,omitempty"`
}

// CompanyData child struct
type CompanyData struct {
   CompanyName   string `json:"companyname,omitempty"`    // 160
   CompanyNumber string `json:"company_number,omitempty"` // 8
   Careof        string `json:"careof,omitempty"`         // 100
   POBox         string `json:"po_box,omitempty"`         // 10
   AddressLine1  string `json:"address_line_1,omitempty"` // (HouseNumber and Street) 300
   AddressLine2  string `json:"address_line_2,omitempty"` // (area) 300
   PostTown      string `json:"post_town,omitempty"`      // 50
   County        string `json:"county,omitempty"`         // (region) 50
   Country       string `json:"country,omitempty"`        // 50
   PostCode      string `json:"post_code,omitempty"`      // 10
}

and so on...
 
I think, on reflection, that the json: tags were unnecessary since the resultant MongoDB structure does not have the underscores in the names.

The data was put into the MongoDB by reading each line of the CSV file into a variable called `record` which was a slice with 55 elements in it. The indexed elements were then written to the mgo.v2 collection with code such as this:

    // read loop
    for {
        recordCount++
        // read an entire record of CSV values
        record, err := reader.Read()
        // enable printing of thousands characters
        p := message.NewPrinter(language.English)
        if err == io.EOF {
            p.Printf("\nEOF reached")
            p.Printf("\n%d records read", recordCount)
            break
        } else {
            checkError(err)
        }
        // print bytesRead / fileSize
        p.Printf("\r%d / %d", cr.bytesRead, fileSize)
        // create MongoDB collection
        collection := session.DB("Companies").C("Companies")
        // insert data into MongoDB
        err = collection.Insert(&Company{
            Details: &CompanyData{
                CompanyName: record[0],
                CompanyNumber: record[1],
                Careof: record[2],
                POBox: record[3],
                AddressLine1: record[4],
                AddressLine2: record[5],
                PostTown: record[6],
                County: record[7],
                Country: record[8],
                PostCode: record[9],
            },

and so on...

This has appeared to be succesful since I am now able to connect with MongoDB using Studio 3T which confirms that I have read in 4,106,757 documents. Studio 3T alos conforms that the field names are all present (coorect is another matter!).


The problem now is that I want to construct a fresh program that initially, at least, allows me to query the existing MongoDB dataset. Whereas I effectively imposed the data structure of the documents being built, now I want to get the internally held metadata of the existing data. How do I do that? How would I, for example, write a mgo.v2 Golang program that read the first 50 documents from the collection (I'm trying to be careful to use NoSQL terminaology here!) and print the field names and contents based on the internally held metadata?

I'm sure this will be possible beacuse mgo.v2 seems like a powerful and well written package, but sadly I haven't been able to find much example code for it. If anyone could suggest some sources as well as answer my specific questions, I'd be really grateful.



 

Diego Medina

unread,
Nov 30, 2017, 6:56:40 AM11/30/17
to mgo-users
Hi Carl,

Great start!
Regarding the mongodb field names not using underscore as you hoped, if you change the tags from `json` to `bson`, then mgo will read those and use them as field names. You'll have to drop your collection and load your data again to get the new names

About printing field names and content. normally, when you query mongo for data, you would load the document(s) into a variable of the expected type, and mgo will match the fieldnames from your collection to the field names on your struct (using the bson tags if available).

If you are just testing things, you could print the fieldnames of the struct with the data using

fmt.Printf("document is: %+v\n", variablehere)


But for most cases in real apps, you would want to pick the fields to print, export, etc kind of manually. What I mean is, if you wanted to export all your data back to csv files, you would export each field in a particular order based on the struct fields that are holding your data.

If this doesn't make sense, let us know and I can give you an example.

Regards,

Diego

Carl Caulkett

unread,
Nov 30, 2017, 7:45:58 AM11/30/17
to mgo-...@googlegroups.com
Hi Diego, thanks for the reply. I’ve actually made some progress on this issue. I did indeed change the tags to ‘bson’ and I was able to import a 2Gb CSV file into a MongoDB dataset using Golang. I’m exploring the finer details of MongoDB queries and mgo.V2 and thoroughly enjoying it so far!

Cheers,
Carl
--
You received this message because you are subscribed to a topic in the Google Groups "mgo-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mgo-users/WeUflYuI948/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mgo-users+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages