how to retrieve the original documents from the search results?

2,136 views
Skip to first unread message

Indraniel Das

unread,
Mar 11, 2015, 12:05:38 AM3/11/15
to bl...@googlegroups.com

Is it possible to retrieve the original data structure, or JSON encoded string of the document, from the results of a bleve index.Search?

After looking at https://github.com/blevesearch/bleve/blob/master/http/doc_get.go, I tried something like this:

import (
    "encoding/json"
    "fmt"
    "github.com/blevesearch/bleve"
    "github.com/blevesearch/bleve/document"
    "time"
)

index, _ := bleve.Open("MyIndex.bleve")
query := bleve.NewMatchQuery("<my search query>")
searchRequest := bleve.NewSearchRequest(query)
searchResult, _ := index.Search(searchRequest)

// collect the original documents
for _, val := range searchResult.Hits {
        id := val.ID
        doc, _ := index.Document(id)

        rv := struct {
            ID     string                 `json:"id"`
            Fields map[string]interface{} `json:"fields"`
        }{
            ID:     id,
            Fields: map[string]interface{}{},
        }

        for _, field := range doc.Fields {
            var newval interface{}
            switch field := field.(type) {
            case *document.TextField:
                newval = string(field.Value())
            case *document.NumericField:
                n, err := field.Number()
                if err == nil {
                    newval = n
                }
            case *document.DateTimeField:
                d, err := field.DateTime()
                if err == nil {
                    newval = d.Format(time.RFC3339Nano)
                }
            }
            existing, existed := rv.Fields[field.Name()]
            if existed {
                switch existing := existing.(type) {
                case []interface{}:
                    rv.Fields[field.Name()] = append(existing, newval)
                case interface{}:
                    arr := make([]interface{}, 2)
                    arr[0] = existing
                    arr[1] = newval
                    rv.Fields[field.Name()] = arr
                }
            } else {
                rv.Fields[field.Name()] = newval
            }
        }

        js, _ := json.MarshalIndent(rv, "", "    ")
        fmt.Printf("%s\n", js)
    }
}

but the result isn’t exactly like the original document that was indexed. Is there a better way to do this?

My intent is to get the IDs from the results of a search, somehow fetch the original documents corresponding to the IDs from the key-value store, and then render those documents into customized text and/or web views.

Is trying to regenerate the original document from a search result an inappropriate use of bleve?

-Indraniel

Marty Schoch

unread,
Mar 11, 2015, 8:46:33 AM3/11/15
to bl...@googlegroups.com
On Wed, Mar 11, 2015 at 12:05 AM, Indraniel Das <indr...@gmail.com> wrote:

Is it possible to retrieve the original data structure, or JSON encoded string of the document, from the results of a bleve index.Search?

In general, the approach bleve takes is that you can store original values on a per-field basis.  In theory if you store all the fields, you can re-compose the original source, but this is not perfect right now.  I think our support for storing date/number fields is broken (works, but not in the correct spirit of stored fields, which should store the original unparsed bytes).  So, this could get better over time if we fix that, but it may always be sub-optimal. 

After looking at https://github.com/blevesearch/bleve/blob/master/http/doc_get.go, I tried something like this:

but the result isn’t exactly like the original document that was indexed. Is there a better way to do this?

You didn't mention how it differed, but I'm assuming it was date and number fields that were wrong?  If not, it might be useful to share the problems, some of them might be fixable. 

My intent is to get the IDs from the results of a search, somehow fetch the original documents corresponding to the IDs from the key-value store, and then render those documents into customized text and/or web views.

Is trying to regenerate the original document from a search result an inappropriate use of bleve?


I think its a valid use-case to try and support this, but keep in mind this is not bleve's primary purpose.  Its possible bleve could get an option in the mapping to store the entire original source in a special field.  This is what Elasticsearch does in the "_source" field.  This is somewhat straighforward if the original content came in as an []byte, but its not obvious what we do if users index a struct directly.  I'm willing to give this some more thought, so I've opened an issue for it here: https://github.com/blevesearch/bleve/issues/174

There is one other possibility.  Bleve offers the ability for an application to store any side-channel information it wants inside the underlying KVstore.  Normally this is used by apps which store sequence numbers or progress tracking information so they can safely resume indexing streams of data.  But, there is no restriction on how it is used, your application has an entire key space to use as it sees fit.  In your case, you could directly store document source in this by performing:

index.SetInternal(docID, docSource)

Then, simply retrieve it using:

index.GetInternal(docID)

If you think you also might need to store other things here, then it is up to you to further partition the key space with some prefix.

Bleve, won't know about any of this data, and won't use it in any way, but it might be a quick work around you could use for now.

marty

Indraniel Das

unread,
Mar 12, 2015, 2:20:22 PM3/12/15
to bl...@googlegroups.com
On Wednesday, March 11, 2015 at 7:46:33 AM UTC-5, Marty Schoch wrote:

but the result isn’t exactly like the original document that was indexed. Is there a better way to do this?

You didn't mention how it differed, but I'm assuming it was date and number fields that were wrong?  If not, it might be useful to share the problems, some of them might be fixable.

My apologies with not showing a concrete example in my earlier post.  Here's a toy example I made with a 2-level nested document data structure:

```Go
package main


import (
    "encoding/json"
    "fmt"
    "github.com/blevesearch/bleve"
    "github.com/blevesearch/bleve/document"
    "log"
    "os"
    "strconv"
    "time"
)

type Person struct {
    Name string
    Age  int
}

type Occupation struct {
    Institution string
    JobTitle    string
    StartDate   time.Time
    Supervisor  string
}

type SuperPowers struct {
    StrengthLevel int
    SpeedLevel    int
    CanFly        bool
}

type SuperHero struct {
    ID int
    Person
    HeroName string
    DayJob   Occupation
    Powers   SuperPowers
}

func main() {
    // setup
    indexName := "heros.bleve"
    index := makeBleveIndex(indexName)
    id, document, pretty := getData()
    indexData(id, document, index)

    // search for some text
    query := bleve.NewMatchQuery("Superman")
    search := bleve.NewSearchRequest(query)
    searchResults, err := index.Search(search)
    if err != nil {
        log.Fatalln("Trouble with search request!")
    }

    // retrieve document
    searchDocs := getDocsFromSearchResults(searchResults, index)
    if len(searchDocs) != 1 {
        log.Fatalln("Trouble retrieving docs from search!")
    }

    // show before and after
    fmt.Println("Original input document is:\n")
    fmt.Printf("%s\n\n", pretty)

    fmt.Println("Re-created document with the index and search results is:\n")
    fmt.Printf("%s\n", searchDocs[0])

    // clean up
    if err := os.RemoveAll(indexName); err != nil {
        log.Fatalln("Trouble removing index file:", indexName)
    }
}

func getDocsFromSearchResults(
    results *bleve.SearchResult,
    index bleve.Index,
) [][]byte {
    docs := make([][]byte, 0)

    for _, val := range results.Hits {
        j2, _ := json.MarshalIndent(rv, "", "    ")
        docs = append(docs, j2)
    }

    return docs
}

func makeBleveIndex(indexName string) bleve.Index {
    mapping := bleve.NewIndexMapping()
    index, err := bleve.New(indexName, mapping)
    if err != nil {
        log.Fatalln("Trouble making index!")
    }
    return index
}

func getData() (id int, document []byte, pretty []byte) {
    hero := SuperHero{
        ID:       1,
        HeroName: "Superman",
        Person:   Person{Name: "Clark Kent", Age: 30},
        DayJob: Occupation{
            Institution: "Daily Planet",
            JobTitle:    "news reporter",
            StartDate:   time.Date(1938, time.April, 18, 12, 0, 0, 0, time.UTC),
            Supervisor:  "Perry White",
        },
        Powers: SuperPowers{
            StrengthLevel: 10,
            SpeedLevel:    10,
            CanFly:        true,
        },
    }

    document, err := json.Marshal(hero)
    if err != nil {
        log.Fatalln("Trouble json encoding hero (as document)!")
    }

    pretty, err = json.MarshalIndent(hero, "", "    ")
    if err != nil {
        log.Fatalln("Trouble json encoding hero (as pretty JSON)!")
    }

    return hero.ID, document, pretty
}

func indexData(id int, doc []byte, index bleve.Index) {
    err := index.Index(strconv.Itoa(id), doc)
    if err != nil {
        log.Fatal("Trouble indexing data!")
    }
}
```

Running the code yields an output like so:

```
Original input document is:

{
    "ID": 1,
    "Name": "Clark Kent",
    "Age": 30,
    "HeroName": "Superman",
    "DayJob": {
        "Institution": "Daily Planet",
        "JobTitle": "news reporter",
        "StartDate": "1938-04-18T12:00:00Z",
        "Supervisor": "Perry White"
    },
    "Powers": {
        "StrengthLevel": 10,
        "SpeedLevel": 10,
        "CanFly": true
    }
}

Re-created document with the index and search results is:

{
    "id": "1",
    "fields": {
        "Age": 30,
        "DayJob.Institution": "Daily Planet",
        "DayJob.JobTitle": "news reporter",
        "DayJob.StartDate": "1938-04-18T12:00:00Z",
        "DayJob.Supervisor": "Perry White",
        "HeroName": "Superman",
        "ID": 1,
        "Name": "Clark Kent",
        "Powers.SpeedLevel": 10,
        "Powers.StrengthLevel": 10
    }
}
```

Are documents with nested levels of data structures considered bad search design?

-Indraniel

Indraniel Das

unread,
Mar 12, 2015, 2:24:52 PM3/12/15
to bl...@googlegroups.com

Sorry, I forgot to apply the formatting in my earlier post. Hopefully this edition is a bit easier to read in a browser.

On Wednesday, March 11, 2015 at 7:46:33 AM UTC-5, Marty Schoch wrote:

but the result isn’t exactly like the original document that was indexed. Is there a better way to do this?

You didn't mention how it differed, but I'm assuming it was date and number fields that were wrong?  If not, it might be useful to share the problems, some of them might be fixable.

My apologies with not showing a concrete example in my earlier post. Here’s a toy example I made with a 2-level nested document data structure:

)

    for _, val := range results.Hits {
        id := val.ID
        doc, _ := index.Document(id)

        rv := struct {
            ID     string                 `json:"id"`
 {
                rv.Fields[field.Name()] = newval
            }
        }

        j2, _ := json.MarshalIndent(rv, "", "    ")
        docs = append(docs, j2)
    }

    

Indraniel Das

unread,
Mar 12, 2015, 2:40:13 PM3/12/15
to bl...@googlegroups.com

On Wednesday, March 11, 2015 at 7:46:33 AM UTC-5, Marty Schoch wrote:

There is one other possibility.  Bleve offers the ability for an application to store any side-channel information it wants inside the underlying KVstore.  Normally this is used by apps which store sequence numbers or progress tracking information so they can safely resume indexing streams of data.  But, there is no restriction on how it is used, your application has an entire key space to use as it sees fit.  In your case, you could directly store document source in this by performing:

index.SetInternal(docID, docSource)

Then, simply retrieve it using:

index.GetInternal(docID)

If you think you also might need to store other things here, then it is up to you to further partition the key space with some prefix.

Bleve, won't know about any of this data, and won't use it in any way, but it might be a quick work around you could use for now.


This technique is nice to know about.  Thanks!

-Indraniel

Marty Schoch

unread,
Mar 12, 2015, 3:47:48 PM3/12/15
to bl...@googlegroups.com
Ah OK, thanks for sharing more of the details...


On Thu, Mar 12, 2015 at 2:20 PM, Indraniel Das <indr...@gmail.com> wrote:

Are documents with nested levels of data structures considered bad search design?


No, I don't think its bad search design.  In fact I would say that generally you should not have to alter your domain model to accommodate search.  So nested structs and arrays are fine, and bleve attempts to deal with them in a straightforward way.

However, as you can see, we internally flatten the nested structure into one level of field names.  For stored fields we can also keep track of array positions as well (open issue to do this for indexed fields as well)  But as you have seen, the index mapping is basically a one-way process mapping from your domain model into the index documents/fields.  I briefly experimented with making the mapping bi-directional, but it didn't seem worth the effort.  There were cases like non-exported structs where we could get the data out, but we could never recreate the object on the other side.

I actually think if you wanted you could tweak the code to better understand how we flatten the names using a "." and reconstruct something closer to the original.  But, I'm not sure its worth it.  If you need to get back the original source information that you indexed, you should store that directly.

marty

Philip O'Toole

unread,
Mar 14, 2015, 3:34:29 PM3/14/15
to bl...@googlegroups.com
I tried this approach, and it worked quite well for the docs I am storing. What I really like is that SetInternal() also exists on the Batch object. Nice.

Philip

Indraniel Das

unread,
Mar 15, 2015, 2:32:12 PM3/15/15
to bl...@googlegroups.com

Using index.GetInternal seems to work well for me too. I’ve updated my toy example with the technique and placed it at this github gist:

https://gist.github.com/indraniel/8108bd7def9b5e222417

Thanks again for the explanations!

Reply all
Reply to author
Forward
0 new messages