Re: [RavenDB] Person/Company Model + Searching

61 views
Skip to first unread message

Oren Eini (Ayende Rahien)

unread,
Aug 15, 2012, 7:48:44 AM8/15/12
to rav...@googlegroups.com
Rei,
Yes, you would do a multi map reduce here.
And 100s of facets are going to present a problem, I would guess. We optimized that, but that is still not easy

On Tue, Aug 14, 2012 at 7:07 AM, rei <reib...@gmail.com> wrote:
This is a two part question. I think I know the answer to the first, but not so sure about the second.

I have 2 documents, Person and Company. I've modeled both, created indexes on each, and that's all fine. The problem is that the user needs to also be able to search on fields from both documents at the same time.

Just a quick overview of how things look:
- I'm dealing with about 4 million Person documents and 3 million Company documents.
- These documents are quite beefy (imagine 50+ fields on each, plus some custom fields specific to certain clients).
- Nearly every field needs to be searchable.

public class Person
{
        public string Id { get; set; }
        public string CompanyId { get; set; }
        public string FirstName { get; set; }
        public string LastName { get; set; }
        public string EmailAddress { get; set; }
        public Address PhysicalAddress { get; set; }
        public string JobTitle { get; set; }
        // etc. You get the idea. Tons of fields about the person.
}

public class Company
{
        public string Id { get; set; }
        public string Name { get; set; }
        public string Industry { get; set; }
        public string PhoneNumber { get; set; }
        public string SicCode { get; set; }
        public int Revenue { get; set; }
        // etc. 50+ fields
}

With 2 separate indexes, I can't reasonably write the following query: "Get all the CEOs in the computing industry of companies with a revenue over $5,000,000"

1) This calls for a multi-map + reduce? I need an index that spans both these documents essentially, right? Are there any alternatives? I'm also thinking about performance here.

2) What happens to this if I wanted to support faceted search? Is having 100 facets okay? I foresee that being a performance problem, but I'm still relatively new to Lucene, so I'm not actually sure. Yes, I realize giving the user 100 facet groups to select from would just be pointless and overwhelming. What I was planning on doing was showing the 5 most relevant facets. By relevant, I mean I would score the facets by how well the split is distributed among the different facet values. I'd score "Industry1 (500), Industry2 (500), Industry3 (500)" higher than "United States (1497), Canada (2), Mexico (1)" for example. I wouldn't even show the country facet to the user in this case.

I'd appreciate your thoughts on this. Thanks!

rei

unread,
Aug 17, 2012, 9:51:50 PM8/17/12
to rav...@googlegroups.com
I tried the multi-map/reduce, but I seem to have run into a logistical problem. It works fine if Person<->Company was a 1:1 relationship, but I'm not sure how I would get it to work given the fact that multiple People can be associated to a single Company.

Just for simplicity, assume I have 4 People and 2 Companies. In my reduce, if I group by CompanyId then I obviously only get 2 People back after the reduce. I want all 4 People with their Company information present. And I can't group by PersonId, because then all the company information is lost in the reduce (since Company doesn't know anything about Person).

AddMap<Company>(companies => from c in companies
select new
{
PersonId = (string)null, // Here lies the issue
c.CompanyId,
// All Company fields populated here
// All Person fields defined here and set to null
});

AddMap<Person>(people => from p in people
select new
{
PersonId = p.Id,
p.CompanyId,
// All Person fields populated here
// All Company fields defined here and set to null
});

Reduce = results => from result in results
group result by result.PersonId
into g
select new
{
PersonId = g.Key,
FirstName = g.Select(x => x.FirstName).FirstOrDefault(x => x != null),
// etc.
CompanyId = g.Select(x => x.CompanyId).FirstOrDefault(x => x != null),
Industry = g.Select(x => x.Industry).FirstOrDefault(x => x != null),
// etc.
}

Does this mean I need to keep a list of PersonIds in the Company document? So my AddMap<Company> would use 2 "from" statements in it (one for Company and the other for the list of PersonIds)? That's really the only thing I can think of. Do I have any other options? ...other than cramming all this in one document called CompanyPerson. That would cause more headaches for me rather than just keep a list of PersonIds in Company.

Also, just as an aside, I tried out the facets, and to my surprise, they performed quite well. And I'm quite looking forward to trying out the facet approach proposed here:

Oren Eini (Ayende Rahien)

unread,
Aug 18, 2012, 5:03:57 AM8/18/12
to rav...@googlegroups.com
Here is an example of how to do this:

public class Orders_Search : AbstractMultiMapIndexCreationTask<SearchResult>
{
public Orders_Search()
{
AddMap<Customer>(customers =>
from customer in customers
select new
{
CustomerId = customer.Id,
CustomerName = customer.Name,
OrderId = (string)null
});

AddMap<Order>(orders =>
from order in orders
select new
{
OrderId = order.Id,
CustomerName = (string)null,
order.CustomerId
});

Reduce = results =>
from searchResult in results
group searchResult by searchResult.CustomerId
into g
let customerName = g.FirstOrDefault(x => x.CustomerName != null).CustomerName
from item in g
select new
{
CustomerName = customerName,
item.OrderId,
item.CustomerId
};

rei

unread,
Aug 18, 2012, 9:27:13 PM8/18/12
to rav...@googlegroups.com
Thank you. Nearly there. I used your example and changed it to use Person/Company. At first it wasn't giving me anything close, but I changed:

let companyName = g.FirstOrDefault(x => x.CompanyName != null).CompanyName
to
let companyName = g.Select(x => x.CompanyName).FirstOrDefault(x => x != null)

After doing that, it almost gives me the desired results, except I'm getting a few extra results with the Company information being null. Here is what I have:
https://gist.github.com/d7118cbaebdee5a56a60

It's printing out:

company1
company1        Person1
company1        Person2
company2
company2        Person3
company2        Person4

Rather than the desired result of:

company1        Person1
company1        Person2
company2        Person3
company2        Person4

I'm not sure how to get rid of those nulls.

Oren Eini (Ayende Rahien)

unread,
Aug 18, 2012, 11:32:32 PM8/18/12
to rav...@googlegroups.com
You filter them out during the query.

rei

unread,
Aug 19, 2012, 12:36:30 AM8/19/12
to rav...@googlegroups.com
I see. I can live with that, thank you. :)

In a way, this might not be so bad considering that the data I'm working with also has Companies with no People associated to them. So I'll be able to search on those Companies as well without the need for a separate index.
Reply all
Reply to author
Forward
0 new messages