How can I fanout/distribute 500,000 records with firebase?

406 views
Skip to first unread message

Uri Brodsky

unread,
Dec 3, 2016, 11:40:31 PM12/3/16
to Firebase Google Group, Nitzan Tal
I have currently this structure.

Companies (20) 
{
    companyId:  {
        ...comanyInfo
    }
}

Users (500,000)
  uid: {
     ...userInfo
   }

User_Companies (500,000 * max(20)) 
 uid: {
    company1: true,
    company2: true,
    ....
 }   

 Company_Users (500,000 * max(20)) 
 companyId: {
    uid1: true,
    uid2: true,
    ....
 }   

 Company_POSTS (2000 posts per company) 
 {
    companyId: {
        postId: {
            expired
            ...postInfo
        }
    }
 }


 User_Feed (For each user (500,000))
  {
    uid: 
        posId : {
            like: true
        }
 }


 Requirements: 
 User should get only the posts of the registered companies and user can register/unregister.
 User shouldn't see posts he already like/unlike

 So currently what I am doing, when users joins a company I copy the relevant posts(registered/ not expried) to the User_Feed.
 So in the client I can fetch relevant users posts paged.

 Now assume I have 4000k of posts I need to bring all the data to the server in order to copy this to USER_FEED.
 it's a lot of data to bring and of course after some users joins to campaniles the server hangs.

 another case if a company posts a new post it should be distributed to all the registered users 500,000 User_FEED.
 which will require to fetch massive amount of data in order to fan-out it to firebase.

 I am not sure how to solve this issues? even if I will fetch shallow 500,000 keys is a lot of data.







Kato Richardson

unread,
Dec 5, 2016, 3:53:15 PM12/5/16
to Firebase Google Group
Hi Uri,

My first suspicion here would be that your scope may be a bit off for your first attempt at tackling Firebase data structures. A million followers is a Twitter-level problem. And companies with with 500,000+ employees are approaching the top 10 range.  It may help to start with a more realistic evaluation and work upward from there.

My first thought here is that if all employees at a company are delivered the same feed, that maybe there's no reason to fan this out to everyone. My second thought is that, since we can't read 500k articles in a run, there's really no reason to fetch that many. Using limit queries to reduce the scope to a consumable amount will go a long way here. As would some segmentation of data by time frame or group. 

It would be good to stat off with some NoSQL data modeling techniques, and try out some simpler scenarios before applying too much early optimization. But ultimately, if you're talking this kind of scale, it's going to take a good team of talented engineers some time to sit down and hash out the tradeoffs and decide on the correct structures for your needs.

Housekeeping: Looks like you already posted this question on Stack Overflow. As a general rule of etiquette, you can show respect for the community's time by always letting developers know when you cross-post your questions, so we don't duplicate effort.

☼, Kato

--
You received this message because you are subscribed to the Google Groups "Firebase Google Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to firebase-talk+unsubscribe@googlegroups.com.
To post to this group, send email to fireba...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/firebase-talk/cd60b06e-ad31-454f-9f92-11bd6c04c537%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--

Kato Richardson | Developer Programs Eng | kato...@google.com | 775-235-8398

Uri Brodsky

unread,
Dec 6, 2016, 10:45:37 AM12/6/16
to Firebase Google Group
Hi Kato, 
thanks for your answer, I am sorry I didn't know about this rule, I edited the SO post.

In our case we are already in progress with users and we need to support this kind of scale.
Since Firebase doesn't allow us to filter by multiple criteria, the best practices of Firebase say we need to structure the data in a way so we can fetch it easily using one criteria.
This of courses, requires data duplication and we are ok with it, but as I mentioned since we have a lot of data the fan-out process becomes very difficult.
We don't need to fetch 500k in client side of course. We fetch the posts paged for the signed user. 
the problem is in data distribution. a new post arrived to the system and it need to be distributed to 500k users in the backgrond (server side).

I have read about the firebase-queue maybe it can help us, I thought of creating tasks for the fan out process and workers that will process the fan-out in chunks.
Do you think it's the correct way?



 
 
To unsubscribe from this group and stop receiving emails from it, send an email to firebase-tal...@googlegroups.com.

To post to this group, send email to fireba...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/firebase-talk/cd60b06e-ad31-454f-9f92-11bd6c04c537%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Kato Richardson

unread,
Dec 6, 2016, 12:42:31 PM12/6/16
to Firebase Google Group
Firebase-queue may help, but it can't solve the throughput or bandwidth issues for you. 

I'd start with that article on NoSQL data modeling techniques though. There are some great approaches in there for aggregating on multiple criteria and so on. Keep in mind that, for the most part, you know the criteria ahead of time, so aggregated keys or map-reduce like functionality might be a good starting point.

☼, Kato


To unsubscribe from this group and stop receiving emails from it, send an email to firebase-talk+unsubscribe@googlegroups.com.

To post to this group, send email to fireba...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages