Performance on filtering on BigTable side vs Client Side

116 views
Skip to first unread message

Shubham Choudhary

unread,
Feb 2, 2023, 9:24:09 AM2/2/23
to Google Cloud Bigtable Discuss
We have a use case where we are storing users on the basis of languages and each language's users are divided into 19 shards we want to perform a business operation on the users of a given language so, we stream in parallel(one shard is read by one pod) and fetch the users data.

We have one column family "a1" and has columns: expire, language, state, tenant. userId.

The tenant can value "A" / "B"

Suppose we want to process the "A" tenant users so right now we are streaming all the users from the BT and dropping all the "B" users on the code level, and dropping "A" rows if we want to process the data for "B" users.

```
const readStream = await this.bigtable.createReadStream({
    decode: false,
    ranges: [
        {
            start: startRowPointer,
            end: `${shardIndex}_${language}$`,
        },
    ],
    filter: {
        column: {
            cellLimit: 1,
        },
    },
});
```

Here, are my three questions :thinking_face:whether filtering on the BT level vs filtering on the code level, which one will be more performant, given we are always doing entire shard read? 

1. Which one is more performant, filtering on the BT level(if possible) or filtering on the code level(i.e bringing all rows via n/w and eventually dropping them via code), given we are already doing full Shard read anyways?

So, I can use the below filter
filter: [
    {
        condition: {
            "test": [
                {
                    column: "tenant",
                },
                {
                    value: "A"
                },
            ],
            "pass": [
                {
                    "all": true
                }
            ],
            "fail": [
                {
                    "all": false
                }
            ],
        },
    }
]

2nd question: How can I use "LOGICAL AND" chaining let's say I want to filter users where tenant = 'A' and state = 'Delhi'? Again, is it even feasible to do this on the BT side as it's a columnar DB?

But the below is returning zero data.

filter: [
    {
        condition: {
            "test": [
                {
                    column: "tenant",
                },
                {
                    value: "A"
                },
            ],
            "pass": [
                {
                    "all": true
                }
            ],
            "fail": [
                {
                    "all": false
                }
            ],
        },
    },
    {
        condition: {
            "test": [
                {
                    column: "state",
                },
                {
                    value: "Delhi"
                },
            ],
            "pass": [
                {
                    "all": true
                }
            ],
            "fail": [
                {
                    "all": false
                }
            ],
        },
    },
]
Screenshot 2023-02-02 at 5.32.32 PM.png

Anwesha Das

unread,
Feb 3, 2023, 6:00:33 PM2/3/23
to Google Cloud Bigtable Discuss
Hi Shubham,

Thanks for the question. It looks like you're trying to use a Condition filter to achieve the desired filtering. As noted in our docs, https://cloud.google.com/bigtable/docs/using-filters#condition Condition filters are much slower than other filters and are generally not recommended. They can also lead to inconsistent or unexpected results due to their non atomic nature. To your second question, it looks like you are trying to use multiple levels of chaining or nested chaining, this should be possible. But again you should be mindful of the performance and CPU impact of filters on read operations https://cloud.google.com/bigtable/docs/reads#performance.

Best,
Anwesha
Reply all
Reply to author
Forward
0 new messages