Hi Everyone,
I am currently looking for a solution for doing batch, fast LPM in a timeseries manner. If you just want to get to the question, skip the explanation and go to the bottom of the text.
Explanation:Basically, what i have is a stream IP addresses (as either String or Int64) which are timestamped. I also have a batch of Prefix to Properties nodes (AS Numbers, Geolocation, etc).
The first table with the streaming data looks like this:
DATA:
┌─name────────────┬─type───┬─default_type─┬─default_expression─┐
│ date │ Date │ │ │
│ TIMESTAMP_START │ UInt32 │ │ │
│ TIMESTAMP_END │ UInt32 │ │ │
│ EXP_IP │ String │ │ │
│ SRC_IP │ String │ │ │
│ DST_IP │ String │ │ │
│ BYTES │ Int64 │ │ │
└─────────────────┴────────┴──────────────┴────────────────────┘And the second with the prefixes looks like this (i've removed all features that are not vital here),
Note: subnet is in cidr notation, e.g.
192.168.0.0/24PREFIX:
┌─name─────────────┬─type───┬─default_type─┬─default_expression─┐
│ date │ Date │ │ │
│ timestamp │ UInt32 │ │ │
│ Subnet │ String │ │ │
└──────────────────┴────────┴──────────────┴────────────────────┘Basically, what i want to do is to take the IPs (any column) from the first table and find the Subnet that they belong to in the second table, taking into account the date and Timestamp (which i do not do in my current queries).
The only way (i can currently see) see is to separate v4 and v6 in the Data table, convert the IP to an UInt32/UInt64 and then explode it with all possible cidr masks that the IP can be in. This would look something like this (for v4 only)
SELECT
ipInt,
cidr,
exp2(32) - exp2(32 - cidr) AS mask,
bitAnd(ipInt, mask) AS subnetInt,
bytes
FROM
(
SELECT
1 AS joinCol,
toUInt32(IPv4StringToNum(DST_NET)) AS ipInt,
sum(BYTES) AS bytes
FROM test.DATA
WHERE ipInt > 0
GROUP BY DST_NET
)
ALL INNER JOIN
(
SELECT
1 AS joinCol,
cidr
FROM
(
SELECT arrayMap(lambda(tuple(x), toUInt8(log2(x))), bitmaskToArray(toUInt64(exp2(33) - 1))) AS cidr
)
ARRAY JOIN cidr
) USING (joinCol)
LIMIT 1
Basically, what i do here is to explode the table to 33 times it size (for v4) while, for v6, this would cause a factor of 65
now, once that is done, i can convert PREFIX into something that can i can use to do LPM. converting it looks like this:
SELECT
Subnet,
toInt64(IPv4StringToNum(extract(Subnet, '([0-9.]+)/'))) AS subnetInt,
toUInt8(extract(Subnet, '/([0-9]+)')) AS cidr,
exp2(32) - exp2(32 - cidr) AS mask,
FROM test.PREFIX
once this is done, the full query for the LPM looks like this:
SELECT
IPv4NumToString(ipInt) AS IP,
groupArray(DestSubnet)[1] AS Subnet,
groupArray(cidr)[1] AS cidr,
groupArray(bytes)[1] AS bytes,
FROM
(
SELECT
ipInt,
Subnet,
cidr,
bytes
FROM
(
SELECT
ipInt,
cidr,
exp2(32) - exp2(32 - cidr) AS mask,
bitAnd(ipInt, mask) AS subnetInt,
bytes
FROM
(
SELECT
1 AS joinCol, /*dummy column for table explosion*/
toUInt32(IPv4StringToNum(DST_NET)) AS ipInt,
sum(BYTES) AS bytes
FROM test.DATA
WHERE ipInt > 0
GROUP BY DST_NET
)
ALL INNER JOIN
(
SELECT
1 AS joinCol, /*dummy column for table explosion*/
cidr
FROM
(
SELECT arrayMap(lambda(tuple(x), toUInt8(log2(x))), bitmaskToArray(toUInt64(exp2(33) - 1))) AS cidr
)
ARRAY JOIN cidr
) USING (joinCol)
)
ANY INNER JOIN
(
SELECT
Subnet,
toInt64(IPv4StringToNum(extract(Subnet, '([0-9.]+)/'))) AS subnetInt,
toUInt8(extract(Subnet, '/([0-9]+)')) AS cidr,
exp2(32) - exp2(32 - cidr) AS mask,
FROM test.PREFIX
ORDER BY cidr DESC /*either order here or use an all inner join - don't know what is more effective...*/
) USING (subnetInt, cidr)
ORDER BY cidr DESC
)
GROUP BY ipInt
So, this works and gets me the requires results... but it is ineffective.
Question:
While the scheme there works, it is very hungry on resources, since the table needs to be exploded, effectively using a magnitude of order more RAM than necessary.
The reason here is that the join only works on a direct comparison and cannot use any kind of range logic.
I would like to suggest two alternative solutions to this, both of which i understand conceptually, but are not able to put into code (otherwise, i would do it and post it back to you guys)
1.) Introduce a new Datatype, lets call it subnet that has exactly two values: SubnetInt Int64 and cidr UInt8. Basically, this replaces the subnetInt/cidr from my tables. The point to this Datatype is that the == operator (which is used during joining) returns true if the one subnet is part of the other. Here are some examples for this:
192.168.0.0/24 == 192.168.0.0/28 --> true
192.168.0.0/24 == 192.168.0.0/23 --> true
192.168.0.0/24 == 192.168.1.0/24 --> false
Now, i realize that this would break the semantics of the == operator. But... it would work on streaming data without the needs for grouping or conversions. In the case of PREFIX being ordered descendingly by cidr, an any join would do an immediate LPM without the needs to group or sort afterwards. This would also work for v4 as well as v6 (subnets), without the need to even distinguish between them.
2.) This is the suggestion to allow a more flexible statement in using. If the subnetInt from the first select can be dynamically and'ed in the using statement with the cidr from the second select, then the LPM works without the needs for table explosion (since it is done on the fly during the join)
Basically, the using would look like this
using (bitAnd(subnetInt{1}, mask{2}) == subnetInt{2})
where the index in the {} references the first or second select statement.
but this probably means a major rewrite on how joining works (as the documentation specicially states that complex join statements are not supported), and it would not (without further hassle) allow for oblivious v4/v6 handling (i think)...
I think, now that i am done with this post, this is most likely a feature request...