table with market data terribly slow to retrieve data

101 views
Skip to first unread message

pantonis

unread,
May 6, 2023, 1:15:48 PM5/6/23
to MariaDB ColumnStore
I have the following table

CREATE TABLE `FinancialTicks` (
  `Asset` longtext NOT NULL,
  `Timestamp` int(11) NOT NULL,
  `Bid` double NOT NULL,
  `Ask` double NOT NULL
) ENGINE=Columnstore DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_general_ci;

I have loaded about 1 billion rows (about 4 months of data) and when I run a simple query

SELECT *
FROM FinancialTicks
WHERE Timestamp >= 1657279800
AND Timestamp <= 1657279830
AND Asset = 'Apple'

it takes 10 minutes to return some data. I thought Columnstore was the way to go for time series data but it seems that this is not the case as it is not usable at all.
Any idea why is it so slow. Running on a 8 core vm with 16gb of ram and 400GB SSD.

Thanks in advance



Emmanuel Bull

unread,
May 6, 2023, 2:55:24 PM5/6/23
to pantonis, MariaDB ColumnStore
Data directory is in which of your mounts/volumes? Make sure that it is not competing with other costly applications on that mount. Check the IOPS.
Look at your perfmon when running the query against when not. 
Your table and query looks OK for 1 billion records. 

--
You received this message because you are subscribed to the Google Groups "MariaDB ColumnStore" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mariadb-columns...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/mariadb-columnstore/8bd208d3-3a84-4e45-bcd2-2a42d649f9c8n%40googlegroups.com.

Todd Stoffel

unread,
May 6, 2023, 7:00:18 PM5/6/23
to MariaDB ColumnStore
I tried to reproduce this using 1 billion rows of fake data on a system with 6 CPU and 8GB of RAM.

[root@mcs1 tmp]# head financialticks.csv

yellow,1252060378,91.6,49.6

white,1394759195,40.23,99.61

silver,1093083353,49.11,37.0

silver,918112014,85.3,76.8

navy,1621380770,20.36,28.86

aqua,850124079,57.9,39.18

purple,467173058,84.71,32.7

gray,1165210570,72.18,35.44

lime,278593393,84.77,68.73


[root@mcs1 tmp]# cpimport -m 1 -s ',' -E '"' google_groups FinancialTicks -l financialticks.csv

2023-05-06 22:43:55 (1257) INFO : Running distributed import (mode 1) on all PMs...

2023-05-06 22:47:52 (1257) INFO : For table google_groups.FinancialTicks: 1000000000 rows processed and 1000000000 rows inserted.

2023-05-06 22:47:52 (1257) INFO : Bulk load completed, total run time : 236.949 seconds


[root@mcs1 tmp]# mariadb -vvv google_groups -e "SELECT * FROM FinancialTicks WHERE Timestamp >= 594075923 AND Timestamp <= 1037354268 AND Asset = 'teal';"



| teal  |  809763260 | 79.16 | 31.45 |

| teal  |  803870149 | 96.57 | 48.87 |

| teal  |  955395282 | 94.59 |  32.7 |

| teal  | 1025037492 | 83.71 | 57.41 |

| teal  |  771100012 | 68.71 | 70.69 |

+-------+------------+-------+-------+

17561280 rows in set (36.045 sec)


pantonis

unread,
May 7, 2023, 3:27:02 AM5/7/23
to MariaDB ColumnStore
This is a brand new VM with nothing installed on apart from mariadb.

Regarding the data directory:

Filesystem      Size  Used Avail Use% Mounted on
/dev/sda3       391G   40G  332G  11% /
root@vm01010:/var/lib/mysql#


htop while running query output is:
 scrn.jpg

pantonis

unread,
May 7, 2023, 3:27:44 AM5/7/23
to MariaDB ColumnStore
36 seconds isn't too much?

Todd Stoffel

unread,
May 7, 2023, 4:03:51 PM5/7/23
to MariaDB ColumnStore
The time includes delivery of the results to my client. It was 7,561,911 records for my sample query.

So that is about right.

allen....@mariadb.com

unread,
May 7, 2023, 4:23:53 PM5/7/23
to MariaDB ColumnStore
If looking to further improve,
1) Add more CPU  - with 1 billion records there are 125 extents of 8 million values per column. With only 4 or 6 CPU, there are many iterations potentially which can be massively parallelized with more cpus which can dramatically reduce query time.
Id recommend 32 - 64 cpus
once you add more cpu you can adjust some parameters to allocate extra CPU ontop of the defaults
sudo mcsSetConfig JobList RequestSize 2
sudo mcsSetConfig JobList ProcessorThreadsPerScan 32
restart columnstore after adjusting

after more CPU, more DISK IOPS would help next.

2) Review extent ordering and elimination
select objectid from calpontsys.syscolumn where tablename='FinancialTicks' and columnname='Timestamp';
select PARTITION_ID,SEGMENT_ID,DBROOT, MIN_VALUE, MAX_VALUE from information_schema.columnstore_extents where object_id=XXXXX order by PARTITION_ID ASC, BLOCK_OFFSET ASC, SEGMENT_ID ASC;

On these larger tables, inserting the data in order is extremely important so that the summary block information for each extent is optimally setup to eliminate entire 8 million records blocks based on the min/max values.The two queries above will print out the min/max values of each extent for the column id from the first query.

3)  review the execution plan in columnstore
  select calSetTrace(1);
# SQL Statements
select calGetTrace();

If you have good data ordering in step 2, the PBE value in the first step or two should be as high as possible to skip over massive chunks of data that is irrelevant to the query.

4) Lastly this query select * from x where y =z; isnt an analytical query. nor does it lean into a columnar engines strength as no aggregate functions are being applied to the query.
Columnar engines are best for queries that run AVG(), SUM(), MIN(),MAX(), GROUP BY x functions. or if you only need to select a subset of the columns, but with "select *" you arent reviewing/testing the strength of columnar.

Todd Stoffel

unread,
May 7, 2023, 4:54:08 PM5/7/23
to MariaDB ColumnStore
I suspect this is a VM issue causing a bottleneck somewhere.  I could not reproduce in AWS or on Docker.

pantonis

unread,
May 8, 2023, 1:20:42 AM5/8/23
to MariaDB ColumnStore
"The time includes delivery of the results to my client. It was 7,561,911 records for my sample query"
What is the query execution time not the retrieval of data to client?

Todd Stoffel

unread,
May 8, 2023, 1:21:25 AM5/8/23
to MariaDB ColumnStore
Less than 700 ms

pantonis

unread,
May 8, 2023, 2:17:17 AM5/8/23
to MariaDB ColumnStore
Thank you for your reply.

"Add more CPU  - with 1 billion records there are 125 extents of 8 million values per column. With only 4 or 6 CPU, there are many iterations potentially which can be massively parallelized with more cpus which can dramatically reduce query time.
Id recommend 32 - 64 cpus"
Adding more CPUs (32 CPUs would sky rocket costs) We ran the test with data for 4 months. Imagine that we have to keep history for 5 years due to regulation. this would cost tens of thousands of euros per month which is not sustainable for our project.

 "Review extent ordering and elimination"
Data inserted are ordered ascending by timestamp. Attached you can check results of query 


"Lastly this query select * from x where y =z; isnt an analytical query. nor does it lean into a columnar engines strength as no aggregate functions are being applied to the query. Columnar engines are best for queries that run AVG(), SUM(), MIN(),MAX(), GROUP BY x functions. or if you only need to select a subset of the columns, but with "select *" you arent reviewing/testing the strength of columnar."
Our use case is to retrieve all data (bid,Ask, asset and time) with no aggregations. I read that MariaDb Columnstore is a very  good solution to use when having to do with time-series data. And because we are already using MariaDb dont wanted to involve a time-series database such as timescaledb. In any case I tried to get number of ticks per hour for an asset 

SELECT
    FROM_UNIXTIME(TimestampUnix, '%Y-%m-%d %H:00:00') AS Hour,
    COUNT(*) AS NumberOfTicks
FROM
    Tick
WHERE
    Asset = 'Apple'
    AND DATE(FROM_UNIXTIME(TimestampUnix)) = '2023-03-15'
GROUP BY
    Hour
ORDER BY
    Hour;

and still it takes ages.

Thanks again.



data.csv

pantonis

unread,
May 8, 2023, 2:17:45 AM5/8/23
to MariaDB ColumnStore
VM is brand new empty.  How can I check if it is a VM issue? 

Thank you

drrtuy

unread,
May 9, 2023, 11:57:45 AM5/9/23
to MariaDB ColumnStore
Hi,

Performance issues are hard to solve sometimes but in most cases there will be a hardware bottleneck. I find tool called atop very useful to monitor potential hardware bottlenecks.
From the bits you shared @pantonis, namely the fact that every PrimProc thread utilizes 25% on average,  I assume there might be IO bottleneck which limits  the performance.
Despite the fact the version that is available for community is 2 years old comparing with the version that is going to be released soon with MariaDB 11.1, there should be no 10 minutes delays w/o an obvious hardware bottleneck.

Regards,
Roman

понедельник, 8 мая 2023 г. в 09:17:45 UTC+3, pantonis:

pantonis

unread,
May 10, 2023, 1:27:03 AM5/10/23
to MariaDB ColumnStore
Hello and thanks for your reply Roman,

I will try to run the tests on a more powerful VM to see if the issue is the IOPS although I doubt it because I have run tests also on Postgresql/TimescaleDb with exactly the same data and the query returns under 500ms.
Regarding versions. I m using Mariadb 10.11.2-MariaDB-1:10.11.2+maria~ubu2004 Community edition and ColumnStore version is 6.4. The versioning of MariaDb and Columnstore is kind of confusing. Been looking to find versions and releases but their website is not user friendly at all and you cannot find easily what you are looking for.
What about MariaDb Enterprise Server? What is the version of ColumnStore for the enterprise version? Still 2 years old?
Based on release notes of 11.X I don't see any changes in ColumnStore.

Thank you again

drrtuy

unread,
May 10, 2023, 8:28:59 AM5/10/23
to MariaDB ColumnStore
Though I don't think you should see a 10 minute delay even with this capacity. atop is the way to go IMHO. At least to see where the bottleneck is.
After some consideration I think you might also hit the peculiar behavior that affects timings when MariaDB server returns a relatively big data set. MDB by default buffers the whole data set  before it returns the set to the consumer. According with my experience this buffering might add couple minutes of extra delay. You can turn the server side buffering off with '-q' starting mariadb client.
Community Columnstore 6.x is 2 years old now. Enterprise has the most recent stable version 23.02.03.
The upcoming Community MDB 11.1.1 will have the same 23.02.03 though. JFYI Columnstore will be available for relatively modern distros, e.g. Rocky8/9, Ubuntu 20 etc , e.g. there will be no MCS 23.02.3 for centos/RHEL 7 with MDB 11.1.1. Enterprise version has it though.

Regards,
Roman

среда, 10 мая 2023 г. в 08:27:03 UTC+3, pantonis:

pantonis

unread,
May 12, 2023, 3:56:16 AM5/12/23
to MariaDB ColumnStore
Roman,

Many thanks for your reply.
I will arrange for a new VM so that tests can be rerun. I prefer to use MariaDb CS at the end instead of other engines, so I will give it another try. Can you please suggest hardware specifications, if it is possible please, for my scenario so that I can run my tests?

Regarding the second part of your answer:
"The upcoming Community MDB 11.1.1 will have the same 23.02.03 though. JFYI Columnstore will be available for relatively modern distros, e.g. Rocky8/9, Ubuntu 20 etc ," e.g. there will be no MCS 23.02.3 for centos/RHEL 7 with MDB 11.1.1. Enterprise version has it though"
If I understood correctly you are saying that for some distros MCS Community will not be available after 11.1.1 and for some it will. If this is the case is there anywhere that it shows which distros will be supported and which ones not?
Also on Enterprise version all distros will be supported?

Thank you again

Antonis

pantonis

unread,
May 14, 2023, 5:30:02 AM5/14/23
to MariaDB ColumnStore
@Roman, Can you advice based on your experience what hardware would you select?

Thank you

Antonis

On Wednesday, 10 May 2023 at 15:28:59 UTC+3 drrtuy wrote:

drrtuy

unread,
May 14, 2023, 6:41:24 AM5/14/23
to MariaDB ColumnStore
Originally MCS was available for a subset of distros that MDB supports.
AFAIK MCS after 11.1.1 will be available at Rocky 8,9(that are compatible with RHEL), Ubuntu 20, 22, Debian 11 for x86 and arm64.
ES version addiotionally supports Centos 7 for x86 but not for ARM, Debian 10.

Regards,
Roman

пятница, 12 мая 2023 г. в 10:56:16 UTC+3, pantonis:

drrtuy

unread,
May 14, 2023, 6:46:07 AM5/14/23
to MariaDB ColumnStore
IMHO the hardware you picked for a quick test should be OK. I mean the 10 minutes to get the results timings tells something went wrong and given that we didn't find the root cause I struggle to answer what should be altered. I expects the results to be closer to what @Todd shared + it is worth to disable server side result caching that should speed up the time to return the results. Moreover I suggest to assess performance using a software client b/c this will give you more realistic timings.

Regards,
Roman

воскресенье, 14 мая 2023 г. в 12:30:02 UTC+3, pantonis:

pantonis

unread,
May 14, 2023, 6:56:44 AM5/14/23
to MariaDB ColumnStore
Todd said that he received about 7 millions of results. my data based on the query return only 250 records. so I don't think should be the case. 

" Moreover I suggest to assess performance using a software client b/c this will give you more realistic timings."
Do you have any in mind?

Thank you again

Antonis

drrtuy

unread,
May 14, 2023, 10:41:06 AM5/14/23
to MariaDB ColumnStore
I read my answer again and realized that my suggestion is not clear.
I suggest you to measure the time you need to retrieve the answer set using one of the connectors available instead of relying on client timings.

There are some optimization bits for the query/schema. There is a mechanism called Extent Elimination. You can treat it as a dumb-simple index. It stores min/max for a series of column values and if the filter literal doesn't fit into the range MCS doesn't scan this column values series. In the context of Columnstore this series is called an extent.
Moreover the original schema and charset used turns the Asset column into what we call a dictionary column. It is basically a pair of columns where the first contains the column value offset for the second.
The query optimizer prefers to run char/text column filters first in the Columnstore. The current Columnstore version doesn't do extent elimination for dictionary columns. We have it in the code but as of now it is disabled b/c of technical reasons. 
So here is how you can turn character first, no EE filtering on Aspect into a vectorized no-char filter.
CREATE TABLE `FinancialTicks` (
  `Asset` varchar(7) NOT NULL,

  `Timestamp` int(11) NOT NULL,
  `Bid` double NOT NULL,
  `Ask` double NOT NULL
) ENGINE=Columnstore DEFAULT CHARSET=ascii COLLATE=ascii_bin;

The explain output for MCS tables doesn't bring much info, however if you run select mcssettrace(1) before the query and select mcsgettrace(1) after the query you will have an execution plan from the perspective of MCS.

Here is the filtering section for the original schema:
                                                 -- scan driven
                                                 3 filter steps:
                                                    ColumnCommandJL: 0 filters, BOP=0, colwidth=8 oid=3013 name=asset (scan) (tokens)
                                                    DictStepJL: 1 filters, BOP=0, oid=3017 name=asset
                                             
                                                    ColumnCommandJL: 2 filters, BOP=1, colwidth=4 oid=3014 name=timestamp
                                                 4 projection steps:
                                                    RTSCommandJL: oid=3017 colName=asset
                                                 ColumnCommandJL: 0 filters, BOP=0, colwidth=8 oid=3013 name=asset (tokens)
                                                 DictStepJL: 0 filters, BOP=0, oid=3017 name=asset
                                             
                                                    PassThruCommandJL: colwidth=4 oid=3014 colName=timestamp
                                                    ColumnCommandJL: 0 filters, BOP=0, colwidth=8 oid=3015 name=bid
                                                    ColumnCommandJL: 0 filters, BOP=0, colwidth=8 oid=3016 name=ask


And here is the one for the modified:
                                                 -- scan driven
                                                 3 filter steps:
                                                    ColumnCommandJL: 0 filters, BOP=0, colwidth=8 oid=3013 name=asset (scan) (tokens)
                                                    DictStepJL: 1 filters, BOP=0, oid=3017 name=asset
                                             
                                                    ColumnCommandJL: 2 filters, BOP=1, colwidth=4 oid=3014 name=timestamp
                                                 4 projection steps:
                                                    RTSCommandJL: oid=3017 colName=asset
                                                 ColumnCommandJL: 0 filters, BOP=0, colwidth=8 oid=3013 name=asset (tokens)
                                                 DictStepJL: 0 filters, BOP=0, oid=3017 name=asset
                                             
                                                    PassThruCommandJL: colwidth=4 oid=3014 colName=timestamp
                                                    ColumnCommandJL: 0 filters, BOP=0, colwidth=8 oid=3015 name=bid
                                                    ColumnCommandJL: 0 filters, BOP=0, colwidth=8 oid=3016 name=ask


Moreover you pointed out that we should re-evaluate char over integer filter preference default we have now in the code.

Regards,
Roman

воскресенье, 14 мая 2023 г. в 13:56:44 UTC+3, pantonis:

pantonis

unread,
May 15, 2023, 8:08:25 AM5/15/23
to MariaDB ColumnStore
Hello Roman and thanks for your reply.

Regarding connector we are using C# .NET and I can see that is not available in the list.
For the query just wanted to point out that the filtering sections you posted are identical. 
Also one more thing that I find it difficult.
I would like to give our feedback as of now. One of the main goals of choosing a database engine is to avoid all these technicalities and to spend time developing our apps without spending 90% of our time to troubleshoot the database engine and optimize it. Our experience as of now with MCS shows that in order to use it you need to spend a huge amount of time and resources to finetune it and I suppose to later maintain. Just want to point out that MCS has to consider making MCS more developer friendly because at the end our job is to develop software systems. I know that databases are complex but this is something we didn't experience out there with other competitor engines although our systems use MariaDb Inno for transactional processes and it would be ideal to use the same Db for our data warehouse project,  we are thinking not to use MCS due to the complexity and maintenance factors mentioned above.

Furthermore I m highly grateful for all of your help that you provided. Much appreciated.

Thank you again, and have a nice day

drrtuy

unread,
May 17, 2023, 6:59:18 AM5/17/23
to MariaDB ColumnStore
Hi,

AFAIK for C#.NET you can use MySQL connector.
Right, the filtering sections are identical and this is the SQL data type that makes the difference.
Thank you for the valuable input as a prospect, we will take it into account.

Regards,
Roman

понедельник, 15 мая 2023 г. в 15:08:25 UTC+3, pantonis:

pantonis

unread,
May 19, 2023, 11:15:05 AM5/19/23
to MariaDB ColumnStore
Hello,

I setup a new VM for the tests again. Since Im new to MCS can you suggest optimal configuration of MCS for my scenario

Thank you in advance

drrtuy

unread,
May 23, 2023, 5:42:43 PM5/23/23
to MariaDB ColumnStore
Hi!

AFAIK the server team found last-moment bug that delays 11.1.1 release. I expect the release to be available this or next week.

Regards,
Roman

пятница, 19 мая 2023 г. в 18:15:05 UTC+3, pantonis:

Ferran Gil Cuesta

unread,
Aug 29, 2023, 9:14:34 AM8/29/23
to MariaDB ColumnStore
I've been trying to install MariaDB 11.1.2 (stable) and also 11.2.1 (RC) in a Debian 11 but so far I don't see the columnstore engine, ever. "Show engines" never lists it, although mariadb updates just fine from a previous version.
As per Roman's message on May 14, 2023, Debian 11 should support MCS...
Is there any (updated) information on how to proceed?
Thanks
Ferran

drrtuy

unread,
Sep 22, 2023, 5:35:46 AM9/22/23
to MariaDB ColumnStore
This is strange b/c I see the Columnstore package available at this mirror:
https://mirrors.ptisp.pt/mariadb/repo/11.1/debian/dists/bullseye/main/binary-amd64/Packages

Regards,
Roman
вторник, 29 августа 2023 г. в 16:14:34 UTC+3, Ferran Gil Cuesta:

Ferran Gil Cuesta

unread,
Oct 19, 2023, 5:36:55 AM10/19/23
to MariaDB ColumnStore
I managed to install/enable columnstore into MariaDB by doing:
sudo apt install mariadb-plugin-columnstore and after a reboot I can now query all columnstore tables.
It seems strange that ColumnStore is not included by default, as InnoDB, etc... This used to be like that in previous versions.
Reply all
Reply to author
Forward
0 new messages