1 Billion Row Challenge

374 views
Skip to first unread message

Nivethan T

unread,
Jan 12, 2024, 11:29:25 AMJan 12
to Pick and MultiValue Databases
Other communities are giving a shot at the 1 billion row challenge and I think it could be fun to do this with Pick.


The challenge is specifically for java but we should be able to modify it to load it into pick and run a query. It would be fun to see how fast it is.

Someone's done it with SQL so we can compare it:


I'm planning to do this with UniVerse and if someone tries scarletdme, that would also be interesting.
Message has been deleted

Joe G

unread,
Jan 12, 2024, 1:33:48 PMJan 12
to Pick and MultiValue Databases
P.S. I left off that the machine I ran it on is a AMD Ryzen 9 5950x 16-core processor with 60gb of memory.

Joe G

unread,
Jan 12, 2024, 1:48:40 PMJan 12
to Pick and MultiValue Databases
I'm not sure what happened there but my P.S. went through and my original post didn't.

Doing this with Pick really is a challenge. Are you planning on loading it into a hashed file and using Retrieve to extract the stats? If so, the results will be interesting.

I thought I'd give the challenge a try but not with Pick. I don't have it running on my local machine and I don't think my clients would appreciate me putting an 13.8gb file on their machine.

Instead, I thought I'd use Python.  Python isn't high performing but it has two libraries that are used a lot in the data science fields: pandas and numpy. Those two libraries are at the center of a lot of the AI development. Numpy creates a memory array that can store huge datasets. Pandas makes it easy to query the data.

The first step was to create the 1 giga row measurements.txt file. The other guy provided a Java program to do it. I took that program and had ChatGPT convert it to Python. I then ran it and create the 18.3gb file. It took 21:13 minutes! I as surprised it took that long since I have a very fast SSD. I could probably cut that time down with some optimizing but not right now.

The next step was to write a Python program to read the data into pandas. That was easy. I just asked ChatGPT to write one that would process the output from the program it converted from Java.

Finally, I added some timings and ran the program. This is the result:

1705082258.0025427 - Starting the script
                min       mean   max
id                                  
Abha          -29.0  18.003993  70.1
Abidjan       -28.1  25.993178  74.8
Abéché        -27.4  29.396475  82.6
Accra         -24.4  26.389850  76.7
Addis Ababa   -34.5  15.994150  67.0
...             ...        ...   ...
Zagreb        -38.2  10.692418  59.2
Zanzibar City -21.9  25.995663  77.8
Zürich        -38.5   9.293477  55.7
Ürümqi        -44.8   7.399864  56.1
İzmir         -35.6  17.896077  68.4

[413 rows x 3 columns]
It took 492.3570783138275 seconds to read the file
It took 70.14955878257751 seconds to calculate the statistics
It took 0.008239507675170898 seconds to sort the results
It took 562.5148766040802 seconds in total


So it took over 8 minutes to load and then a little over a minute to generate the stats.

So what do we learn? Nothing about my programming ability. This wasn't as fun as generating optimized code using obscure tricks to make it really fast. On the other hand, I was able to create the data file and generate the stats in less than an hour of time. That's actually pretty cool.
On Friday, January 12, 2024 at 9:29:25 AM UTC-7 Nivethan T wrote:

Joe Goldthwaite

unread,
Jan 12, 2024, 1:51:36 PMJan 12
to mvd...@googlegroups.com
I forgot to attach my code if anyone wants to look at it.

--
You received this message because you are subscribed to
the "Pick and MultiValue Databases" group.
To post, email to: mvd...@googlegroups.com
To unsubscribe, email to: mvdbms+un...@googlegroups.com
For more options, visit http://groups.google.com/group/mvdbms
---
You received this message because you are subscribed to the Google Groups "Pick and MultiValue Databases" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mvdbms+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/mvdbms/41b4d3ec-88be-472c-bc36-ccd8965aae1cn%40googlegroups.com.

1brc.zip

Nivethan T

unread,
Jan 12, 2024, 11:20:18 PMJan 12
to Pick and MultiValue Databases
Hi Joe, I'm way underpowered compared to you! I'm using a Intel Xeon E5520 with 16GB of ram.

Loading 1 million records is slow so I didn't even bother with 1 billion.

92.58 seconds to load the data.

74.85 seconds to execute the listing with the query language.

I have a feeling there is some easy gains somewhere but I'll need to look.

If anyone sees something obvious in the code please let me know:

Jim Idle

unread,
Jan 13, 2024, 12:34:06 AMJan 13
to mvd...@googlegroups.com
Loading of the whole thing in to memory is not going to work in any efficient way. It will need to use mmap and efficiently traverse the file. READSEQ will work, but not be efficient at that scale. Maybe I will give this a go, in go. I suspect Universe will be poor at this, you have better options in jBASE, but I would say that. 

Jim

Wols Lists

unread,
Jan 13, 2024, 7:31:29 AMJan 13
to mvd...@googlegroups.com
On 13/01/2024 04:20, Nivethan T wrote:
> I have a feeling there is some easy gains somewhere but I'll need to look.

First gain, MINIMUM.MODULO

Second gain, indexing on station name, I guess.

Third gain, something I'm planning to do for my first task at work
(which oddly enough is vaguely similar to this), use station name as the
key, and aggregate the records as you read them in storing the data as
multi-values. You still want that index from second gain, though ...

So, minimum.modulo doesn't gain us much, but splitting groups is still
an unnecessary cost. Indexing on station name gives us a sorted list
with no effort whatsoever. And storing the data as multi-values rather
than values clusters it neatly in exactly the form required for output :-)

Cheers,
Wol

Joe G

unread,
Jan 13, 2024, 11:31:56 AMJan 13
to Pick and MultiValue Databases
Hi Nivethan, 

This takes me back to the first utility I wrote on a Pick system, a plist program to print formatted Basic code. I would start it running and it would display the first half of the program really quickly then go slower and slower until it reached the end.

The problem was that I was looping through the program and extracting line by line using the dynamic array LINE = PROGRAM<LINE.NO>. Every time I extracted the next line, the  program would start at the beginning of the program and count attribute marks until it got to the requested line, then it  would extract it. The further into the program, the longer it took.

The solution then was to QSELECT the program and read the lines with the READNEXT statement. I've heard that Universe and other implementations have code in the extract function that helps mitigate this issue. Universe has the "REMOVE VAR FROM DYNAMIC.ARRAY SETTING DELIM" statement that keeps track of where it is so it doesn't have to recount.

So, looking at your code, the biggest performance problem is going to be here:

      READU ANYTHING FROM STATION.FILE,ITEM.ID THEN
         ANYTHING<1,-1> = TEMP
         WRITE ANYTHING ON STATION.FILE,ITEM.ID
      END ELSE
         WRITE TEMP ON STATION.FILE,ITEM.ID
      END

There are only 413 weather stations in the file. With a million rows that's 2421 records per station. At a billion rows it's 24,421,000. I wrote a program to test appending values onto a dynamic array. These are the timings:

0.0169 1000
0.0393 2000
0.0737 3000
0.1227 4000
0.1864 5000
0.2645 6000
...
75.1228 95000
76.7386 96000
78.3794 97000
80.0271 98000
81.6959 99000

Each line represents adding 1000 values to an array using the minus index, ANYTHING<1,-1> = TEMP.  As you can see, it starts to slow down after 2000 but if I had continued to the billion mark we'd still be waiting. This doesn't even include the overhead of reading, locking and writing the items on the file. That all also takes longer as the items get bigger. Once you go through all that and have a file, those large items will each have to be processed by the select statement. That means reading the large items, extracting the data and doing the calculations on it.

This illustrates the weakness in the Pick database structure. Hashed files are really really great at handling large data sets where you have to update individual records. With a properly sized file, key selection and hashing algorithm, you can go directly to the section that holds your data even if the file has 2 billion rows. You can pull the record out, update it and write it back without disturbing most of the other records.

Reporting on the other hand, suffers. It will get slower and slower and the file sizes get bigger.  You can address this with various indexing methods but that's adding overhead and complexity to the record updates.

In the real world with most datasets this doesn't matter and Pick is a great choice. It's not the right tool for the billion record challenge.

With that all said, you don't have to load the data into a database or memory to calculate the numbers. All you need is one line per station, the count, max, min and total. Since there are only 413 stations we can write a program to make one pass through using READSEQ to calculate aggregate the numbers. I've attached an attempt. I'm simulating a million rows by running the measurement's file through the program 2430 times. It takes about 8 seconds. Extrapolating that to 1 billion would come out to about 2 hours 20 minutes. Still not close to the competition but doable.
1BRC.PICK.zip

Nivethan T

unread,
Jan 13, 2024, 11:54:56 AMJan 13
to Pick and MultiValue Databases
Thanks Wol, I tried some of your suggestions.

Jim, I'm glad you got to the read and write issue. I added a hashmap implementation that used internal subroutines and separated out the building of the data and the writing of the data. This brought it down by a 3x which is  still bad but not as bad.

Tried so far:
MINIMUM.MODULUS didn't do much.
Indexing on @ID didn't do anything.
The REMOVE statement that let you track the delimiters wasn't much of a difference.
LOCATE with sorted order didn't do much

This is definitely not something Pick was designed for but it is fun to see the edges. Really interesting that reading the file line by line and keep tracking of things is still taking long.

Nivethan T

unread,
Jan 13, 2024, 12:02:52 PMJan 13
to Pick and MultiValue Databases
This if for just 1 million records:

Current timing is 25 .04 seconds to load the data to a hash file
Reporting on that data took 80.34 seconds.

Joe Goldthwaite

unread,
Jan 13, 2024, 12:46:22 PMJan 13
to mvd...@googlegroups.com
25..04 seconds is faster than I thought it would be. 80 seconds is also faster. Still, increasing the load by 1000 will take a long time.

In my example, most of the time is being taken up in the locate statement used to get the measurement positions from the station name. Depending on where the station is in the sorted list, it takes different amounts of time. Here are some timings where I did 1,000,000 locate statements against the first, middle and last items in the list:

Locating Abha    =0.0982
Locating Loadwar =4.3245
Locating zmir    =9.5638

Since my timing took about 8 seconds to process a simulated 1,000,000 rows, I figure that most of the time was doing that locate. I don't know how I could speed that up with mvdb basic. 

Maybe reading the position value from a hashed file would be quicker? If the file was large enough that there was only one record for group it might speed things up. In the old days disk reads were orders of magnitude slower but now all those operations are happening in memory so maybe?

P.S. I got the decimal place wrong on my first post. A billion divided by 413 is 2,421,308 not 24 million.

Joe Goldthwaite

unread,
Jan 13, 2024, 1:03:28 PMJan 13
to mvd...@googlegroups.com
I answered my question. I changed the program to read the index from a hashed file. Now it's taking a little less than 3.5 seconds to process the simulated million rows.

--
You received this message because you are subscribed to
the "Pick and MultiValue Databases" group.
To post, email to: mvd...@googlegroups.com
To unsubscribe, email to: mvdbms+un...@googlegroups.com
For more options, visit http://groups.google.com/group/mvdbms
---
You received this message because you are subscribed to the Google Groups "Pick and MultiValue Databases" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mvdbms+un...@googlegroups.com.
1BRC.BASIC.2.txt

Nivethan T

unread,
Jan 13, 2024, 4:00:41 PMJan 13
to Pick and MultiValue Databases
This was exactly my impetus to write a hashing library for BASIC.

I found locates, even when sorted took linear time. The hashmap implementation brings down the look up to constant time.


My current best attempt:


I really like that position trick, its got the same idea as my hashmap idea but simpler. That's something I'll need to chew on as I imagine it might be useful in other places where I want to do something quickly instead of pulling in a library. Would you be able to do the position tracking in memory rather than writing out to a file? I think that would speed things up quite a bit.

A raw READSEQ of 1 million items takes 3 seconds on my server. I feel like there should be a way to get 25 seconds down to the same magnitude at least.

Nivethan T

unread,
Jan 13, 2024, 4:10:17 PMJan 13
to Pick and MultiValue Databases
To elaborate on Joe's testing,

I'm using a bit of  jargon which i think would be better served if I laid it out a bit more clearly. The LOCATE is taking some X number of time based on the length of the string. The longer the string, the longer the locate will take for the worst case of searching for stuff at the end. These are usually string find functions.

Hashmaps can find any element in X time. This is true for all the elements. It's like using array positions in a language like python/javascript or a dimensioned array in BASIC. Dynamic arrays don't seem to have this property in BASIC. <1> is faster than <255>. (This might be specific to the flavour of pick being used)

Joe Goldthwaite

unread,
Jan 13, 2024, 5:32:57 PMJan 13
to mvd...@googlegroups.com
A hashed lookup function in Basic would be better than writing to a file. In fact, that's how dictionaries in Python work. If you wanted to add the functionality to Pick you could follow the same syntax

a = {} ;* using {} to create a dictionary object since they're not being used in Pick basic and that's what Python uses.

a{'key'} = some.value ;* python uses [] but with basic you might want some way to tell dictionary extracts from the other {} functions.

print a{'key'}

It would require changes to the compiler.  I don't know if you could write a library in mvdb basic that would perform well.  Maybe.  Basic also doesn't have things that make it easier to write library extensions. There's no mechanism to store values inside an object that you would need to keep track of the arrays, keys and such.

Remember too that even though my sample was working with a file, that's probably not what's really happening. I created the file with a modulo of 101. That allocates around 20k of space. That's all going to be in the cache so you'll be pulling the data out of ram anyway. Having a string library would avoid the overhead to the disk calls but it's a lot more work to do it correctly.

The other option would be to use something like the Universe python interface. That would give you direct access to the dictionaries.


--
You received this message because you are subscribed to
the "Pick and MultiValue Databases" group.
To post, email to: mvd...@googlegroups.com
To unsubscribe, email to: mvdbms+un...@googlegroups.com
For more options, visit http://groups.google.com/group/mvdbms
---
You received this message because you are subscribed to the Google Groups "Pick and MultiValue Databases" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mvdbms+un...@googlegroups.com.

Peter Schellenbach

unread,
Jan 13, 2024, 8:09:43 PMJan 13
to Pick and MultiValue Databases
Sorry, I have not followed this thread carefully, but noticed Joe's desire for Python-like dictionaries in Pick. OpenQM data collections and jBASE dynamic objects both implement this behavior. I am not sure how either of them would work with massive amounts of data, but just wanted to mention that the functionality exists.

Brian Speirs

unread,
Jan 14, 2024, 3:08:49 AMJan 14
to Pick and MultiValue Databases
OK. I took Joe's list of weather stations, and created a program to process those in QM. This is running on modest hardware relative to that mentioned elsewhere - AMD Ryzen 5 4500u with 16 GB ram and 500 GB NVMe SSD.
I did the MV thing and converted the station names to numeric codes. To make things a bit more realistic, the codes weren't contiguous, and weren't in an order that represented the alphabetic order. But I did work with the names in memory, and a cross-reference from the station number to the order.
I didn't run the code for a billion records, but I did run it for 100 million records. That took 700 seconds to build the data file, and 1,014 seconds (nearly 17 minutes) to process the records and build the output.
The increase in time from running with 1 million records wasn't quite linear - but wasn't far off it. For one million records, it took 6.3 seconds to build the file, and 9.4 seconds to process them.
Code attached.
Cheers,

Brian
BRC2.zip

Joe G

unread,
Jan 15, 2024, 3:15:57 PMJan 15
to Pick and MultiValue Databases
That's good to know Pete. I have no way to play with jBASE. I'm not surprised that OpenQM has it. Is it still called "OpenQM" when it's no longer open?  Just curious.


Joe G

unread,
Jan 15, 2024, 3:18:22 PMJan 15
to Pick and MultiValue Databases
Hi Brian,

Interesting approach. Translating the station names to a cross referenced number does eliminate the time to look up the number with each detail record using a locate or hashed array. That speeds things up a lot.

Joe G.

Wol

unread,
Jan 15, 2024, 5:21:02 PMJan 15
to mvd...@googlegroups.com
Well, it's sort of open, in that the source for 2.6 is available.

Now that Rocket's in charge, there's no hope of getting it re-opened,
but I'm planning to offer an olive branch. I don't expect it to be
accepted, but one can but try ...

I now have to have a fight with my employer's legal eagles, but I'm not
expecting trouble there. If everything goes to plan I will soon be
devoting a few work hours to improving Scarlet - wayhey !!!

Cheers,
Wol

Nivethan T

unread,
Jan 16, 2024, 12:15:35 AMJan 16
to Pick and MultiValue Databases
I was able to do some inlining and simplification of my hashing logic and cut the time down to 8ish seconds to load 1 million records.

This is about a 10x improvement from what I had originally did but this is way more finnicky. I'd bet that my simplistic hashing will result in collisions but it seems to work on my test data set.


The ultimate solution in my eyes would be to rip through the data to find all the keys, give them an equate and then do the load. I think the load would then be basically at readseq speed.

Now to see if I can actually get the LIST to speed up: dictionaries, correlatives and indices are on the table, if anyone has any ideas feel free to chime in.

martinp...@ladybridge.com

unread,
Jan 16, 2024, 7:01:28 AMJan 16
to mvd...@googlegroups.com
  • Is it still called "OpenQM" when it's no longer open?  Just curious.

 

Some time ago, I spent a while looking back through archived emails and other documents to confirm that my memory regarding the product name was correct. The “open” first appeared in internal documents relating to the original Windows version being ported to open systems (Linux). This name change was partially motivated by the need to emphasise that we had a high degree of compatibility with PI/Open. When we did the open source version of QM, the “open” got a second meaning as cited in the history notes in the source code.

 

Martin

 

 

Brian Speirs

unread,
Jan 16, 2024, 1:39:43 PMJan 16
to Pick and MultiValue Databases
Thanks, Martin.

I think that illustrates perfectly that names don't always mean what we think they mean! <g>

Brian

Nivethan T

unread,
Jan 16, 2024, 10:23:38 PMJan 16
to Pick and MultiValue Databases
READBLK is drastically faster than using READSEQ as one may guess.

1 million records is about 14 MB. READBLK will read the entire thing in 0.012 seconds while READSEQ will take 2.5 seconds.

READBLK however has a limitation of around 2gigs in UniVerse which feels like its a tunable parameter. Maybe not.

Switching my program to use READBLK did speed things up and now it's at 5 seconds for a million records. Which translates to about 90 minutes to load 1 billion records if I'm super optimistic. However this is only possible if I load the entire thing into memory which looks like I can't in UniVerse.

Reply all
Reply to author
Forward
0 new messages