80 views

Skip to first unread message

Tim Rude

unread,

Feb 2, 2024, 12:46:04 PMFeb 2

to mvd...@googlegroups.com

I took a stab at the Billion record challenge, the hash function is interesting but only does half the job. I used an index in D3, no data in the file, just data in the index, using the ‘c’reate operation to the key statement. This not only gives the pointer in the dimensioned array where the weather station is located but it also gives a nice sorted list for the final listing.

I did do all the way up to the Billion records, I used Joe’s Python program to generate these files:

ls -lrS *.txt

-rw-r--r-- 1 root root 13763620 Jan 20 16:31 mrc.txt

-rw-r--r-- 1 root root 137672081 Jan 22 03:52 10mrc.txt

-rw-r--r-- 1 root root 1376639587 Jan 22 04:50 100mrc.txt

-rw-r--r-- 1 root root 13766972087 Jan 22 06:43 brc.txt

These were the times I got:

10.707s to process temps 0.004s to list sorted data 10.711s Total

107.883s to process temps 0.004s to list sorted data 107.887s Total

1073.284s to process temps 0.005s to list sorted data 1073.289s Total

10763.750s to process temps 0.139s to list sorted data 10763.889s Total

Pretty linear, the only other trickiness I needed to employ was to run this at precision 0. The quick math shows that 1 Billion rows divided by 413 weather stations is on average just over 2.4 Million temperatures for each station. Multiply this times the highest average temperature 30.5 (from the python file) gives just over 74 million, if you scale that up by 10000 (precision 4) it equals approximately 740 Billion, but since I scaled it up already myself to eliminate the decimal point, it would be 7.4 Trillion, well beyond the trigger to make D3 use either string (bad performance) or floating point (bad accuracy) math. So, running with precision 0 scales that back down to 74 million which is well below the trigger. And since I already removed the decimal, like all good programmers should, running at precision is no problem, using the masking to put back the decimal at the end.

This was done on a single core single processor VM instance (Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz) with 1Gb of memory, so not the latest 5+ GHZ CPU but decent.

Thanks to RDM Infinity for the use of the VM.

If I were to do this for real I would split it up into multiple processes each processing part of the file, then do a final aggregation. I would base the split on how many cores my processor had, since mine only had one core, this method would not have gained me anything.

Here is my code:

:ct bp brc

brc

001 precision 0

002 dim wStation(500,4)

003 mat wStation = ''

004 root 'tmp','a1' to rv else stop "No index"

005 fs = (char*)%fopen('./measurements.txt','r')

006 nextid = 1; listkey = ""; listid = ""

007 printer on

008 time1 = system(12)

009 loop

010 char line[80]; * make a buffer

011 ptr = (char*)%fgets(line, 80, (char*) fs)

012 city = trim(field(line,';',1))

013 temp = field(field(line,';',2),char(10),1)*10

014 until ptr = 0 do

015 keyval = city; id = ''

016 key('r',rv,keyval,id) then

017 * Found the city in the index

018 if temp < wStation(id,1) then wStation(id,1) = temp; * min

019 if temp > wStation(id,2) then wStation(id,2) = temp; * max

020 wStation(id,3) += temp; * total

021 wStation(id,4) += 1; * count

022 end else

023 id = nextid

024 wStation(id,1) = temp; * min

025 wStation(id,2) = temp; * max

026 wStation(id,3) = temp; * total

027 wStation(id,4) = 1; * count

028 nextid += 1

029 key('a',rv,city,id) else null

030 end

031 repeat

032 *

033 time2 = system(12)

034 *

035 print "{":

036 key('n',rv,listkey,listid) then

037 print listkey:"/":(wStation(listid,1) 'mr11'):"/":

038 print (int(wStation(listid,3)/wStation(listid,4)) 'mr11'):"/":

039 print wStation(listid,2) 'mr11':

040 end

041 loop

042 key('n',rv,listkey,listid) then

043 print ", ":listkey:"/":(wStation(listid,1) 'mr11'):"/":

044 print (int(wStation(listid,3)/wStation(listid,4)) 'mr11'):"/":

045 print wStation(listid,2) 'mr11':

046 end else

047 exit

048 end

049 repeat

050 print "}"

051 time3 = system(12)

052 *

053 print oconv((time2 - time1),'mr33') 'r#12':"s to process temps"

054 print oconv((time3 - time2),'mr33') 'r#12':"s to list sorted data"

055 print oconv((time3 - time1),'mr33') 'r#12':"s Total"

056 printer off

Christopher Jeune

unread,

Feb 3, 2024, 2:42:37 AMFeb 3

to mvd...@googlegroups.com

Where can I get a old copy of D3/NT from the late 90's to throw into a NT Virtual machine so I can play with some old software on some Pseudo files I have have for ages?

Thanks!

--
You received this message because you are subscribed to
the "Pick and MultiValue Databases" group.
To post, email to: mvd...@googlegroups.com
To unsubscribe, email to: mvdbms+un...@googlegroups.com
For more options, visit http://groups.google.com/group/mvdbms
---
You received this message because you are subscribed to the Google Groups "Pick and MultiValue Databases" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mvdbms+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/mvdbms/CAEVOgcnLet%2BX4Hy5shdsm33Va4yat8VAqGq5XFGzLGmBMMqPaQ%40mail.gmail.com.

Reply all

Reply to author

Forward

0 new messages