80 views
Skip to first unread message

Tim Rude

unread,
Feb 2, 2024, 12:46:04 PMFeb 2
to mvd...@googlegroups.com

I took a stab at the Billion record challenge, the hash function is interesting but only does half the job.  I used an index in D3, no data in the file, just data in the index, using the ‘c’reate operation to the key statement.  This not only gives the pointer in the dimensioned array where the weather station is located but it also gives a nice sorted list for the final listing.


I did do all the way up to the Billion records, I used Joe’s Python program to generate these files:


ls -lrS  *.txt

-rw-r--r-- 1 root root    13763620 Jan 20 16:31 mrc.txt

-rw-r--r-- 1 root root   137672081 Jan 22 03:52 10mrc.txt

-rw-r--r-- 1 root root  1376639587 Jan 22 04:50 100mrc.txt

-rw-r--r-- 1 root root 13766972087 Jan 22 06:43 brc.txt


These were the times I got:


    10.707s to process temps       0.004s to list sorted data       10.711s Total

   107.883s to process temps       0.004s to list sorted data      107.887s Total

  1073.284s to process temps       0.005s to list sorted data     1073.289s Total

 10763.750s to process temps       0.139s to list sorted data    10763.889s Total


Pretty linear, the only other trickiness I needed to employ was to run this at precision 0. The quick math shows that 1 Billion rows divided by 413 weather stations is on average just over 2.4 Million temperatures for each station. Multiply this times the highest average temperature 30.5 (from the python file) gives just over 74 million, if you scale that up by 10000 (precision 4) it equals approximately 740 Billion, but since I scaled it up already myself to eliminate the decimal point, it would be 7.4 Trillion, well beyond the trigger to make D3 use either string (bad performance) or floating point (bad accuracy) math. So, running with precision 0 scales that back down to 74 million which is well below the trigger. And since I already removed the decimal, like all good programmers should, running at precision is no problem, using the masking to put back the decimal at the end.


This was done on a single core single processor VM instance (Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz) with 1Gb of memory, so not the latest 5+ GHZ CPU but decent.

Thanks to RDM Infinity for the use of the VM.


If I were to do this for real I would split it up into multiple processes each processing part of the file, then do a final aggregation.  I would base the split on how many cores my processor had, since mine only had one core, this method would not have gained me anything.


Here is my code:


:ct bp brc


    brc

001 precision 0

002 dim wStation(500,4)

003 mat wStation = ''

004 root 'tmp','a1' to rv else stop "No index"

005 fs = (char*)%fopen('./measurements.txt','r')

006 nextid = 1; listkey = ""; listid = ""

007 printer on

008 time1 = system(12)

009 loop

010   char line[80]; * make a buffer

011   ptr = (char*)%fgets(line, 80, (char*) fs)

012   city = trim(field(line,';',1))

013   temp = field(field(line,';',2),char(10),1)*10

014 until ptr = 0 do

015   keyval = city; id = ''

016   key('r',rv,keyval,id) then

017     * Found the city in the index

018     if temp < wStation(id,1) then wStation(id,1) = temp; * min

019     if temp > wStation(id,2) then wStation(id,2) = temp; * max

020     wStation(id,3) += temp; * total

021     wStation(id,4) += 1; * count

022   end else

023     id = nextid

024     wStation(id,1) = temp; * min

025     wStation(id,2) = temp; * max

026     wStation(id,3) = temp; * total

027     wStation(id,4) = 1; * count

028     nextid += 1

029     key('a',rv,city,id) else null

030   end

031 repeat

032 *

033 time2 = system(12)

034 *

035 print "{":

036 key('n',rv,listkey,listid) then

037   print listkey:"/":(wStation(listid,1) 'mr11'):"/":

038   print (int(wStation(listid,3)/wStation(listid,4)) 'mr11'):"/":

039   print wStation(listid,2) 'mr11':

040 end

041 loop

042   key('n',rv,listkey,listid) then

043     print ", ":listkey:"/":(wStation(listid,1) 'mr11'):"/":

044     print (int(wStation(listid,3)/wStation(listid,4)) 'mr11'):"/":

045     print wStation(listid,2) 'mr11':

046   end else

047     exit

048   end

049 repeat

050 print "}"

051 time3 = system(12)

052 *

053 print oconv((time2 - time1),'mr33') 'r#12':"s to process temps"

054 print oconv((time3 - time2),'mr33') 'r#12':"s to list sorted data"

055 print oconv((time3 - time1),'mr33') 'r#12':"s Total"

056 printer off




Christopher Jeune

unread,
Feb 3, 2024, 2:42:37 AMFeb 3
to mvd...@googlegroups.com
Where can I get a old copy of D3/NT from the late 90's to throw into a NT Virtual machine so I can play with some old software on some Pseudo files I have have for ages?

Thanks!

--
You received this message because you are subscribed to
the "Pick and MultiValue Databases" group.
To post, email to: mvd...@googlegroups.com
To unsubscribe, email to: mvdbms+un...@googlegroups.com
For more options, visit http://groups.google.com/group/mvdbms
---
You received this message because you are subscribed to the Google Groups "Pick and MultiValue Databases" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mvdbms+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/mvdbms/CAEVOgcnLet%2BX4Hy5shdsm33Va4yat8VAqGq5XFGzLGmBMMqPaQ%40mail.gmail.com.
Reply all
Reply to author
Forward
0 new messages