I took a stab at the Billion record challenge, the hash function is interesting but only does half the job. I used an index in D3, no data in the file, just data in the index, using the ‘c’reate operation to the key statement. This not only gives the pointer in the dimensioned array where the weather station is located but it also gives a nice sorted list for the final listing.
I did do all the way up to the Billion records, I used Joe’s Python program to generate these files:
ls -lrS *.txt
-rw-r--r-- 1 root root 13763620 Jan 20 16:31 mrc.txt
-rw-r--r-- 1 root root 137672081 Jan 22 03:52 10mrc.txt
-rw-r--r-- 1 root root 1376639587 Jan 22 04:50 100mrc.txt
-rw-r--r-- 1 root root 13766972087 Jan 22 06:43 brc.txt
These were the times I got:
10.707s to process temps 0.004s to list sorted data 10.711s Total
107.883s to process temps 0.004s to list sorted data 107.887s Total
1073.284s to process temps 0.005s to list sorted data 1073.289s Total
10763.750s to process temps 0.139s to list sorted data 10763.889s Total
Pretty linear, the only other trickiness I needed to employ was to run this at precision 0. The quick math shows that 1 Billion rows divided by 413 weather stations is on average just over 2.4 Million temperatures for each station. Multiply this times the highest average temperature 30.5 (from the python file) gives just over 74 million, if you scale that up by 10000 (precision 4) it equals approximately 740 Billion, but since I scaled it up already myself to eliminate the decimal point, it would be 7.4 Trillion, well beyond the trigger to make D3 use either string (bad performance) or floating point (bad accuracy) math. So, running with precision 0 scales that back down to 74 million which is well below the trigger. And since I already removed the decimal, like all good programmers should, running at precision is no problem, using the masking to put back the decimal at the end.
This was done on a single core single processor VM instance (Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz) with 1Gb of memory, so not the latest 5+ GHZ CPU but decent.
Thanks to RDM Infinity for the use of the VM.
If I were to do this for real I would split it up into multiple processes each processing part of the file, then do a final aggregation. I would base the split on how many cores my processor had, since mine only had one core, this method would not have gained me anything.
Here is my code:
:ct bp brc
brc
001 precision 0
002 dim wStation(500,4)
003 mat wStation = ''
004 root 'tmp','a1' to rv else stop "No index"
005 fs = (char*)%fopen('./measurements.txt','r')
006 nextid = 1; listkey = ""; listid = ""
007 printer on
008 time1 = system(12)
009 loop
010 char line[80]; * make a buffer
011 ptr = (char*)%fgets(line, 80, (char*) fs)
012 city = trim(field(line,';',1))
013 temp = field(field(line,';',2),char(10),1)*10
014 until ptr = 0 do
015 keyval = city; id = ''
016 key('r',rv,keyval,id) then
017 * Found the city in the index
018 if temp < wStation(id,1) then wStation(id,1) = temp; * min
019 if temp > wStation(id,2) then wStation(id,2) = temp; * max
020 wStation(id,3) += temp; * total
021 wStation(id,4) += 1; * count
022 end else
023 id = nextid
024 wStation(id,1) = temp; * min
025 wStation(id,2) = temp; * max
026 wStation(id,3) = temp; * total
027 wStation(id,4) = 1; * count
028 nextid += 1
029 key('a',rv,city,id) else null
030 end
031 repeat
032 *
033 time2 = system(12)
034 *
035 print "{":
036 key('n',rv,listkey,listid) then
037 print listkey:"/":(wStation(listid,1) 'mr11'):"/":
038 print (int(wStation(listid,3)/wStation(listid,4)) 'mr11'):"/":
039 print wStation(listid,2) 'mr11':
040 end
041 loop
042 key('n',rv,listkey,listid) then
043 print ", ":listkey:"/":(wStation(listid,1) 'mr11'):"/":
044 print (int(wStation(listid,3)/wStation(listid,4)) 'mr11'):"/":
045 print wStation(listid,2) 'mr11':
046 end else
047 exit
048 end
049 repeat
050 print "}"
051 time3 = system(12)
052 *
053 print oconv((time2 - time1),'mr33') 'r#12':"s to process temps"
054 print oconv((time3 - time2),'mr33') 'r#12':"s to list sorted data"
055 print oconv((time3 - time1),'mr33') 'r#12':"s Total"
056 printer off
--
You received this message because you are subscribed to
the "Pick and MultiValue Databases" group.
To post, email to: mvd...@googlegroups.com
To unsubscribe, email to: mvdbms+un...@googlegroups.com
For more options, visit http://groups.google.com/group/mvdbms
---
You received this message because you are subscribed to the Google Groups "Pick and MultiValue Databases" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mvdbms+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/mvdbms/CAEVOgcnLet%2BX4Hy5shdsm33Va4yat8VAqGq5XFGzLGmBMMqPaQ%40mail.gmail.com.