A time-series experiment in Forth, early report

Julian Fondren

未读，

2017年9月24日 20:01:012017/9/24

收件人

Hello clf,

I have 1.6 million points from sar data. This is only a single series,
and each 'point' consists six floating-point numbers showing CPU usage
on a laptop, divided into %user, %nice, %system, %iowait, %steal, and
%idle.

Example source data:

12:00:01 AM CPU %user %nice %system %iowait %steal %idle
12:01:01 AM all 46.38 0.05 12.25 0.99 0.00 40.33
12:02:01 AM all 51.02 0.06 14.05 0.99 0.00 33.88
12:03:01 AM all 44.96 0.34 12.64 1.43 0.00 40.63

Example of how (different) data looks in influxdb, my usual go-to
time-series dbms:

time idle iowait nice steal system user
---- ---- ------ ---- ----- ------ ----
0 97.09 0.21 0 0 0.94 1.76
1 97.65 0.11 0 0 0.87 1.37
2 97.4 0.15 0 0 0.86 1.59

Note that for testing purposes this data begins at the unix epoch, and
the actual data is pretty repetitive. Almost as if I massaged a few
days of sar outputs, then duplicated that to get 100MB of data, then
changed the timestamps to all be unique in the straightforward manner
of counting them up from zero.

influxdb accepts data in a particular format. For adding the 1.6
million points to some time-series code in Forth, I just used Forth:

0 1.76e 0e 0.94e 0.21e 0e 97.09e point new -> init
1 1.37e 0e 0.87e 0.11e 0e 97.65e point new -> init
2 1.59e 0e 0.86e 0.15e 0e 97.40e point new -> init
...

And then? Not like a teller saying, "yep, this is exactly 13kg of
lettuce, so you'll need to pay this exact amount -- but you could save
exactly 15% off this purchase if you sign up for car insurance!", but
rather more like a suspicious hiker discretely checking his GPS while
murmuring, "this guide says he's taking us to a site to the north, but
aren't we already 7km to the south of our starting point?", here are
some benchmarks:

- 100 repetitions of SELECT mean(idle)
Forth/2: 84.122916 0.787 seconds elapsed
Forth: 84.112951 1.559 seconds elapsed
Influx: 84.12295112898101 0m13.002s

- a single repetition of SELECT * FROM sar
Forth/2: 87.017 seconds elapsed
Forth: 3m41.993s
Forth*: 4.020 seconds elapsed
Forth**: 4.227 seconds elapsed
Influx: 2m37.972s
Influx*: 20.017 seconds elapsed
Influx**: 37.212 seconds elapsed

InfluxDB was asked via SQL to its CLI interface, which connects to a
running server that manages its various time-series databases. Forth
was asked via Forth to the interactive prompt of a Forth that had
already loaded the 1.6 million points into memory.

Forth/2: data is only kept to a resolution of 2 seconds. With this dataset
effectively half the points are thrown away.

Forth* and Influx*: the output is piped to /dev/null

Forth** and Influx**: the output is piped to a file

Actual runs for Forth* and Influx* (with FLUSH writing to stderr):

$ iforth include tsdb.fs include sar2.dat "timer-reset ' dump-db buffered .elapsed bye" 2>/dev/null

Redefining flush
Redefining \\
Redefining (package)
Redefining OOP
Redefining >S
Redefining S>
Redefining SIZEOF 4.020 seconds elapsed.

$ time influx -database sardata -execute 'select * from sar' > /dev/null

real 0m20.017s
user 0m22.952s
sys 0m1.712s

The code is not much:

needs -toolbelt

\ buffered io
variable #iobuf
16 1024 * buffer: iobuf
: flush ( -- )
iobuf #iobuf @ 1 write-file drop 0 #iobuf ! ;
: autoflush ( -- )
#iobuf @ [ 8 1024 * ] literal > if flush then ;
: buf-type ( c-addr u -- )
iobuf #iobuf @ + swap
dup #iobuf +! move
autoflush ;
: buf-emit ( c -- )
#iobuf @ tuck iobuf + c! 1+ #iobuf ! ;
: buffered ( xt -- )
['] buf-type 'type !
['] buf-emit 'emit !
catch forth-io flush throw ;

needs -swoop

\ 60 constant step-size \ per-minute data (unit: seconds)
\ 60 24 * constant span-size \ day-long spans (unit: step-size)
\ 3600 24 * constant span-factor \ spans start on day boundaries (unit: seconds)

\ 2 constant step-size \ per-bisecond data (unit: seconds)
\ 3600 2 / constant span-size \ hour-long spans (unit: step-size)
\ 3600 constant span-factor \ spans start on hour boundaries (unit: seconds)

1 constant step-size \ per-second data (unit: seconds)
3600 1 / constant span-size \ hour-long spans (unit: step-size)
3600 constant span-factor \ spans start on hour boundaries (unit: seconds)

variable database
0 value point-size

: point-fits? ( time p span -- f )
third over cell+ @ dup step-size span-size * + within ;

: time>index ( time -- n )
span-factor mod step-size / ;

: insert-point-into ( time p span -- )
rot time>index point-size * 2 cells + + point-size move ;

defer new-timespan-from
: insert ( time p -- )
database begin @ dup while
point-fits? if insert-point-into exit then
repeat drop new-timespan-from ;

: eat ( time p -- )
tuck insert destroy ;

class point
1 floats buffer: valid? \ boolean
1 floats buffer: %user
1 floats buffer: %nice
1 floats buffer: %system
1 floats buffer: %iowait
1 floats buffer: %steal
1 floats buffer: %idle
: init ( time r r r r r r -- )
%idle f! %steal f! %iowait f!
%system f! %nice f! %user f!
true valid? ! addr eat ;
: dump ( -- )
." [ "
%user f@ f.
%nice f@ f.
%system f@ f.
%iowait f@ f.
%steal f@ f.
%idle f@ f.
." ]" ;
end-class
point sizeof to point-size

class timespan
variable next
variable start
span-size point builds[] point[]
: init ( time -- )
start ! database @ next ! addr database ! ;
: dump ( -- )
start @ span-size 0 do
i point[] valid? @ if
cr i point[] dump ." @ " dup .
then
step-size + loop drop ;
: +mean ( n r -- n' r' )
start @ span-size 0 do
i point[] valid? @ if
i point[] %idle f@ f+ 1 under+
then
step-size + loop drop ;
end-class
:noname ( time p )
over dup span-factor mod - timespan new tuck -> init
insert-point-into ; is new-timespan-from

: dump-db ( -- )
database begin @ dup while
dup -> dump
repeat drop ;

: mean-idle ( -- r )
0e 0 database begin @ dup while
tuck USING timespan +mean
swap
repeat drop s>d d>f f/ ;

Right away, it is not a dbms at all. This is basically code for a
single measurement, or a single table in traditional terms, with low
level code to work with it, with no support for multiple series of the
measurement (there's are no tags or tagsets, in time-series terms).
With influxdb you can just throw points into a measurement and
influxdb will figure things out for you: you don't have to tell it how
much data you'll be keeping, or what resolution the data should have,
and so on. If for a hundred days you introduce data that's all tagged
with machine=one and then on the hundredth-and-first day you introduce
data tagged with machine=two, influxdb will accept the new series of
points without complaining.

But that's all dbms stuff, like accepting commands as SQL, or
authenticating users, etc. When that's written, the
automatically-managed measurements might be maintained as Forth code
in a file, code that might look a bit like the code above. Why not? It
works so far.

There's some obvious low-hanging fruit for improvements to this code,
like: it could get away with only ever allocating memory for
timespans; or if the code is generated, POINT-SIZE can be a constant
with a known value. I basically said, "I'm interested in time-series
data. Let's write something to work with that.", and have neither
optimized the code nor considered much the benefits of different
representations, etc.

Paul Rubin

未读，

2017年9月24日 20:54:432017/9/24

收件人

Julian Fondren <julian....@gmail.com> writes:
> Forth**: 4.227 seconds elapsed
> Influx: 2m37.972s

Nice. Any idea how sqlite3 compares?

Julian Fondren

未读，

2017年9月24日 21:36:342017/9/24

收件人

On Sunday, September 24, 2017 at 7:54:43 PM UTC-5, Paul Rubin wrote:
> Julian Fondren <julian....@gmail.com> writes:
> > Forth**: 4.227 seconds elapsed
> > Influx: 2m37.972s

The ** figures for both Forth and Influx are probably the most realistic.

> Nice. Any idea how sqlite3 compares?

I've never used it for time-series data; I'm not sure that it would
suffice for say a grafana backend (esp. as grafana uses sqlite3 by
default for its own data, but doesn't accept it as a data source.)

Something to keep in mind though.

hughag...@gmail.com

未读，

2017年9月24日 21:36:402017/9/24

收件人

On Sunday, September 24, 2017 at 5:01:01 PM UTC-7, Julian Fondren wrote:
> And then? Not like a teller saying, "yep, this is exactly 13kg of
> lettuce, so you'll need to pay this exact amount -- but you could save
> exactly 15% off this purchase if you sign up for car insurance!", but
> rather more like a suspicious hiker discretely checking his GPS while
> murmuring, "this guide says he's taking us to a site to the north, but
> aren't we already 7km to the south of our starting point?"

You should really lay off on the drugs --- you aren't making any sense whatsoever...

minf...@arcor.de

未读，

2017年9月25日 01:27:342017/9/25

收件人

Is this just experiments using laptop-generated series as proof-of-concept?

In the SCADA world, time series management / compression / storage / math
is quite another beast.

rickman

未读，

2017年9月25日 03:07:342017/9/25

收件人

Julian Fondren wrote on 9/24/2017 8:00 PM:
> Hello clf,
>
> I have 1.6 million points from sar data.

Ok, I'll bite, what's sar data?

--

Rick C

Viewed the eclipse at Wintercrest Farms,
on the centerline of totality since 1998

Paul Rubin

未读，

2017年9月25日 04:22:192017/9/25

收件人

rickman <gnu...@gmail.com> writes:
> Ok, I'll bite, what's sar data?

System activity reporter. It's a logging system that measures and
records your server load periodically, so you can generate graphs and
stuff of where the bottlenecks are.

rickman

未读，

2017年9月25日 05:34:492017/9/25

收件人

I guess if I had to ask, I wouldn't appreciate the answer. :)

m...@iae.nl

未读，

2017年9月25日 06:42:262017/9/25

收件人

On Monday, September 25, 2017 at 2:01:01 AM UTC+2, Julian Fondren wrote:
> Hello clf,
[..]

Just a remark.

> \ buffered io
> variable #iobuf
> 16 1024 * buffer: iobuf
> : flush ( -- )
> iobuf #iobuf @ 1 write-file drop 0 #iobuf ! ;
> : autoflush ( -- )
> #iobuf @ [ 8 1024 * ] literal > if flush then ;
> : buf-type ( c-addr u -- )
> iobuf #iobuf @ + swap
> dup #iobuf +! move
> autoflush ;
> : buf-emit ( c -- )
> #iobuf @ tuck iobuf + c! 1+ #iobuf ! ;
> : buffered ( xt -- )
> ['] buf-type 'type !
> ['] buf-emit 'emit !
> catch forth-io flush throw ;

In its next release iForth will have thread-safe
IODEVICEs with which all input/output can be
(crash-safe) switched as a whole.
Currently we support a STANDARD, NULL, FILE, PIPE,
and a MEMORY device. MIDI doesn't fit the mold but
can be constructed by merging the inputs of one
standard iodevice with the outputs of another.

-marcel

Mark Wills

未读，

2017年9月25日 07:46:362017/9/25

收件人

Looks very interesting. My comment below isn't particularly relevant,
so lease feel free to ignore it; however, I was struck that your efforts
above reflect what seems to be happening in the 'bleeding-edge' DB
technology arenas: A move away from SQL, which kind of separates the
user from his data, towards manipulating data directly using script,
which tightly couples the user/programmer to the data. Often, you are
working with data in-memory, and the "system" beneath you is very light,
maybe taking of paging data in/out of memory, and record locking.

Of course, there is nothing new under the sun. I mean, before SQL came
along and effectively abstracted the underlying database away from you,
this is how we did it. People of a certain age will remember database
systems such as DBase and FoxPro, and this is exactly what we did. There
was very little between us and the database, and the data was queried and
manipulated using scripts. And it was hellishly fast for it's time.

We're moving back towards that with the latest "no SQL" type database
systems.

minf...@arcor.de

未读，

2017年9月25日 09:11:162017/9/25

收件人

Really?

Anyone done Hadoop in Forth yet? ;-)

Julian Fondren

未读，

2017年9月26日 00:38:342017/9/26

收件人

On Monday, September 25, 2017 at 12:27:34 AM UTC-5, minf...@arcor.de
wrote:

>
> Is this just experiments using laptop-generated series as
> proof-of-concept?
>

I'm interested in time-series databases. I looked around and didn't
find any "principles of time series data", "write your own tsdb from
scratch", etc. I found some installation guides basically, but neither
theory nor practice on how such databases actually work, what
performance or resource problems people run into with them, etc.

> In the SCADA world, time series management / compression / storage /
> math
> is quite another beast.

I'd hope that any practical use of time series data is a beast apart
from "dump all points" and "get the average of one value across all
points". Or do you mean that you couldn't use something like influxdb
for SCADA? Its capabilities are documented here:
https://docs.influxdata.com/influxdb/v1.3/

I don't know anything about SCADA though. How about, I'll tell you a
little bit about what I've been doing, and you can say how that
differs from things in the SCADA world.

For about six months out of this last year I was responsible for some
projects related to time-series data. This started out with my taking
over some completely custom webapps whose function was to consume
external monitoring and to, when monitoring said things were bad,
display a lot of red boxes so that people would start making phone
calls. Early on I adopted influxdb for storing time-series data and
grafana for displaying it.

There were two big sources of data: our internal monitoring system
(let's call it, to not offend anyone, Zabbix--because that's what it
is and Zabbix is great), and an external monitoring system (let's call
it, to not offend anyone, Cassanda--because they use Apache Cassandra
and their performance characteristics are 'interesting'). Zabbix and
Cassandra both monitored a really big server farm. Mainly Cassandra
would connect to two differently configured websites running on each
server and would report on how fast each site loaded from whatever
regional requester, or if it failed to load how long the outage
lasted. Zabbix meanwhile maintained thousands of items per server and
would alert on dozens of issues--a service not running, or a service
not being remotely accessible, or a service appearing to have
performance problems, etc.

I couldn't just straightforwardly get time-series data from either
system. For Zabbix I had a daemon that listened to the 'event
firehose', keeping track of all active alerts, which would regularly
report to influxdb about those alerts when they were ongoing and then
would report when they'd closed. In influxdb's line format these
alerts might look something like:

alert,server=big43.bigserver.com,dc=Rome,brand=BigHost,zabbix=zbx1,platform=Big--Shared--Europe,acked=null,description=Nginx\ doesn\'t\ like\ its\ configuration value=120 norm=0.0012 1506398311

So, add two values at that Unix epoch timestamp to the 'alert'
measurement in whatever database, the first named 'value' with a value
of 120 (seconds: how long the alert has lasted so far, or lasted
ever); and the second named 'norm' with a value of 0.0012 (seconds:
just value divided by the number of servers in the group, here 1000).

The rest of that is tags. You could then ask influxdb a question like

SELECT count(value) FROM alert WHERE dc=Rome AND time > now() - 7d GROUP BY time(1d), brand

and the result could be graphed, with one line per brand: how many
alerts by day did we see from this brand in the last week from the
datacenter in Rome? The graph would have time on the X axis, there
would one point per day (per brand), and the lines would connect those
points.

For Cassandra, the external monitoring, I made API calls to ask them
for various reports, then massaged the data into influxdb. The
massaged data might look something like:

outage,server=big12.bigserver.com,dc=Rome,brand=BigHost,platform=Big--Shared--Europe value=1 1506397759
outage,server=big12.bigserver.com,dc=Rome,brand=BigHost,platform=Big--Shared--Europe value=1 1506398759
outage,server=big12.bigserver.com,dc=Rome,brand=BigHost,platform=Big--Shared--Europe value=1 1506398859
availability,server=big12.bigserver.com,dc=Rome,brand=BigHost,platform=Big--Shared--Europe value=99.98 1506398859
performance,server=big12.bigserver.com,dc=Rome,brand=BigHost,platform=Big--Shared--Europe--Cached,continent=NorthAmerica,state=VA value=120 1506398859
performance,server=big12.bigserver.com,dc=Rome,brand=BigHost,platform=Big--Shared--Europe,continent=NorthAmerica,state=VA value=330 1506398859

With outages stored as a single value=1 without duration, "something
happened at this time". With uptime stored as a float [0,100]. With
performance stored in milliseconds to page load, with that continent=
stuff indicating which site *initiated* the request, not the location
of the server.

One cute thing you could do with this data is grab all of the
performance data for all servers everywhere, and then demonstrate
that, regrettably, the speed of light is still in effect: most of our
stuff is in NA, therefore NA speeds are on average the best, followed
by Europe, then South America:

SELECT mean(value) FROM performance WHERE time > now() - 30d GROUP BY time(1d), continent

The purpose was trending, mainly: when things turn bad, someone should
notice and ask questions about it. What is measured, improves. Which
only ever needed the last month of data, mainly.

A secondary purpose: "showing that a thing was real". One part of the
company gets some complaints about customers on the 4th of July. Is
this a big fire or can people still enjoy their holidays? Well this
graph shows that yes there's a huge spike in problems related to the
system the customers are complaining about, sorry. Again this only
needs the most recent data.

An uncommon purpose: "did we do well? / what the hell happened?" It's
noticed after the fact that some metric has slipped, or a team
introduced a new feature and would like to see what impact it made,
and someone gets to prepare, in May, a dashboard that compares some
numbers from February vs. those from March.

The most advanced math was turning uptime figures into periods of
time, and vice-versa. I could for example produce an 'Nginx
Configuration Uptime' for a platform, for a day or week, based on how
long the nginx configuration alerts lasted for servers on that
platform.

Compression and storage issues didn't come up. Hundreds of millions of
points just don't take up enough space to care about. I was so
conservative with Cassandra's influxdb usage that an entire year of
it probably wouldn't take up 1G on the disk. But that's influxdb
handling things--maybe it actually compresses data on the disk, I
don't know.

Performance issues, and influxdb regrettably requiring infinite
amounts of RAM to answer a query, those did come up. I was much less
conservative with Zabbix's use of influxdb. I the end every single
grafana panel that needed Zabbix data had to deal with, not the raw
measurements, but aggregate measurements built by infuxdb 'continuous
queries'.

So, how's the SCADA world?

Julian Fondren

未读，

2017年9月26日 00:54:192017/9/26

收件人

On Monday, September 25, 2017 at 6:46:36 AM UTC-5, Mark Wills wrote:
> Of course, there is nothing new under the sun. I mean, before SQL came
> along and effectively abstracted the underlying database away from you,
> this is how we did it. People of a certain age will remember database
> systems such as DBase and FoxPro, and this is exactly what we did. There
> was very little between us and the database, and the data was queried and
> manipulated using scripts. And it was hellishly fast for it's time.
>
> We're moving back towards that with the latest "no SQL" type database
> systems.

Actually, apart from hearing that they had superior 'table oriented
programming' capabilities, I don't know *anything* about DBase or
FoxPro or the like. Even when I was always interested in learning new
programming languages, and following programming language forums, I
just never heard anything about those.

I bet that more readers of reddit or the YC or whatever have seen
Forth code than have seen DBase code.

So there might really be some lost knowledge there.

Anyway, I don't have anything against SQL for asking questions about
databases. It's just not urgent. It's a usability enhancement for
something whose other features are not clear yet.

Paul Rubin

未读，

2017年9月26日 02:45:482017/9/26

收件人

Julian Fondren <julian....@gmail.com> writes:
> Anyway, I don't have anything against SQL for asking questions about
> databases. It's just not urgent. It's a usability enhancement for
> something whose other features are not clear yet.

I thought these were both cool:

https://wiki.haskell.org/Timeplot
https://wiki.haskell.org/Splot

Your dataset isn't all that large and I'm surprised how slow influxdb
is. You might try something like Postgres.

minf...@arcor.de

未读，

2017年9月26日 06:21:212017/9/26

收件人

Thanks for you detailed explanations. As you are already familiar with
general purpose high-troughput DBs like Influx or Cassandra, you'll also know
that probably the most important optimization parameter is the infrastructure
(data sources, sinks, transmission paths) and their availability.

For industrial-scale process information OSIsoft's PI System might be the
market leader. Their core system is a historian with RTDBMS and very
efficient time series compression algorithm. Most SCADAS and DCS have
their proprietary DBs.

Typical challenges are
- Transmission paths are unavailable and time series data are buffered in
subsystems. After reactivating the transmission, buffered old and actual
realtime data arrive at the same time and have to be sorted in
- Apart from time stamps, data are "equipped" with validity bits to be processed
(e.g. data tag/time implausible or corrupt, data is simulated or a claculated
estimated replacement value)
- Planning time series (e.g. target load profiles) have to be stored along
each other
- Fast storage and long-term archives reside on different machines with
different access and retrieval characteristics (e.g. delay times)
- Redundancy management / security management take their tolls
- Upgrades and updates / patches have to be prepared and executed with care

Julian Fondren

未读，

2017年9月26日 10:40:252017/9/26

收件人

On Tuesday, September 26, 2017 at 5:21:21 AM UTC-5, minf...@arcor.de wrote:
> Thanks for you detailed explanations. As you are already familiar with
> general purpose high-troughput DBs like Influx or Cassandra, you'll also know
> that probably the most important optimization parameter is the infrastructure
> (data sources, sinks, transmission paths) and their availability.
>

It's funny you say that, because no--I didn't. That's not an issue at
all for the usage I described above. Of your typical challenges, only
software upgrades and the separation of long-term and most-recent
points came up.

Cassandra data was pulled, and then fed to influxdb, by cronjobs that
didn't run more often than once every ten minutes (because the
Cassandra people couldn't keep up with that pace anyway). The Zabbix
data came from the very daemon that also maintained separate real-time
displays, and the points were alert data (active triggers in Zabbix
terminology), not the huge amounts of metrics that Zabbix itself needs
to be able to know when to alert.

What I was more concerned about was someone landing on a grafana
dashboard that showed recent uptime figures across a variety of brands
and platforms, resulting in grafana making 30+ simultaneous queries to
influxdb. Or about 'SHOW TAG VALUES' queries slowing down the load
times of the dashboard itself.

I was so much more concerned about that than any issues about feeding
data to influx that my first idea about a thing to do with my own code
was to massively increase the cost of adding points if it'd mean
faster queries like that.

Julian Fondren

未读，

2017年9月26日 10:53:452017/9/26

收件人

On Tuesday, September 26, 2017 at 1:45:48 AM UTC-5, Paul Rubin wrote:
> Julian Fondren <julian....@gmail.com> writes:
> > Anyway, I don't have anything against SQL for asking questions about
> > databases. It's just not urgent. It's a usability enhancement for
> > something whose other features are not clear yet.
>
> I thought these were both cool:
>
> https://wiki.haskell.org/Timeplot
> https://wiki.haskell.org/Splot
>

... what?

Please take a look at http://play.grafana.org/dashboard/db/graphite-templated-nested?orgId=1

You can click on the titles of the panels (Requests / s) and Edit to
play with their configurations. I'd also draw your attention to the
time selector at the top right of the page, and to the 'Refreshing
every' part of it, and to your ability to click and drag a portion of
a graph, to 'zoom' into it. If you don't interact with the page for a
while, the UI elements will fade out in a way that would look nice on
a wall.

What that board is trying to show off are the drop-down menus at the
top, that let you change the values of different variables in the
queries used by those panels. You can have variables without a
drop-down menu, by the way: the variable can still be supplied by the
URL, as in &var-Server=big42.bigserver.com. Which is ideal when you
have so many potential values that the drop-down menu would cease to
work as a useful interface to them. On the grafana instance I set up,
the default dashboard has a little input box for server names, with
some custom javascript to direct you to a dashboard with the 'Server'
variable set to what you typed.

Julian Fondren

未读，

2017年12月27日 00:46:422017/12/27

收件人

On Sunday, September 24, 2017 at 7:01:01 PM UTC-5, Julian Fondren wrote:
> Hello clf,

...

I went ahead and redid this version of tsdb.fs in OCaml - a pretty
direct translation from Forth, although I dropped the object
orientation, and I use an option type instead of an 'is valid' field
to say if points are good. Conclusions follow.

An early roadblock: POINT-FITS? had a bad stack comment; it doesn't
actually consume its arguments like it says. I didn't notice this
until I was trying to figure out what it was doing with the provided
point -- it wasn't doing anything with it.

1. Forth's readability is dependent on something that is completely
ignored by Forth itself. So that can slide without you even
noticing.

The actual OCaml code was very easy to write, but when it came to
testing I had problems with the 100MB input data. With Forth I just had
100MB of code and let Forth interpret it. So my first attempt with
OCaml was to put the 100MB of points in a big list. But, OCaml has a
limit on how large literal lists can be - the compiler hits a stack
overflow trying to interpret larger lists. So instead of a big list,
how about 1000-point lists that are appended to each other? The
bytecode compiler could accept that but the optimizing compiler choked
on it. In the end I used the bytecode compiler to marshall the list to
a file (see save_to and load_from functions, below), and loaded that
as data rather than code.

2. It's very to handle data with Forth's compiler, and a little
parsing word can make it much nicer-looking than it would be in other
languages. But, somehow, it's not something I miss much when not
using Forth--I'd rather use a database, JSON, a formatted file. And
in this case it was still much slower and much more expensive for
Forth to handle the 100MB of input data... but I didn't do that
very cleverly, with a throwaway object per point.

On the 'get mean %idle 100 times' benchmark, OCaml's three times
slower than Forth... interactively, using the bytecode compiler.
Compiled to native code, it's twice as fast as Forth.

3. It's nice that Forth is decently fast by default?

There are a ton of bad languages in popular use, and in some ways it's
gotten worse: instead of programmers moving to better languages, the
trend is to expend resources on making programmers' terrible choices
slightly less terrible: see the PyPy project, which is can be a 'go
faster' button for Python code, or Facebook's work on adding safety
features to PHP.

So it's always been easy to find favorable comparisons for Forth. The
metaprogramming features are so much better and easier than they are
in C. It's so much easier to be more reasonable with memory use than
in Haskell. It compiles very quickly and it has great interactive
features and it has low overhead. You can easily produce standalone
executables, and also make use of C libraries, and also run Forth
directly from source as a script. Forth gets speed without boilerplate
or lengthy compile-times and it gets dynamism without an expensive and
slow runtime.

But, among the bad languages, there are also *good* languages, and
it's much harder to find favorable comparisons for Forth here.
Compared to OCaml, I have to say, "well maybe one day I'll have an
embedded project suitable for Forth." Instead of what I've been saying
of pretty much everything else: "I'd much rather use Forth than that!"

I realize that I'm talking here to a handful of Forth system
maintainers, and then a handful of drooling psychopaths, and then a
handful of people who think that pygmy.py is a charming and promising
new Forth system (because they didn't even download it and have no
plans to do anything in Forth ever). So the odds of my persuading you
to any position seem pretty low. Please feel free though to counter
that Forth does still have advantages against *good* languages.

'Cause in the end,

4. I like this code a lot more than I do the Forth. It's easier to
understand and easier to change. And it's actually not that good,
as OCaml goes, because it's mimicking the Forth. The obvious
improvement is one that's a lot harder to make with Forth: the
time-series code should be pushed into a module that's then
paramertized based on provided 'point' type.

(Although if you want to mine this post for potential improvements to
Forth, and dispose of the rest as garbage, *parameterized modules* is
the way to go. Parameterized, like how an function is a parameterized
expression. Rather than giving SORT a defer'd comparison word, you
should load a Sort module (potentially multiple times) that compiles
in the provided comparison word. Modules can already be parameterized
in a fashion by having an included file begin with some [DEFINED]
tests, but it's not much used.)

The OCaml:

let step_size = 1 (* per-second data; unit: seconds *)
let span_size = 3600 (* hour-long spans; unit: step_size *)
let span_factor = 3600 (* spans start on hour boundaries; unit: seconds *)

type point = {
pct_user : float;
pct_nice : float;
pct_system : float;
pct_iowait : float;
pct_steal : float;
pct_idle : float;
}

type timespan = {
start : int;
points : point option array;
}

let database = ref []

let print_point {pct_user; pct_nice; pct_system; pct_iowait; pct_steal; pct_idle} =
Printf.printf "[%f %f %f %f %f %f]"
pct_user pct_nice pct_system
pct_iowait pct_steal pct_idle

let point_fits time span =
let stop = span.start + step_size * span_size in
time >= span.start && time < stop

let index_of_time time =
time mod span_factor / step_size

let insert_point_into time point span =
span.points.(index_of_time time) <- Some point

let insert time point =
match List.find_opt (point_fits time) !database with
| Some span -> insert_point_into time point span
| None ->
let span = { start = time - time mod span_factor;
points = Array.make span_size None } in
insert_point_into time point span;
database := span :: !database

let iter_points f =
List.iter (fun span ->
Array.iter (function None -> () | Some p -> f p)
span.points) !database

let dump_db () =
iter_points (fun p -> print_point p; print_newline ())

let mean_idle () =
let sum, count = ref 0., ref 0 in
iter_points (fun p -> sum := !sum +. p.pct_idle; incr count);
!sum /. float !count

let insert_test_data list =
List.iter (fun (time, pct_user, pct_nice, pct_system, pct_iowait, pct_steal, pct_idle) ->
let point = {pct_user; pct_nice; pct_system; pct_iowait; pct_steal; pct_idle} in
insert time point)
list

let save_to filename =
let ch = open_out filename in
output_value ch !database;
close_out ch

let load_from filename =
let ch = open_in filename in
database := input_value ch;
close_in ch

let bench n f =
let start = Unix.gettimeofday () in
for i = 1 to n do f () done;
Unix.gettimeofday () -. start

let () =
Printf.printf "%.5f s (load)\n" (bench 1 (fun () -> load_from "tsdb.dat"));
Printf.printf "%.5f s\n%f\n" (bench 100 mean_idle) (mean_idle ())

m...@iae.nl

未读，

2017年12月27日 04:51:482017/12/27

收件人

On Monday, September 25, 2017 at 2:01:01 AM UTC+2, Julian Fondren wrote:

[..]

> The code is not much:
>
> needs -toolbelt
>

[]
> needs -swoop

Where to find swoop.frt?

> defer new-timespan-from
> : insert ( time p -- )
> database begin @ dup while
> point-fits? if insert-point-into exit then
> repeat drop new-timespan-from ;

Looks like a linear search? Did you try the hashtable files?

> class point
> 1 floats buffer: valid? \ boolean
> 1 floats buffer: %user
> 1 floats buffer: %nice
> 1 floats buffer: %system
> 1 floats buffer: %iowait
> 1 floats buffer: %steal
> 1 floats buffer: %idle

Using FLOAT(S) in iForth gives you extended IEEE floats
(80 bits, 16 bytes in memory). When only a 32bit float is
needed, SFLOAT is both faster and more memory-friendly
(4 bytes).

Mining would be fun, but I have another project to finish
this year.

-marcel

john

未读，

2017年12月27日 07:12:412017/12/27

收件人

In article <6f0d4c06-5d63-4e2a...@googlegroups.com>,
julian....@gmail.com says...

> There are a ton of bad languages in popular use, and in some ways it's
> gotten worse: instead of programmers moving to better languages, the
> trend is to expend resources on making programmers' terrible choices
> slightly less terrible: see the PyPy project, which is can be a 'go
> faster' button for Python code, or Facebook's work on adding safety
> features to PHP.
>

Indeed.

--

john

=========================
http://johntech.co.uk
=========================

NN

未读，

2017年12月27日 08:06:152017/12/27

收件人

On Wednesday, 27 December 2017 09:51:48 UTC, m...@iae.nl wrote:
> On Monday, September 25, 2017 at 2:01:01 AM UTC+2, Julian Fondren wrote:
> [..]
> > The code is not much:
> >
> > needs -toolbelt
> >
> []
> > needs -swoop
>
> Where to find swoop.frt?
>

Perhaps he meant http://soton.mpeforth.com/flag/swoop/index.html

NN

未读，

2017年12月27日 09:03:422017/12/27

收件人

"There are a ton of bad languages in popular use, and in some ways it's gotten worse:..." Whats a better langauge ? Whats a bad langauge ?

You were free to choose your problem domain and the language to solve it. Just because you found it easy in one and not the other doesnt mean anything.

What you didnt say was :-

-- how long did it take you to write both programs, test them and agree you were done.

-- did you get the same results with both or did you need to go back and fix ?

--what would you say is your familiarity as a programmer in both languages? I get the impression you are good in forth and still learning in Ocaml!

Julian Fondren

未读，

2017年12月27日 11:15:372017/12/27

收件人

n Wednesday, December 27, 2017 at 3:51:48 AM UTC-6, m...@iae.nl wrote:
>
> Where to find swoop.frt?

http://theforth.net/package/swoop-compat

>
> > defer new-timespan-from
> > : insert ( time p -- )
> > database begin @ dup while
> > point-fits? if insert-point-into exit then
> > repeat drop new-timespan-from ;
>
> Looks like a linear search? Did you try the hashtable files?
>

Ordered from most-recent points to earlier points. A hashtable would
do just fine though, keyed off the starting time for the spans.

> > class point
> > 1 floats buffer: valid? \ boolean
> > 1 floats buffer: %user
> > 1 floats buffer: %nice
> > 1 floats buffer: %system
> > 1 floats buffer: %iowait
> > 1 floats buffer: %steal
> > 1 floats buffer: %idle
>
> Using FLOAT(S) in iForth gives you extended IEEE floats
> (80 bits, 16 bytes in memory). When only a 32bit float is
> needed, SFLOAT is both faster and more memory-friendly
> (4 bytes).
>

That probably accounts for the 2x speed difference.

Julian Fondren

未读，

2017年12月27日 11:56:322017/12/27

收件人

On Wednesday, December 27, 2017 at 8:03:42 AM UTC-6, NN wrote:
>
> "There are a ton of bad languages in popular use, and in some ways it's gotten worse:..." Whats a better langauge ? Whats a bad langauge ?
>

Rather than dishonestly pretend to not know what 'bad' means, you
should just come out with your competing understanding of the matter,
which I reckon is something like this: there are various trade-offs in
programming language design and implmentation, and the languages that
exist primarily differ in the choices they make: one might have faster
runtimes at the cost of lengthier compile times; another might have
faster development speed at the cost of being harder to maintain later
on; etc. You should "pick the right tool for the job", and not get
emotionally invested in whether a flathead screwdriver is 'better'
than a phillips screwdriver, before you've even decided on what screws
to use for a project.

Programming, though, has not advanced anywhere near the level of the
high technology of carpentry, and we do not only have programming
languages that sit optimally at various forks of language design. We
have screwdrivers that are better in all respects than other
screwdrivers because those other screwdrivers are made out of paper,
are bent five ways down their lengths, and are covered in ants.
Python is slow as shit because it has always had terrible
implementations - not as a choice, not as some units of efficiency
traded away for some units of gain, but simply because its developers
were bad and could benefit from low standards for performance. Today
Python manages to be fast for some very specific tasks because people
have gone beyond 'lipstick on a pig' and have just taken the whole pig
and placed it in the arms of a beautiful woman, separately provided.

And not all 'forks of language design' actually result in good tools.
Haskell's lazy semantics makes it a "tool for a job" like the ants
make the ant-covered screwdriver a tool for some job. Feeding an ant
ant-eater, I suppose. Getting ants on someone you dislike. Planting it
and then coming along later to 'discover' it so that you can
recommend your cousin's pest service. It's of no use to carpentry,
though.

And also, if rockets blew up and civilization sometimes collapsed
because various cults promoted the use of blunted saws and paper
hammers, people would also argue emotionally about the superiority and
importance of good tools, and would be right to do so.

> You were free to choose your problem domain and the language to solve it. Just because you found it easy in one and not the other doesnt mean anything.
>

What language do you think is which? I came up with the design
intending to implement it in Forth. I'd expect the result to be easy
to implement in Forth. The actual experience during development was
that I made a number of small errors, saw bad answers during the
frequent and early tesing, and then ground the errors out. The
experience in OCaml was similar except that the compiler helped to the
extent that when I did any testing there was only the WITHIN error
that remained.

>
> What you didnt say was :-
>
> -- how long did it take you to write both programs, test them and agree you were done.
>

OCaml took less time in all respects. Didn't you just say that ease
doesn't mean anything though?

> -- did you get the same results with both or did you need to go back and fix ?
>

The results were skewed for a little while because I reimplemented
WITHIN with a <= instead of a <

> --what would you say is your familiarity as a programmer in both languages? I get the impression you are good in forth and still learning in Ocaml!

That's about right.

NN

未读，

2017年12月27日 17:58:262017/12/27

收件人

Do you think I am being dishonest and pretentious ?

pahihu

未读，

2017年12月29日 04:54:352017/12/29

收件人

Hello,

I've tested the APL derivative interpreted Kona language against this problem.
I've generated 2.1M data rows, the Kona version loads the raw log file (ie. not preprocessed).

pahihu

Forth version on iForth 6:
load 62s
dump-db 53s
100x mean-idle 8.4s

Kona version:
load 27.4s
dump-db 22.0s
100x mean-idle 0.5s
save_to 0.7s
load_from 1.0s

The code:
/ hdr - d[0]
/ dat - d[1;0] time d[1;3 4 5 6 7 8] percents
`0:"process raw data...\n"
\t d:("CSSFFFFFF";," ")0:`sar.log
/ cvt time to numbers
ts:{60_sv+:0$":"\'x}
D:d[1;0];
\t d[1;0]:0.0+ts D
/ assemble header
F:`time,3_`$1_'$d[0]
/ make table
\t T:.+(F;d[1;0 3 4 5 6 7 8])
/ sort by time desc
\t o:>T[`time]
\t T[]:T[;o]

`0:"db-dump...\n"
\t `sar.dump 0:5:+T[]
`0:"save_to...\n"
\t `tsdb 1:T
`0:"load_from...\n"
\t T: 2:`tsdb

`0:"mean_idle...\n"
avg:{(+/x)%#,x}
\t do[100;avg T[`idle]]

m...@iae.nl

未读，

2017年12月29日 09:12:022017/12/29

收件人

On Friday, December 29, 2017 at 10:54:35 AM UTC+1, pahihu wrote:
> Hello,
>
> I've tested the APL derivative interpreted Kona language against this problem.
> I've generated 2.1M data rows, the Kona version loads the raw log file (ie. not preprocessed).
>
> pahihu
>
> Forth version on iForth 6:
> load 62s
> dump-db 53s
> 100x mean-idle 8.4s
>
> Kona version:
> load 27.4s
> dump-db 22.0s
> 100x mean-idle 0.5s
> save_to 0.7s
> load_from 1.0s

So again about twice slower. The 100x average compute
timing might be a parlor trick (memoizing?)

2.1M rows ~ 160MB? Given that the read speed of a
current HD is about 80 - 160MB/s, a load time
of 60 seconds is astounding. Can you make the file
available somewhere?

-marcel

pahihu

未读，

2017年12月29日 10:32:562017/12/29

收件人

2017. december 29., péntek 15:12:02 UTC+1 időpontban m...@iae.nl a következőt írta:
>
> So again about twice slower. The 100x average compute
> timing might be a parlor trick (memoizing?)
>
> 2.1M rows ~ 160MB? Given that the read speed of a
> current HD is about 80 - 160MB/s, a load time
> of 60 seconds is astounding. Can you make the file
> available somewhere?
>
> -marcel

Kona is a vector language: for this test the data is stored in vectors (column oriented) in a
dictionary (map), indexed by symbols (column names). Thus it processes only 2.1M doubles
in a simple vector at C speed.
It is slow for scalars but hard to beat on vectors.

Computing the average of 2.1M DFLOATs with iForth6 100 times takes 1.070 seconds on my
computer. Mr. Fondren's data structure is not a simple vector and record oriented instead of
columns.

My computer is slow, loading the data is CPU limited.
You can find the data at the link below:
https://drive.google.com/open?id=1D64gSiDZDJA7iZNj6Ct-iriGVSnEdUQL

pahihu

pahihu

未读，

2017年12月30日 04:48:072017/12/30

收件人

2017. december 29., péntek 10:54:35 UTC+1 időpontban pahihu a következőt írta:
> Forth version on iForth 6:
> load 62s
> dump-db 53s
> 100x mean-idle 8.4s

Run the same test on SwiftForth3:
load 55s
dump-db 105s
100x mean-idle 7.2s

Load time is faster probably due to the native SWOOP implementation.
Calculating mean-idle is even faster than on iForth6, but iForth6 uses
extended precision floats (requires 16byte storage) by default.

iForth6 with DFLOATs:
load 62s
dump-db 48s
100x mean-idle 3.7s

pahihu

minf...@arcor.de

未读，

2017年12月30日 04:55:562017/12/30

收件人

Just a side question:
Why not use Matlab/Scilab/Octave for these type of computation tasks?

m...@iae.nl

未读，

2017年12月30日 07:29:372017/12/30

收件人

On Saturday, December 30, 2017 at 10:55:56 AM UTC+1, minf...@arcor.de wrote:
> Just a side question:
> Why not use Matlab/Scilab/Octave for these type of computation tasks?

Because of a purely personal weighting of tastes, pride, personal
standards, goals in life, patience, stress level, satisfaction,
competency, exhibitionism, drive to excel, sadism, masochism, and
trollishness?

-marcel

pahihu

未读，

2017年12月30日 08:38:102017/12/30

收件人

All of the above plus it is better than solving sudoku or cross-word puzzles.

pahihu

minf...@arcor.de

未读，

2017年12月30日 15:32:272017/12/30

收件人

All right, when it's fun it's fun ;-)

pahihu

未读，

2017年12月31日 12:21:132017/12/31

收件人

2017. december 29., péntek 15:12:02 UTC+1 időpontban m...@iae.nl a következőt írta:
> So again about twice slower. The 100x average compute
> timing might be a parlor trick (memoizing?)
>
> 2.1M rows ~ 160MB? Given that the read speed of a
> current HD is about 80 - 160MB/s, a load time
> of 60 seconds is astounding. Can you make the file
> available somewhere?
>
> -marcel

Puzzle solved: I've changed 2 things

- points are not dynamically allocated then destroyed, only a static instance is used
- data entry is now in the format

ROW 0 36.56 7.69 23.13 1.61 16.64 14.35
ROW 1 27.59 6.27 25.52 7.77 6.71 26.12
ROW 2 3.76 7.03 20.30 23.10 22.53 23.25

ROW parses the input directly, converts double numbers to floats, initializes the static
point instance etc.

On iForth6 load time dropped to 12 seconds.

pahihu