Non-consuming linear stream

109 views
Skip to first unread message

Dambaev Alexander

unread,
Jul 9, 2022, 5:26:08 AM7/9/22
to ats-lan...@googlegroups.com
Hi,

I had tried to implement function of type:

```
fun
  {a:t0p}
  isPrime
  ( xs: !stream_vt( int)
  , x: int
  ):<cloptr1>
  bool
```

Ie, I want to force evaluation of some portion of a stream, but I need to preserve it for a later use. 

I had tried to make a similar version:
```
fun
  {a:t0p}
  isPrime
  ( xs: stream_vt( int)
  , x: int
  ):<cloptr1>
  ( stream_vt( int), bool)
```

but failed as well, so I decided to ask for a direction if someone had tried to do similar stuff

Hongwei Xi

unread,
Jul 9, 2022, 7:11:49 AM7/9/22
to ats-lan...@googlegroups.com
By looking at your first version, my simple answer is that a linear stream cannot be used
in this way. The second version is possible but I am not sure what you wanted to do exactly.
If you show me how to use a non-linear stream to do it, then I could probably say more.

--
You received this message because you are subscribed to the Google Groups "ats-lang-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ats-lang-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ats-lang-users/CAHjn2KwFq7JH%2BiZE7bWCJ_L7oZ38K-kmGBFii7DZsdxWLDsGmg%40mail.gmail.com.

Dambaev Alexander

unread,
Jul 9, 2022, 12:50:11 PM7/9/22
to ats-lan...@googlegroups.com
My initial motivation was this Haskell source code: https://github.com/dambaev/mobt2/blob/master/haskell/app/Main.hs which is using lazy list (of course) and recursive binding and I decided to check if it will be possible to get something similar

ATS version using non-linear stream is here: https://github.com/dambaev/mobt2/blob/master/ats2/src/TEST/test1.dats , but it takes to much memory as `stream_take_while` duplicates data, as I have got, that datatype constructor can't be unfolded with `@stream_cons` pattern match

There is another version, that generates primes with non-linear stream and then, converting it to linear stream https://github.com/dambaev/mobt2/blob/master/ats2/src/TEST/test3.dats . This is the closest version to haskell's one. but still is using more space and as twice as slow as Haskell, so I had started to think of how to eliminate intermediate data structure.

So, not a production issue, hehe, just found an interesting topic to dig in :)

сб, 9 июл. 2022 г. в 11:11, Hongwei Xi <gmh...@gmail.com>:

gmhwxi

unread,
Jul 9, 2022, 5:33:43 PM7/9/22
to ats-lang-users
I took a look. Your ATS code looks very fine to me.

I did some profiling on my own. In my trials, your ATS code is actually a lot faster than
the Haskell code you posted. Note that my ghc version is very old: GHC 7.10.3

###### ATS ######

(* Your code using non-linear stream *)
hwxi@hongwei-t440p:/tmp$ patscc -O2 -o primes -DATS_MEMALLOC_LIBC primes.dats
hwxi@hongwei-t440p:/tmp$ time ./primes    
1077871

real    0m3.118s
user    0m3.064s
sys     0m0.048s

###### Haskell ######
(* Your haskell version *)
(*
Glasgow Haskell Compiler, Version 7.10.3, stage 2 booted by GHC version 7.10.3
*)

hwxi@hongwei-t440p:/tmp$ ghc -O2 primes.hs
Linking primes ...
hwxi@hongwei-t440p:/tmp$ time ./primes    
1077871

real    0m8.195s
user    0m8.152s
sys     0m0.040s

################

###### ATS ######
(*My version using linear stream that is based on yours*)

hwxi@hongwei-t440p:/tmp$ patscc -O2 -o primes2 -DATS_MEMALLOC_LIBC primes2.dats
hwxi@hongwei-t440p:/tmp$ time ./primes2    
1077871

real    0m2.120s
user    0m2.092s
sys     0m0.000s

gmhwxi

unread,
Jul 9, 2022, 10:21:27 PM7/9/22
to ats-lang-users
Here is a version I wrote in ATS3:


Currently, I can only compile the ATS3 code to JS. The generated JS code runs about 10 times slower
than the C code generated from compiling a comparable ATS2 implementation:

|thePrimes2| = 1077871

real    0m23.060s
user   0m23.380s
sys     0m0.188s

Dambaev Alexander

unread,
Jul 9, 2022, 10:37:03 PM7/9/22
to ats-lan...@googlegroups.com
I will check ATS3 version, meanwhile GHC 8.10 produces much more optimised code:

```
[nix-shell:/data/devel/mobt2/haskell]$ ghc --version
The Glorious Glasgow Haskell Compilation System, version 8.10.7
[nix-shell:/data/devel/mobt2/haskell]$ ghc -O2 --make app/Main.hs -o haskell
[1 of 1] Compiling Main             ( app/Main.hs, app/Main.o )
Linking haskell ...
[nix-shell:/data/devel/mobt2/haskell]$ $(which time) ./haskell
1077871
1.04user 0.00system 0:01.05elapsed 100%CPU (0avgtext+0avgdata 72140maxresident)k
0inputs+0outputs (0major+17298minor)pagefaults 0swaps
```
in comparison to ATS2 versions from repo above:

```
[nix-shell:/data/devel/mobt2/ats2/src]$ make
patscc -O2 -DATS_MEMALLOC_LIBC -I"../libs/" -o test3   TEST/test3.dats
/run/current-system/sw/bin/time ./test3
1077871
1.11user 0.01system 0:01.12elapsed 100%CPU (0avgtext+0avgdata 102456maxresident)k
0inputs+0outputs (0major+25327minor)pagefaults 0swaps
patscc -O2 -DATS_MEMALLOC_LIBC -I"../libs/" -o test1   TEST/test1.dats
/run/current-system/sw/bin/time ./test1
1077871
2.11user 0.03system 0:02.14elapsed 99%CPU (0avgtext+0avgdata 203448maxresident)k
0inputs+0outputs (0major+50589minor)pagefaults 0swaps
```
ie, non-linear version (test1) is twice slower and using 2x memory of test3 (which converts stream into stream_vt). But still, haskell version is faster and using less memory :)

вс, 10 июл. 2022 г. в 02:21, gmhwxi <gmh...@gmail.com>:

gmhwxi

unread,
Jul 9, 2022, 11:12:36 PM7/9/22
to ats-lang-users

I see.

By the way, when I tried test1 and test3 on my machine, the latter is only about 10% faster than the former.

It should be easy to have a version in ATS that beats Haskell:

You first produce a non-linear thePrimes. Then you produce a linear thePrimes2 (where isPrime uses thePrimes).
In this way, the memory foot print should be very small. And I bet the running time is much faster. Try 2^28 or 2^30 :)

Dambaev Alexander

unread,
Jul 11, 2022, 7:20:09 AM7/11/22
to ats-lan...@googlegroups.com
I had checked some more and was able to make a stateful numbers generator using list_vt for accumulating of evaluated results here https://github.com/dambaev/mobt2/blob/master/ats2/src/TEST/test5.dats but for some reason, it takes forever to complete with default constraint <= g0int_npow( 2, 24) using linear streams

I guess is due to list_vt_extent, which is O(n)

I was trying to pass pointer to a tail of the list_vt data to pass through auxmain/auxmain_con to make extend to be O(1), but I was not able to satisfy type checker :) So I decided to ask if it is possible to use such pointer/hole to the tail in the environment of $ldelay?

вс, 10 июл. 2022 г. в 03:12, gmhwxi <gmh...@gmail.com>:

Hongwei Xi

unread,
Jul 11, 2022, 8:46:02 AM7/11/22
to ats-lan...@googlegroups.com
There is 'qlist' in libats that does what you want.

By using the two-stream approach I mentioned earlier, you can readily get to 2^30:

54400028
real    18m47.053s
user    18m46.372s
sys     0m0.424s

(*
This info is for the number of primes up to 2^28:

ATS2:
14630843
real    2m35.548s
user    2m35.532s
sys     0m0.000s

Haskell (GHC-7.10.3)
14630843
real    7m41.733s
user    7m36.208s
sys     0m1.688s
*)

I could not get the haskell implementation to finish for 2^30. I had to stop its running once
its memory usage reached 25% after about 10 minutes (I was monitoring using 'top').


gmhwxi

unread,
Jul 11, 2022, 7:02:25 PM7/11/22
to ats-lang-users

I put my ATS2 code here:


I will put some code similar to what you did in test5 later. I think that test5-style
is going to be faster for a relative small input (e.g., 2^24) but it may not be able to
beat test02 for larger input (e.g., 2^30).

Dambaev Alexander

unread,
Jul 12, 2022, 2:54:34 AM7/12/22
to ats-lan...@googlegroups.com
Hi,

I had replaced list_vt with qlist and now it is the most efficient version I could get now:
```
patscc -O2 -DATS_MEMALLOC_LIBC -I"../libs/" -o test7   TEST/test7.dats
/run/current-system/sw/bin/time ./test7
1077871
0.84user 0.01system 0:00.85elapsed 100%CPU (0avgtext+0avgdata 35108maxresident)k
0inputs+0outputs (0major+8486minor)pagefaults 0swaps
```
in comparison to haskell's:
```
1.16user 0.02system 0:01.18elapsed 99%CPU (0avgtext+0avgdata 74492maxresident)k
0inputs+0outputs (0major+17827minor)pagefaults 0swaps
```
I have to note, that test4 is more correct than test3: it is the same, except that it is using GC to collect stream()
```
patscc -O2 -DATS_MEMALLOC_GCBDW -lgc -o test4   TEST/test4.dats
/run/current-system/sw/bin/time ./test4
1077871
2.13user 0.03system 0:02.13elapsed 101%CPU (0avgtext+0avgdata 97344maxresident)k
```
which is expected to use less memory, but increases time to GC (as expected as well), but, technically, this version is the closest to haskell's one.

for 2^30 constraint:
```
patscc -O2 -DATS_MEMALLOC_LIBC -I"../libs/" -o test7   TEST/test7.dats
/run/current-system/sw/bin/time ./test7
54400028
227.79user 0.63system 3:48.43elapsed 99%CPU (0avgtext+0avgdata 1701428maxresident)k
0inputs+0outputs (0major+425064minor)pagefaults 0swaps
```
haskell version with GHC8.10:
```
54400028
278.48user 0.62system 4:38.75elapsed 100%CPU (0avgtext+0avgdata 3016564maxresident)k
0inputs+0outputs (0major+753319minor)pagefaults 0swaps
```
and test4
```
patscc -O2 -DATS_MEMALLOC_GCBDW -lgc -o test4   TEST/test4.dats
/run/current-system/sw/bin/time ./test4
54400028
704.39user 1.81system 11:46.22elapsed 99%CPU (0avgtext+0avgdata 4966216maxresident)k
0inputs+0outputs (0major+1241156minor)pagefaults 0swaps
```
I am wondering why test4 performs quite poor(I guess, Boehm GC is just too general/unoptimized), because, I had expected, that GC should free stream_con() instances as soon as they will be converted into stream_vt_con() by stream_t2vt... That is what haskell's GC is doing with intermediate data. I wish I was able to free stream_con() instance forcibly in stream_t2vt(). In this case, I expect it will be on par with haskell (I assume, that qlist is implemented as a flat array (I haven't actually checked, hehe))

Meanwhile: 
1. I had started this topic, because Haskell's source code is a good example of composability of lazy evaluation: it is short, but quite powerful as I had spent quite some time to produce an outperforming version which is more difficult as well. Basically, I had to implement memoization using qlist() which is built-in into Haskell's runtime;
2. now I have an example of stateful lazy stream processing in ATS2 and it was very interesting as I had the impression that it is hard in ATS to keep an implicit state (as resource usage is harder without GC);
3. I have a food for thoughts, because, even if Haskell's version is short but powerful, lazy-by-default evaluation has a side-effect of having to dig and decide what is worse in particular case: code being too lazy or code being too strict. Just a simple example is a `foldr` implementation (https://downloads.haskell.org/~ghc/latest/docs/html/libraries/base-4.16.1.0/src/Data.Foldable.html#foldr): which requires some time to understand how it is being done. Also it is hard to not to meet https://downloads.haskell.org/~ghc/latest/docs/html/libraries/deepseq-1.4.6.1/Control-DeepSeq.html for forcing evaluation. So Haskell hides laziness but requires understanding how to force evaluation and marshall data to fit GC's expectations. ATS on its turn, exposes laziness primitives and requires understanding how to use laziness. For me, it seems that it is a more fair approach: if you don't use laziness, then you don't pay for it, unlike with haskell: if you write a strict code, you will still need to be aware of laziness/GC/marshalling and etc...;
4. I was missing `qlist_free` function;
5. it is still interesting if there a way how to improve test4: I understand, that having stream_free() will be impossible, as stream() is a datatype and type checker will not be able to protect from double free, as well as having possibility to have a recursive linear stream like this:
```
val rec ones: stream_vt(int) = $ldelay( stream_vt_cons( 1, ones))
```
will probably won't help as well, as it will require a possibility to forcibly walk through a stream without consuming it, which is hard due to requirement of usage of $ldelay....
So it looks like some Boehm GC optimization is the correct answer... Or, maybe it means that linear streams and qlist is the "only" correct version for such task... Especially, as we already know that older GHC version produces worse results

I hope that the topic of lazy streams in ATS is interesting for others :)



пн, 11 июл. 2022 г. в 23:02, gmhwxi <gmh...@gmail.com>:

Hongwei Xi

unread,
Jul 12, 2022, 8:44:16 AM7/12/22
to ats-lan...@googlegroups.com
Thanks for this very interesting example!

I did my test03 that more or less matches your test07:


Yes, it is significantly more efficient (time-wise) than both of test01
and test02 that I did earlier. However, it is not memory-efficient. You
can find some measurements in the source code.

A qlist is essentially a list. To free it, you can take out the list and free
it and then free the (empty) qlist. Please see my test03.dats. Also, if
you use valgrind on test03, you can see that all the allocated blocks
are freed at the end (except the one for the closure passed to the higher-order
function stream_vt_take_while):

==23360== Memcheck, a memory error detector
==23360== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
==23360== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info
==23360== Command: ./test03
==23360==
nprime(1048576) = 82025
==23360==
==23360== HEAP SUMMARY:
==23360==     in use at exit: 16 bytes in 1 blocks
==23360==   total heap usage: 328,108 allocs, 328,107 frees, 6,563,160 bytes allocated
==23360==
==23360== LEAK SUMMARY:
==23360==    definitely lost: 16 bytes in 1 blocks
==23360==    indirectly lost: 0 bytes in 0 blocks
==23360==      possibly lost: 0 bytes in 0 blocks
==23360==    still reachable: 0 bytes in 0 blocks
==23360==         suppressed: 0 bytes in 0 blocks
==23360== Rerun with --leak-check=full to see details of leaked memory
==23360==
==23360== For counts of detected and suppressed errors, rerun with: -v
==23360== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

For efficiency, it is important to minimize the use of non-linear streams.
For instance, after seeing your test04, my immediate "solution" is to use primes()
to implement primes_vt() and then uses primes_vt() to generate a linear stream of
prime numbers.

Again, thanks for the example. I hope that ATS3 will be a lot more convenient to use
than ATS2 while being at least equally performant both time-wise and memory-wise :)

Cheers!

--Hongwei


Hongwei Xi

unread,
Jul 12, 2022, 8:51:05 AM7/12/22
to ats-lan...@googlegroups.com
By the way, qlist is not flat; it is based on list. To have a flat one, you can use dynarray
(which is like vector in C++). I suppose that dynarray can further shorten the running time.

Dambaev Alexander

unread,
Jul 14, 2022, 8:20:39 AM7/14/22
to ats-lan...@googlegroups.com
I had added test8 with dynarray instead of qlist:

```
patscc -O2 -DATS_MEMALLOC_LIBC -I"../libs/" -o test8   TEST/test8.dats
/run/current-system/sw/bin/time ./test8
1077871
0.82user 0.00system 0:00.82elapsed 100%CPU (0avgtext+0avgdata 9236maxresident)k
0inputs+0outputs (0major+3172minor)pagefaults 0swaps
patscc -O2 -DATS_MEMALLOC_LIBC -I"../libs/" -o test7   TEST/test7.dats
```
for 2^30 constraint:
```
patscc -O2 -DATS_MEMALLOC_LIBC -I"../libs/" -o test8   TEST/test8.dats
/run/current-system/sw/bin/time ./test8
54400028
223.48user 0.18system 3:43.70elapsed 99%CPU (0avgtext+0avgdata 263012maxresident)k
0inputs+0outputs (0major+118738minor)pagefaults 0swaps
```

So a bit faster, but I was expecting that qlist should have more cache misses, so I expected a bit more improvement. But anyway)

вт, 12 июл. 2022 г. в 12:51, Hongwei Xi <gmh...@gmail.com>:

gmhwxi

unread,
Jul 15, 2022, 12:13:45 PM7/15/22
to ats-lang-users
Memory-wise, this one (test8) is actually much more efficient than test7.
If you use int64 and try 2^32 or something, then I think test8 is likely to perform
significantly better. Once memory thrashing happens, it is no longer an efficiency
issue; it really becomes a "survival" issue :)

My rules of thumb for writing efficient programs are essentially (1) smaller memory
footprint and (2) flatter data structures and (3) whole program compilation (inlining).
In ATS, we have linear types for (1) and templates for (2) and (3).
Reply all
Reply to author
Forward
0 new messages