Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

expressive iteration with macros

66 views
Skip to first unread message

Kaz Kylheku

unread,
Apr 23, 2022, 10:43:47 PM4/23/22
to
$ cppawk '
#include <iter.h>
#include <cons.h>

#define NAME $1
#define UID $3

BEGIN {
FS = ":"
loop (records("/etc/passwd"),
maximizing(max_uid, UID),
argmax(longest_name_uid, UID, length(NAME)),
argmax(longest_name, NAME, length(NAME)),
from(line, 1),
counting(roots, UID == 0),
argmax(longest_name_line, line, length(NAME)),
if (NAME ~ /^r/, collect(start_with_r, NAME)))
; // empty
print "highest observed UID:", max_uid
print "number of superuser aliases:", roots
print "UID with longest name:", longest_name_uid
print "line in file with longest name:", longest_name_line
print "longest name:", longest_name
print "names starting with 'r':", sexp(start_with_r)
}'
highest observed UID: 65534
number of superuser aliases: 1
UID with longest name: 123
line in file with longest name: 43
longest name: gnome-initial-setup
names starting with r: ("root" "rtkit")


--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal

Kpop 2GM

unread,
May 28, 2022, 7:42:57 AM5/28/22
to
so you implemented something resembling the functionality of SQL SELECT statement GROUP BY ?

Kaz Kylheku

unread,
May 28, 2022, 12:43:19 PM5/28/22
to
On 2022-05-28, Kpop 2GM <jason....@gmail.com> wrote:
> so you implemented something resembling the functionality of SQL
> SELECT statement GROUP BY ?

No such thing appears in the example you replied to, so funny you should
mention it; but in fact I have a group_by function in the <array.h>
header, which is still undocumented.

https://www.kylheku.com/cgit/cppawk/tree/cppawk-include/array.h

I don't know SQL, but this is like the group-by function you
find in some dynamic programming languages.

Here is a quick demo. First, a background warmup. Let's write
an uncoditional action which builds a list of cons cell pairs
made from fields $1 and $2, pushing them onto the lst variable:

./cppawk '
#include <cons.h>
#include <array.h>
{ push(cons($1, $2), lst) }
END { print sexp(lst) }'
a 1
a 2
a 3
b 1
a 4
c 2
c 3
[Ctrl-D][Enter]
(("c" . 3) ("c" . 2) ("a" . 4) ("b" . 1) ("a" . 3) ("a" . 2) ("a" . 1))

OK, now let's introduce group-by:

./cppawk '
#include <cons.h>
#include <array.h>
#include <fun.h>
{ push(cons($1, $2), lst) }
END { group_by(fun(car), lst, arr);
for (i in arr) print i, sexp(arr[i]) }'
a 1
a 2
a 3
b 1
a 4
c 2
c 3
[Ctrl-D][Enter]
a (("a" . 4) ("a" . 3) ("a" . 2) ("a" . 1))
b (("b" . 1))
c (("c" . 3) ("c" . 2))

group_by has populated the array arr with keys a, b, c,
each one tied to a list of those cons pair items which
have that key.

group_by(fun(car), lst, arr) means: for each item x in list,
apply the car function to extract the key k as if by k = car(x).
Then collect the item x into a list that is specific to k.
Each such collected then appears as arr[k] in the array.

Kpop 2GM

unread,
May 28, 2022, 4:36:31 PM5/28/22
to
impressive library indeed. i took a lot at your GIT tree.

I guess i come from a completely different angle in terms off adding on features to awk - i made mine to be

- all still at scripting level,
- make close to zero amount of external calls (other than benchmarking utility - only mawk2 gives me sub-second timestamps, the rest i need to go to gnu-date)

- same code base being able to all run from at least 4 variants of awk that i have (so it can't leverage any of the extra goodies from gawk, and i have to devise equivalent ones),

that includes haing them self-identify which awk variant it's running on, but fingerprinted entirely based on intrinsic behavior of that awk that cannot be tricked via setting a variable - at shell at awk or at file, or naming the binary differently in the directory, nor does it rely on what it says at ARGV[ 0 ]

- need my functions to be able to account for nuisances and caveats for each awk-variant, and have single unified function thta can process around their unique weaknesses (like the stupid 2^31-1 limit of mawk 1.3.4), and

- regardless of locale setting, i designed my library such that byte-mode awks are fully unicode aware, while gawk in unicode mode can handle any arbitrary combination of unsafe bytes, and process them without incurring any warning messages (without needing to force suppress them) -

i even managed to have it stay entirely within gawk unicode mode, not make any external calls, and base64 decode out a mp3 file byte-for-byte, entirely from my own library's features.

Kaz Kylheku

unread,
May 28, 2022, 9:18:01 PM5/28/22
to
On 2022-05-28, Kpop 2GM <jason....@gmail.com> wrote:
> I guess i come from a completely different angle in terms off adding on features to awk - i made mine to be
>
> - all still at scripting level,
> - make close to zero amount of external calls (other than benchmarking
> utility - only mawk2 gives me sub-second timestamps, the rest i need
> to go to gnu-date)

cppawk is a shell script, and calls the preprocessor and awk; but you
can capture the preprocessor output and then you just have an awk script
you can pass to awk; basically you can use it like a "compiler" to
produce a single "executable" out of one or more files and/or command
line program.

That would be a use case when preparing something for an embedded
system, where you might not want the preprocessor, or if you
don't want the preprocessing overhead each time you run some
frequently run program.

> - same code base being able to all run from at least 4 variants of awk
> that i have (so it can't leverage any of the extra goodies from gawk,
> and i have to devise equivalent ones),

I not only have that, but you in cppawk you can test which Awk you're
using at preprocessing time:

#if __gawk__
...
#else
...
#endif

can test using #if which Awk you're running on. There are command
line options to tell cppawk which Awk to generate code for and execute.

One big exception to portability is Gawk indirect functions which
group_by depends on. So group_by will not be available if you don't
have Gawk.

Take a look at <case.h>; it's provides a portable case statement syntax
that becomes switch if you have Gawk, or else portable code for other
Awks.

Indirect function stuff failing on mawk:

$ ./cppawk --awk=mawk '
#include <cons.h>
#include <array.h>
#include <fun.h>
{ push(cons($1, $2), lst) }
END { group_by(fun(car), lst, arr);
for (i in arr) print i, sexp(arr[i]) }'
In file included from ./cppawk-include/fun.h:32:0,
from <stdin>:4:
./cppawk-include/fun-priv.h:40:2: warning: #warning "<fun.h> requires an Awk with function indirection like newer GNU Awk" [-Wcpp]
#warning "<fun.h> requires an Awk with function indirection like newer GNU Awk"
^~~~~~~
mawk: /dev/fd/63: line 835: function group_by never defined

Things not requiring <fun.h> are good to go, though.

Kpop 2GM

unread,
May 29, 2022, 8:26:16 AM5/29/22
to

> > - same code base being able to all run from at least 4 variants of awk
> > that i have (so it can't leverage any of the extra goodies from gawk,
> > and i have to devise equivalent ones),
> I not only have that, but you in cppawk you can test which Awk you're
> using at preprocessing time:
>
> #if __gawk__
> ...
> #else
> ...
> #endif
>
> can test using #if which Awk you're running on. There are command
> line options to tell cppawk which Awk to generate code for and execute.


i actually meant it as being able to tell whether gawk was invoked with -c flag or -n flag or -P flag or -M flag , multiply all that by unicode-ness - without relying on looking at the invocation call, or peek at "ps" output.

e.g. my system has

awk is /usr/local/bin/awk
awk is /usr/bin/awk
awk is /opt/homebrew/bin/awk

nawk is /usr/local/bin/nawk
gawk is /usr/local/bin/gawk
gawk is /opt/homebrew/bin/gawk

mawk is /usr/local/bin/mawk
mawk is /opt/homebrew/bin/mawk
mawk2 is /usr/local/bin/mawk2

#1 and #3 are alias to gawk, but #2 is for nawk, so i clearly cant rely on just the binary name alone. nawk is close to pure junk at this point, its ONLY usefulness being a debugging interface like no other.

An example of where detection matter is measuring array length. Gawk -P mode (posix) disables length(array), so calling it like that errors out. So my library would auto-detect it's gawk -P, and route it to count using a for-loop, but since the looping method is very slow, i send every other invocation combo to just length(array).

to detect just gawk -P, i came up with this strange test that only hits for gawk -P and no other

function _testawk_util8(){ # only gawk -P
return \
("x\4"<"\x4")
}

For now, i could detect these splits properly :

## gawk -e |- 06 gawk -ne |- 01 gawk -nMbe |- 93
## gawk -be |- 90 gawk -nbe |- 85 mawk1 -- |- 29
## gawk -ce |- 49 gawk -Me |- 76 mawk2 -- |- 21
## gawk -cbe |- 33 gawk -Mbe |- 98 nawk[UTF8] |- 12
## gawk -Pe |- 39 gawk -nMe |- 09 nawk[byte] |- 11

i can't get the god damn mpfr extension to compile properly on M1, cest la vie

Kpop 2GM

unread,
May 29, 2022, 11:04:47 AM5/29/22
to
@Kaz : here's a quick illustration of what my library does (among others) - this was ran over mawk2, which is completely unicode-blind on its own right :

- it could map hangul to latin letter syllables
- calculate CRC32 on it
- URL encoding, base64 encoding, and dump out the byte composition in octal
- emulate "xxd -ps" with a pure hex dump
- make it split $0 into individual UTF-8 characters (for something that's not UTF-8 aware)
- and show a fully decomposed view of them (per UTF-8 NFC/NFD setup) :

mawk2x 'BEGIN { OFS="\f" } { print NF, $0, $1, $NF, hangulk2e($NF); print crc32($0), urlencode($0), base64enc($0); print xxdps($0); print strdump($0) } { split0uc(); OFS="\f"; print NF, $0; NF=NF; print } { for(_^=_<_;_<=NF;_++) { print _,$_, ordC($_), decomposeHangul($_) } }' <<<'오복녀'

1
오복녀
오복녀
오복녀
오복녀=Oh Bok-Nyeo=Oh Bok-Nyeo 오복녀=
crc32#0x7A39C92A
%EC%98%A4%EB%B3%B5%EB%85%80
7Jik67O164WA
ec98a4ebb3b5eb8580
\354\230\244\353\263\265\353\205\200
3
오복녀



1

50724
U+C624:오:50724:S:6692:ᄋ:L:11:ᅩ:V:8:~_:_O_:~_:_sLV:T:0
2

48373
U+BCF5:복:48373:S:4341:ᄇ:L:7:ᅩ:V:8:B_:_O_:K_:ᆨ:T:1
3

45376
U+B140:녀:45376:S:1344:ᄂ:L:2:ᅧ:V:6:N_:YEO:~_:_sLV:T:0

Another different chunk of my library pertains to my own big int functions instead of relying on GMP, something like this

echo 127 | mawk2x '{ timerF(!(_______=___=__=$NF)); for(____=_^=_<_;_<11;) {__=pow(__,_+=1); print timerF(2,1),sprintf("%s ^ %6.f",_______,____*=_),(___=(___~"\\^")? (___)(" * ")_:___"^("_)")",anylog(__,2,14) } }'

0.0000260=127 ^ 2=127^(2)=13.97736937354433
0.0000600=127 ^ 6=127^(2 * 3)=41.93210812063299
0.0000910=127 ^ 24=127^(2 * 3 * 4)=167.72843248253198
0.0002110=127 ^ 120=127^(2 * 3 * 4 * 5)=838.64216241265990
0.0025120=127 ^ 720=127^(2 * 3 * 4 * 5 * 6)=5031.85297447595985
0.1265000=127 ^ 5040=127^(2 * 3 * 4 * 5 * 6 * 7)=35222.97082133172079
3.5375310=127 ^ 40320=127^(2 * 3 * 4 * 5 * 6 * 7 * 8)=281783.76657065376639

The slowness of it only starts to show when log2 is 125K+

Kaz Kylheku

unread,
May 29, 2022, 12:16:22 PM5/29/22
to
On 2022-05-29, Kpop 2GM <jason....@gmail.com> wrote:
>
>> > - same code base being able to all run from at least 4 variants of awk
>> > that i have (so it can't leverage any of the extra goodies from gawk,
>> > and i have to devise equivalent ones),
>> I not only have that, but you in cppawk you can test which Awk you're
>> using at preprocessing time:
>>
>> #if __gawk__
>> ...
>> #else
>> ...
>> #endif
>>
>> can test using #if which Awk you're running on. There are command
>> line options to tell cppawk which Awk to generate code for and execute.
>
> i actually meant it as being able to tell whether gawk was invoked
> with -c flag or -n flag or -P flag or -M flag , multiply all that by

I could add support for that in cppawk. It parses all the options in
order to support a few cpp options, and a couple of its own. The
rest are passed to awk. I could have it recognize -c being passed
to gawk, to set some preprocessor symbol.

> unicode-ness - without relying on looking at the invocation call, or
> peek at "ps" output.

The advantage of having a preprocessing layer is that it may be
in many situations acceptable that the output of preprocessing just
*assumes* it is running on a certain brand of awk, of a certain
version, invoked in a certain way. Then you don't have any run-time
detection and switching overheads in the code.

I suspect that in many cases, the user of a portable Awk library
is actually just using one specific awk, and doesn't care whether
the preprocessed output works with other awks.

Or else if they do care about their code running on other awks also,
many users may be accepting of the limitations of doing it statically:
being able to generate efficient code that is tuned to a particular awk,
or else inefficient code that works with more awks, rather than one body
code which switches at run-time.

Kpop 2GM

unread,
May 30, 2022, 4:09:37 PM5/30/22
to
there's only one single user of that "portable library" - me

i wrote it for myself only.

it's not properly documented, it's not fully debugged, and as much as i tried to give it a best shot, i still have a handful of locations where i couldn't figure out the place i originally got the idea from, and give proper credit, so i refrained from sharing it in full for the sake of propriety

previously i wrote a small library specific to gawk features (mostly the true multi-dimensional array and unicode bits), and their built-in sorting.

so when i rewrote the entire library to account for all those awks, it took some creative thinking on how to circumvent it, but it's totally worth the effort, cuz now my mawks can perform unicode substring-ing sometimes even faster than gawk does

i also have no use-case at all for something like arabic, so i wouldn't be wasting time implementing those right to left text in unicode

but i *do* now have a module that allows typing in the 2 letter country code of any nation, and get back its emoji flag

but if you type the codes for russia or belarus, it force overrides it and outputs the Ukrainian flag instead =p

Kpop 2GM

unread,
Aug 4, 2022, 4:02:02 AM8/4/22
to
@Kaz : here's what I meant by awk variant tester - the objective is simple -

get as many of them as possible to print out a different value for the exact same function call, and using their differences to create indicator flags in order to proper account for that behavior in the codes.

The example isn't perfect because some still print out the same values, but it already covers a wide swath :

Using this code, one could automate their testing on multiple awk variants, with a built-in 5-minute auto-timeout for each variant if the testing still hasn't finished by then.

The first test resulting in 0/1 is only true whenever GMP is invoked (because it prints out "-nan" instead of "+nan"/"nan" for everything else).

The second test incorporates many of their bespoke nuances, which changes their exponent against a base of 127,

=============================

cmd='function ____(_,__,___) { __="\333\222"; ___=(toupper(-(_=-log(_<_))/_) != toupper(_/_)); ___=___ "......" sprintf("%c%c%c",12,8,8); ___=(___)\
((_=(_+=_^=_=_<_)+(++_+--_)^++_)^((_%100)+length(__)+(0x4)+8*(sprintf("%u",3E10)%2)+("0x10")+32*("x\4"<"\x4")+64*(__~"[^"(__)"]"))); return ___ } BEGIN { print ____() }'; echo "\n\n code tested ::\n\n$( gawk -o- -e "${cmd}" | mawk 'sub("^",(_)_)^_+gsub(/\14/,"\\f")+gsub(/\13/,"\\v")+gsub(/\11/,"\\t")+gsub(/\10/,"\\b")+gsub(/\7/,"\\a")+gsub(/\15/,"\\r")+gsub(/\33/,"\\33")+gsub(/\34/,"\\34")+gsub(/\177/,"\\177")+gsub(/\0/,"\\0")' FS='^$' \_=' ' | mawk 3 FS='^$' RS='(\n )+\n' ORS='\n\n' | gsed -zE 's/ ([+<*/^>=%-][=]|[/^*=~]) /\1/g' | mawk 'gsub(/\t/,(_=" ")_ (_)_)+gsub((_)_," \140")^_+gsub(/\t/,"\140 \140 ")' ORS= RS='^$' FS='^$' )\n"; for idx in 1 ; do for awk0 in gawk nawk mawk mawk2 ; do for flg in $( <<< "${awk0}" mawk '{ print (($_)=="gawk") ? "te Mte e b Se Sbe ce cbe ne nbe Me Mbe nMe nMbe Pe MPe " : "-" }' ); do timeout --foreground 300 printf ' %-6s -%-5s :: %s\n' "${awk0}" "${flg}" "$( eval " \"\${awk0}\" -\"\${flg}\" \"\${cmd}\" " )" ; done ; done ; done | lgp3 4 | gcat -b | mawk 'gsub(/\t/,(_=" ")_ (_)_)+gsub((_)_," \140")^_' RS='^$' FS='^$'


code tested ::

` `BEGIN {
` ` print ____()
` `}

` `function ____(_, __, ___)
` `{
` ` __="\333\222"
` ` ___=(toupper(-(_=-log(_ < _))/_) != toupper(_/_))
` ` ___=___ "......" sprintf("%c%c%c", 12, 8, 8)
` ` ___=(___) ((_=(_+=_^=_=_ < _) + (++_ + --_)^++_)^((_ % 100) + length(__) + (0x4) + 8*(sprintf("%u", 3E10) % 2) + ("0x10") + 32*("x" < "") + 64*(__~("[^" (__) "]"))))
` ` return ___
` `}

gawk: cmd. line:1: warning: `function' is not supported in old awk
gawk: cmd. line:1: warning: `toupper' is not supported in old awk
gawk: cmd. line:2: warning: operator `^=' is not supported in old awk
gawk: cmd. line:2: warning: operator `^' is not supported in old awk
gawk: cmd. line:2: warning: `return' is not supported in old awk
gawk: cmd. line:1: warning: `function' is not supported in old awk
gawk: cmd. line:1: warning: `toupper' is not supported in old awk
gawk: cmd. line:2: warning: operator `^=' is not supported in old awk
gawk: cmd. line:2: warning: operator `^' is not supported in old awk
gawk: cmd. line:2: warning: `return' is not supported in old awk

` ` 1 ` ` gawk ` -te ` `:: 0......
20975825942850833350709513021069607858612813027289575844399267446784
` ` 2 ` ` gawk ` -Mte ` :: 1......
20975825942850835435487946371038619427046538071435595461971578449921
` ` 3 ` ` gawk ` -e ` ` :: 0......
20975825942850833350709513021069607858612813027289575844399267446784
` ` 4 ` ` gawk ` -b ` ` :: 0......
2663929894742055667923408371469246315099621159900709173939115476910080

` ` 5 ` ` gawk ` -Se ` `:: 0......
20975825942850833350709513021069607858612813027289575844399267446784
` ` 6 ` ` gawk ` -Sbe ` :: 0......
2663929894742055667923408371469246315099621159900709173939115476910080
` ` 7 ` ` gawk ` -ce ` `:: 0......
80631397449585884603480471012312324983606434321474214952960
` ` 8 ` ` gawk ` -cbe ` :: 0......
10240187476097406396860348881012181757649990571271861294858240

` ` 9 ` ` gawk ` -ne ` `:: 0......
96067968254367458374461558004074452761005365968933701617309045473364966403804172795092679764568178688
` `10 ` ` gawk ` -nbe ` :: 0......
12200631968304669482593883986169010334579188647924663063729436788016311555400024520016975449783684562944
` `11 ` ` gawk ` -Me ` `:: 1......
20975825942850835435487946371038619427046538071435595461971578449921
` `12 ` ` gawk ` -Mbe ` :: 1......
2663929894742056100306969189121904667234910335072320623670390463139967

` `13 ` ` gawk ` -nMe ` :: 1......
96067968254367481276365696967779773877012643204635128378812984311724350641524683124538450810649569281
` `14 ` ` gawk ` -nMbe `:: 1......
12200631968304670122098443514908031282380605686988661304109249007588992531473634756816383252952495298687
` `15 ` ` gawk ` -Pe ` `:: 0......
7746094530492103116603130830240466614297156825794967015584458761805042780817132149091242541651768498124851548434645116915842513649999335390349056080566380658688
` `16 ` ` gawk ` -MPe ` :: 1......
1691310158431340276446001387259617718107722316775496316977546955395900025960587599577785437677679318686381164900905921415897601

` `17 ` ` nawk ` -- ` ` :: 0......
4.68994168836430963037851433017e+94
` `18 ` ` mawk ` -- ` ` :: 0......
3.17393e+111
` `19 ` ` mawk2 `-- ` ` :: 0......
4.50553e+195
0 new messages