Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

AWK code repository request for help (long)

32 views
Skip to first unread message

jh

unread,
Dec 30, 2008, 10:21:27 PM12/30/08
to
Tim Menzies and I have been talking about an AWK code library system.

Parsing through all our emails, we seem to be discussing the following
options. In what follows, I am for the things marked (*) and Tim is for
the (+) items. We'll defend those positions, if necessary, in a week or two.

But, before debating the options, we'd like to know what all the options
are. So we post this list, not to argue any particular point, but to
canvas options from the community.

We'd like back, at least for now, not arguments for options, but the
options themselves.

the plan is this:

- post this newsgroup
- watch it to see what comments come back
- post a revised version in a week
- start the real debate then

over to you all!

Jim & Tim


--------

0) Repository name?

0a+) planetawk
0b) libawk
0c) address of the repository must be in the domain name that
corresponds to the chosen name
0d) address of the repository may be <repository name>.<name of hosting
service> or <hosting service full domain name>/<repository name>

1) goal

1a*+) an open source project with methods for promoting stable AWK code
to some special place,

1b*) and providing the kind of search and install features found in CPAN
(for Perl) and PEAR (for PHP).

--------
2) type of extensions

2a+*) scripting based, using the current version of gawk

2b*) scripting based, using any version of AWK

2c*) compiler-based requiring a new gawk executable (e.g. xgawk)

2d*) based on other languages (e.g. runawk is a C program)


--------
3) formatting standards

3a+) one liners, if short. e.g.

function some() { return 1/(rand() * Inf) }.

--------
4) dependency mechanism

4a) m4 and needz . e.g.
http://code.google.com/p/crusty/source/browse/trunk/needz-apps which
shows tim's last attempt (2 years ago) to define a library system in gawk.

- there is a directory called needz-app/fred with files
- there are files
needz-app/fred/about.awk # command line args and help text
needz-app/fred/fred # main file. compilation starts here
needz-app/fred/XXX # support file
needz-app/fred/YYY # support file

- there is a directory called needz-app/fred/eg for unit tests
needz-app/fred/eg/1 # unit test 1
needz-app/fred/eg/1.want # expected output for unit test 1

4b) Debian package system

4c) CPAN or PEAR software, if it's available

--------
5) hosting service

5b+) code.google.com <http://code.google.com>
5b) sites.google
5c) github
5d) launchpad
5e) sourceforge
5e) savannah
5f) custom

--------
6) repository

6a+) subversion
6b) git
6c) bazaar
6d) cvs
6e) (s)ftp
6f) Web pages with upload
6g*) CPAN, PEAR or similar

--------
7) cross-indexing method to describe library (i.e. a way to describe
functions and collections of functions)

7a+) use scripts to auto-generate wiki pages in code.google.com
<http://code.google.com>

7b) documentation that's required for acceptance into the repository

7c) structured comments in the code ala JavaDoc

-------
8) coding standards

8a*) functions don't change globals directly

8b+) reduce use of /pattern/ {action} in favor of while(getline) loops
inside functions

8d+) use a[0] to store size of array

8e+) require all "local" (see 8g) variable names be lower case (use '_'
to separate words?)

8f*) allow naming variables with some version of Hungarian notation

8g*) require all function variables to be "localized" by including them
in the function parameter list

-------
9) define some standard macros (using m4)

9a+) tim's iteration trick

function fred(a,thing, _2){
foreach2(thing,a) {
do something with thing }
}

expands to

function fred(a,thing, max1,i,max2,j){
max1=a[0]; for(i=1;i<=max1;i++){ max2=a[i,0]; for(j=1;j<=max2;j++)
{thing=a[i,j];
do something with thing }
}

10) Label for a collection of functions

10a*) package
10b*) module
10c+) gem

11) Extension and preprocessor standardization

11a) Standardize syntax for all extensions that change or add to the
(G)AWK standards; compliant changes made to all extension
programs/projects that provide the same feature(s)

11b) All preprocessor programs provide an option to turn them into
stdin-->stdout filters

11c*) With 11b, overlapping features in extension programs/projects get
"refactored" into separate programs or projects.

For example, only igawk pulls in external code files, then its
preprocessed code is, optionally, written to stdout for further
processing by other programs.

12) Function standardization

12a) Multiple functions that do the same thing get merged through some
agreed upon process.

ggrothendieck

unread,
Dec 31, 2008, 12:14:24 AM12/31/08
to
On Dec 30, 10:21 pm, jh <jh...@mail.avcnet.org> wrote:
> Tim Menzies and I have been talking about an AWK code library system.

Another possibility, especially if the library is not too large, is
just to create a batteries included distro that bundles them all
into the executable.

Aharon Robbins

unread,
Dec 31, 2008, 1:17:36 AM12/31/08
to
In article <0LOdnWfIfaiqeMfU...@neonova.net>,

jh <jh...@mail.avcnet.org> wrote:
>Tim Menzies and I have been talking about an AWK code library system.

Really cool! I have a few things that I've made available which I
would be happy to forward on for inclusion, as well as about 80
gazillion things just sitting in my inbox.

>Parsing through all our emails, we seem to be discussing the following
>options. In what follows, I am for the things marked (*) and Tim is for
>the (+) items. We'll defend those positions, if necessary, in a week or two.
>
>But, before debating the options, we'd like to know what all the options
>are. So we post this list, not to argue any particular point, but to
>canvas options from the community.
>
>We'd like back, at least for now, not arguments for options, but the
>options themselves.
>
>the plan is this:
>
>- post this newsgroup
>- watch it to see what comments come back
>- post a revised version in a week
>- start the real debate then
>
>over to you all!
>
>Jim & Tim

Very nice. Some comments below.

>0) Repository name?
>
>0a+) planetawk
>0b) libawk
>0c) address of the repository must be in the domain name that
>corresponds to the chosen name
>0d) address of the repository may be <repository name>.<name of hosting
>service> or <hosting service full domain name>/<repository name>

You may even want subdomains, such as

csv-parsing.repo-name.whatever.top-level
xml-parsing.repo-name.whatever.top-level
...

>1) goal
>
>1a*+) an open source project with methods for promoting stable AWK code
>to some special place,
>
>1b*) and providing the kind of search and install features found in CPAN
>(for Perl) and PEAR (for PHP).

Both are laudable and orthogonal.

>--------
>2) type of extensions
>
>2a+*) scripting based, using the current version of gawk
>
>2b*) scripting based, using any version of AWK

Both are fine, just keep them in separate areas, or clearly mark
each item as to whether it is POSIX compliant or requires gawk
or another version.

>3) formatting standards

I'm not sure you should mandate formatting standards; people have
their own styles. Or you can settle on the pretty-printed style
of gawk --profile, except that that loses comments and merges
BEGIN / END blocks.

>5) hosting service
>
>5b+) code.google.com <http://code.google.com>
>5b) sites.google
>5c) github
>5d) launchpad
>5e) sourceforge
>5e) savannah
>5f) custom

awk.info...

>6) repository
>
>6a+) subversion
>6b) git
>6c) bazaar
>6d) cvs
>6e) (s)ftp
>6f) Web pages with upload
>6g*) CPAN, PEAR or similar

This needs to be very mainstream; particularly bear in mind that Windows
users need a way to get to things. A browsable web repository is the
most generic and easy to use. Internally, you can use whatever you
want and export it to the web repository.

>8) coding standards
>
>8a*) functions don't change globals directly
>
>8b+) reduce use of /pattern/ {action} in favor of while(getline) loops
>inside functions
>
>8d+) use a[0] to store size of array

Gawk & Bell Labs awk support length(array), FWIW.

>8e+) require all "local" (see 8g) variable names be lower case (use '_'
>to separate words?)
>
>8f*) allow naming variables with some version of Hungarian notation
>
>8g*) require all function variables to be "localized" by including them
>in the function parameter list

You should have a separate section for uploads that don't follow your
coding standards, in order to be as inclusive as possible.

>9) define some standard macros (using m4)

I will reserve comments on this until the discussion period.

>10) Label for a collection of functions
>
>10a*) package
>10b*) module
>10c+) gem

"Collection". :-)

This is a great initiative, I really hope it bears fruit!

Thanks,

Arnold
--
Aharon (Arnold) Robbins arnold AT skeeve DOT com
P.O. Box 354 Home Phone: +972 8 979-0381
Nof Ayalon Cell Phone: +972 50 729-7545
D.N. Shimshon 99785 ISRAEL

Manel Collado

unread,
Jan 4, 2009, 10:59:44 AM1/4/09
to
jh escribió:

> Tim Menzies and I have been talking about an AWK code library system.
> ...

> 0) Repository name?
>
> 0a+) planetawk
> 0b) libawk
> 0c) address of the repository must be in the domain name that
> corresponds to the chosen name
> 0d) address of the repository may be <repository name>.<name of hosting
> service> or <hosting service full domain name>/<repository name>

CAWKAN, for analogy with CPAN (Perl), CTAN (Tex).

--
Manuel Collado - http://lml.ls.fi.upm.es/~mcollado

Ed Morton

unread,
Jan 6, 2009, 10:08:51 AM1/6/09
to
On Dec 30 2008, 9:21 pm, jh <jh...@mail.avcnet.org> wrote:
<snip>

> 8) coding standards
>
> 8a*) functions don't change globals directly

It's sometimes useful/necessary for functions to change globals
directly. A better coding standard to adopt would be one that
identifies globals so when we see a variable being modified in a
function and that variable isn't included in the function argument
list, we can tell if it's deliberate or a bug. For example:

foo( i) {
for (i=1;i<=10;i++) {
Things[i] = "x"
stuff[i] = "y"
}
}

If we adopt the convention that all globals start with an upper case
letter (there may be better alternatives), we can see immediately that
"Things" is global so it's use above is OK but "stuff" was intended to
be local so it should've been listed as a function argument.

> 8b+)  reduce use of /pattern/ {action} in favor of while(getline) loops
> inside functions

What does that mean?

> 8d+) use a[0] to store size of array

I've never needed that. I suppose it's a reasonable idea if required
(e.g. in an awk where length(array) doesn't return the size of the
array), but there could be other things you want to store in array[0]
too, e.g. an original string that you split() into array[] (analogous
with $0 in the "array" of fields, "$"). I wouldn't make this part of
coding standards.

> 8e+) require all "local" (see 8g) variable names be lower case  (use '_'
> to separate words?)

I'd recommend you just have them start with lower case. I much prefer
"namesArray" over "names_array" and I won't be changing unless there's
a very good reason.

So, I'm recommending that globals start with a capital letter, locals
with lower case.

> 8f*) allow naming variables with some version of Hungarian notation

Fine, just don't require it. I wouldn't even mention that in coding
standards.

> 8g*) require all function variables to be "localized" by including them
> in the function parameter list

Right and, as typically done, put some white space before them to
separate them from the real function arguments.

You might want to add some things like:

9) Do not use all upper case for user-defined variable names as that's
reserved for builtin variables.

10) Always use parantheses when they're optional (e.g. loop bodies).

11) Do not use one-character variable names as they're hard to find
when searching the code later (e.g. use "idx" instead of "i" for a
loop index)

12) Always give printf at least 2 arguments, the format plus at least
1 data item (e.g. use printf "%s",$1 instead of printf $1, and use
printf "%s","foo" instead of printf "foo".)

13) If using getline, always use one of these forms (see http://tinyurl.com/yn9ka9):
if/while ( (getline var < file) > 0)
if/while ( (command | getline var) > 0)
if/while ( (command |& getline var) > 0)

Regards,

Ed.

Tim Menzies

unread,
Jan 7, 2009, 12:40:39 PM1/7/09
to

>> 8d+) use a[0] to store size of array
>
> I've never needed that. I suppose it's a reasonable idea if required
> (e.g. in an awk where length(array) doesn't return the size of the
> array), but there could be other things you want to store in array[0]
> too, e.g. an original string that you split() into array[] (analogous
> with $0 in the "array" of fields, "$"). I wouldn't make this part of
> coding standards.

agreed- should not part of a standard.

but, fyi, the fix proposed by arnold won't work. he suggested using
a[length(a)+1]=x for a "push" operation but length(a) assumes a is a
string for uninitialized "a". i tried the obvious fix but it did not
work. in the following code push1 crashes for uninitialized "a" but
not push2. so i would still argue for using a[0] to store the length

function empty(a, i) { for(i in a) return 0; return 1}
function push1(a,x) {
if (empty(a)) {print 1; split("",a,"")};
a[length(a)+1]=x
}
function push2(a,x) { a[++a[0]]=x }

>
>> 8e+) require all "local" (see 8g) variable names be lower case  (use '_'
>> to separate words?)
>
> I'd recommend you just have them start with lower case. I much prefer
> "namesArray" over "names_array" and I won't be changing unless there's
> a very good reason.
>
> So, I'm recommending that globals start with a capital letter, locals
> with lower case.

i concur


> 12) Always give printf at least 2 arguments, the format plus at least
> 1 data item (e.g. use printf "%s",$1 instead of printf $1, and use
> printf "%s","foo" instead of printf "foo".)

what is the reason for this?

> 13) If using getline, always use one of these forms (see http://tinyurl.com/yn9ka9):
> if/while ( (getline var < file) > 0)
> if/while ( (command | getline var) > 0)
> if/while ( (command |& getline var) > 0)


er... so you are saying always check for errors?

> Regards,
>
> Ed.

great comments. thanks a lot

timm


--
Posted Via Newsfeeds.com Premium Usenet Newsgroup Service
----------------------------------------------------------
http://www.Newsfeeds.com

Ed Morton

unread,
Jan 7, 2009, 1:56:33 PM1/7/09
to

We frequently see people use

printf $1

when they want to print the first field with no trailing newline, then
they get surprised by the result when their input data contains a
character that has meaning in a printf format string:

$ echo "abc" | awk '{printf $1}'
abc$
$ echo "ab%c" | awk '{printf $1}'
awk: (FILENAME=- FNR=1) fatal: not enough arguments to satisfy format
string
`ab%c'
^ ran out for this one
$

So, you must always use 2 arguments to printf when you want to print
input data, so it's a good habit to get into even if you're printing
the values of variables or literal strings which can still bite you
unexpectedly on a future modification.

As with most coding standards, the above doesn't matter much for a one-
line throw-away awk script, but when you're storing it in a
repository....

>
> > 13) If using getline, always use one of these forms (seehttp://tinyurl.com/yn9ka9):


> >     if/while ( (getline var < file) > 0)
> >     if/while ( (command | getline var) > 0)
> >     if/while ( (command |& getline var) > 0)
>
> er... so you are saying always check for errors?

I'm saying check for errors using the right, unambiguous syntax, and
use the form of getline that doesn't change any builtin variables.

Ed.

Aharon Robbins

unread,
Jan 8, 2009, 2:35:18 PM1/8/09
to
In article <slrngm9q8n...@vermouth.dreamhost.com>,

Tim Menzies <t...@menzies.us> wrote:
>but, fyi, the fix proposed by arnold won't work. he suggested using
>a[length(a)+1]=x for a "push" operation but length(a) assumes a is a
>string for uninitialized "a".

Me, or someone else? I don't remember this.

>i tried the obvious fix but it did not
>work. in the following code push1 crashes for uninitialized "a" but
>not push2. so i would still argue for using a[0] to store the length
>
>function empty(a, i) { for(i in a) return 0; return 1}

function empty(a, i) { return (i in a) }

>function push1(a,x) {
> if (empty(a)) {print 1; split("",a,"")};
> a[length(a)+1]=x
>}

length(a) where a is an array is not portable.

>> So, I'm recommending that globals start with a capital letter, locals
>> with lower case.
>
>i concur

This is a very good convention.

>> 13) If using getline, always use one of these forms (see
>http://tinyurl.com/yn9ka9):
>> if/while ( (getline var < file) > 0)
>> if/while ( (command | getline var) > 0)
>> if/while ( (command |& getline var) > 0)
>
>er... so you are saying always check for errors?

More than that:

if (getline var)

where getline returns -1 acts as a "true" value. Not what
you typically want. :-)

Aharon Robbins

unread,
Jan 8, 2009, 2:36:52 PM1/8/09
to
In article <gk5khl$hu4$1...@localhost.localdomain>,

Aharon Robbins <arn...@skeeve.com> wrote:
>>function empty(a, i) { for(i in a) return 0; return 1}
>
>function empty(a, i) { return (i in a) }

Grr. It's too late at night:

function empty(a, i) { return !(i in a) }

Hmmm. Now that I think about it, the original is
probably best. Never mind.

Tim Menzies

unread,
Jan 8, 2009, 9:02:23 PM1/8/09
to
On 2009-01-08, Aharon Robbins <arn...@skeeve.com> wrote:
> In article <gk5khl$hu4$1...@localhost.localdomain>,
> Aharon Robbins <arn...@skeeve.com> wrote:
>>>
>>
>>function empty(a, i) { return (i in a) }
>
> Grr. It's too late at night:
>
>
>
> Hmmm. Now that I think about it, the original is
> probably best. Never mind.


beg pardon? your blessing is on...

function empty(a, i) { for(i in a) return 0; return 1}

why? what is wrong with

function empty(a, i) { return !(i in a) }

timm

jh

unread,
Jan 8, 2009, 11:04:59 PM1/8/09
to
Tim Menzies wrote:
> On 2009-01-08, Aharon Robbins <arn...@skeeve.com> wrote:
>> In article <gk5khl$hu4$1...@localhost.localdomain>,
>> Aharon Robbins <arn...@skeeve.com> wrote:
>>> function empty(a, i) { return (i in a) }
>> Grr. It's too late at night:
>>
>>
>>
>> Hmmm. Now that I think about it, the original is
>> probably best. Never mind.
>
>
> beg pardon? your blessing is on...
>
> function empty(a, i) { for(i in a) return 0; return 1}
>
> why? what is wrong with
>
> function empty(a, i) { return !(i in a) }
>
> timm
>
>

Test results:

$ gawk 'BEGIN {print (i in a)}'
0
$ gawk 'BEGIN {a[1] = "";print (i in a)}'
0
$ gawk 'BEGIN {a[1] = 0;print (i in a)}'
0
$ gawk 'BEGIN {a[1] = "1";print (i in a)}'
0


$ gawk 'BEGIN {for(i in a) print "not empty"; print "done"}'
done
$ gawk 'BEGIN {a[1] = "";for(i in a) print "not empty"; print "done"}'
not empty
done
$ gawk 'BEGIN {a[1] = 0;for(i in a) print "not empty"; print "done"}'
not empty
done

Aleksey Cheusov

unread,
Jan 12, 2009, 11:16:10 AM1/12/09
to
> Tim Menzies and I have been talking about an AWK code library system.

I love idea in general but I have one question.

AWK modules are not isolated. One module depends on other etc.
RUNAWK uses #use directve for this which works recursively and
I think it is better than @include and some others for a number of
reasons.

Until #use/@include incompatibility is not resolved AWK modules library
project is not possible. I think this is a question #1.

--
Best regards, Aleksey Cheusov.

Aleksey Cheusov

unread,
Jan 12, 2009, 11:22:32 AM1/12/09
to

>> 13) If using getline, always use one of these forms (see
>> http://tinyurl.com/yn9ka9):
>> if/while ( (getline var < file) > 0)
>> if/while ( (command | getline var) > 0)
>> if/while ( (command |& getline var) > 0)
>
>
> er... so you are saying always check for errors?

This is the only way to create a robust software :-)
See runawk's xgetline.awk and other x<function>.awk modules.

Simple example below. Checks are made automatically.
This is good default for most small scripts.

0 ~>cat ~/tmp/2.awk
#!/usr/bin/env runawk

#use "xgetline.awk"

BEGIN {
while (xgetline0(ARGV [1])){
print $0
}
}
0 ~>yes | head -5 | ~/tmp/2.awk -
y
y
y
y
y
141 0 0 ~>~/tmp/2.awk /do/not/exis
error: assertion failed: getline < /do/not/exis failed
ARGV[0]=2.awk
$0=``
NF=0
FNR=0
FILENAME=
1 ~>

Aharon Robbins

unread,
Jan 12, 2009, 2:17:22 PM1/12/09
to
Once again, it's late at night here, so the probability of my
being completely coherent is lower than desirable. That said...

In article <slrngmdc1f...@vermouth.dreamhost.com>,


Tim Menzies <t...@menzies.us> wrote:
>On 2009-01-08, Aharon Robbins <arn...@skeeve.com> wrote:
>> In article <gk5khl$hu4$1...@localhost.localdomain>,
>> Aharon Robbins <arn...@skeeve.com> wrote:
>>>
>>>function empty(a, i) { return (i in a) }
>>
>> Grr. It's too late at night:
>>
>> Hmmm. Now that I think about it, the original is
>> probably best. Never mind.
>
>beg pardon? your blessing is on...
>
> function empty(a, i) { for(i in a) return 0; return 1}

Yes.

>why? what is wrong with
>
> function empty(a, i) { return !(i in a) }

This can be incorrect if someone did

a[""] = 1 # or any value

The null string is a valid (albeit unusual) index, and there's no
guarantee that the some fool won't call

x = empty(a, "foo")

just to be ornery. However,

function empty(a, i) { for(i in a) return 0; return 1}

will work, since if there's nothing at all in a, it'll return 1,
which is what's desired. It's an unusual use of the for, but
I rather like it overall. :-)

Arnold

0 new messages