why does patsplit() exist?

73 views
Skip to first unread message

Ed Morton

unread,
Apr 17, 2021, 11:09:02 AM4/17/21
to
In gawk 4.0 two similar changes were introduced:

1) patsplit() - a new function to split a string into array elements
that match a regexp
2) split() was given a 4th argument to store the strings that match the
separator regexp in an array.

For example:

$ echo 'foo13bar27' | awk 'patsplit($0,vals,/[0-9]+/) { for (i in vals)
print vals[i] }'
13
27

$ echo 'foo13bar27' | awk 'split($0,tmp,/[0-9]+/,vals) { for (i in vals)
print vals[i] }'
13
27

Given the awk language traditionally only provides constructs that are
hard to implement with other existing constructs and that both items
were introduced in the same release there must be something I'm missing
- what is it that patsplit() provides that's hard to implement with split()?

Ed.

Kenny McCormack

unread,
Apr 17, 2021, 11:31:55 AM4/17/21
to
In article <s5etmc$jjm$1...@dont-email.me>,
Ed Morton <morto...@gmail.com> wrote:
...
>Given the awk language traditionally only provides constructs that are
>hard to implement with other existing constructs and that both items
>were introduced in the same release there must be something I'm missing
>- what is it that patsplit() provides that's hard to implement with split()?

You need to re-read the parts of the GAWK documentation that explain the
FPAT concept.

Note: I do get what you mean about how, with enough gymnastics, you could
(sort of) do with split() what FPAT/patsplit() do, but who needs that kind
of pain?

--
The coronavirus is the first thing, in his 74 pathetic years of existence,
that the orange menace has come into contact with, that he couldn't browbeat,
bully, bullshit, bribe, sue, legally harrass, get Daddy to fix, get his
siblings to bail him out of, or, if all else fails, simply wish it away.

Janis Papanagnou

unread,
Apr 17, 2021, 12:48:56 PM4/17/21
to
I think it's probably a convenience function, although it's conceptually
also clearer to distinguish the two cases, and finally you can construct
use-cases that produce different output

echo 'Hello,,world!' |
awk 'x=patsplit($0,vals,/[^,]*/) {
print x ; for (i in vals) print vals[i]
}'

echo 'Hello,,world!' |
awk 'x=split($0,tmp,/[^,]*/,vals) {
print x ; for (i in vals) print vals[i]
}'

(where you even can't rely on the result value, e.g. to iterate over
the fields; in some data cases the values are the same, in other cases
off by one).

Janis

>
> Ed.

Ben Bacarisse

unread,
Apr 17, 2021, 5:20:33 PM4/17/21
to
Change the + to a *. I don't think split will ever see an empty separator.

--
Ben.

Ben Bacarisse

unread,
Apr 17, 2021, 5:25:08 PM4/17/21
to
Change the + to *. I don't think split will ever see an empty
separator, but patsplit is happy with empty fields.

--
Ben.

Ed Morton

unread,
Apr 17, 2021, 8:18:07 PM4/17/21
to
Thanks for the response Ben & Janis. You both gave an example of a case
I hadn't considered which is where an empty string could match the regexp:

Janis:
-----
$ echo 'Hello,,World' | awk 'patsplit($0,vals,/[^,]*/) { for (i in vals)
print i, vals[i] }'
1 Hello
2
3 World

$ echo 'Hello,,World' | awk 'split($0,tmp,/[^,]*/,vals) { for (i in
vals) print i, vals[i] }'
1 Hello
2 World
-----

Ben:
-----
$ echo 'foo13bar27' | awk 'patsplit($0,vals,/[0-9]*/) { for (i in vals)
print i, vals[i] }'
1
2
3
4 13
5
6
7 27

$ echo 'foo13bar27' | awk 'split($0,tmp,/[0-9]*/,vals) { for (i in vals)
print i, vals[i] }'
1 13
2 27
-----

Is that the only difference - whether or not an empty string can match
the regexp?

Ed.

J Naman

unread,
Apr 17, 2021, 11:39:11 PM4/17/21
to
Maybe patsplit() is a convenience, but it is very important to me. In addition to CSV files, I use patsplit() to extract all numeric percentages (e.g.12.3%) and ALL embedded dates mm/dd/yy or yyyy from highly UNSTRUCTURED text files that are aggregations of text lines from multiple sources. The text lines that I get have misspellings, non-standard abbreviations, bizarre punctuation -- "unNatural Language Processing". The extracted numeric data then clue me to how to process the sep[] text data.
Example: patsplit($0, arr, /[0-9]*[.][0-9]*%/,seps); # first, extract all embedded yields (none, some, a lot)

Ed Morton

unread,
Apr 18, 2021, 9:53:22 AM4/18/21
to
Wouldn't you get the same output from

split($0, seps, /[0-9]*[.][0-9]*%/, arr)

though?

I'm just trying to understand what patsplit() does differently from
split() with the array names swapped and so far Ben and Janis gave an
example where it handles null strings differently - best I can tell that
wouldn't apply in the case you describe so is there some other difference?

Ed.

Ed Morton

unread,
Apr 24, 2021, 8:46:18 AM4/24/21
to
Well, best I can tell that handling of null strings that match the
regexp is the only difference between the 3rd arg for patsplit() and the
3rd arg for split() other than the cases where split() is using either
of the special-case FSs of "" or " ".

So the key is that split() takes an FS for the 3rd arg while patsplit()
takes a regexp and while a FS is regexp-like, it has 3 special cases
that make it different from regexps:

1) FS = "" -> undefined by POSIX, some awks split into chars.
2) FS = " " -> leading/trailing spaces ignored, split on contiguous spaces.
3) FS = a regexp that can match a null string -> treat it like a regexp
that cannot match a null string (e.g. `,*` gets treated like `,+`).

While that 3rd point makes sense I couldn't actually find anything
documenting the fact that a field separator isn't allowed to match a
null string (except in the case of FS="" in some awks). POSIX says:

---------
The following describes FS behavior:

If FS is a null string, the behavior is unspecified.

If FS is a single character:

If FS is <space>, skip leading and trailing <blank> and
<newline> characters; fields shall be delimited by sets of one or more
<blank> or <newline> characters.

Otherwise, if FS is any other character c, fields shall be
delimited by each single occurrence of c.

Otherwise, the string value of FS shall be considered to be an
extended regular expression. Each occurrence of a sequence matching the
extended regular expression shall delimit fields.
---------

so in the case of `-F'[^,]*', for example, that falls into the final
case above. It should really say "...a sequence _of 1 or more
characters_ matching..." I suppose.

That difference makes it non-trivial to implement patsplit() using
existing functionality (i.e. split() with the args swapped). Thanks to
all who replied.

Ed.

Kpop 2GM

unread,
Sep 6, 2021, 11:44:37 AM9/6/21
to
I wrote this proof-of-concept for emulating patsplit functionality even without gawk :

mawk2 'BEGIN { sepFS="\301\372"; FS=RS="^$"; OFS=" :: "; } END { mypat="\352[\\200-\\277][\\200-\\277]|[\\353\\354][\\200-\\277][\\200-\\277]|\355[\\200-\\235][\\200-\\277]"; print gsub(mypat "|(" mypat ")( |"mypat")*("mypat")", sepFS "&" sepFS); gsub(sepFS "("sepFS")+", ""); print nx=split($0,arr, sepFS );for(x=1;x<=nx; x++) { print "\t" x ": /." arr[x] "./" ; }}'

the test pattern here is all 11,172 korean hangul syllables. at the same time, i also didnt want it to chop korean phrases up due to space character, while preventing latin ASCII from splitting from space character. mawk2, that isn't unicode aware whatsoever, can split out nearly 72000-cell array in 3.24 seconds, with all the hangul in the even # cells, and all the ""non-pat", if you will, in the odd numbered ones.

The trick is simply use a sep string that nearly never exists in proper UTF8 inputs - i only included 2 UTF-8 illegal bytes xC1 xFA \301\372. you can do a quick scan of the data, and if xC1 \301 doesn't show up at all then just use a single byte xC1 as your sep. if it *does* show up, it's possibly you're working with binary data streams, in which case, keep padding the sep string with a byte you deem very unlikely to show up

(tip : don't bother with x00 \000 and xFF \377. those 2 bytes are *very* common in a variety of binary file formats)

70940: /.베리베리./
70941: /.XTOO @3./
70942: /.차 경연./
70943: /. <./
70944: /.컬래버레이션 무대./
70945: /.>=artist=14958011=VOD 657 ▸ ./
70946: /.로드 투 킹덤./
70947: /.=year=2020=05-29=secs=251=mstr=NoF=tile=t=info=1280=720=00:04:11=gnr=31219=2908-vod1=clipID=MA_306857=song=.=[Full CAM] ♬ ON - ./
70948: /.베리베리./
70949: /.XTOO @3./
70950: /.차 경연./
70951: /. <./
70952: /.컬래버레이션 무대./
70953: /.>./
70954: /.킹덤으로 가려는자./
70955: /., ./
70956: /.살아남아라./
70957: /.!Mnet <./
70958: /.로드 투 킹덤./
70959: /. (Road to Kingdom) >./
70960: /.매주 목요일 저녁./
70961: /. 8./

I think a similar approach should work for CSV files, using just about any awk variant in circulation. I've personally tested it in mawk 1.3.4, mawk2-beta 1.9.9.6, gawk 5.1.0, and macos 11.5.2 built-in awk

Kpop 2GM

unread,
Sep 6, 2021, 11:53:07 AM9/6/21
to
706571: /. 8./
706572: /.시./
706573: /.!!./
706574: /.꿀케미 커넥션./
706575: /. <./
706576: /.랜선친구 아이오아이./
706577: /.>./
706578: /.엠넷 본방사수./
706579: /.=albm=.=277 ▸ ./
706580: /.프로듀스./
706581: /. 101=G0636~ADV009~P2015~E0933647=VOD Various Artists=mnetA-487082=
./

real 0m2.368s
user 0m0.757s
sys 0m0.315s

just tested with mawk 1.3.4 : only 2.4 seconds to split out array with over 700K cells, and the korean strings just by themselves. at lucky times, the english translated names will be conveniently placed in adjacent cells, e.g.

691357: /., ./
691358: /.원포유./
691359: /. (14U) , ./
691360: /.에이프릴./
691361: /. (APRIL) , ./
691362: /.혜이니./
691363: /. (HEYNE) , ./


Janis Papanagnou

unread,
Sep 6, 2021, 12:14:34 PM9/6/21
to
On 06.09.2021 17:44, Kpop 2GM wrote:
>
> mawk2 'BEGIN { sepFS="\301\372"; FS=RS="^$"; OFS=" :: "; } END { mypat="\352[\\200-\\277][\\200-\\277]|[\\353\\354][\\200-\\277][\\200-\\277]|\355[\\200-\\235][\\200-\\277]"; print gsub(mypat "|(" mypat ")( |"mypat")*("mypat")", sepFS "&" sepFS); gsub(sepFS "("sepFS")+", ""); print nx=split($0,arr, sepFS );for(x=1;x<=nx; x++) { print "\t" x ": /." arr[x] "./" ; }}'

Even if the standard screen width in Korea is 370+ columns, is
there any advantage writing all in one line, unformatted? Or
is there a prize for doing that? Or are you assuming that this
newsgroup is read solely by awk interpreter software (and not
by humans)? - I wonder what folks are thinking when doing so.
Or whether they are thinking at all. Or whether that is some
regional mindset. Or a mental disease. Or a religious dogma.
Or a statement of political or social protest. - Why the heck!

Ed Morton

unread,
Sep 6, 2021, 1:16:12 PM9/6/21
to
On 9/6/2021 10:44 AM, Kpop 2GM wrote:
<snip>
> I wrote this proof-of-concept for emulating patsplit functionality even without gawk :
>
> mawk2 'BEGIN { sepFS="\301\372"; FS=RS="^$"

That's still relying on an extension to POSIX awk for multi-char RS. A
POSIX awk would treat that like `RS="^"`. I'm not going to try to read
the rest of the script since it was all crammed onto 1 line. Janis's
response covers that situation well!

Ed.
Reply all
Reply to author
Forward
0 new messages