Extended ASCII chars: Could not match text

Daniel Gersten

ungelesen,

29.09.2022, 07:22:3629.09.22

an Jep Java Users

The parser only matches ASCII variables, but I would like to extend it to using extended ASCII symbol variable names.
https://theasciicode.com.ar/

What is the regex for extended ASCII symbols?

cp.addTokenMatcher(new IdentifierTokenMatcher("???"));

Richard Morris

ungelesen,

30.09.2022, 12:48:2830.09.22

an Jep Java Users

The Pattern class https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/util/regex/Pattern.html#jcc

has a character class \p{IsLatin} which is probably what you want.

This matches: ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzªºÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ

The corresponding IdentifierTokenMatcher will be

var itm = new IdentifierTokenMatcher("[\\p{IsLatin}_][\\p{IsLatin}\\p{Digit}_]*");

Allowing Latin characters and underscores for the first character and those plus digits for the other characters.

A simple Jep session might be

Actually working with files encoded in Extended Ascii is tricky. I think you probably will need java.nio.charset.Charset class https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/nio/charset/Charset.html

Richard Morris

ungelesen,

02.10.2022, 07:11:5602.10.22

an Jep Java Users

A bit more digging.

It looks the charset you reference is IBM Code page 437, one of many charset referred to as extended ASCII. https://en.wikipedia.org/wiki/Code_page_437

You can get the Java charset for this using

Charset charset = Charset.forName("Cp437");

You'll probably need to convert these to unicode

CharBuffer chars = charset.decode(ByteBuffer.wrap(array));

Then

var itm = new IdentifierTokenMatcher("[\\p{L}_][\\p{L}\\p{Digit}_]*");

is probably the best, matching all Unicode letters.

Daniel Gersten

ungelesen,

17.10.2022, 08:02:2317.10.22

an Jep Java Users

Works like a charm, thank you.

If somebody is looking for other specific symbols, checkout Unicode and add it to the Regex Expression:

> A Unicode character can also be represented by using its Hex notation (hexadecimal code point value) directly as described in construct \x{...}, for example a supplementary character U+2011F can be specified as \x{2011F}, instead of two consecutive Unicode escape sequences of the surrogate pair \uD840\uDD1F.

https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/util/regex/Pattern.html

Allen antworten

Antwort an Autor

Weiterleiten