Extended ASCII chars: Could not match text

39 Aufrufe
Direkt zur ersten ungelesenen Nachricht

Daniel Gersten

ungelesen,
29.09.2022, 07:22:3629.09.22
an Jep Java Users
The parser only matches ASCII variables, but I would like to extend it to using extended ASCII symbol variable names.
https://theasciicode.com.ar/

What is the regex for extended ASCII symbols?

cp.addTokenMatcher(new IdentifierTokenMatcher("???"));

 

Richard Morris

ungelesen,
30.09.2022, 12:48:2830.09.22
an Jep Java Users
The Pattern class  https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/util/regex/Pattern.html#jcc
has a character class  \p{IsLatin} which is probably what you want.

This matches: ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzªºÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ

The corresponding IdentifierTokenMatcher will be

        var itm = new IdentifierTokenMatcher("[\\p{IsLatin}_][\\p{IsLatin}\\p{Digit}_]*");

Allowing Latin characters and underscores for the first character and those plus digits for the other characters.

A simple Jep session might be 

        var itm = new IdentifierTokenMatcher("[\\p{IsLatin}_][\\p{IsLatin}\\p{Digit}_]*");
       
        ConfigurableParser cp = new ConfigurableParser();
        cp.addHashComments();
        cp.addSlashComments();
        cp.addSingleQuoteStrings();
        cp.addDoubleQuoteStrings();
        cp.addWhiteSpace();
        cp.addExponentNumbers();
        cp.addSymbols("(",")","[","]",","); //$NON-NLS-1$ //$NON-NLS-2$ //$NON-NLS-3$ //$NON-NLS-4$ //$NON-NLS-5$
        cp.setImplicitMultiplicationSymbols("(","["); //$NON-NLS-1$ //$NON-NLS-2$
        cp.addOperatorTokenMatcher();
        cp.addTokenMatcher(itm);
        cp.addSemiColonTerminator();
        cp.addWhiteSpaceCommentFilter();
        cp.addBracketMatcher("(",")"); //$NON-NLS-1$ //$NON-NLS-2$
        cp.addFunctionMatcher("(",")",","); //$NON-NLS-1$ //$NON-NLS-2$ //$NON-NLS-3$
        cp.addListMatcher("[","]",","); //$NON-NLS-1$ //$NON-NLS-2$ //$NON-NLS-3$
        cp.addArrayAccessMatcher("[","]"); //$NON-NLS-1$ //$NON-NLS-2$
        Jep jep = new Jep(cp);
                     
        Node n1 = jep.parse("café23 = 5");
        jep.evaluate(n1);
        Object val = jep.getVariableValue("café23");
        assertEquals(5.0, val);

Actually working with files encoded in Extended Ascii is tricky. I think you probably will need java.nio.charset.Charset class https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/nio/charset/Charset.html

Richard Morris

ungelesen,
02.10.2022, 07:11:5602.10.22
an Jep Java Users
A bit more digging.

It looks the charset you reference is IBM Code page 437, one of many charset referred to as extended ASCII. https://en.wikipedia.org/wiki/Code_page_437

You can get the Java charset for this using
        Charset charset = Charset.forName("Cp437");

You'll probably need to convert these to unicode 
        CharBuffer chars = charset.decode(ByteBuffer.wrap(array));

Then 
        var itm = new IdentifierTokenMatcher("[\\p{L}_][\\p{L}\\p{Digit}_]*");
is probably the best, matching all Unicode letters. 

Daniel Gersten

ungelesen,
17.10.2022, 08:02:2317.10.22
an Jep Java Users
Works like a charm, thank you.

If somebody is looking for other specific symbols, checkout Unicode and add it to the Regex Expression:

> A Unicode character can also be represented by using its Hex notation (hexadecimal code point value) directly as described in construct \x{...}, for example a supplementary character U+2011F can be specified as \x{2011F}, instead of two consecutive Unicode escape sequences of the surrogate pair \uD840\uDD1F.

https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/util/regex/Pattern.html

Allen antworten
Antwort an Autor
Weiterleiten
0 neue Nachrichten