Re: [TSEPro] Massaging data into columns

8 views
Skip to first unread message

knud van eeden

unread,
May 22, 2024, 7:22:46 PMMay 22
to TSE editor list, SemWare TSE Pro Text Editor
Latest version, does what is asked for.

1. Fred Olsson wrote:

> reminded me of something I have long wished I had, namely
> a robust macro that would convert tabular data into columns.
> The thought came to me when I grabbed a Wikipedia article's edit
> history and pasted it into TSE. I'd like to have columns like:
> date  user-who-did-edit  size  summary
>

2. See the text here:


3. If you then copy/paste (thus directly from the screen there) part of the text there from the screen into TSE

Inline image


4. Then highlight that text in TSE 

Inline image

5. Then run the below TSE parser program and it will create an aligned output in CSV format, which you can then e.g. load in your spreadsheet if wanted.

Inline image

6. Here the working TSE parser program used (tested in TSE 4.50 RC23)

// Forward declarations of functions
forward integer proc parse_page()
forward integer proc parse_line()
forward integer proc match_curprev()
forward integer proc match_spaces()
forward integer proc match_hour()
forward integer proc match_minute()
forward integer proc match_day()
forward integer proc match_month()
forward integer proc match_year()
forward integer proc match_username()
forward integer proc match_filesize()
forward integer proc match_text()
forward integer proc match_literal(string s)
forward integer proc match_optional_m()
forward integer proc match_optional_questionmark()
forward integer proc match_optional_plusminus()
forward integer proc match_optional_blocksign()
forward integer proc match_regex(string s)
forward integer proc isSpaceB(integer I)
forward integer proc isRegexMatchB(string s)
forward integer proc matchLength(string s)
forward integer proc match_optional(string s)

// global variable
INTEGER buffer1I = 0

// global variable
INTEGER buffer2I = 0

// global variable
INTEGER downB = TRUE

// global variable
INTEGER extraI = 5

// Define tokens based on the EBNF notation
string CURPREV[255] = "curprev"
string COLON[255] = ":"
string SPACE[255] = " "
string COMMA[255] = ","
string QUESTIONMARK[255] = "\?"
string BLOCKSIGN[255] = "B"
string HOUR[255] = "{[01][0-9]}|{2[0-3]}"
string MINUTE[255] = "[0-5][0-9]"
string DAY[255] = "{{[12][0-9]}|{3[01]}|{[0-9]}}"
string MONTH[255] = "{January}|{February}|{March}|{April}|{May}|{June}|{July}|{August}|{September}|{October}|{November}|{December}"
string YEAR[255] = "{{19}|{20}}[0-9][0-9]"
string USERNAME[255] = "{[a-z]#}|{[0-9][0-9]?[0-9]?\.[0-9][0-9]?[0-9]?\.[0-9][0-9]?[0-9]?\.[0-9][0-9]?[0-9]?\c}"
string FILESIZE[255] = "[0-9]#[,]?[0-9]#"
string TEXT[255] = ".#"

// Function to check if a character is a space
integer proc isSpaceB(integer I)
    return(I == Asc(SPACE))
end

// Parse a single line based on the EBNF rules
integer proc parse_line()

    BegLine()

    if not match_spaces()
     return(false)
    endif

    if not match_curprev()
     return(false)
    else
     PushPosition()
     PushBlock()
     GotoBufferId( buffer1I )
     GotoColumn( 1 )
     InsertText( CURPREV, _INSERT_ )
     EndLine()
     InsertText( ",", _OVERWRITE__ )
     PopBlock()
     PopPosition()
    endif

    if not match_spaces()
     return(false)
    endif

    if not match_hour()
     return(false)
    else
     PushPosition()
     PushBlock()
     GotoBufferId( buffer1I )
     GotoColumn( Length( CURPREV ) + extraI )
     //
     PushBlock()
     PushPosition()
     GotoBufferId( buffer2I )
     BegLine() UnMarkBlock() MarkStream() EndLine() Left() MarkStream() Copy()
     DelLine()
     PopBlock()
     PopPosition()
     Paste()
     //
     // PasteFromWinClip()
     //
     PopBlock()
     PopPosition()
    endif

    if not match_literal(COLON)
        return(false)
    else
     PushPosition()
     PushBlock()
     GotoBufferId( buffer1I )
     GotoColumn( Length( CURPREV ) + extraI + 2 )
     InsertText( COLON, _INSERT_ )
     PopBlock()
     PopPosition()
    endif

    if not match_minute()
        return(false)
    else
     PushPosition()
     PushBlock()
     GotoBufferId( buffer1I )
     GotoColumn( Length( CURPREV ) + extraI + 2 + Length( COLON ) )
     //
     PushBlock()
     PushPosition()
     GotoBufferId( buffer2I )
     BegLine() UnMarkBlock() MarkStream() EndLine() Left() MarkStream() Copy()
     DelLine()
     PopBlock()
     PopPosition()
     Paste()
     //
     // PasteFromWinClip()
     //
     PopBlock()
     PopPosition()
    endif

    if not match_literal(COMMA)
        return(false)
    else
     PushPosition()
     PushBlock()
     GotoBufferId( buffer1I )
     GotoColumn( Length( CURPREV ) + extraI + 2 + Length( COLON ) + 2 )
     InsertText( COMMA, _INSERT_ )
     PopBlock()
     PopPosition()
    endif

    if not match_spaces()
        return(false)
    endif

    if not match_day()
        return(false)
    else
     PushPosition()
     PushBlock()
     GotoBufferId( buffer1I )
     GotoColumn( Length( CURPREV ) + extraI + 2 + Length( COLON ) + 2 + 1 + extraI )
     //
     PushBlock()
     PushPosition()
     GotoBufferId( buffer2I )
     BegLine() UnMarkBlock() MarkStream() EndLine() Left() MarkStream() Copy()
     DelLine()
     PopBlock()
     PopPosition()
     Paste()
     //
     // PasteFromWinClip()
     //
     EndLine()
     InsertText( ",", _OVERWRITE__ )
     PopBlock()
     PopPosition()
    endif

    if not match_spaces()
        return(false)
    endif

    if not match_month()
        return(false)
    else
     PushPosition()
     PushBlock()
     GotoBufferId( buffer1I )
     GotoColumn( Length( CURPREV ) + extraI + 2 + Length( COLON ) + 2 + 1 + extraI + 3 + 1 )
     //
     PushBlock()
     PushPosition()
     GotoBufferId( buffer2I )
     BegLine() UnMarkBlock() MarkStream() EndLine() Left() MarkStream() Copy()
     DelLine()
     PopBlock()
     PopPosition()
     Paste()
     //
     // PasteFromWinClip()
     //
     EndLine()
     InsertText( ",", _OVERWRITE__ )
     PopBlock()
     PopPosition()
    endif

    if not match_spaces()
        return(false)
    endif

    if not match_year()
        return(false)
    else
     PushPosition()
     PushBlock()
     GotoBufferId( buffer1I )
     GotoColumn( Length( CURPREV ) + extraI + 2 + Length( COLON ) + 2 + 1 + extraI + 3 + 1 + 10 )
     //
     PushBlock()
     PushPosition()
     GotoBufferId( buffer2I )
     BegLine() UnMarkBlock() MarkStream() EndLine() Left() MarkStream() Copy()
     DelLine()
     PopBlock()
     PopPosition()
     Paste()
     //
     // PasteFromWinClip()
     //
     EndLine()
     InsertText( ",", _OVERWRITE__ )
     PopBlock()
     PopPosition()
    endif

    if not match_literal(QUESTIONMARK)
        return(false)
    else
     PushPosition()
     PushBlock()
     GotoBufferId( buffer1I )
     GotoColumn( Length( CURPREV ) + extraI + 2 + Length( COLON ) + 2 + 1 + extraI + 3 + 1 + 10 )
     EndLine()
     //
     PushBlock()
     PushPosition()
     GotoBufferId( buffer2I )
     BegLine() UnMarkBlock() MarkStream() EndLine() Left() MarkStream() Copy()
     DelLine()
     PopBlock()
     PopPosition()
     Paste()
     //
     // PasteFromWinClip()
     //
     EndLine()
     InsertText( ",", _OVERWRITE__ )
     PopBlock()
     PopPosition()
    endif

    if not match_username()
        return(false)
    else
     PushPosition()
     PushBlock()
     GotoBufferId( buffer1I )
     GotoColumn( Length( CURPREV ) + extraI + 2 + Length( COLON ) + 2 + 1 + extraI + 3 + 1 + 10 + 15 )
     //
     PushBlock()
     PushPosition()
     GotoBufferId( buffer2I )
     BegLine() UnMarkBlock() MarkStream() EndLine() Left() MarkStream() Copy()
     DelLine()
     PopBlock()
     PopPosition()
     Paste()
     //
     // PasteFromWinClip()
     //
     EndLine()
     InsertText( ",", _OVERWRITE__ )
     PopBlock()
     PopPosition()
    endif

    if not match_spaces()
        return(false)
    endif

    if not match_literal("talk")
        return(false)
    else
     PushPosition()
     PushBlock()
     GotoBufferId( buffer1I )
     GotoColumn( Length( CURPREV ) + extraI + 2 + Length( COLON ) + 2 + 1 + extraI + 3 + 1 + 10 + 15 + 20 )
     InsertText( "talk", _INSERT_ )
     EndLine()
     InsertText( ",", _OVERWRITE__ )
     PopBlock()
     PopPosition()
    endif

    if not match_optional_questionmark()
        return(false)
    endif

    if not match_spaces()
        return(false)
    endif

    if not match_optional("contribs")
        return(false)
    endif

    if not match_optional("?")
        return(false)
    endif

    if not match_spaces()
        return(false)
    endif

    if not match_optional_m()
        return(false)
    endif

    if not match_filesize()
        return(false)
    endif

    if not match_spaces()
        return(false)
    endif

    if not match_literal("bytes")
        return(false)
    endif

    if not match_optional_plusminus()
        return(false)
    endif

    if not match_literal(QUESTIONMARK)
        return(false)
    endif

    if not match_spaces()
        return(false)
    endif

    if not match_optional_blocksign()
        return(false)
    endif

    if not match_text()
        return(false)
    else
     PushPosition()
     PushBlock()
     GotoBufferId( buffer1I )
     GotoColumn( Length( CURPREV ) + extraI + 2 + Length( COLON ) + 2 + 1 + extraI + 3 + 1 + 10 + 15 + 40 )
     //
     PushBlock()
     PushPosition()
     GotoBufferId( buffer2I )
     BegLine() UnMarkBlock() MarkStream() EndLine() Left() MarkStream() Copy()
     DelLine()
     PopBlock()
     PopPosition()
     Paste()
     //
     // PasteFromWinClip()
     //
     PopBlock()
     PopPosition()
    endif


    downB = down()

    AddLine( "", buffer1I )

    BegLine()

    return(true)
end

// Function to match one or more spaces
integer proc match_spaces()
    while isSpaceB(CurrChar())
        Right()
    endwhile
    return(true)
end

// Function to match the minute part based on the MINUTE regex
integer proc match_minute()
    return(match_regex(MINUTE))
end

// Function to match the day part based on the DAY regex
integer proc match_day()
    return(match_regex(DAY))
end

// Function to match the month part based on the MONTH regex
integer proc match_month()
    return(match_regex(MONTH))
end

// Function to match the year part based on the YEAR regex
integer proc match_year()
    return(match_regex(YEAR))
end

// Function to match the username based on the USERNAME regex
integer proc match_username()
    return(match_regex(USERNAME))
end

// Function to match the filesize based on the FILESIZE regex
integer proc match_filesize()
    return(match_regex(FILESIZE))
end

// Function to match the text based on the TEXT regex
integer proc match_text()
    return(match_regex(TEXT))
end

// Function to match a specific literal string
integer proc match_literal(string s)
    return(match_regex(s))
end

// Function to match an optional 'm'
integer proc match_optional_m()
    if CurrChar() == Asc('m')
        Right()
    endif
    return(true)
end

// Function to match an optional string
integer proc match_optional(string s)
    string line[255] = ""
    integer I = 0
    integer B = false
    line = GetText(1, 255)
    I = CurrCol()
    B = (SubStr(line, I, Length(s)) == s)
    if B
        Right(Length(s))
    endif
    return(true)
end

// Function to match an optional '?'
integer proc match_optional_questionmark()
    if CurrChar() == Asc('?')
        Right()
    endif
    return(true)
end

// Function to match an optional '+' or '-'
integer proc match_optional_plusminus()
    if CurrChar() == Asc('+') or CurrChar() == Asc('-')
        Right()
    endif
    return(true)
end

// Function to match an optional BLOCKSIGN
integer proc match_optional_blocksign()
    string line[255] = ""
    if SubStr(line, CurrCol(), length(BLOCKSIGN)) == BLOCKSIGN
        Right(length(BLOCKSIGN))
    endif
    return(true)
end

// Define the main parsing procedure
integer proc parse_page()
    GotoBlockBegin()
    BegLine()
    REPEAT
       IF LFind( "curprev", "cig" )
        if not parse_line()
            return(false)
        endif
       ELSE
        downB = Down()
        BegLine()
       ENDIF
    UNTIL NOT isCursorInBlock() OR NOT downB
    return(true)
end

// Function to calculate the length of a regex match
integer proc matchLength(string regex)
    integer beginI = CurrCol()
    integer endI = 0
    PushPosition()
    PushBlock()
    LFind(Format(regex, "\c"), "cx")
    endI = CurrCol()
    PopBlock()
    PopPosition()
    return( endI - beginI )
end

// Function to check if the current position matches a regex
integer proc isRegexMatchB(string s)
    integer B = false
    string s1[255] = ""
    string line[255] = ""
    integer beginI = 0
    integer endI = 0
    PushPosition()
    PushBlock()
    line = GetText( 1, 255 )
    beginI = CurrCol()
    B = LFind(Format( s, "\c" ), "cx")
    endI = CurrCol()
    IF B
     s1 = SubStr( line, beginI, endI - beginI )
     // CopyToWinClip( s1 )
     AddLine( s1, buffer2I )
    ENDIF
    PopBlock()
    PopPosition()
    return( B )
end

// Function to match a regex and move the cursor appropriately
integer proc match_regex(string regex)
    if isRegexMatchB(regex)
        Right(matchLength(regex))  // Move cursor past the matched regex
        return(true)
    endif
    return(false)
end

// Function to match the curprev condition. Assume it is the beginning of the line.
integer proc match_curprev()
    match_regex(CURPREV)
    return(true)
end

// Function to match the hour part based on the HOUR regex
integer proc match_hour()
    return(match_regex(HOUR))
end

// Main entry point for the script
proc main()
    //
    PushPosition()
    buffer1I = CreateTempBuffer()
    PopPosition()
    //
    PushPosition()
    buffer2I = CreateTempBuffer()
    PopPosition()
    //
    if not parse_page()
        Warn("Parsing failed.")
    else
        Message("Parsing completed successfully.")
        GotoBufferId( buffer1I )
    endif
end


===

7. And here the EBNF like notation used to create that TSE parser program:

[page] := [line]*

[line] := (curprev) (space)+ [hour](colon)[minute] (comma) (space)+ [day] (space)+ [month] )(space)+ [year] (questionmark)
          [username] (space)+ (talk) (questionmark)? (space)+ (contribs) (questionmark) (space)+ (m)? [filesize] (space)+
          (bytes) {(+)|(-)} (questionmark) (space)+ (blocksign)? [text]

[hour] := 00..24

[day] := 01..31

[year] := 1900..2024

[minute] := 00..59

[username] := a..z+

[filesize] := 0..100000

[text] := a..z+

(colon) := ":"

(space) := " "

(questionmark) := "?"

(blocksign) = "[]"

with friendly greetings
Knud van Eeden
Reply all
Reply to author
Forward
0 new messages