Gscan 3

1 view

Skip to first unread message

Егор Ульянов

unread,

Aug 4, 2024, 10:55:20 PM8/4/24

to zinsreatotip

Thegeneral story is this:Alex provides a basic interface to the generated lexer (described in the next section), which you can use to parse tokens given an abstract input type with operations over it.You also have the option of including a wrapper,which provides a higher-level abstraction over the basic interface;Alex comes with several wrappers.

Depending on how you use Alex, the fact that Alex uses UTF-8 encoding internally may or may not affect you.If you use one of the wrappers (below) that takes input from a Haskell String,then the UTF-8 encoding is handled automatically.However, if you take input from a ByteString,then it is your responsibility to ensure that the input is properly UTF-8 encoded.

None of this applies if you used the --latin1 option to Alex or specify a Latin-1 encoding via a %encoding declaration.In that case, the input is just a sequence of 8-bit bytes, interpreted as characters in the Latin-1 character set.

If you compile your Alex file without a %wrapper declaration,then you get access to the lowest-level API to the lexer.You must provide definitions for the following,either in the same module or imported from another module:

Once you have the action, it is up to you what to do with it.The type of action could be a function which takes the String representation of the token and returns a value in some token type,or it could be a continuation that takes the new input and calls alexScan again, building a list of tokens as it goes.

This is pretty low-level stuff;you have complete flexibility about how you use the lexer,but there might be a fair amount of support code to write before you can actually use it.For this reason, we also provide a selection of wrappers that add some common functionality to this basic scheme.Wrappers are described in the next section.

It provides definitions for AlexInput, alexGetByte and alexInputPrevChar that are suitable for lexing a String input.It also provides a function alexScanTokens which takes a String input and returns a list of the tokens it contains.

The monad wrapper is the most flexible of the wrappers provided with Alex.It includes a state monad which keeps track of the current input and text position, and the startcode.It is intended to be a template for building your own monads -feel free to copy the code and modify it to build a monad with the facilities you need.

The gscan wrapper is provided mainly for historical reasons:it exposes an interface which is very similar to that provided by Alex version 1.x.The interface is intended to be very general, allowing actions to modify the startcode,and pass around an arbitrary state value.

The basic-bytestring, posn-bytestring and monad-bytestring wrappers are variations on the basic, posn and monad wrappers that use lazy ByteStrings as the input and token types instead of an ordinary String.

The point of using these wrappers is that ByteStrings provide a more memory efficient representation of an input stream.They can also be somewhat faster to process.Note that using these wrappers adds a dependency on the ByteString modules, which live in the bytestring package (or in the base package in ghc-6.6)

As mentioned earlier (Unicode and UTF-8),Alex lexers internally process a UTF-8 encoded string of bytes.This means that the ByteString supplied as input when using one of the ByteString wrappers should be UTF-8 encoded(or use either the --latin1 option or the %encoding declaration).

All of the actions in your lexical specification have the same type as in the monadUserState wrapper.It is only the types of the function to run the monad and the type of the token function that change.

The point of using these wrappers is that Texts provide a more memory efficient representation of an input stream.They can also be somewhat faster to process.Note that using these wrappers adds a dependency on the Data.Text modules, which live in the text package.

The %token directive can be used to specify the token type when any kind of %wrapper directive has been given.Whenever %token is used, the %typeclass directive can also be used to specify one or more typeclass constraints.The following shows a simple lexer that makes use of this to interpret the meaning of tokens using the Read typeclass: