Parsing discussion (#16981)

148 views
Skip to first unread message

Scott Jones

unread,
Jun 17, 2016, 11:52:43 AM6/17/16
to julia-dev
I'd like to bring up some very important parse functionality (which can be critical for performance as well), that hasn't been addressed yet in the discussion on parsing (https://github.com/JuliaLang/julia/pull/16981).

One important case is to be able to start parsing from a particular position in an abstract string, and parse until the valid value is parsed, the end of the string is reached, or a terminating character is reached.
For parsing numbers in particular, there are some things that should be optional, such as whether there is a separator character that should be ignored (such as ',', '.' (European style), '_', or '\'').

Without this sort of functionality, you end up having to create substrings (which can be expensive) just to call the parse function (and you have to have figured out just where the end of the integer is before doing so,
so you have to parse it twice)

-Scott

Jacob Quinn

unread,
Jun 17, 2016, 11:59:39 AM6/17/16
to juli...@googlegroups.com
Definitely a good point and one that is certainly on my own roadmap. The initial PR you're referencing was a deliberate attempt to start by changing as little in Base as possible, while still achieving *much* better performance and a slightly better framework (IO-based). Note in the PR the use of a

immutable Options{B}
end

type which probably seems useless. The plan is to utilize it as a way to accumulate various "parsing options", delimiters, custom null values, etc.

-Jacob

Scott Jones

unread,
Jun 17, 2016, 12:13:37 PM6/17/16
to julia-dev, quinn....@gmail.com
Yes, your POC is a great advance already to what is currently present!

I was just thinking that something that parsed the value starting at a particular position (default = 1), and possibly an ending position,
which could return a tuple with status, the position past the last character used, and the value (if it fits in the type),
could be used to implement other simpler parse methods.
Different statuses could be: 1) valid value returned, all characters consumed 2) valid value returned, position after last character returned also, 3) valid (syntactic) value found, doesn't fit in requested type, all characters consumed, 4) same as previous, position after last character returned, 5) no valid value found (position updated [useful if, as in your PR, skipping whitespace is allowed])

-Scott
Reply all
Reply to author
Forward
0 new messages