Parse/parsing some text

942 views
Skip to first unread message

Aleksandar

unread,
Feb 28, 2016, 12:40:45 PM2/28/16
to automa...@googlegroups.com
I would like to know what would you suggest as the best way to make efficient parsing, and extracting some part of text as variable.
The most general approach would be some text, as variable (as input text), no matter what was the source (SMS body text, notification title, http response....).

In my case, I have to extract number, from some random text.

Random text would be something like this (without quotes):

"Hey dude, this number you need to extract: 123456. Thanks."

The more useful and more complicated extraction object would be some mixed sequence, let say:

""Hey dude, this sequence you need to extract: 1a2b3c4d5e6f. Thanks.""

The most important for me is the first example, only number, but if you have some more general solution for the second sequence, I'd like to see your suggestion.
Common sense tells me that character counting would be very good solution, for fixed position and fixed length of wished sequence (no matter what it is).
So, If I'm right, please tell me how to do that.

Also, if you agree, we could use this topic for all kind of parsing, not only for my questions.

Best regards,
Aleksandar

Henrik "The Developer" Lindqvist

unread,
Feb 28, 2016, 1:54:24 PM2/28/16
to Automate
Use regular expressions, i.e the matches function.
matches("Hey dude, this number you need to extract: 123456. Thanks.", ".*([0-9]+).*")

The above expression will only match the first occurrence, to find all occurrences you'll need to use the split function (in version 1.1.11):
split("Hey dude, this number you need to extract: 123456. Thanks.", "([0-9]+)")

Aleksandar

unread,
Mar 1, 2016, 6:40:05 PM3/1/16
to automa...@googlegroups.com
Hi Henrik,

I have managed to make extraction of wished sequence, because it was fixed position and fixed length, using function substr:

substr (variable_name, index, lenght)

Examples given by you (matches and split functions) didn't work.

I'm not so familiar with regular expressions, but I'm not sure why you've defined the first regex in that way.
Do you have an idea why they didn't give me any output?

Thanks.

Henrik "The Developer" Lindqvist

unread,
Mar 2, 2016, 2:38:29 PM3/2/16
to Automate
Weird it didn't work, this does:
matches("Hey dude, this number you need to extract: 123456. Thanks.", ".*?([0-9]+).*")[0]

Split do work as expected, note that it will return an array where every odd element is the value, i.e 1, 3, 5, etc..

Aleksandar

unread,
Mar 10, 2016, 8:11:26 AM3/10/16
to Automate
I have tried to use examples which you offered, but non of them worked as expected.
I have managed to make parsing using functions "matches" and "split", but totally in different manner.
It looks bit more complicated, and I owe you explanation:

I have used "split" function to separate input sentence to sub-strings, which becomes array....and then block "for each" to process sub-strings.

Sub-strings contains symbols ",", ":" and "."....and that's why I had to clean every substring using "replaceAll" function (replacing: comma, colon and dot with spaces).

After that, I had to use "trim" function, to cutoff spaces produced by "replaceAll", and in that part of parsing, every element became only single word or number string.

Only at that moment (after braking input to sub-strings by space, replaced punctuation signs by space, and cleaned spaces again) there was possible using "matches" function inside "for each" loop:

matches(split_and_trimmed_element, ".*?([0-9.]+).*")[0]

As a conclusion, I would suggest you making some block which would do some parsing without need to adopt/adjust/brake/trim input string too much.
Some kind of simple regex parser, which would give output, according regex expression.
Is that possible?

Thank you.

Henrik "The Developer" Lindqvist

unread,
Mar 10, 2016, 2:53:09 PM3/10/16
to Automate
Why do you need to split, replaceAll, trim, etc.?
Using the matches function should suffice to extract the digits from the text as you requested.

Aleksandar

unread,
Mar 10, 2016, 5:23:59 PM3/10/16
to Automate
Honestly, that's what was surprise for me, and also my bad experience with "matches" function.
You already gave an answer, what I wanted to know...it should extract only digits, but it doesn't.

I have made test case(s), and conclusions are:


matches("Hey dude, this number you need to extract: 123456. Thanks.", ".*?([0-9]+).*")[0]

gives as output the whole input text (but not just digits).

Maybe I understood initially wrongly (that "matches" gives as output just product of regex processing (eg. only digit extraction, like in my case)),
but output of function "matches" is only whole input text ONLY IF input text is according regex.
I had to adopt input string to be exactly what regex points out. My impression was that I was making such a (sub)string, that "matches" only make decision and gives approval that I've found number.

Please, could you make your own test case(s), with just input string(s) with mixed characters, and try to implement as many regex as you can try.
I'd like to debug that, also...and that's reason why I'm writing this.

Also, my notice, and also my question about regex syntax of Automate functions:
Why there is not possible using syntax: \d, \D, \s...like it's given:

https://en.wikipedia.org/wiki/Regular_expression#Examples

That would be much easier syntax in lot of cases.
Backslash is omitted after entering text inside field of any block, and that syntax can't be implemented.

It could be suggestion for incoming updates, what do you think (both: behaviour of "matches", and regex syntax with backslashes)?
What do you think?

 Thanks a lot.

Henrik "The Developer" Lindqvist

unread,
Mar 10, 2016, 6:43:31 PM3/10/16
to Automate
Sorry, my bad, again. It should end in [1]

matches("Hey dude, this number you need to extract: 123456. Thanks.", ".*?([0-9]+).*")[1]

Aleksandar

unread,
Mar 11, 2016, 5:06:06 AM3/11/16
to Automate
This really works :) !
Thank you very much, now I have reduced my flow significantly.

There are two questions, please make me aware what is going on:

1. What is difference between using [0] or [1] after function "matches"?

I can't see any explanation in documentation:
http://llamalab.com/automate/doc/function/matches.html

It looks like [0] or [1] indexes output array, but I don't understand how many elements do we have inside output array after function "matches", nor for other functions where output is array.

Please, could you explain that, or even update documentation (to clarify that for all).

2. Do you have plan/possibility to make regex in format: \d, \D, \s....and so on.

Backslashes are not accepted after block editing confirmation.

Again, thank you very much.
I hope you'll take in account doubts posted by me there, and explain (to me, or even better to all trough documentation) some things about data/array format after each function.

Best regards,
Aleksandar

Henrik "The Developer" Lindqvist

unread,
Mar 11, 2016, 2:40:28 PM3/11/16
to Automate
  1. The matches function returns an array in case of an match, the first element of the array is the entire matched text, the remaining elements are any capture groups. I'll add this to the doc.
  2. Since text literals also use \ for escaping you have to use \\ in regular expressions, i.e \\d

Twiz a

unread,
Nov 19, 2017, 11:35:07 AM11/19/17
to Automate
Hi,

Im having prolems with this regex;

".*?([0-9]+).*"

I get this error;

"illegal character ''x"

I am using the set value block, am i using the wrong block?

Im to familiar with regular expression.

Thanks

Twiz a

unread,
Nov 19, 2017, 11:46:49 AM11/19/17
to Automate
Sorry, using the set variable block

Aleksandar

unread,
Nov 20, 2017, 4:35:05 AM11/20/17
to automa...@googlegroups.com
Variable set block works fine for regex.
This flow works fine, try to compare and get some conclusion:

http://llamalab.com/automate/community/flows/11290

Try to crosscheck other possible syntax errors, and share with us if you get some notice, please.

Twiz a

unread,
Nov 21, 2017, 6:17:30 PM11/21/17
to Automate
Thanks Aleksander, 

I actually found another way, i was over thinking my method.

So i must have had a typo somewhere.

Thanks again.
Reply all
Reply to author
Forward
0 new messages