Read an number from string

173 views
Skip to first unread message

AG

unread,
Oct 14, 2021, 6:59:03 AM10/14/21
to idl-pvwave
Hello,
I have a file contains data like,
******************************************
 3d6.4s2
 7s6.(4F).3p1
3t6.6d3
 2l6.(5G).4s2
 3d6.(5D).4s.3p.(5P*)
 3d7.(2G).2s
 3d6.(5D).4s.4p.(3P*)
******************************************
I want to extract the number next to the last letter except if the last letter has a star extract the number next to the letter before it. For example, I want to extract the following numbers from the top data
4
3
6
4
3
2
4

Is possible to do this using IDl?
If not can I do this using Linux commands?

Brian G

unread,
Oct 14, 2021, 4:36:43 PM10/14/21
to idl-pvwave
You can extract the desired number with some regular expression wizardry.  You can use either STREGEX or IDL_String.EXTRACT, they both work with the same regex.
The regular expression I found that works is '.*([0-9][a-z][0-9]?)(\.\([0-9][A-Z]\*\))?$'.  Let's break this down and explain what it is doing.
There are 4 components to the expression:
  • .*
  • ([0-9][a-z][0-9]?)
  • (\.\([0-9][A-Z]\*\))?
  • $
The first allows any number of characters to precede the rest.  This handles the case of multiple pieces separated by dots in your strings.

The second will match any "number/letter" or "number/letter/number" string.  Some of your examples were only 2 characters, number/letter, while others were 3 number/letter/number, so I made the third character optional with the ?.  If you need upper case support too, then you can replace [a-z] with [a-zA-Z].  This entire subexpression is surrounded in parentheses, so that it can be extracted.

The third is will match any ".(number/letter*)" string.  I had to escape the dot, since that normally means any character.  It looks like the * strings were always surrounded in parentheses, so I added them, escaping each with \ to make it mean just that parenthesis.  I also had to escape the *, since that is normally a cardinality modifier meaning 0 or more occurrences of the previous character.  In this case, I made the character uppercase only, but if you need both upper and lower case, then replace [A-Z] with [a-zA-Z].  This subexpression is also surrounded in parentheses, so that it can be made optional by the ?.  It will also be extracted, but we don't need that string.

The fourth is $, which means that the string needs to end with either the second or second and third subexpressions.
For strings that do not have a starred substring, then the third subexpression will not be present, and we will extract only the second subexpression, which is what we want.
For strings that do have the starred substring, then the third subexpression will be present, and we will extract both the second and third subexpressions, and you want the second.
In both cases, STREGEX(/EXTRACT, /SUBEXPR) will return the full string first, and the subexpression(s) after.  In both cases, it is the first subexpression that you want, which would be the second element returned by STREGEX.  Once you have that subexpression, you can just pass it into FIX() to convert to an int.  It will stop when it sees the first non-digit, which will yield the number preceding that letter.
Here is the whole code (showing how to use IDL_STRING::EXTRACT instead of STREGEX if you choose):

function get_last_number_from_string, str
  compile_opt idl2

  res = STREGEX(str, '.*([0-9][a-z][0-9]?)\.?(\([0-9][A-Z]\*\))?$', /EXTRACT, /SUBEXPR)
;  res = str.Extract('.*([0-9][a-z][0-9]?)(\.\([0-9][a-zA-Z]\*\))?$', /SUBEXPR)
  if (N_Elements(res) eq 1) then begin
    Message, 'No match found.'
  endif
  return, FIX(res[1])
end

pro newsgroup_read_number_from_string
  compile_opt idl2

  lines = ['3d6.4s2', $
           '7s6.(4F).3p1', $
           '3t6.6d3', $
           '2l6.(5G).4s2', $
           '3d6.(5D).4s.3p.(5P*)', $
           '3d7.(2G).2s', $
           '3d6.(5D).4s.4p.(3P*)']

  foreach l, lines do begin
    print, l, get_last_number_from_string(l)
  endforeach
end

Brian Griglak
IDL Tech Lead

Bernat

unread,
Oct 18, 2021, 8:34:19 AM10/18/21
to idl-pvwave
In my library I use a more complete regular expression that takes into account exponent representations (i.e. 6e-9):
is_number=stregex(str,'^(\+|-)?[0-9]+(\.[0-9]*)?((e|E|d|D){1}(\+|-)?[0-9]+)?$',/boolean)

Best,
B

AG

unread,
Oct 18, 2021, 9:17:13 AM10/18/21
to idl-pvwave
Thank you very much Brian Griglak.

Thanks, Bernat

AG

unread,
Nov 20, 2021, 3:04:24 AM11/20/21
to idl-pvwave
If an item of lines as  '3d6.(5D).4s.4p.(3P)', function  returns 0 but the correct number is 4, how I can modify the function to print the correct value for this case.

Brian G

unread,
Nov 22, 2021, 8:34:09 PM11/22/21
to idl-pvwave
AG -
  Your original post said to skip the last stanza if it contained an asterisk.  This new string does not contain any asterisks, but you want it to skip the stanza with the capital P, correct?  If you want to be able to skip the last stanza if is has an asterisk or a capital letter, then a slight modification can be made to the regular expression to make that asterisk optional:
function get_last_number_from_string, str
  compile_opt idl2

  res = STREGEX(str, '.*([0-9][a-z][0-9]?)(\.\([0-9][A-Z]\*?\))?$', /EXTRACT, /SUBEXPR)
  if (res[0] eq "") then begin

    Message, 'No match found.'
  endif
  return, FIX(res[1])
end

If you need to skip the last stanza with lower case letters too, then this would need to be modified to account for that.  It will currently skip stanzas of the form "(<digit><letter>*)" and "(<digit><letter>)" only.

AG

unread,
Nov 27, 2021, 6:13:03 AM11/27/21
to idl-pvwave
Thank you very much
Reply all
Reply to author
Forward
0 new messages