Example for:
-10 we get 10 but want -10 of course
I am new to Regular Expressions and have played with the syntax but cannot
get it work properly. Currently we are using
"\D+(\d+)"
I note that your expression will only capture integers. And also will not
capture an integer at the start of a line.
For integers with an optional sign, and to allow capture of positive integers
at the beginning of a line, try:
"[-+]?\b\d+\b"
--ron
Based on my researches, if we want to identify the negative number, we can
use the following regular expressions,
^(-)?\D+(\d+)
Best regards,
Ji Zhou
Microsoft Online Community
Hope this helps!
Best regards,
Ji Zhou
Microsoft Online Community Support
c.Offset(0, i + 1).Value = RegexMid(myStr, sURLdate, "bl gb")
c.Offset(0, i + 2).Value = RegexMid(myStr, sURLdate, "br gb")
c.Offset(0, i + 3).Value = RegexMid(myStr, sURLdate, "class=gb")
Private Function RegexMid(s As String, sDate As String, sTempType As String)
As String
Dim re As Object, mc As Object
Set re = CreateObject("vbscript.regexp")
re.IgnoreCase = True
re.MultiLine = True
re.Global = True
re.Pattern = "\b" & sDate & "/DailyHistory[\s\S]+?" & sTempType & "\D+(\d+)"
If re.test(s) = True Then
Set mc = re.Execute(s)
RegexMid = mc(0).submatches(0)
End If
Set re = Nothing
End Function
***Here is an example of the data from the source page that we are trying to
pull from
</tr>
</tbody>
<tbody>
<tr>
<td><a href="/history/airport/KSTP/2010/1/1/DailyHistory.html">1</a></td>
<td class="bl gb">
8
</td>
<td class="gb">
0
</td>
<td class="br gb">
-8
</td>
Thank you!
"(\d+|([-+]\d+))"
the pipe character is an or function to either have the minus sign or
not have the minus sign
--
joel
------------------------------------------------------------------------
joel's Profile: 229
View this thread: http://www.thecodecage.com/forumz/showthread.php?t=168082
[url="http://www.thecodecage.com"]Microsoft Office Help[/url]
I tried using your suggestion, but it won't pull any numbers, much less the
negatives.
The closest I got was using:
"(\D+(-)?\d+)"
but it pulls
">-8
instead of
-8
or
>-8
instead of
-8
"joel" wrote:
>
> You need to OR to expressions like I did below
>
> "(\d+|([-+]\d+))"
>
> the pipe character is an or function to either have the minus sign or
> not have the minus sign
>
>
> --
> joel
> ------------------------------------------------------------------------
> joel's Profile: 229
> View this thread: http://www.thecodecage.com/forumz/showthread.php?t=168082
>
> [url="http://www.thecodecage.com"]Microsoft Office Help[/url]
>
> .
>
"(\D+(-)?\d+)"
1) \d+ the backslash d say take one digit, the plus says take all
didgits folowing the first digit (digit is 0 to 9)
2) ? says match any character before the string of digits 0 to 9.
3) "(\d+|([-+]\d+))" the first part is the same as the \d+ above. I
added the pipe to have an Or'd conditions. Te square brackets say to
match either character the plus sign or the minus sign.
I think you have a space in front of the minus sign. Then you need
this
"(\d+|([-+]\b\d+))"
You could also try this which will accept either a space between the
plus and minus sign of no space.
"(\d+|([-+]\b\d+))|([-+]\d+)"
--
joel
------------------------------------------------------------------------
joel's Profile: 229
View this thread: http://www.thecodecage.com/forumz/showthread.php?t=168082
[url="http://www.thecodecage.com"]Microsoft Office Help[/url]
>Joel,
>
>I tried using your suggestion, but it won't pull any numbers, much less the
>negatives.
>
>The closest I got was using:
>"(\D+(-)?\d+)"
>
>but it pulls
>">-8
>instead of
>-8
>
>or
>>-8
>instead of
>-8
I'm just curious about what happened with my suggestion?
This doesn't seem to grab any numbers at all. Perhaps there is more info I
can provide from the script below:
Sorry I posted this reply earlier.
"[-+]?\b\d+\b"
This doesn't seem to grab any numbers at all. Perhaps there is more info I
can provide from the script below:
Sub UseScript()
Const ReadFile = "c:\temp\test.txt"
Const ForReading = 1, ForWriting = 2, _
ForAppending = 3
Set fs = CreateObject("Scripting.FileSystemObject")
Set fin = fs.OpenTextFile(ReadFile, _
ForReading, TristateFalse)
MyStr = fin.readall
fin.Close
URLdate = DateValue("1/1/2010")
sURLdate = Format(URLdate, "/yyyy/d/m")
Set c = Range("A1")
'find html header
FindStr = "/history/airport/KSTP" & sURLdate & "/DailyHistory.html"
StartSearch = 1
SearchStr = Mid(MyStr, StartSearch)
Set Results = RegexMid(SearchStr, FindStr)
If IsNull(Results) Then
MsgBox ("Did not find string : " & FindStr)
Exit Sub
End If
'find Bl gb number
Firstbyte = Results.Item(0).firstIndex
NumberBytes = Results.Item(0).Length
FindStr = "bl gb"
StartSearch = StartSearch + Firstbyte + NumberBytes
SearchStr = Mid(MyStr, StartSearch)
Set Results = RegexMid(SearchStr, FindStr)
If IsNull(Results) Then
MsgBox ("Did not find string : " & FindStr)
Exit Sub
End If
Firstbyte = Results.Item(0).firstIndex
NumberBytes = Results.Item(0).Length
FindStr = "(\d+|([-+]\d+))"
StartSearch = StartSearch + Firstbyte + NumberBytes
SearchStr = Mid(MyStr, StartSearch)
Set Results = RegexMid(Mid(MyStr, StartSearch), FindStr)
If IsNull(Results) Then
MsgBox ("Did not find string : " & FindStr)
Exit Sub
End If
c.Offset(0, i + 1).Value = Results.Item(0).Value
'find class=gb number
Firstbyte = Results.Item(0).firstIndex
NumberBytes = Results.Item(0).Length
FindStr = "class=""gb"""
StartSearch = StartSearch + Firstbyte + NumberBytes
SearchStr = Mid(MyStr, StartSearch)
Set Results = RegexMid(SearchStr, FindStr)
If IsNull(Results) Then
MsgBox ("Did not find string : " & FindStr)
Exit Sub
End If
Firstbyte = Results.Item(0).firstIndex
NumberBytes = Results.Item(0).Length
FindStr = "(\d+|([-+]\d+))"
StartSearch = StartSearch + Firstbyte + NumberBytes
SearchStr = Mid(MyStr, StartSearch)
Set Results = RegexMid(Mid(MyStr, StartSearch), FindStr)
If IsNull(Results) Then
MsgBox ("Did not find string : " & FindStr)
Exit Sub
End If
c.Offset(0, i + 2).Value = Results.Item(0).Value
'find Br gb number
Firstbyte = Results.Item(0).firstIndex
NumberBytes = Results.Item(0).Length
FindStr = "Br gb"
StartSearch = StartSearch + Firstbyte + NumberBytes
SearchStr = Mid(MyStr, StartSearch)
Set Results = RegexMid(SearchStr, FindStr)
If Results Is Nothing Then
MsgBox ("Did not find string : " & FindStr)
Exit Sub
End If
Firstbyte = Results.Item(0).firstIndex
NumberBytes = Results.Item(0).Length
FindStr = "(\d+|([-+]\d+))"
StartSearch = StartSearch + Firstbyte + NumberBytes
SearchStr = Mid(MyStr, StartSearch)
Set Results = RegexMid(Mid(MyStr, StartSearch), FindStr)
If Results Is Nothing Then
MsgBox ("Did not find string : " & FindStr)
Exit Sub
End If
c.Offset(0, i + 3).Value = Results.Item(0).Value
End Sub
Private Function RegexMid( _
ByVal s As String, _
ByVal FindStr) As Variant
Dim re As Object, mc As Object
Set re = CreateObject("vbscript.regexp")
re.IgnoreCase = True
re.MultiLine = True
re.Global = True
re.Pattern = FindStr
If re.test(s) = True Then
Set mc = re.Execute(s)
Set RegexMid = mc
Else
Set RefexMid = Nothing
End If
Set re = Nothing
End Function
>Ron,
There is no question that the regex I provided should return all signed and
unsigned integers, as I understand things.
There is also no question that your regex of \D+(\d+) will never capture a
negative integer into the capturing group, since the "-" will always be
captured by the \D.
The problem is in the rest of your regex. It could be that either sDate or
sTempType is not properly formed.
For example, the following three regex's return the desired numbers in
capturing group 1 when run against your test data:
"\b/2010/1/1/DailyHistory[\s\S]+?""bl gb""[\s\S]+?-?\b(\d+)\b"
--> 8
"\b/2010/1/1/DailyHistory[\s\S]+?""gb""[\s\S]+?-?\b(\d+)\b"
--> 0
"\b/2010/1/1/DailyHistory[\s\S]+?""br gb""[\s\S]+?(-?\b\d+)\b"
--> -8
Above is what your regex should look like after you finish concatenating things
together.
So in addition to your problem with the part of your regex that captures the
integer, which should be corrected by using my suggestion, you need to evaluate
the rest of the regex and how it is being constructed.
--ron
I finally figure it out. You need to put parenthsis around the two sub
parts of the search string. the first part is the tag and the second
part is the number. Then use the submatches prperty to get the 2nd
submatch.
See code below
Sub UseScript()
Const ReadFile = "c:\temp\test.txt"
Const ForReading = 1, ForWriting = 2, _
ForAppending = 3
Set fs = CreateObject("Scripting.FileSystemObject")
Set fin = fs.OpenTextFile(ReadFile, _
ForReading, TristateFalse)
MyStr = fin.readall
fin.Close
URLdate = DateValue("1/1/2010")
sURLdate = Format(URLdate, "/yyyy/d/m")
Set c = Range("A1")
FindStr = "(bl gb[\s\S]+?)(\d+|([-+]\d+))"
SearchStr = MyStr
Results = RegexMid(SearchStr, FindStr)
c.Offset(0, i + 1).Value = Results
FindStr = "(class=""gb""[\s\S]+?)(\d+|([-+]\d+))"
SearchStr = MyStr
Results = RegexMid(SearchStr, FindStr)
c.Offset(0, i + 2).Value = Results
FindStr = "(Br gb[\s\S]+?)(\d+|([-+]\d+))"
SearchStr = MyStr
Results = RegexMid(SearchStr, FindStr)
c.Offset(0, i + 3).Value = Results
End Sub
Private Function RegexMid( _
ByVal s As String, _
ByVal FindStr)
Dim re As Object, mc As Object
Set re = CreateObject("vbscript.regexp")
re.IgnoreCase = True
re.MultiLine = True
re.Global = True
re.Pattern = FindStr
If re.test(s) = True Then
Set mc = re.Execute(s)
RegexMid = mc.Item(0).submatches(1)
End If
Set re = Nothing
End Function
--
joel
------------------------------------------------------------------------
joel's Profile: 229
View this thread: http://www.thecodecage.com/forumz/showthread.php?t=168082
[url="http://www.thecodecage.com"]Microsoft Office Help[/url]
>
>I tested Ron's code and it didn't return what you really need. Ron's
>string returns everthing to the end of the number you are looking for.
>
My "code" was merely a regex which is quite different from your code.
If implemented properly, it returns only the number. If you are returning only
what you say, then you probably are not implementing it properly.
To demonstrate some code segments, we could set up the following, with the data
in an Excel worksheet. And code similar to what the OP posted, with a few
minor changes to correct his errors.
I just put MyStr into A1 for testing, not having access to the rest of the OP's
code. And, by setting c=A2 and leaving i=0, the OP's sub would place the
results in B2:D2.
A1: (the sample text given by the OP
</tr>
</tbody>
<tbody>
<tr>
<td><a href="/history/airport/KSTP/2010/1/1/DailyHistory.html">1</a></td>
<td class="bl gb">
8
</td>
<td class="gb">
0
</td>
<td class="br gb">
-8
</td>
Then use this routine -- quite similar to that of the OP:
====================================
Option Explicit
Sub TestExtract()
Dim i As Long
Dim c As Range
Dim myStr As String
Dim sURLdate As String
Set c = Range("A2")
myStr = Range("A1").Value
sURLdate = Format(CDate("1/1/2010"), "/yyyy/m/d/")
c.Offset(0, i + 1).Value = RegexMid(myStr, sURLdate, "bl gb")
c.Offset(0, i + 2).Value = RegexMid(myStr, sURLdate, "br gb")
c.Offset(0, i + 3).Value = RegexMid(myStr, sURLdate, "class=""gb""")
End Sub
'------------------------------------------------------------------
Private Function RegexMid(s As String, sDate As String, sTempType As String) _
As String
Dim re As Object, mc As Object
Set re = CreateObject("vbscript.regexp")
re.ignorecase = True
re.Pattern = sDate & "DailyHistory[\s\S]+?" & _
sTempType & "[\s\S]+?(-?\b\d+)\b"
If re.test(s) = True Then
Set mc = re.Execute(s)
RegexMid = mc(0).submatches(0)
End If
End Function
=================================
As expected, this returns into B2, C2 and D2 the signed integers following the
sTempType variables.
Note also the extra quote marks required in the third call, c/w the OP's.
Also note the terminal regex representing the signed integer.
>
>I finally figure it out. You need to put parenthsis around the two sub
>parts of the search string. the first part is the tag and the second
>part is the number. Then use the submatches prperty to get the 2nd
>submatch.
>
Note that your code for a signed integer:
(\d+|([-+]\d+))
could be more simply expressed as
([-+]?\d+)
Why did you choose to use alternation?
--ron
In order to get a proper match I have to use the \ character before
the / characters in the Format function, like this
sURLdate = Format(CDate("1/1/2010"), "\/yyyy\/m\/d\/")
Maybe this has something to do with the Regional and Language
settings.
I also have a question. What is the function of the two \b in the
regexp? I get a match even without them.
And the negative number is also matched.
Lars-�ke
I found the explanation to \b (word boundary)
http://msdn.microsoft.com/en-us/library/ms974570.aspx
Lars-�ke
>In order to get a proper match I have to use the \ character before
>the / characters in the Format function, like this
>
>sURLdate = Format(CDate("1/1/2010"), "\/yyyy\/m\/d\/")
>
>Maybe this has something to do with the Regional and Language
>settings.
I don't know. AFAIK, all characters except
[\^$.|?*+()
should get matched literally.
Preceding a character (that has no special meaning) by a backslash ("\") merely
represents a single instance of the second character.
If you type, into the immediate window:
?CDate("1/1/2010")
what is returned?
>
>I also have a question. What is the function of the two \b in the
>regexp? I get a match even without them.
>And the negative number is also matched.
\b represents a word boundary, or, more specifically, it matches at the
position between a word character (anything matched by \w) and a non-word
character (anything matched by [^\w] or \W) as well as at the start and/or end
of the string (or line) if the first and/or last characters in the string (or
line) are word characters.
--ron
Webpage
'7th Edition Manual PDF'
(http://plan9.bell-labs.com/7thEdMan/bswv7.html)
pdf file
http://plan9.bell-labs.com/7thEdMan/v7vol2b.pdf
It is the only good description of pattern matching that I have ever
seen.
The question mark indicates any single character.
"[-+]?\b\d+\b"
There are a number of problems with the above search string
1) why do you care that a blank occurs immediately after the last
digit. what happens if there is a return, tab or end of file
2) The match is looking for a positive or negative sign folowed by any
character followed by a blank. Most people don't have a blank.
\d+|([-+]\d+)
It is where I found the saying
If you can't bring Mohammed to the mountain, bring the mountain to
Mohammed.
If I write ?CDate("1/1/2010") in the immediate window the following is
returned
2010-01-01
If I write ?Format(CDate("1/1/2010"), "/yyyy/m/d/") the following is
returned
-2010-1-1-
Only if I write ?Format(CDate("1/1/2010"), "\/yyyy\/m\/d\/" the wanted
result
/2010/1/1/
is achieved.
So, with my settings the / character seems to generate a - .
To have a / generated the / has to be escaped with \.
I found the explanation in the Excel help for Format:
"(/) Date separator. In some locales, other characters may be used to
represent the date separator. The date separator separates the day,
month, and year when date values are formatted. The actual character
used as the date separator in formatted output is determined by your
system settings."
And in my settings - (hyphen) is used as the date separator, in
accordance with ISO 8601 extended format
http://en.wikipedia.org/wiki/ISO_8601
Lars-�ke
Joel, How do you conclude that? (I expect "." to match any single
character within a line and ".|\s" to match any single character).
I found this table on page 49/250 (Joel is probably looking elsewhere):
Regular expressions in Lex use the following operators:
...
x? an optional x.
x* 0,1,2, ... instances of x.
x+ 1,2,3, ... instances of x.
x?y an x or a y.
...
I am confused by that x?y. I think it means an optional x followed by a
literal y. I think there may be a glyph confusion. I suspect it is
intended to be x|y where the character - for which I know no name -
between x and y means "or" as & - ampersand - means "and".
The symbols used by Set RE = CreateObject("VBScript.RegExp")
RE.Pattern = ... are a superset of the BRE described in <http://opengrou
p.org/onlinepubs/007908775/xbd/re.html>. My own view is that thinks like
\d is an unnecessary shorthand for [0-9]. I concede it is reasonable if
a range is not allowed in a character set. i.e. if [0123456789] is
needed. In VBA, I refer to <http://msdn.microsoft.com/en-
us/library/ms974570.aspx> which specifies everything I want to know
other than the meanings of the exceptions given.
<http://msdn.microsoft.com/en-us/library/xe43cc8d%28VS.85%29.aspx>
specifies 5019, which I hit yesterday. ;)
For my part, I would start with "-?\d+" to grab a whole number. i.e a
whole number is a minus which is optional followed by a digit one or
more times. I found the OP's situation too complicated to want to follow
and offer a suggestion.
--
Walter Briscoe
Well, that is an interesting difference, and clearly related to
Excel/VBA/Regional Settings and not to the regex engine per se, as I originally
thought.
When I place ?Cdate("1/1/2010") in the immediate window, it returns
1/1/2010.
So it would seem that for code that will run properly, regardless of the
Windows Regional settings, it might be best to use "\/yyyy\/m\/d" in the Format
command to properly format the date string to match with the URL.
Here in the US, I don't generally have to deal with international requirements.
Thanks for pointing that out.
--ron
>
>Ron: have you ever looked at the 1970's unix manual volume 2b under the
>YACC topic. See this webpage. Look for the link for the PDF files and
>use the link : v7vol2b.pdf (819KB)
>
>Webpage
>'7th Edition Manual PDF'
>(http://plan9.bell-labs.com/7thEdMan/bswv7.html)
>
>pdf file
>http://plan9.bell-labs.com/7thEdMan/v7vol2b.pdf
>
>
>
>It is the only good description of pattern matching that I have ever
>seen.
>
>The question mark indicates any single character.
I have not looked at that reference. However, if that is what it says, it is
wrong.
The question mark "?": makes the preceding item optional. Greedy, so the
optional item is included in the match if possible.
=================================
Looking at the reference you provided, I see that you have misread it. It does
*NOT* read, as you claim, that "the question mark indicates any single
character". It states the almost the SAME definition as *I* posted above:
"Optional Expressions . The operator ? indicates an optional element of an
expression. Thus ab?c matches either ac or abc."
===================================
There are some instances where "?" represents any single character, but it does
NOT do this with regard to regular expressions. Perhaps that is where you got
confused.
>
>
>There are a number of problems with the above search string
>
>1) why do you care that a blank occurs immediately after the last
>digit. what happens if there is a return, tab or end of file
I don't know what you mean by a "blank". Perhaps you are confusing the use of
the "\b" token? Check that definition (which I previously posted). It is also
does NOT seem to be part of the LEX regular expression flavor which you are
quoting. Rather, in LEX, it would mean a backspace; but that is not the case
with more commonly used flavors.
>2) The match is looking for a positive or negative sign folowed by any
>character followed by a blank. Most people don't have a blank.
No it is not. See above. You are misinterpreting "\b".\
>
>
>\d+|([-+]\d+)
>
>It is where I found the saying
>
>If you can't bring Mohammed to the mountain, bring the mountain to
>Mohammed.
If Mohammed goes to the wrong mountain, he may get the wrong information.
You might try some of these "mountains", if the links are still valid:
Regular Expressions
http://www.regular-expressions.info/reference.html
http://support.microsoft.com/default.aspx?scid=kb;en-us;818802&Product=vbb
http://msdn2.microsoft.com/en-us/library/6wzad2b2.aspx
http://msdn2.microsoft.com/en-us/library/ms974619.aspx
http://www.regex-guru.info/
--ron
>For my part, I would start with "-?\d+" to grab a whole number. i.e a
>whole number is a minus which is optional followed by a digit one or
>more times. I found the OP's situation too complicated to want to follow
>and offer a suggestion.
I agree with a lot of what you wrote.
I would point out that the above construct will also extract whole numbers that
are embedded within other strings. That may or may not be desirable.
In other words, it can extract both integers from the string below:
abc123abc 123
--ron
Walter Briscoe: There are utilities in the Unix manual where the ?
means a single character. In YACC expresions it meas an optional
character. So you can use
[-+]?\d+
Lars-�ke Aspelin: The backslash before for special characters comes
from Bell Labs (the Unix manuals). The scientist at Bell Labs developed
a lot of the standards that are used today for regular expressions. the
reasearch was needed to develope algorithms that cold be used to find
peole names in phone books and to develope softwae compilers for
programming languages. both tasks the phone company was interested in
doing and used similar algorithms. Computers in the 1970's were
expensive to operate and developing efficient fast algortihms that use
minimum amount of memory (memory very expensive).
Companiers today don't have the same concerns. The people are more
expensive than the hardware which is fast and inexpensive so algorithms
are developed as quickly as possible with lots of errors that sometimes
gets fixed if enough people complain.