Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Extract a number from a text string

17 views
Skip to first unread message

paul crowley

unread,
Jan 9, 2003, 8:45:34 AM1/9/03
to
I wonder if anybody could help me with this. I have a column of text strings
of the form "Tillage 130 Acres" and
"35Acres dairying" etc. I want to extract the the number of acres but I
can't find any function or macro to do this.
Does anybody have any ideas. Any help wouild be much appreciated.


Paul

Peo Sjoblom

unread,
Jan 9, 2003, 9:02:34 AM1/9/03
to
Here's one way although it can only extract the numbers like
in your example, if there would be something like this it will not work

aaaa25bbbb35

=MID(A1,MATCH(FALSE,ISERROR(1*MID(A1,ROW(INDIRECT("1:100")),1)),0),100-SUM(1
*ISERROR(1*MID(A1,ROW(INDIRECT("1:100")),1))))*1

note that this is an array formula and has to be entered
with

Ctrl + Shift & Enter

the above assumes your values start in A1

Regards,

Peo Sjoblom

paul crowley <crowl...@eircom.net> wrote in message
news:F2fT9.1857$V6....@news.indigo.ie...

Ron Rosenfeld

unread,
Jan 9, 2003, 12:04:14 PM1/9/03
to
On Thu, 9 Jan 2003 13:45:34 -0000, "paul crowley" <crowl...@eircom.net>
wrote:

Here is a UDF that will extract all the numbers from a text string. It
uses the period [.] as the decimal separator, so would not be appropriate
for use in other locales.

If that is an issue, substitute Application.DecimalSeparator for "." in the
routine.

=============================
Option Explicit

Function GetValue(str)
Dim n As Integer, i As String
i = ""
For n = 1 To Len(str)
If IsNumeric(Mid(str, n, 1)) Then
i = i & Mid(str, n, 1)
If Mid(str, n + 1, 1) = "." Then i = i & "."
End If
Next
If i = "" Then
GetValue = i
Exit Function
End If
GetValue = CDbl(i)
End Function
====================


--ron

Norman Harker

unread,
Jan 9, 2003, 12:18:54 PM1/9/03
to
Hi Peo!

That's not only beautiful but it's useful.

Sorry to impose, but can you explain the logic? I just hate to use these
things without really understanding why it works. On this one, I'm just
getting confused.
Regards


--
Norman Harker
Sydney, Australia.
njha...@optusnet.com.au

"Peo Sjoblom" <ter...@mvps.org> wrote in message
news:OgerOf#tCHA.1620@TK2MSFTNGP11...

Harlan Grove

unread,
Jan 9, 2003, 1:39:49 PM1/9/03
to
"Peo Sjoblom" wrote...

>Here's one way although it can only extract the numbers like
>in your example, if there would be something like this it will not work
>
>aaaa25bbbb35
>
>=MID(A1,MATCH(FALSE,ISERROR(1*MID(A1,ROW(INDIRECT("1:100")),1)),0),100-SUM(1
>*ISERROR(1*MID(A1,ROW(INDIRECT("1:100")),1))))*1
>
>note that this is an array formula and has to be entered
>with
>
>Ctrl + Shift & Enter
>
>the above assumes your values start in A1
..

Been a while since I've advocated for regular expressions, so here goes.

If one have either Outlook Express 5.0 or higher or VBScript 5.0 or later
installed (you have the latter if you're running Windows 2000 or higher), then
you could use the following user-defined functions.


'Similar to awk's sub()/gsub() but using VBScript's perl-like regular
'expressions
'Required: matchpat - regexp to find, replacepat - replacement pattern,
' s - string to search through
'Optional global sets global properties:
' True - replace all matched patterns - DEFAULT
' False - replace only the first matched pattern
'
Function psub(matchpat As String, replacepat As String, s As String, _
Optional re_global As Boolean = True) As String
Dim regex As Object
Set regex = CreateObject("vbscript.regexp")
regex.Pattern = matchpat
regex.Global = re_global
psub = regex.Replace(s, replacepat)
End Function


'similar to awk's split, but returns the array rather than using a pointer
'to an array or range passed as the second argument
Function psplit(s As String, sep As String) As Variant
psplit = Evaluate("={""" & psub(sep, """,""", s) & """}")
End Function


Then the formula

=psplit(INDEX(psplit(" "&SomeString,"^[^0-9]+"),2),"[^0-9]+")

returns an array of all numeric substrings in SomeString. To extract the first
(leftmost), last (rightmost), or longest, respectively, use

=INDEX(psplit(" "&SomeString,"[^0-9]+"),2)

=INDEX(psplit(" "&SomeString,"[^0-9]+"),
COUNTA(psplit(" "&SomeString,"[^0-9]+")))

=INDEX(psplit(" "&SomeString,"[^0-9]+"),
MATCH(MAX(LEN(psplit(" "&SomeString,"[^0-9]+"))),
LEN(psplit(" "&SomeString,"[^0-9]+")),0))

Note that the last formula is an array formula.

If all that's needed is the leftmost numeric substring, then with Seq_1_1024
defined as =ROW(INDIRECT("1:1024")), the array formula

=MID(SomeString,MATCH(TRUE,ISNUMBER(0+MID(SomeString,Seq_1_1024,1)),0),
MATCH(FALSE,ISNUMBER(0+MID(SomeString,MATCH(TRUE,ISNUMBER(0+
MID(SomeString,Seq_1_1024,1)),0)+Seq_1_1024,1)),0))

will pull the leftmost numeric substring no matter what's in SomeString.

The main point, however, is that spreadsheets in general (and Excel is no
exception) aren't provided with sophisticated text processing tools. Regular
expressions are, to date, the most efficient mechanism for parsing text.

--
Public Service Announcement
Don't attach files to postings in nonbinary newsgroups like this one.

Harlan Grove

unread,
Jan 9, 2003, 1:57:03 PM1/9/03
to
"Norman Harker" wrote...
..
> . . . but can you explain the logic? I just hate to use these

>things without really understanding why it works. On this one, I'm just
>getting confused.
..

>"Peo Sjoblom" <ter...@mvps.org> wrote in message
..

>>=MID(A1,MATCH(FALSE,ISERROR(1*MID(A1,ROW(INDIRECT("1:100")),1)),0),100-SUM(1
>>*ISERROR(1*MID(A1,ROW(INDIRECT("1:100")),1))))*1
..

If your intent was sycophancy, my sincerest appologies for buttin in.

C'mon, Norman. Don't you know how to deconstruct formulas?

=MID(A1,

So far, so simple - extracting a substring

MATCH(
FALSE,
ISERROR(
1*
MID(
A1,
ROW(
INDIRECT("1:100")
),
1
)
),
0
),

The inner mid breaks up A1 into an array of the first 100 characters in A1 as
length 1 substrings. The leading 1* converts decimal digits to numeric values
and everything else to errors. Wrapping that in ISERROR makes the entries
corresponding to decimal digits FALSE, so the MATCH returns the character
position of the leftmost decimal digit in the first 100 characters of A1.

100-
SUM(
1*
ISERROR(
1*
MID(
A1,
ROW(
INDIRECT("1:100")
),
1
)
)
)

This should be simple enough for you to figure out the details. It gives the
number of decimal digit characters in the first 100 characters in A1.

)*1

Since the MID returns the numeric substring (if there's only one in A1's first
100 characters), you get to figure out what the trailing *1 does.

Sheesh, this idiom appears about once a month. You should know what it means by
now.

adetaylor

unread,
Jan 9, 2003, 5:38:15 PM1/9/03
to

Or, if the string were in A1 and if the number will
always be within the first 255 characters.

=--MID(A1,LeftDigit,MATCH(1
,(COLUMN(1:1)>LeftDigit)*ISERROR(--MID(A1,COLUMN(1:1),1))
,0)-LeftDigit)

After pasting that formula into a cell, enter it
using Ctrl+Shift+Enter. While that cell is the
active cell...

Insert>Name>Define>
NamesInWorkbook:LeftDigit
RefersTo:
=MATCH(TRUE,ISNUMBER(--MID(A1,COLUMN(1:1),1)),0)

+ - - - - - - - - - - - - - - - - - -
! To place the formula into a cell:
! Drag to select the formula provided and copy.
! Double-click a cell (or use F2).
! Paste the formula into the cell.
!
! Accept it using Ctrl+Shift+Enter (not just Enter).
! Then copy to new cells -- not into the same
! original cell.
+ - - - - - - - - - - - - - - - - - -

--Brian Taylor
Refinate (C)2001
Click to: Rearrange Cells or Cell Content,
Find Duplicates, Count Items, Import, and more.
www.adetaylor.com


>>"Harlan Grove" wrote:------------
> ...


> If all that's needed is the leftmost numeric
> substring, then with Seq_1_1024 defined as
> =ROW(INDIRECT("1:1024")), the array formula
>
=MID(SomeString,MATCH(TRUE,ISNUMBER(0+MID(SomeString
,Seq_1_1024,1)),0),MATCH(FALSE,ISNUMBER(0+MID(SomeString
,MATCH(TRUE,ISNUMBER(0+MID(SomeString,Seq_1_1024,1))
,0)+Seq_1_1024,1)),0))
>
> will pull the leftmost numeric substring no matter
> what's in SomeString.

> ...


Ken Wright

unread,
Jan 9, 2003, 5:42:03 PM1/9/03
to

> C'mon, Norman. Don't you know how to deconstruct formulas?

> Sheesh, this idiom appears about once a month. You should know what it
means by
> now.

I just don't get it, been curious for days now (but not being a cat,
hopefully it won't be detrimental to my health), is there some kind of
history with you two guys that I'm just too late to the party to be aware
of??

Regards
Ken...............................

"Harlan Grove" <hrl...@aol.com> wrote in message
news:3CjT9.4983$15....@www.newsranger.com...

Harlan Grove

unread,
Jan 9, 2003, 6:33:21 PM1/9/03
to
"Ken Wright" wrote...
..

>I just don't get it, been curious for days now (but not being a cat,
>hopefully it won't be detrimental to my health), is there some kind of
>history with you two guys that I'm just too late to the party to be aware
>of??
..

No history. Norman is clever enough to have done what I did. He knows how to
break formulas into their pieces and figure out how the pieces are connected to
form the whole. Since this is a recurring formula idiom which has appeared
several times in the past year, a period during which Norman seems to have been
quite active, he's had ample opportunity to have seen it before. There must have
been some other motive behind his response.

Now Peo deserves a lot of thanks for many things. However, the over-the-top
praise Norman was heaping on him for something that's been appearing for years
(e.g., http://makeashorterlink.com/?W61B16703) in these newsgroups was just a
bit nauseating.

Norman Harker

unread,
Jan 9, 2003, 7:20:54 PM1/9/03
to
Hi Ken!

I don't bother any more. After telling me and others to get stuffed and
admitting to being a troll followed up with gratuitous insults of Australia
and Australians I decided that I wouldn't be missing a lot. Harlan is now
the sole occupant of my blocked senders list. Not a bad record for the
newsgroup and the many thousands of contributors.

His contributions tend to be condescending, derisive and negative. I believe
that quite a few others have the same view, which is a great pity as with a
different approach he has a lot he could contribute that people would both
enjoy and benefit from.

Your unsnipped follow on was a typical example. A helpful explanation but
delivered in a manner that detracts from the sender rather than the
questioner.

Regards


--
Norman Harker
Sydney, Australia.
njha...@optusnet.com.au

"Ken Wright" <ken.w...@NOSPAMntlworld.com> wrote in message
news:#q918BDuCHA.2556@TK2MSFTNGP10...

Dana DeLouis

unread,
Jan 9, 2003, 7:44:02 PM1/9/03
to

Just two cents. If you haven't been there in a while, version 5.5 has some
nice things.
I find it powerful, but it did prove slow on a large project I was working
on.
Just another version to your excellent Regular Expression function might be
like this...

Function FindNumber(s As String) As Variant
' = = = = = = = = = = = = = = = = = = = = = = = = =
'// Dana DeLouis
'// Microsoft VBScript Regular Expressions 5.5
'// Returns the first group of numbers from a string
' = = = = = = = = = = = = = = = = = = = = = = = = =
Dim re As RegExp
Dim Matches As MatchCollection
Const ConsecutiveNumbers As String = "\D*(\d+)\D*" '// Remember Digits

Set re = New RegExp
With re
.IgnoreCase = True
.Global = False
.Pattern = ConsecutiveNumbers
If .Test(s) Then
Set Matches = .Execute(s)
FindNumber = CDbl(Matches(0).SubMatches(0))
Else
FindNumber = "No Number Found!"
End If
End With
Set re = Nothing
End Function


Sub Test_It()
Debug.Print FindNumber("Tillage 130 Acres")
Debug.Print FindNumber("35Acres dairying")
Debug.Print FindNumber("This has 1digit")
Debug.Print FindNumber("This has no digits")
End Sub

returns...
130
35
1
No Number Found!

--
Dana DeLouis
Windows XP & Office XP
= = = = = = = = = = = = = = = = =

<snip>

Dana DeLouis

unread,
Jan 9, 2003, 7:29:23 PM1/9/03
to

If it's just 1 whole number in the string, a slight change to this excellent
idea could be...

Function GetValue(s As String) As Double
Dim n As Long
For n = 1 To Len(s)
GetValue = WorksheetFunction.Max(GetValue, Val(Mid$(s, n)))
Next
End Function

Debug.Print GetValue("Tillage 130 Acres")
Debug.Print GetValue("35Acres dairying")

returns 130, and 35.


--
Dana DeLouis
Windows XP & Office XP
= = = = = = = = = = = = = = = = =


"Ron Rosenfeld" <ronros...@nospam.org> wrote in message
news:mp9r1vobg9e3871gs...@4ax.com...

Harlan Grove

unread,
Jan 9, 2003, 8:25:28 PM1/9/03
to
"Dana DeLouis" wrote...

>Just two cents. If you haven't been there in a while, version 5.5 has some
>nice things.

Indeed, such as look-ahead asertions. But ver 5.0 is adequate for most things.

>I find it powerful, but it did prove slow on a large project I was working
>on.

OK. It's possible the WSH regex engine isn't particularly efficient. You could
use Perl's if you have Perl/PScript installed. Note: if you're mixing
parenthesized subexpressions and closure operators (?, * and +), you may be
running into backtracking problems.

If you're serious about using regexs as efficiently as possible, consider buying
'Mastering Regular Expressions, 2nd Ed.' by Jeffrey E. F. Friedl (O'Reilly,
ISBN: 0-596-00289-0). It has a chapter on the .NET flavor, which are presumably
what WSH provides.

..


> Const ConsecutiveNumbers As String = "\D*(\d+)\D*" '// Remember Digits

..
> FindNumber = CDbl(Matches(0).SubMatches(0))
..

Since default matching is leftmost maximal, avoid the backtracking overhead of
global closures and parenthesized subexpressions.

Const ConsecutiveNumbers As String = "\d+"
..
FindNumber = CDbl(Matches(0).Value)

Ron Rosenfeld

unread,
Jan 9, 2003, 9:24:08 PM1/9/03
to
On Thu, 9 Jan 2003 19:29:23 -0500, "Dana DeLouis" <ng_...@hotmail.com>
wrote:

>Function GetValue(s As String) As Double
> Dim n As Long
> For n = 1 To Len(s)
> GetValue = WorksheetFunction.Max(GetValue, Val(Mid$(s, n)))
> Next
>End Function


Nice approach.


--ron

Dana DeLouis

unread,
Jan 10, 2003, 2:40:28 PM1/10/03
to

Thanks Harlan. I didn't see the simpler solution. :>)
I think the slowness I was having was most likely my fault. I'm sure I can
speed it up by doing it a little smarter.
I think I know what you mean by the possibility of "Backtracking" causing a
little slowdown. However, I have seen textbook examples that seem to
suggest that it "should" handle this ok.
For example, I have seen an example in determining if a string contained a
valid Web address. The repeat character "*" was not only used inside the
parentheses, but also just outside the parentheses as well.
A partial example would be (\w+[\w-]*\w+\.)* . At least I know it's
faster than anything I could come up with using standard VBA commands within
a loop.

Just for fun, by removing the explicit Set and Matches keywords, the
following code ran twice as fast.

Function FindNumber(s As String) As Variant
' = = = = = = = = = = = = = = = = = = = = = = = = =

'// Returns the first group of numbers from a string
' = = = = = = = = = = = = = = = = = = = = = = = = =

With New RegExp
.Global = False
.Pattern = "\d+"
If .Test(s) Then
FindNumber = CDbl(.Execute(s).Item(0))


Else
FindNumber = "No Number Found!"
End If
End With

End Function


Sub Test_It()
Debug.Print FindNumber("Tillage 130 Acres")
Debug.Print FindNumber("35Acres dairying")
Debug.Print FindNumber("This has 1digit")
Debug.Print FindNumber("This has no digits")

Debug.Print FindNumber("Number 23 here, and 45 there")
End Sub

Returns...


130
35
1
No Number Found!

23

--
Dana DeLouis
Windows XP & Office XP
= = = = = = = = = = = = = = = = =

"Harlan Grove" <hrl...@aol.com> wrote in message

news:cipT9.5068$15....@www.newsranger.com...

Harlan Grove

unread,
Jan 10, 2003, 3:56:58 PM1/10/03
to
"Dana DeLouis" wrote...
..

>I think I know what you mean by the possibility of "Backtracking" causing a
>little slowdown. However, I have seen textbook examples that seem to
>suggest that it "should" handle this ok.
..

Get Friedl's book! Read the ENTIRE discusion on DFA vs NFA regex engines.
Closure ops in DFA engines are no big deal. Closure ops in NFA engines (which
WSH's must be) can be a killer.

>Just for fun, by removing the explicit Set and Matches keywords, the
>following code ran twice as fast.
>
>Function FindNumber(s As String) As Variant
>' = = = = = = = = = = = = = = = = = = = = = = = = =
>'// Returns the first group of numbers from a string
>' = = = = = = = = = = = = = = = = = = = = = = = = =
> With New RegExp
> .Global = False
> .Pattern = "\d+"
> If .Test(s) Then
> FindNumber = CDbl(.Execute(s).Item(0))
> Else
> FindNumber = "No Number Found!"
> End If
> End With
>End Function

..

Sure the speedup wasn't due to the different regex?

0 new messages