How would i match a substring of a URL?

0 views
Skip to first unread message

Mohanaraj

unread,
Mar 31, 2008, 3:25:41 AM3/31/08
to loghetti-dev
Hello ,

Just a quick question - say I have logs like follows:

192.168.3.1 - - [31/Mar/2008:14:09:02 +0800] "PUT /feeds/user1/pics/
test.atom HTTP/1.1" 200 1543 "-" "client"
192.168.3.1 - - [31/Mar/2008:14:39:04 +0800] "POST /feeds/user1/pics
HTTP/1.1" 201 1968 "-" "client"
192.168.3.1 - - [31/Mar/2008:14:39:05 +0800] "POST /feeds/user2/pics
HTTP/1.1" 201 1968 "-" "client"

What I would like to do is pull out anyinteraction that invoves the
path '/feeds/user1'. Hence I would like to construct a query that
would essentially pull out the first two rows of the log above.

I try:

loghetti.py --debug --urlbase="feeds/user1" /var/log/apache2/
access.log

but it only pulls out

192.168.3.1 - - [31/Mar/2008:14:39:04 +0800] "POST /feeds/user1/pics
HTTP/1.1" 201 1968 "-" "client"

Then I try

loghetti.py --debug --urlbase="feeds/user1/pics" /var/log/apache2/
access.log

but it only pulls out

192.168.3.1 - - [31/Mar/2008:14:09:02 +0800] "PUT /feeds/user1/pics/
test.atom HTTP/1.1" 200 1543 "-" "client"

Any light on how I would the above would be greatly appreciated.

Thank you.

Mohan

Brian Jones

unread,
Mar 31, 2008, 8:12:43 AM3/31/08
to loghet...@googlegroups.com
I actually see a couple of problems here. One is that the base is not
being defined the way that I had intended. The second is that, even if
it was defined as I initially intended, that way is probably not
ideal.

The 'urlbase' definition in the code is supposed to be everything in
the url up to the first occurrence of "?" or "/". So if you have
"/feeds/pics/user1/stuff.php?foo=bar", the code should actually call,
simply, "feeds" the urlbase. That was the plan at the outset, but I'll
freely admit to not thinking deeply about this feature until now (and
I still haven't had time to give it due diligence). I did a quick
test, and it appears to me that the regex is not working properly,
because it's matching parts of a request beyond the first occurrence
of "/" (though "?" seems to work ok).

Regardless, what you really want (and I think would be nice too) is
not the urlbase, but to filter the lines based on what the code would
consider arbitrary substring components of the request :-)

I'll poke at this more in my late night coding sessions during this
week. Hopefully I can get a fix in place in the next couple of days.

brian.

--
Brian K. Jones
Python Magazine http://www.pythonmagazine.com
My Blog http://www.protocolostomy.com

Mohanaraj

unread,
Apr 1, 2008, 12:01:40 AM4/1/08
to loghetti-dev


On Mar 31, 8:12 pm, "Brian Jones" <bkjo...@gmail.com> wrote:

>
> The 'urlbase' definition in the code is supposed to be everything in
> the url up to the first occurrence of "?" or "/". So if you have
> "/feeds/pics/user1/stuff.php?foo=bar", the code should actually call,
> simply, "feeds" the urlbase.

I would be more inclined to think that the urlbase would be /feeds/
pics/user1/stuff.php
as in everything until the query parameters. If we were to follow your
line of thinking
how is /feeds being the urlbase different from /feeds/pics or /feed/
pics/user1 being the urlbase?

Instead if we were to say that the urlbase is part of the url that
does not include the query parameters
its pretty clear what it mean IMHO. What say you ?

>That was the plan at the outset, but I'll
> freely admit to not thinking deeply about this feature until now (and
> I still haven't had time to give it due diligence). I did a quick
> test, and it appears to me that the regex is not working properly,
> because it's matching parts of a request beyond the first occurrence
> of "/" (though "?" seems to work ok).

In all honesty, I find the line.base attribute to be somewhat
superfluous and confusing.

> Regardless, what you really want (and I think would be nice too) is
> not the urlbase, but to filter the lines based on what the code would
> consider arbitrary substring components of the request :-)

Correct :)

> I'll poke at this more in my late night coding sessions during this
> week. Hopefully I can get a fix in place in the next couple of days.

The way I have implemented it in my local code base now is by using
the
python 'in' operator, which is really easy to implement if we apply
the changes to filter
I mentioned in my last email. What you would do is set the rules cmp
attribute to the 'in' operator
and in the __call__ method of the rule you have the 'in' case do the
following

def __call__(self,line):
#print line.__dict__[self.attr]+self.cmp+ self.val
if self.cmp == "==":
return (line.__dict__[self.attr] == self.val)
elif self.cmp == "in":
return (self.val in line.__dict__[self.attr])

Hence now you have substring comparisons :).

Kent Johnson

unread,
Apr 1, 2008, 7:26:46 AM4/1/08
to loghet...@googlegroups.com
Mohanaraj wrote:
>
>
> On Mar 31, 8:12 pm, "Brian Jones" <bkjo...@gmail.com> wrote:
>
>> The 'urlbase' definition in the code is supposed to be everything in
>> the url up to the first occurrence of "?" or "/". So if you have
>> "/feeds/pics/user1/stuff.php?foo=bar", the code should actually call,
>> simply, "feeds" the urlbase.

Looking at the code, it seems that urlbase is the path and base is
supposed to be the first path component.

I suggest using the terminology from urlparse() for the components it
creates.


> Instead if we were to say that the urlbase is part of the url that
> does not include the query parameters
> its pretty clear what it mean IMHO. What say you ?

Yes, that is the path.


>
>> That was the plan at the outset, but I'll
>> freely admit to not thinking deeply about this feature until now (and
>> I still haven't had time to give it due diligence). I did a quick
>> test, and it appears to me that the regex is not working properly,
>> because it's matching parts of a request beyond the first occurrence
>> of "/" (though "?" seems to work ok).

Regexes are greedy. You have "^\/.*[\?\/]" which will match to the
*last* /. You could use r"^\/[^\?\/]*" to match to the first ? or /.

I don't think you will actually see a ? in the path returned from
urlparse() though.

Some unit tests would help here, both to clearly express your intent and
to make sure you have correctly implemented it.

> The way I have implemented it in my local code base now is by using
> the
> python 'in' operator, which is really easy to implement if we apply
> the changes to filter
> I mentioned in my last email. What you would do is set the rules cmp
> attribute to the 'in' operator
> and in the __call__ method of the rule you have the 'in' case do the
> following
>
> def __call__(self,line):
> #print line.__dict__[self.attr]+self.cmp+ self.val
> if self.cmp == "==":
> return (line.__dict__[self.attr] == self.val)
> elif self.cmp == "in":
> return (self.val in line.__dict__[self.attr])
>
> Hence now you have substring comparisons :).

'in' is useful, a regex search might be useful as well.

How do you express this in the command-line parameters?

Kent

Reply all
Reply to author
Forward
0 new messages