Index variable fields per model.

17 views
Skip to first unread message

Julián Porta

unread,
Aug 27, 2012, 1:57:12 PM8/27/12
to picky...@googlegroups.com
I have a web application that saves http requests. I'm using (as orm) Ohm (it's redis-based).

My model has a few fields, nothing fancy. One of the fields contains the raw web request, as I received it (conveniently parsed as json).

What I'd like to do, is to index the contents of the original request, using each key in the parsed json as a category in the picky index.

Possible?

Thoughts?

Picky / Florian Hanke

unread,
Aug 27, 2012, 2:13:21 PM8/27/12
to picky...@googlegroups.com
Hi Julián,

I'm not perfectly sure I understand your setup.

You have a model, with a few fields. One of the fields is JSON text, for example '{"a":"b","c":"d"}'.
And you'd like to index, in this case, "b" in category "a", and "d" in category "c"?

Also, is it the request you index (thus you always have the same categories), or do you index anything, needing dynamic category adding?

I'm pretty sure I understood, but not 100%, which is why I ask.

Cheers and thanks for the question,
   Florian

Julián Porta

unread,
Aug 27, 2012, 2:28:52 PM8/27/12
to picky...@googlegroups.com
Answer, inline:


On Monday, August 27, 2012 3:13:21 PM UTC-3, Picky / Florian Hanke wrote:
Hi Julián,

I'm not perfectly sure I understand your setup.

You have a model, with a few fields. One of the fields is JSON text, for example '{"a":"b","c":"d"}'.
And you'd like to index, in this case, "b" in category "a", and "d" in category "c"?

Exactly.
 

Also, is it the request you index (thus you always have the same categories), or do you index anything, needing dynamic category adding?

Also, I'd need to add categories to be indexed based on the contents of some requests. Use case.

Some requests I save have special headers added. There's a few that have a X-Forwarded-For header. I'd like to use that as a category. Other requests have paypal-related headers (the paypal API uses header key/values for authentication and other data). I'd like to use that as a search criteria.

 

I'm pretty sure I understood, but not 100%, which is why I ask.


You got it perfectly.

Thanks!

Picky / Florian Hanke

unread,
Aug 27, 2012, 3:06:10 PM8/27/12
to picky...@googlegroups.com
It seems to me as if it would make sense to simply install these categories as fixed, in the index definition – since it seems they are not truly dynamic, but do have expected values.

If this is not the case, check this out. You can copy it and run it in a Ruby script. Note that there is a problem at the end, which I will need to fix (explanation at the very end).

require 'picky'
require 'json'

# Some request.
#
Request = Struct.new :id, :url, :some_headers_as_json

# Define the url category.
#
data = Picky::Index.new :request do
  category :url, indexing: { splits_text_on: /[\/\?\&\=]/ }
end

# Add a request.
#
data.add Request.new(1, '/hello/test', '{"a":"b","c":"d"}')

# Check out some internals (just for fun).
#
p data[:url].exact.inverted # The index
p data[:url].tokenizer.split("a/b/c") # How the tokenizer of url splits.

# Prepare a search interface.
#
requests = Picky::Search.new data

# Try a search.
#
p requests.search 'test'

# Dynamically add category.
#
request = Request.new(2, '/with/json', '{"a":"b","c":"d"}')
json = JSON.parse request.some_headers_as_json

# Check it out.
#
p json

json.each do |key, value|
  key = key.to_sym
  p "Adding value #{value} to category #{key}"
  data[key] rescue data.category(key)
  data[key].add_text request.id, value 
end

# Let's search for it:
#
p requests.search 'b'
p requests.search 'c:d' # Whoops. This is a problem!

I get this as output:
~/temp/picky-examples $ ruby google_groups_julian_porta.rb 
{"hello"=>[1], "test"=>[1]}
["a", "b", "c"]
>|2012-08-27 21:00:59|0.000098|test                                              |       1|   0| 1|
{"a"=>"b", "c"=>"d"}
"Adding value b to category a"
"Adding value d to category c"
>|2012-08-27 21:00:59|0.000046|b                                                 |       1|   0| 1|
>|2012-08-27 21:00:59|0.000017|c:d                                               |       0|   0| 0|

Regarding the error:
Why does "c:d" fail? Picky sets up the "category mapping" once, at the beginning, so only categories which are defined at the beginning are included in the qualifiers, a qualifier being the thing that tells d what it must be, eg. "c:d" tells Picky that you want to find d in c, and that Picky must find it in there and give up if it isn't.

So in the next version this needs to be done dynamically, after adding each category instead of at the beginning. Sorry about that – Picky has only recently become a realtime system, so this has not yet been a concern.

Please tell me if all this helps.

Cheers,
   Florian

Picky / Florian Hanke

unread,
Aug 27, 2012, 3:12:40 PM8/27/12
to picky...@googlegroups.com
P.S: There are, of course, far funkier solutions, but I left them out for clarity.

Julián Porta

unread,
Aug 27, 2012, 3:36:08 PM8/27/12
to picky...@googlegroups.com
Yes it helps --a lot. It's clear that dynamically adding categories isn't an option. I'm thinking now on the following solution.

When a new entry is saved on the database, parse the json fields and save into the database the list of categories to add to the index.

Having a cron job to retrieve that list of categories to index, define those categories for indexing, and reindex the whole thing.

Is that possible? to add categories AND reindex?

Picky / Florian Hanke

unread,
Aug 27, 2012, 4:29:44 PM8/27/12
to picky...@googlegroups.com
Hi Julián,

See below.

On Monday, 27 August 2012 21:36:08 UTC+2, Julián Porta wrote:
Yes it helps --a lot. It's clear that dynamically adding categories isn't an option.

Because of the bug?
(Although, I guess I see the problem – the category definition would not be persistent, and Picky, as of yet, does not save the category definition, just the index/category data)

I'm wondering – don't these categories each need specific indexing (tokenizing options etc.)?
 
I'm thinking now on the following solution.

When a new entry is saved on the database, parse the json fields and save into the database the list of categories to add to the index.

Having a cron job to retrieve that list of categories to index, define those categories for indexing, and reindex the whole thing.

Is that possible? to add categories AND reindex?

That would be very possible, yes. It's all Ruby, so whatever you can do with objects etc. is possible with Picky.
So yes, reloading the categories on start, then reindexing sounds good :)

Cheers,
   Florian 

Julián Porta

unread,
Aug 27, 2012, 4:46:55 PM8/27/12
to picky...@googlegroups.com


On Monday, August 27, 2012 5:29:44 PM UTC-3, Picky / Florian Hanke wrote:
Hi Julián,

See below.

On Monday, 27 August 2012 21:36:08 UTC+2, Julián Porta wrote:
Yes it helps --a lot. It's clear that dynamically adding categories isn't an option.

Because of the bug?
(Although, I guess I see the problem – the category definition would not be persistent, and Picky, as of yet, does not save the category definition, just the index/category data)

I'm wondering – don't these categories each need specific indexing (tokenizing options etc.)?

The problem is the following.
I receive different requests from different sources. And those sources change all the time. Each new request from new source comes with a lot of specific headers and params, which change all the time. So, as new sources are added, I need to add to the index specific categories (while all the categories will look into the same json data).

So, having a cron job to retrieve a already processed list of categories to add and reindex them a few times a day seems like a reasonable tradeoff --no realtime indexing, so if a new source is this morning, after the reindex I'll be able to search things like "X-Some-Weird-Header:TheValue" that came in a request from the new source.

I'll share the final implementation as soon I have it.

Thanks for the help!

Picky / Florian Hanke

unread,
Aug 27, 2012, 4:56:37 PM8/27/12
to picky...@googlegroups.com
My pleasure.

Looks good. Another idea might be to simply store each request in a database anyway, with a timestamp, and then tell Picky to index just the new entries (by remembering the last time you indexed – after startup it would be 01.01.1970 or so). And then set a signal trap Signal.trap("USR1") { data.index # indexing all since last timestamp } or so to add the new entries. Like that you simply kill -USR1 <Picky PID> to index the latest entries.

Thanks for sharing!

Cheers,
   Florian
Reply all
Reply to author
Forward
0 new messages