Re: [bookie] [GSoC] Solr backend; Monitor 3rd party sites

112 views
Skip to first unread message
Message has been deleted

Richard Harding

unread,
Mar 9, 2014, 11:37:54 AM3/9/14
to bookie_b...@googlegroups.com
On Sun, 09 Mar 2014, Paolo Coffetti wrote:

> Dear Rick and Bookie-rs,
>
> I'm Paolo Coffetti, a software engineer living in Amsterdam, the
> Netherlands.
> I'm very close to a master degree at University of Bergamo, Italy: I've
> finished all the courses and currently working on my thesis which I'll be
> defending on June 9.

Thanks for the intro and very cool. I'm glad others find the ideas in
Bookie interesting. :)


> So, coming to GSoC, I'm particularly interested in 2 ideas proposed by
> Bookie and I would like to ask you more details.
>
>
> *Update Bookie to permit an ElasticSearch/Solr backend for full text
> indexing of content*

Very cool, let us know if you have any questions about the project idea.


> Yesterday I took a look at the way Bookie performs the full text indexing
> of bookmarks and I've got some questions.
> Is the search actually performing a full text search against the textual
> content of bookmarks? If so, why it doesn't seem to be working for me.
> The details are as follows.
>
> In my local dev I added the following bookmarks:
> http://chimera.labs.oreilly.com/books/1234000000754/ch08.html
> http://science.nasa.gov/science-news/science-at-nasa/2006/17jan_jack/
> After a while Celery triggered the fetching and indexing of those bookmarks
> and I could see the "readable" version of the pages at
> <localhost>/bmark/readable/{{hash_id}}
> Then I tested the search page at <localhost>/search and the one at
> <localhost>/<my_user_name>/recent with keywords found in those bookmarks:
> *Gunicorn*
> *provisioning*
> *astronaut*
> *wonderland*
> but I got no results.
>
> So it seemed to me that the indexing somehow failed and I investigated a
> bit more:
> >>> from whoosh.index import open_dir
> >>> ix = open_dir('bookie_index')
> >>> ix.schema
> <Schema: ['bid', 'description', 'extended', 'readable', 'tags']>
> >>> searcher = ix.searcher()
> >>> list(searcher.lexicon("tags"))
> [u'bookmarks', u'development', u'django', u'driven', u'eating', u'moon',
> u'nasa', u'science', u'tdd', u'test']
> >>> list(searcher.lexicon("description"))
> [u'bookie', u'development', u'driven', u'nasa', u'skiing', u'test',
> u'water', u'website']
> >>> list(searcher.lexicon("readable"))
> []
> Clearly the "tags" and "description" fields were indexed correctly while
> "readable" (which seems to be meant to perform a StemmingAnalyzer of the
> HTML content of bookmarks) is just empty.

Hmm, so this is populated via the Readable model. When it's changed, it
triggers a SqlAlchemy event.

https://github.com/bookieio/Bookie/blob/develop/bookie/models/__init__.py#L288

And then that triggers the celery job you mentioned that should send that
readable content to the indexer.

https://github.com/bookieio/Bookie/blob/develop/bookie/bcelery/tasks.py#L218

If the content is empty it could be because the readable backend isn't
running, but I thought you mentioned that it did fetch the readable content
of the page. If you want to poke farther I'd be curious if you put a
breakpoint in the readable hook? (check out pdb in python if you've not
yet, powerful tool)

I wonder if the content is getting lost along the way from the Readable
object, to the celery task, and finally to the index.


> *Create ability to monitor 3rd party sites for bookmark content. (Twitter,
> pocket, etc)*

Very cool, and it's definitely something that would be useful to Bookie
users.


> This is not my official proposal, I haven't deeply studied the code yet,
> nor made a detailed plan, but only a first approach in order to get a
> clearer idea on what the aims are and see if I am on the right track.
> I believe we should try to estimate the effort for each task and give them
> a priority.
> I've never worked for an Open Source project but I've been willing to do
> that since many years, so I'm excited to finally have the chance to do so.

Yes, the idea is to submit the application and then we'll work with the
applicants over the next couple of weeks to bring more detail to the
project proposals, take a better look at time estimates, and make sure
we're clear (as a project) how we're expecting to operate the work over the
months of GSoC.


> Also, please consider that I'd really love to take part of a Google Summer
> of Code project, so I will apply for more than one project (mainly Python
> projects). I also see that Bookie is very popular and I'll be competing
> with many smart students, so could you please suggest me which idea I
> should focus more on? I have a little preference for the first one, because
> I find it more challenging, shall I go for that or you already have a
> designated student?

Honestly, we don't have anything designated yet. I think either is valid
and good to work on.

>
>
> PS: I think I found a bug...
> Searching against the keyword "#fedora-devel"
> (https://bmark.us/results?search=%23fedora-devel&content=true&submit=Search)
> returns a page with the spinner image rotating forever (tested with Chrome
> and Firefox on Mac OS)

Yep, the # is picked up and ecoded and turns into a # breaking the urls in
the api request it looks like. If you could file a bug that's be great.

Thanks for the email and let us know if there's anything we can do to help
with things.

--

Rick Harding
@mitechie
http://blog.mitechie.com
http://lococast.net

Rick Harding

unread,
Mar 9, 2014, 4:38:13 PM3/9/14
to bookie_b...@googlegroups.com


On Sunday, March 9, 2014 11:07:21 AM UTC-4, Paolo Coffetti wrote:
In my local dev I added the following bookmarks:
After a while Celery triggered the fetching and indexing of those bookmarks and I could see the "readable" version of the pages at <localhost>/bmark/readable/{{hash_id}}
Then I tested the search page at <localhost>/search and the one at <localhost>/<my_user_name>/recent with keywords found in those bookmarks:
Gunicorn
provisioning
astronaut
wonderland
but I got no results.

So it seemed to me that the indexing somehow failed and I investigated a bit more:
>>> from whoosh.index import open_dir
>>> ix = open_dir('bookie_index')
>>> ix.schema
<Schema: ['bid', 'description', 'extended', 'readable', 'tags']>
>>> searcher = ix.searcher()
>>> list(searcher.lexicon("tags"))
[u'bookmarks', u'development', u'django', u'driven', u'eating', u'moon', u'nasa', u'science', u'tdd', u'test']
>>> list(searcher.lexicon("description"))
[u'bookie', u'development', u'driven', u'nasa', u'skiing', u'test', u'water', u'website']
>>> list(searcher.lexicon("readable"))
[]
Clearly the "tags" and "description" fields were indexed correctly while "readable" (which seems to be meant to perform a StemmingAnalyzer of the HTML content of bookmarks) is just empty.



This was bugging me so I checked it out. I tried it out locally and the code you posted here works fine after I run 'make celery' for a bit and it goes and fetches the bookmark contents and then comes back and indexes it.  

Now, I also tested this on the main bmark.us website. That's empty. No lexicon for the readable and I've noticed that recently. We've been pushing whoosh and I think we've made it angry. I have a celery task to reindex all bookmarks, but it hammers whoosh too hard and makes it unhappy. I'm going to work on adding a script to see if I can slowly reindex things in a single thread way and let it run for a while. Then I'll see if the index keeps the content indexed. 

Thanks for the heads up and description of the problem. I'd love to hear if running 'make celery' helps you get content for the readable index in your dev environment.

Rick

Paolo Coffetti

unread,
Mar 9, 2014, 6:23:33 PM3/9/14
to bookie_b...@googlegroups.com
You mean 'make run_celery' right? Unfortunately it doesn't work for me, no lexicon for the readable. But maybe it is a problem limited to my local dev.
Celery logs every minutes something like:
[2014-03-09 23:03:37,372: INFO/MainProcess] Task bookie.bcelery.tasks.fetch_unfetched_bmark_content[8d4344c3-8b66-470a-9112-e89e6882686e] succeeded in 0.0108418464661s: None
That None stinks... I will debug more tomorrow!

By the way, at bmark.us we already have 100k bookmarks, I guess it is really time to move from Whoosh to Solr!

This sounds like a lot of fun!
Paolo

Richard Harding

unread,
Mar 9, 2014, 8:36:47 PM3/9/14
to bookie_b...@googlegroups.com
On Sun, 09 Mar 2014, Paolo Coffetti wrote:

> You mean 'make *run_*celery' right? Unfortunately it doesn't work for me,
> no lexicon for the readable. But maybe it is a problem limited to my local
> dev.
> Celery logs every minutes something like:
> [2014-03-09 23:03:37,372: INFO/MainProcess] Task
> bookie.bcelery.tasks.fetch_unfetched_bmark_content[8d4344c3-8b66-470a-9112-e89e6882686e]
> succeeded in 0.0108418464661s: None
> That None stinks... I will debug more tomorrow!

That's ok. Most of the tasks will have a None there. The tasks talk
directly to the data store (database or fulltext index) and the tasks don't
return any value.

The real issue to look for is that when you store a new bookmark the
process should be:

1) Save a new bookmark in the webui - it triggers a 'fetch bookmark content' celery task
2) That task goes off, fetches the html for that url, parses it, and
creates a Readable object in the database. That triggers a fulltext_index
task.
3) The fulltext index task is given the parsed html stripped of html tags
for indexing. It updates the whoosh index.

At this point, it should be there.

I'll be very interested in your testing tomorrow. If it's not working in a
local dev case then there's a bigger issue and I'll keep trying to
duplicate it. It *should* work as it's a big reason I started Bookie years
ago :)

> By the way, at bmark.us we already have 100k bookmarks, I guess it is
> really time to move from Whoosh to Solr!

Definitely! I'm excited to hopefully have a student work on it. I think
it'll be good to show off what Bookie can really do with a more powerful
search tool available.

Rick

Paolo Coffetti

unread,
Mar 10, 2014, 1:47:04 PM3/10/14
to bookie_b...@googlegroups.com
A quick update.

I didn't have much time to debug it today, will have more time tomorrow tho.
Anyway, bookmarks are correctly fetched and Readable objects are created in the database.
But I noticed an error raised by fulltext_index_bookmark celery task:
[2014-03-10 18:09:55,647: ERROR/MainProcess] Could not load bookmark to fulltext index: 5
So that could be a clue that the content is not indexed.
But somehow (sorry for the vagueness... I need more time to investigate) I managed to make it work and now bookmarks have been indexed and I see the lexicons in Whoosh's index.
But still I cannot find any result after a search in the webui.

I will work more on it tomorrow!

Paolo

Paolo Coffetti

unread,
Mar 13, 2014, 5:34:26 PM3/13/14
to bookie_b...@googlegroups.com
Hi Rick,

I couldn't make the full text search properly working on my local env and not even at bmark.us so I investigated.
I think either I found some issues or I am missing something, could you please give me your advice?

The details are as follow.
I'm being verbose cause I want to be clear.

At bmark.us I have an account with username arigato.
I have the following bookmarks: http://postimg.org/image/9e9e3fmt1/
Let's focus on the second one:
    Description: Django TDD
The readable has been created: http://postimg.org/image/jg96uwaat/
and it contains the keyword "server_address"
against the keyword "server_address" (w/out quotes) I get no results: http://postimg.org/image/bayux1e7v/

The same happens in my local environment.

I made sure that the keyword is listed in the lexicon of the index in my local env:
>>> from whoosh.index import open_dir
>>> ix = open_dir('bookie_index')
>>> searcher = ix.searcher()
>>> l = list(searcher.lexicon("readable"))
>>> "server_address" in l
True

Plus I run a search against that keyword using the API in my local env and got the result I was expecting: http://paste.ubuntu.com/7086752/
(there is no my result at bmark.us, not clear why, but anyway there are some results)

So I was wondering why there is no result in the web page.

First I noticed that the template does receive the right search_results variable i.e. a list of 1 Bmark object: http://postimg.org/image/49fkj5mf7/
I double checked the code (file bookie/views/utils.py) and confirmed that:
#173 'search_results': res_list,
So it is correct: the right search_results is passed to the template.

Then I thought that the problem must have been in the template: bookie/templates/utils/result_wrap.mako.
I noticed that the variable search_results is never used in that template.
Instead there is some complicated (for my not too long javascript experience) javascript code.
After some debug I realized that the javascript code is making an API call like:
/api/v1/bmarks/search/server_address?count=1&page=0&with_content=1&api_key=123456&username=admin&resource=

Thus here we have 2 problems:
1) We are hitting the database (and the index) 2 times. First when we instantiate the template variable search_results and second when we make the API call.
We can skip one of these operation.
2) In the API call the parameter search_content=true is missing so it is not a full text search (the search is not performed against the content of the bookmarked page).

I debugged a little bit more (a lot actually ;) and found out where the API call is generated: in the function Api.route.Search of bookie/static/js/bookie/api.js
I fixed this adding to that file 2 lines:
#985 with_content: true,
#986 search_content: true /* FIX */
...
#1053 with_content: true,
#1054 search_content: true /* FIX */

And compiled the js files with: make js

Now the problem n.2 is solved, but the n.1 of course is still there.
Not too complicated to solve the problem n.1 tho but we have to think why we got this issue (hitting 2 times the db) and what we actually want to use: template variable or api call?


PS: during my investigation I used first an ubuntu 12.04LTS machine, then a Mac OS laptop.
I use to work on a Mac OS laptop with Pycharm and I really love Pycharm's integrated debugger.
In order to make Bookie work on Mac OS I had to do some tricks, do you want me to write some guidelines about how to build Bookie on Mac OS for development?

Paolo

Richard Harding

unread,
Mar 14, 2014, 10:28:20 PM3/14/14
to bookie_b...@googlegroups.com
On Thu, 13 Mar 2014, Paolo Coffetti wrote:

> Hi Rick,
>
> I couldn't make the full text search properly working on my local env and
> not even at bmark.us so I investigated.
> I think either I found some issues or I am missing something, could you
> please give me your advice?
>
> The details are as follow.
> I'm being verbose cause I want to be clear.
>
> At bmark.us I have an account with username arigato.
> I have the following bookmarks: http://postimg.org/image/9e9e3fmt1/
> Let's focus on the second one:
> Description: Django TDD
> Url: http://chimera.labs.oreilly.com/books/1234000000754/ch09.html
> The readable has been created: http://postimg.org/image/jg96uwaat/
> and it contains the keyword "server_address"
> If I perform a search: http://postimg.org/image/w9zqxj6pl/
> against the keyword "server_address" (w/out quotes) I get no results:
> http://postimg.org/image/bayux1e7v/


Right, the bmark.us site's whoosh index does not have anything in the
readable field. I've not gotten a chance to write up an admin script to try
to repopulate it yet. I hope to see if that will bootstrap it.

> The same happens in my local environment.

This is :(

> I made sure that the keyword is listed in the lexicon of the index in my
> local env:
> >>> from whoosh.index import open_dir
> >>> ix = open_dir('bookie_index')
> >>> searcher = ix.searcher()
> >>> l = list(searcher.lexicon("readable"))
> >>> "server_address" in l
> True
>
> Plus I run a search against that keyword using the API in my local env and
> got the result I was expecting: http://paste.ubuntu.com/7086752/
> The same at bmark.us: http://paste.ubuntu.com/7086796/
> (there is no my result at bmark.us, not clear why, but anyway there are
> some results)
>
> So I was wondering why there is no result in the web page.
>
> First I noticed that the template does receive the right search_results
> variable i.e. a list of 1 Bmark object: http://postimg.org/image/49fkj5mf7/
> I double checked the code (file bookie/views/utils.py) and confirmed that:
> #173 'search_results': res_list,
> So it is correct: the right search_results is passed to the template.

Ok cool, great work on parsing the bits that work here. Bookie uses the API
itself from the JS front end. This helps make sure the api functions
properly and if anything is off, Bmark.us users feel the pain and I'm
motivated to fix it, which benefits all API users.

> Then I thought that the problem must have been in the template:
> bookie/templates/utils/result_wrap.mako.
> I noticed that the variable search_results is never used in that template.
> Instead there is some complicated (for my not too long javascript
> experience) javascript code.
> After some debug I realized that the javascript code is making an API call
> like:
> /api/v1/bmarks/search/server_address?count=1&page=0&with_content=1&api_key=123456&username=admin&resource=
>
> Thus here we have 2 problems:
> 1) We are hitting the database (and the index) 2 times. First when we
> instantiate the template variable search_results and second when we make
> the API call.
> We can skip one of these operation.

Very true, and I thought we had an issue for this. I revamped how the
results wrapping worked for the main page (all, mine, etc) and the search
pages needed to be updated as well. This is part of that. I'll add a bug.
Thanks for chasing this down.

> 2) In the API call the parameter search_content=true is missing so it is
> not a full text search (the search is not performed against the content of
> the bookmarked page).

Yep, that makes sense. Originally it was a flag because searching the
fulltext index was slower than a normal search. Honestly, we should remove
this flag. If you're doing search, fulltext search all the things. If it's
slow, we should make it less slow by doing things like...moving from whoosh
to ElasticSearch for Solr. :)

> I debugged a little bit more (a lot actually ;) and found out where the API
> call is generated: in the function Api.route.Search of
> bookie/static/js/bookie/api.js
> I fixed this adding to that file 2 lines:
> #985 with_content: true,
> #986 search_content: true /* FIX */
> ...
> #1053 with_content: true,
> #1054 search_content: true /* FIX */
>
> And compiled the js files with: make js
>
> Now the problem n.2 is solved, but the n.1 of course is still there.
> Not too complicated to solve the problem n.1 tho but we have to think why
> we got this issue (hitting 2 times the db) and what we actually want to
> use: template variable or api call?

The goal is to use the API. Back in the history of Bookie I saw a talk at
PyOhio about writing good apis and got motivated to make Bookie even more
of a JS heavy app so that we flexed the muscles of the completely rewritten
API. The search page didn't get the love and update attention it should
have and that's what you're seeing.


> PS: during my investigation I used first an ubuntu 12.04LTS machine, then a
> Mac OS laptop.
> I use to work on a Mac OS laptop with Pycharm and I really love Pycharm's
> integrated debugger.
> In order to make Bookie work on Mac OS I had to do some tricks, do you want
> me to write some guidelines about how to build Bookie on Mac OS for
> development?

Definitely! I know a couple of students have tried to get it running on OSX
and it's something that's been on my todo list now that I have a Mac as
well.

I'll be happy to help debug, QA, and land OSX friendly changes.
Reply all
Reply to author
Forward
0 new messages