On Sun, 09 Mar 2014, Paolo Coffetti wrote:
> Dear Rick and Bookie-rs,
>
> I'm Paolo Coffetti, a software engineer living in Amsterdam, the
> Netherlands.
> I'm very close to a master degree at University of Bergamo, Italy: I've
> finished all the courses and currently working on my thesis which I'll be
> defending on June 9.
Thanks for the intro and very cool. I'm glad others find the ideas in
Bookie interesting. :)
> So, coming to GSoC, I'm particularly interested in 2 ideas proposed by
> Bookie and I would like to ask you more details.
>
>
> *Update Bookie to permit an ElasticSearch/Solr backend for full text
> indexing of content*
Very cool, let us know if you have any questions about the project idea.
> Yesterday I took a look at the way Bookie performs the full text indexing
> of bookmarks and I've got some questions.
> Is the search actually performing a full text search against the textual
> content of bookmarks? If so, why it doesn't seem to be working for me.
> The details are as follows.
>
> In my local dev I added the following bookmarks:
>
http://chimera.labs.oreilly.com/books/1234000000754/ch08.html
>
http://science.nasa.gov/science-news/science-at-nasa/2006/17jan_jack/
> After a while Celery triggered the fetching and indexing of those bookmarks
> and I could see the "readable" version of the pages at
> <localhost>/bmark/readable/{{hash_id}}
> Then I tested the search page at <localhost>/search and the one at
> <localhost>/<my_user_name>/recent with keywords found in those bookmarks:
> *Gunicorn*
> *provisioning*
> *astronaut*
> *wonderland*
> but I got no results.
>
> So it seemed to me that the indexing somehow failed and I investigated a
> bit more:
> >>> from whoosh.index import open_dir
> >>> ix = open_dir('bookie_index')
> >>> ix.schema
> <Schema: ['bid', 'description', 'extended', 'readable', 'tags']>
> >>> searcher = ix.searcher()
> >>> list(searcher.lexicon("tags"))
> [u'bookmarks', u'development', u'django', u'driven', u'eating', u'moon',
> u'nasa', u'science', u'tdd', u'test']
> >>> list(searcher.lexicon("description"))
> [u'bookie', u'development', u'driven', u'nasa', u'skiing', u'test',
> u'water', u'website']
> >>> list(searcher.lexicon("readable"))
> []
> Clearly the "tags" and "description" fields were indexed correctly while
> "readable" (which seems to be meant to perform a StemmingAnalyzer of the
> HTML content of bookmarks) is just empty.
Hmm, so this is populated via the Readable model. When it's changed, it
triggers a SqlAlchemy event.
https://github.com/bookieio/Bookie/blob/develop/bookie/models/__init__.py#L288
And then that triggers the celery job you mentioned that should send that
readable content to the indexer.
https://github.com/bookieio/Bookie/blob/develop/bookie/bcelery/tasks.py#L218
If the content is empty it could be because the readable backend isn't
running, but I thought you mentioned that it did fetch the readable content
of the page. If you want to poke farther I'd be curious if you put a
breakpoint in the readable hook? (check out pdb in python if you've not
yet, powerful tool)
I wonder if the content is getting lost along the way from the Readable
object, to the celery task, and finally to the index.
> *Create ability to monitor 3rd party sites for bookmark content. (Twitter,
> pocket, etc)*
Very cool, and it's definitely something that would be useful to Bookie
users.
> This is not my official proposal, I haven't deeply studied the code yet,
> nor made a detailed plan, but only a first approach in order to get a
> clearer idea on what the aims are and see if I am on the right track.
> I believe we should try to estimate the effort for each task and give them
> a priority.
> I've never worked for an Open Source project but I've been willing to do
> that since many years, so I'm excited to finally have the chance to do so.
Yes, the idea is to submit the application and then we'll work with the
applicants over the next couple of weeks to bring more detail to the
project proposals, take a better look at time estimates, and make sure
we're clear (as a project) how we're expecting to operate the work over the
months of GSoC.
> Also, please consider that I'd really love to take part of a Google Summer
> of Code project, so I will apply for more than one project (mainly Python
> projects). I also see that Bookie is very popular and I'll be competing
> with many smart students, so could you please suggest me which idea I
> should focus more on? I have a little preference for the first one, because
> I find it more challenging, shall I go for that or you already have a
> designated student?
Honestly, we don't have anything designated yet. I think either is valid
and good to work on.
>
>
> PS: I think I found a bug...
> Searching against the keyword "#fedora-devel"
> (
https://bmark.us/results?search=%23fedora-devel&content=true&submit=Search)
> returns a page with the spinner image rotating forever (tested with Chrome
> and Firefox on Mac OS)
Yep, the # is picked up and ecoded and turns into a # breaking the urls in
the api request it looks like. If you could file a bug that's be great.
Thanks for the email and let us know if there's anything we can do to help
with things.
--
Rick Harding
@mitechie
http://blog.mitechie.com
http://lococast.net