excel file term indexing

Hala Gamal

unread,

Feb 17, 2013, 3:02:20 PM2/17/13

to who...@googlegroups.com

how to use whoosh to make indexing of csv file in the image to get each word and where it is mentioned

Philippe Ombredanne

unread,

Feb 17, 2013, 9:55:38 PM2/17/13

to who...@googlegroups.com

On Sun, Feb 17, 2013 at 12:02 PM, Hala Gamal <halaga...@gmail.com> wrote:
> how to use whoosh to make indexing of csv file in the image to get each
> word and where it is mentioned

If I understand correctly what you want (though I am not sure), I
would do this: treat each row as if it a document. Therefore:
- Create a schema for your index that has one field for each column
of your csv and possibly one field where you would also index all the
columns
- at indexing time, read the CSV, and for each row, add a document
to the index, properly fielded per the schema and identified possibly
with the row number or some other field that is unique
- This way when searching you can know exactly which row and
eventually which column was found.

--
Philippe Ombredanne

+1 650 799 0949 | pombr...@nexB.com
DejaCode Enterprise at http://www.dejacode.com
nexB Inc. at http://www.nexb.com

Hala Gamal

unread,

Feb 18, 2013, 1:33:12 PM2/18/13

to who...@googlegroups.com

but can you give me link or tutorial that makes i easier as i'm new un using whoosh?

--
You received this message because you are subscribed to the Google Groups "Whoosh" group.
To unsubscribe from this group and stop receiving emails from it, send an email to whoosh+un...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Philippe Ombredanne

unread,

Feb 18, 2013, 2:01:17 PM2/18/13

to who...@googlegroups.com

On Mon, Feb 18, 2013 at 10:33 AM, Hala Gamal <halaga...@gmail.com> wrote:
> but can you give me link or tutorial that makes i easier as i'm new un using
> whoosh?

You will have to read the docs at
http://whoosh.readthedocs.org/en/latest/

If you think they are not good enough for a new whoosh user, then
please submit some ticket so it can be improved
https://bitbucket.org/mchaput/whoosh/issues ... and even better,
suggest improvements and provide an updated documentation.

Hala Gamal

unread,

Feb 19, 2013, 8:17:07 AM2/19/13

to who...@googlegroups.com

can i ask you how to make row field in building the schema?

Hala Gamal

unread,

Feb 19, 2013, 8:19:37 AM2/19/13

to who...@googlegroups.com

sorry i mean column ->field?

Philippe Ombredanne

unread,

Feb 19, 2013, 11:56:04 AM2/19/13

to who...@googlegroups.com

On Tue, Feb 19, 2013 at 5:19 AM, Hala Gamal <halaga...@gmail.com> wrote:
> sorry i mean column ->field?
> On Tue, Feb 19, 2013 at 3:17 PM, Hala Gamal <halaga...@gmail.com> wrote:
>> can i ask you how to make row field in building the schema?

See the doc here:
http://whoosh.readthedocs.org/en/latest/quickstart.html#the-index-and-schema-objects

Say you have 5 columns c1 to c5 in your csv and these are text, use this:
schema = Schema(c1=TEXT, c2=TEXT, c3=TEXT, c4=TEXT, c5=TEXT)

Hala Gamal

unread,

Feb 19, 2013, 2:14:17 PM2/19/13

to who...@googlegroups.com

can i ask your opinion in where the mistake here:

import whoosh

import xldr

import csv

from xldr import open_workbook,XL_CELL_TEXT

from whoosh import *

schema = Schema(A1=NUMERIC,B1=NUMERIC,C1=NUMERIC,D1=TEXT,E1=KEYWORD,G1=TEXT)

ix = create_in("indexdir", schema)

writer=ix.writer()

book=open_workbook('q.csv')

sheet=book.sheet_by_index(0)

for row_index in range(sheet.nrows):

writer.add_document(sheet.cell(row_index,0),sheet.cell(row_index,1),sheet.cell(row_index,2),sheet.cell(row_index,3),sheet.cell(row_index,4),sheet.cell(row_index,5),sheet.cell(row_index,6))

writer.commit()

Hala Gamal

unread,

Feb 19, 2013, 2:47:14 PM2/19/13

to who...@googlegroups.com

and i got this error:

File "D:/Python27/try.py", line 14, in <module>

writer.add_document(A1=sheet.cell(row_index,0),B1=sheet.cell(row_index,1),C1=sheet.cell(row_index,2),D1=sheet.cell(row_index,3),E1=sheet.cell(row_index,4),F1=sheet.cell(row_index,5),G1=sheet.cell(row_index,6))

File "D:/Python27\whoosh\filedb\filewriting.py", line 369, in add_document

items = field.index(value)

File "D:/Python27\whoosh\fields.py", line 466, in index

return [(txt, 1, 1.0, '') for txt in self._tiers(num)]

File "D:/Python27\whoosh\fields.py", line 454, in _tiers

yield self.to_text(num, shift=shift)

File "D:/Python27\whoosh\fields.py", line 487, in to_text

return self._to_text(self.prepare_number(x), shift=shift,

File "D:/Python27\whoosh\fields.py", line 476, in prepare_number

x = self.type(x)

TypeError: int() argument must be a string or a number, not 'Cell

Hala Gamal

unread,

Feb 19, 2013, 3:40:48 PM2/19/13

to who...@googlegroups.com

i'm very sorry for all of that,,i modified the code:

import whoosh

import xlrd

import os

import csv

from xlrd import open_workbook,XL_CELL_TEXT

from whoosh import *

from whoosh.fields import *

from whoosh .index import create_in

schema = Schema(A1=NUMERIC,B1=NUMERIC,C1=NUMERIC,D1=TEXT,E1=KEYWORD,F1=TEXT,G1=TEXT)

if not os.path.exists("index"):

os.mkdir("index")

ix = create_in("index",schema)

schema = ix.schema

writer=ix.writer()

book=open_workbook('q.csv')

sheet=book.sheet_by_index(0)

for row_index in range(sheet.nrows):

write_cmd = "writer.add_document("

i = 0

for field in schema.field_names():

f, v = field, cell[row_index][i]

if v.__class__ == str: write_cmd += f + "=u'" + unicode( v ) + "',"

elif v.__class__ == unicode: write_cmd += f + "=u'" + unicode( v ) + "',"

elif v.__class__ == int:write_cmd += f + "=" + unicode( v ) + "," #must change 1 to 0001

else:pass

i += 1

write_cmd = write_cmd[:-1] + ")"

writer.commit()

AND The error:AttributeError: 'Schema' object has no attribute 'field_names'

Matt Chaput

unread,

Feb 19, 2013, 3:43:52 PM2/19/13

to who...@googlegroups.com

> AND The error:AttributeError: 'Schema' object has no attribute
> 'field_names'

The method is actually Schema.names(). If you saw field_names()
somewhere in the docs, please let me know so I can fix it.

Thanks,

Matt

Hala Gamal

unread,

Feb 19, 2013, 3:51:09 PM2/19/13

to who...@googlegroups.com

i saw it from the internet but if i see it in the documentation i will tell you

but can i ask you where do you think the error is " i correct the schema.names and got error:

Traceback (most recent call last):

File "D:\Python27\try.py", line 15, in <module>

writer=ix.writer()

File "build\bdist.win32\egg\whoosh\filedb\fileindex.py", line 258, in writer

return SegmentWriter(self, **kwargs)

File "build\bdist.win32\egg\whoosh\filedb\filewriting.py", line 137, in __init__

raise LockError

LockError "

--
You received this message because you are subscribed to the Google Groups "Whoosh" group.

To unsubscribe from this group and stop receiving emails from it, send an email to whoosh+unsubscribe@googlegroups.com.

Matt Chaput

unread,

Feb 19, 2013, 3:52:58 PM2/19/13

to who...@googlegroups.com

On 19/02/2013 3:51 PM, Hala Gamal wrote:
> LockError "

This indicates that two processes are trying to open a writer at the
same time, or that the same process is trying to open a writer twice.
Only one writer is allowed at a time.

Matt

Hala Gamal

unread,

Feb 19, 2013, 3:55:55 PM2/19/13

to who...@googlegroups.com

ok but can you recommend any solution?

Matt

Matt Chaput

unread,

Feb 19, 2013, 4:01:00 PM2/19/13

to who...@googlegroups.com

On 19/02/2013 3:55 PM, Hala Gamal wrote:
> ok but can you recommend any solution?

Every time you put your code in an email message it loses indentation.
Can you put it in a pastebin somewhere and send the URL?

Matt

Hala Gamal

unread,

Feb 19, 2013, 4:07:41 PM2/19/13

to who...@googlegroups.com

ok, i have attached my code and the file i work on

and also pasted it in pastebin : http://pastebin.com/kZQ1pAVG

Matt

try.py

q.csv

Matt Chaput

unread,

Feb 19, 2013, 4:49:44 PM2/19/13

to who...@googlegroups.com

On 19/02/2013 4:07 PM, Hala Gamal wrote:
> ok, i have attached my code and the file i work on
> and also pasted it in pastebin : http://pastebin.com/kZQ1pAVG

The code you pasted there doesn't do anything, and I don't see how it
could generate a LockError. But here's a quick sketch of how you could
use the csv module and whoosh.

I'll assume a CSV file formatted WITHOUT column headings in the first
row. For example:

iPod, 5, Plays music
iPad, 10, Shows websites
Mac Pro, 2, Makes websites

Here's how you could read the rows using csv and associate each column
value with a field name:

from whoosh import fields, index
import os.path
import csv

# This list associates a name with each position in a row
columns = ["name", "quantity", "description"]

schema = fields.Schema(name=fields.TEXT,
quantity=fields.NUMERIC,
description=fields.TEXT)

# Create the Whoosh index
indexname = "index"
if not os.path.exists(indexname):
os.mkdir(indexname)
ix = index.create_in(indexname, schema)

# Open a writer for the index
with ix.writer() as writer:
# Open the CSV file
with open("stuff.csv", "rb") as csvfile:
# Create a csv reader object for the file
csvreader = csv.reader(csvfile)

# Read each row in the file
for row in csvreader:

# Create a dictionary to hold the document values for this row
doc = {}

# Read the values for the row enumerated like
# (0, "name"), (1, "quantity"), etc.
for colnum, value in enumerate(row):

# Get the field name from the "columns" list
fieldname = columns[colnum]

# Strip any whitespace and convert to unicode
# NOTE: you need to pass the right encoding here!
value = unicode(value.strip(), "utf-8")

# Put the value in the dictionary
doc[fieldname] = value

# Pass the dictionary to the add_document method
writer.add_document(**doc)

(This could be much more compact but I've expanded some things for
clarity.) Of course if you want/need to use xlrd instead of csv, you
have to modify the example to use it. I'm not familiar with xlrd.

Cheers,

Matt

Hala Gamal

unread,

Feb 21, 2013, 3:45:39 PM2/21/13

to who...@googlegroups.com

thanks,i wanna tell you simple error in quick start in the documentation:

with ix.searcher() as searcher:
    query = QueryParser("content", ix.schema).parse(u"first")

i think you must add "u" before first :)

Matt

Matt Chaput

unread,

Feb 21, 2013, 4:01:42 PM2/21/13

to who...@googlegroups.com

On 21/02/2013 3:45 PM, Hala Gamal wrote:
> thanks,i wanna tell you simple error in quick start in the documentation:
>
> with ix.searcher() as searcher:
> query = QueryParser("content", ix.schema).parse(u"first")
>
> i think you must add "u" before first :)

I was trying to update the examples to look like Python 3. Maybe now
that Python 3 has u"" again I should put them back.

Matt

Matt Billenstein

unread,

Feb 19, 2013, 3:46:40 PM2/19/13

to who...@googlegroups.com

On Tue, Feb 19, 2013 at 10:40:48PM +0200, Hala Gamal wrote:
> i'm very sorry for all of that,,i modified the code:

Really not a forum for debugging your script - if you have specific problems
with Whoosh, it's best to distill those down to a smaller example script so
people can help.

m

--
Matt Billenstein
ma...@vazor.com
http://www.vazor.com/

seanieb

unread,

Feb 23, 2013, 2:36:47 PM2/23/13

to who...@googlegroups.com, Matt Billenstein

Might I also suggest that we direct specific implementation issues like this to Stack Overflow (sorted by Whoosh):

http://stackoverflow.com/questions/tagged/whoosh

And if any of you have some spare time there are a number of unanswered questions there:

http://stackoverflow.com/questions/tagged/whoosh?sort=unanswered&pagesize=15

LIU XIAOYU

unread,

Apr 1, 2017, 3:54:48 AM4/1/17

to Whoosh, ma...@whoosh.ca

Thank you for providing the code.

l have tried to use/modify this code for documents containing 49000 rows, so the part "Open a writer for the index" is very time-consuming and took me over 10 minutes. Is there a way to use writer only once for the indexing?

Thank you!

Reply all

Reply to author

Forward