excel file term indexing

459 views
Skip to first unread message

Hala Gamal

unread,
Feb 17, 2013, 3:02:20 PM2/17/13
to who...@googlegroups.com

how to use whoosh to make indexing of csv file  in the image to get each word and where it is mentioned

Philippe Ombredanne

unread,
Feb 17, 2013, 9:55:38 PM2/17/13
to who...@googlegroups.com
On Sun, Feb 17, 2013 at 12:02 PM, Hala Gamal <halaga...@gmail.com> wrote:
> how to use whoosh to make indexing of csv file in the image to get each
> word and where it is mentioned
If I understand correctly what you want (though I am not sure), I
would do this: treat each row as if it a document. Therefore:
- Create a schema for your index that has one field for each column
of your csv and possibly one field where you would also index all the
columns
- at indexing time, read the CSV, and for each row, add a document
to the index, properly fielded per the schema and identified possibly
with the row number or some other field that is unique
- This way when searching you can know exactly which row and
eventually which column was found.

--
Philippe Ombredanne

+1 650 799 0949 | pombr...@nexB.com
DejaCode Enterprise at http://www.dejacode.com
nexB Inc. at http://www.nexb.com

Hala Gamal

unread,
Feb 18, 2013, 1:33:12 PM2/18/13
to who...@googlegroups.com
but can you give me link or tutorial that makes i easier as i'm new un using whoosh?


--
You received this message because you are subscribed to the Google Groups "Whoosh" group.
To unsubscribe from this group and stop receiving emails from it, send an email to whoosh+un...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.



Philippe Ombredanne

unread,
Feb 18, 2013, 2:01:17 PM2/18/13
to who...@googlegroups.com
On Mon, Feb 18, 2013 at 10:33 AM, Hala Gamal <halaga...@gmail.com> wrote:
> but can you give me link or tutorial that makes i easier as i'm new un using
> whoosh?
You will have to read the docs at
http://whoosh.readthedocs.org/en/latest/

If you think they are not good enough for a new whoosh user, then
please submit some ticket so it can be improved
https://bitbucket.org/mchaput/whoosh/issues ... and even better,
suggest improvements and provide an updated documentation.

Hala Gamal

unread,
Feb 19, 2013, 8:17:07 AM2/19/13
to who...@googlegroups.com
can i ask you how to make row field in building the schema?

Hala Gamal

unread,
Feb 19, 2013, 8:19:37 AM2/19/13
to who...@googlegroups.com
sorry i mean column ->field?

Philippe Ombredanne

unread,
Feb 19, 2013, 11:56:04 AM2/19/13
to who...@googlegroups.com
On Tue, Feb 19, 2013 at 5:19 AM, Hala Gamal <halaga...@gmail.com> wrote:
> sorry i mean column ->field?
> On Tue, Feb 19, 2013 at 3:17 PM, Hala Gamal <halaga...@gmail.com> wrote:
>> can i ask you how to make row field in building the schema?
See the doc here:
http://whoosh.readthedocs.org/en/latest/quickstart.html#the-index-and-schema-objects

Say you have 5 columns c1 to c5 in your csv and these are text, use this:
schema = Schema(c1=TEXT, c2=TEXT, c3=TEXT, c4=TEXT, c5=TEXT)

Hala Gamal

unread,
Feb 19, 2013, 2:14:17 PM2/19/13
to who...@googlegroups.com
can i ask your opinion in where the mistake here:
import whoosh
import xldr
import csv
from xldr import open_workbook,XL_CELL_TEXT
from whoosh import *
schema = Schema(A1=NUMERIC,B1=NUMERIC,C1=NUMERIC,D1=TEXT,E1=KEYWORD,G1=TEXT)
ix = create_in("indexdir", schema)
writer=ix.writer()
book=open_workbook('q.csv')
sheet=book.sheet_by_index(0)
for row_index in range(sheet.nrows):
    writer.add_document(sheet.cell(row_index,0),sheet.cell(row_index,1),sheet.cell(row_index,2),sheet.cell(row_index,3),sheet.cell(row_index,4),sheet.cell(row_index,5),sheet.cell(row_index,6))
writer.commit()


Hala Gamal

unread,
Feb 19, 2013, 2:47:14 PM2/19/13
to who...@googlegroups.com
and i got this error:
File "D:/Python27/try.py", line 14, in <module>
    writer.add_document(A1=sheet.cell(row_index,0),B1=sheet.cell(row_index,1),C1=sheet.cell(row_index,2),D1=sheet.cell(row_index,3),E1=sheet.cell(row_index,4),F1=sheet.cell(row_index,5),G1=sheet.cell(row_index,6))
  File "D:/Python27\whoosh\filedb\filewriting.py", line 369, in add_document
    items = field.index(value)
  File "D:/Python27\whoosh\fields.py", line 466, in index
    return [(txt, 1, 1.0, '') for txt in self._tiers(num)]
  File "D:/Python27\whoosh\fields.py", line 454, in _tiers
    yield self.to_text(num, shift=shift)
  File "D:/Python27\whoosh\fields.py", line 487, in to_text
    return self._to_text(self.prepare_number(x), shift=shift,
  File "D:/Python27\whoosh\fields.py", line 476, in prepare_number
    x = self.type(x)
TypeError: int() argument must be a string or a number, not 'Cell

Hala Gamal

unread,
Feb 19, 2013, 3:40:48 PM2/19/13
to who...@googlegroups.com
i'm very sorry for all of that,,i modified the code:
import whoosh
import xlrd
import os
import csv
from xlrd import open_workbook,XL_CELL_TEXT
from whoosh import *
from whoosh.fields import *
from whoosh .index import create_in
schema = Schema(A1=NUMERIC,B1=NUMERIC,C1=NUMERIC,D1=TEXT,E1=KEYWORD,F1=TEXT,G1=TEXT)
if not os.path.exists("index"):
    os.mkdir("index")

ix = create_in("index",schema)
schema = ix.schema
writer=ix.writer()
book=open_workbook('q.csv')
sheet=book.sheet_by_index(0)
for row_index in range(sheet.nrows):
   
            write_cmd = "writer.add_document("
            i = 0
            for field in schema.field_names():
                f, v = field, cell[row_index][i]
                if v.__class__ == str: write_cmd += f + "=u'" + unicode( v ) + "',"
                elif v.__class__ == unicode: write_cmd += f + "=u'" + unicode( v ) + "',"
                elif v.__class__ == int:write_cmd += f + "=" + unicode( v ) + ","        #must change 1 to 0001
                else:pass
                i += 1

            write_cmd = write_cmd[:-1] + ")"
writer.commit()
AND The error:AttributeError:  'Schema' object has no attribute 'field_names'

Matt Chaput

unread,
Feb 19, 2013, 3:43:52 PM2/19/13
to who...@googlegroups.com
> AND The error:AttributeError: 'Schema' object has no attribute
> 'field_names'

The method is actually Schema.names(). If you saw field_names()
somewhere in the docs, please let me know so I can fix it.

Thanks,

Matt

Hala Gamal

unread,
Feb 19, 2013, 3:51:09 PM2/19/13
to who...@googlegroups.com
i saw it from the internet but if i see it in the documentation i will tell you
but can i ask you where do you think the error is " i correct the schema.names and got error:
Traceback (most recent call last):
  File "D:\Python27\try.py", line 15, in <module>
    writer=ix.writer()
  File "build\bdist.win32\egg\whoosh\filedb\fileindex.py", line 258, in writer
    return SegmentWriter(self, **kwargs)
  File "build\bdist.win32\egg\whoosh\filedb\filewriting.py", line 137, in __init__
    raise LockError
LockError "

--
You received this message because you are subscribed to the Google Groups "Whoosh" group.
To unsubscribe from this group and stop receiving emails from it, send an email to whoosh+unsubscribe@googlegroups.com.

Matt Chaput

unread,
Feb 19, 2013, 3:52:58 PM2/19/13
to who...@googlegroups.com
On 19/02/2013 3:51 PM, Hala Gamal wrote:
> LockError "

This indicates that two processes are trying to open a writer at the
same time, or that the same process is trying to open a writer twice.
Only one writer is allowed at a time.

Matt


Hala Gamal

unread,
Feb 19, 2013, 3:55:55 PM2/19/13
to who...@googlegroups.com
ok but can you recommend any solution?



Matt


Matt Chaput

unread,
Feb 19, 2013, 4:01:00 PM2/19/13
to who...@googlegroups.com
On 19/02/2013 3:55 PM, Hala Gamal wrote:
> ok but can you recommend any solution?

Every time you put your code in an email message it loses indentation.
Can you put it in a pastebin somewhere and send the URL?

Matt

Hala Gamal

unread,
Feb 19, 2013, 4:07:41 PM2/19/13
to who...@googlegroups.com
ok, i have attached my code and the file i work on
and also pasted it in pastebin :  http://pastebin.com/kZQ1pAVG



Matt

try.py
q.csv

Matt Chaput

unread,
Feb 19, 2013, 4:49:44 PM2/19/13
to who...@googlegroups.com
On 19/02/2013 4:07 PM, Hala Gamal wrote:
> ok, i have attached my code and the file i work on
> and also pasted it in pastebin : http://pastebin.com/kZQ1pAVG

The code you pasted there doesn't do anything, and I don't see how it
could generate a LockError. But here's a quick sketch of how you could
use the csv module and whoosh.

I'll assume a CSV file formatted WITHOUT column headings in the first
row. For example:


iPod, 5, Plays music
iPad, 10, Shows websites
Mac Pro, 2, Makes websites


Here's how you could read the rows using csv and associate each column
value with a field name:


from whoosh import fields, index
import os.path
import csv

# This list associates a name with each position in a row
columns = ["name", "quantity", "description"]

schema = fields.Schema(name=fields.TEXT,
quantity=fields.NUMERIC,
description=fields.TEXT)

# Create the Whoosh index
indexname = "index"
if not os.path.exists(indexname):
os.mkdir(indexname)
ix = index.create_in(indexname, schema)

# Open a writer for the index
with ix.writer() as writer:
# Open the CSV file
with open("stuff.csv", "rb") as csvfile:
# Create a csv reader object for the file
csvreader = csv.reader(csvfile)

# Read each row in the file
for row in csvreader:

# Create a dictionary to hold the document values for this row
doc = {}

# Read the values for the row enumerated like
# (0, "name"), (1, "quantity"), etc.
for colnum, value in enumerate(row):

# Get the field name from the "columns" list
fieldname = columns[colnum]

# Strip any whitespace and convert to unicode
# NOTE: you need to pass the right encoding here!
value = unicode(value.strip(), "utf-8")

# Put the value in the dictionary
doc[fieldname] = value

# Pass the dictionary to the add_document method
writer.add_document(**doc)


(This could be much more compact but I've expanded some things for
clarity.) Of course if you want/need to use xlrd instead of csv, you
have to modify the example to use it. I'm not familiar with xlrd.

Cheers,

Matt




Hala Gamal

unread,
Feb 21, 2013, 3:45:39 PM2/21/13
to who...@googlegroups.com
thanks,i wanna tell you simple error in quick start in the documentation:
with ix.searcher() as searcher:
    query = QueryParser("content", ix.schema).parse(u"first")
i think you must add "u" before first :)


Matt




Matt Chaput

unread,
Feb 21, 2013, 4:01:42 PM2/21/13
to who...@googlegroups.com
On 21/02/2013 3:45 PM, Hala Gamal wrote:
> thanks,i wanna tell you simple error in quick start in the documentation:
>
> with ix.searcher() as searcher:
> query = QueryParser("content", ix.schema).parse(u"first")
>
> i think you must add "u" before first :)

I was trying to update the examples to look like Python 3. Maybe now
that Python 3 has u"" again I should put them back.

Matt


Matt Billenstein

unread,
Feb 19, 2013, 3:46:40 PM2/19/13
to who...@googlegroups.com
On Tue, Feb 19, 2013 at 10:40:48PM +0200, Hala Gamal wrote:
> i'm very sorry for all of that,,i modified the code:

Really not a forum for debugging your script - if you have specific problems
with Whoosh, it's best to distill those down to a smaller example script so
people can help.

m

--
Matt Billenstein
ma...@vazor.com
http://www.vazor.com/

seanieb

unread,
Feb 23, 2013, 2:36:47 PM2/23/13
to who...@googlegroups.com, Matt Billenstein
Might I also suggest that we direct specific implementation issues like this to Stack Overflow (sorted by Whoosh):

And if any of you have some spare time there are a number of unanswered questions there:

LIU XIAOYU

unread,
Apr 1, 2017, 3:54:48 AM4/1/17
to Whoosh, ma...@whoosh.ca
Thank you for providing the code. 
l have tried to use/modify this code for documents containing 49000 rows, so the part "Open a writer for the index" is very time-consuming and took me over 10 minutes. Is there a way to use writer only once for the indexing? 
Thank you!
Reply all
Reply to author
Forward
0 new messages