Confused how PyTables indexing compares to pandas HDFStore (PyTables) indexing

327 views
Skip to first unread message

Evan

unread,
Oct 12, 2016, 12:11:22 AM10/12/16
to pytables-users
The PyTables library and the HDFStore object (based on PyTables) both provide indexing for the user. 

For the latter case, users instantiate an HDFStore object and then chose which columns to index. 

    store = HDFStore('file1.hd5')
    key = "key_name"
    index_columns = ["column1", "column2"]
    store.append(key,... data_columns=index_columns)

Here we index on two columns, which should optimize our search. 

For PyTables alone, we create an HDF5 file as follows (from the documentation):

from tables import *

class Particle(IsDescription):
    identity = StringCol(itemsize=22, dflt=" ", pos=0)  # character String
    idnumber = Int16Col(dflt=1, pos = 1)  # short integer
    speed    = Float32Col(dflt=1, pos = 2)  # single-precision

# Open a file in "w"rite mode
fileh = open_file("objecttree.h5", mode = "w")

# Get the HDF5 root group
root = fileh.root

# Create the groups
group1 = fileh.create_group(root, "group1")
group2 = fileh.create_group(root, "group2")

# Now, create an array in root group
array1 = fileh.create_array(root, "array1", ["string", "array"], "String array")

# Create 1 new tables in group1
table1 = fileh.create_table(group1, "table1", Particle)

# Get the record object associated with the table:
row = table1.row

# Fill the table with 10 records
for i in xrange(10):
    # First, assign the values to the Particle record
    row['identity']  = 'This is particle: %2d' % (i)
    row['idnumber'] = i
    row['speed']  = i * 2.

    # This injects the Record values
    row.append()

# Flush the table buffers
table.flush()

# Finally, close the file (this also will flush all the remaining buffers!)
fileh.close()


Users index columns by using "Column.create_index()"

For example: 

    indexrows = table.cols.var1.create_index() 
    indexrows = table.cols.var2.create_index() 
    indexrows = table.cols.var3.create_index()

Two questions:

(1) I'm afraid in our PyTables example above, it is not clear to me how to set the indices (indexes). There are no columns defined. To my mind, there are three fields: identity, idnumber, speed. Let's say I wanted to place an index on speed and identity. How would one do this? 

(2) Are there any benchmarks between the the pandas based indexing and PyTables based indexing? Is one faster than the other? Does one take up more disk space (i.e. larger HDF5 file) than the other? 

Thank you for any help! Apologies for so many questions recently

Thanks, Evan










Francesc Alted

unread,
Oct 12, 2016, 3:33:31 AM10/12/16
to Evan, pytables-users
indexedrows = table.cols.identity.create_index()
 

(2) Are there any benchmarks between the the pandas based indexing and PyTables based indexing? Is one faster than the other? Does one take up more disk space (i.e. larger HDF5 file) than the other?

pandas just uses PyTables for doing the indexing, so they are going to behave very close.
 
 

Thank you for any help! Apologies for so many questions recently

No problem!
 

Thanks, Evan










--
You received this message because you are subscribed to the Google Groups "pytables-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pytables-users+unsubscribe@googlegroups.com.
To post to this group, send email to pytables-users@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Francesc Alted

Francesc Alted

unread,
Oct 12, 2016, 3:38:01 AM10/12/16
to Evan, pytables-users
Also, you might be interested in a comparison between PyTables and other data container (including relational databases):

https://github.com/FrancescAlted/DataContainersTutorials/blob/master/4-On-Disk-Tables.ipynb

This is for a relatively small database (1 million entries), but it could give you an idea on what to expect.

 

Thanks, Evan










To post to this group, send email to pytable...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
Francesc Alted



--
Francesc Alted

Evan

unread,
Oct 12, 2016, 10:12:36 AM10/12/16
to pytables-users, evan.bie...@gmail.com, fal...@gmail.com
Thank you for the prompt response. 

So just to be clear, users should index the columns using 

indexedrows = table.cols.identity.create_index()

before filling the tables with records? That is, users how instantiate the `tables` and `rows` object, then use 

indexedrows = table.cols.identity.create_index(), e.g. with the example below

...
...
# Create 1 new tables in group1
table1 = fileh.create_table(group1, "table1", Particle)

# Get the record object associated with the table:
row = table1.
row

# now index columns

indexedrows = table.cols1.identity.create_index()
indexedrows = table.cols2.identity.create_index()

Does the order when columns are indexed matter?

Question 2:

So, `indexedrows` is not a traditional python variable? 

If I were to index on threes columns 'col1', 'col2', 'col3', I would use

indexedrows = table.cols1.identity.create_index()
indexedrows = table.cols2.identity.create_index()
indexedrows = table.cols3.identity.create_index()

`indexedrows` must automatically execute these commands. For a traditional "variable", the only thing saved is `table.cols3.identity.create_index()`---the other two have been written over. 

Thanks for the help!

Thanks, Evan










To unsubscribe from this group and stop receiving emails from it, send an email to pytables-user...@googlegroups.com.
To post to this group, send email to pytable...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
Francesc Alted

Francesc Alted

unread,
Oct 12, 2016, 10:40:05 AM10/12/16
to Evan, pytables-users
2016-10-12 16:12 GMT+02:00 Evan <evan.bie...@gmail.com>:
Thank you for the prompt response. 

So just to be clear, users should index the columns using 

indexedrows = table.cols.identity.create_index()

before filling the tables with records?

Its use is the same than relational databases, you can index before or after, although the time to create the index is faster if you use it after.  Also, there are some cases where indexing after allows better index optimization.  So yes, creating the index after is recommended.
 
That is, users how instantiate the `tables` and `rows` object, then use 

indexedrows = table.cols.identity.create_index(), e.g. with the example below

...
...
# Create 1 new tables in group1
table1 = fileh.create_table(group1, "table1", Particle)

# Get the record object associated with the table:
row = table1.row

# now index columns

indexedrows = table.cols1.identity.create_index()
indexedrows = table.cols2.identity.create_index()

Does the order when columns are indexed matter?

Nope.  The columns are independent, the same way than the indexes.
 

Question 2:

So, `indexedrows` is not a traditional python variable? 

If I were to index on threes columns 'col1', 'col2', 'col3', I would use

indexedrows = table.cols1.identity.create_index()
indexedrows = table.cols2.identity.create_index()
indexedrows = table.cols3.identity.create_index()

`indexedrows` must automatically execute these commands. For a traditional "variable", the only thing saved is `table.cols3.identity.create_index()`---the other two have been written over. 

I don't totally understand the question.  IIRC indexedrows is the number of rows indexed, but this is indeed not documented:

http://www.pytables.org/usersguide/libref/structured_storage.html?highlight=create_index#tables.Column.create_index

In general, you should not care about this, because the number of rows indexed will be the same than the number of rows in the table.  I suppose you have got this usage from http://www.pytables.org/usersguide/optimization.html?highlight=create_index#indexed-searches, and it would be nice if these lines would be replaced by just:
table.cols.var1.create_index()
table.cols.var2.create_index()
table.cols.var3.create_index()

Hope this helps,

Francesc

To unsubscribe from this group and stop receiving emails from it, send an email to pytables-users+unsubscribe@googlegroups.com.
To post to this group, send email to pytables-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
Francesc Alted

Evan

unread,
Oct 12, 2016, 1:28:58 PM10/12/16
to pytables-users
Thanks for the response. 

One further clarification based on the recent reply:


""""
In general, you should not care about this, because the number of rows indexed will be the same than the number of rows in the table.  I suppose you have got this usage from http://www.pytables.org/usersguide/optimization.html?highlight=create_index#indexed-searches, and it would be nice if these lines would be replaced by just:
table.cols.var1.create_index()
table.cols.var2.create_index()
table.cols.var3.create_index()
""""

So, in the example shown previously from the documentation with fields "identity", "idnumber", and "speed", the correct way to index on (as an example) both "idnumber" and "speed" would be for users to execute

table.cols.idnumber.create_index()
table.cols.speed.create_index()


After `row.append()` and before `table.flush()` for optimal use and performance of indexes? I'm still slightly unclear. 

Thank you for the help. I appreciate it, Evan

Evan

unread,
Oct 12, 2016, 6:38:13 PM10/12/16
to pytables-users, evan.bie...@gmail.com, fal...@gmail.com
Let me provide a concrete example why I'm asking:

""""

class MyTable(IsDescription):

        COL1 = Int16Col()

        COL2  = Int16Col()

        COL3= StringCol(64)

        COL4= StringCol(64)

        COL5= StringCol(64)

        COL6= StringCol(64)

        COL7   = Int32Col()

       


# Open a file in write mode

h5file = open_file("file1.h5", mode = "w")


my_key = "key"


# Create group

group = h5file.create_group("/", "my_table")


table = h5file.create_table(group, my_key, MyTable, "table of values")


row = table.row

# user decides which indices to create

field1  = "COL1"   # create index on column 1, COL1

field2  = "COL2"   # create index on column 2, COL2


#  import dictionary 'dictionary1"
for dict in dictionary1: 

    row["COL1"] =  dict["COL1"]

    row["COL2"]  = dict["COL2"

    row["COL3"] =  dict["COL3"

    row["COL4"]  = dict["COL4"]   

    row["COL5"]  = dict["COL5"

    row["COL6"] =  dict["COL6"

    row["COL7"] =  dict["COL7"

    # This injects the Record values

    table.cols.field1.create_index()

    table.cols.field2.create_index()

    row.append()

# Flush the table buffers

table.flush()
"""

Now, when I run this, I get the following error: 

"
ValueError: Index(6, medium, shuffle, zlib(1)).is_csi=False for column 'COL1' already exists. If you want to re-create it, please, try with reindex() method better"

Where above am I indexing on all columns? Surely I would have most faster queries if I only indexed on one/two columns, and queried those, right? 

So...I'm making a mistake somewhere. 

Thanks for the help, we'll soon have this figured out, Evan

Reply all
Reply to author
Forward
0 new messages