Infobright storage loader : wrong data or column definition

693 views
Skip to first unread message

Gilles Moncaubeig

unread,
Dec 7, 2012, 3:46:09 AM12/7/12
to snowpl...@googlegroups.com
Hi, 

I am trying loading millions of rows in infobright coming from my adserver.
I have two recurring errors that seems to cause almost half of downloaded files are not loaded.

LOAD ERROR: mysql-ib error code 256: ERROR 2 (HY000) at line 1: Wrong data or column definition. Row: 3222, field: 48.
LOAD ERROR: mysql-ib error code 256: ERROR 2 (HY000) at line 1: Wrong data or column definition. Row: 8508, field: 15.

As of 0.0.3 table definition

Field 48 is `br_lang` varchar(10)
 some examples of values causing errors:
English(en)
es-ES_tradnl
*30790cc430790a1130790cd430790ca830790b20 
 
 

Field 15 is `page_url` varchar(2083)
 one example of row causing errors:

Ebuzzing        \N      2012-12-03      17:24:48        \N      \N      102048  \N      cf      serde-0.5.2     d7be8224d9ec4241        88.76.125.233   \N      2       
http://ads.adk2.com/player.html?a=8357137&size=300x250&cid=9194680&adviewurl=http%253A%252F%252Fad.media-servers.net%252Fclk%253F3%252CeJy1k8mSqzYUhp.GuzaFkJiqKwvAA2CDbS42hh0zYjZgMzx95E7XvUk2SRY5JemofklHf5U-AfgJEo4FDOQZngUBl4BPAAMOhCFPM-iD.vz8hAjRCDIAoo8yvUjSuQVmKiOQbWXpHVqfXaQ.h.adkSSlkrLlo9763pD1O.17MerBX0.9t1B8Jbv-um8kV20E6cFQhmv38s9tmysw9xrjVtvxaJuFWV0Hw96VBqZZ0zGLo51Cw9Zmw9GQl5P5L0u.fWTD0K6gtGJ2pPlRTz17Kl63pV.XuE6pOh6IHmMyQNJ5AdAkMSKCMRCZGIWQpQN2BXfhsIIbsGK4NiCTkGcTmovCREQ8zzMi0RNMdI4OEzaiBRSAMAIhS.Sujt4nIYtYlkUCIhJsv6r93Vrcvd6WwnbC9RB3fjjgV0yFTUWWw7IgJohFJdbn0u6HuVmkCyW3zzztiz1tWqcwn7RR3CD8CKdjVVjpYes2F4tf550E2Jddv4IuL2kR2vdcu4qmamCrMu66wIeRXM.H7TTgZR8c24PO3Yt4wPT9SC.HSAmg0IEkuHFQCkQ-VmJVt-ppffpBFUfR5dOSfZ41z7nr8wD9NVir6xydRwvmJh8ib21RkgsfAv3cxdaYJaeLNfygBPfWvxJwatNzJSbGIhYsRobdbAWgcmi-TxmON12jRq.0Pq0dGKmvyJfPp8eZcqglynWd45vR2KgXL8vkdtqUiX0rvaU-TIYSnPbh.t7fROBdIPbAnscau0Sm160PmADtDqpdsAsFa6b2Dl53Fqbzg-H8E2YW2ZnqrNuFY9rdvGlfnmTgdF6-2U7XEcs-3zOKk.vm01WGhsnRDoiau5QItfSoprhX3e3YNjfnZR4R06bO7FjC2ZbbJn9M-7kxwV3vuN3cHoRMP2wOSmeahsSOi9CpB1ST5yV4RE8Cx4r823.HbvCP7DakXvJ.oJjP5fCFov-ggj9QTOm6awiKeBSjLxTLqujSIp6bR.dG0ScoDj9RHKYcP8U6q3BXVVP-RjGo5zJ-o5gGvwMK6oHQ%252C&SOURCEURL=http%253A%252F%252Fads.us.e-planning.net%252Feb%252F3%252F7810%252F2943e192e4c350b5%253Fo%253Df%2526rnd%253D1354555484%25263pct%253Dhttp%253A%252F%252Fadserving.cpxinteractive.com%252Fclk%253F3%252Cejyltstyozaq.bpujgskg0nrocjxiw9d4iqcxlmkrgkeyoqr7-jra15vtnvbrjl093txjiu9nhmirmmxj87cdbnylextizgb&ci=1&u=http%253A%252F%252Fads.us.e-planning.net%252Feb%252F3%252F7810%252F2943e192e4c350b5%253Fo%253Df%2526rnd%253D1354555484%25263pct%253Dhttp%253A%252F%252Fadserving.cpxinteractive.com%252Fclk%253F3%252CeJylTstyozAQ.BpujgskG0NROcjxIw9D4iqcxLmkRgKEYoQR7-jrA15vTnvbrjl093TXjIU9NHMiRmMXJ87cdBnyLExtizGbLpKJ6Xketi0XL0zLdCb38r1fbV63Ab97eCeHJRnx-OS.kL9Y7gl5uPIZWXJyt3a-1-H-j4PwR3jN7c4Z-R.AY3q80uFeRwhfOQRtS.8YVsvf1OpgPm9fMz9k5i4MToE81H64yXxhieDroHdvgXx-W3dHvdaBPOqP.W.zdjJJ67owMDHQZhhBpxDlfTVlZznKxMCbOGcGXsV91ZQ3iZ1G7iI5zdNZr-KiA4AYtHTk5z.3n2nZKZrP8xPq26aOi2zBWxnhrFcwgrVZxGlOB1WrZjDExUwiBa7s2CWjaNuYCto2j4F19IYzl44p0wHgisHYEwpoVWvNL42pgWyWR8PTBpojqxGyoN1XJr6FypK8hJKDKCrNNMA5wz8rHK4n%252C%2526p%253D5d1d19a2c3d6a394&r=  \N      http://ad.media-servers.net/st?ad_type=iframe&ad_size=300x250&section=3569146&pub_url=${PUB_URL}&pub_redirect_unencoded=1&pub_redirect=http://ads.us.e-planning.net/ei/3/7810/2943e192e4c350b5?ct=1&pb=c75f06dcf9477729&fi=60cf5d084b1cd1c5&rnd=1354555484&3pct=http://adserving.cpxinteractive.com/clk?3,eJylTstyozAQ.BpujgskG0NROcjxIw9D4iqcxLmkRgKEYoQR7-jrA15vTnvbrjl093TXjIU9NHMiRmMXJ87cdBnyLExtizGbLpKJ6Xketi0XL0zLdCb38r1fbV63Ab97eCeHJRnx-OS.kL9Y7gl5uPIZWXJyt3a-1-H-j4PwR3jN7c4Z-R.AY3q80uFeRwhfOQRtS.8YVsvf1OpgPm9fMz9k5i4MToE81H64yXxhieDroHdvgXx-W3dHvdaBPOqP.W.zdjJJ67owMDHQZhhBpxDlfTVlZznKxMCbOGcGXsV91ZQ3iZ1G7iI5zdNZr-KiA4AYtHTk5z.3n2nZKZrP8xPq26aOi2zBWxnhrFcwgrVZxGlOB1WrZjDExUwiBa7s2CWjaNuYCto2j4F19IYzl44p0wHgisHYEwpoVWvNL42pgWyWR8PTBpojqxGyoN1XJr6FypK8hJKDKCrNNMA5wz8rHK4n,&du=     \N      \N      \N      \N      \N      \N      \N      \N      \N      \N      \N      \N      \N      \N      \N      \N      \N      \N      \N      \N      \N      \N      \N      \N      Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.95 Safari/537.11    Chrome  Chrome  23.0.1271.95    Browser WEBKIT  de      1       1       1       1       0       0       0       0       1       1       \N      Windows Windows Microsoft Corporation   \N      Computer        0       1280    720


As you can see, values are too long, exceeding VARCHAR limits in table definition
Maybe these limitations on VARCHAR are unnecassary, even in terms of performances (size on disk, read speed) ?

Any feedback appreciated

cheers
g




Yali

unread,
Dec 7, 2012, 4:57:54 AM12/7/12
to snowpl...@googlegroups.com
Hi Gilles,

It looks like we need to bump up the limits on both those columns. 

What if we make the limit on `br_lang` 256? 

On `page_url`, it's a bit more tricky - URLs from ad servers in particular can be very long. Do you think a limit of 4095 would be enough? (This is the limit of what IE8 supports.) We have to be careful as there is a limit of 65532 bytes per row in ICE. (See http://www.dbms2.com/2011/03/13/so-how-many-columns-can-a-single-table-have-anyway/.)  

More broadly - we're thinking about how to evolve the `page_url` field. We've seen (thanks to your feedback) that having lookups on many of the fields increases load times. For this reason, we were thinking of removing the lookup on `page_url` in the next release. However, if you think about a typical SnowPlow user, they are tracking activity across a website where the number of distinct page_urls is much smaller than the total page views - so a lookup should be an excellent optimisation. What we're thinking is that we should split the current field in Infobright into a URL and a query string - have the lookup on the URL only. We could also then limit the number of characters for the URL more tightly, and use a bigger limit for the query string.

Would that work for you Gilles? Any feedback from other SnowPlow users?

Many thanks!

Yali @ SnowPlow

Alex Dean

unread,
Dec 7, 2012, 5:03:52 AM12/7/12
to snowpl...@googlegroups.com
Yep, and another thing to flag is that, because our table is UTF8 formatted, each character maps onto 3 bytes:


So 4,095*3 = 12,285 = 19% of our total available table width in one field. Here's a good discussion of URL length:


Alex

--
You received this message because you are subscribed to the Google Groups "SnowPlow" group.
To unsubscribe from this group, send email to snowplow-use...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Alex Dean

unread,
Dec 7, 2012, 5:06:19 AM12/7/12
to Alex Dean, snowpl...@googlegroups.com
Just a quick one - I have updated the gist for the ICE table definition with fewer lookups to include the new column widths:


A
Reply all
Reply to author
Forward
0 new messages