Using Multiple Languages in Single Database Field

13 views
Skip to first unread message

mikej

unread,
Sep 6, 2019, 9:44:08 AM9/6/19
to Thinking Sphinx
A client recently was unable to save a record as it contained Polish characters.  In order to accommodate these records (rightly or wrongly) I changed the db collation with a migration (in development only so far).  First for the whole db, then by table. i.e.

class ChangeEncoding < ActiveRecord::Migration[5.2]
  def change
    config = Rails.configuration.database_configuration
    db_name = config[Rails.env]["database"]
    collate = 'utf8_general_ci'
    char_set = 'utf8'
    execute("ALTER DATABASE #{db_name} CHARACTER SET #{char_set} COLLATE #{collate};")
    ActiveRecord::Base.connection.tables.each do |table|
      execute("ALTER TABLE #{table} CONVERT TO CHARACTER SET #{char_set} COLLATE #{collate};")
    end
  end
end

Records can now be saved.

I am using delta indexing.  If you search for a record containing the Polish characters, no records are found until issuing rake ts:index.  Once a record is found, it is not included in the excerpts result.  All other characters are returned within span.match.

Any ideas on how I can configure sphinx to find records without indexing and include all characters in the excerpt? 

Many thanks,

Mike

Sphinx 2.2.9
thinking-sphinx 4.2.0

Pat Allan

unread,
Sep 9, 2019, 2:29:40 AM9/9/19
to thinkin...@googlegroups.com
Heya Mike,

This sounds like an odd problem indeed.

Firstly: I think UTF8 is the way to go for the database - and indeed, I’d expect it to be the same for Sphinx/TS (unless you’ve configured it to be something else?). Can you confirm that delta indexing is occurring when a record with Polish characters is updated? (I presume it’s being done in a standard way that’s firing validations and callbacks)

The fact things aren’t coming through in excerpts is surprising as well… I’m not yet sure what the cause of that might be.

Can you confirm if you’ve any character/encoding related settings in `config/thinking_sphinx.yml`? Also, it may be worth upgrading to Sphinx 2.2.11 - it seems there’s one documented fix in that for excerpts and UTF8 (when using wildcards).

Cheers,

— 
Pat

--
You received this message because you are subscribed to the Google Groups "Thinking Sphinx" group.
To unsubscribe from this group and stop receiving emails from it, send an email to thinking-sphi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/thinking-sphinx/f58b6b0d-52b2-4afe-853b-0bfe6fe2babe%40googlegroups.com.

mikej

unread,
Sep 9, 2019, 11:42:30 AM9/9/19
to Thinking Sphinx
Thanks Pat.

Yes UTF8 is good.  I only changed the collation to get the record saving at all.

Right.  First problem solved, my muppetry.  The delta is working fine.  I was saving a related record, so delta indexing was never going to be relevant.  This is fine.  Sorry.

I don't have any character/encoding related settings in `config/thinking_sphinx.yml`.  I have upgraded to 2.2.11 but still have the problem with the excerpts.  Not a disastrous situation, but would be good to fix.

THANK YOU,

Mike

Pat Allan

unread,
Sep 15, 2019, 3:24:47 AM9/15/19
to thinkin...@googlegroups.com
Hi Mike,

Good to know the deltas issue is resolved :)

As for excerpts… can you give me an example of a word that’s not matching appropriately? I can then try reproducing it locally. It may end up being a bug in Sphinx itself, but I’m not yet sure.

Cheers,

— 
Pat

--
You received this message because you are subscribed to the Google Groups "Thinking Sphinx" group.
To unsubscribe from this group and stop receiving emails from it, send an email to thinking-sphi...@googlegroups.com.

mikej

unread,
Sep 16, 2019, 5:03:06 AM9/16/19
to Thinking Sphinx
ŚRODOWISKA

Many thanks,

Mike

Pat Allan

unread,
Sep 29, 2019, 12:01:25 AM9/29/19
to thinkin...@googlegroups.com
Hi Mike,

Sorry for not getting back to you on this promptly.

Here’s the code I’ve been testing with (within the TS test suite):

    Article.create! :title => "ochrona środowiska"
    index

    search = Article.search("środowiska")
    search.context[:panes] << ThinkingSphinx::Panes::ExcerptsPane

    expect(search.first.excerpts.title).
      to eq(%q{ochrona <span class="match">środowiska</span>}


And, when I first ran it, it didn’t pass:

    Failure/Error:
      expect(search.first.excerpts.title).
        to eq(%q{ochrona <span class="match">środowiska</span>})

      expected: "ochrona <span class=\"match\">środowiska</span>"
           got: "ochrona ś<span class=\"match\">rodowiska</span>”


As you may notice, it’s the leading ś that doesn’t get matched correctly.

I tested this against Sphinx 2.2.11, 3.1.1, and Manticore 3.1.2, all failed. Then I did a bit of searching, and came across this post:

I can’t read Polish, but I took the suggested charset_table settings and added them to my Article index, and the test passes. So perhaps that’s worth adding to your app’s config? Either in config/thinking-sphinx.yml or on a per-index basis with set_property :charset_table => "..."

Mind you, I’m not across what each of those transformations cover - and you may want a more extensive set (as covered here: https://yob.id.au/2008/05/08/thinking-sphinx-and-unicode.html).

If this doesn’t help, though, do let me know!

— 
Pat

On 16 Sep 2019, at 7:03 pm, mikej <mikeje...@gmail.com> wrote:

ŚRODOWISKA

Many thanks,


Mike

-- 
You received this message because you are subscribed to the Google Groups "Thinking Sphinx" group.
To unsubscribe from this group and stop receiving emails from it, send an email to thinking-sphinx+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/thinking-sphinx/37d87243-2763-4f43-8c56-ff18ff38c5cc%40googlegroups.com.

mikej

unread,
Sep 30, 2019, 7:50:37 AM9/30/19
to Thinking Sphinx
Thanks very much for running those tests and researching the issue.  I'll check out the articles and see what happens.

Mike 
 
Reply all
Reply to author
Forward
0 new messages