Proposal: Kernel.is_string/1 or Kernel.is_utf8/1

1,448 views
Skip to first unread message

Kobayakawa Ken

unread,
Nov 15, 2015, 11:11:14 PM11/15/15
to elixir-lang-core
I faced a problem when I created a library for web scraping. ( https://github.com/nekova/metainvestigator )

in metainvestigator.ex
def fetch(html) when is_binary(html), do:  parse_html

iex(1)> html = HTTPoison.get!("http://portal.nifty.com/").body
iex
(2)> page = MetaInvestigator.fetch(html)

It fails.

fetch/1 expected String, but it got Binary.
Because the page's charset is not UTF-8 but Shift_JIS.

is_binary/1 is not enough to use as is_string/1, so I want to use is_string/1 or is_utf8/1 inside guards.

def is_string(<<_ :: utf8, t :: binary>>), do: is_string(t)
def is_string(<<>>), do: true
def is_string(_), do: false

This implementation brought from String.valid?/1

Ben Wilson

unread,
Nov 15, 2015, 11:26:33 PM11/15/15
to elixir-lang-core
Are you sure that what's happening is what you think is happening? I can't seem to create a binary in iex that when passed to is_binary doesn't return true. Elixir Strings are binaries, there isn't any difference. 

Eric Meadows-Jönsson

unread,
Nov 16, 2015, 12:10:03 AM11/16/15
to elixir-l...@googlegroups.com
The VM puts restrictions on what kind of expressions guards are allowed in guards, only a predefined set of functions can be called. So unfortunately it is impossible to create a new function that can be called from a guard.

--
You received this message because you are subscribed to the Google Groups "elixir-lang-core" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elixir-lang-co...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elixir-lang-core/d545e2a0-f90a-48c6-bd15-5fd33cab5be5%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
Eric Meadows-Jönsson

Kobayakawa Ken

unread,
Nov 16, 2015, 12:31:18 AM11/16/15
to elixir-lang-core
Sorry for my poor explanation.

I think Elixir Strings is a UTF-8 encoded binary.

There is no problem, If html is encoded by UTF-8. 

iex(1)> html = HTTPoison.get!("http://elixir-lang.org/").body
"<!DOCTYPE html>\n<html xmlns=\"http://www.w3.org/1999/xhtml\">\n<head>\n  <meta http-equiv=\"Content-Type\" content=\"text/html; charset=UTF-8\" />\n......"
iex
(2)> is_binary(html)
true
iex
(3)>String.valid?(html)
true

But, There is a problem, If html is encoded by the other encoding.

iex(1)> html = HTTPoison.get!("http://portal.nifty.com/").body
<<60, 33, 68, 79, 67, 84, 89, 80, 69, 32, 72, 84, 77, 76, 32, 80, 85, 66, 76, 73, 67, 32, 34, 45, 47, 47, 87, 51, 67, 47, 47, 68, 84, 68, 32, 72, 84, 77, 76, 32, 52, 46, 48, 49, 32, 84, 114, 97, 110, 115, ...>>
iex(2)> is_binary(html)
true
iex(3)> String.valid?(html)
false

The html is not valid String. It is just binary.

is_string/1 or is_utf8/1 can confirm that an argument is UTF-8 encoded binary.

2015年11月16日月曜日 13時26分33秒 UTC+9 Ben Wilson:

Ben Wilson

unread,
Nov 16, 2015, 12:42:43 AM11/16/15
to elixir-lang-core
What would those functions do that String.valid? does not already do?
Message has been deleted

Kobayakawa Ken

unread,
Nov 16, 2015, 12:50:32 AM11/16/15
to elixir-lang-core
So unfortunately it is impossible to create a new function that can be called from a guard.
 oh......I didn't care about VM. Thank you so much.

2015年11月16日月曜日 14時10分03秒 UTC+9 Eric Meadows-Jönsson:
The VM puts restrictions on what kind of expressions guards are allowed in guards, only a predefined set of functions can be called. So unfortunately it is impossible to create a new function that can be called from a guard.
On Sunday, November 15, 2015 at 11:11:14 PM UTC-5, Kobayakawa Ken wrote:
I faced a problem when I created a library for web scraping. ( https://github.com/nekova/metainvestigator )

in metainvestigator.ex
def fetch(html) when is_binary(html), do:  parse_html

iex(1)> html = HTTPoison.get!("http://portal.nifty.com/").body
iex
(2)> page = MetaInvestigator.fetch(html)

It fails.

fetch/1 expected String, but it got Binary.
Because the page's charset is not UTF-8 but Shift_JIS.

is_binary/1 is not enough to use as is_string/1, so I want to use is_string/1 or is_utf8/1 inside guards.

def is_string(<<_ :: utf8, t :: binary>>), do: is_string(t)
def is_string(<<>>), do: true
def is_string(_), do: false

This implementation brought from String.valid?/1

--
You received this message because you are subscribed to the Google Groups "elixir-lang-core" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elixir-lang-co...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elixir-lang-core/d545e2a0-f90a-48c6-bd15-5fd33cab5be5%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
Eric Meadows-Jönsson

Ben Wilson

unread,
Nov 16, 2015, 12:52:01 AM11/16/15
to elixir-lang-core
I grant that it's a bit confusing, but a String isn't its own data type any more than a chat list is. Strings are implemented on binaries, and char lists in lists / integers. The kernel functions tie into low level VM instructions that are concerned with the underlying data type, not particular uses thereof. The String module however is, and it's why the valid function belongs there.

Kobayakawa Ken

unread,
Nov 16, 2015, 1:14:22 AM11/16/15
to elixir-lang-core
I got why String.valid?/1 is not kernel functions. thanks a lot

2015年11月16日月曜日 14時52分01秒 UTC+9 Ben Wilson:
Reply all
Reply to author
Forward
0 new messages