Unicode in webpage addresses

63 views
Skip to first unread message

Earl Brown

unread,
Mar 27, 2011, 2:56:33 AM3/27/11
to CorpLing with R
Hello R-ists. I have another encoding/unicode problem.

I'm trying to get synonyms of Spanish words from WordReference.com by
creating the appropriate URL for each word with a "for" loop and then
loading into R each word's webpage with scan(). However, when I have a
word with an accented vowel, for example "rápido", R doesn't load the
webpage that I see in my web browser. Even though my browser address
bar says

"http://www.wordreference.com/sinonimos/rápido"

when I look at the source code with my browser, the title of the
source code webpage is

"http://www.wordreference.com/sinonimos/r%C3%A1pido"

It seems that this change doesn't allow R to load the page correctly
because it's looking for "http://.../rápido" when it should be looking
for "http://.../r%C3%A1pido". When I manually changed "rápido" to "r
%C3%A1pido" in scan(), it loaded correctly.

So, I'm trying to find a way to automate the change of my accented
vowels into unicode hex codes in my URLs. Using iconv(), with the
"from" argument as "UTF-8", I tried all possible "to" arguments (which
I got from iconvlist()), but still no luck.

My question for you is: Does anyone see a way to automate the change
of accented vowel to unicode hex code? Is my only option using sub()
to find and replace, as in:

url.2<-sub("(.*?)á(.*?)", "\\1%C3%A1\\2", url, perl=T)
url.2<-sub("(.*?)é(.*?)", "\\1%C3%A9\\2", url.2, perl=T)
url.2<-sub("(.*?)í(.*?)", "\\1%C3%AD\\2", url.2, perl=T)
etc.

Thanks for your help. Earl Brown

Hardie, Andrew

unread,
Mar 27, 2011, 5:53:54 AM3/27/11
to corplin...@googlegroups.com
This is called percent-encoding or url-encoding and there is an R function to do it: URLencode() and URLdecode() in the utils package.

You must *start* with a string in the same encoding that the website uses - whether that's LAtin1, UTF8 as in this case, or whatever - or the URL-encoding will produce an incorrect address.

Best

Andrew.

> --
> You received this message because you are subscribed to the
> Google Groups "CorpLing with R" group.
> To post to this group, send email to corplin...@googlegroups.com.
> To unsubscribe from this group, send email to
> corpling-with...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/corpling-with-r?hl=en.
>
>

Earl Brown

unread,
Mar 27, 2011, 12:04:09 PM3/27/11
to CorpLing with R
Yeah, that works well. Thank you for your unfailing help. Earl Brown

Earl Brown

unread,
Jan 12, 2013, 10:17:46 PM1/12/13
to corplin...@googlegroups.com
Hello Corpus Linguistics Rists.

I have another encoding problem. I'm not sure what the name of the URL encoding I need is but it's not what URLencode() gives. For example, I need "México" to come out as "M%E9xico" but URLencode("México") gives me "M%c3%a9xico". I noticed the following meta data in the <head> of the webpage I'm trying to get:

<meta http-equiv="Content-Type" content="text/html; charset=windows-1252">

so I tried:

URLencode(iconv("México", "WINDOWS-1252", "UTF-8"))
and
URLencode(iconv("México", "LATIN1", "UTF-8"))

but that made it worse, giving me "M%c3%83%c2%a9xico". I need the URL-encoded letters listed here:

http://www.w3schools.com/tags/ref_urlencode.asp

but I'm not sure how to do it in R with either URLencode() or something else. I assume I could manually change accented vowels and other Spanish-language graphemes with gsub(), but I also assume there's an easier way.

Any ideas? Thanks in advance. Earl Brown

John Newman

unread,
Jan 13, 2013, 12:22:12 AM1/13/13
to corplin...@googlegroups.com
Earl

One thing I did which got the result (I think) you are looking for is to modify the URLencode() function by changing the line of the function definition

y <- sapply(x[z], function(x) paste0("%", as.character(charToRaw(x)), collapse = ""))

to 

y <- sapply(x[z], function(x) paste0("%", as.hexmode(utf8ToInt(x)), collapse = ""))

So, I define myURLencode as follows:

myURLencode = function(URL, reserved = FALSE) {
OK <- paste("[^-ABCDEFGHIJKLMNOPQRSTUVWXYZ", "abcdefghijklmnopqrstuvwxyz0123456789$_.+!*'(),", if (!reserved) ";/?:@=&", "]", sep = "")
x <- strsplit(URL, "")[[1L]]
z <- grep(OK, x)
if (length(z)) {
y <- sapply(x[z], function(x) paste("%", as.hexmode(utf8ToInt(x)), sep = "", collapse = ""))
x[z] <- y
}
paste(x, collapse = "")
}

Then:

myURLencode("México")

[1] "M%e9xico"

John

--
You received this message because you are subscribed to the Google Groups "CorpLing with R" group.
To view this discussion on the web visit https://groups.google.com/d/msg/corpling-with-r/-/WJQ8JnW_F3EJ.

To post to this group, send email to corplin...@googlegroups.com.
To unsubscribe from this group, send email to corpling-with...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/corpling-with-r?hl=en.



--
John Newman
Professor 
Department of Linguistics, 4-32 Assiniboia Hall, University of Alberta
Edmonton T6G 2E7 CANADA
Fax: (780) 492-0806, Tel: (780) 492-0804
Homepage: http://johnnewm.jimdo.com

Hardie, Andrew

unread,
Jan 13, 2013, 1:56:16 AM1/13/13
to corplin...@googlegroups.com

Hi Earl,

 

I think you have your iconv arguments the wrong way round.

 

Your original text is clearly UTF-8 because when you url-encode it without using iconv you get a UTF-8 byte sequence. So, you need to convert from UTF-8 to Win1252 before url-encoding if you want e-acute to come out as just %e9. The order of arguments for iconv is string-from-to.

 

In other words,

 

URLencode(iconv("México", "UTF-8", "WINDOWS-1252"))

 

should do the trick.

 

best

 

Andrew.

--

Earl Brown

unread,
Jan 15, 2013, 5:22:28 PM1/15/13
to corplin...@googlegroups.com
Thank you John and Andrew. I went with John's suggestion, as I couldn't get Andrew's simpler one to work. Here's what happens when I try Andrew's:


> URLencode(iconv("México", "UTF-8", "WINDOWS-1252"))
[1] "NA"
Warning message:
In strsplit(URL, "") : input string 1 is invalid in this locale

And when I try just:


> iconv("México", "UTF-8", "WINDOWS-1252")
[1] "M\xe9xico"

the percentage sign doesn't come through.

Thanks for your help.

Hardie, Andrew

unread,
Jan 15, 2013, 5:35:27 PM1/15/13
to corplin...@googlegroups.com

I’m not an specialist on R internals by any means, but that looks like a deficiency (I shan’t say bug) in URLencode – basically, it is saying that it refuses to URLencode a string in another encoding than the current locale’s encoding (which is UTF-8). (To be more specific, strsplit refuses to deal with strings that aren’t current locale ie UTF-8, and URLencode uses strsplit).

 

You might be able to get round this by stashing the iconv return value in a variable, setting the locale to a Win-1252 charset, and then calling URLencode on the stored value. (and re-setting the locale afterwards of course).

 

Chalk this up as reason number seventeen or so that I don’t like or trust locales...

 

best

 

Andrew.

--

You received this message because you are subscribed to the Google Groups "CorpLing with R" group.

To view this discussion on the web visit https://groups.google.com/d/msg/corpling-with-r/-/HW71Gs9C3mYJ.

Reply all
Reply to author
Forward
0 new messages