This page shows a little function that changes unicode string into ASCII. For example “passé” becomes “passe”, “voilà” becomes “voila”.
When refactoring my elisp code last week, i split out this little function. It turns unicode chars into roughly equivalent ASCII ones. I needed this because the open source dictionary will choke on words with unicode chars. (See: Emacs Dictionary Lookup ◇ Problems of Open Source Dictionaries.)
I remember, in the popular Mac editor BBEdit i used 10 years ago before emacs, there's such a command in the menu called “zap gremlins”. Though, i'm not aware there's one in emacs, but might be. Anyway, here's the code:
(defun asciify-string (inputstr) "Make unicode string into equivalent ASCII ones. Todo: this command is not exhaustive." (let () (setq inputstr (replace-regexp-in-string "á\\|à\\|â\\|ä" "a" inputstr)) (setq inputstr (replace-regexp-in-string "é\\|è\\|ê\\|ë" "e" inputstr)) (setq inputstr (replace-regexp-in-string "í\\|ì\\|î\\|ï" "i" inputstr)) (setq inputstr (replace-regexp-in-string "ó\\|ò\\|ô\\|ö" "o" inputstr)) (setq inputstr (replace-regexp-in-string "ú\\|ù\\|û\\|ü" "u" inputstr)) inputstr ))
You might improve this code, as right now it's puny. Right now it's a function that takes in a string. You might also create a version that works on region, or better yet, works on text selection if there's one, else on current word (or line, or paragraph, or buffer, your design call). (For how, see: Emacs Lisp: Using thing-at-point.)
Here's common non-english letters: ÀÁÂÃÄÅÆ Ç ÈÉÊË ÌÍÎÏ ÐÑ ÒÓÔÕÖ ØÙÚÛÜÝÞß àáâãäåæç èéêë ìíîï ðñòóôõö øùúûüýþÿ. You also might consider changing unicode bullet “•” to “*”, and others such as “→” to “->”, “≥” to “>=”, etc.
Or, perhaps you know someone has written this somewhere?
──────────────────── Accumulator vs Parallel Programing
When looking at my code, another thing that piqued my interest is that, notice how the algorithm is of sequential nature? The paradigm is similar to what's called “accumulator” or “iteration”. Recently, i watched Guy Steele's talk on parallel programing (See: Guy Steele on Parallel Programing.) and learned that the iteration style is very difficult for compiler to automatically generate parallel code.
A better way to write it for parallel programing, is to “map” a char- transform function to the string. (in elisp, a string datatype is also a sequence, and can be the second argument to “mapcar”.) It will probably become slower, but it'll be good in n years when someday emacs lisp becomes Scheme Lisp or something.
> (defun asciify-string (inputstr) > "Make unicode string into equivalent ASCII ones. > Todo: this command is not exhaustive." > (let () > (setq inputstr (replace-regexp-in-string "á\\|à\\|â\\|ä" "a"
etc.
Yep, it sure is not exhaustive.
The right way to do this is to fix the "open source dictionary", whatever that is, as anything that doesn't handle Unicode is pretty much useless in today's world. Failing that, then the right way to do asciify-string is to use the Unicode Character Database: convert to NFKD, and then remove all the combining characters, and all the remaining non-ASCII characters (how are you going to ascii-fy Russian, Greek, Chinese?). This is (probably) a couple of lines of perl, but it's not built into Emacs.
> On 2011-03-07, Xah Lee <xah...@gmail.com> wrote:> (defun asciify-string (inputstr) > > "Make unicode string into equivalent ASCII ones. > > Todo: this command is not exhaustive." > > (let () > > (setq inputstr (replace-regexp-in-string "á\\|à\\|â\\|ä" "a"
> etc.
> Yep, it sure is not exhaustive.
> The right way to do this is to fix the "open source dictionary", > whatever that is, as anything that doesn't handle Unicode is pretty > much useless in today's world. > Failing that, then the right way to do asciify-string is to use the > Unicode Character Database: convert to NFKD, and then remove all the > combining characters, and all the remaining non-ASCII characters (how > are you going to ascii-fy Russian, Greek, Chinese?).
that seems a big job.
> This is (probably) a couple of lines of perl, but it's not built into Emacs.
hum... not sure there's a lib for it but i haven't done perl for years. Is it really a few lines of perl? but maybe hours to write it? i suppose so if one specializes in unicode processing with perl.
> Failing that, then the right way to do asciify-string is to use the > Unicode Character Database: convert to NFKD, and then remove all the > combining characters, and all the remaining non-ASCII characters (how > are you going to ascii-fy Russian, Greek, Chinese?). This is > (probably) a couple of lines of perl, but it's not built into Emacs.
Also, "iconv" utility can be used for that:
(defun string-to-ascii (string) (with-temp-buffer (insert string) (call-process-region (point-min) (point-max) "iconv" t t nil "--to-code=ASCII//TRANSLIT") (buffer-substring-no-properties (point-min) (point-max))))
> On Mar 8, 12:22 am, Julian Bradfield <j...@inf.ed.ac.uk> wrote: >> This is (probably) a couple of lines of perl, but it's not built into Emacs.
> hum... not sure there's a lib for it but i haven't done perl for > years. Is it really a few lines of perl? but maybe hours to write it? > i suppose so if one specializes in unicode processing with perl.
Oh, all right. I've done it. Here's the filter that just removes the accents from stdin: perl -e 'use encoding utf8; use Unicode::Normalize; while ( <> ) { $_ = NFKD($_); s/\pM//g; print; }'
This is the first time I've used Perl's Unicode facilities, and it took me 5 minutes to look up the stuff to write that. (But I do know Unicode, which helps.)
> On 2011-03-08, Xah Lee <xah...@gmail.com> wrote:
> > On Mar 8, 12:22 am, Julian Bradfield <j...@inf.ed.ac.uk> wrote: > >> This is (probably) a couple of lines of perl, but it's not built into Emacs.
> > hum... not sure there's a lib for it but i haven't done perl for > > years. Is it really a few lines of perl? but maybe hours to write it? > > i suppose so if one specializes in unicode processing with perl.
> Oh, all right. I've done it. Here's the filter that just removes the > accents from stdin: > perl -e 'use encoding utf8; use Unicode::Normalize; while ( <> ) { $_ = NFKD($_); s/\pM//g; print; }'
> This is the first time I've used Perl's Unicode facilities, and it > took me 5 minutes to look up the stuff to write that. (But I do know > Unicode, which helps.)
> > Failing that, then the right way to do asciify-string is to use the > > Unicode Character Database: convert to NFKD, and then remove all the > > combining characters, and all the remaining non-ASCII characters (how > > are you going to ascii-fy Russian, Greek, Chinese?). This is > > (probably) a couple of lines of perl, but it's not built into Emacs.
> ──────────────────── > Accumulator vs Parallel Programing
> When looking at my code, another thing that piqued my interest is > that, notice how the algorithm is of sequential nature? The paradigm > is similar to what's called “accumulator” or “iteration”. Recently, i > watched Guy Steele's talk on parallel programing (See: Guy Steele on > Parallel Programing.) and learned that the iteration style is very > difficult for compiler to automatically generate parallel code.
> A better way to write it for parallel programing, is to “map” a char- > transform function to the string. (in elisp, a string datatype is also > a sequence, and can be the second argument to “mapcar”.) It will > probably become slower, but it'll be good in n years when someday > emacs lisp becomes Scheme Lisp or something.
> Xah
thought a bit more about this yesterday. I think this is actually a great example of parallel programing Guy Steele is talking about.
if we do it by mapping a transcode function to each char in the string, then it'll probably becomes 100 times slower as it is now. However, suppose the string is few millions char long (which is just few megabytes), then using map will certainly be much faster, provided that elisp compiler/interpreter in the future have become parallelism aware...
-------------------
... about calling external util... it has problem of IO limitations, especially on Windows. e.g. you have to make sure the encoding sent is specified in the external util input spec, then make sure the size is within limit of IO allowed... (i don't know the details, but often have problems on Windows when calling util in cygwin. e.g. my rgrep is even broken. Ι get the error message “find: unknown predicate `-nam'”. Apparantly the long input sent to shell has been truncated.)
\pM means characters with the "mark" property. This includes accents, as well as some other additions to characters. (So that script will also remove the vowels from devanagari, which may or may not be intended. There are ways to restrict it to Latin characters: if we only want to remove accents from latin characters, then: s/(\p{Latin})\pM+/$1/g;
> nice. Though, i can't use that cause i don't have iconv installed on > my cygwin (Windows). It wasn't in OS X 10.4.x but not sure what about > today.
> PS added your code on my page if you don't mind.
I don't mind having my code there but it annoys me a bit that you changed the symbol named "string" to "inputStr" while still saying it's my code. CamelCaps is not good Lisp style and I don't stand behind that change. Also, having the word "input" in such function's argument is redundant. Obviously it's about some kind of input because it's in the function's lambda list (arguments).
> On Mar 8, 2:13 am, Teemu Likonen <tliko...@iki.fi> wrote: >> Also, "iconv" utility can be used for that: >> (defun string-to-ascii (string) >> (with-temp-buffer >> (insert string) >> (call-process-region (point-min) (point-max) "iconv" t t nil >> "--to-code=ASCII//TRANSLIT") >> (buffer-substring-no-properties (point-min) (point-max)))) > nice. Though, i can't use that cause i don't have iconv installed on > my cygwin (Windows). It wasn't in OS X 10.4.x but not sure what about > today. > PS added your code on my page if you don't mind.
On Mar 11, 2:44 am, Teemu Likonen <tliko...@iki.fi> wrote:
> I don't mind having my code there but it annoys me a bit that you > changed the symbol named "string" to "inputStr" while still saying it's > my code.
I wrote “Code originally by Teemu Likonen.”. I added the word “originally” there precisely about this worry. ☺
i changed the “inputStr” to “string” now.
> CamelCaps is not good Lisp style
yes. I'm aware. But that's really a point of view.
> and I don't stand behind that > change. > Also, having the word "input" in such function's argument is > redundant. Obviously it's about some kind of input because it's in the > function's lambda list (arguments).
i don't think it's a big issue, but here's the reason why i'm using camelCase.
(1) it provides a easy way to distinguish variables from built-in symbols. Particularly because of the fact that emacs-lisp-mode's coloring scheme is not full. (i wrote about this problem here 〈Emacs Lisp Mode Syntax Coloring Problem〉 http://xahlee.org/emacs/modernization_elisp_syntax_color.html )
i developed a habit to use cameCase in particular for local variables. I could've used under_score but that's more typing and less visually distinguishable to lisp's hypen-word-style.
(2) For variables, in recent years i developed a habit to avoid any standard english word, partly as a experiment. So, i'd name “file” as “myFile” or “aFile”. “string” would be “str”, “myString”, etc. A good solution in this regard is to append or prepend a random number in var names. So, “string” would be “string-5w77o” or something like that. But the problem with this is that it's too long and disruptive in reading and typing. Recently i've been toying with the idea of attaching a unicode to all vars. e.g. all my var would start with ξ. So “string” would be “ξstring”. (works fin e in elisp btw). This way solves the random string readability issue.
The reason for this “avoiding english words” is for easy source code transformation. The idea is similar to the idea of referential transparency.
imagine, if every local variable (or every symbol) are a unique identifier in the source code. This way, you could locate any variable in the whole source code, and you can freely change their names.
(3) another reason that somewhat pushed me in this naming exaperiment is that... instead of naming your vars in some meaningful english words, the opposite is to name them completely random, as in math's x, y, z.
So, i'd name “counter” as just “i” or “n”. (since these are 1-letter string and too common, so with the unique naming idea above, i usually name them “ii” or “nn” or might be “ξi”)
the idea with abstract naming is that it forces you to understand the code as a math expression that specify algorithm, instead of like english prose. Readability of source code is helped by coding in a pure functional programing style (e.g. functions, input, output), and good documentation of each function. So, to understand a function, you should just read the doc about its input output. While inside a code snippet, it is understood by simple functional style programing constructs.
to illustrate from the opposite view, the problem with english naming is that often it interfere with what the code is actually doing. For example, in normal convention often you'll see names like “thisObject”, “thatTree”, “fileList”, or “files”, your focus is on the meaning of these words, but not what the data type actually are or the function's actual mathematical behavior. The words can be deceptive. e.g. “file” can be a file handle, file path, file content. This is especially a problem when you are reading source code of a lang you do not know. e.g. when you encounter the word “object”, you don't know if that's a keyword in the language, a pattern spec, something, or just a variable name. When you read a normal source code, half of the words are like that unless the editor does syntax coloring that distinguish the lang's keyword.
to view this idea in another way ... when you read math, you never see mathematician name their variables with a multi-letter descriptive word, but usually a single symbol, yet there's no problem understanding the expression. Your focus and understanding is on the abstract process and structure.
again, the above ideas is just a experiment. Without actually doing it, one never know what's really good or bad.
Here's a more polished solution. The “asciify-word-or-selection” is a command. It works on current word or text selection.
(defun asciify-string (inputstr) "Make Unicode string into equivalent ASCII ones. For example, “passé” becomes “passe”. This function works on chars in European languages, and does not transcode arbitrary unicode chars (such as Greek). Un-transformed unicode char remains in the string." (let () (setq inputstr (replace-regexp-in-string "á\\|à\\|â\\|ä\\|ã\\|å" "a" inputstr)) (setq inputstr (replace-regexp-in-string "é\\|è\\|ê\\|ë" "e" inputstr)) (setq inputstr (replace-regexp-in-string "í\\|ì\\|î\\|ï" "i" inputstr)) (setq inputstr (replace-regexp-in-string "ó\\|ò\\|ô\\|ö\\|õ\\|ø" "o" inputstr)) (setq inputstr (replace-regexp-in-string "ú\\|ù\\|û\\|ü" "u" inputstr)) (setq inputstr (replace-regexp-in-string "ñ" "n" inputstr)) (setq inputstr (replace-regexp-in-string "ç" "c" inputstr)) (setq inputstr (replace-regexp-in-string "ð" "d" inputstr)) (setq inputstr (replace-regexp-in-string "þ" "th" inputstr)) (setq inputstr (replace-regexp-in-string "ß" "ss" inputstr)) (setq inputstr (replace-regexp-in-string "æ" "ae" inputstr)) inputstr ))
(defun asciify-word-or-selection () "Make Unicode string into equivalent ASCII ones. For example, “passé” becomes “passe”. This command works on chars in European languages, and does not transcode arbitrary unicode chars (such as Greek). They remain in the string. This command calls `asciify-string' to do the string transformation." (interactive) (let (bds p1 p2 inputstr) (setq bds (get-selection-or-unit 'word)) (setq inputstr (elt bds 0) p1 (elt bds 1) p2 (elt bds 2) ) (setq inputstr (asciify-string inputstr)) (delete-region p1 p2 ) (insert inputstr) ))
The command uses “get-selection-or-unit”. You can get the code for that at Emacs Lisp: Using thing-at-point.
>> nice. Though, i can't use that cause i don't have iconv installed on >> my cygwin (Windows). It wasn't in OS X 10.4.x but not sure what about >> today. >> PS added your code on my page if you don't mind.
> On Mar 11, 2:44 am, Teemu Likonen <tliko...@iki.fi> wrote: >> I don't mind having my code there but it annoys me a bit that you >> changed the symbol named "string" to "inputStr" while still saying it's >> my code.
> I wrote “Code originally by Teemu Likonen.”. I added the word > “originally” there precisely about this worry. ☺
> i changed the “inputStr” to “string” now.
>> CamelCaps is not good Lisp style
> yes. I'm aware. But that's really a point of view.
>> and I don't stand behind that >> change. >> Also, having the word "input" in such function's argument is >> redundant. Obviously it's about some kind of input because it's in the >> function's lambda list (arguments).
> i don't think it's a big issue, but here's the reason why i'm using > camelCase.
> (1) it provides a easy way to distinguish variables from built-in > symbols. Particularly because of the fact that emacs-lisp-mode's > coloring scheme is not full. (i wrote about this problem here > 〈Emacs Lisp Mode Syntax Coloring Problem〉 > http://xahlee.org/emacs/modernization_elisp_syntax_color.html )
> i developed a habit to use cameCase in particular for local variables. > I could've used under_score but that's more typing and less visually > distinguishable to lisp's hypen-word-style.
> (2) For variables, in recent years i developed a habit to avoid any > standard english word, partly as a experiment. So, i'd name “file” as > “myFile” or “aFile”. “string” would be “str”, “myString”, etc. A good > solution in this regard is to append or prepend a random number in var > names. So, “string” would be “string-5w77o” or something like that. > But the problem with this is that it's too long and disruptive in > reading and typing. Recently i've been toying with the idea of > attaching a unicode to all vars. e.g. all my var would start with ξ. > So “string” would be “ξstring”. (works fin e in elisp btw). This way > solves the random string readability issue.
> The reason for this “avoiding english words” is for easy source code > transformation. The idea is similar to the idea of referential > transparency.
> imagine, if every local variable (or every symbol) are a unique > identifier in the source code. This way, you could locate any variable > in the whole source code, and you can freely change their names.
> (3) another reason that somewhat pushed me in this naming exaperiment > is that... instead of naming your vars in some meaningful english > words, the opposite is to name them completely random, as in math's x, > y, z.
> So, i'd name “counter” as just “i” or “n”. (since these are 1-letter > string and too common, so with the unique naming idea above, i usually > name them “ii” or “nn” or might be “ξi”)
> the idea with abstract naming is that it forces you to understand the > code as a math expression that specify algorithm, instead of like > english prose. Readability of source code is helped by coding in a > pure functional programing style (e.g. functions, input, output), and > good documentation of each function. So, to understand a function, you > should just read the doc about its input output. While inside a code > snippet, it is understood by simple functional style programing > constructs.
> to illustrate from the opposite view, the problem with english naming > is that often it interfere with what the code is actually doing. For > example, in normal convention often you'll see names like > “thisObject”, “thatTree”, “fileList”, or “files”, your focus is on the > meaning of these words, but not what the data type actually are or the > function's actual mathematical behavior. The words can be deceptive. > e.g. “file” can be a file handle, file path, file content. This is > especially a problem when you are reading source code of a lang you > do not know. e.g. when you encounter the word “object”, you don't know > if that's a keyword in the language, a pattern spec, something, or > just a variable name. When you read a normal source code, half of the > words are like that unless the editor does syntax coloring that > distinguish the lang's keyword.
> to view this idea in another way ... when you read math, you never see > mathematician name their variables with a multi-letter descriptive > word, but usually a single symbol, yet there's no problem > understanding the expression. Your focus and understanding is on the > abstract process and structure.
> again, the above ideas is just a experiment. Without actually doing > it, one never know what's really good or bad.
It is good to experiment - its how we learn. However, I think it is unlikely that the experimentation and experiences of a single person are going to be of any real benefit to anyone other than the individual concerned. The first stage in any experimentation should be to scan the literature and become familiar with the current body of knowledge in the area. Failing to do so means the experiments are unlikely to be of much real interest to others. Writing good clear code is a real skill that takes years to develop and refine. Reading and learning from others code is extremely beneficial in helping to develop good technique. It is rare that an author's first book or an artists first painting represents their best work. It is even rarer for an author or artist to create a masterpice the first time without also having studied the works of others and having a solid grasp of both the theoretical and practicle aspects of the discipline. Practice/experience and appreciation of others work is essential. In many ways, this is an example of where you really need to know and understand the existing conventions/rules before you can break/change them.
There are lots of articles and books concerning this topic and a lot of research has been done in this area. There are many 'formal' techniques that have been developed with varying levels of success. You could gain some valuable insight by lookin at some of the research done in the area of software engineering rather than working from 'first principals' and read about techniques used in large software projects - for example, while I don't like MS conventions in this area, reading about them provides some good insight into the problems they are attempting to solve and how they feel their solution achieves this.
I would suggest many of your ideas have a consistent weakness in that they overlook one of the main objectives of code - to communicate your ideas, algorithm etc to others as well as to yourself. One way to evaluate your technique is to regularly re-visit code you wrote some time ago and see how easily you can understand it without using commens an documentation and seek feedback from others regarding how easily they can understand what you have written.
I also think technique and style cannot be divorced from the language being used. The way I write depends very much on the dialect being used. For example, in perl,I have used variable nameing techniques where the name indicates whether the variable is a scalar, hash, reference etc, in C I may use names that indicate whether the variable is a pointer or not and in Java - well, actually I just avoid java because o the rediculously verbose nature of its naming conventions which just end up being bloated boilerplate noise.
In lisp, I do tend to avoid using variables names that are the same as important keywords. For example, last week I was debugging som code written by someone else that had code along the lines of
which I found an unfortunate bit of code because the name and position of the 'cons' variable meant I had to stop an re-read this declaration to recognise that in this context, 'cons' is a variable, not a function. Worse still, at this point, I have no idea what either 'cons' or 'vec' are supposed to represent. All I know is that 'cons' is probably a cons cell (a guess at this point) and vec is a vector (obvious from its definition). I don't know what the cons cell represents or what the vec is used for. Matters were made worse by the unfortunate layout/indenting used. Either putting 'cons' as the last variable or putting the definition on its own line may have helped make it easier to read (putting it as the first variable could also have made it even harder!)
The point is, the above code is readable, but it requires additional mental effort that could easily have been avoided by using better layout, meaningful variable names and avoiding names that are the same as a frequently used function name.
I don't use camel case in lisp because it is a case insensitive language. I want my function names and variables to have the same level of distinctiveness in the repl, debugger and backtraces as they do in source code. I do use the ear-muff convention for special variables to remind me they are special. I tend to avoid using variable names that are the same as keyword/function names, unless not doing so would make the code harder to understand (i.e. I've been known to use the
...
On Mar 11, 4:12 pm, Tim X wrote: │ It is good to experiment - its how we learn. However, I think it is │ unlikely that the experimentation and experiences of a single person are │ going to be of any real benefit to anyone other than the individual │ concerned.
yes. But for every undertaking there's a first step. ☺
│ The first stage in any experimentation should be to scan the │ literature and become familiar with the current body of knowledge in the │ area. Failing to do so means the experiments are unlikely to be of much │ real interest to others.
That's the thought pattern of many hard core tech geekers, and is my approach too in much of 1990s and early 2000s.
Bertran Russel has written about this. That when he was young, he thought that to study anything he would first do a complete survey of existing knowledge, then venture on discovery, however, he found out that is actually not effective. Rather, if you just start exploration and push out your findings, you'll have more impact, and perhaps with overall more understanding.
He wrote something to that effect, i forgot in what essay or lecture or publication, and it's hard to web search. (might be parts of his collected auto-bio) Anyone knows?
if you look at programing languages or computer industry... you'll find this to be true (albeit not in some scientific way). Namely, lang such as perl, python, ruby, php, C-turd, or whatnot that crops up and became popular, are not the result of the designer having deep understanding of programing langs or massive survey of the varieties out there. Rather, they just pushed themselves forward, then later on concocted “philosophies” to back it up. While, lang from designers who really studied thoroughly or tried to about langs before they published their invention, usually are not successful ones. Typically the academecians. (of course their lack of salemanship probably have much to do with it too.)
thinking about this, i think actually most lang inventors did not know the varieties of prog langs when they invented their own. This makes sense too. If everyone takes the tech geeker stance of studying existing knowledge before doing, the world wouldn't move.
I know you Tim X. I remember you in the beginning maybe 4 or 5 years ago as someone who think i'm a dork on my emacs opinions. I think in recent years my image improved, perhaps slightly, and i appreciate it. ☺
Will probably write a separate post about some thoughts on programing style.
one problem i thought is interetsing is about a feedback loop. That is, if your code does the find/replace recursively, then some string in the input text may be unexpectedly replaced, even if that string isn't anywhere in the find string.
e.g.
For example, if the input string is “abcd”, and the pairs are “a → c” and “c → d”, then, result is “dbdd”, though most of the time you want “cbdd”. This is especially important if you use regex in your find string.
how does your code handle this?
the solution i did is to do a intermediate replacement. That is, take find string, replace it with some random string that's not likely to occure in text, then replace this random string to the replacement string.
> On Mar 11, 11:02 pm, Teemu Likonen <tliko...@iki.fi> wrote: >> (defmacro replace-regexp-series (string &rest clauses) >> (declare (indent 1)) >> (let ((value (make-symbol "--value--"))) >> `(let ((,value ,string)) >> ,@(let (forms) >> (dolist (clause clauses (nreverse forms)) >> (push `(setq ,value (replace-regexp-in-string >> ,(car clause) ,(cadr clause) ,value)) >> forms)))))) > egads, macros. Unreadable, or, at least i don't think i'll ever > understand it. ☺
The point behind my version was to hide boring repetitive code away. I made a syntactic abstraction (a macro) but of course there are other ways too. I mean, why did you write a lot of expressions like
(setq inputstr (replace-regexp-in-string "ñ" "n" inputstr))
when you could have written a loop, for example?
> one problem i thought is interetsing is about a feedback loop. That > is, if your code does the find/replace recursively, then some string > in the input text may be unexpectedly replaced, even if that string > isn't anywhere in the find string.
> For example, if the input string is “abcd”, and the pairs are “a → c” > and “c → d”, then, result is “dbdd”, though most of the time you want > “cbdd”. This is especially important if you use regex in your find > string.
> how does your code handle this?
The macro was just a syntactic abstraction of your code. That is, it expands to series of expressions which are equal to yours. No change in the functionality.
> The point behind my version was to hide boring repetitive code away. I > made a syntactic abstraction (a macro) but of course there are other > ways too. I mean, why did you write a lot of expressions like
> (setq inputstr (replace-regexp-in-string "ñ" "n" inputstr))
> when you could have written a loop, for example?
> > one problem i thought is interetsing is about a feedback loop. That > > is, if your code does the find/replace recursively, then some string > > in the input text may be unexpectedly replaced, even if that string > > isn't anywhere in the find string.
> > For example, if the input string is “abcd”, and the pairs are “a → c” > > and “c → d”, then, result is “dbdd”, though most of the time you want > > “cbdd”. This is especially important if you use regex in your find > > string.
> > how does your code handle this?
> The macro was just a syntactic abstraction of your code. That is, it > expands to series of expressions which are equal to yours. No change in > the functionality.
• the way i did it, also by a loop, but using “while”, not by macro.
• there is a issue about multi-pair find/replace problem. Namely, if you simply to each find/replace pair one after another, you may get unexpected result. I detailed this problem in the above. I named it feedback loop problem.
Was wondering about your thoughts on these.
also, about macros, opinions differ. Of this particular problem, I think the macro solution is rather ugly and hard to understand. I think the explicit sequential replace string call is easier to understand. I think the only advantage of macro solution is shrinking size of source code.
i started functional programing in Mathematica since ~1992. Like most FP geeks, i've read and explored extensively the FP paradigms, especially on toy problems of lists. Ι really love all the abstract ways, but i realized sometimes in mid 2000s, that many of these abstractions that functional programers love to chat about (e.g. Schemers), are actually harmful. (on the other hand, i'm not sure non- academic production code of functional langs actually do that much abstractions for the sake of abstraction/elegance/purity type of impetus)
...
on rewriting this particular piece of code, there's this way i rather find more interesting:
but there are 2 problems with it that's interesting to note.
• typical lispers will probably find the code abominable. This is because, in my opinion, a problem caused by nesting syntax. It is a program chaining paradigm (aka pipe, filter).
• the length of the code and its deep nesting, is again a problem nesting syntax while without any automatic formatting facilities.
> one problem i thought is interetsing is about a feedback loop. That > is, if your code does the find/replace recursively, then some string > in the input text may be unexpectedly replaced, even if that string > isn't anywhere in the find string. > For example, if the input string is “abcd”, and the pairs are “a → c” > and “c → d”, then, result is “dbdd”, though most of the time you want > “cbdd”. This is especially important if you use regex in your find > string. > the solution i did is to do a intermediate replacement. That is, take > find string, replace it with some random string that's not likely to > occure in text, then replace this random string to the replacement > string.
The problem should be defined more accurately. So, in string “abcd” the first regexp-replacement pair is “b” → “/” and the second is “^.+$” → “XXX”. What should the result be? What string should the second regexp match use? Maybe it should use all the (sub)strings which were left untouched by the first replace? Is the result “XXX/XXX” or something else?
> > one problem i thought is interetsing is about a feedback loop. That > > is, if your code does the find/replace recursively, then some string > > in the input text may be unexpectedly replaced, even if that string > > isn't anywhere in the find string. > > For example, if the input string is “abcd”, and the pairs are “a → c” > > and “c → d”, then, result is “dbdd”, though most of the time you want > > “cbdd”. This is especially important if you use regex in your find > > string. > > the solution i did is to do a intermediate replacement. That is, take > > find string, replace it with some random string that's not likely to > > occure in text, then replace this random string to the replacement > > string.
> The problem should be defined more accurately. So, in string “abcd” the > first regexp-replacement pair is “b” → “/” and the second is “^.+$” → > “XXX”. What should the result be? What string should the second regexp > match use? Maybe it should use all the (sub)strings which were left > untouched by the first replace? Is the result “XXX/XXX” or something > else?