Does anyone have (or know of) accurate totals and percentages on how
Python is used? I'm particularly interested in the following
groupings:
- new development vs. stable code-bases
- categories (web, scripts, "big data", computation, etc.)
- "bare metal" vs. on top of some framework
- regional usage
I'm thinking about this partly because of the discussion on
python-ideas about the perceived challenges of Unicode in Python 3.
All the rhetoric, anecdotal evidence, and use-cases there have little
meaning to me, in regards to Python as a whole, without an
understanding of who is actually affected.
For instance, if frameworks (like django and numpy) could completely
hide the arguable challenges of Unicode in Python 3--and most projects
were built on top of frameworks--then general efforts for making
Unicode easier in Python 3 should go toward helping framework writers.
Not only are such usage numbers useful for the Unicode discussion
(which I wish would get resolved and die so we could move on to more
interesting stuff :) ). They help us know where efforts could be
focused in general to make Python more powerful and easier to use
where it's already used extensively. They can show us the areas that
Python isn't used much, thus exposing a targeted opportunity to change
that.
Realistically, it's not entirely feasible to compile such information
at a comprehensive level, but even generally accurate numbers would be
a valuable resource. If the numbers aren't out there, what would some
good approaches to discovering them? Thanks!
> I'm thinking about this partly because of the discussion on
> python-ideas about the perceived challenges of Unicode in Python 3.
> For instance, if frameworks (like django and numpy) could completely
> hide the arguable challenges of Unicode in Python 3--and most projects
> were built on top of frameworks--then general efforts for making
> Unicode easier in Python 3 should go toward helping framework writers.
Huh? I'll admit I'm a novice, but isn't Unicode mostly trivial in py3k
compared to 2.x? Or are you referring to porting 2.x to 3.x? I've been
under the impression that Unicode in 2.x can be painful at times, but
easy in 3.x.
I've been using 3.2 and Unicode hasn't been much of an issue.
-- CPython 3.2.2 | Windows NT 6.1.7601.17640
> Does anyone have (or know of) accurate totals and percentages on how
> Python is used? I'm particularly interested in the following
> groupings:
> - new development vs. stable code-bases
> - categories (web, scripts, "big data", computation, etc.)
> - "bare metal" vs. on top of some framework
> - regional usage
> I'm thinking about this partly because of the discussion on
> python-ideas about the perceived challenges of Unicode in Python 3.
> All the rhetoric, anecdotal evidence, and use-cases there have little
> meaning to me, in regards to Python as a whole, without an
> understanding of who is actually affected.
> For instance, if frameworks (like django and numpy) could completely
> hide the arguable challenges of Unicode in Python 3--and most projects
> were built on top of frameworks--then general efforts for making
> Unicode easier in Python 3 should go toward helping framework writers.
> Not only are such usage numbers useful for the Unicode discussion
> (which I wish would get resolved and die so we could move on to more
> interesting stuff :) ). They help us know where efforts could be
> focused in general to make Python more powerful and easier to use
> where it's already used extensively. They can show us the areas that
> Python isn't used much, thus exposing a targeted opportunity to change
> that.
> Realistically, it's not entirely feasible to compile such information
> at a comprehensive level, but even generally accurate numbers would be
> a valuable resource. If the numbers aren't out there, what would some
> good approaches to discovering them? Thanks!
> -eric
As others have said on other Python newsgroups it ain't a problem. The only time I've ever had a problem was with matplotlib which couldn't print a £ sign. I used a U to enforce unicode job done. If I had a major problem I reckon that a search on c.l.p would give me an answer easy peasy.
On Sat, Feb 11, 2012 at 2:51 PM, Andrew Berg <bahamutzero8...@gmail.com> wrote:
> On 2/11/2012 3:02 PM, Eric Snow wrote:
>> I'm thinking about this partly because of the discussion on
>> python-ideas about the perceived challenges of Unicode in Python 3.
>> For instance, if frameworks (like django and numpy) could completely
>> hide the arguable challenges of Unicode in Python 3--and most projects
>> were built on top of frameworks--then general efforts for making
>> Unicode easier in Python 3 should go toward helping framework writers.
> Huh? I'll admit I'm a novice, but isn't Unicode mostly trivial in py3k
> compared to 2.x? Or are you referring to porting 2.x to 3.x? I've been
> under the impression that Unicode in 2.x can be painful at times, but
> easy in 3.x.
> I've been using 3.2 and Unicode hasn't been much of an issue.
My expectation is that yours is the common experience. However, in at
least one current thread (on python-ideas) and at a variety of times
in the past, _some_ people have found Unicode in Python 3 to make more
work. So that got me to thinking about who's experience is the
general case, and if any concerns broadly apply to more that
framework/library writers (like django, jinja, twisted, etc.). Having
usage statistics would be helpful in identifying the impact of things
like Unicode in Python 3.
On Sun, Feb 12, 2012 at 12:21 PM, Eric Snow <ericsnowcurren...@gmail.com> wrote:
> However, in at
> least one current thread (on python-ideas) and at a variety of times
> in the past, _some_ people have found Unicode in Python 3 to make more
> work.
If Unicode in Python is causing you more work, isn't it most likely
that the issue would have come up anyway? For instance, suppose you
have a web form and you accept customer names, which you then store in
a database. You could assume that the browser submits it in UTF-8 and
that your database back-end can accept UTF-8, and then pretend that
it's all ASCII, but if you then want to upper-case the name for a
heading, somewhere you're going to needto deal with Unicode; and when
your programming language has facilities like str.upper(), that's
going to make it easier, not later. Sure, the simple case is easier if
you pretend it's all ASCII, but it's still better to have language
facilities.
On Sat, Feb 11, 2012 at 6:28 PM, Chris Angelico <ros...@gmail.com> wrote:
> On Sun, Feb 12, 2012 at 12:21 PM, Eric Snow <ericsnowcurren...@gmail.com> wrote:
>> However, in at
>> least one current thread (on python-ideas) and at a variety of times
>> in the past, _some_ people have found Unicode in Python 3 to make more
>> work.
> If Unicode in Python is causing you more work, isn't it most likely
> that the issue would have come up anyway? For instance, suppose you
> have a web form and you accept customer names, which you then store in
> a database. You could assume that the browser submits it in UTF-8 and
> that your database back-end can accept UTF-8, and then pretend that
> it's all ASCII, but if you then want to upper-case the name for a
> heading, somewhere you're going to needto deal with Unicode; and when
> your programming language has facilities like str.upper(), that's
> going to make it easier, not later. Sure, the simple case is easier if
> you pretend it's all ASCII, but it's still better to have language
> facilities.
Yeah, that's how I see it too. However, my sample size is much too
small to have any sense of the broader Python 3 experience. That's
what I'm going for with those Python usage statistics (if it's even
feasible).
On Sun, 12 Feb 2012 12:28:30 +1100, Chris Angelico wrote:
> On Sun, Feb 12, 2012 at 12:21 PM, Eric Snow
> <ericsnowcurren...@gmail.com> wrote:
>> However, in at
>> least one current thread (on python-ideas) and at a variety of times in
>> the past, _some_ people have found Unicode in Python 3 to make more
>> work.
> If Unicode in Python is causing you more work, isn't it most likely that
> the issue would have come up anyway?
The argument being made is that in Python 2, if you try to read a file that contains Unicode characters encoded with some unknown codec, you don't have to think about it. Sure, you get moji-bake rubbish in your database, but that's the fault of people who insist on not being American. Or who spell Zoe with an umlaut.
In Python 3, if you try the same thing, you get an error. Fixing the error requires thought, and even if that is only a minuscule amount of thought, that's too much for some developers who are scared of Unicode. Hence the FUD that Python 3 is too hard because it makes you learn Unicode.
I know this isn't exactly helpful, but I wish they'd just HTFU. I'm with Joel Spolsky on this one: if you're a programmer in 2003 who doesn't have at least a basic working knowledge of Unicode, you're the equivalent of a doctor who doesn't believe in germs.
+comp.lang.pyt...@pearwood.info> wrote:
> On Sun, 12 Feb 2012 12:28:30 +1100, Chris Angelico wrote:
> > On Sun, Feb 12, 2012 at 12:21 PM, Eric Snow
> > <ericsnowcurren...@gmail.com> wrote:
> >> However, in at
> >> least one current thread (on python-ideas) and at a variety of times in
> >> the past, _some_ people have found Unicode in Python 3 to make more
> >> work.
> > If Unicode in Python is causing you more work, isn't it most likely that
> > the issue would have come up anyway?
> The argument being made is that in Python 2, if you try to read a file
> that contains Unicode characters encoded with some unknown codec, you
> don't have to think about it. Sure, you get moji-bake rubbish in your
> database, but that's the fault of people who insist on not being
> American. Or who spell Zoe with an umlaut.
That's not the worst of it... i have many times had a block of text
that was valid ASCII except for some intermixed Unicode white-space.
Who the hell would even consider inserting Unicode white-space!!!
> the most obvious answer would be to read the file WITHOUT worrying
> about asinine encoding.
What this statement misunderstands, though, is that ASCII is itself an
encoding. Files contain bytes, and it's only what's external to those
bytes that gives them meaning. The famous "bush hid the facts" trick
with Windows Notepad shows the folly of trying to use internal
evidence to identify meaning from bytes.
Everything that displays text to a human needs to translate bytes into
glyphs, and the usual way to do this conceptually is to go via
characters. Pretending that it's all the same thing really means
pretending that one byte represents one character and that each
character is depicted by one glyph. And that's doomed to failure,
unless everyone speaks English with no foreign symbols - so, no
mathematical notations.
On Sun, 12 Feb 2012 15:38:37 +1100, Chris Angelico wrote:
> Everything that displays text to a human needs to translate bytes into
> glyphs, and the usual way to do this conceptually is to go via
> characters. Pretending that it's all the same thing really means
> pretending that one byte represents one character and that each
> character is depicted by one glyph. And that's doomed to failure, unless
> everyone speaks English with no foreign symbols - so, no mathematical
> notations.
Pardon me, but you can't even write *English* in ASCII.
You can't say that it cost you £10 to courier your résumé to the head office of Encyclopædia Britanica to apply for the position of Staff Coördinator. (Admittedly, the umlaut on the second "o" looks a bit stuffy and old-fashioned, but it is traditional English.)
ASCII truly is a blight on the world, and the sooner it fades into obscurity, like EBCDIC, the better.
Even if everyone did change to speak ASCII, you still have all the historical records and documents and files to deal with. Encodings are not going away.
<steve+comp.lang.pyt...@pearwood.info> wrote:
> You can't say that it cost you £10 to courier your résumé to the head
> office of Encyclopædia Britanica to apply for the position of Staff
> Coördinator.
True, but if it cost you $10 (or 10 GBP) to courier your curriculum
vitae to the head office of Encyclopaedia Britannica to become Staff
Coordinator, then you'd be fine. And if it cost you $10 to post your
work summary to Britannica's administration to apply for this Staff
Coordinator position, you could say it without 'e' too. Doesn't mean
you don't need Unicode!
> the most obvious answer would be to read the file WITHOUT worrying about
> asinine encoding.
Your mad leet reading comprehension skillz leave me in awe Rick.
If you try to read a file containing non-ASCII characters encoded using UTF8 on Windows without explicitly specifying either UTF8 as the encoding, or an error handler, you will get an exception.
It's not just UTF8 either, but nearly all encodings. You can't even expect to avoid problems if you stick to nothing but Windows, because Windows' default encoding is localised: a file generated in (say) Israel or Japan or Germany will use a different code page (encoding) by default than one generated in (say) the US, Canada or UK.
> It's not just UTF8 either, but nearly all encodings. You can't even > expect to avoid problems if you stick to nothing but Windows, because > Windows' default encoding is localised: a file generated in (say) Israel > or Japan or Germany will use a different code page (encoding) by default > than one generated in (say) the US, Canada or UK.
Generated by what? Windows will store a locale value for programs to
use, but programs use Unicode internally by default (i.e., API calls are
Unicode unless they were built for old versions of Windows), and the
default filesystem (NTFS) uses Unicode for file names. AFAIK, only the
terminal has a localized code page by default.
Perhaps Notepad will write text files with the localized code page by
default, but that's an application choice...
> - Try decoding with UTF8 or Latin1. Even if you don't get the right
> characters, you'll get *something*.
> - Use open(filename, encoding='ascii', errors='surrogateescape')
> (Or possibly errors='ignore'.)
These are not good answer, IMHO. The only answer I can think of, really, is:
- pack you luggage, your submarine waits on you to peel onions in it (with reference to the Joel's article). Meaning, really, you should learn your craft and pull up your head from the sand. There is a wider world around you.
(and yes, I am a Czech, so I need at least latin-2 for my language).
>> - Try decoding with UTF8 or Latin1. Even if you don't get the right
>> characters, you'll get *something*.
>> - Use open(filename, encoding='ascii', errors='surrogateescape')
>> (Or possibly errors='ignore'.)
> These are not good answer, IMHO. The only answer I can think of, really,
> is:
Slightly less flameish answer to the question “What should I do, really?” is a tough one: all these suggested answers are bad because they don’t deal with the fact, that your input data are obviously broken. The rest is just pure GIGO … without fixing (and I mean, really, fixing, not ignoring the problem, which is what the previous answers suggest) your input, you’ll get garbage on output. And you should be thankful to py3k that it shown the issue to you.
On Sun, 12 Feb 2012 01:05:35 -0600, Andrew Berg wrote:
> On 2/12/2012 12:10 AM, Steven D'Aprano wrote:
>> It's not just UTF8 either, but nearly all encodings. You can't even
>> expect to avoid problems if you stick to nothing but Windows, because
>> Windows' default encoding is localised: a file generated in (say)
>> Israel or Japan or Germany will use a different code page (encoding) by
>> default than one generated in (say) the US, Canada or UK.
> Generated by what? Windows will store a locale value for programs to
> use, but programs use Unicode internally by default
Which programs? And we're not talking about what they use internally, but what they write to files.
> (i.e., API calls are
> Unicode unless they were built for old versions of Windows), and the
> default filesystem (NTFS) uses Unicode for file names.
No. File systems do not use Unicode for file names. Unicode is an abstract mapping between code points and characters. File systems are written using bytes.
Suppose you're a fan of Russian punk bank Наӥв and you have a directory of their music. The file system doesn't store the Unicode code points 1053 1072 1253 1074, it has to be encoded to a sequence of bytes first.
NTFS by default uses the UTF-16 encoding, which means the actual bytes written to disk are \x1d\x040\x04\xe5\x042\x04 (possibly with a leading byte-order mark \xff\xfe).
Windows has two separate APIs, one for "wide" characters, the other for single bytes. Depending on which one you use, the directory will appear to be called Наӥв or 0å2.
But in any case, we're not talking about the file name encoding. We're talking about the contents of files.
> AFAIK, only the
> terminal has a localized code page by default. Perhaps Notepad will
> write text files with the localized code page by default, but that's an
> application choice...
Exactly. And unless you know what encoding the application chooses, you will likely get an exception trying to read the file.
> NTFS by default uses the UTF-16 encoding, which means the actual bytes > written to disk are \x1d\x040\x04\xe5\x042\x04 (possibly with a leading > byte-order mark \xff\xfe).
That's what I meant. Those bytes will be interpreted consistently across
all locales.
> Windows has two separate APIs, one for "wide" characters, the other for > single bytes. Depending on which one you use, the directory will appear > to be called Наӥв or 0å2.
Yes, and AFAIK, the wide API is the default. The other one only exists
to support programs that don't support the wide API (generally, such
programs were intended to be used on older platforms that lack that API).
> But in any case, we're not talking about the file name encoding. We're > talking about the contents of files.
Okay then. As I stated, this has nothing to do with the OS since
programs are free to interpret bytes any way they like.
> On 12.2.2012 09:14, Matej Cepl wrote:
>>> Obvious answers:
>>> - Try decoding with UTF8 or Latin1. Even if you don't get the right
>>> characters, you'll get *something*.
>>> - Use open(filename, encoding='ascii', errors='surrogateescape')
>>> (Or possibly errors='ignore'.)
>> These are not good answer, IMHO. The only answer I can think of, really,
>> is:
> Slightly less flameish answer to the question “What should I do,
> really?” is a tough one: all these suggested answers are bad because
> they don’t deal with the fact, that your input data are obviously
> broken. The rest is just pure GIGO … without fixing (and I mean, really,
> fixing, not ignoring the problem, which is what the previous answers
> suggest) your input, you’ll get garbage on output. And you should be
> thankful to py3k that it shown the issue to you.
> BTW, can you display the following line?
> Příliš žluťoučký kůň úpěl ďábelské ódy.
> Best,
> Matěj
Yes in Thunderbird, Notepad, Wordpad and Notepad++ on Windows Vista, can't be bothered to try any other apps.
> > the most obvious answer would be to read the file WITHOUT worrying
> > about asinine encoding.
> What this statement misunderstands, though, is that ASCII is itself an
> encoding. Files contain bytes, and it's only what's external to those
> bytes that gives them meaning.
Exactly. <soapbox class="wise-old-geezer">. ASCII was so successful at becoming a universal standard which lasted for decades, people who grew up with it don't realize there was once any other way. Not just EBCDIC, but also SIXBIT, RAD-50, tilt/rotate, packed card records, and so on.
Transcoding was a way of life, and if you didn't know what you were starting with and aiming for, it was hopeless. Kind of like now where we are again with Unicode. </soapbox>
In article <4f375347$0$29986$c3e8da3$54964...@news.astraweb.com>,
Steven D'Aprano <steve+comp.lang.pyt...@pearwood.info> wrote:
> ASCII truly is a blight on the world, and the sooner it fades into > obscurity, like EBCDIC, the better.
That's a fair statement, but it's also fair to say that at the time it came out (49 years ago!) it was a revolutionary improvement on the extant state of affairs (every manufacturer inventing their own code, and often different codes for different machines). Given the cost of both computer memory and CPU cycles at the time, sticking to a 7-bit code (the 8th bit was for parity) was a necessary evil.
And, before people complain about the character set being US-Centric, keep in mind that the A in ASCII stands for American, and it was published by ANSI (whose A also stands for American). I'm not trying to wave the flag here, just pointing out that it was never intended to be anything other than a national character set.
Part of the complexity of Unicode is that when people switch from working with ASCII to working with Unicode, they're really having to master two distinct things at the same time (and often conflate them into a single confusing mess). One is the Unicode character set. The other is a specific encoding (UTF-8, UTF-16, etc). Not to mention silly things like BOM (Byte Order Mark). I expect that some day, storage costs will become so cheap that we'll all just be using UTF-32, and programmers of the day will wonder how their poor parents and grandparents ever managed in a world where nobody quite knew what you meant when you asked, "how long is that string?".
On Sun, 12 Feb 2012 17:08:24 +1100, Chris Angelico wrote:
> On Sun, Feb 12, 2012 at 4:51 PM, Steven D'Aprano
> <steve+comp.lang.pyt...@pearwood.info> wrote:
>> You can't say that it cost you £10 to courier your résumé to the head
>> office of Encyclopædia Britanica to apply for the position of Staff
>> Coördinator.
> True, but if it cost you $10 (or 10 GBP) to courier your curriculum
> vitae to the head office of Encyclopaedia Britannica to become Staff
> Coordinator, then you'd be fine. And if it cost you $10 to post your
> work summary to Britannica's administration to apply for this Staff
> Coordinator position, you could say it without 'e' too. Doesn't mean you
> don't need Unicode!
Back in the late 1970's, the economy and the outlook in the USA sucked, and the following joke made the rounds:
Mr. Smith: Good morning, Mr. Jones. How are you?
Mr. Jones: I'm fine.
(The humor is that Mr. Jones had his head so far [in the sand] that he thought that things were fine.)
American English is my first spoken language, but I know enough French, Greek, math, and other languages that I am very happy to have more than ASCII these days. I imagine that even Steven's surname should be spelled D’Aprano rather than D'Aprano.
+comp.lang.pyt...@pearwood.info> wrote:
> On Sun, 12 Feb 2012 15:38:37 +1100, Chris Angelico wrote:
> > Everything that displays text to a human needs to translate bytes into
> > glyphs, and the usual way to do this conceptually is to go via
> > characters. Pretending that it's all the same thing really means
> > pretending that one byte represents one character and that each
> > character is depicted by one glyph. And that's doomed to failure, unless
> > everyone speaks English with no foreign symbols - so, no mathematical
> > notations.
> Pardon me, but you can't even write *English* in ASCII.
> You can't say that it cost you £10 to courier your résumé to the head
> office of Encyclopædia Britanica to apply for the position of Staff
> Coördinator. (Admittedly, the umlaut on the second "o" looks a bit stuffy
> and old-fashioned, but it is traditional English.)
That's interesting. When I wrote that, it showed on my screen as a cent symbol and a copyright symbol. What I see in your response is an upper case "A" with a hat accent (circumflex?) over it followed by a cent symbol, and likewise an upper case "A" with a hat accent over it followed by copyright symbol.
Oh, for the days of ASCII again :-)
Not to mention, of course, that I wrote <colon><dash><close-paren>, but I fully expect some of you will be reading this with absurd clients which turn that into some kind of smiley-face image.
> Any volunteers to create an Extended Baudot... Instead of "letter
> shift" and "number shift" we could have a generic "encoding shift" which
> uses the following characters to identify which 7-bit subset of Unicode
> is to be represented <G>
rusi <rustompm...@gmail.com> wrote:
> On Feb 12, 10:51 am, Steven D'Aprano <steve
> +comp.lang.pyt...@pearwood.info> wrote:
> > On Sun, 12 Feb 2012 15:38:37 +1100, Chris Angelico wrote:
> > > Everything that displays text to a human needs to translate bytes into
> > > glyphs, and the usual way to do this conceptually is to go via
> > > characters. Pretending that it's all the same thing really means
> > > pretending that one byte represents one character and that each
> > > character is depicted by one glyph. And that's doomed to failure, unless
> > > everyone speaks English with no foreign symbols - so, no mathematical
> > > notations.
> > Pardon me, but you can't even write *English* in ASCII.
> > You can't say that it cost you £10 to courier your résumé to the head
> > office of Encyclopædia Britanica to apply for the position of Staff
> > Coördinator. (Admittedly, the umlaut on the second "o" looks a bit stuffy
> > and old-fashioned, but it is traditional English.)
> [Quite OT but...] How do you type all this?
> [Note: I grew up on APL so unlike Rick I am genuinely asking :-) ]
What I do (on a Mac) is open the Keyboard Viewer thingie and try various combinations of shift-control-option-command-function until the thing I'm looking for shows up on a keycap. A few of them I've got memorized (for example, option-8 gets you a bullet €). I would imagine if you commonly type in a language other than English, you would quickly memorize the ones you use a lot.
Or, open the Character Viewer thingie and either hunt around the various drill-down menus (North American Scripts / Canadian Aboriginal Syllabics, for example) or type in some guess at the official unicode name into the search box.