Hi ups,
The language of the status is determined when new statuses flow
into the search system. The code itself is custom but uses statistics
about what characters are used [1], as well as some other information.
There are some tuning parameters and such, which is all very boring
(but excites me to no end, that code is my pride and joy). Because of
the method used we're pretty good about determining language, but when
I tried locales (I tested en-us/en-gb and de-de/de-ch) the code was
wrong too often to be useful. I don't think we'll be able to add
locale any time in the near future.
The location information in the search system is based on the
users location at the time of the status. Because Twitter does not ask
for structured location information at registration the free-text
location is geocoded when entering the search system. Because of this
lack of structured location information (like a separate country
field) it's very difficult to search within a bound like city or
country. The point/radius method for search is optimized for small
area searches. As you've seen, the large area searches often take in
more than you want. The reason we settled on the point/radius method
is because the vast majority of location searches are for a very small
area.
Thanks;
— Matt Sanford
[1] - Something like
http://tnlessone.wordpress.com/2007/05/13/how-to-detect-which-language-a-text-is-written-in-or-when-science-meets-human/