API Server - JSON content-type

Nigel

unread,

Dec 30, 2016, 3:07:50 PM12/30/16

to TheyWorkForYou

Hi,

I have just started looking at creating some Haskell bindings for the API and have run up against an issue almost immediately.

The Content-Type that the API currently returns for JSON indicates ISO 8859-1 which is apparently not allowed by the RFC for JSON content.

See section 8.1 - http://rfc7159.net/rfc7159 / https://www.rfc-editor.org/rfc/rfc7159.txt

I am going to see if I can get the Servant library to accept it by creating a new MIME type for JSON received from this service but it would be good if people didn't have to, additionally it could be confusing some other libraries and causing decoding issues if they actually attempt to honour the charset and decode UTF8 data as ISO 8859-1.

If this was a conscious decision on the part of the maintainers I'd be interested to hear the reasoning.

Example from curl with API key blocked out;

---
bash$ curl -i -H"Accept: application/json" -X GET "https://www.theyworkforyou.com/api/getConstituency?key=XXXXXXXXXXXXXXXXXXXXXXXX&name=Stretford+and+Urmston"
HTTP/1.1 200 OK
Server: nginx
Date: Fri, 30 Dec 2016 19:55:33 GMT
Content-Type: text/javascript; charset=iso-8859-1
Content-Length: 240
Connection: keep-alive
Access-Control-Allow-Origin: *
x-url: /api/getConstituency?key=XXXXXXXXXXXXXXXXXXXXXXXX&name=Stretford+and+Urmston
Accept-Ranges: bytes
Age: 0

{"bbc_constituency_id":"551","guardian_election_results":"http://www.guardian.co.uk/politics/constituency/1347/stretford-and-urmston","guardian_id":"1347","guardian_name":"Stretford and Urmston","pa_id":"545","name":"Stretford and Urmston"}
---

Thanks,

Nigel

Matthew Somerville

unread,

Dec 30, 2016, 5:19:03 PM12/30/16

to TheyWorkForYou

Hi,

On 30 December 2016 at 20:07, Nigel <nigel....@gmail.com> wrote:
> The Content-Type that the API currently returns for JSON indicates ISO
> 8859-1 which is apparently not allowed by the RFC for JSON content.

This is why the API does not at present claim the results are JSON :-)
I believe it uses “JS” in the documentation, and the Content-Type
header is “text/javascript”, not “application/json”.

> I am going to see if I can get the Servant library to accept it by creating
> a new MIME type for JSON received from this service but it would be good if
> people didn't have to, additionally it could be confusing some other
> libraries and causing decoding issues if they actually attempt to honour the
> charset and decode UTF8 data as ISO 8859-1.

I fully agree; do note the data *is* (for MP data) ISO-8859-1, e.g.
see the output of getMP with id=11148 (constituency), or id=11863
(given name). So there shouldn’t be decoding issues if that is
honoured (though I would of course understand if a JSON library
refused to act on non-UTF-8 data). There are a tiny number of speeches
that do appear to have ISO-8859-1 or UTF-8 data, but those could be
classed as mistakes in either case – almost all special characters are
stored as HTML entities, which of course isn’t great either by any
means!

> If this was a conscious decision on the part of the maintainers I'd be
> interested to hear the reasoning.

It is only a conscious decision in so much as it is historic attrition
due to the age of the site and source data. It would obviously be
better if the output were in UTF-8 and it could therefore be true
JSON, but it is not currently so, we haven’t had the time to do
anything about it, and clearly no-one else has yet either :) Perhaps a
straightforward conversion before API output would be all that is
necessary for the API, rather than trying to solve it at a deeper
internal level (though I think the name/constituency source data is in
UTF-8, so that at least would hopefully not be complex).

ATB,
Matthew

Nigel

unread,

Dec 31, 2016, 6:51:51 AM12/31/16

to TheyWorkForYou

Hi, thanks for the quick reply :)

On Friday, 30 December 2016 22:19:03 UTC, matthew wrote:

Hi,

On 30 December 2016 at 20:07, Nigel <nigel....@gmail.com> wrote:
> The Content-Type that the API currently returns for JSON indicates ISO
> 8859-1 which is apparently not allowed by the RFC for JSON content.

This is why the API does not at present claim the results are JSON :-)
I believe it uses “JS” in the documentation, and the Content-Type
header is “text/javascript”, not “application/json”.

Okay, I missed this, I guess I see people also using text/javascript and stating JSON.

> I am going to see if I can get the Servant library to accept it by creating
> a new MIME type for JSON received from this service but it would be good if
> people didn't have to, additionally it could be confusing some other
> libraries and causing decoding issues if they actually attempt to honour the
> charset and decode UTF8 data as ISO 8859-1.

I fully agree; do note the data *is* (for MP data) ISO-8859-1, e.g.
see the output of getMP with id=11148 (constituency), or id=11863
(given name). So there shouldn’t be decoding issues if that is
honoured (though I would of course understand if a JSON library
refused to act on non-UTF-8 data). There are a tiny number of speeches
that do appear to have ISO-8859-1 or UTF-8 data, but those could be
classed as mistakes in either case – almost all special characters are
stored as HTML entities, which of course isn’t great either by any
means!

Okay, so I haven't played with more than two of the API resources yet so I guess I will find this out but it sounds like you are saying there is a mix of encodings in the data returned from the API? Maybe I'll get back to you on this after looking some more.

> If this was a conscious decision on the part of the maintainers I'd be
> interested to hear the reasoning.

It is only a conscious decision in so much as it is historic attrition
due to the age of the site and source data. It would obviously be
better if the output were in UTF-8 and it could therefore be true
JSON, but it is not currently so, we haven’t had the time to do
anything about it, and clearly no-one else has yet either :) Perhaps a
straightforward conversion before API output would be all that is
necessary for the API, rather than trying to solve it at a deeper
internal level (though I think the name/constituency source data is in
UTF-8, so that at least would hopefully not be complex).

Well, if the source data has both then I understand why you would maybe just pass it through unmodified. Given time and resources I would suggest normalising into UTF-8 but I know those resources might not be available.

My only concern would be mixing up encodings if both exist in different places. I will look further at the output of some other resources and get back to you on this I guess.

Thanks,

Nigel

Matthew Somerville

unread,

Jan 3, 2017, 8:02:54 AM1/3/17

to TheyWorkForYou

> Okay, so I haven't played with more than two of the API resources yet so I
> guess I will find this out but it sounds like you are saying there is a mix
> of encodings in the data returned from the API?

No, there should not be. The name/constituency data is in ISO-8859-1,
and the speech data should all be ASCII with HTML entities. If there
is anything that is in another encoding, that will have been a bug in
that data import.

From a search, 0.06% of entries in the epobject table contain a
non-ASCII character; it looks like those are in one of three
categories:
* Specially-imported things, e.g. October 1981 written answers (not
sure where they came from!), UTF-8;
* Unknown speaker names in Scottish Parliament or historic (1937-1950,
1997-2001) Commons debates, UTF-8;
* Division MP names in some old Public Bill committees, ISO-8859-1.

Now that I've done that search, I think a UTF-8 switch might be easier
than previously thought, because all that would need to change would
be the name/constituency data (which as I mentioned is UTF-8 at source
anyway), and then converting those old Public Bill division MP names.
I've updated https://github.com/mysociety/theyworkforyou/issues/815
with details from this thread, perhaps someone will volunteer or have
time to take a look.

ATB,
Matthew

Reply all

Reply to author

Forward