I've been testing the JSON-stat practical knowledge of different leading AI models (Llama 3.1, Mistral Large 2, GPT-4o, Claude Sonnet...). The idea was not to prove if they knew what JSON-stat is (which most of them know) but if
generic AI models could be use as conversion or analysis tools of JSON-stat datasets. In my test I did not use
specialized versions of AI models in data analysis or code generation (like Codestral, CodeLlama, Codex/Copilot...).
Unfortunately, Llama and Mistral could not (in my preliminary tests) unflatten correctly the data. With Claude I couldn't proceed (for free) because it considered I've sent too much data.
GPT-4o was the winner: it seems to understand perfectly how to apply ids and labels to dimensions and categories and is capable of unflattening the "value" array as long as the prompt reminds the AI that JSON-stat uses the row-major order to flatten values.
GPT-4o was able to convert a JSON-stat dataset into a CSV dataset by just providing the following prompt:
Convert dataset.json which is a JSON-stat dataset into a CSV taking into account that values are stored in the "value" array according to the row-major order.
It also understood prompts like
You have the following dataset in the JSON-stat format where dimensions order in the cube is stored in the "id" property, dimensions information is stored in the "dimension" property and values are stored in a flatten array in the row-major order in the "value" property: ...
Return a JSON object with the index in the "value" array corresponding to "2023" "from 60 to 64 years old" "foreigner" "men" "population" and the corresponding value.
The reason why I asked GPT-4o to return the actual value but also the index in the array is because for some weird reason, at least in my tests, the AI computed correctly the index (which was the difficult part of the request) but was unable to retrieve the actual value from the array (the trivial part of the request). In some cases, it even produced an hallucination returning a value that was not even in the dataset (even though the position in the "value" array was totally correct!).
I hope this info is useful,
Xavier