FWIW: I haven’t seen this before but in development, there were times that Dataverse would send input to the MDC log file that counter couldn’t process. I’d suggest checking the mdc log file from that day for anything unusual, i.e. special characters like tabs in user supplied text . It’s possible that Dataverse needs to be escaping or removing some character. If so, a workaround of editing the mdc log file to fix that char should enable you to keep processing things. I think counter produces a log as well but haven’t checked recently – that may help pinpoint which entry is problematic.
I would not suspect the database constraint – it’s not really related. Also, QDR applied that and has no problem running counter – just checked and QDR is current w.r.t. counter daily processing.
Hope that helps,
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/4f15d8f0-7293-4f17-b9a2-98949a2b5b26n%40googlegroups.com.
I think what you propose should work. Counter-processor can be forced to rerun things by resetting its state (in the ./state directory) and the Dataverse API called in the counter_daily script overwrites the monthly value so when you fix the one file and rerun, it should update the Dataverse db without issue.
FWIW- if you can find the counter log file, it might point you to a specific line in the bad log file.
To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/0a7a4bc0-d5eb-4391-b3e7-e095863b7c9en%40googlegroups.com.
#Fields: event_time client_ip session_cookie_id user_cookie_id user_id request_url identifier filename size user-agent title publisher publisher_id authors publication_dateversion other_id target_url publication_year
publication_year is the last item, so scanning for any lines that do not have a year at the end of the line should help you find the bad entry(ies). Removing or adding a year to those should allow processing. The next question would be how/why those entries are ending up in the log – hopefully once you see what they are, it will suggest some next steps. i.e. there’s either a bug in the mdc logging or you have some datasets that don’t have a publication date for some reason. (As far as I know there are no mdc log issues in Dataverse 5.10.1 or later.)
Counter-processor has some info on reprocessing at https://github.com/CDLUC3/counter-processor#maintaining-state-between-runs . My recollection is that removing the monthly file and the entry for the month in the json file will make counter-processor forget that month. To reprocess, you’d then have to update the year_month setting in the config file to have it process that month (such as 2022-09). By default, the counter-daily script overwrites that value with the current month, so one easy way to handle a past month is to add a line in the counter-daly.sh script and then run it. This example resets it to April 2022 to be able to rerun that month instead of running for the current month.
YEAR_MONTH=$(date -d "yesterday 13:00" '+%Y-%m')
YEAR_MONTH=$(date -d "2020-04-30" '+%Y-%m')
It is always good to keep a backup but, since counter processor doesn’t delete the Dataverse mdc logs, deleting its state for a month and rerunning for that month is unlikely to break anything.
To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/1118047815.589626.1666249328556%40mail.yahoo.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/PH7PR19MB656204FA9E7B33544AB9A24EBF2A9%40PH7PR19MB6562.namprd19.prod.outlook.com.
So – two directions to explore :
The nulls appearing in the log means there’s either a bug in the code or a problem in your database. If you can check the datasetversion table for entries with dataset_id = one of the affected datasets, e.g. 96182, 96183, you can see if they do have a version, etc. In comparison, it looks like the same /api/v1/datasets/<id> call is being made for datasets such as id= 95474 and not getting a null, so any difference between the entries for that and one that is failing could be a clue. If things look OK there, it would be useful to track down exactly what command is being run and whether there are any differences (in apikey, extra query params, etc. between the calls that fail and those that succeed.
>>>> Jeya: the problematic dataset entries in the datasetversion table contain version=1, versionnumber as null and versionstate='RELEASED'. For other correct entries either versionnumber is present and if not, versionstate is in 'DRAFT' only. What- could be the reason for versionnumber is null and what is the difference between version and versionnumber?
A simpler work-around may be to just update to python3.9. My guess would be that the peewee library for it is better at handling the null and that’s why I’ve been able to process a file. The version does not appear to make it into the final make-data-count-report file so the fact that it is null should be OK if 3.9 manages to process it. It would be nice to figure out the root issue as well. I know there were some problems early on with draft entries getting into the mdc log but those changes were made many releases ago.
>>>>Jeya: In parallel, we will also look into upgrading to higher counter processor version (counter-processor-0.1.04) or python 3.9.
Hope that helps,
For rerunning the counter processor for September log files, whether we need to remove the counter_db_2022-09.sqlite3 file from the folder, even if there is no entry created for "2022-09" in json file?
>> If it is really empty, as it should be if September was never processed, it should be OK to leave it. That said, it will be recreated when Sept is processed so there’s no harm in removing it either.