Make Data Count not writing JSON report

154 views
Skip to first unread message

Yuyun W

unread,
Oct 12, 2022, 5:52:02 AM10/12/22
to Dataverse Users Community
Our MDC got stuck since Sep 2 as it's no longer updating the make-data-count-report.json. The counter logs are still produced normally. 

The output from counter processor should have stats calculation follow by Writing JSON Report. Instead, it stopped before calculation (see log snapshot below). 
I don't think we have any records that might violate the counter processor (i.e all records should have utf-8 encoding). 
We did, however, run the script to add a database constraint to address cybersecurity vulnerability alert on Sep 2. 
Anyone has similar issue?  
Any advice is much appreciated.




Fri Sep  2 00:00:01 +08 2022

Running report for 2022-09-01T00:00:00 to 2022-10-01T00:00:00
1 daily log file(s) will be added to the database
Last processed date: not processed yet for 2022-09
processing /usr/local/payara5/glassfish/domains/domain1/logs/mdc/counter_2022-09-01.log

Sat Sep  3 00:00:01 +08 2022

Running report for 2022-09-01T00:00:00 to 2022-10-01T00:00:00
2 daily log file(s) will be added to the database
Last processed date: not processed yet for 2022-09
processing /usr/local/payara5/glassfish/domains/domain1/logs/mdc/counter_2022-09-01.log

thanks
yuyun 
NTU Library

James Myers

unread,
Oct 12, 2022, 9:35:29 AM10/12/22
to dataverse...@googlegroups.com

FWIW: I haven’t seen this before but in development, there were times that Dataverse would send input to the MDC log file that counter couldn’t process.  I’d suggest checking the mdc log file from that day for anything unusual, i.e. special characters like tabs in user supplied text .  It’s possible that Dataverse needs to be escaping or removing some character. If so, a workaround of editing the mdc log file to fix that char should enable you to keep processing things. I think counter produces a log as well but haven’t checked recently – that may help pinpoint which entry is problematic.

 

I would not suspect the database constraint – it’s not really related. Also, QDR applied that and has no problem running counter – just checked and QDR is current w.r.t. counter daily processing.

 

Hope that helps,

 

-- Jim

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/4f15d8f0-7293-4f17-b9a2-98949a2b5b26n%40googlegroups.com.

jeyala...@yahoo.com

unread,
Oct 14, 2022, 1:51:40 AM10/14/22
to Dataverse Users Community
Hi Jim,
     Thank you for your inputs. Since the mdc log file of Sep 1 is very big, it takes some time to check it. As a quick check there is no unusual patterns or chars. In the meantime, we plan to do following:
  1. rename existing  counter_2022-09-01.log file as backup_counter_2022-09-01.log and create another empty file counter_2022-09-01.log .
  2. rerun counter_daily.sh script.
   In this case we will miss 1 day statistics. Is it ok to proceed as above? Later, if we fix the issue with the backup_counter_2022-09-01.log, is there a way to make the counter to include that file again for processing? 
 
Thanks and regards,
Jeya

James Myers

unread,
Oct 14, 2022, 9:46:59 AM10/14/22
to dataverse...@googlegroups.com

Jeya,

I think what you propose should work. Counter-processor can be forced to rerun things by resetting its state (in the ./state directory) and the Dataverse API called in the counter_daily script overwrites the monthly value so when you fix the one file and rerun, it should update the Dataverse db without issue.

 

FWIW- if you can find the counter log file, it might point you to a specific line in the bad log file.

 

-- Jim

Jeyalakshmi Sambasivam

unread,
Oct 20, 2022, 3:02:16 AM10/20/22
to dataverse...@googlegroups.com

Hi Jim,
     Sorry for the late reply as I was on leave. 
Regarding log files checking, counter_daily.log file contains the log for processing of log files and we do not see any finer details other than following:

Sat Oct  1 00:00:01 +08 2022

Running report for 2022-09-01T00:00:00 to 2022-10-01T00:00:00
30 daily log file(s) will be added to the database
Last processed date: not processed yet for 2022-09
processing /usr/local/payara5/glassfish/domains/domain1/logs/mdc/counter_2022-09-01.log

Sun Oct  2 00:00:01 +08 2022

Running report for 2022-10-01T00:00:00 to 2022-11-01T00:00:00
1 daily log file(s) will be added to the database
Last processed date: not processed yet for 2022-10
processing /usr/local/payara5/glassfish/domains/domain1/logs/mdc/counter_2022-10-01.log


The counter_daily.sh processes log files starting from counter_2022-10-01.log file only (current_month's).  So, first I applied the zero file approach for this file. It could process till 2022-10-10  and stopped with some errors while processing for 2022-10-11. So I used zero file for 2022-10-11 also and still the same error message. I could see that a counter_db_2022-10.sqlite3 file is generated with latest timestamp.

    
counter_daily.log :
=============
Running report for 2022-10-01T00:00:00 to 2022-11-01T00:00:00
19 daily log file(s) will be added to the database
Last processed date: not processed yet for 2022-10
processing /usr/local/payara5/glassfish/domains/domain1/logs/mdc/counter_2022-10-01.log
processing /usr/local/payara5/glassfish/domains/domain1/logs/mdc/counter_2022-10-02.log
processing /usr/local/payara5/glassfish/domains/domain1/logs/mdc/counter_2022-10-03.log
processing /usr/local/payara5/glassfish/domains/domain1/logs/mdc/counter_2022-10-04.log
processing /usr/local/payara5/glassfish/domains/domain1/logs/mdc/counter_2022-10-05.log
processing /usr/local/payara5/glassfish/domains/domain1/logs/mdc/counter_2022-10-06.log
processing /usr/local/payara5/glassfish/domains/domain1/logs/mdc/counter_2022-10-07.log
processing /usr/local/payara5/glassfish/domains/domain1/logs/mdc/counter_2022-10-08.log
processing /usr/local/payara5/glassfish/domains/domain1/logs/mdc/counter_2022-10-09.log
processing /usr/local/payara5/glassfish/domains/domain1/logs/mdc/counter_2022-10-10.log

Error:
=====
Traceback (most recent call last):
  File "main.py", line 31, in <module>
    ll.populate()
  File "/usr/local/counter-processor-0.0.1/input_processor/log_line.py", line 38, in populate
    md_item = self.find_or_create_metadata()
  File "/usr/local/counter-processor-0.0.1/input_processor/log_line.py", line 83, in find_or_create_metadata
    publication_year=self.publication_year
  File "/usr/local/counter-processor-0.0.1/python36_env/lib64/python3.6/site-packages/peewee.py", line 5050, in create
    inst.save(force_insert=True)
  File "/usr/local/counter-processor-0.0.1/python36_env/lib64/python3.6/site-packages/peewee.py", line 5173, in save
    pk_from_cursor = self.insert(**field_dict).execute()
  File "/usr/local/counter-processor-0.0.1/python36_env/lib64/python3.6/site-packages/peewee.py", line 1447, in inner
    return method(self, database, *args, **kwargs)
  File "/usr/local/counter-processor-0.0.1/python36_env/lib64/python3.6/site-packages/peewee.py", line 1518, in execute
    return self._execute(database)
  File "/usr/local/counter-processor-0.0.1/python36_env/lib64/python3.6/site-packages/peewee.py", line 2151, in _execute
    return super(Insert, self)._execute(database)
  File "/usr/local/counter-processor-0.0.1/python36_env/lib64/python3.6/site-packages/peewee.py", line 1928, in _execute
    cursor = database.execute(self)
  File "/usr/local/counter-processor-0.0.1/python36_env/lib64/python3.6/site-packages/peewee.py", line 2496, in execute
    sql, params = ctx.sql(query).query()
  File "/usr/local/counter-processor-0.0.1/python36_env/lib64/python3.6/site-packages/peewee.py", line 525, in sql
    return obj.__sql__(self)
  File "/usr/local/counter-processor-0.0.1/python36_env/lib64/python3.6/site-packages/peewee.py", line 2127, in __sql__
    self._simple_insert(ctx)
  File "/usr/local/counter-processor-0.0.1/python36_env/lib64/python3.6/site-packages/peewee.py", line 2024, in _simple_insert
    return self._generate_insert((self._insert,), ctx)
  File "/usr/local/counter-processor-0.0.1/python36_env/lib64/python3.6/site-packages/peewee.py", line 2099, in _generate_insert
    return ctx.sql(CommaNodeList(all_values))
  File "/usr/local/counter-processor-0.0.1/python36_env/lib64/python3.6/site-packages/peewee.py", line 525, in sql
    return obj.__sql__(self)
  File "/usr/local/counter-processor-0.0.1/python36_env/lib64/python3.6/site-packages/peewee.py", line 1369, in __sql__
    ctx.sql(self.nodes[n_nodes - 1])
  File "/usr/local/counter-processor-0.0.1/python36_env/lib64/python3.6/site-packages/peewee.py", line 525, in sql
    return obj.__sql__(self)
  File "/usr/local/counter-processor-0.0.1/python36_env/lib64/python3.6/site-packages/peewee.py", line 1367, in __sql__
    ctx.sql(self.nodes[i])
  File "/usr/local/counter-processor-0.0.1/python36_env/lib64/python3.6/site-packages/peewee.py", line 525, in sql
    return obj.__sql__(self)
  File "/usr/local/counter-processor-0.0.1/python36_env/lib64/python3.6/site-packages/peewee.py", line 1099, in __sql__
    .value(self.value, self.converter))
  File "/usr/local/counter-processor-0.0.1/python36_env/lib64/python3.6/site-packages/peewee.py", line 539, in value
    value = converter(value)
  File "/usr/local/counter-processor-0.0.1/python36_env/lib64/python3.6/site-packages/peewee.py", line 3593, in db_value
    return value if value is None else self.coerce(value)
ValueError: invalid literal for int() with base 10: 'null'
{"status":"OK","data":{"message":"Usage Metrics Data has been added to all datasets from file  /tmp/make-data-count-report.json"}}(python36_env) [counter@researchdata ltg2]$ exit
exit


1. Can you please let us know the probable reason for the above errors?

2. How do I force the counter to process the log files starting from last month ie. counter_2022-09-01?
You were referring to resetting the state in /state folder. I could find counter*.sqlite3 files and statefile.json file only under the /usr/local/counter-processor-0.0.1/state folder. The statefile.json contains one last processed date entry for each month (please see attached) as below:

.......
.......
"2022-07": {
        "id": "b77f4d10-d66d-4a20-ab69-e9a10f0f7c23",
        "last_processed_day": 31
    },
    "2022-08": {
        "id": "bcf5e308-3caa-4f50-ba15-fa7aa3645388",
        "last_processed_day": 31
    }
}

3. Successfully running the daily script, ignoring few defective days should be ok for timebeing. All your inputs/help is much appreciated.

Thanks and regards,
Jeya


statefile.json

James Myers

unread,
Oct 20, 2022, 1:23:12 PM10/20/22
to dataverse...@googlegroups.com

Jeya,

 

  1. The error in the log seems to indicate that some entry in the log does not have a publication date. If you look at the daily log file, you should see the fields at the top:

#Fields: event_time     client_ip       session_cookie_id       user_cookie_id  user_id request_url     identifier      filename        size    user-agent      title   publisher       publisher_id    authors publication_dateversion  other_id        target_url      publication_year

publication_year is the last item, so scanning for any lines that do not have a year at the end of the line should help you find the bad entry(ies). Removing or adding a year to those should allow processing. The next question would be how/why those entries are ending up in the log – hopefully once you see what they are, it will suggest some next steps. i.e. there’s either a bug in the mdc logging or you have some datasets that don’t have a publication date for some reason. (As far as I know there are no mdc log issues in Dataverse 5.10.1 or later.)

 

Counter-processor has some info on reprocessing at https://github.com/CDLUC3/counter-processor#maintaining-state-between-runs . My recollection is that removing the monthly file and the entry for the month in the json file will make counter-processor forget that month. To reprocess, you’d then have to update the year_month setting in the config file to have it process that month (such as 2022-09). By default, the counter-daily script overwrites that value with the current month, so one easy way to handle a past month is to add a line in the counter-daly.sh script and then run it. This example resets it to April 2022 to be able to rerun that month instead of running for the current month.

 

YEAR_MONTH=$(date -d "yesterday 13:00" '+%Y-%m')

YEAR_MONTH=$(date -d "2020-04-30" '+%Y-%m')

#echo $YEAR_MONTH

 

It is always good to keep a backup but, since counter processor doesn’t delete the Dataverse mdc logs, deleting its state for a month and rerunning for that month is unlikely to break anything.

 

  1. The error log is reporting the problem for 2022-10-10.log, so that is the date with a problem, even if you were processing on 10-11 or later.
  2. FWIW: I also see from the log that you’re still on v0.0.1 – as of Dataverse 5.11, you can update to v0.1.04 . I don’t know if that version is any better at skipping/reporting on the error you have or not.

 

-- Jim

Philip Durbin

unread,
Nov 7, 2022, 9:43:29 AM11/7/22
to dataverse...@googlegroups.com
Hi Jeya,

I'm just checking if you're all set. Thank you, Jim, for the detailed reply!

Thanks,

Phil



--

jeyala...@yahoo.com

unread,
Nov 7, 2022, 11:37:36 PM11/7/22
to Dataverse Users Community
Hi Phil,
      Yes, I am working with Jim and shared the log files also to figure out the values giving the issue. Thank you for checking on.

Hi Jim,
      Thank you very much for guiding me to narrow down to the problematic records. 
      As you have correctly pointed out, the records with version=null values are causing the issues (they are only subset of target_url="localhost:8080" entries that I found). All the log files that were not processed contain the same set of records with version=null.
They started coming in from September only. These are the ones harvested from another repository within NTU. We are liaising with them to correct it.

For rerunning the counter processor for September log files, whether we need to remove the counter_db_2022-09.sqlite3 file from the folder, even if there is no entry created for "2022-09" in json file?
statefile.json content:
::::
    "2022-06": {
        "id": "dcf9a536-bc59-4c04-84c3-4a614d26d1ee",
        "last_processed_day": 30
    },

    "2022-07": {
        "id": "b77f4d10-d66d-4a20-ab69-e9a10f0f7c23",
        "last_processed_day": 31
    },
    "2022-08": {
        "id": "bcf5e308-3caa-4f50-ba15-fa7aa3645388",
        "last_processed_day": 31
    },
    "2022-10": {
        "id": "c483b31f-94e8-4c5f-85da-117c91db5da4",
        "last_processed_day": 31
    },
    "2022-11": {
        "id": "4e213d3d-ddfd-451e-89c5-294a421ae79c",
        "last_processed_day": 6
    }
}

=====================================

So – two directions to explore :

The nulls appearing in the log means there’s either a bug in the code or a problem in your database. If you can check the datasetversion table for entries with dataset_id = one of the affected datasets, e.g. 96182, 96183, you can see if they do have a version, etc. In comparison, it looks like the same /api/v1/datasets/<id> call is being made for datasets such as id= 95474 and not getting a null, so any difference between the entries for that and one that is failing could be a clue. If things look OK there, it would be useful to track down exactly what command is being run and whether there are any differences (in apikey, extra query params, etc. between the calls that fail and those that succeed.

>>>> Jeya: the problematic dataset entries in the datasetversion table contain version=1, versionnumber as null and versionstate='RELEASED'.   For other correct entries either versionnumber is present and if not, versionstate is in 'DRAFT' only.  What- could be the reason for versionnumber is null and what is the difference between version and versionnumber?

A simpler work-around may be to just update to python3.9. My guess would be that the peewee library for it is better at handling the null and that’s why I’ve been able to process a file. The version does not appear to make it into the final make-data-count-report file so the fact that it is null should be OK if 3.9 manages to process it. It would be nice to figure out the root issue as well. I know there were some problems early on with draft entries getting into the mdc log but those changes were made many releases ago.

>>>>Jeya: In parallel, we will also look into upgrading to higher counter processor version (counter-processor-0.1.04) or python 3.9. 

Hope that helps,

-- Jim

=================================

Thanks and regards,
Jeya

James Myers

unread,
Nov 14, 2022, 9:58:21 AM11/14/22
to dataverse...@googlegroups.com

For rerunning the counter processor for September log files, whether we need to remove the counter_db_2022-09.sqlite3 file from the folder, even if there is no entry created for "2022-09" in json file?

 

>> If it is really empty, as it should be if September was never processed, it should be OK to leave it. That said, it will be recreated when Sept is processed so there’s no harm in removing it either.

jeyala...@yahoo.com

unread,
Aug 24, 2023, 5:10:53 AM8/24/23
to Dataverse Users Community
Hi Jim,
   Hope you are well. Just an update on this issue.
   We upgraded to v5.11.1 and migrated to counter-processor-0.1.04 with python 3.8.14.
   The null values created by harvested dataset references in the log files are now ignored by the counter-processor and generating the make-data-count-report.json file and hence the original issue is solved.
   But, when we try to upload the .json report to datacite, we are getting HTTP response code 409 logged in  datacite_response_body.txt (details at the bottom).     The new make-data-count-report.json contains a warning for "Report is compressed using gzip".
    I thought compressed data/warning message is creating a conflict.
    So, tested replacing gzip.compress with original data.encode() and removed the addition of warning to .json content.
    Still the 409 error is returned.
   Can please advise what could be the probable reason? Do we need to delete August month statistics at datacite side? If so how do I do it?   

 Note: We did the counter-processor migration on 8th August and upload to api.datacite.org was enabled since 23rd August only.
         Since make-data-count-report.json, contains the metrics for the period 1st August till the yesterday, it shouldn't be a problem right?

datacite_response_body.txt as follows:   
===========
after 0.1.04:
409
{'Date': 'Wed, 23 Aug 2023 16:15:42 GMT', 'Content-Type': 'application/json; charset=utf-8', 'Transfer-Encoding': 'chunked',
'Connection': 'keep-alive', 'Status': '409 Conflict', 'Cache-Control': 'no-cache',
'Vary': 'Accept-Encoding,Origin', 'Content-Encoding': 'gzip', 'X-Runtime': '19.412374',
'X-Request-Id': '46d322ea-2f31-4011-8194-dfe4584fba40', 'X-Powered-By': 'Phusion Passenger(R) 6.0.12',
'Server': 'nginx/1.18.0 + Phusion Passenger(R) 6.0.12'}
application/json; charset=utf-8
{"errors":[{"status":"409","title":"The resource already exists."}]}
before 0.1.04:
200
{'Date': 'Mon, 07 Aug 2023 16:04:54 GMT', 'Content-Type': 'application/json; charset=utf-8', 'Transfer-Encoding': 'chunked',
'Connection': 'keep-alive', 'Status': '200 OK', 'Cache-Control': 'max-age=0, private, must-revalidate',
'Vary': 'Accept-Encoding, Origin', 'Content-Encoding': 'gzip', 'ETag': 'W/"a1b63ebf456565ac246695c030b81931"',
'X-Runtime': '75.327318', 'X-Credential-Username': 'gdcc.ntu', 'X-Request-Id': '488fcc49-767d-4178-8718-3b8e60bad627',
'X-Powered-By': 'Phusion Passenger(R) 6.0.12', 'Server': 'nginx/1.18.0 + Phusion Passenger(R) 6.0.12'}
application/json; charset=utf-8
===========

Thanks and regards,
Jeya

James Myers

unread,
Aug 24, 2023, 1:03:26 PM8/24/23
to dataverse...@googlegroups.com

Hmm – I’m not sure what could cause a 409. Counter does just submit updated reports – with one more day of data – each time and one shouldn’t need to delete anything. If I run the counter_daily.sh script on a QDR dev machine more than once on the same day, it also still seems to send the report to DataCite OK.  

 

The API call being used is the POST/PUT /reports/<report_id> version of https://support.datacite.org/docs/usage-reports-api-guide#register-a-usage-report . The report_id should be coming from the ./state/statefile.json entry for the current month (after the initial POST call on the first of the month which would generate the id). They python code sending the report is in the upload/upload.py file.

 

Other than trying to use the DataCite API via curl to check that a report with your id exists, to see if you can send an update via curl without getting the 409, etc., I’m not sure what to suggest. It’s possible that DataCite could provide some insight – they don’t actually show 409 as one of their returned error codes in the doc.

 

--Jim

 

 

From: 'jeyala...@yahoo.com' via Dataverse Users Community <dataverse...@googlegroups.com>
Sent: Thursday, August 24, 2023 5:11 AM
To: Dataverse Users Community <dataverse...@googlegroups.com>
Subject: Re: [Dataverse-Users] Make Data Count not writing JSON report

 

Hi Jim,

--

You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.

Jeyalakshmi Sambasivam

unread,
Aug 24, 2023, 10:39:34 PM8/24/23
to dataverse...@googlegroups.com
Hi Jim,
        Thank you very much for your reply.  I understand that, statefile id could be the problem.
Because till 7th August, the POST call was using one id and after migration to 0.1.04, it was using another id. If statefile id generated on 1st of the month is used for all POSTs in that month, when we post with different id, 

/usr/local/counter-processor-0.1.04/state/statefile.json:
{
    "2023-08": {
        "id": "5495c4e5-e09e-4dc4-8800-7bbea7d47d16",
        "last_processed_day": 24
    }
}
 /usr/local/counter-processor-0.0.1/state/statefile.json:
   
    "2023-08": {
        "id": "1e2c56a1-ef96-4e33-9202-cb8d67f9ae99",
        "last_processed_day": 7
    }
 To reprocess a month's data and submit we need to request delete for the old "id": "1e2c56a1-ef96-4e33-9202-cb8d67f9ae99at the hub (ref: https://github.com/CDLUC3/counter-processor

We are checking with datacite also, FYI.

If you have any references/guides for api to delete of an id, please let me know. 

Thanks and regards,
Jeya

James Myers

unread,
Aug 25, 2023, 7:37:38 AM8/25/23
to dataverse...@googlegroups.com

Jeya,

 

You might also just be able to edit the statefile to use the old id. The id is just created on day 1 and reused, so I think swapping the old one into your 1.04 statefile should just allow the 1.04 to replace the one from the old code.

 

-- Jim

Jeyalakshmi Sambasivam

unread,
Aug 25, 2023, 10:28:12 AM8/25/23
to dataverse...@googlegroups.com
Hi Jim,
         Thanks for the valuable advice. Actually I guessed the same ( happy to think in line with a wise person ) already and made changes and ran the report. The report at datacite side was updated but contained report-datasets   as []. Confirmed with datacite support that it is normal for the compressed report.
          Thank you very much for all your support. Have a great weekend!!!
Regards,
Jeya
Reply all
Reply to author
Forward
0 new messages