ElasticSearch and large datasets / Data optimisation

46 views
Skip to first unread message

bap...@gmail.com

unread,
Nov 13, 2020, 8:12:17 AM11/13/20
to Arches Development

Dear All

The CAAL group is running an Arches Version 5.0 / 5.1 system for the display of cultural heritage data from Central Asia.

We are uploading a large number of records into our Arches system through CSVs.  We have 20k+ records already in the system, using both point data and polygons that display correctly.  We are increasing our record number by about 20k, and are suddenly running into problems with elasticsearch:

1)      Our palaeolithic data is being rejected because of the very early start date figure (-3000000).

2)      We are receiving a large number of geometry errors (unable to tessellate shape/duplicate consecutive coords etc.). 

To my knowledge, we haven’t received these errors before; we have had an imminent deadline to honour.

Has anyone else come across these problems before? And if so, do you have a solution? We are investigating to see if we can configure the tolerance of elasticsearch. We are also pursuing geometry corrections through QGIS geometry simplifications and if we find anything we will keep you posted - but if anyone has come across these problems before, your help would be appreciated.

Does anyone know what the smallest (negative) number is that Arches will accept as a date before it falls over?

 

Best wishes,

 

Bryan Alvey

 

Lindsey Gant

unread,
Nov 17, 2020, 5:23:19 PM11/17/20
to Arches Development
Hi Bryan,

It appears that this question has gone unnoticed-- apologies for that! Please let me know if you are still experiencing this issue and I can find a community member to help troubleshoot further.

Best,

Lindsey

ape...@fargeo.com

unread,
Nov 17, 2020, 6:35:27 PM11/17/20
to Arches Development
Hi Bryan,
I had assumed that the minimum date you could put in ES would be limited to what javascript allows which would have been somewhere around -271821 BCE but I just tested with my local instance of ES and I was able to add a date in range of -30,000,000.

This is the document that I saved.
{
   "_index": "test",
   "_type": "_doc",
   "_id": "fHGI2HUBT6EUyuNemZ-h",
   "_version": 1,
   "_score": 1,
   "_source": {
      "earlydate": "-30000000"
   }
}
What specific error are you seeing?
Maybe it has to do with the range of dates you're saving.  What is the maximum date you have in the system?

Cheers,
Alexei

dwut...@fargeo.com

unread,
Nov 17, 2020, 6:51:37 PM11/17/20
to Arches Development
Brian,


Cheers,

Dennis

Bryan Alvey

unread,
Nov 26, 2020, 6:32:12 AM11/26/20
to arche...@googlegroups.com

Hi Alexei

Thank you for taking the time to help me with this!

I am uploading 13k+ records for a resource type via CSV.

My problem was occurring when I was trying to describe a site with Palaeolithic origins, with a start date therefore of around -3,000,000 BCE. When uploading the data, I was getting an error of:

'type': 'mapper_parsing_exception',

'reason': "failed to parse field [date_ranges.date_range] of type [integer_range] in document with id '1ec942a7-e76c-4ea6-8e21-9e39ed6fd4b9'. Preview of field's value: '-12999999899'",

'caused_by': {

'type': 'json_parse_exception',

'reason': 'Numeric value (-12999999899) out of range of int\n at [Source: org.elasticsearch.common.bytes.AbstractBytesReference$MarkSupportingStreamInputWrapper@1a41a2d0; line: 1, column: 237

 

What I didn’t outline in my plea for help (mea culpa) was that dates are stored in our Arches application as EDTF fields.  Buried in the standards document in the Library of Congress website (https://www.loc.gov/standards/datetime/) is the section below:

Letter-prefixed calendar year

'Y' may be used at the beginning of the date string to signify that the date is a year, when (and only when) the year exceeds four digits, i.e. for years later than 9999 or earlier than -9999.

  • Example 1             'Y170000002' is the year 170000002
  • Example 2             'Y-170000002' is the year -170000002

 

So when loading up date data with more than 4 digits for the year you must add the letter Y .

When I add the prefix Y to the start and dates whose figures have more than four digits, (i.e. less than -9999 or greater than 9999) the data is uploaded correctly – and the response times of Arches improves markedly!

Result!

Thanks guys for all your help (and thanks Mahmoud!)

Best wishes,

 

Bryan


--
You received this message because you are subscribed to a topic in the Google Groups "Arches Development" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/arches-dev/nyM0lMje7ws/unsubscribe.
To unsubscribe from this group and all its topics, send an email to arches-dev+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/arches-dev/977fc42e-1dd3-480f-90b4-59e6c53bb9d0n%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages