h2o.importfolder error with s3

187 views
Skip to first unread message

a chandran

unread,
Oct 6, 2021, 1:48:24 PM10/6/21
to H2O Open Source Scalable Machine Learning - h2ostream
Hi,

can someone help with h2o.importfolder error for h2o 3.29 and above ,it works with h2o3.28 and below

test.hex <- h2o.importFolder(path = 's3://h2o_test/', pattern=".*\\.snappy\\.parquet$")

ERROR: Unexpected HTTP Status code: 412 Precondition Failed (url = http:/xxxxxxxxx:45820/3/ParseSetup) water.exceptions.H2OIllegalArgumentException [1] "water.exceptions.H2OIllegalArgumentException: Column separator mismatch. One file seems to use \" \" and the other uses \"\001\"." [2] " 

Error in .h2o.doSafeREST(h2oRestApiVersion = h2oRestApiVersion, urlSuffix = page, : ERROR MESSAGE: Column separator mismatch. One file seems to use " " and the other uses " ".

Paul Donnelly

unread,
Jan 11, 2022, 1:00:42 PM1/11/22
to H2O Open Source Scalable Machine Learning - h2ostream
Hi,

I'm encountering the same issue. Parquets on s3 that can import on h2o version 3.28.1.1 do not parse correctly on 3.28.1.2. In the following code I tried importing the same parquet file on both versions and provide the debug log-level. It shows that 3.28.1.2 attempts to parse the parquet as a CSV.

____________________________________________________________________

library(data.table)
library(h2o, lib.loc = file.path(.libPaths()[1], "h2o-3.28.1.1"))

h2o.init()

# 1. Downloaded copy of parquet file from: https://s3.amazonaws.com/h2o-public-test-data/e2e-testing/dataset/file-format/parquet_file.parquet
# 2. Moved to Scality/S3
# 3. Import works on h2o <= 3.28.1.1
x <- h2o.importFile(path = "s3://path/to/file/parquet_file.parquet")
#> x
# X                                  customer_id is_churn city age gender registration_method registration_date plan_list_price avg_amount_paid auto_renew times_under_paid num_transactions num_payment_methods common_payment_method     age2
# 1  4 m5uYkcpwZnNbCVoeT2OBWQGoCbWoQJfnpT6zxZMKjlY=        0   13  26 female                   3      1.383091e+12        143.2692        149.0000        YES                0               26                   1                    40 3.258097
# 2  7 8+E0KjVTEOWqQ+WMpoz/Zx1j3zNlaRGh0MQKbyP/NeQ=        0   22  32   male                   4      1.456358e+12        154.8182        154.8182        YES                0               11                   2                    40 3.465736
# 3  8 gcXPAp5mnOJYYAXx0DUGPH5WiFUynJgFpAqCkEHMF6w=        0    1  30 female                   9      1.439510e+12        141.1579        141.1579         NO                0               19                   2                    38 3.401197
# 4  9 B+MnfvGmATwy6wDMXNR43lCKPathx6RaFNNc0fxe9L0=        0    8  32   male                   3      1.447546e+12        246.5000        246.5000        YES                0                4                   3                    36 3.465736
# 5 10 awHJPaDlKLZL1rQtCLAbv/JHEBienVYVULrYmDgzdpo=        0    5  22   male                   7      1.456531e+12        149.0000        149.0000        YES                0               13                   1                    41 3.091042
# 6 11 FzfNguMdUBc79Iiwd9qWWg0GAH3leUVwQHa2RCwdyoA=        0   15  23   male                   7      1.396483e+12        144.7857        150.1071        YES                0               28                   2                    41 3.135494
#
# [14976 rows x 16 columns]

# From h2o.flow
# DEBUG view
# 01-11 08:31:39.314 10.194.118.110:24947  #9803  #33492-34 INFO: GET /3/ImportFiles, parms: {pattern=, path=s3://path/to/file/parquet_file.parquet}
# 01-11 08:31:39.315 10.194.118.110:24947  #9803  #33492-34 INFO: ImportS3 processing (s3://path/to/file/parquet_file.parquet)
# 01-11 08:31:39.586 10.194.118.110:24947  #9803  #33492-34 DEBUG: S3 endpoint specified: https://obs/path/to/endpoint.com
# 01-11 08:31:39.586 10.194.118.110:24947  #9803  #33492-34 DEBUG: S3 path style access enabled
# 01-11 08:31:39.894 10.194.118.110:24947  #9803  #33492-34 DEBUG: write-lock s3://path/to/file/parquet_file.parquet by job null
# 01-11 08:31:39.910 10.194.118.110:24947  #9803  #33492-34 DEBUG: update write-locked s3://path/to/file/parquet_file.parquet by job null
# 01-11 08:31:39.911 10.194.118.110:24947  #9803  #33492-34 DEBUG: unlock s3://path/to/file/parquet_file.parquet by job null
# 01-11 08:31:39.940 10.194.118.110:24947  #9803  #33492-34 INFO: POST /3/ParseSetup, parms: {skipped_columns=[], source_frames=["s3://path/to/file/parquet_file.parquet"], check_header=0}
# 01-11 08:31:40.588 10.194.118.110:24947  #9803  #e Thread DEBUG: GC CALLBACK: 1641911500588, USED:38.2 MB, CRIT: false
# 01-11 08:31:40.588 10.194.118.110:24947  #9803  #e Thread DEBUG: MemGood:   GC CALLBACK, (K/V: 126  B + POJO:38.2 MB + FREE:8.85 GB == MEM_MAX:8.89 GB), desiredKV=5.88 GB NO-OOM
# 01-11 08:31:40.951 10.194.118.110:24947  #9803  #33492-34 INFO: ParseSetup heuristic: cloudSize: 1, cores: 192, numCols: 16, maxLineLength: 16383, totalSize: 991319, localParseSize: 991319, chunkSize: 163830, numChunks: 6, numChunks * cols: 96
# 01-11 08:31:41.006 10.194.118.110:24947  #9803  #33492-34 INFO: POST /3/Parse, parms: {number_columns=16, source_frames=["s3://path/to/file/parquet_file.parquet"], column_types=["Numeric","String","Numeric","Numeric","Numeric","Enum","Numeric","Time","Numeric","Numeric","Enum","Numeric","Numeric","Numeric","Numeric","Numeric"], single_quotes=TRUE, parse_type=PARQUET, destination_frame=parquet_file_parquet.hex_sid_8a5a_1, column_names=["X","customer_id","is_churn","city","age","gender","registration_method","registration_date","plan_list_price","avg_amount_paid","auto_renew","times_under_paid","num_transactions","num_payment_methods","common_payment_method","age2"], delete_on_done=TRUE, check_header=1, separator=124, blocking=FALSE, skipped_columns=[], na_strings=[], chunk_size=163830, decrypt_tool=NULL}
# 01-11 08:31:41.018 10.194.118.110:24947  #9803  #33492-34 INFO: Total file size: 968.1 KB
# 01-11 08:31:41.033 10.194.118.110:24947  #9803  #33492-34 INFO: Parse chunk size 163830
# 01-11 08:31:41.038 10.194.118.110:24947  #9803  #33492-34 DEBUG: write-lock parquet_file_parquet.hex_sid_8a5a_1 by job $03010ac2766e7461ffffffff$_ab94c2db132cd017b853b2ab63d24435
# 01-11 08:31:41.038 10.194.118.110:24947  #9803  #33492-34 DEBUG: shared-read-lock s3://path/to/file/parquet_file.parquet by job $03010ac2766e7461ffffffff$_ab94c2db132cd017b853b2ab63d24435
# 01-11 08:31:41.288 10.194.118.110:24947  #9803  FJ-2-15   DEBUG: Key s3://path/to/file/parquet_file.parquet will be parsed using method DistributedParse.
# 01-11 08:31:42.005 10.194.118.110:24947  #9803  FJ-3-29   INFO: Processing 1 blocks of chunk #2
# 01-11 08:31:43.054 10.194.118.110:24947  #9803  FJ-3-29   DEBUG: lock-then-delete s3://path/to/file/parquet_file.parquet by job $03010ac2766e7461ffffffff$_ab94c2db132cd017b853b2ab63d24435
# 01-11 08:31:43.070 10.194.118.110:24947  #9803  FJ-1-15   DEBUG: update write-locked parquet_file_parquet.hex_sid_8a5a_1 by job $03010ac2766e7461ffffffff$_ab94c2db132cd017b853b2ab63d24435
# 01-11 08:31:43.101 10.194.118.110:24947  #9803  FJ-1-15   INFO: Parse result for parquet_file_parquet.hex_sid_8a5a_1 (14976 rows, 16 columns):
# 01-11 08:31:43.151 10.194.118.110:24947  #9803  FJ-1-15   INFO:                  ColV2    type          min          max         mean        sigma         NAs constant cardinality
# 01-11 08:31:43.152 10.194.118.110:24947  #9803  FJ-1-15   INFO:                      X: numeric      4.00000      39993.0      20133.0      11603.7
# 01-11 08:31:43.152 10.194.118.110:24947  #9803  FJ-1-15   INFO:            customer_id:  string
# 01-11 08:31:43.153 10.194.118.110:24947  #9803  FJ-1-15   INFO:               is_churn: numeric      0.00000      1.00000    0.0922810     0.289432
# 01-11 08:31:43.153 10.194.118.110:24947  #9803  FJ-1-15   INFO:                   city: numeric      1.00000      22.0000      10.9635      5.90846
# 01-11 08:31:43.153 10.194.118.110:24947  #9803  FJ-1-15   INFO:                    age: numeric     -43.0000      45.0000      27.2676      5.98002
# 01-11 08:31:43.153 10.194.118.110:24947  #9803  FJ-1-15   INFO:                 gender:  factor       female         male                                   360               2
# 01-11 08:31:43.153 10.194.118.110:24947  #9803  FJ-1-15   INFO:    registration_method: numeric      3.00000      13.0000      6.82365      2.54578
# 01-11 08:31:43.154 10.194.118.110:24947  #9803  FJ-1-15   INFO:      registration_date:    time 2004-03-25 1 2017-02-24 1
# 01-11 08:31:43.154 10.194.118.110:24947  #9803  FJ-1-15   INFO:        plan_list_price: numeric      0.00000      1788.00      176.994      190.111
# 01-11 08:31:43.154 10.194.118.110:24947  #9803  FJ-1-15   INFO:        avg_amount_paid: numeric      0.00000      1788.00      183.403      189.063
# 01-11 08:31:43.154 10.194.118.110:24947  #9803  FJ-1-15   INFO:             auto_renew:  factor           NO          YES                                                     2
# 01-11 08:31:43.154 10.194.118.110:24947  #9803  FJ-1-15   INFO:       times_under_paid: numeric      0.00000      10.0000     0.245059      1.14378
# 01-11 08:31:43.155 10.194.118.110:24947  #9803  FJ-1-15   INFO:       num_transactions: numeric      1.00000      61.0000      17.2105      8.30895
# 01-11 08:31:43.155 10.194.118.110:24947  #9803  FJ-1-15   INFO:    num_payment_methods: numeric      1.00000      6.00000      1.36311     0.650844
# 01-11 08:31:43.155 10.194.118.110:24947  #9803  FJ-1-15   INFO:  common_payment_method: numeric      3.00000      41.0000      37.3642      4.08064
# 01-11 08:31:43.155 10.194.118.110:24947  #9803  FJ-1-15   INFO:                   age2: numeric      1.60944      3.80666      3.28246     0.218947           3
# 01-11 08:31:43.168 10.194.118.110:24947  #9803  FJ-1-15   INFO: Chunk compression summary:
# 01-11 08:31:43.168 10.194.118.110:24947  #9803  FJ-1-15   INFO:   Chunk Type                 Chunk Name       Count  Count Percentage        Size  Size Percentage
# 01-11 08:31:43.168 10.194.118.110:24947  #9803  FJ-1-15   INFO:          C0D            Constant double          32          66.667 %      2.5 KB          0.201 %
# 01-11 08:31:43.168 10.194.118.110:24947  #9803  FJ-1-15   INFO:          CBS                     Binary           1           2.083 %      1.9 KB          0.153 %
# 01-11 08:31:43.168 10.194.118.110:24947  #9803  FJ-1-15   INFO:          CXI            Sparse Integers           2           4.167 %      7.5 KB          0.607 %
# 01-11 08:31:43.168 10.194.118.110:24947  #9803  FJ-1-15   INFO:           C1            1-Byte Integers           1           2.083 %     14.7 KB          1.183 %
# 01-11 08:31:43.168 10.194.118.110:24947  #9803  FJ-1-15   INFO:          C1N  1-Byte Integers (w/o NAs)           5          10.417 %     73.5 KB          5.916 %
# 01-11 08:31:43.168 10.194.118.110:24947  #9803  FJ-1-15   INFO:          C1S           1-Byte Fractions           1           2.083 %     14.7 KB          1.184 %
# 01-11 08:31:43.168 10.194.118.110:24947  #9803  FJ-1-15   INFO:          C2S           2-Byte Fractions           1           2.083 %     29.3 KB          2.362 %
# 01-11 08:31:43.168 10.194.118.110:24947  #9803  FJ-1-15   INFO:           C8            8-byte Integers           1           2.083 %    117.1 KB          9.428 %
# 01-11 08:31:43.168 10.194.118.110:24947  #9803  FJ-1-15   INFO:         CStr                    Strings           1           2.083 %    716.7 KB         57.722 %
# 01-11 08:31:43.168 10.194.118.110:24947  #9803  FJ-1-15   INFO:          CUD               Unique Reals           1           2.083 %     29.6 KB          2.386 %
# 01-11 08:31:43.168 10.194.118.110:24947  #9803  FJ-1-15   INFO:          C8D               64-bit Reals           2           4.167 %    234.1 KB         18.857 %
# 01-11 08:31:43.168 10.194.118.110:24947  #9803  FJ-1-15   INFO: Frame distribution summary:
# 01-11 08:31:43.168 10.194.118.110:24947  #9803  FJ-1-15   INFO:                             Size  Number of Rows  Number of Chunks per Column  Number of Chunks
# 01-11 08:31:43.168 10.194.118.110:24947  #9803  FJ-1-15   INFO: 10.194.118.110:24947      1.2 MB           14976                            3                48
# 01-11 08:31:43.168 10.194.118.110:24947  #9803  FJ-1-15   INFO:                 mean      1.2 MB    14976.000000                     3.000000         48.000000
# 01-11 08:31:43.168 10.194.118.110:24947  #9803  FJ-1-15   INFO:                  min      1.2 MB    14976.000000                     3.000000         48.000000
# 01-11 08:31:43.168 10.194.118.110:24947  #9803  FJ-1-15   INFO:                  max      1.2 MB    14976.000000                     3.000000         48.000000
# 01-11 08:31:43.168 10.194.118.110:24947  #9803  FJ-1-15   INFO:               stddev        0  B        0.000000                     0.000000          0.000000
# 01-11 08:31:43.168 10.194.118.110:24947  #9803  FJ-1-15   INFO:                total      1.2 MB           14976                            3                48
# 01-11 08:31:43.168 10.194.118.110:24947  #9803  FJ-1-15   DEBUG: update write-locked parquet_file_parquet.hex_sid_8a5a_1 by job $03010ac2766e7461ffffffff$_ab94c2db132cd017b853b2ab63d24435
# 01-11 08:31:43.168 10.194.118.110:24947  #9803  FJ-1-15   DEBUG: unlock parquet_file_parquet.hex_sid_8a5a_1 by job $03010ac2766e7461ffffffff$_ab94c2db132cd017b853b2ab63d24435
# 01-11 08:31:44.161 10.194.118.110:24947  #9803  #33492-34 INFO: GET /3/Frames/parquet_file_parquet.hex_sid_8a5a_1, parms: {row_count=10}
# 01-11 08:31:47.164 10.194.118.110:24947  #9803  #mCleaner DEBUG: MemGood:   preclean, (K/V: 126  B + POJO:38.2 MB + FREE:8.85 GB == MEM_MAX:8.89 GB), desiredKV=5.89 GB NO-OOM
# 01-11 08:31:47.164 10.194.118.110:24947  #9803  #mCleaner DEBUG: H(cached:1M, eldest:1641911501038L < +4ms <...{47ms}...< +6016ms < +6126) DESIRED=6031M dirtysince=7261 force=false clean2age=5000
# 01-11 08:31:47.165 10.194.118.110:24947  #9803  #mCleaner DEBUG: MemGood:   postclean, (K/V: 126  B + POJO:38.2 MB + FREE:8.85 GB == MEM_MAX:8.89 GB), desiredKV=5.89 GB NO-OOM
# 01-11 08:31:47.165 10.194.118.110:24947  #9803  #mCleaner DEBUG: Cleaner pass took:  0.001 sec, spilled Zero   in   0 usecH(cached:1M, eldest:1641911501042L < +0ms <...{47ms}...< +6016ms < +6123) diski_o=Zero  , freed=0M, DESIRED=6031M
# 01-11 08:31:52.166 10.194.118.110:24947  #9803  #mCleaner DEBUG: MemGood:   preclean, (K/V: 126  B + POJO:38.2 MB + FREE:8.85 GB == MEM_MAX:8.89 GB), desiredKV=5.90 GB NO-OOM
# 01-11 08:31:52.166 10.194.118.110:24947  #9803  #mCleaner DEBUG: H(cached:1M, eldest:1641911501042L < +0ms <...{86ms}...< +11008ms < +11124) DESIRED=6039M dirtysince=9492 force=false clean2age=5000
# 01-11 08:31:52.166 10.194.118.110:24947  #9803  #mCleaner DEBUG: MemGood:   postclean, (K/V: 126  B + POJO:38.2 MB + FREE:8.85 GB == MEM_MAX:8.89 GB), desiredKV=5.90 GB NO-OOM
# 01-11 08:31:52.166 10.194.118.110:24947  #9803  #mCleaner DEBUG: Cleaner pass took:  0.001 sec, spilled Zero   in   0 usecH(cached:1M, eldest:1641911501042L < +0ms <...{86ms}...< +11008ms < +11124) diski_o=Zero  , freed=0M, DESIRED=6039M


library(data.table)
library(h2o, lib.loc = file.path(.libPaths()[1], "h2o-3.28.1.2")) # incrementing forward one version from 3.28.1.1

h2o.init()

x <- h2o.importFile(path = "s3://path/to/file/parquet_file.parquet")
# > x
# C1     C2
# 1 PAR1Q\004Q<0x80D0>\016Q<0xC8B0>\aLQ<0x80EA> Q\004!
#   2                                      \a\004   <NA>
#   3                                          \b   <NA>
#   4                                          \b   <NA>
#   5                                          \b   <NA>
#   6                                          \b   <NA>
#
#   [14230 rows x 2 columns]

# From h2o.flow
# DEBUG view **** Note the POST /3/Parse, params specifies the parse_type=CSV
# 01-11 09:15:14.449 10.194.118.110:28483  #0344  #33492-29 INFO: GET /3/ImportFiles, parms: {pattern=, path=s3://path/to/file/parquet_file.parquet}
# 01-11 09:15:14.451 10.194.118.110:28483  #0344  #33492-29 INFO: ImportS3 processing (s3://path/to/file/parquet_file.parquet)
# 01-11 09:15:14.743 10.194.118.110:28483  #0344  #33492-29 DEBUG: S3 endpoint specified: https://obs/path/to/endpoint.com
# 01-11 09:15:14.743 10.194.118.110:28483  #0344  #33492-29 DEBUG: S3 path style access enabled
# 01-11 09:15:15.045 10.194.118.110:28483  #0344  #33492-29 DEBUG: write-lock s3://path/to/file/parquet_file.parquet by job null
# 01-11 09:15:15.062 10.194.118.110:28483  #0344  #33492-29 DEBUG: update write-locked s3://path/to/file/parquet_file.parquet by job null
# 01-11 09:15:15.063 10.194.118.110:28483  #0344  #33492-29 DEBUG: unlock s3://path/to/file/parquet_file.parquet by job null
# 01-11 09:15:15.099 10.194.118.110:28483  #0344  #33492-29 INFO: POST /3/ParseSetup, parms: {skipped_columns=[], source_frames=["s3://path/to/file/parquet_file.parquet"], check_header=0}
# 01-11 09:15:15.853 10.194.118.110:28483  #0344  #33492-29 INFO: ParseSetup heuristic: cloudSize: 1, cores: 192, numCols: 2, maxLineLength: 16383, totalSize: 991319, localParseSize: 991319, chunkSize: 163830, numChunks: 6, numChunks * cols: 12
# 01-11 09:15:15.910 10.194.118.110:28483  #0344  #33492-29 INFO: POST /3/Parse, parms: {number_columns=2, source_frames=["s3://path/to/file/parquet_file.parquet"], column_types=["Enum","Enum"], single_quotes=FALSE, parse_type=CSV, destination_frame=parquet_file_parquet.hex_sid_8014_1, column_names=[""], delete_on_done=TRUE, check_header=-1, separator=1, blocking=FALSE, skipped_columns=[], na_strings=[], chunk_size=163830, decrypt_tool=NULL}
# 01-11 09:15:16.048 10.194.118.110:28483  #0344  #e Thread DEBUG: GC CALLBACK: 1641914116048, USED:30.7 MB, CRIT: false
# 01-11 09:15:16.049 10.194.118.110:28483  #0344  #e Thread DEBUG: MemGood:   GC CALLBACK, (K/V: 126  B + POJO:30.7 MB + FREE:8.86 GB == MEM_MAX:8.89 GB), desiredKV=5.90 GB NO-OOM
# 01-11 09:15:16.052 10.194.118.110:28483  #0344  #33492-29 INFO: Total file size: 968.1 KB
# 01-11 09:15:16.073 10.194.118.110:28483  #0344  #33492-29 INFO: Parse chunk size 163830
# 01-11 09:15:16.079 10.194.118.110:28483  #0344  #33492-29 DEBUG: write-lock parquet_file_parquet.hex_sid_8014_1 by job $03010ac2766e446fffffffff$_a2686455dbd382667f61804300a94f1b
# 01-11 09:15:16.079 10.194.118.110:28483  #0344  #33492-29 DEBUG: shared-read-lock s3://path/to/file/parquet_file.parquet by job $03010ac2766e446fffffffff$_a2686455dbd382667f61804300a94f1b
# 01-11 09:15:16.332 10.194.118.110:28483  #0344  FJ-2-15   INFO: Key s3://path/to/file/parquet_file.parquet will be parsed using method DistributedParse.
# 01-11 09:15:16.773 10.194.118.110:28483  #0344  FJ-3-99   DEBUG: lock-then-delete s3://path/to/file/parquet_file.parquet by job $03010ac2766e446fffffffff$_a2686455dbd382667f61804300a94f1b
# 01-11 09:15:17.157 10.194.118.110:28483  #0344  FJ-1-15   INFO: Found categoricals with non-UTF-8 characters or NULL character in the 1st column. Converting unrecognized characters into hex:  <0xAD>84
# h, @ p <0xB890>, UA!<0xC0>9<0xA4>, 'w&a<0xD3>( @, <0xA8>m<0xAA>!<0xB2>* !EhUv # 5<0x96>, <0xB0>, <0xB2>
# , <0xA8>NJpIgpOqJ62SFYzprBFfnCLtrp4+xprioky0cboz5sI `<0xA8>q8afzBt+7oqmpK7/fpznxYKBYpFwJWbvgQV0WpQ10m0 0<0xA0>GVHGNSIPc+U+g6VuYgvy37XnG9CRKR2Wtukwwj4RN¡<0xB0>ñ<0xA8>SUf7FhWzPUuHwP+EDUvnICiMPrgv4IyDPga85gBTEWc `<0xA8>j23m1mbbbGlfHZZQNAd+iRFRcA4JUiI6p86IxhUXH8E 0<0xA0>PFxBO8i1WN3rI3JgZRHbm5cg5u0mTb/qpp0/ct6T2¡<0xC0>-<0xA8>AsVn82nJW85A+baRkMgJG+ASgavX8dJTQ3UiW+gIoUU        `<0xA0>fKl/cWvQlU/IAAnH3k302fr96HJ8SsPhkTtKOqF9Di0<0xA8>9qqfaUUbZFGAFJFjwcIFkB+8SUlQEPaWpLDJaBcjpdI `<0xA0>0nsJ2bIhg5nw5qdiJsnufSy0locv+EtOsragknp0O¡<0x80>^<0xA8>MI5abbyRlW7BRRHOzXFklSRBHQ1FS+j+doLz8/GPTC8 `<0xA0>KYVUFFy0q5xp9goCWgmg26yxRxrOwupdaUXUyF/4S¡0x<0xA0>K5pQIecSrsN3DMJA17ZkaLU2bOuL5dM7hi1Y46o0H¡pG<0xA8>377OnBOaML06gSTrejJ9IyoT4zG585qrdpRNH7U+Kgg <0x90A0>24+ikkIhsBgupUsEPOdTbyK01p3OjjHGGBEjr0+BH<0xADA0A8>Ga/yKzJ80Bck5nbLQd01RVeB2Y7TrTWjpLMvnikl1xU `<0xA8>CrVHQiJfE+/WbiKTkZV2V9b1qNQFYHtmPkokimffnvc 0<0xA8>4EeVCYYQriV17UdqHKtkPmrh7sTEXiKE3MFplh3uhMM 0<0xA8>bGlrbF9TMYjZZPcvJdmv5tMjY30J3wDjVT7sW60IoLk 0dYJK6rnm8KYJky+Str1+yGq78iB <0xC7>±0GNQIHf35FV7aI 0<0xA8>kLYm/Uev1wxs5B+w4ZAU0TrOJ5a0o83O3nJf+W1NFV4 0<0xA0>s9dFzA5pGgisO1cOZ9c6My721O9O4DPCY8vAgac5W¡<0x8082A0>avyzIcU8nJYuAphjY3IrhDvYjddYdJATZrPzUjclDá<0xD0> <0xA0>fF5UhqC6LKOBHSDqcPjhI5N8eyJmtAM87XvOs6AAya a<0xA8>saqSgR8Yzk6Ys86hJcXwzEj8gWhrGdrOqkJT1e7VtGM <0xC0A8>yXgF5CnWY8P8+47G2zQZjh96ONGmcX0AoezlFWLRbyc 0<0xA0>ObqwxDHjuAifDuZ25hee0MViBceuH4Tuwu/4SGsSj¡`á<0xA8>9/OxmiKIceqgdxX1VOFqIuu256St/oZdUx41qIieNI4 `<0xA0>StlGl0CKDgIA6fZMpGgkhj7FF3yp+hzCPTkV5S6Bg¡<0xB0>a<0xA8>Qiga0i6KHCF1cwsR5X7j5iN7ACowzU+/lofLQfSFekY `<0xA0>n2706L9g53tPP0oClx+c93qWfIXbT0XlWWGgb7Txk¡P, M<0x88><0xC4>¡8Q, - à W         <0xA3>-<0xA0> dÑ         , ...
# 01-11 09:15:17.214 10.194.118.110:28483  #0344  FJ-1-15   INFO: Found categoricals with non-UTF-8 characters or NULL character in the 2nd column. Converting unrecognized characters into hex:  (La,<0xC3C3>@<0xD0>p(, <0xE4A1>Kq<0xD8> 0 <0xF0>¡A2, * E )P <0xCE> &<0x96>+<0xC1BA>*l+ <0xC9> uH!^aK        8!<0xB2A1>{, <0xC4>7U<0x84C1>X} ar
# <0xB6>) 5<0xEAA5>@Qb<0xA9A2>Q<0xAE> à "<0xA2>1        <0x8CCD> F E MvE<0xD8>5 %<0xDC>U<0xD8>!<0x96> Q8<0xE5>à Qb<0xC9>J1<0xCEA1>Y9<0x88>%&u<0x9CC5> 54!<0x8A>‘U>e<0xF0>5<0xB2> <0xC4>.8, <0x9C> @ <0x80>E<0x8F> , <0xA5>        <0xA8>, 6D3<0x8982> A$X !, _*<0x9C>&<0xA1>u‘<0xFCE1>‘<0xB9>x b.T&<0xCA>á!<0xEA>= <0xA1>Y<0xB9A2>a mr F}:E<0xAE>Q<0xD2>%<0xEA>Q*!& <0x95>|<0xA9>a1<0x88>a<0xE6>        <0xED>b <0xC4>&
# 01-11 09:15:17.214 10.194.118.110:28483  #0344  FJ-1-15   INFO: !<0x96BDF6>E<0xBC>Qb!> <0xF1E0>%<0xAF>Q<0xEE> F<0xF9>à C*<0xA6> !<0xC0> u,<0x89>¡M<0xE6>        pQ <0xA1>a9 AL]Z<0x85DE>Q<0xE0>e<0x80>Uh!P "<0xC6>
# 01-11 09:15:17.214 10.194.118.110:28483  #0344  FJ-1-15   INFO: EZQ<0xE0>!<0x86>¡q *<A Tu<0x80898A>áx CQ&>        an -<0xF8E9>p<0x95B4E1F9>Q<0xC4>!l&H AAa<0xBB>I0 b&T*!&, <0x81>H<D        <0xD0>C        @, <0x9B80>, ...
# 01-11 09:15:17.238 10.194.118.110:28483  #0344  FJ-1-15   DEBUG: update write-locked parquet_file_parquet.hex_sid_8014_1 by job $03010ac2766e446fffffffff$_a2686455dbd382667f61804300a94f1b
# 01-11 09:15:17.280 10.194.118.110:28483  #0344  FJ-1-15   INFO: Parse result for parquet_file_parquet.hex_sid_8014_1 (14230 rows, 2 columns):
# 01-11 09:15:17.295 10.194.118.110:28483  #0344  FJ-1-15   INFO:  ColV2    type          min          max         mean        sigma         NAs constant cardinality
# 01-11 09:15:17.296 10.194.118.110:28483  #0344  FJ-1-15   INFO:  C1:  factor             ñƒ‹³ @X& P AHa                                   346            7120
# 01-11 09:15:17.296 10.194.118.110:28483  #0344  FJ-1-15   INFO:  C2:  factor Q Q<0x8         Ì¡ ‘4                                 12535            1613
# 01-11 09:15:17.315 10.194.118.110:28483  #0344  FJ-1-15   INFO: Chunk compression summary:
# 01-11 09:15:17.315 10.194.118.110:28483  #0344  FJ-1-15   INFO:   Chunk Type       Chunk Name       Count  Count Percentage        Size  Size Percentage
# 01-11 09:15:17.315 10.194.118.110:28483  #0344  FJ-1-15   INFO:          CXI  Sparse Integers           4          28.571 %      532  B          1.245 %
# 01-11 09:15:17.315 10.194.118.110:28483  #0344  FJ-1-15   INFO:           C2  2-Byte Integers          10          71.429 %     41.2 KB         98.755 %
# 01-11 09:15:17.315 10.194.118.110:28483  #0344  FJ-1-15   INFO: Frame distribution summary:
# 01-11 09:15:17.315 10.194.118.110:28483  #0344  FJ-1-15   INFO:                             Size  Number of Rows  Number of Chunks per Column  Number of Chunks
# 01-11 09:15:17.315 10.194.118.110:28483  #0344  FJ-1-15   INFO: 10.194.118.110:28483     41.7 KB           14230                            7                14
# 01-11 09:15:17.315 10.194.118.110:28483  #0344  FJ-1-15   INFO:                 mean     41.7 KB    14230.000000                     7.000000         14.000000
# 01-11 09:15:17.315 10.194.118.110:28483  #0344  FJ-1-15   INFO:                  min     41.7 KB    14230.000000                     7.000000         14.000000
# 01-11 09:15:17.315 10.194.118.110:28483  #0344  FJ-1-15   INFO:                  max     41.7 KB    14230.000000                     7.000000         14.000000
# 01-11 09:15:17.315 10.194.118.110:28483  #0344  FJ-1-15   INFO:               stddev        0  B        0.000000                     0.000000          0.000000
# 01-11 09:15:17.315 10.194.118.110:28483  #0344  FJ-1-15   INFO:                total     41.7 KB           14230                            7                14
# 01-11 09:15:17.316 10.194.118.110:28483  #0344  FJ-1-15   DEBUG: update write-locked parquet_file_parquet.hex_sid_8014_1 by job $03010ac2766e446fffffffff$_a2686455dbd382667f61804300a94f1b
# 01-11 09:15:17.316 10.194.118.110:28483  #0344  FJ-1-15   DEBUG: unlock parquet_file_parquet.hex_sid_8014_1 by job $03010ac2766e446fffffffff$_a2686455dbd382667f61804300a94f1b
# 01-11 09:15:18.223 10.194.118.110:28483  #0344  #33492-29 INFO: GET /3/Frames/parquet_file_parquet.hex_sid_8014_1, parms: {row_count=10}
# 01-11 09:15:23.870 10.194.118.110:28483  #0344  #mCleaner DEBUG: MemGood:   preclean, (K/V: 126  B + POJO:30.7 MB + FREE:8.86 GB == MEM_MAX:8.89 GB), desiredKV=5.90 GB NO-OOM
# 01-11 09:15:23.871 10.194.118.110:28483  #0344  #mCleaner DEBUG: H(cached:2M, eldest:1641914116085L < +0ms <...{60ms}...< +7680ms < +7786) DESIRED=6044M dirtysince=8814 force=false clean2age=5000
# 01-11 09:15:23.872 10.194.118.110:28483  #0344  #mCleaner DEBUG: MemGood:   postclean, (K/V: 126  B + POJO:30.7 MB + FREE:8.86 GB == MEM_MAX:8.89 GB), desiredKV=5.90 GB NO-OOM
# 01-11 09:15:23.872 10.194.118.110:28483  #0344  #mCleaner DEBUG: Cleaner pass took:  0.001 sec, spilled Zero   in   0 usecH(cached:2M, eldest:1641914116085L < +0ms <...{60ms}...< +7680ms < +7787) diski_o=Zero  , freed=0M, DESIRED=6044M

Paul Donnelly

unread,
Jan 11, 2022, 1:11:05 PM1/11/22
to H2O Open Source Scalable Machine Learning - h2ostream
Also worth noting version 3.28.1.2 closed an issue related to parquet parsing: https://h2oai.atlassian.net/browse/PUBDEV-7293

Paul Donnelly

unread,
Jan 12, 2022, 2:58:04 PM1/12/22
to H2O Open Source Scalable Machine Learning - h2ostream
FYI I have cross-posted this thread with some additional information to the JIRA page: https://h2oai.atlassian.net/browse/PUBDEV-8513

Michal Kurka

unread,
Jan 13, 2022, 9:14:14 AM1/13/22
to H2O Open Source Scalable Machine Learning - h2ostream
Hi Paul,

thank you for the detailed issue description and your investigation of what changed between the versions.

Would you be able to re-run your experiment with TRACE log level? The trace level should reveal why it failed in this message: Log.trace("Guesser failed for parser type", pp.info(), ignore);
Does the same file parse correctly if it is first downloaded locally and then imported from local storage?

We will try to reproduce it on our end as well.

Thank you for your report,
Michal Kurka

Michal Kurka

unread,
Jan 13, 2022, 9:36:26 AM1/13/22
to H2O Open Source Scalable Machine Learning - h2ostream
Paul,

I just noticed that your report in JIRA does contain TRACE already. I was able to reproduce the issue and I am working on a fix.

Please follow the jira, I am also looking for a possible workaround.

Thank you for your help!

MK

Paul Donnelly

unread,
Jan 28, 2022, 3:17:51 PM1/28/22
to H2O Open Source Scalable Machine Learning - h2ostream
Thank you, Michal! Your fix resolved my issues with importing parquet files from S3.

I am experiencing similar error messages when trying to export files and I made a JIRA ticket with the TRACE. Perhaps you could take a look: https://h2oai.atlassian.net/jira/software/c/projects/PUBDEV/issues/PUBDEV-8559

Thank you!

Michal Kurka

unread,
Jan 28, 2022, 4:02:33 PM1/28/22
to Paul Donnelly, H2O Open Source Scalable Machine Learning - h2ostream
Hi Paul,

I am glad the fix worked.

I am experiencing similar error messages when trying to export files and I made a JIRA ticket with the TRACE. Perhaps you could take a look: https://h2oai.atlassian.net/jira/software/c/projects/PUBDEV/issues/PUBDEV-8559

This might be a known issue. I will give you the background - it might clarify things. We implemented high-performance data ingest from S3 in our custom PersistS3 subsystem, however, this subsystem doesn't allow for large-data export (export of H2O Frames). When you use h2o.exportFile, it will fallback to Hadoop FS export - this is why you see PersistHdfs is both stack traces in your JIRA.

The solution right now is to configure the Hadoop subsystem independently by adding argument -hdfs_config core-site.xml
where core-site.xml should look like this:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
    <property>
        <name>fs.s3a.access.key</name>
        <value>...</value>
    </property>
    <property>
        <name>fs.s3a.secret.key</name>
        <value>...</value>
    </property>
</configuration>

In this case you would use h2o.exportFile("s3a://...").

The export feature certainly needs our attention, I will make an improvement to make the user experience better.

Please let me know if the suggestion worked for now.

Thank you,
MK

--
You received this message because you are subscribed to the Google Groups "H2O Open Source Scalable Machine Learning - h2ostream" group.
To unsubscribe from this group and stop receiving emails from it, send an email to h2ostream+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/h2ostream/3249b3f0-7362-40b3-a3d1-4f94a1a3d15fn%40googlegroups.com.

Paul Donnelly

unread,
Jan 28, 2022, 5:39:56 PM1/28/22
to H2O Open Source Scalable Machine Learning - h2ostream
Thanks for helping to look into this!

I tried setting up that hdfs configuration file and received a different error, that really felt more like a timeout (as it just sat there for a little bit before erroring out):

01-28 16:13:59.624 10.194.118.118:11048  #5470        main DEBUG water.default: resource /home/pdonn/core-site.xml added to the hadoop configuration

01-28 16:14:37.347 10.194.118.118:11048  #5470  0551034-39  INFO water.default: POST /3/Frames/parquet_file_parquet.hex_sid_b809_2/export, parms: {num_parts=1, quote_header=TRUE, force=FALSE, separator=44, path=s3a://path/to/file/parquet_file_export.csv, header=TRUE}
01-28 16:14:37.349 10.194.118.118:11048  #5470  0551034-39  INFO water.default: ExportFiles processing (s3a://path/to/file/parquet_file_export.csv)
01-28 16:16:58.306 10.194.118.118:11048  #5470  0551034-39 ERROR water.default:
water.api.HDFSIOException: HDFS IO Failure:
  accessed URI : s3a://path/to/file/parquet_file_export.csv
configuration: Configuration: core-default.xml, core-site.xml, hdfs-default.xml, hdfs-site.xml, /home/pdonn/core-site.xml
org.apache.hadoop.fs.s3a.AWSClientIOException: doesBucketExist on bucket-name: com.amazonaws.SdkClientException: Unable to execute HTTP request: Connection reset: Unable to execute HTTP request: Connection reset
at water.persist.PersistHdfs.exists(PersistHdfs.java:456) ~[h2o.jar:?]
at water.persist.PersistManager.exists(PersistManager.java:518) ~[h2o.jar:?]
at water.fvec.Frame.export(Frame.java:1543) ~[h2o.jar:?]
at water.api.FramesHandler.export(FramesHandler.java:251) ~[h2o.jar:?]
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_232]
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_232]
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_232]
at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_232]
at water.api.Handler.handle(Handler.java:60) ~[h2o.jar:?]
at water.api.RequestServer.serve(RequestServer.java:470) [h2o.jar:?]
at water.api.RequestServer.doGeneric(RequestServer.java:301) [h2o.jar:?]
at water.api.RequestServer.doPost(RequestServer.java:227) [h2o.jar:?]
at javax.servlet.http.HttpServlet.service(HttpServlet.java:707) [h2o.jar:?]
at javax.servlet.http.HttpServlet.service(HttpServlet.java:790) [h2o.jar:?]
at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:865) [h2o.jar:?]
at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:535) [h2o.jar:?]
at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255) [h2o.jar:?]
at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1317) [h2o.jar:?]
at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203) [h2o.jar:?]
at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:473) [h2o.jar:?]
at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201) [h2o.jar:?]
at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1219) [h2o.jar:?]
at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144) [h2o.jar:?]
at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:126) [h2o.jar:?]
at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132) [h2o.jar:?]
at water.webserver.jetty9.Jetty9ServerAdapter$LoginHandler.handle(Jetty9ServerAdapter.java:130) [h2o.jar:?]
at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:126) [h2o.jar:?]
at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132) [h2o.jar:?]
at org.eclipse.jetty.server.Server.handle(Server.java:531) [h2o.jar:?]
at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:352) [h2o.jar:?]
at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:260) [h2o.jar:?]
at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:281) [h2o.jar:?]
at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:102) [h2o.jar:?]
at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:118) [h2o.jar:?]
at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:333) [h2o.jar:?]
at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:310) [h2o.jar:?]
at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:168) [h2o.jar:?]
at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:126) [h2o.jar:?]
at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:366) [h2o.jar:?]
at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:762) [h2o.jar:?]
at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:680) [h2o.jar:?]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_232]
01-28 16:16:58.327 10.194.118.118:11048  #5470  0551034-39 ERROR water.default: Caught exception:
01-28 16:16:58.327 10.194.118.118:11048  #5470  0551034-39 ERROR water.default:
01-28 16:16:58.327 10.194.118.118:11048  #5470  0551034-39 ERROR water.default: ERROR MESSAGE:
01-28 16:16:58.328 10.194.118.118:11048  #5470  0551034-39 ERROR water.default:
01-28 16:16:58.328 10.194.118.118:11048  #5470  0551034-39 ERROR water.default: HDFS IO Failure:
01-28 16:16:58.328 10.194.118.118:11048  #5470  0551034-39 ERROR water.default:  accessed URI : s3a://path/to/file/parquet_file_export.csv
01-28 16:16:58.328 10.194.118.118:11048  #5470  0551034-39 ERROR water.default:  configuration: Configuration: core-default.xml, core-site.xml, hdfs-default.xml, hdfs-site.xml, /home/pdonn/core-site.xml
01-28 16:16:58.328 10.194.118.118:11048  #5470  0551034-39 ERROR water.default:  org.apache.hadoop.fs.s3a.AWSClientIOException: doesBucketExist on bucket-name: com.amazonaws.SdkClientException: Unable to execute HTTP request: Connection reset: Unable to execute HTTP request: Connection reset
01-28 16:16:58.328 10.194.118.118:11048  #5470  0551034-39 ERROR water.default:
01-28 16:16:58.328 10.194.118.118:11048  #5470  0551034-39 ERROR water.default: ; Stacktrace: [water.api.HDFSIOException: HDFS IO Failure:
01-28 16:16:58.328 10.194.118.118:11048  #5470  0551034-39 ERROR water.default:  accessed URI : s3a://path/to/file/parquet_file_export.csv
01-28 16:16:58.328 10.194.118.118:11048  #5470  0551034-39 ERROR water.default:  configuration: Configuration: core-default.xml, core-site.xml, hdfs-default.xml, hdfs-site.xml, /home/pdonn/core-site.xml
01-28 16:16:58.328 10.194.118.118:11048  #5470  0551034-39 ERROR water.default:  org.apache.hadoop.fs.s3a.AWSClientIOException: doesBucketExist on bucket-name: com.amazonaws.SdkClientException: Unable to execute HTTP request: Connection reset: Unable to execute HTTP request: Connection reset,     water.persist.PersistHdfs.exists(PersistHdfs.java:456),     water.persist.PersistManager.exists(PersistManager.java:518),     water.fvec.Frame.export(Frame.java:1543),     water.api.FramesHandler.export(FramesHandler.java:251),     sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method),     sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62),     sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43),     java.lang.reflect.Method.invoke(Method.java:498),     water.api.Handler.handle(Handler.java:60),     water.api.RequestServer.serve(RequestServer.java:470),     water.api.RequestServer.doGeneric(RequestServer.java:301),     water.api.RequestServer.doPost(RequestServer.java:227),     javax.servlet.http.HttpServlet.service(HttpServlet.java:707),     javax.servlet.http.HttpServlet.service(HttpServlet.java:790),     org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:865),     org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:535),     org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255),     org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1317),     org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203),     org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:473),     org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201),     org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1219),     org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144),     org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:126),     org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132),     water.webserver.jetty9.Jetty9ServerAdapter$LoginHandler.handle(Jetty9ServerAdapter.java:130),     org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:126),     org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132),     org.eclipse.jetty.server.Server.handle(Server.java:531),     org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:352),     org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:260),     org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:281),     org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:102),     org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:118),     org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:333),     org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:310),     org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:168),     org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:126),     org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:366),     org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:762),     org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:680),     java.lang.Thread.run(Thread.java:748)];parms={num_parts=1, frame_id=parquet_file_parquet.hex_sid_b809_2, quote_header=TRUE, force=FALSE, separator=44, path=s3a://path/to/file/parquet_file_export.csv, header=TRUE}

Michal Kurka

unread,
Feb 3, 2022, 12:19:31 PM2/3/22
to H2O Open Source Scalable Machine Learning - h2ostream
Hi Paul,

I was not able to reproduce the same issue. In my case, the export worked just fine.

It seems to me there really is a connection issue. Where are you running the H2O instance? Is it on your local laptop? On EC2? 

This issue https://stackoverflow.com/questions/52434691/sdkclientexception-unable-to-execute-http-request-connection-reset suggests a proxy could be at fault.

MK

Paul Donnelly

unread,
Feb 17, 2022, 1:01:40 PM2/17/22
to H2O Open Source Scalable Machine Learning - h2ostream
Hi Michal,

Here is some info about my h2o setup:

The h2o instance is an on-premises cluster. It’s running on Red Hat Enterprise Linux Server 7.9 (Maipo). h2o.jar is launched across the nodes and joined together into the h2o cluster using a topology flatfile before locking the h2o cluster formation. I’m not using Minio or AWS S3 products, but am using a local representation of S3 and have been launching using the documented Minio options -Dsys.ai.h2o.persist.s3.endPoint=/path/to/my/endpoint and -Dsys.ai.h2o.persist.s3.enable.path.style=true to redirect to the local S3 endpoint. This was also while including the -hdfs_config core-site.xml option.

 To my knowledge I’m not running into a proxy/firewall.

Appreciate any additional thoughts/ideas you may have.

Thanks,
Paul

Paul Donnelly

unread,
Oct 26, 2022, 11:03:46 AM10/26/22
to H2O Open Source Scalable Machine Learning - h2ostream
Hi Michal,

I wanted to follow-up on this issue because I saw you have done a lot of work related to S3 + h2o in recent months. I did a series of import/export tests and things are much improved! The only issue I'm encountering now is the specific case of exporting a parquet directly to S3. Thanks in advance for your help with this!

Summary of tests:

s3_tbl <- h2o.importFolder("s3://path/to/file") # does not work
s3a_tbl <- h2o.importFolder("s3a://path/to/file") # works

h2o.exportFile(data = s3a_tbl, "s3a://path/to/file/test.csv") # works
h2o.exportFile(data = s3a_tbl, "s3://path/to/file/test.csv") # works
h2o.exportFile(data = s3a_tbl, "s3://path/to/file/test", format = "parquet") # does not work
h2o.exportFile(data = s3a_tbl, "s3a://path/to/file/test", format = "parquet") # does not work
h2o.exportFile(data = s3a_tbl, "/path/to/file/test", format = "parquet") # works (not an S3 destination)

Kind of a miscellaneous note that when reviewing the log of the successful s3a import, I noticed there were many warnings that said the following:

10-26 07:32:10.142 10.193.240.243:41954  43080    FJ-1-501  WARN com.amazonaws.services.s3.internal.S3AbortableInputStream: Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection. This is likely an error and may result in sub-optimal behavior. Request only the bytes you need via a ranged GET or drain the input stream after use.

Here is the start-up information in case that is helpful.

10-26 07:31:11.714 10.193.240.243:41954  43080        main  INFO water.default: ----- H2O started  -----
10-26 07:31:11.715 10.193.240.243:41954  43080  ice Thread DEBUG water.default: GC CALLBACK: 1666787464919, USED:591.5 MB, CRIT: false
10-26 07:31:11.715 10.193.240.243:41954  43080        main  INFO water.default: Build git branch: rel-zygmund
10-26 07:31:11.715 10.193.240.243:41954  43080        main  INFO water.default: Build git hash: 7d606463d8c778614e09c47c953ab65e9967b5af
10-26 07:31:11.715 10.193.240.243:41954  43080        main  INFO water.default: Build git describe: jenkins-master-5959-2-g7d60646
10-26 07:31:11.715 10.193.240.243:41954  43080  ice Thread DEBUG water.default: MemGood:   GC CALLBACK, (K/V:Zero   + POJO:591.5 MB + FREE:43.87 GB == MEM_MAX:44.44 GB), desiredKV=28.76 GB NO-OOM
10-26 07:31:11.715 10.193.240.243:41954  43080        main  INFO water.default: Build project version: 3.38.0.1
10-26 07:31:11.716 10.193.240.243:41954  43080        main  INFO water.default: Build age: 1 month and 6 days
10-26 07:31:11.716 10.193.240.243:41954  43080        main  INFO water.default: Built by: 'jenkins'
10-26 07:31:11.716 10.193.240.243:41954  43080        main  INFO water.default: Built on: '2022-09-19 14:10:44'
10-26 07:31:11.716 10.193.240.243:41954  43080        main  INFO water.default: Found H2O Core extensions: [XGBoost, KrbStandalone, Infogram]
10-26 07:31:11.717 10.193.240.243:41954  43080        main  INFO water.default: Processed H2O arguments: [-ip, 10.193.240.243, -port, 41954, -flatfile, /home/pdonn/h2o.topology.flatfile, -nthreads, -1, -ice_root, /home/pdonn/h2o-pdonn, -log_dir, /home/pdonn/h2o-pdonn-log, -log_level, TRACE, -max_log_file_size, 100MB, -hdfs_config, /home/pdonn/core-site.xml]
10-26 07:31:11.717 10.193.240.243:41954  43080        main  INFO water.default: Java availableProcessors: 192
10-26 07:31:11.717 10.193.240.243:41954  43080        main  INFO water.default: Java heap totalMemory: 1.92 GB
10-26 07:31:11.717 10.193.240.243:41954  43080        main  INFO water.default: Java heap maxMemory: 44.44 GB
10-26 07:31:11.717 10.193.240.243:41954  43080        main  INFO water.default: Java version: Java 1.8.0_342 (from Red Hat, Inc.)
10-26 07:31:11.719 10.193.240.243:41954  43080        main  INFO water.default: JVM launch parameters: [-Dsys.ai.h2o.persist.s3.endPoint=https://my/endpoint/url, -Dsys.ai.h2o.persist.s3.enable.path.style=true, -Dsys.ai.h2o.persist.s3.maxErrorRetry=10, -Dsys.ai.h2o.persist.s3.socketTimeout=100000, -Dsys.ai.h2o.persist.s3.connectionTimeout=20000, -Dsys.ai.h2o.persist.s3.maxHttpConnections=50, -Xmx50g]
10-26 07:31:11.719 10.193.240.243:41954  43080        main  INFO water.default: JVM process id: 43080@my_server.com
10-26 07:31:11.719 10.193.240.243:41954  43080        main  INFO water.default: OS version: Linux 3.10.0-1160.76.1.el7.x86_64 (amd64)
10-26 07:31:11.719 10.193.240.243:41954  43080        main  INFO water.default: Machine physical memory: 2.953 TB
10-26 07:31:11.720 10.193.240.243:41954  43080        main  INFO water.default: Machine locale: en_US
10-26 07:31:11.720 10.193.240.243:41954  43080        main  INFO water.default: X-h2o-cluster-id: 1666787462264
10-26 07:31:11.720 10.193.240.243:41954  43080        main  INFO water.default: User name: 'pdonn'
10-26 07:31:11.720 10.193.240.243:41954  43080        main  INFO water.default: IPv6 stack selected: false
10-26 07:31:11.720 10.193.240.243:41954  43080        main  INFO water.default: Possible IP Address: bond0 (bond0), fe80:0:0:0:266e:96ff:fe1f:a28%bond0
10-26 07:31:11.721 10.193.240.243:41954  43080        main  INFO water.default: Possible IP Address: bond0 (bond0), 10.193.240.243
10-26 07:31:11.721 10.193.240.243:41954  43080        main  INFO water.default: Possible IP Address: lo (lo), 0:0:0:0:0:0:0:1%lo
10-26 07:31:11.721 10.193.240.243:41954  43080        main  INFO water.default: Possible IP Address: lo (lo), 127.0.0.1
10-26 07:31:11.721 10.193.240.243:41954  43080        main  INFO water.default: H2O node running in unencrypted mode.
10-26 07:31:11.723 10.193.240.243:41954  43080        main  INFO water.default: Internal communication uses port: 41955
10-26 07:31:11.723 10.193.240.243:41954  43080        main  INFO water.default: Listening for HTTP and REST traffic on http://10.193.240.243:41954/
10-26 07:31:11.738 10.193.240.243:41954  43080        main DEBUG water.default: Interface MTU: 1500
10-26 07:31:11.743 10.193.240.243:41954  43080        main  INFO water.default: H2O cloud name: 'pdonn' on /10.193.240.243:41954, static configuration based on -flatfile /home/pdonn/h2o.topology.flatfile
10-26 07:31:11.743 10.193.240.243:41954  43080        main  INFO water.default: If you have trouble connecting, try SSH tunneling from your local machine (e.g., via port 55555):
10-26 07:31:11.743 10.193.240.243:41954  43080        main  INFO water.default:   1. Open a terminal and run 'ssh -L 55555:localhost:41954 pd...@10.193.240.243'
10-26 07:31:11.743 10.193.240.243:41954  43080        main  INFO water.default:   2. Point your browser to http://localhost:55555
10-26 07:31:12.584 10.193.240.243:41954  43080        main DEBUG water.default: resource /home/pdonn/core-site.xml added to the hadoop configuration
10-26 07:31:12.597 10.193.240.243:41954  43080        main  INFO water.default: Kerberos not configured
10-26 07:31:12.598 10.193.240.243:41954  43080        main  INFO water.default: Log dir: '/home/pdonn/h2o-pdonn-log'
10-26 07:31:12.598 10.193.240.243:41954  43080        main  INFO water.default: Cur dir: '/home/pdonn'
10-26 07:31:12.599 10.193.240.243:41954  43080        main DEBUG water.default: H2O launch parameters: [ SYSTEM_PROP_PREFIX: sys.ai.h2o., SYSTEM_DEBUG_CORS: sys.ai.h2o.debug.cors, help: false, version: false, name: pdonn, flatfile: /home/pdonn/h2o.topology.flatfile, ice_root: /home/pdonn/h2o-pdonn, cleaner: false, nthreads: 192, log_dir: /home/pdonn/h2o-pdonn-log, flow_dir: null, disable_web: false, disable_net: false, disable_flow: false, client: false, allow_clients: false, allow_unsupported_java: false, rest_api_ping_timeout: 0, notify_local: null, off_heap_memory_ratio: 0.0, hdfs_config: [Ljava.lang.String;@6d467c87, hdfs_skip: false, aws_credentials: null, configure_s3_using_s3a: false, ga_hadoop_ver: null, hadoop_properties: {}, auto_recovery_dir: null, log_level: TRACE, max_log_file_size: 100MB, random_udp_drop: false, md5skip: false, quiet: false, clientDisconnectTimeout: 20000, embedded: false, features_level: Experimental ]
10-26 07:31:12.600 10.193.240.243:41954  43080        main DEBUG water.default: Boot class path: /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.342.b07-1.el7_9.x86_64/jre/lib/resources.jar:/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.342.b07-1.el7_9.x86_64/jre/lib/rt.jar:/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.342.b07-1.el7_9.x86_64/jre/lib/sunrsasign.jar:/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.342.b07-1.el7_9.x86_64/jre/lib/jsse.jar:/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.342.b07-1.el7_9.x86_64/jre/lib/jce.jar:/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.342.b07-1.el7_9.x86_64/jre/lib/charsets.jar:/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.342.b07-1.el7_9.x86_64/jre/lib/jfr.jar:/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.342.b07-1.el7_9.x86_64/jre/classes
10-26 07:31:12.600 10.193.240.243:41954  43080        main DEBUG water.default: Java class path: /path/to/R/packages/renv/cache/v5/R-3.5/x86_64-pc-linux-gnu/h2o/3.38.0.1/5326f4e992019499550e2c6e2324504f/h2o/java/h2o.jar
10-26 07:31:12.600 10.193.240.243:41954  43080        main DEBUG water.default: Java library path: /usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib
10-26 07:31:12.616 10.193.240.243:41954  43080        main  INFO water.default: Subsystem for distributed import from HTTP/HTTPS successfully initialized
10-26 07:31:12.616 10.193.240.243:41954  43080        main  INFO water.default: HDFS subsystem successfully initialized
10-26 07:31:12.621 10.193.240.243:41954  43080        main  INFO water.default: S3 subsystem successfully initialized
10-26 07:31:12.625 10.193.240.243:41954  43080        main  INFO water.default: GCS subsystem successfully initialized
10-26 07:31:12.625 10.193.240.243:41954  43080        main  INFO water.default: Flow dir: '/home/pdonn/h2oflows'
10-26 07:31:12.634 10.193.240.243:41954  43080        main DEBUG water.default: Announcing new Cloud Membership: [/10.193.240.243:41954]
10-26 07:31:12.634 10.193.240.243:41954  43080        main  INFO water.default: Cloud of size 1 formed [/10.193.240.243:41954]
10-26 07:31:12.647 10.193.240.243:41954  43080        main  INFO water.default: Registered parsers: [GUESS, ARFF, XLS, SVMLight, AVRO, PARQUET, CSV]
10-26 07:31:12.648 10.193.240.243:41954  43080        main DEBUG water.default: Timing within H2O.main():
10-26 07:31:12.648 10.193.240.243:41954  43080        main DEBUG water.default:     Args parsing & validation: 1ms
10-26 07:31:12.648 10.193.240.243:41954  43080        main DEBUG water.default:     Get ICE root: 0ms
10-26 07:31:12.649 10.193.240.243:41954  43080        main DEBUG water.default:     Print log version: 53ms
10-26 07:31:12.649 10.193.240.243:41954  43080        main DEBUG water.default:     Detect network address: 9ms
10-26 07:31:12.649 10.193.240.243:41954  43080        main DEBUG water.default:     Start local node: 7851ms
10-26 07:31:12.649 10.193.240.243:41954  43080        main DEBUG water.default:     Extensions onLocalNodeStarted(): 206ms
10-26 07:31:12.649 10.193.240.243:41954  43080        main DEBUG water.default:     RuntimeMxBean: 2ms
10-26 07:31:12.650 10.193.240.243:41954  43080        main DEBUG water.default:     Initialize persistence layer: 26ms
10-26 07:31:12.650 10.193.240.243:41954  43080        main DEBUG water.default:     Start network services: 7ms
10-26 07:31:12.650 10.193.240.243:41954  43080        main DEBUG water.default:     Cloud up: 2ms
10-26 07:31:12.650 10.193.240.243:41954  43080        main DEBUG water.default:     Start GA: 13ms
10-26 07:31:12.650 10.193.240.243:41954  43080        main  INFO water.default: XGBoost extension initialized
10-26 07:31:12.650 10.193.240.243:41954  43080        main  INFO water.default: KrbStandalone extension initialized
10-26 07:31:12.650 10.193.240.243:41954  43080        main  INFO water.default: Infogram extension initialized
10-26 07:31:12.651 10.193.240.243:41954  43080        main  INFO water.default: Registered 3 core extensions in: 2033ms
10-26 07:31:12.651 10.193.240.243:41954  43080        main  INFO water.default: Registered H2O core extensions: [XGBoost, KrbStandalone, Infogram]
10-26 07:31:12.659 10.193.240.243:41954  43080        main  INFO hex.tree.xgboost.XGBoostExtension: Found XGBoost backend with library: xgboost4j_gpu
10-26 07:31:12.660 10.193.240.243:41954  43080        main  INFO hex.tree.xgboost.XGBoostExtension: XGBoost supported backends: [WITH_GPU, WITH_OMP]
10-26 07:31:12.896 10.193.240.243:41954  43080  ice Thread DEBUG water.default: GC CALLBACK: 1666787472895, USED:33.6 MB, CRIT: false
10-26 07:31:12.896 10.193.240.243:41954  43080  ice Thread DEBUG water.default: MemGood:   GC CALLBACK, (K/V:Zero   + POJO:33.6 MB + FREE:44.41 GB == MEM_MAX:44.44 GB), desiredKV=29.71 GB NO-OOM
10-26 07:31:13.198 10.193.240.243:41954  43080        main  INFO water.default: Registered: 276 REST APIs in: 545ms
10-26 07:31:13.198 10.193.240.243:41954  43080        main  INFO water.default: Registered REST API extensions: [XGBoost, Amazon S3, Algos, Infogram, AutoML, Core V3, TargetEncoder, Core V4]
 
This log corresponds to this export test:

h2o.exportFile(data = s3a_tbl, "s3://path/to/file/test", format = "parquet")

10-26 07:40:02.905 10.193.240.243:41954  43080  8931371-31  INFO water.default: POST /3/Frames/part_00000_a8db5fb1_46d9_49f7_9142_8be7ce5505de_c000_1cb4b261_4f52_4acc_8e8e_37cc5a4d7910_snappy_parquet.hex_sid_be6f_1/export, parms: {num_parts=1, quote_header=TRUE, force=FALSE, separator=44, format=parquet, path=s3://path/to/file/test, header=TRUE}
10-26 07:40:02.912 10.193.240.243:41954  43080  8931371-31  INFO water.default: ExportFiles processing (s3://path/to/file/test)
10-26 07:40:02.912 10.193.240.243:41954  43080  8931371-31  WARN water.default: Format is 'parquet', csv parameter values: separator, header, quote_header will be ignored!
10-26 07:40:02.913 10.193.240.243:41954  43080  8931371-31  WARN water.default: Format is 'parquet', H2O itself determines the optimal number of files (1 file per chunk). Parts parameter value will be ignored!
10-26 07:40:02.927 10.193.240.243:41954  43080  8931371-31 DEBUG water.persist.PersistS3: S3 endpoint specified: https://my/endpoint/url
10-26 07:40:02.927 10.193.240.243:41954  43080  8931371-31 DEBUG water.persist.PersistS3: S3 path style access enabled
10-26 07:40:03.303 10.193.240.243:41954  43080    FJ-1-165  WARN org.apache.hadoop.fs.FileSystem: S3FileSystem is deprecated and will be removed in future releases. Use NativeS3FileSystem or S3AFileSystem instead.
10-26 07:40:03.525 10.193.240.243:41954  43080    FJ-1-413 ERROR water.default:
java.lang.IllegalArgumentException: Invalid hostname in URI s3:/path/to/file/test/part-m-00299
        at org.apache.hadoop.fs.s3.S3Credentials.initialize(S3Credentials.java:52) ~[h2o.jar:?]
        at org.apache.hadoop.fs.s3.Jets3tFileSystemStore.initialize(Jets3tFileSystemStore.java:94) ~[h2o.jar:?]
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_342]
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_342]
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_342]
        at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_342]
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:409) ~[h2o.jar:?]
        at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:163) ~[h2o.jar:?]
        at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:155) ~[h2o.jar:?]
        at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95) ~[h2o.jar:?]
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:346) ~[h2o.jar:?]
        at com.sun.proxy.$Proxy28.initialize(Unknown Source) ~[?:?]
        at org.apache.hadoop.fs.s3.S3FileSystem.initialize(S3FileSystem.java:111) ~[h2o.jar:?]
        at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2812) ~[h2o.jar:?]
        at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:100) ~[h2o.jar:?]
        at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2849) ~[h2o.jar:?]
        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2831) ~[h2o.jar:?]
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:389) ~[h2o.jar:?]
        at org.apache.hadoop.fs.Path.getFileSystem(Path.java:356) ~[h2o.jar:?]
        at org.apache.parquet.hadoop.ParquetFileWriter.<init>(ParquetFileWriter.java:209) ~[h2o.jar:?]
        at org.apache.parquet.hadoop.ParquetWriter.<init>(ParquetWriter.java:266) ~[h2o.jar:?]
        at org.apache.parquet.hadoop.ParquetWriter$Builder.build(ParquetWriter.java:489) ~[h2o.jar:?]
        at water.parser.parquet.FrameParquetExporter.buildWriter(FrameParquetExporter.java:167) ~[h2o.jar:?]
        at water.parser.parquet.FrameParquetExporter.access$000(FrameParquetExporter.java:29) ~[h2o.jar:?]
        at water.parser.parquet.FrameParquetExporter$PartExportParquetTask.map(FrameParquetExporter.java:85) ~[h2o.jar:?]
        at water.MRTask.compute2(MRTask.java:819) ~[h2o.jar:?]
        at water.MRTask.compute2(MRTask.java:775) ~[h2o.jar:?]
        at water.MRTask.compute2(MRTask.java:775) ~[h2o.jar:?]
        at water.MRTask.compute2(MRTask.java:775) ~[h2o.jar:?]
        at water.MRTask.compute2(MRTask.java:775) ~[h2o.jar:?]
        at water.MRTask.compute2(MRTask.java:775) ~[h2o.jar:?]
        at water.MRTask.compute2(MRTask.java:775) ~[h2o.jar:?]
        at water.MRTask.compute2(MRTask.java:775) ~[h2o.jar:?]
        at water.MRTask.compute2(MRTask.java:775) ~[h2o.jar:?]
        at water.MRTask.compute2(MRTask.java:775) ~[h2o.jar:?]
        at water.H2O$H2OCountedCompleter.compute1(H2O.java:1680) ~[h2o.jar:?]
        at water.parser.parquet.FrameParquetExporter$PartExportParquetTask$Icer.compute1(FrameParquetExporter$PartExportParquetTask$Icer.java) ~[?:?]
        at water.H2O$H2OCountedCompleter.compute(H2O.java:1676) ~[h2o.jar:?]
        at jsr166y.CountedCompleter.exec(CountedCompleter.java:468) ~[h2o.jar:?]
        at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263) [h2o.jar:?]
        at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:976) [h2o.jar:?]
        at jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1479) [h2o.jar:?]
        at jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104) [h2o.jar:?]
10-26 07:40:03.556 10.193.240.243:41954  43080  8931371-32 TRACE water.default:
java.util.concurrent.ExecutionException: java.lang.IllegalArgumentException: Invalid hostname in URI s3:/path/to/file/test/part-m-00299
        at jsr166y.ForkJoinTask.get(ForkJoinTask.java:1066) ~[h2o.jar:?]
        at water.Job.blockingWaitForDone(Job.java:523) [h2o.jar:?]
        at water.Job.tryGetDoneJob(Job.java:506) [h2o.jar:?]
        at water.api.JobsHandler.fetch(JobsHandler.java:39) [h2o.jar:?]
        at sun.reflect.GeneratedMethodAccessor36.invoke(Unknown Source) ~[?:?]
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_342]
        at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_342]
        at water.api.Handler.handle(Handler.java:60) [h2o.jar:?]
        at water.api.RequestServer.serve(RequestServer.java:472) [h2o.jar:?]
        at water.api.RequestServer.doGeneric(RequestServer.java:303) [h2o.jar:?]
        at water.api.RequestServer.doGet(RequestServer.java:225) [h2o.jar:?]
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:687) [h2o.jar:?]
        at java.lang.Thread.run(Thread.java:750) [?:1.8.0_342]
Caused by: java.lang.IllegalArgumentException: Invalid hostname in URI s3:/path/to/file/test/part-m-00299
        at org.apache.hadoop.fs.s3.S3Credentials.initialize(S3Credentials.java:52) ~[h2o.jar:?]
        at org.apache.hadoop.fs.s3.Jets3tFileSystemStore.initialize(Jets3tFileSystemStore.java:94) ~[h2o.jar:?]
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_342]
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_342]
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_342]
        at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_342]
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:409) ~[h2o.jar:?]
        at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:163) ~[h2o.jar:?]
        at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:155) ~[h2o.jar:?]
        at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95) ~[h2o.jar:?]
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:346) ~[h2o.jar:?]
        at com.sun.proxy.$Proxy28.initialize(Unknown Source) ~[?:?]
        at org.apache.hadoop.fs.s3.S3FileSystem.initialize(S3FileSystem.java:111) ~[h2o.jar:?]
        at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2812) ~[h2o.jar:?]
        at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:100) ~[h2o.jar:?]
        at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2849) ~[h2o.jar:?]
        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2831) ~[h2o.jar:?]
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:389) ~[h2o.jar:?]
        at org.apache.hadoop.fs.Path.getFileSystem(Path.java:356) ~[h2o.jar:?]
        at org.apache.parquet.hadoop.ParquetFileWriter.<init>(ParquetFileWriter.java:209) ~[h2o.jar:?]
        at org.apache.parquet.hadoop.ParquetWriter.<init>(ParquetWriter.java:266) ~[h2o.jar:?]
        at org.apache.parquet.hadoop.ParquetWriter$Builder.build(ParquetWriter.java:489) ~[h2o.jar:?]
        at water.parser.parquet.FrameParquetExporter.buildWriter(FrameParquetExporter.java:167) ~[h2o.jar:?]
        at water.parser.parquet.FrameParquetExporter.access$000(FrameParquetExporter.java:29) ~[h2o.jar:?]
        at water.parser.parquet.FrameParquetExporter$PartExportParquetTask.map(FrameParquetExporter.java:85) ~[h2o.jar:?]
        at water.MRTask.compute2(MRTask.java:819) ~[h2o.jar:?]
        at water.MRTask.compute2(MRTask.java:775) ~[h2o.jar:?]
        at water.MRTask.compute2(MRTask.java:775) ~[h2o.jar:?]
        at water.MRTask.compute2(MRTask.java:775) ~[h2o.jar:?]
        at water.MRTask.compute2(MRTask.java:775) ~[h2o.jar:?]
        at water.MRTask.compute2(MRTask.java:775) ~[h2o.jar:?]
        at water.MRTask.compute2(MRTask.java:775) ~[h2o.jar:?]
        at water.MRTask.compute2(MRTask.java:775) ~[h2o.jar:?]
        at water.MRTask.compute2(MRTask.java:775) ~[h2o.jar:?]
        at water.MRTask.compute2(MRTask.java:775) ~[h2o.jar:?]
        at water.H2O$H2OCountedCompleter.compute1(H2O.java:1680) ~[h2o.jar:?]
        at water.parser.parquet.FrameParquetExporter$PartExportParquetTask$Icer.compute1(FrameParquetExporter$PartExportParquetTask$Icer.java) ~[?:?]
        at water.H2O$H2OCountedCompleter.compute(H2O.java:1676) ~[h2o.jar:?]
        at jsr166y.CountedCompleter.exec(CountedCompleter.java:468) ~[h2o.jar:?]
        at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263) ~[h2o.jar:?]
        at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:976) ~[h2o.jar:?]
        at jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1479) ~[h2o.jar:?]
        at jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104) ~[h2o.jar:?]
10-26 07:40:03.557 10.193.240.243:41954  43080  8931371-32 TRACE water.default: Waited for job result for 476ms.
10-26 07:40:12.597 10.193.240.243:41954  43080  MemCleaner DEBUG water.default: MemGood:   preclean, (K/V:2.1 MB + POJO:5.60 GB + FREE:38.84 GB == MEM_MAX:44.44 GB), desiredKV=29.77 GB NO-OOM
10-26 07:40:12.598 10.193.240.243:41954  43080  MemCleaner DEBUG water.default: H(cached:1513M, eldest:1666787613144L < +0ms <...{3119ms}...< +399232ms < +399454) DESIRED=30489M dirtysince=9532 force=false clean2age=5000
10-26 07:40:12.765 10.193.240.243:41954  43080  MemCleaner DEBUG water.default: MemGood:   postclean, (K/V:2.1 MB + POJO:5.60 GB + FREE:38.84 GB == MEM_MAX:44.44 GB), desiredKV=29.77 GB NO-OOM
10-26 07:40:12.766 10.193.240.243:41954  43080  MemCleaner DEBUG water.default: Cleaner pass took:  0.081 sec, spilled Zero   in   0 usecH(cached:1513M, eldest:1666787613144L < +0ms <...{3121ms}...< +399488ms < +399622) diski_o=Zero  , freed=0M, DESIRED=30489M

This log corresponds to this export test:

h2o.exportFile(data = s3a_tbl, "s3a://path/to/file/test", format = "parquet")

10-26 07:41:35.168 10.193.240.243:41954  43080  8931371-34  INFO water.default: POST /3/Frames/part_00000_a8db5fb1_46d9_49f7_9142_8be7ce5505de_c000_1cb4b261_4f52_4acc_8e8e_37cc5a4d7910_snappy_parquet.hex_sid_be6f_1/export, parms: {num_parts=1, quote_header=TRUE, force=FALSE, separator=44, format=parquet, path=s3a://path/to/file/test, header=TRUE}
10-26 07:41:35.170 10.193.240.243:41954  43080  8931371-34  INFO water.default: ExportFiles processing (s3a://path/to/file/test)
10-26 07:41:35.170 10.193.240.243:41954  43080  8931371-34  WARN water.default: Format is 'parquet', csv parameter values: separator, header, quote_header will be ignored!
10-26 07:41:35.171 10.193.240.243:41954  43080  8931371-34  WARN water.default: Format is 'parquet', H2O itself determines the optimal number of files (1 file per chunk). Parts parameter value will be ignored!
10-26 07:41:35.465 10.193.240.243:41954  43080    FJ-1-163 ERROR water.default:
java.lang.NullPointerException: null uri host.
        at java.util.Objects.requireNonNull(Objects.java:228) ~[?:1.8.0_342]
        at org.apache.hadoop.fs.s3native.S3xLoginHelper.buildFSURI(S3xLoginHelper.java:72) ~[h2o.jar:?]
        at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:165) ~[h2o.jar:?]
        at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2812) ~[h2o.jar:?]
        at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:100) ~[h2o.jar:?]
        at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2849) ~[h2o.jar:?]
        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2831) ~[h2o.jar:?]
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:389) ~[h2o.jar:?]
        at org.apache.hadoop.fs.Path.getFileSystem(Path.java:356) ~[h2o.jar:?]
        at org.apache.parquet.hadoop.ParquetFileWriter.<init>(ParquetFileWriter.java:209) ~[h2o.jar:?]
        at org.apache.parquet.hadoop.ParquetWriter.<init>(ParquetWriter.java:266) ~[h2o.jar:?]
        at org.apache.parquet.hadoop.ParquetWriter$Builder.build(ParquetWriter.java:489) ~[h2o.jar:?]
        at water.parser.parquet.FrameParquetExporter.buildWriter(FrameParquetExporter.java:167) ~[h2o.jar:?]
        at water.parser.parquet.FrameParquetExporter.access$000(FrameParquetExporter.java:29) ~[h2o.jar:?]
        at water.parser.parquet.FrameParquetExporter$PartExportParquetTask.map(FrameParquetExporter.java:85) ~[h2o.jar:?]
        at water.MRTask.compute2(MRTask.java:819) ~[h2o.jar:?]
        at water.MRTask.compute2(MRTask.java:775) ~[h2o.jar:?]
        at water.H2O$H2OCountedCompleter.compute1(H2O.java:1680) ~[h2o.jar:?]
        at water.parser.parquet.FrameParquetExporter$PartExportParquetTask$Icer.compute1(FrameParquetExporter$PartExportParquetTask$Icer.java) ~[?:?]
        at water.H2O$H2OCountedCompleter.compute(H2O.java:1676) ~[h2o.jar:?]
        at jsr166y.CountedCompleter.exec(CountedCompleter.java:468) ~[h2o.jar:?]
        at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263) [h2o.jar:?]
        at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:976) [h2o.jar:?]
        at jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1479) [h2o.jar:?]
        at jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104) [h2o.jar:?]
10-26 07:41:35.521 10.193.240.243:41954  43080  8931371-40 TRACE water.default:
java.util.concurrent.ExecutionException: java.lang.NullPointerException: null uri host.
        at jsr166y.ForkJoinTask.get(ForkJoinTask.java:1066) ~[h2o.jar:?]
        at water.Job.blockingWaitForDone(Job.java:523) [h2o.jar:?]
        at water.Job.tryGetDoneJob(Job.java:506) [h2o.jar:?]
        at water.api.JobsHandler.fetch(JobsHandler.java:39) [h2o.jar:?]
        at sun.reflect.GeneratedMethodAccessor36.invoke(Unknown Source) ~[?:?]
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_342]
        at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_342]
        at water.api.Handler.handle(Handler.java:60) [h2o.jar:?]
        at water.api.RequestServer.serve(RequestServer.java:472) [h2o.jar:?]
        at water.api.RequestServer.doGeneric(RequestServer.java:303) [h2o.jar:?]
        at water.api.RequestServer.doGet(RequestServer.java:225) [h2o.jar:?]
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:687) [h2o.jar:?]
        at java.lang.Thread.run(Thread.java:750) [?:1.8.0_342]
Caused by: java.lang.NullPointerException: null uri host.
        at java.util.Objects.requireNonNull(Objects.java:228) ~[?:1.8.0_342]
        at org.apache.hadoop.fs.s3native.S3xLoginHelper.buildFSURI(S3xLoginHelper.java:72) ~[h2o.jar:?]
        at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:165) ~[h2o.jar:?]
        at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2812) ~[h2o.jar:?]
        at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:100) ~[h2o.jar:?]
        at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2849) ~[h2o.jar:?]
        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2831) ~[h2o.jar:?]
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:389) ~[h2o.jar:?]
        at org.apache.hadoop.fs.Path.getFileSystem(Path.java:356) ~[h2o.jar:?]
        at org.apache.parquet.hadoop.ParquetFileWriter.<init>(ParquetFileWriter.java:209) ~[h2o.jar:?]
        at org.apache.parquet.hadoop.ParquetWriter.<init>(ParquetWriter.java:266) ~[h2o.jar:?]
        at org.apache.parquet.hadoop.ParquetWriter$Builder.build(ParquetWriter.java:489) ~[h2o.jar:?]
        at water.parser.parquet.FrameParquetExporter.buildWriter(FrameParquetExporter.java:167) ~[h2o.jar:?]
        at water.parser.parquet.FrameParquetExporter.access$000(FrameParquetExporter.java:29) ~[h2o.jar:?]
        at water.parser.parquet.FrameParquetExporter$PartExportParquetTask.map(FrameParquetExporter.java:85) ~[h2o.jar:?]
        at water.MRTask.compute2(MRTask.java:819) ~[h2o.jar:?]
        at water.MRTask.compute2(MRTask.java:775) ~[h2o.jar:?]
        at water.H2O$H2OCountedCompleter.compute1(H2O.java:1680) ~[h2o.jar:?]
        at water.parser.parquet.FrameParquetExporter$PartExportParquetTask$Icer.compute1(FrameParquetExporter$PartExportParquetTask$Icer.java) ~[?:?]
        at water.H2O$H2OCountedCompleter.compute(H2O.java:1676) ~[h2o.jar:?]
        at jsr166y.CountedCompleter.exec(CountedCompleter.java:468) ~[h2o.jar:?]
        at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263) ~[h2o.jar:?]
        at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:976) ~[h2o.jar:?]
        at jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1479) ~[h2o.jar:?]
        at jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104) ~[h2o.jar:?]
10-26 07:41:35.522 10.193.240.243:41954  43080  8931371-40 TRACE water.default: Waited for job result for 86ms.
Reply all
Reply to author
Forward
0 new messages