Getting gc to work with GCS

70 views
Skip to first unread message

David Muto

unread,
Nov 10, 2023, 11:05:21 AM11/10/23
to projectnessie
Hey there! I've been struggling to get Nessie GC to work when the FileIO should be GCSFileIO (rather than S3FileIO which is supported out of the box).

I'm using the following script to run gc:

```
#!/usr/bin/env bash
set -euo pipefail

NESSIE_HOST="${NESSIE_HOST:-localhost:19120}"
WORKING_DIR="${WORKING_DIR:-./tmp}"

main() {
# fetch all the jars and mark nessie as executable
mkdir -p "${WORKING_DIR}"
fetch_jar "gcs-connector-hadoop3-latest.jar" "https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-hadoop3-latest.jar"
fetch_jar "iceberg-gcp-1.4.1.jar" "https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-gcp/1.4.1/iceberg-gcp-1.4.1.jar"
fetch_jar "nessie-gc" "https://github.com/projectnessie/nessie/releases/download/nessie-0.73.0/nessie-gc-0.73.0"
chmod +x "${WORKING_DIR}/nessie-gc"

cd "${WORKING_DIR}"
run_gc "https://${NESSIE_HOST}/api/v2"
}

run_gc() {
local api="${1}"

# I know inmemory isn't for production use cases...just a demo
./nessie-gc gc \
--inmemory \
-u "${api}" \
-H fs.gs.impl=com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem \
-H fs.AbstractFileSystem.gs.impl=com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS
}

fetch_jar() {
local jar="${1}"
local url="${2}"

if ! test -f "${WORKING_DIR}/${jar}"; then
echo "Fetching ${jar} from ${url}"
curl -fsSL -o "${WORKING_DIR}/${jar}" "${url}"
else
echo "Found ${jar} in ${WORKING_DIR}"
fi
}

main "$@"
```

From what I could tell, passing the `-H` params is necessary when not using S3FileIO since it falls back to HadoopFileIO by default. However, when I run this, I see the following which seems to indicate that neither Iceberg's GCSFileIO, nor the Google Hadoop file system classes can be found.

```
2023-11-10 10:56:32,938 [ForkJoinPool-1-worker-1] INFO  o.p.gc.identify.IdentifyLiveContents - live-set#63965210-652d-48de-9e28-534bbd861bdf: Start walking the commit log of Branch{name=main, metadata=null, hash=56d92b26e2db94f1ca8ff22c6c6c104ba83c02f0496b76c8356e1b665f9bbdb1} using no cutoff (keep
 everything).
2023-11-10 10:56:33,328 [ForkJoinPool-1-worker-1] INFO  o.p.gc.identify.IdentifyLiveContents - live-set#63965210-652d-48de-9e28-534bbd861bdf: Finished walking the commit log of Branch{name=main, metadata=null, hash=56d92b26e2db94f1ca8ff22c6c6c104ba83c02f0496b76c8356e1b665f9bbdb1} using no cutoff (k
eep everything) after 323 commits, no more commits.
2023-11-10 10:56:33,328 [ForkJoinPool-1-worker-1] INFO  o.p.gc.identify.IdentifyLiveContents - live-set#63965210-652d-48de-9e28-534bbd861bdf: Finished walking all named references, took PT0.764379S: numReferences=1, numCommits=323, numContents=316, shortCircuits=0.
Finished Nessie-GC identify phase finished with status IDENTIFY_SUCCESS after PT0.764443S, live-content-set ID is 63965210-652d-48de-9e28-534bbd861bdf.
2023-11-10 10:56:33,362 [main] INFO  o.p.g.e.local.DefaultLocalExpire - live-set#63965210-652d-48de-9e28-534bbd861bdf: Starting expiry.
2023-11-10 10:56:33,369 [ForkJoinPool-3-worker-3] INFO  org.apache.iceberg.CatalogUtil - Loading custom FileIO implementation: org.apache.iceberg.gcp.gcs.GCSFileIO
2023-11-10 10:56:33,370 [ForkJoinPool-3-worker-3] WARN  o.apache.iceberg.io.ResolvingFileIO - Failed to load FileIO implementation: org.apache.iceberg.gcp.gcs.GCSFileIO, falling back to org.apache.iceberg.hadoop.HadoopFileIO
java.lang.IllegalArgumentException: Cannot initialize FileIO, missing no-arg constructor: org.apache.iceberg.gcp.gcs.GCSFileIO
        at org.apache.iceberg.CatalogUtil.loadFileIO(CatalogUtil.java:312)
        at org.apache.iceberg.io.ResolvingFileIO.lambda$io$1(ResolvingFileIO.java:185)
        at java.base/java.util.concurrent.ConcurrentHashMap.computeIfAbsent(ConcurrentHashMap.java:1708)
        at org.apache.iceberg.io.ResolvingFileIO.io(ResolvingFileIO.java:174)
        at org.apache.iceberg.io.ResolvingFileIO.newInputFile(ResolvingFileIO.java:82)
        at org.apache.iceberg.TableMetadataParser.read(TableMetadataParser.java:266)
        at org.projectnessie.gc.iceberg.IcebergContentToFiles.extractFiles(IcebergContentToFiles.java:78)
        at org.projectnessie.gc.expire.PerContentDeleteExpired.lambda$identifyLiveFiles$2(PerContentDeleteExpired.java:125)
        at java.base/java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:273)
        at java.base/java.util.HashMap$KeySpliterator.forEachRemaining(HashMap.java:1707)
        at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509)
        at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499)
        at java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:921)
        at java.base/java.util.stream.ReduceOps$5.evaluateSequential(ReduceOps.java:258)
        at java.base/java.util.stream.ReduceOps$5.evaluateSequential(ReduceOps.java:248)
        at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
        at java.base/java.util.stream.ReferencePipeline.count(ReferencePipeline.java:709)
        at org.projectnessie.gc.expire.PerContentDeleteExpired.identifyLiveFiles(PerContentDeleteExpired.java:133)
        at org.projectnessie.gc.expire.PerContentDeleteExpired.expire(PerContentDeleteExpired.java:73)
        at org.projectnessie.gc.expire.local.DefaultLocalExpire.expireSingleContent(DefaultLocalExpire.java:104)
        at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:197)
        at java.base/java.util.concurrent.ConcurrentHashMap$KeySpliterator.forEachRemaining(ConcurrentHashMap.java:3573)
        at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509)
        at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499)
        at java.base/java.util.stream.ReduceOps$ReduceTask.doLeaf(ReduceOps.java:960)
        at java.base/java.util.stream.ReduceOps$ReduceTask.doLeaf(ReduceOps.java:934)
        at java.base/java.util.stream.AbstractTask.compute(AbstractTask.java:327)
        at java.base/java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:754)
        at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:373)
        at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1182)
        at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1655)
        at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1622)
        at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:165)
Caused by: java.lang.NoSuchMethodException: Cannot find constructor for interface org.apache.iceberg.io.FileIO
        Missing org.apache.iceberg.gcp.gcs.GCSFileIO [java.lang.ClassNotFoundException: org.apache.iceberg.gcp.gcs.GCSFileIO]
        at org.apache.iceberg.common.DynConstructors.buildCheckedException(DynConstructors.java:250)
        at org.apache.iceberg.common.DynConstructors.access$200(DynConstructors.java:32)
        at org.apache.iceberg.common.DynConstructors$Builder.buildChecked(DynConstructors.java:220)
        at org.apache.iceberg.CatalogUtil.loadFileIO(CatalogUtil.java:309)
        ... 32 common frames omitted
        Suppressed: java.lang.ClassNotFoundException: org.apache.iceberg.gcp.gcs.GCSFileIO
                at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:641)
...
...
```

I'm wondering if anyone either can see what's wrong here and offer some help, or has experience using Nessie GC with GCS data.

Thanks in advance for reading!
- David

Dmitri Bourlatchkov

unread,
Nov 13, 2023, 1:38:30 PM11/13/23
to David Muto, projectnessie

--
You received this message because you are subscribed to the Google Groups "projectnessie" group.
To unsubscribe from this group and stop receiving emails from it, send an email to projectnessi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/projectnessie/ddaf2c16-b2e2-463d-b7dc-fc0d9cdd519fn%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages