As given in the below blog,
I was trying to read file from Google Cloud Storage using Spark-scala. For that I have imported Google Cloud Storage Connector and Google Cloud Storage as below,
// https://mvnrepository.com/artifact/com.google.cloud/google-cloud-storage
compile group: 'com.google.cloud', name: 'google-cloud-storage', version: '0.7.0'
// https://mvnrepository.com/artifact/com.google.cloud.bigdataoss/gcs-connector
compile group: 'com.google.cloud.bigdataoss', name: 'gcs-connector', version: '1.6.0-hadoop2'
After that created a simple scala object file like below, (Created a sparkSession) val csvData = spark.read.csv("gs://my-bucket/project-data/csv")
But it is throwing below error,
17/03/01 20:16:02 INFO GoogleHadoopFileSystemBase: GHFS version: 1.6.0-hadoop2 17/03/01 20:16:23 WARN HttpTransport: exception thrown while executing request java.net.SocketTimeoutException: connect timed out at java.net.DualStackPlainSocketImpl.waitForConnect(Native Method) at java.net.DualStackPlainSocketImpl.socketConnect(DualStackPlainSocketImpl.java:85) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:172) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:589) at sun.net.NetworkClient.doConnect(NetworkClient.java:175) at sun.net.www.http.HttpClient.openServer(HttpClient.java:432) at sun.net.www.http.HttpClient.openServer(HttpClient.java:527) at sun.net.www.http.HttpClient.(HttpClient.java:211) at sun.net.www.http.HttpClient.New(HttpClient.java:308) at sun.net.www.http.HttpClient.New(HttpClient.java:326) at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1169) at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1105) at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:999) at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:933) at com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:93) at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:981) at com.google.cloud.hadoop.util.CredentialFactory$ComputeCredentialWithRetry.executeRefreshToken(CredentialFactory.java:158) at com.google.api.client.auth.oauth2.Credential.refreshToken(Credential.java:489) at com.google.cloud.hadoop.util.CredentialFactory.getCredentialFromMetadataServiceAccount(CredentialFactory.java:205) at com.google.cloud.hadoop.util.CredentialConfiguration.getCredential(CredentialConfiguration.java:70) at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.configure(GoogleHadoopFileSystemBase.java:1816) at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.initialize(GoogleHadoopFileSystemBase.java:1003) at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.initialize(GoogleHadoopFileSystemBase.java:966) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2433) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:88) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2467) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2449) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:367) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:287) at org.apache.spark.sql.execution.datasources.DataSource.hasMetadata(DataSource.scala:317) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:354) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149) at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:413) at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:349) at test$.main(test.scala:41) at test.main(test.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)
I have setup all the authentications as well. Not sure how the timeout is getting flash.
As given in the below blog,
I was trying to read file from Google Cloud Storage using Spark-scala. For that I have imported Google Cloud Storage Connector and Google Cloud Storage as below,
// https://mvnrepository.com/artifact/com.google.cloud/google-cloud-storage
compile group: 'com.google.cloud', name: 'google-cloud-storage', version: '0.7.0'
// https://mvnrepository.com/artifact/com.google.cloud.bigdataoss/gcs-connector
compile group: 'com.google.cloud.bigdataoss', name: 'gcs-connector', version: '1.6.0-hadoop2'
After that created a simple scala object file like below, (Created a sparkSession)
val csvData = spark.read.csv("gs://my-bucket/project-data/csv")
But it is throwing below error,
17/03/01 20:16:02 INFO GoogleHadoopFileSystemBase: GHFS version: 1.6.0-hadoop2
17/03/01 20:16:23 WARN HttpTransport: exception thrown while executing request
java.net.SocketTimeoutException: connect timed out
at java.net.DualStackPlainSocketImpl.waitForConnect(Native Method)
at java.net.DualStackPlainSocketImpl.socketConnect(DualStackPlainSocketImpl.java:85)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:172)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:589)
at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)
at sun.net.www.http.HttpClient.<init>(HttpClient.java:211)
at sun.net.www.http.HttpClient.New(HttpClient.java:308)
at sun.net.www.http.HttpClient.New(HttpClient.java:326)
at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1169)
at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1105)
at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:999)
at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:933)
at com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:93)
at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:981)
at com.google.cloud.hadoop.util.CredentialFactory$ComputeCredentialWithRetry.executeRefreshToken(CredentialFactory.java:158)
at com.google.api.client.auth.oauth2.Credential.refreshToken(Credential.java:489)
at com.google.cloud.hadoop.util.CredentialFactory.getCredentialFromMetadataServiceAccount(CredentialFactory.java:205)
at com.google.cloud.hadoop.util.CredentialConfiguration.getCredential(CredentialConfiguration.java:70)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.configure(GoogleHadoopFileSystemBase.java:1816)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.initialize(GoogleHadoopFileSystemBase.java:1003)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.initialize(GoogleHadoopFileSystemBase.java:966)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2433)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:88)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2467)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2449)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:367)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:287)
at org.apache.spark.sql.execution.datasources.DataSource.hasMetadata(DataSource.scala:317)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:354)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149)
at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:413)
at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:349)
at test$.main(test.scala:41)
at test.main(test.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)
I have setup all the authentications as well. Not sure how the timeout is getting flashed.
Edit
I am trying to run above code through IntelliJ Idea (Windows). The JAR file for same code is working fine on Google Cloud DataProc but giving above error when I run it through local system. I have installed Spark,Scala,Google Cloud plugins in IntelliJ.
One more thing, I had created Dataproc instance and tried to connect to External IP address as given in the documentation, https://cloud.google.com/compute/docs/instances/connecting-to-instance#standardssh
It was not able to connect to the server giving Timeout Error
You need to set google.cloud.auth.service.account.json.keyfile to the local path of a json credential file for a service account you create following these instructions for generating a private key. The stack trace shows the connector thinks its on a GCE VM and is trying to obtain a credential from a local metadata server. If that doesn't work, try setting fs.gs.auth.service.account.json.keyfile instead.
When trying to SSH, have you tried gcloud compute ssh <instance name>? You may also need to check your Compute Engine firewall rules to make sure you're allowing inbound connections on port 22.
Thank you Dennis for showing direction to the problem. Since I am using Windows OS, there is no core-site.xml because hadoop is not available for windows.
I have downloaded pre-built spark and in the code itself configured the parameter mentioned by you as given below
Created a SparkSession and using its variable configured hadoop parameter like spark.SparkContext.hadoopConfiguration.set("google.cloud.auth.service.account.json.keyfile","<KeyFile Path>") and all other parameters which we need to setup in the core-site.xml.
After setting all these, Program could access the files from Google Cloud Storage.