How to read simple text file from Google Cloud Storage using Spark-Scala local Program

Manoj Patil via StackOverflow

unread,

Mar 1, 2017, 9:56:08 AM3/1/17

to google-appengin...@googlegroups.com

As given in the below blog,

https://cloud.google.com/blog/big-data/2016/06/google-cloud-dataproc-the-fast-easy-and-safe-way-to-try-spark-20-preview

I was trying to read file from Google Cloud Storage using Spark-scala. For that I have imported Google Cloud Storage Connector and Google Cloud Storage as below,

// https://mvnrepository.com/artifact/com.google.cloud/google-cloud-storage
compile group: 'com.google.cloud', name: 'google-cloud-storage', version: '0.7.0'

// https://mvnrepository.com/artifact/com.google.cloud.bigdataoss/gcs-connector
compile group: 'com.google.cloud.bigdataoss', name: 'gcs-connector', version: '1.6.0-hadoop2'

After that created a simple scala object file like below, (Created a sparkSession) val csvData = spark.read.csv("gs://my-bucket/project-data/csv")

But it is throwing below error,

17/03/01 20:16:02 INFO GoogleHadoopFileSystemBase: GHFS version: 1.6.0-hadoop2 17/03/01 20:16:23 WARN HttpTransport: exception thrown while executing request java.net.SocketTimeoutException: connect timed out at java.net.DualStackPlainSocketImpl.waitForConnect(Native Method) at java.net.DualStackPlainSocketImpl.socketConnect(DualStackPlainSocketImpl.java:85) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:172) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:589) at sun.net.NetworkClient.doConnect(NetworkClient.java:175) at sun.net.www.http.HttpClient.openServer(HttpClient.java:432) at sun.net.www.http.HttpClient.openServer(HttpClient.java:527) at sun.net.www.http.HttpClient.(HttpClient.java:211) at sun.net.www.http.HttpClient.New(HttpClient.java:308) at sun.net.www.http.HttpClient.New(HttpClient.java:326) at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1169) at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1105) at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:999) at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:933) at com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:93) at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:981) at com.google.cloud.hadoop.util.CredentialFactory$ComputeCredentialWithRetry.executeRefreshToken(CredentialFactory.java:158) at com.google.api.client.auth.oauth2.Credential.refreshToken(Credential.java:489) at com.google.cloud.hadoop.util.CredentialFactory.getCredentialFromMetadataServiceAccount(CredentialFactory.java:205) at com.google.cloud.hadoop.util.CredentialConfiguration.getCredential(CredentialConfiguration.java:70) at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.configure(GoogleHadoopFileSystemBase.java:1816) at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.initialize(GoogleHadoopFileSystemBase.java:1003) at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.initialize(GoogleHadoopFileSystemBase.java:966) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2433) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:88) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2467) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2449) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:367) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:287) at org.apache.spark.sql.execution.datasources.DataSource.hasMetadata(DataSource.scala:317) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:354) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149) at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:413) at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:349) at test$.main(test.scala:41) at test.main(test.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)

I have setup all the authentications as well. Not sure how the timeout is getting flash.

Please DO NOT REPLY directly to this email but go to StackOverflow:
http://stackoverflow.com/questions/42534841/how-to-read-simple-text-file-from-google-cloud-storage-using-spark-scala-local-p

Manoj Patil via StackOverflow

unread,

Mar 1, 2017, 12:11:05 PM3/1/17

to google-appengin...@googlegroups.com

As given in the below blog,

https://cloud.google.com/blog/big-data/2016/06/google-cloud-dataproc-the-fast-easy-and-safe-way-to-try-spark-20-preview

I was trying to read file from Google Cloud Storage using Spark-scala. For that I have imported Google Cloud Storage Connector and Google Cloud Storage as below,

// https://mvnrepository.com/artifact/com.google.cloud/google-cloud-storage
compile group: 'com.google.cloud', name: 'google-cloud-storage', version: '0.7.0'

// https://mvnrepository.com/artifact/com.google.cloud.bigdataoss/gcs-connector
compile group: 'com.google.cloud.bigdataoss', name: 'gcs-connector', version: '1.6.0-hadoop2'

After that created a simple scala object file like below, (Created a sparkSession)

val csvData = spark.read.csv("gs://my-bucket/project-data/csv")

But it is throwing below error,

17/03/01 20:16:02 INFO GoogleHadoopFileSystemBase: GHFS version: 1.6.0-hadoop2
17/03/01 20:16:23 WARN HttpTransport: exception thrown while executing request
java.net.SocketTimeoutException: connect timed out
    at java.net.DualStackPlainSocketImpl.waitForConnect(Native Method)
    at java.net.DualStackPlainSocketImpl.socketConnect(DualStackPlainSocketImpl.java:85)
    at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
    at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
    at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
    at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:172)
    at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
    at java.net.Socket.connect(Socket.java:589)
    at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
    at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)
    at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)
    at sun.net.www.http.HttpClient.<init>(HttpClient.java:211)
    at sun.net.www.http.HttpClient.New(HttpClient.java:308)
    at sun.net.www.http.HttpClient.New(HttpClient.java:326)
    at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1169)
    at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1105)
    at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:999)
    at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:933)
    at com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:93)
    at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:981)
    at com.google.cloud.hadoop.util.CredentialFactory$ComputeCredentialWithRetry.executeRefreshToken(CredentialFactory.java:158)
    at com.google.api.client.auth.oauth2.Credential.refreshToken(Credential.java:489)
    at com.google.cloud.hadoop.util.CredentialFactory.getCredentialFromMetadataServiceAccount(CredentialFactory.java:205)
    at com.google.cloud.hadoop.util.CredentialConfiguration.getCredential(CredentialConfiguration.java:70)
    at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.configure(GoogleHadoopFileSystemBase.java:1816)
    at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.initialize(GoogleHadoopFileSystemBase.java:1003)
    at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.initialize(GoogleHadoopFileSystemBase.java:966)
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2433)
    at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:88)
    at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2467)
    at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2449)
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:367)
    at org.apache.hadoop.fs.Path.getFileSystem(Path.java:287)
    at org.apache.spark.sql.execution.datasources.DataSource.hasMetadata(DataSource.scala:317)
    at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:354)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149)
    at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:413)
    at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:349)
    at test$.main(test.scala:41)
    at test.main(test.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)

Manoj Patil via StackOverflow

unread,

Mar 3, 2017, 9:37:05 AM3/3/17

to google-appengin...@googlegroups.com

I have setup all the authentications as well. Not sure how the timeout is getting flashed.

Manoj Patil via StackOverflow

unread,

Mar 3, 2017, 10:02:05 AM3/3/17

to google-appengin...@googlegroups.com

Edit

I am trying to run above code through IntelliJ Idea (Windows). The JAR file for same code is working fine on Google Cloud DataProc but giving above error when I run it through local system. I have installed Spark,Scala,Google Cloud plugins in IntelliJ.

One more thing, I had created Dataproc instance and tried to connect to External IP address as given in the documentation, https://cloud.google.com/compute/docs/instances/connecting-to-instance#standardssh

It was not able to connect to the server giving Timeout Error

Dennis Huo via StackOverflow

unread,

Mar 3, 2017, 11:37:07 PM3/3/17

to google-appengin...@googlegroups.com

You need to set google.cloud.auth.service.account.json.keyfile to the local path of a json credential file for a service account you create following these instructions for generating a private key. The stack trace shows the connector thinks its on a GCE VM and is trying to obtain a credential from a local metadata server. If that doesn't work, try setting fs.gs.auth.service.account.json.keyfile instead.

When trying to SSH, have you tried gcloud compute ssh <instance name>? You may also need to check your Compute Engine firewall rules to make sure you're allowing inbound connections on port 22.

Please DO NOT REPLY directly to this email but go to StackOverflow:

http://stackoverflow.com/questions/42534841/how-to-read-simple-text-file-from-google-cloud-storage-using-spark-scala-local-p/42592114#42592114

Manoj Patil via StackOverflow

unread,

Mar 7, 2017, 3:07:05 AM3/7/17

to google-appengin...@googlegroups.com

Thank you Dennis for showing direction to the problem. Since I am using Windows OS, there is no core-site.xml because hadoop is not available for windows.

I have downloaded pre-built spark and in the code itself configured the parameter mentioned by you as given below

Created a SparkSession and using its variable configured hadoop parameter like spark.SparkContext.hadoopConfiguration.set("google.cloud.auth.service.account.json.keyfile","<KeyFile Path>") and all other parameters which we need to setup in the core-site.xml.

After setting all these, Program could access the files from Google Cloud Storage.

Please DO NOT REPLY directly to this email but go to StackOverflow:

http://stackoverflow.com/questions/42534841/how-to-read-simple-text-file-from-google-cloud-storage-using-spark-scala-local-p/42643095#42643095

Reply all

Reply to author

Forward