how to change spark dataframe columns with elemnet as list into double, float

lkkris...@gmail.com

unread,

Sep 12, 2018, 12:39:45 PM9/12/18

to BigDL User Group

I am working on multi-label code of transfer learning and here is my error while running correct = predictionDF.filter("label=prediction").count() and also can anyone suggest how to convert dataframe column of list into double (or) float?

Error:

Yiheng Wang

unread,

Sep 12, 2018, 9:19:34 PM9/12/18

to lkkris...@gmail.com, BigDL User Group

Can you post more detailed code?

--
You received this message because you are subscribed to the Google Groups "BigDL User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bigdl-user-gro...@googlegroups.com.
To post to this group, send email to bigdl-us...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bigdl-user-group/378585cd-f50e-4ef5-8936-dcdea43c990c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

lkkris...@gmail.com

unread,

Sep 12, 2018, 9:52:04 PM9/12/18

to BigDL User Group

I am trying to compare the "label" and "prediction" columns and filter the same values. Here is my code:

classifier = NNEstimator(lrModel, MultiLabelSoftMarginCriterion(), transformer, SeqToTensor([15])) \
    .setLearningRate(0.003).setBatchSize(736).setMaxEpoch(2).setFeaturesCol("image")

creating: createMultiLabelSoftMarginCriterion
creating: createSeqToTensor
creating: createFeatureLabelPreprocessing
creating: createNNEstimator

pipeline = Pipeline(stages=[classifier])

nnModel = pipeline.fit(trainingDF)

creating: createToTuple
creating: createChainedPreprocessing

nnModel.transform(trainingDF).show(10)

delete key = 1be1d69b-963b-41f2-ab23-967ca68d9d7a 3
+----------------+--------------------+--------------+--------------------+--------------------+
|     Image Index|               image|Finding Labels|               label|          prediction|
+----------------+--------------------+--------------+--------------------+--------------------+
|00000005_001.png|[hdfs://pNameNode...|    No Finding|[0.0, 0.0, 0.0, 0...|[0.122650035, 0.0...|
|00000005_004.png|[hdfs://pNameNode...|    No Finding|[0.0, 0.0, 0.0, 0...|[0.012328666, 0.0...|
|00000008_001.png|[hdfs://pNameNode...|    No Finding|[0.0, 0.0, 0.0, 0...|[0.0055147237, 0....|
|00000011_007.png|[hdfs://pNameNode...|  Infiltration|[0.0, 0.0, 1.0, 0...|[0.003118499, 0.0...|
|00000015_000.png|[hdfs://pNameNode...|    No Finding|[0.0, 0.0, 0.0, 0...|[0.01412236, 0.00...|
|00000022_000.png|[hdfs://pNameNode...|    No Finding|[0.0, 0.0, 0.0, 0...|[0.02084559, 0.01...|
|00000022_001.png|[hdfs://pNameNode...|      Fibrosis|[0.0, 0.0, 0.0, 0...|[0.032384016, 0.0...|
|00000039_001.png|[hdfs://pNameNode...|    No Finding|[0.0, 0.0, 0.0, 0...|[0.0017941386, 0....|
|00000040_001.png|[hdfs://pNameNode...|     Emphysema|[0.0, 0.0, 0.0, 0...|[0.23027827, 0.00...|
|00000042_005.png|[hdfs://pNameNode...|    No Finding|[0.0, 0.0, 0.0, 0...|[0.005233792, 8.9...|
+----------------+--------------------+--------------+--------------------+--------------------+
only showing top 10 rows

predictionDF = nnModel.transform(validationDF).cache()

delete key = e017f9ce-af62-4653-8b85-01f7d3374be5 3

predictionDF.select("Image Index","label","prediction").sort("label", ascending=False).show(10)

+----------------+--------------------+--------------------+
|     Image Index|               label|          prediction|
+----------------+--------------------+--------------------+
|00011379_040.png|[1.0, 1.0, 1.0, 1...|[0.013405059, 0.0...|
|00025262_000.png|[1.0, 1.0, 1.0, 0...|[0.0056996667, 0....|
|00014871_012.png|[1.0, 1.0, 1.0, 0...|[0.0013419994, 0....|
|00020826_011.png|[1.0, 1.0, 1.0, 0...|[0.0135408975, 0....|
|00020826_012.png|[1.0, 1.0, 1.0, 0...|[0.003606446, 0.0...|
|00017055_001.png|[1.0, 1.0, 1.0, 0...|[0.0052660527, 0....|
|00016805_014.png|[1.0, 1.0, 1.0, 0...|[0.005817896, 0.0...|
|00009286_003.png|[1.0, 1.0, 1.0, 0...|[0.019827748, 0.0...|
|00019395_016.png|[1.0, 1.0, 1.0, 0...|[0.0021730824, 0....|
|00020860_001.png|[1.0, 1.0, 1.0, 0...|[0.03898971, 0.00...|
+----------------+--------------------+--------------------+
only showing top 10 rows

predictionDF.select("Image Index","label","prediction").show(10)

+----------------+--------------------+--------------------+
|     Image Index|               label|          prediction|
+----------------+--------------------+--------------------+
|00000005_003.png|[0.0, 0.0, 0.0, 0...|[0.006877544, 0.0...|
|00000010_000.png|[0.0, 0.0, 1.0, 0...|[0.0013657764, 0....|
|00000011_006.png|[1.0, 0.0, 0.0, 0...|[0.046066444, 0.0...|
|00000031_000.png|[0.0, 0.0, 0.0, 0...|[0.0026110203, 0....|
|00000034_001.png|[0.0, 0.0, 0.0, 0...|[0.05101261, 0.04...|
|00000046_000.png|[0.0, 0.0, 0.0, 0...|[0.005264407, 0.0...|
|00000047_001.png|[0.0, 0.0, 0.0, 0...|[0.0036213757, 0....|
|00000047_003.png|[1.0, 0.0, 0.0, 0...|[0.042358775, 0.0...|
|00000050_002.png|[0.0, 0.0, 0.0, 0...|[0.118123546, 0.0...|
|00000054_006.png|[0.0, 0.0, 1.0, 0...|[0.0027013535, 0....|
+----------------+--------------------+--------------------+
only showing top 10 rows

correct = predictionDF.filter("label=prediction").count()

And the Error I am having is mentioned in my first post.

she Bowen

unread,

Sep 14, 2018, 12:42:33 AM9/14/18

to BigDL User Group

Thanks for posting your code! We've been looking into this.

The suggested solution is to convert the array<double> type column into array<float> type column by exploiting cast().

from pyspark.sql.types import *

predictionDF=predictionDF.withColumn(“label”, predictionDF[“label”].cast(ArrayType(FloatType())))

Then we can compare and filter out those rows with the same value for the “label” column and “prediction” column.

correct = predictionDF.filter("label=prediction").count()

Converting array<float> to array<double> is not recommended here. Since when converting float value to double value, the compiler converts the float value to the nearest double, i.e 1.0 -> 1.00000000002359. It could make values which should be equal not equal any more. Then you should set a threshold for the value difference then which is more cumbersome.

在 2018年9月12日星期三 UTC-7下午6:52:04，lkkris...@gmail.com写道：

lkkris...@gmail.com

unread,

Sep 14, 2018, 11:43:52 AM9/14/18

to BigDL User Group

Thanks for your suggestions She Bowen. This might fix my issue and I can predict the common values for the "label" and "prediction" columns.

Reply all

Reply to author

Forward