Best way of using split and groupBy to handle csv file input?

694 views
Skip to first unread message

Denis Papathanasiou

unread,
Oct 1, 2015, 8:16:12 PM10/1/15
to scala-user
I have a csv file consisting of three attributes per line:

id,position,team

This is an "exploded" view, in that one id can have multiple position +
team assignments simultaneously.

The following code works, but it seems clunky, not to mention there's
no real validation of the csv input data:

val data = scala.io.Source.fromFile("playas.csv").getLines.toList
val idMap = data.groupBy(_.split(",")(0))
for( id <- idMap.keys ) {
  val posMap = idMap(id).groupBy(_.split(",")(1))
  for( pos <- posMap.keys ) {
    val teams = posMap(pos).map(_.split(",")(2))
    println("%s played %s for (%s)".format(id, pos, teams mkString "|"))
  }
}

For a sample csv file containing this:

tom,c,lakers
tom,pf,lakers
tom,c,rockets
tom,pf,knicks
dick,sg,cavs
dick,sf,bulls
dick,sg,heat
harry,pg,mavs
harry,pg,wizards
harry,pg,clippers

I get this result:

harry played pg for (mavs|wizards|clippers)
tom played c for (lakers|rockets)
tom played pf for (lakers|knicks)
dick played sg for (cavs|heat)
dick played sf for (bulls)

Which, again, is exactly what I wanted, but is there a cleaner, more
idiomatic scala way of doing this?

Brian Maso

unread,
Oct 1, 2015, 11:52:06 PM10/1/15
to Denis Papathanasiou, scala-user

First, I'd consider grouping by the first 2 elements of each line in one swoop, rather than a nested operation. And I'd use a regex to parse each line, which I find a lot easier to look at than "split":

scala> val input =

     |   """tom,c,lakers
     | tom,pf,lakers
     | tom,c,rockets
     | tom,pf,knicks
     | dick,sg,cavs
     | dick,sf,bulls
     | dick,sg,heat
     | harry,pg,mavs
     | harry,pg,wizards
     | harry,pg,clippers"""

input: String =


tom,c,lakers
tom,pf,lakers
tom,c,rockets
tom,pf,knicks
dick,sg,cavs
dick,sf,bulls
dick,sg,heat
harry,pg,mavs
harry,pg,wizards
harry,pg,clippers

scala> val linePattern = "([^,]+),([^,]+),([^,]+)".r
linePattern: scala.util.matching.Regex = ([^,]+),([^,]+),([^,]+)

scala> val tuples = scala.io.Source.fromString(input).getLines.toList.collect {
     |   case linePattern(name, pos, team) => (name, pos, team)
     | }
tuples: List[(String, String, String)] = List((tom,c,lakers), (tom,pf,lakers), (tom,c,rockets), (tom,pf,knicks), (dick,sg,cavs), (dick,sf,bulls), (dick,sg,heat), (harry,pg,mavs), (harry,pg,wizards), (harry,pg,clippers))

scala> for( ((name, pos), teams) <- tuples.groupBy(t => (t._1, t._2)).mapValues(_.map(_._3).mkString(","))) {
     |   println(s"${name} played ${pos} for ${teams}")
     | }
harry played pg for mavs,wizards,clippers
tom played c for lakers,rockets
tom played pf for lakers,knicks
dick played sg for cavs,heat
dick played sf for bulls

Brian Maso


--
You received this message because you are subscribed to the Google Groups "scala-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scala-user+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Best regards,
Brian Maso
(949) 395-8551
Follow me: @bmaso
br...@blumenfeld-maso.com

Lanny Ripple

unread,
Oct 2, 2015, 1:20:16 AM10/2/15
to scala-user, denis.pap...@gmail.com
It's hard to beat Brian's answer so I'll only add some considerations.

First for simple input there's not much to recommend between line.split(",", 3) vs a regex.  You still should trim your resulting fields and both would be defeated by fields with commas within quoted elements.  Slightly more complicated input and the regex pulls ahead.  Beyond that you could develop your own parser but OpenCSV is a simple CSV library in Java and there's an extension (I believe) that makes it nice to use with Scala.  It's just as easy to start with OpenCSV and not worry about correct parsing.

Grouping by elements in order (or reverse order) is very close to a sorting problem.  Consider

scala> input.split("\n").map{line => val Array(a,b,c) = line.split(",",3); (a,b,c)}.sorted  // via Ordering[Tuple3]

and you have the exact results you want as a sorted Array.  A .foreach or a bit of foldLeft would easily give you the output you want.  Sometimes a .sorted or .sortBy is all you really need.

Denis Papathanasiou

unread,
Oct 4, 2015, 11:59:33 AM10/4/15
to scala-user, denis.pap...@gmail.com
Brian, this is great, thank you!

I didn't realize I could use collect in conjunction with pattern matching, which will clean up my code in other places as well.

I should also study what mapValues does, since I didn't know it existed.

Denis Papathanasiou

unread,
Oct 4, 2015, 12:02:25 PM10/4/15
to scala-user, denis.pap...@gmail.com
Lanny, thank you for the sorting observation.

I also didn't realize OpenCSV existed, so I'll look into that as well (though you're right, I probably don't need it here).

Nick Stanchenko

unread,
Oct 5, 2015, 5:38:47 AM10/5/15
to scala-user
Denis,

One thing I commonly do when reading CSV is create a small case class with the corresponding data model:

case class Entry(
  player
: String,
  position
: String,
  team
: String
)

scala
.io.Source.fromFile("playas.csv").getLines.toSeq
 
.map(_.split(","))
 
.map { case Array(player, position, team) => Entry(player, position, team) }
 
.groupBy(entry => (entry.player, entry.position)) // see how this is already much nicer?
 
.foreach {
   
case ((player, position), entries) =>
      val teams
= entries.map(_.team).distinct
      println
(s"$player played $position for (${teams.mkString("|")})"
 
}

Nick
Reply all
Reply to author
Forward
0 new messages