Weirdness when every function returns a Column: Chained when (Spark)

When when is chained, the chain breaks at the point that the test returns true.

import org.apache.spark.sql.Column     
val isTrue = lit(true)

def getWithChainedWhen():Column = {
  when(isTrue,"1st")
    .when(isTrue,"2nd")
    .when(isTrue,"3rd")     
}

val df =
  sc.parallelize(List[(String)](("A")))
    .toDF("a")
    .withColumn( "chained",getWithChainedWhen() )
    .show(false)

The results of running the above code is as follows:

+---+-------+
|a |chained|
+---+-------+
|A |1st |
+---+-------+

Only the first when is evaluated. This, it could be argued, is logically inconsistent. The function is being called on a particular dataframe: the returned dataframe from the first when statement. It could be argued that the output of the 1st function should be evaluated in the 2nd when… and the 2nd value be returned and so on until we reach the 3rd value.

The following code explicitly describes these logical forks and returns the same result.

def getWithChainedWhen():Column = {
  when(isTrue,"1st").otherwise(
    when(isTrue,"2nd").otherwise(
      .when(isTrue,"3rd"))) 
}

This is logically consistent and makes it easy to anticipate the outcome of the function, but creates a deeply nested code block.

The first strategy, using chained “when”s, the function is syntactically simpler and generates more compact source code.

Moral: For explicit logic, use when with otherwise. For succinct code, use chained whens.

This post was edited in 2025 to acknowledge the merits of chaining whens.

Share this:

Related

Leave a comment Cancel reply