Join Two DataFrames without a Duplicated Column

A common question Databricks customers have is how best to join two DataFrames without having a duplicated column. If you perform a join in Spark and don’t specify your join correctly you’ll end up with duplicate column names. This makes it harder to select those columns down the road.

In order to perform this join “correctly” (although both kinds will work) you’ll need to specify your join columns as an Array instead of an expression. The notebook and code below will show you how to fix this.

Problem - Duplicated Columns.

%scala

val llist = Seq(("bob", "2015-01-13", 4), ("alice", "2015-04-23",10))
val left = llist.toDF("name","date","duration")
val right = Seq(("alice", 100),("bob", 23)).toDF("name","upload")

val df = left.join(right, left.col("name") === right.col("name"))

display(df)
%python

llist = [('bob', '2015-01-13', 4), ('alice', '2015-04-23',10)]
left = sqlContext.createDataFrame(llist, ['name','date','duration'])
right = sqlContext.createDataFrame([('alice', 100),('bob', 23)],['name','upload'])

df = left.join(right, left.name == right.name)

display(df)

Solution, specify the columns as an array type or string to avoid having this problem.

Here’s solutions in Scala:

%scala

val df = left.join(right, Seq("name"))
 display(df)
%scala

val df = left.join(right, "name")
 display(df)

Here are solutions in Python:

%python
df = left.join(right, ["name"])

display(df)
%python
df = left.join(right, "name")

display(df)

Now to do the same thing in R. First we register these DataFrames as tables.

%python

left.createOrReplaceTempView("left_test_table")
right.createOrReplaceTempView("right_test_table")
left <- sql(sqlContext, "SELECT * FROM left_test_table")
right <- sql(sqlContext, "SELECT * FROM right_test_table")

head(join(left, right, left$name == right$name))

The above code results in duplicate columns. The below code does not.

%r

head(drop(join(left, right), left$name))