join-two-dataframes-duplicated-column-notebook.html(Scala)

Problem

If you join on columns, you get duplicated columns.

%scala

val llist = Seq(("bob", "2015-01-13", 4), ("alice", "2015-04-23",10)) 
val left = llist.toDF("name","date","duration")
val right = Seq(("alice", 100),("bob", 23)).toDF("name","upload")

val df = left.join(right, left.col("name") === right.col("name")) 

display(df)
bob2015-01-134bob23
alice2015-04-2310alice100
%python

llist = [('bob', '2015-01-13', 4), ('alice', '2015-04-23',10)] 
left = spark.createDataFrame(llist, ['name','date','duration']) 
right = spark.createDataFrame([('alice', 100),('bob', 23)],['name','upload'])

df = left.join(right, left.name == right.name) 

display(df)
alice2015-04-2310alice100
bob2015-01-134bob23

Solution

Specify the join column as an array type or string.

Scala

val df = left.join(right, Seq("name"))
 display(df)
bob2015-01-13423
alice2015-04-2310100
val df = left.join(right, "name")
 display(df)
bob2015-01-13423
alice2015-04-2310100

Python

%python
df = left.join(right, ["name"]) 

display(df)
alice2015-04-2310100
bob2015-01-13423
%python
df = left.join(right, "name") 

display(df)
alice2015-04-2310100
bob2015-01-13423

R

left.createOrReplaceTempView("left_test_table")
right.createOrReplaceTempView("right_test_table")
%r
library(SparkR)
sparkR.session()

left <- sql("SELECT * FROM left_test_table")
right <- sql("SELECT * FROM right_test_table")

head(join(left, right, left$name == right$name))
   name       date duration  name upload
1 alice 2015-04-23       10 alice    100
2   bob 2015-01-13        4   bob     23

In R you must drop the duplicated column.

%r
head(drop(join(left, right, left$name == right$name), left$name))
        date duration  name upload
1 2015-01-13        4   bob     23
2 2015-04-23       10 alice    100