Databricks Runtime 7.0 (Beta)

Beta

Databricks Runtime 7.0 is in Beta. The contents of the supported environments may change in upcoming Beta releases. Changes can include the list of packages or versions of installed packages.

The following release notes provide information about Databricks Runtime 7.0 Beta, powered by Apache Spark.

Beta testing

Thank you for participating in the Databricks Runtime 7.0 Beta. This is an exciting Databricks Runtime release for us, the first to include the upcoming Apache Spark 3.0. Thank you for partnering with us to help test and provide feedback.

To get started with the Databricks Runtime 7.0 Beta, create a cluster that runs it, either through the UI or the REST API.

  • From the Cluster creation UI, select 7.0 Beta from the Databricks Runtime Version drop-down:

    Select DBR 7.0 Beta
  • In a Clusters API 2.0/clusters/create request, specify "spark_version": "7.0.x-scala2.12"

Feedback

Your feedback will ensure that we provide the most stable and performant GA release of Databricks Runtime 7.0. To provide feedback, reach out to your Databricks account team. For support issues, contact help@databricks.com.

Important

It is important for us to understand how your existing workloads will behave with this new release. Use the Beta to test compatibility with existing workloads, but do not run production workloads on Databricks Runtime 7.0 Beta clusters.

New features

New Apache Spark features

Databricks Runtime 7.0 Beta includes all of the new features and fixes in the preview version of the Apache Spark 3.0 release. There are too many to list here, but some highlights include:

  • Scala 2.12

  • Adaptive Query Execution, a flexible framework to do adaptive execution in Spark SQL and support changing the number of reducers at runtime.

  • Redesigned pandas UDFs with type hints: [SPARK-28264]

    In Spark 3.0, the pandas UDFs were redesigned by leveraging Python type hints. By using Python type hints, you can naturally express pandas UDFs without requiring the evaluation type. pandas UDFs are now more “Pythonic” and let themselves define what the UDF is supposed to input and output. With this clear definition comes several benefits, such as easier static analysis.

  • Structured Streaming web UI: [SPARK-29543]

  • Better compatibility with ANSI SQL standards

    Among other compatibility improvements, Spark 3.0 uses the Proleptic Gregorian calendar, adds ANSI-compliant reserved keywords to the parser, enforces store assignment in INSERT, and checks the overflow at runtime.

  • Join Hints: [SPARK-27225]

    Spark 3.0 extends the existing BROADCAST join hint (for both broadcast-hash joins and broadcast-nested-loop joins) by implementing additional join strategy hints corresponding to Spark’s existing join strategies: shuffle-hash, sort-merge, cartesian-product. The hint names SHUFFLE_MERGE, SHUFFLE_HASH, and SHUFFLE_REPLICATE_NL are slightly different from the code names in order to make them clearer and better reflect the actual algorithms.

  • New EXPLAIN FORMATTED command: [SPARK-27395]

    The new EXPLAIN command makes the complex SQL execution plan readable by decluttering the EXPLAIN outputs to separate sections.

New Databricks Runtime features

The following are some of the new features included in Databricks Runtime 7.0 in addition to the upcoming Spark 3.0 features:

  • Auto Loader (Public Preview), released in Databricks Runtime 6.4, has been improved in Databricks Runtime 7.0

    Auto Loader gives you a more efficient way to process new data files incrementally as they arrive on a cloud blob store during ETL. This is an improvement over file-based structured streaming, which identifies new files by repeatedly listing the cloud directory and tracking the files that have been seen, and can be very inefficient as the directory grows. Auto Loader is also more convenient and effective than file-notification-based structured streaming, which requires that you manually configure file-notification services on the cloud and doesn’t let you backfill existing files. For details, see Load files from S3 using Auto Loader.

  • COPY INTO (Public Preview), which lets you load data into Delta Lake with idempotent retries, has been improved in Databricks Runtime 7.0

    Released as a Public Preview in Databricks Runtime 6.4, the COPY INTO SQL command lets you load data into Delta Lake with idempotent retries. To load data into Delta Lake today you have to use Apache Spark DataFrame APIs. If there are failures during loads, you have to handle them effectively. The new COPY INTO command provides a familiar declarative interface to load data in SQL. The command keeps track of previously loaded files and you safely re-run it in case of failures. For details, see Copy Into (Delta Lake on Databricks).

Databricks Runtime changes

Library changes

This is a partial list of libraries included in Databricks Runtime 6.5 that are updated in Databricks Runtime 7.0.

Major Python packages upgraded:

  • boto3 1.9.162 -> 1.12.0
  • matplotlib 3.0.3 -> 3.1.3
  • numpy 1.16.2 -> 1.18.1
  • pandas 0.24.2 -> 1.0.1
  • pip 19.0.3 -> 20.0.2
  • pyarrow 0.13.0 -> 0.15.1
  • scikit-learn 0.20.3 -> 0.22.1
  • scipy 1.2.1 -> 1.4.1
  • seaborn 0.9.0 -> 0.10.0

Python packages removed:

  • boto
  • pycurl

Major R packages newly added:

  • tidyverse

R packages removed:

  • MatrixModels
  • R.methodsS3
  • R.oo
  • R.utils
  • RCurl
  • RcppEigen
  • SparseM
  • abind
  • bitops
  • car
  • carData
  • doMC
  • gbm
  • h2o
  • littler
  • lme4
  • mapproj
  • maps
  • maptools
  • minqa
  • mvtnorm
  • nloptr
  • openxlsx
  • pbkrtest
  • pkgKitten
  • quantreg
  • rio
  • sp
  • statmod
  • zip

Other changes

Data skipping index was deprecated in Databricks Runtime 4.3 and removed in Databricks Runtime 7.0. We recommend that you use Delta tables instead, which offer improved data skipping capabilities.

Migration instructions

The following notes list changes that may require updates to jobs running on previous Databricks Runtime versions. Specifically, they cover differences between Spark 2.4 and Spark 3.0.

Core Spark

  • The org.apache.spark.ExecutorPlugin interface and related configuration has been replaced with org.apache.spark.plugin.SparkPlugin, which adds new functionality. Plugins using the old interface must be modified to extend the new interfaces.
  • Deprecated method TaskContext.isRunningLocally has been removed. Local execution was removed and it always has returned false.
  • Deprecated method shuffleBytesWritten, shuffleWriteTime, and shuffleRecordsWritten in ShuffleWriteMetrics have been removed. Instead, use bytesWritten, writeTime , and recordsWritten respectively.
  • Deprecated method AccumulableInfo.apply has been removed because creating AccumulableInfo is disallowed.
  • Event log file will be written as UTF-8 encoding, and Spark History Server will replay event log files as UTF-8 encoding. Previously Spark wrote the event log file as default charset of driver JVM process, so Spark History Server of Spark 2.x is needed to read the old event log files in case of incompatible encoding.
  • A new protocol for fetching shuffle blocks is used. It’s recommended that external shuffle services be upgraded when running Spark 3.0 apps. You can still use old external shuffle services by setting the configuration spark.shuffle.useOldFetchProtocol to true. Otherwise, Spark may run into errors with messages like: IllegalArgumentException: Unexpected message type: <number>.

Spark SQL

Dataset and DataFrame APIs

  • In Spark 3.0, the Dataset and DataFrame API unionAll is no longer deprecated. It is an alias for union.
  • In Spark 2.4 and below, Dataset.groupByKey results to a grouped dataset with key attribute wrongly named as value if the key is non-struct type, for example, int, string, array, and so on. This is counterintuitive and makes the schema of aggregation queries unexpected. For example, the schema of ds.groupByKey(...).count() is (value, count). Since Spark 3.0, the grouping attribute is key. The old behavior is preserved under a newly added configuration spark.sql.legacy.dataset.nameNonStructGroupingKeyAsValue with a default value of false.

DDL statements

  • In Spark 3.0, CREATE TABLE without a specific provider uses the value of spark.sql.sources.default as its provider. In Spark version 2.4 and below, it was Hive. To restore the behavior before Spark 3.0, you can set spark.sql.legacy.createHiveTableByDefault.enabled to true.

  • In Spark 3.0, when inserting a value into a table column with a different data type, type coercion is performed as per ANSI SQL standard. Certain unreasonable type conversions such as converting string to int and double to boolean are disallowed. A runtime exception is thrown if the value is out of range for the data type of the column. In Spark version 2.4 and below, type conversions during table insertion are allowed as long as they are valid Cast. When inserting an out-of-range value to a integral field, the low-order bits of the value is inserted(the same as Java/Scala numeric type casting). For example, if 257 is inserted to a field of byte type, the result is 1. The behavior is controlled by the option spark.sql.storeAssignmentPolicy, with a default value as ANSI. Setting the option as Legacy restores the previous behavior.

  • The ADD JAR command previously returned a result set with the single value 0. It now returns an empty result set.

  • Spark 2.4 and below: the SET command works without any warnings even if the specified key is for SparkConf entries, and it has no effect because the command does not update SparkConf. In Spark 3.0, the command fails if a SparkConf key is used. You can disable such a check by setting spark.sql.legacy.setCommandRejectsSparkCoreConfs to false.

  • Refreshing a cached table triggers a table uncache operation and then a table cache (lazily) operation. In Spark version 2.4 and below, the cache name and storage level are not preserved before the uncache operation. Therefore, the cache name and storage level could be changed unexpectedly. In Spark 3.0, cache name and storage level are first preserved for cache recreation. It helps to maintain a consistent cache behavior upon table refreshing.

  • In Spark 3.0, the properties listing below become reserved; commands fail if you specify reserved properties in statements like CREATE DATABASE ... WITH DBPROPERTIES and ALTER TABLE ... SET TBLPROPERTIES. You need their specific clauses to specify them, for example, CREATE DATABASE test COMMENT 'any comment' LOCATION 'some path'. You can set spark.sql.legacy.notReserveProperties to true to ignore the ParseException, in this case, these properties are silently removed, for example: SET DBPROTERTIES('location'='/tmp') will have no effect. In Spark version 2.4 and below, these properties are neither reserved nor have side effects, for example, SET DBPROTERTIES('location'='/tmp') does not change the location of the database but only create a headless property just like 'a'='b'.

    Property (case sensitive) Database Reserved Table Reserved Remarks
    provider no yes For tables, use the USING clause to specify it. Once set, it can’t be changed.
    location yes yes For databases and tables, use the LOCATION clause to specify it.
    owner yes yes For databases and tables, it is determined by the user who runs spark and create the table.
  • In Spark 3.0, you can use ADD FILE to add file directories as well. Earlier you could add only single files using this command. To restore the behavior of earlier versions, set spark.sql.legacy.addSingleFileInAddFile to true.

  • In Spark 3.0, SHOW TBLPROPERTIES throws AnalysisException if the table does not exist. In Spark version 2.4 and below, this scenario causes NoSuchTableException. Also, SHOW TBLPROPERTIES on a temporary view causes AnalysisException. In Spark version 2.4 and below, it returns an empty result.

  • In Spark 3.0, SHOW CREATE TABLE always returns Spark DDL, even when the given table is a Hive SerDe table. For generating Hive DDL, use SHOW CREATE TABLE AS SERDE command instead.

  • In Spark 3.0, column of CHAR type is not allowed in non-Hive-Serde tables, and CREATE/ALTER TABLE commands will fail if CHAR type is detected. Please use STRING type instead. In Spark version 2.4 and below, CHAR type is treated as STRING type and the length parameter is simply ignored.

UDFs and built-in functions

  • In Spark 3.0, the date_add and date_sub functions accept only int, smallint, tinyint as the 2nd argument. Fractional and non-literal strings are not valid anymore. For example: date_add(cast('1964-05-23' as date), '12.34') causes an AnalysisException. Note that string literals are still allowed, but Spark will throw an AnalysisException if the string content is not a valid integer. In Spark version 2.4 and below, if the second argument is a fractional or string value, it is coerced to int value, and the result is a date value of 1964-06-04.

  • In Spark 3.0, the function percentile_approx and its alias approx_percentile only accept integral value with range in [1, 2147483647] as its 3rd argument accuracy, fractional and string types are disallowed, for example, percentile_approx(10.0, 0.2, 1.8D) causes AnalysisException. In Spark version 2.4 and below, if accuracy is fractional or string value, it is coerced to an int value, percentile_approx(10.0, 0.2, 1.8D) is operated as percentile_approx(10.0, 0.2, 1) which results in 10.0.

  • In Spark 3.0, an analysis exception is thrown when hash expressions are applied on elements of MapType. To restore the behavior before Spark 3.0, set spark.sql.legacy.allowHashOnMapType to true.

  • In Spark 3.0, when the array/map function is called without any parameters, it returns an empty collection with NullType as element type. In Spark version 2.4 and below, it returns an empty collection with StringType as element type. To restore the behavior before Spark 3.0, you can set spark.sql.legacy.createEmptyCollectionUsingStringType to true.

  • In Spark 3.0, the from_json functions supports two modes - PERMISSIVE and FAILFAST. The modes can be set via the mode option. The default mode became PERMISSIVE. In previous versions, behavior of from_json did not conform to either PERMISSIVE nor FAILFAST, especially in processing of malformed JSON records. For example, the JSON string {"a" 1} with the schema a INT is converted to null by previous versions but Spark 3.0 converts it to Row(null).

  • In Spark version 2.4 and below, you can create map values with map type key via built-in functions such as CreateMap, MapFromArrays, and so on. Spark 3.0, it’s not allowed to create map values with map type key with these built-in functions. As a workaround you can use the map_entries function to convert a map to array<struct<key, value>>. In addition, you can still read map values with map type key from data source or Java and Scala collections, though it is discouraged.

  • In Spark version 2.4 and below, you can create a map with duplicated keys via built-in functions like CreateMap, StringToMap, and so on. The behavior of map with duplicated keys is undefined, for example, map look up respects the duplicated key appears first, Dataset.collect only keeps the duplicated key appears last, MapKeys returns duplicated keys, and so on. In Spark 3.0, Spark throws RuntimeException when duplicated keys are found. You can set spark.sql.mapKeyDedupPolicy to LAST_WIN to deduplicate map keys with last wins policy. You may still read map values with duplicated keys from data sources which do not enforce it (for example, Parquet), the behavior is undefined.

  • In Spark 3.0, using org.apache.spark.sql.functions.udf(AnyRef, DataType) is not allowed by default. Set spark.sql.legacy.allowUntypedScalaUDF to true to keep using it. In Spark version 2.4 and below, if org.apache.spark.sql.functions.udf(AnyRef, DataType) gets a Scala closure with primitive-type argument, the returned UDF returns null if the input values is null. However, in Spark 3.0, the UDF returns the default value of the Java type if the input value is null. For example, val f = udf((x: Int) => x, IntegerType), f($"x") returns null in Spark 2.4 and below if column x is null, and return 0 in Spark 3.0. This behavior change is introduced because Spark 3.0 is built with Scala 2.12 by default.

  • In Spark 3.0, a higher-order function exists follows the three-valued boolean logic, that is, if the predicate returns any nulls and no true is obtained, then exists returns null instead of false. For example, exists(array(1, null, 3), x -> x % 2 == 0) is null. You can restore the previous behavior by setting spark.sql.legacy.followThreeValuedLogicInArrayExists to false.

  • In Spark 3.0, the add_months function does not adjust the resulting date to a last day of month if the original date is a last day of months. For example, select add_months(DATE'2019-02-28', 1) results 2019-03-28. In Spark version 2.4 and below, the resulting date is adjusted when the original date is a last day of months. For example, adding a month to 2019-02-28 results in 2019-03-31.

  • In Spark version 2.4 and below, the current_timestamp function returns a timestamp with millisecond resolution only. In Spark 3.0, the function can return the result with microsecond resolution if the underlying clock available on the system offers such resolution.

  • In Spark 3.0, a 0-argument Java UDF is executed in the executor side identically with other UDFs. In Spark version 2.4 and below, 0-argument Java UDF alone was executed in the driver side, and the result was propagated to executors, which might be more performant in some cases but caused inconsistency with a correctness issue in some cases.

  • The result of java.lang.Math’s log, log1p, exp, expm1, and pow may vary across platforms. In Spark 3.0, the result of the equivalent SQL functions (including related SQL functions like LOG10) return values consistent with java.lang.StrictMath. In virtually all cases this makes no difference in the return value, and the difference is very small, but may not exactly match java.lang.Math on x86 platforms in cases like, for example, log(3.0), whose value varies between Math.log() and StrictMath.log().

  • In Spark 3.0, the Cast function processes string literals such as 'Infinity', '+Infinity', '-Infinity', 'NaN', 'Inf', '+Inf', '-Inf' in a case-insensitive manner when casting the literals to Double or Float type to ensure greater compatibility with other database systems. This behavior change is illustrated in the table below:

    Operation Result before Spark 3.0 Result in Spark 3.0
    CAST(‘infinity’ AS DOUBLE) NULL Double.PositiveInfinity
    CAST(‘+infinity’ AS DOUBLE) NULL Double.PositiveInfinity
    CAST(‘inf’ AS DOUBLE) NULL Double.PositiveInfinity
    CAST(‘inf’ AS DOUBLE) NULL Double.PositiveInfinity
    CAST(‘-infinity’ AS DOUBLE) NULL Double.NegativeInfinity
    CAST(‘-inf’ AS DOUBLE) NULL Double.NegativeInfinity
    CAST(‘infinity’ AS FLOAT) NULL Float.PositiveInfinity
    CAST(‘+infinity’ AS FLOAT) NULL Float.PositiveInfinity
    CAST(‘inf’ AS FLOAT) NULL Float.PositiveInfinity
    CAST(‘+inf’ AS FLOAT) NULL Float.PositiveInfinity
    CAST(‘-infinity’ AS FLOAT) NULL Float.NegativeInfinity
    CAST(‘-inf’ AS FLOAT) NULL Float.NegativeInfinity
    CAST(‘nan’ AS DOUBLE) NULL Double.Nan
    CAST(‘nan’ AS FLOAT) NULL Float.NaN
  • In Spark 3.0, when casting interval values to string type, there is no interval prefix, for example, 1 days 2 hours. In Spark version 2.4 and below, the string contains the “interval” prefix like interval 1 days 2 hours.

  • In Spark 3.0, when casting string value to integral types(tinyint, smallint, int and bigint), datetime types(date, timestamp and interval) and boolean type, the leading and trailing whitespaces (<= ASCII 32) will be trimmed before converted to these type values, for example, cast(' 1\t' as int) results 1, cast(' 1\t' as boolean) results true, cast('2019-10-10\t as date) results the date value 2019-10-10. In Spark version 2.4 and below, when casting string to integrals and booleans, it does not trim the whitespaces from both ends; the foregoing results is null, while to datetimes, only the trailing spaces (= ASCII 32) are removed.

Query engine

  • In Spark version 2.4 and below, SQL queries such as FROM <table> or FROM <table> UNION ALL FROM <table> are supported by accident. In hive-style FROM <table> SELECT <expr>, the SELECT clause is not negligible. Neither Hive nor Presto support this syntax. These queries are treated as invalid in Spark 3.0.

  • In Spark 3.0, the interval literal syntax does not allow multiple from-to units anymore. For example, SELECT INTERVAL '1-1' YEAR TO MONTH '2-2' YEAR TO MONTH' throws parser exception.

  • In Spark 3.0, numbers written in scientific notation(for example, 1E2) would be parsed as Double. In Spark version 2.4 and below, they’re parsed as Decimal. To restore the behavior before Spark 3.0, you can set spark.sql.legacy.exponentLiteralAsDecimal.enabled to true.

  • In Spark 3.0, day-time interval strings are converted to intervals with respect to the from and to bounds. If an input string does not match to the pattern defined by specified bounds, the ParseException exception is thrown. For example, interval '2 10:20' hour to minute raises the exception because the expected format is [+|-]h[h]:[m]m. In Spark version 2.4, the from bound was not taken into account, and the to bound was used to truncate the resulted interval. For instance, the day-time interval string from the showed example is converted to interval 10 hours 20 minutes. To restore the behavior before Spark 3.0, you can set spark.sql.legacy.fromDayTimeString.enabled to true.

  • In Spark 3.0, negative scale of decimal is not allowed by default, for example, data type of literal like 1E10BD is DecimalType(11, 0). In Spark version 2.4 and below, it was DecimalType(2, -9). To restore the behavior before Spark 3.0, you can set spark.sql.legacy.allowNegativeScaleOfDecimal to true.

  • In Spark 3.0, the unary arithmetic operator plus(+) only accepts string, numeric and interval type values as inputs. Besides, + with a integral string representation is coerced to a double value, for example, +'1' returns 1.0. In Spark version 2.4 and below, this operator is ignored. There is no type checking for it, thus, all type values with a + prefix are valid, for example, + array(1, 2) is valid and results [1, 2]. Besides, there is no type coercion for it at all, for example, in Spark 2.4, the result of +'1' is string 1.

  • In Spark 3.0, a Dataset query fails if it contains ambiguous column reference that is caused by self join. A typical example: val df1 = ...; val df2 = df1.filter(...);, then df1.join(df2, df1("a") > df2("a")) returns an empty result which is quite confusing. This is because Spark cannot resolve Dataset column references that point to tables being self joined, and df1("a") is exactly the same as df2("a") in Spark. To restore the behavior before Spark 3.0, you can set spark.sql.analyzer.failAmbiguousSelfJoin to false.

  • In Spark 3.0, spark.sql.legacy.ctePrecedencePolicy is introduced to control the behavior for name conflicting in the nested WITH clause. By default value EXCEPTION, Spark throws an AnalysisException, it forces you to choose the specific substitution order they wanted. If set to CORRECTED (which is recommended), inner CTE definitions take precedence over outer definitions. For example, set the config to false, WITH t AS (SELECT 1), t2 AS (WITH t AS (SELECT 2) SELECT * FROM t) SELECT * FROM t2 returns 2, while setting it to LEGACY, the result is 1 which is the behavior in version 2.4 and below.

  • In Spark 3.0, configuration spark.sql.crossJoin.enabled became an internal configuration, is true by default, and therefore Spark no longer raises an exception on SQL with implicit cross join.

  • In Spark version 2.4 and below, float/double -0.0 is semantically equal to 0.0, but -0.0 and 0.0 are considered as different values when used in aggregate grouping keys, window partition keys, and join keys. In Spark 3.0, this bug is fixed. For example, Seq(-0.0, 0.0).toDF("d").groupBy("d").count() returns [(0.0, 2)] in Spark 3.0, and [(0.0, 1), (-0.0, 1)] in Spark 2.4 and below.

  • In Spark version 2.4 and below, invalid time zone ids are silently ignored and replaced by GMT time zone, for example, in the from_utc_timestamp function. In Spark 3.0, such time zone IDs are rejected, and Spark throws java.time.DateTimeException.

  • In Spark 3.0, Proleptic Gregorian calendar is used in parsing, formatting, and converting dates and timestamps as well as in extracting sub-components like years, days and so on. Spark 3.0 uses Java 8 API classes from the java.time packages that are based on ISO chronology. In Spark version 2.4 and below, those operations are performed using the hybrid calendar (Julian + Gregorian. The changes impact the results for dates before October 15, 1582 (Gregorian) and affect the following Spark 3.0 API:

    • Parsing/formatting of timestamp/date strings. This effects CSV and JSON data sources and on the unix_timestamp, date_format, to_unix_timestamp, from_unixtime, to_date, to_timestamp functions when patterns are used for parsing and formatting. In Spark 3.0, we define our own pattern strings in sql-ref-datetime-pattern.md, which is implemented via java.time.format.DateTimeFormatter under the hood. New implementation performs strict checking of its input. For example, the 2015-07-22 10:00:00 timestamp cannot be parse if pattern is yyyy-MM-dd because the parser does not consume whole input. Another example is the 31/01/2015 00:00 input cannot be parsed by the dd/MM/yyyy hh:mm pattern because hh supposes hours in the range 1-12. In Spark version 2.4 and below, java.text.SimpleDateFormat is used for timestamp/date string conversions, and the supported patterns are described in simpleDateFormat. You can restore the previous behavior by setting spark.sql.legacy.timeParserPolicy to LEGACY.

    • The weekofyear, weekday, dayofweek, date_trunc, from_utc_timestamp, to_utc_timestamp, and unix_timestamp functions use java.time API for calculation week number of year, day number of week as well for conversion from/to TimestampType values in UTC time zone.

    • The JDBC options lowerBound and upperBound are converted to TimestampTypeor DateType values in the same way as casting strings to TimestampType and DateType values. The conversion is based on Proleptic Gregorian calendar, and time zone defined by the SQL config spark.sql.session.timeZone. In Spark version 2.4 and below, the conversion is based on the hybrid calendar (Julian + Gregorian) and on default system time zone.

    • Formatting TIMESTAMP and DATE literals.

    • Creating typed TIMESTAMP and DATE literals from strings. In Spark 3.0, string conversion to typed TIMESTAMP/DATE literals is performed via casting to TIMESTAMP/DATE values. For example, TIMESTAMP '2019-12-23 12:59:30' is semantically equal to CAST('2019-12-23 12:59:30' AS TIMESTAMP). When the input string does not contain information about time zone, the time zone from the SQL config spark.sql.session.timeZone is used in that case. In Spark version 2.4 and below, the conversion is based on JVM system time zone. The different sources of the default time zone may change the behavior of typed TIMESTAMP and DATE literals.

    • In Spark 3.0, TIMESTAMP literals are converted to strings using the SQL config spark.sql.session.timeZone. In Spark version 2.4 and below, the conversion uses the default time zone of the Java virtual machine.

    • In Spark 3.0, Spark casts String to Date/TimeStamp in binary comparisons with dates/timestamps. You can restore the previous behavior of casting Date/Timestamp to String by setting spark.sql.legacy.typeCoercion.datetimeToString.enabled to true.

    • In Spark 3.0, special values are supported in conversion from strings to dates and timestamps. Those values are simply notational shorthands that are converted to ordinary date or timestamp values when read. The following string values are supported for dates:

      • epoch [zoneId] - 1970-01-01
      • today [zoneId] - the current date in the time zone specified by spark.sql.session.timeZone
      • yesterday [zoneId] - the current date - 1
      • tomorrow [zoneId] - the current date + 1
      • now - the date of running the current query. It has the same meaning as today.

      For example: SELECT date 'tomorrow' - date 'yesterday'; should output 2.

      Here are special timestamp values:

      • epoch [zoneId] - 1970-01-01 00:00:00+00 (Unix system time zero)
      • today [zoneId] - midnight today
      • yesterday [zoneId] - midnight yesterday
      • tomorrow [zoneId] - midnight tomorrow
      • now - current query start time

      For example: SELECT timestamp 'tomorrow';

Data sources

  • In Spark version 2.4 and below, when reading a Hive SerDe table with Spark native data sources such as Parquet and ORC, Spark infers the actual file schema and update the table schema in metastore. In Spark 3.0, Spark doesn’t infer the schema anymore. This should not cause any problems to end users, but if it does, set spark.sql.hive.caseSensitiveInferenceMode to INFER_AND_SAVE.
  • In Spark version 2.4 and below, partition column value is converted as null if it can’t be casted to corresponding user provided schema. In 3.0, a partition column value is validated with user provided schema. An exception is thrown if the validation fails. You can disable such validation by setting spark.sql.sources.validatePartitionColumns to false.
  • In Spark 3.0, if files or subdirectories disappear during recursive directory listing (that is, they appear in an intermediate listing but then cannot be read or listed during later phases of the recursive directory listing, due to either concurrent file deletions or object store consistency issues) then the listing will fail with an exception unless spark.sql.files.ignoreMissingFiles is true (default false). In previous versions, these missing files or subdirectories would be ignored. Note that this change of behavior only applies during initial table file listing (or during REFRESH TABLE), not during query execution: the net change is that spark.sql.files.ignoreMissingFiles is now obeyed during table file listing / query planning, not only at query execution time.
  • In Spark version 2.4 and below, the parser of JSON data source treats empty strings as null for some data types such as IntegerType. For FloatType, DoubleType, DateType and TimestampType, it fails on empty strings and throws exceptions. Spark 3.0 disallows empty strings and will throw an exception for data types except for StringType and BinaryType. You can restore the previous behavior of allowing an empty string by setting spark.sql.legacy.json.allowEmptyString.enabled to true.
  • In Spark version 2.4 and below, JSON datasource and JSON functions like from_json convert a bad JSON record to a row with all nulls in the PERMISSIVE mode when specified schema is StructType. In Spark 3.0, the returned row can contain non-null fields if some of JSON column values were parsed and converted to desired types successfully.
  • In Spark 3.0, JSON datasource and JSON function schema_of_json infer TimestampType from string values if they match to the pattern defined by the JSON option timestampFormat. Set JSON option inferTimestamp to false to disable such type inference.
  • In Spark version 2.4 and below, CSV datasource converts a malformed CSV string to a row with all nulls in the PERMISSIVE mode. In Spark 3.0, the returned row can contain non-null fields if some of CSV column values were parsed and converted to desired types successfully.
  • In Spark 3.0, parquet logical type TIMESTAMP_MICROS is used by default while saving TIMESTAMP columns. In Spark version 2.4 and below, TIMESTAMP columns are saved as INT96 in parquet files. Note that, some SQL systems such as Hive 1.x and Impala 2.x can only read INT96 timestamps, you can set spark.sql.parquet.outputTimestampType as INT96 to restore the previous behavior and keep interoperability.
  • In Spark 3.0, when Avro files are written with user provided schema, the fields are matched by field names between catalyst schema and Avro schema instead of positions.
  • In Spark 3.0, when Avro files are written with user provided non-nullable schema, even the catalyst schema is nullable, Spark is still able to write the files. However, Spark throws runtime NullPointerException if any of the records contains null.

Others

  • In Spark version 2.4, when a Spark session is created via cloneSession(), the newly created Spark session inherits its configuration from its parent SparkContext even though the same configuration may exist with a different value in its parent Spark session. In Spark 3.0, the configurations of a parent SparkSession have a higher precedence over the parent SparkContext. You can restore the old behavior by setting spark.sql.legacy.sessionInitWithConfigDefaults to true.

  • In Spark 3.0, if hive.default.fileformat is not found in Spark SQL configuration then it falls back to the hive-site.xml file present in the Hadoop configuration of SparkContext.

  • In Spark 3.0, we pad decimal numbers with trailing zeros to the scale of the column for spark-sql interface, for example:

    Query Spark 2.4 Spark 3.0
    SELECT CAST(1 AS decimal(38, 18)); 1 1.000000000000000000
  • In Spark 3.0, we upgraded the built-in Hive from 1.2 to 2.3 and it brings following impacts:

    • You may need to set spark.sql.hive.metastore.version and spark.sql.hive.metastore.jars according to the version of the Hive metastore you want to connect to. For example: set spark.sql.hive.metastore.version to 1.2.1 and spark.sql.hive.metastore.jars to maven if your Hive metastore version is 1.2.1.
    • You need to migrate your custom SerDes to Hive 2.3 or build your own Spark with hive-1.2 profile. See HIVE-15167 for more details.
    • The decimal string representation can be different between Hive 1.2 and Hive 2.3 when using TRANSFORM operator in SQL for script transformation, which depends on hive’s behavior. In Hive 1.2, the string representation omits trailing zeroes. But in Hive 2.3, it is always padded to 18 digits with trailing zeroes if necessary.

Structured Streaming

  • In Spark 3.0, Structured Streaming forces the source schema into nullable when file-based datasources such as text, json, csv, parquet and orc are used via spark.readStream(...). Previously, it respected the nullability in source schema; however, it caused issues tricky to debug with NPE. To restore the previous behavior, set

    spark.sql.streaming.fileSource.schema.forceNullable to false.

  • Spark 3.0 fixes the correctness issue on Stream-stream outer join, which changes the schema of state. (See SPARK-26154 for more details). if you start your query from checkpoint constructed from Spark 2.x which uses stream-stream outer join, Spark 3.0 fails the query. To recalculate outputs, discard the checkpoint and replay previous inputs.

MLlib

Breaking changes

  • OneHotEncoder which is deprecated in 2.3, is removed in 3.0 and OneHotEncoderEstimator is now renamed to OneHotEncoder.
  • org.apache.spark.ml.image.ImageSchema.readImages which is deprecated in 2.3, is removed in 3.0, use spark.read.format('image') instead.
  • org.apache.spark.mllib.clustering.KMeans.train with param Int runs which is deprecated in 2.1, is removed in 3.0. Use train method without runs instead.
  • org.apache.spark.mllib.classification.LogisticRegressionWithSGD which is deprecated in 2.0, is removed in 3.0, use org.apache.spark.ml.classification.LogisticRegression or spark.mllib.classification.LogisticRegressionWithLBFGS instead.
  • org.apache.spark.mllib.feature.ChiSqSelectorModel.isSorted which is deprecated in 2.1, is removed in 3.0, is not intended for subclasses to use.
  • org.apache.spark.mllib.regression.RidgeRegressionWithSGD which is deprecated in 2.0, is removed in 3.0, use org.apache.spark.ml.regression.LinearRegression with elasticNetParam = 0.0. Note the default regParam is 0.01 for RidgeRegressionWithSGD, but is 0.0 for LinearRegression.
  • org.apache.spark.mllib.regression.LassoWithSGD which is deprecated in 2.0, is removed in 3.0, use org.apache.spark.ml.regression.LinearRegression with elasticNetParam = 1.0. Note the default regParam is 0.01 for LassoWithSGD, but is 0.0 for LinearRegression.
  • org.apache.spark.mllib.regression.LinearRegressionWithSGD which is deprecated in 2.0, is removed in 3.0, use org.apache.spark.ml.regression.LinearRegression or LBFGS instead.
  • org.apache.spark.mllib.clustering.KMeans.getRuns and setRuns which are deprecated in 2.1, are removed in 3.0, have no effect since Spark 2.0.0.
  • org.apache.spark.ml.LinearSVCModel.setWeightCol which is deprecated in 2.4, is removed in 3.0.
  • From 3.0, org.apache.spark.ml.classification.MultilayerPerceptronClassificationModel extends MultilayerPerceptronParams to expose the training params. As a result, layers in MultilayerPerceptronClassificationModel has been changed from Array[Int] to IntArrayParam. You should use MultilayerPerceptronClassificationModel.getLayers instead of MultilayerPerceptronClassificationModel.layers to retrieve the size of layers.
  • org.apache.spark.ml.classification.GBTClassifier.numTrees which is deprecated in 2.4.5, is removed in 3.0, use getNumTrees instead.
  • org.apache.spark.ml.clustering.KMeansModel.computeCost which is deprecated in 2.4, is removed in 3.0, use ClusteringEvaluator instead.
  • The member variable precision in org.apache.spark.mllib.evaluation.MulticlassMetrics which is deprecated in 2.0, is removed in 3.0. Use accuracy instead.
  • The member variable recall in org.apache.spark.mllib.evaluation.MulticlassMetrics which is deprecated in 2.0, is removed in 3.0. Use accuracy instead.
  • The member variable fMeasure in org.apache.spark.mllib.evaluation.MulticlassMetrics which is deprecated in 2.0, is removed in 3.0. Use accuracy instead.
  • org.apache.spark.ml.util.GeneralMLWriter.context which is deprecated in 2.0, is removed in 3.0, use session instead.
  • org.apache.spark.ml.util.MLWriter.context which is deprecated in 2.0, is removed in 3.0, use session instead.
  • org.apache.spark.ml.util.MLReader.context which is deprecated in 2.0, is removed in 3.0, use session instead.
  • abstract class UnaryTransformer[IN, OUT, T <: UnaryTransformer[IN, OUT, T]] is changed to abstract class UnaryTransformer[IN: TypeTag, OUT: TypeTag, T <: UnaryTransformer[IN, OUT, T]] in 3.0.

Deprecations

  • SPARK-11215: labels in StringIndexerModel is deprecated and will be removed in 3.1.0. Use labelsArray instead.
  • SPARK-25758: computeCost in BisectingKMeansModel is deprecated and will be removed in future versions. Use ClusteringEvaluator instead.

Changes of behavior

  • SPARK-11215: In Spark 2.4 and previous versions, when specifying frequencyDesc or frequencyAsc as stringOrderType param in StringIndexer, in case of equal frequency, the order of strings is undefined. Since Spark 3.0, the strings with equal frequency are further sorted by alphabet. And since Spark 3.0, StringIndexer supports encoding multiple columns.

  • SPARK-20604: In prior to 3.0 releases, Imputer requires input column to be Double or Float. In 3.0, this restriction is lifted so Imputer can handle all numeric types.

  • SPARK-23469: In Spark 3.0, the HashingTF Transformer uses a corrected implementation of the murmur3 hash function to hash elements to vectors. HashingTF in Spark 3.0 maps elements to different positions in vectors than in Spark 2. However, HashingTF created with Spark 2.x and loaded with Spark 3.0 will still use the previous hash function and will not change behavior.

  • SPARK-28969: The setClassifier method in PySpark’s OneVsRestModel has been removed in 3.0 for parity with the Scala implementation. Callers should not need to set the classifier in the model after creation.

  • SPARK-25790: PCA adds the support for more than 65535 column matrix in Spark 3.0.

  • SPARK-28927: When fitting ALS model on nondeterministic input data, previously if rerun happens, you would see ArrayIndexOutOfBoundsException caused by mismatch between In/Out user/item blocks.

    From 3.0, a SparkException with a clearer message is thrown, and original ArrayIndexOutOfBoundsException is wrapped.

  • SPARK-29232: In prior to 3.0 releases, RandomForestRegressionModel doesn’t update the parameter maps of the DecisionTreeRegressionModels underneath. This is fixed in 3.0.

PySpark

  • In Spark 3.0, PySpark requires a pandas version of 0.23.2 or higher to use pandas related functionality, such as toPandas, createDataFrame from pandas DataFrame, and so on.

  • In Spark 3.0, PySpark requires a PyArrow version of 0.12.1 or higher to use PyArrow related functionality, such as pandas_udf, toPandas and createDataFrame with “spark.sql.execution.arrow.enabled=true”, and so on.

  • In PySpark, when creating a SparkSession with SparkSession.builder.getOrCreate(), if there is an existing SparkContext, the builder was trying to update the SparkConf of the existing SparkContext with configurations specified to the builder, but the SparkContext is shared by all SparkSessions, so we should not update them. In 3.0, the builder comes to not update the configurations. This is the same behavior as Java/Scala API in 2.3 and above. If you want to update them, you need to update them prior to creating a SparkSession.

  • In PySpark, when Arrow optimization is enabled, if Arrow version is higher than 0.11.0, Arrow can perform safe type conversion when converting pandas.Series to an Arrow array during serialization. Arrow raises errors when detecting unsafe type conversions like overflow. You enable it by setting spark.sql.execution.pandas.convertToArrowArraySafely to true. The default setting is false. PySpark behavior for Arrow versions is illustrated in the following table:

    PyArrow version Integer overflow Floating point truncation
    0.11.0 and below Raise error Silently allows
    > 0.11.0, arrowSafeTypeConversion=false Silent overflow Silently allows
    > 0.11.0, arrowSafeTypeConversion=true Raise error Raise error
  • In Spark 3.0, createDataFrame(..., verifySchema=True) validates LongType as well in PySpark. Previously, LongType was not verified and resulted in None in case the value overflows. To restore this behavior, verifySchema can be set to False to disable the validation.

  • In Spark 3.0, Column.getItem is fixed such that it does not call Column.apply. Consequently, if Column is used as an argument to getItem, the indexing operator should be used. For example, map_col.getItem(col('id')) should be replaced with map_col[col('id')].

  • As of Spark 3.0, Row field names are no longer sorted alphabetically when constructing with named arguments for Python versions 3.6 and above, and the order of fields will match that as entered. To enable sorted fields by default, as in Spark 2.4, set the environment variable PYSPARK_ROW_FIELD_SORTING_ENABLED to true for both executors and driver - this environment variable must be consistent on all executors and driver; otherwise, it may cause failures or incorrect answers. For Python versions less than 3.6, the field names will be sorted alphabetically as the only option.

SparkR

  • The deprecated methods parquetFile, saveAsParquetFile, jsonFile, jsonRDD have been removed. Use read.parquet, write.parquet, read.json instead.

System environment

  • Operating System: Ubuntu 18.04.4 LTS
  • Java: 1.8.0_242
  • Scala: 2.12.10
  • Python: 3.7.6
  • R: R version 3.6.3 (2020-02-29)

Installed Python libraries

Package Version
asn1crypto 1.3.0
backcall 0.1.0
boto3 1.12.0
botocore 1.15.0
certifi 2020.4.5
cffi 1.14.0
chardet 3.0.4
cryptography 2.8
cycler 0.10.0
Cython 0.29.15
decorator 4.4.1
docutils 0.15.2
entrypoints 0.3
idna 2.8
ipykernel 5.1.4
ipython 7.12.0
ipython-genutils 0.2.0
jedi 0.14.1
jmespath 0.9.4
joblib 0.14.1
jupyter-client 5.3.4
jupyter-core 4.6.1
kiwisolver 1.1.0
matplotlib 3.1.3
numpy 1.18.1
pandas 1.0.1
parso 0.5.2
patsy 0.5.1
pexpect 4.8.0
pickleshare 0.7.5
pip 20.0.2
prompt-toolkit 3.0.3
psycopg2 2.8.4
ptyprocess 0.6.0
pyarrow 0.15.1
pycparser 2.19
Pygments 2.5.2
PyGObject 3.26.1
pyOpenSSL 19.1.0
pyparsing 2.4.6
PySocks 1.7.1
python-apt 1.6.5+ubuntu0.2
python-dateutil 2.8.1
pytz 2019.3
pyzmq 18.1.1
requests 2.22.0
s3transfer 0.3.3
scikit-learn 0.22.1
scipy 1.4.1
seaborn 0.10.0
setuptools 45.2.0
six 1.14.0
ssh-import-id 5.7
statsmodels 0.11.0
tornado 6.0.3
traitlets 4.3.3
unattended-upgrades 0.1
urllib3 1.25.8
virtualenv 16.7.10
wcwidth 0.1.8
wheel 0.34.2

Installed R libraries

R libraries are installed from Microsoft CRAN snapshot on 2020-04-07.

Package Version
askpass 1.1
assertthat 0.2.1
backports 1.1.6
base 3.6.3
base64enc 0.1-3
BH 1.72.0-3
bit 1.1-15.2
bit64 0.9-7
blob 1.2.1
boot 1.3-24
brew 1.0-6
broom 0.5.6
callr 3.4.3
caret 6.0-86
cellranger 1.1.0
chron 2.3-55
class 7.3-16
cli 2.0.2
clipr 0.7.0
cluster 2.1.0
codetools 0.2-16
colorspace 1.4-1
commonmark 1.7
compiler 3.6.3
config 0.3
covr 3.5.0
crayon 1.3.4
crosstalk 1.1.0.1
curl 4.3
data.table 1.12.8
datasets 3.6.3
DBI 1.1.0
dbplyr 1.4.3
desc 1.2.0
devtools 2.3.0
digest 0.6.25
dplyr 0.8.5
DT 0.13
ellipsis 0.3.0
evaluate 0.14
fansi 0.4.1
farver 2.0.3
fastmap 1.0.1
forcats 0.5.0
foreach 1.5.0
foreign 0.8-75
forge 0.2.0
fs 1.4.1
generics 0.0.2
ggplot2 3.3.0
gh 1.1.0
git2r 0.26.1
glmnet 3.0-2
globals 0.12.5
glue 1.4.0
gower 0.2.1
graphics 3.6.3
grDevices 3.6.3
grid 3.6.3
gridExtra 2.3
gsubfn 0.7
gtable 0.3.0
haven 2.2.0
highr 0.8
hms 0.5.3
htmltools 0.4.0
htmlwidgets 1.5.1
httpuv 1.5.2
httr 1.4.1
hwriter 1.3.2
hwriterPlus 1.0-3
ini 0.3.1
ipred 0.9-9
isoband 0.2.1
iterators 1.0.12
jsonlite 1.6.1
KernSmooth 2.23-16
knitr 1.28
labeling 0.3
later 1.0.0
lattice 0.20-41
lava 1.6.7
lazyeval 0.2.2
lifecycle 0.2.0
lubridate 1.7.8
magrittr 1.5
markdown 1.1
MASS 7.3-51.5
Matrix 1.2-18
memoise 1.1.0
methods 3.6.3
mgcv 1.8-31
mime 0.9
ModelMetrics 1.2.2.2
modelr 0.1.6
munsell 0.5.0
nlme 3.1-144
nnet 7.3-13
numDeriv 2016.8-1.1
openssl 1.4.1
parallel 3.6.3
pillar 1.4.3
pkgbuild 1.0.6
pkgconfig 2.0.3
pkgload 1.0.2
plogr 0.2.0
plyr 1.8.6
praise 1.0.0
prettyunits 1.1.1
pROC 1.16.2
processx 3.4.2
prodlim 2019.11.13
progress 1.2.2
promises 1.1.0
proto 1.0.0
ps 1.3.2
purrr 0.3.4
r2d3 0.2.3
R6 2.4.1
randomForest 4.6-14
rappdirs 0.3.1
rcmdcheck 1.3.3
RColorBrewer 1.1-2
Rcpp 1.0.4.6
readr 1.3.1
readxl 1.3.1
recipes 0.1.10
rematch 1.0.1
rematch2 2.1.1
remotes 2.1.1
reprex 0.3.0
reshape2 1.4.4
rex 1.2.0
rjson 0.2.20
rlang 0.4.5
rmarkdown 2.1
RODBC 1.3-16
roxygen2 7.1.0
rpart 4.1-15
rprojroot 1.3-2
Rserve 1.8-6
RSQLite 2.2.0
rstudioapi 0.11
rversions 2.0.1
rvest 0.3.5
scales 1.1.0
selectr 0.4-2
sessioninfo 1.1.1
shape 1.4.4
shiny 1.4.0.2
sourcetools 0.1.7
sparklyr 1.2.0
SparkR 3.0.0
spatial 7.3-11
splines 3.6.3
sqldf 0.4-11
SQUAREM 2020.2
stats 3.6.3
stats4 3.6.3
stringi 1.4.6
stringr 1.4.0
survival 2.44-1.1
sys 3.3
tcltk 3.6.3
TeachingDemos 2.10
testthat 2.3.2
tibble 3.0.1
tidyr 1.0.2
tidyselect 1.0.0
tidyverse 1.3.0
timeDate 3043.102
tinytex 0.22
tools 3.6.3
usethis 1.6.0
utf8 1.1.4
utils 3.6.3
vctrs 0.2.4
viridisLite 0.3.0
whisker 0.4
withr 2.2.0
xfun 0.13
xml2 1.3.1
xopen 1.0.0
xtable 1.8-4
yaml 2.2.1