Use ADAM in Databricks

ADAM is a library for genomic data processing on Apache Spark. It is used to implement pipelines that operate on genomic read data such as BAM, SAM, and CRAM files.

To use ADAM in Databricks:

  1. Launch a Databricks Runtime cluster with these Spark configurations:

    # Hadoop configs
    org.apache.spark.serializer.KryoSerializer
    spark.kryo.registrator org.bdgenomics.adam.serialization.ADAMKryoRegistrator
    spark.hadoop.hadoopbam.bam.enable-bai-splitter true
    
  2. Install the cluster libraries:

    • Maven: org.bdgenomics.adam:adam-apis-spark3_2.12:<version>
    • PyPI: bdgenomics.adam