Many small Spark jobs

If you see many small jobs, it’s likely you’re doing many operations on relatively small data (<10GB). Small operations only take a few seconds each, but they add up, and the time spent in overhead per operation also adds up.

The best approach to speeding up small jobs is to run multiple operations in parallel. Delta Live Tables do this for you automatically.

Other options include:

  • Separate your operations into multiple notebooks and run them in parallel on the same cluster by using multi-task jobs.

  • Use Python’s ThreadPoolExecutor or another multi-threading approach to run queries in parallel.

  • Use SQL warehouses if all your queries are written in SQL. SQL warehouses scale very well for many queries run in parallel as they were designed for this type of workload.