Collecting logs in Azure Databricks

Table of contents

Reading Time: 3 minutes

Azure Databricks is an Apache Spark-based analytics platform optimized for the Microsoft Azure cloud services platform. In this blog, we are going to see how we can collect logs from Azure to ALA .Before going further we need to look how to setup spark cluster in azure

Create a Spark cluster in Databricks

In the Azure portal, go to the Databricks workspace that you created, and then click Launch Workspace
You are redirected to the Azure Databricks portal. From the portal, click New Cluster
Under “Advanced Options”, click on the “Init Scripts” tab. Go to the last line under the “Init Scripts section” Under the “destination” dropdown, select “DBFS”. Enter “dbfs:/databricks/spark-monitoring/spark-monitoring.sh” in the text box. Click the “add” button. which is later explained in this blog

Run a Spark SQL job

In the left pane, select Azure Databricks. From the Common Tasks, select New Notebook
In the Create Notebook dialog box, enter a name, select language, and select the Spark cluster that you created earlier

Create a notebook

Click the Workspace button
In the Create Notebook dialog, enter a name and select the notebook’s default language
There are running clusters, the Cluster drop-down displays. Select the cluster

Adding Logger into DataBricks Notebook

dbutils.fs.put("/databricks/init/dev-heb-spark-cluster/verbose_logging.sh", """
#!/bin/bash
echo "log4j.appender.A1=com.microsoft.pnp.logging.loganalytics.LogAnalyticsAppender
log4j.appender.A1.layout=com.microsoft.pnp.logging.JSONLayout
log4j.appender.A1.layout.LocationInfo=false
log4j.additivity.com.knoldus.pnp.samplejob=false
log4j.logger.com.knoldus.pnp.samplejob=INFO, A1" >> /home/ubuntu/databricks/spark/dbconf/log4j/executor/log4j.properties
""", true)

import com.microsoft.pnp.logging.Log4jConfiguration
import org.apache.spark.internal.Logging
import org.apache.spark.metrics.UserMetricsSystems
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.window
import org.apache.spark.sql.types.{StringType, StructType, TimestampType}
import org.apache.log4j.Logger

object SumNumbers extends Logging {
  private final val METRICS_NAMESPACE = "SumNumbers"
  private final val COUNTER_NAME = "counter1"
  val test = Logger.getLogger(getClass.getName)
  def computeSumOfNumbersFromOneTo(value: Long, spark: SparkSession): Long = {

  //  Log4jConfiguration.configure("/databricks/spark-monitoring/log4j.properties")
    
    test.info("Hello There")
    logTrace("data testing ")
    logDebug("data testing")
    logInfo("data testing")
    logWarning("Wdata testing")
    logError("data testing")

    val driverMetricsSystem = UserMetricsSystems
        .getMetricSystem(METRICS_NAMESPACE, builder => {
          builder.registerCounter(COUNTER_NAME)
        })
    driverMetricsSystem.counter(COUNTER_NAME).inc
    val sumOfNumbers = spark.range(value + 1).reduce(_ + _)
    driverMetricsSystem.counter(COUNTER_NAME).inc(5)
    return sumOfNumbers
  }
}
SumNumbers.computeSumOfNumbersFromOneTo(100, spark)

Now that you are all setup with notebook, let’s configure the cluster for sending logs to azure log analytics workspace

For that, we will be creating a log analytic workspace in Azure

Deploy spark monitoring library in Azure cluster using https://github.com/mspnp/spark-monitoring

Clone or download this GitHub repository https://github.com/mspnp/spark-monitoring.git

Install the Azure Databricks CLI.

A personal access token is required to use the CLI. For instructions, see token management.
You can also use the CLI from the Azure Cloud Shell

Build the Azure Data bricks monitoring library using Docker

Linux:

chmod +x spark-monitoring/build.sh
docker run -it --rm -v `pwd`/spark-monitoring:/spark-monitoring -v "$HOME/.m2":/root/.m2 maven:3.6.1-jdk-8 /spark-monitoring/build.sh

Windows:

docker run -it --rm -v %cd%/spark-monitoring:/spark-monitoring -v "%USERPROFILE%/.m2":/root/.m2 maven:3.6.1-jd

Configure the Azure Databricks workspace

Copy the JAR files and init scripts to Databricks.

Use the Azure Databricks CLI to create a directory named dbfs:/databricks/spark-monitoring:dbfs mkdirs dbfs:/databricks/spark-monitoring
Open the /src/spark-listeners/scripts/spark-monitoring.sh script file and add your Log Analytics Workspace ID and Key to the lines below:export LOG_ANALYTICS_WORKSPACE_ID= export LOG_ANALYTICS_WORKSPACE_KEY=
Use the Azure Databricks CLI to copy /src/spark-listeners/scripts/spark-monitoring.sh to the directory created in step 3:dbfs cp <local path to spark-monitoring.sh> dbfs:/databricks/spark-monitoring/spark-monitoring.sh
Use the Azure Databricks CLI to copy all of the jar files from the spark-monitoring/src/target folder to the directory created in step 3:dbfs cp –overwrite –recursive <local path to target folder> dbfs:/databricks/spark-monitoring/