Azure Databricks is an Apache Spark-based analytics platform optimized for the Microsoft Azure cloud services platform. In this blog, we are going to see how we can collect logs from Azure to ALA .Before going further we need to look how to setup spark cluster in azure
Create a Spark cluster in Databricks
- In the Azure portal, go to the Databricks workspace that you created, and then click Launch Workspace
- You are redirected to the Azure Databricks portal. From the portal, click New Cluster
- Under “Advanced Options”, click on the “Init Scripts” tab. Go to the last line under the “Init Scripts section” Under the “destination” dropdown, select “DBFS”. Enter “dbfs:/databricks/spark-monitoring/spark-monitoring.sh” in the text box. Click the “add” button. which is later explained in this blog
Run a Spark SQL job
- In the left pane, select Azure Databricks. From the Common Tasks, select New Notebook
- In the Create Notebook dialog box, enter a name, select language, and select the Spark cluster that you created earlier
Create a notebook
- Click the Workspace button
- In the Create Notebook dialog, enter a name and select the notebook’s default language
- There are running clusters, the Cluster drop-down displays. Select the cluster
Adding Logger into DataBricks Notebook
dbutils.fs.put("/databricks/init/dev-heb-spark-cluster/verbose_logging.sh", """
#!/bin/bash
echo "log4j.appender.A1=com.microsoft.pnp.logging.loganalytics.LogAnalyticsAppender
log4j.appender.A1.layout=com.microsoft.pnp.logging.JSONLayout
log4j.appender.A1.layout.LocationInfo=false
log4j.additivity.com.knoldus.pnp.samplejob=false
log4j.logger.com.knoldus.pnp.samplejob=INFO, A1" >> /home/ubuntu/databricks/spark/dbconf/log4j/executor/log4j.properties
""", true)
import com.microsoft.pnp.logging.Log4jConfiguration
import org.apache.spark.internal.Logging
import org.apache.spark.metrics.UserMetricsSystems
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.window
import org.apache.spark.sql.types.{StringType, StructType, TimestampType}
import org.apache.log4j.Logger
object SumNumbers extends Logging {
private final val METRICS_NAMESPACE = "SumNumbers"
private final val COUNTER_NAME = "counter1"
val test = Logger.getLogger(getClass.getName)
def computeSumOfNumbersFromOneTo(value: Long, spark: SparkSession): Long = {
// Log4jConfiguration.configure("/databricks/spark-monitoring/log4j.properties")
test.info("Hello There")
logTrace("data testing ")
logDebug("data testing")
logInfo("data testing")
logWarning("Wdata testing")
logError("data testing")
val driverMetricsSystem = UserMetricsSystems
.getMetricSystem(METRICS_NAMESPACE, builder => {
builder.registerCounter(COUNTER_NAME)
})
driverMetricsSystem.counter(COUNTER_NAME).inc
val sumOfNumbers = spark.range(value + 1).reduce(_ + _)
driverMetricsSystem.counter(COUNTER_NAME).inc(5)
return sumOfNumbers
}
}
SumNumbers.computeSumOfNumbersFromOneTo(100, spark)
Now that you are all setup with notebook, let’s configure the cluster for sending logs to azure log analytics workspace
For that, we will be creating a log analytic workspace in Azure
Deploy spark monitoring library in Azure cluster using https://github.com/mspnp/spark-monitoring
Clone or download this GitHub repository https://github.com/mspnp/spark-monitoring.git
Install the Azure Databricks CLI.
- A personal access token is required to use the CLI. For instructions, see token management.
- You can also use the CLI from the Azure Cloud Shell
Build the Azure Data bricks monitoring library using Docker
Linux:
chmod +x spark-monitoring/build.sh docker run -it --rm -v `pwd`/spark-monitoring:/spark-monitoring -v "$HOME/.m2":/root/.m2 maven:3.6.1-jdk-8 /spark-monitoring/build.sh
Windows:
docker run -it --rm -v %cd%/spark-monitoring:/spark-monitoring -v "%USERPROFILE%/.m2":/root/.m2 maven:3.6.1-jd
Configure the Azure Databricks workspace
Copy the JAR files and init scripts to Databricks.
- Use the Azure Databricks CLI to create a directory named dbfs:/databricks/spark-monitoring:dbfs mkdirs dbfs:/databricks/spark-monitoring
- Open the /src/spark-listeners/scripts/spark-monitoring.sh script file and add your Log Analytics Workspace ID and Key to the lines below:export LOG_ANALYTICS_WORKSPACE_ID= export LOG_ANALYTICS_WORKSPACE_KEY=
- Use the Azure Databricks CLI to copy /src/spark-listeners/scripts/spark-monitoring.sh to the directory created in step 3:dbfs cp <local path to spark-monitoring.sh> dbfs:/databricks/spark-monitoring/spark-monitoring.sh
- Use the Azure Databricks CLI to copy all of the jar files from the spark-monitoring/src/target folder to the directory created in step 3:dbfs cp –overwrite –recursive <local path to target folder> dbfs:/databricks/spark-monitoring/
Now it is all set to query in log analytics workspace to get logs.
Event | search "error"
This query will get all the error level logs of the generate event. Similarly, we can get logs of different classes.
References
- https://github.com/mspnp/spark-monitoring
- https://blog.knoldus.com/unveiling-the-mystery-of-serverless
1 thought on “Collecting logs in Azure Databricks4 min read”
Comments are closed.