Enter your keyword

Training Machine Learning Models in Azure Databricks

Training Databricks ML Model on Azure

Prerequisites

We’ll use Databricks Notebook I’ve created to train a RandomForestRegressor model.

Here are the details about the model and Databricks Notebook referenced above:

  • Code: It is written in Scala, but it can be easily ported to Python
  • Hyperparameters: For simplicity, I did not tune any hyperparameters — fine tuning hyperparameters is highly recommended before deploying models in production environments.
  • Dataset: The model is trained on a classic dataset that contains advertising budgets for media channels — TV, radio, and newspapers — and their sales.
  • Inference: The model is trained to predict sales (number of units sold) based on advertising budgets allocated to TV, radio and newspapers media channels.
  • Model Export: After the model is trained, it’s exported using Databricks ML Model Export (You’ll need to fill in your AWS credentials and uncomment code to save the model in your AWS S3 bucket).
Getting Started in Azure Databricks
STEP 1. Follow the instructions outlined here to upload Advertising dataset. (Note: You don’t need to create a table as long as the file is uploaded and can be accessed at /FileStore/tables/Advertising.csv)

STEP 2. Follow the instructions outlined here to import the Databricks Notebook. (And make sure it is attached to a Spark cluster running in Azure Databricks.)

Before moving on, run all commands/cells in the Notebook to make sure everything checks out and that there are no errors.

STEP 3. Follow the instructions outlined here to create a job.

STEP 4. Follow the instructions outlined here to generate an authentication token. (The auth token will be required to execute the job from StreamSets.)

Execute Databricks ML Job in Azure
Before we look at how to execute this job using StreamSets Databricks Executor, let’s do a quick test using the curl command and Azure Databricks Jobs API.
curl 'https://BASE_AZURE_URL/api/2.0/jobs/run-now' -X POST -H "Authorization: Bearer BEARER_TOKEN" -d "{\"job_id\": JOB_ID}"

Replace BASE_AZURE_URL, BEARER_TOKEN and JOB_ID and execute the command. So in my case, it looked like this:

curl 'https://westus.azuredatabricks.net/api/2.0/jobs/run-now' -X POST -H "Authorization: Bearer dapi53916e671db41XXXXXXXXXXXXXXX" -d "{\"job_id\": 1}"

If all goes well, the JSON response will look something like this:

{"run_id":25,"number_in_job":25}
Execute Databricks ML Job in Azure Using StreamSets Databricks Executor

Now let’s see how to execute the same job using StreamSets Databricks Executor. Assume there’s a dataflow pipeline with a data source/origin, optional processors to perform transformations, a destination and some logic or condition(s) to trigger a task in response to events that occur in the pipeline. In our case, that task is to execute the Databricks ML job in Azure using StreamSets Databricks Executor. (For more information on dataflow triggers, refer to the documentation.)

For simplicity, let’s focus on the following fragments of the dataflow pipeline.

Job tab of StreamSets Databricks Executor:

Where:
Cluster Base URL: Your Azure Databricks Service URL
Job Type: Notebook Job
Job ID: Id of the job created in step 3
Parameters: key = NUM_OF_TREES; value = ${record:value(‘/tune_trees’)}. Note: In this example, its value is dynamically set to what’s in the record field named ‘tune_trees.’

Credentials tab of StreamSets Databricks Executor

Where:
Credential Type: Token
Token: Auth token created in step 4

Running the Pipeline

Assuming all goes well with no errors including processor logic kicking off in response to event(s) configured in the pipeline, the job will start running in Azure Databricks. As a result, the associated Databricks Notebook in Azure will execute all the commands, which will effectively (re)train the RandomForestRegressor model using NUM_OF_TREES as its n_estimators hyperparameter value passed in from StreamSets Databricks Executor. (See Notebook code snippet below from Cmd 12.)

You can view the job transitioning from Pending, Running to Succeeded states in the Jobs interface as shown below.

Summary

In this post, you learned how to execute jobs in Azure Databricks using StreamSets Databricks Executor. In particular, we looked at automating the task of (re)training Databricks ML model using different hyperparameters for evaluating and comparing model accuracies. Note: It goes without saying that training models, evaluating them, model versioning, and serving different versions of the model are not trivial undertakings by any means, and that is not the focus of this post.

If you’re interested in learning how to use trained models to achieve low-latency inference in StreamSets, check out the tech blogs, Low-Latency Inference Using Databricks ML In StreamSets and Real-Time Machine Learning With TensorFlow In Data Collector.

StreamSets Data Collector is open source, under the Apache v2 license.

source:dzone.com

  • Sign up
Lost your password? Please enter your username or email address. You will receive a link to create a new password via email.