Track training loss to Azure when running file locally

60 views
Skip to first unread message

Frank

unread,
Jun 16, 2022, 12:35:24 PM6/16/22
to mlflow-users
Hi all :)

I have followed the official microsoft tutorial (https://docs.microsoft.com/en-us/azure/machine-learning/how-to-use-mlflow-cli-runs?tabs=mlflow) to log parameters or metrics in the Azure Workspace when I run something locally. That works just fine.

However, what I want to do is run a train.py file by using the os.system command like
os.system(f"python ./src/train.py --data_path ./data) instead of the os.system command used in the tutorial. 

This works fine in general. However, inside the train.py file, I am not able to get access to the Azure Machine Learning experiment that I created in the file which runs the os.system command. Therefore, I am not able to track the loss of the train.py file to Azure.

Do you have any suggestions how I could solve that?

Thanks a lot already!

Best regards

Facundo Santiago

unread,
Jun 18, 2022, 12:01:28 PM6/18/22
to mlflow-users
Hi Frank,
The command os.system() starts a new process in a separate shell; consequently, any information you want to make available to the subprocess needs to be passed as arguments (to keep things simple). On top of that, if you are using virtual environments to install python packages, the libraries you installed won't be available (I guess that this is why you can't access Azure ML, cause the other shell doesn't have azureml-mlflow installed). To give a concrete answer I would need a bit more details about what you are doing. Generally speaking:
  • If the train.py file handles all the training processes, then such file should be the one configuring the experiments and doing all the logging. Don't run it in a separate shell. If you want to separate responsibilities into different files (like a file for configuring the experiment and another for the training - which I won't recommend but still) then use the importing mechanisms in Python. 
  • If you are creating multiple training executions in separate processes (for instance, for hyper-parameter tunning or because you train multiple "things"), then you need to use the concept of nested runs (or child runs).
Hope it helps,
Facundo.

Frank

unread,
Jun 20, 2022, 3:27:30 AM6/20/22
to mlflow-users
Hi Facundo,

thank you really much for your fast and detailed reply! I understand the problem with os.system() now.

My main idea is that sometimes (for smaller tasks), I want to train my network locally on my computer and for bigger tasks I would like to train the network with Azure Machine Learning. Therefore, I thought it would be best to have one train.py script, which I could either call from a run-on-azure.py file or from a run-locally.py file. In the train.py script, I use mlflow.log_metric() to log my loss.

In the run-on-azure.py file, I would use the command command (Azure v2) to start the compute cluster with the train.py script. This automatically generates a new run in the experiment and successfully tracks the loss using mlflow.
Therefore, I was looking for a way to start the train.py script with a similar command like the command command from Azure v2 but locally. In run_local.py I would therefore use a tracking URI so that the local run is still tracked on Azure using MLFlow but everything is computed locally.

Would you solve this differently? Or do you see a way that I could use the run-local.py file to start the train.py script but still track the log_metrics?

Thank you so much for your help!

Best regards

Facundo Santiago

unread,
Jun 20, 2022, 1:08:57 PM6/20/22
to mlflow-users
Hi Frank,
Thanks for the detailed scenario. It makes a lot of sense to me and, moreover, MLflow is the right tool to enable such portability of code from local to cloud compute. I would solve the problem in a different way. Basically,
  • Keep 1 train.py file that leverages MLflow for tracking and logging. Any data path needed by the training script should be passed as parameters.
  • For running the file locally
    • Configure the environment variable MLFLOW_TRACKING_URI to point to the workspace you are using.
    • Run your jobs by invoking python: "python train.py  --training-data-path /my/path/to/where/data/is"
  • For running the file on Azure:
      • Create a job definition YAML file using:
        • Inputs for indicating the data paths needed by your script
        • In the "command" section of the file, put the same thing that you run locally, but replace the data paths by the "inputs" specified before. Then, paths will be resolved to cloud locations. For instance, "python train.py --training-data-path ${{ inputs.training_data_path }}". 
        • Use compute clusters for running the job and configure the compute cluster with a minimum number of nodes = 0. That will do exactly what you are doing manually: turning on the compute cluster when there is jobs to do, and turn it off once there is no more jobs.
      • Submit jobs using the Azure ML CLI v2: "az ml job create -f train.yml".
    I actually have a repository that demonstrates how to do this: https://github.com/santiagxf/mlproject-sample, you can check it out. The repo is about something different (about how to structure an ML project) but it does the same thing you are trying to achieve (which is a good practice, by the way).

    Hope it helps,
    Facundo.

    Frank

    unread,
    Jun 21, 2022, 8:36:51 AM6/21/22
    to mlflow-users
    Hi Facundo,

    Thanks again! I have managed to get my code running on Azure using the job definition YAML file like you proposed.
    However, I am still struggling about using the same train.py file for both locally and running on Azure.

    If I place the following code inside the train.py file, I can run the train.py file locally with the command you proposed and everything is logged on Azure just as it should be.
    ml_client = MLClient(DefaultAzureCredential(), subscription_id, resource_group, workspace)
    tracking_uri = ml_client.workspaces.get(name=workspace).mlflow_tracking_uri
    mlflow.set_tracking_uri(tracking_uri)
    experiment_name = 'pytorch-test-yml'
    mlflow.set_experiment(experiment_name)

    However, if I have this code snipped inside my train.py function I cannot run the train.py file on Azure anymore using the YAML file. I get some errors on Azure about the credentials that I do not understand. I have included a screenshot of the error.

    Is it your idea to place this code snippet inside the train.py file or where do you mean to place it? If I understood it correctly so far, I do not need that code if I run my code with the YAML file on Azure but I do need it if I run the file locally.

    Best regards and thank you so much!
    Bildschirmfoto 2022-06-21 um 14.33.42.png
    Reply all
    Reply to author
    Forward
    0 new messages