Hugging face code throwing error when running using vertex-ai

271 views

Skip to first unread message

Pratik

unread,

Mar 15, 2022, 6:59:35 AM3/15/22

to cloud-nl-discuss

I am training a NLP Hugging face model in vertex-ai with custom image.

The same code works in local machine.

Here is my code and the error.

import torch
import numpy as np
import pandas as pd
from transformers import BertTokenizer, BertForSequenceClassification
import transformers as tr
from sentence_transformers import SentenceTransformer
from transformers import XLMRobertaTokenizer, XLMRobertaForMaskedLM
from transformers import AdamW
from transformers import AutoTokenizer
from transformers import BertTokenizerFast as BertTokenizer, BertModel, AdamW, get_linear_schedule_with_warmup,BertForMaskedLM
from transformers import DataCollatorForLanguageModeling
from scipy.special import softmax
import scipy
import random
import pickle
import os

print("package imported completed")

os.environ['TRANSFORMERS_OFFLINE']='1'
os.environ['HF_MLFLOW_LOG_ARTIFACTS']='TRUE'

print("env setup completed")
print( tr.__version__)

device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
print("Using", device)
torch.backends.cudnn.deterministic = True

tr.trainer_utils.set_seed(0)

print("here")

tokenizer = tr.XLMRobertaTokenizer.from_pretrained("xlm-roberta-large",local_files_only=True)
model = tr.XLMRobertaForMaskedLM.from_pretrained("xlm-roberta-large", return_dict=True,local_files_only=True)

model.to(device)
print("Model loaded successfully")

df=pd.read_csv("gs://****bucket***/data.csv")
print("read csv")
# ,engine='openpyxl',sheet_name="master_data"
train_df=df.text.tolist()
print(len(train_df))

train_df=list(set(train_df))
train_df = [x for x in train_df if str(x) != 'nan']
train_df=train_df[:50]

print("Length of training data is \n ",len(train_df))
print("DATA LOADED successfully")

train_encodings = tokenizer(train_df, truncation=True, padding=True, max_length=512, return_tensors="pt")
print("encoding done")

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=True, mlm_probability=0.15)
print("data collector done")

class SEDataset(torch.utils.data.Dataset):
def __init__(self, encodings):
self.encodings = encodings

def __getitem__(self, idx):
item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
return item

def __len__(self):
return len(self.encodings["attention_mask"])

train_data = SEDataset(train_encodings)

print("train data created")

training_args = tr.TrainingArguments(
output_dir='gs://****bucket***/results_mlm_exp1',
overwrite_output_dir=True,
num_train_epochs=2,
per_device_train_batch_size=4,
# per_device_train_batch_size
# per_gpu_train_batch_size
prediction_loss_only=True
# ,save_strategy="epoch"
# ,run_name="MLM_Exp1"
,learning_rate=2e-5
# logging_dir='gs://****bucket***/logs_mlm_exp1', # directory for storing logs
# logging_steps=32000,
)

trainer = tr.Trainer(
model=model,
args=training_args,
data_collator=data_collator,
train_dataset=train_data,
)
print("training to start")
trainer.train()
print("model training finished")
trainer.save_model("gs://****bucket***/model_mlm_exp1")

print("training finished")

The error that I get is:

None INFO train data created
None INFO training to start
None ERROR 0%| | 0/8 [00:00<?, ?it/s]train.py:70: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
None ERROR item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
None ERROR /opt/conda/lib/python3.7/site-packages/torch/nn/parallel/_functions.py:68: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
None ERROR warnings.warn('Was asked to gather along dimension 0, but all '
/var/sitecustomize/sitecustomize.py INFO None
None ERROR 0%| | 0/8 [00:09<?, ?it/s]

Most of them are warning but still my code stops with error.

tielve

unread,

Mar 16, 2022, 4:42:02 AM3/16/22

to cloud-nl-discuss

Hello,

Please remind that Google Groups are reserved for general product discussion, StackOverflow for technical questions whereas Issue Tracker for product issues (unexpected behaviors) and feature requests, see [1].

I can see that you already raised your issue on Stackoverflow (reference [2]).

But, if you have reasons to believe that this could be an unexpected behavior from Vertex AI, you could you can create an issue report on issue-tracker [3]. But please remind to provide reproduction steps on how you run your code on Vertex AI, custom image specs,... And if possible, mock/public data to train on.

Best regards,

Kevin
__________________
[1] https://cloud.google.com/support/docs/community
[2] https://stackoverflow.com/questions/71480531/hugging-face-code-throwing-error-when-running-using-vertex-ai-in-gcp

[3] https://cloud.google.com/support/docs/issue-trackers

Reply all

Reply to author

Forward

0 new messages