Preprocessing Computing Times

43 views

Skip to first unread message

Tori Leatherman

unread,

Oct 4, 2022, 10:32:35 AM10/4/22

to Sentiment140

Hi, I am trying to conduct a sentiment analysis on the data set using SVM and Naive Bayes. I am currently stuck trying to preprocess the data. Below is my function for preprocessing that I have used successfully on smaller data sets, but for Sentiment140 it takes upwards of 4 hours before crashing. Can anyone provide insight into whether this is normal, or any tips on how to improve the computation time? Thank you!

def flatten(l):

return [item for sublist in l for item in sublist]

def preprocessor(df):

for i in range (len(df)):

x = df['text'][i].replace('\n',' ') #cleaning newline “\n” from the tweets

df['text'][i] = html.unescape(x)

for i in range (len(df)):

df['text'][i] = re.sub(r'<br /><br />|(@[A-Za-z0–9_]+)|(#[A-Za-z0–9_]+)|[^\w\s]|http\S+', ' ', df['text'][i]) # add removal items for <br /><br /> and #something

tweets_to_token = df['text']

sw = stopwords.words('english') #you can adjust the language as you desire

sw.remove('not') #we exclude not from the stopwords corpus since removing not from the text will change the context of the text

for i in range(len(tweets_to_token)):

tweets_to_token[i] = word_tokenize(tweets_to_token[i]) # do the word tokenize

for token in tweets_to_token[i]:

tweets_to_token[i] = tk.tokenize(token)

flatten(tweets_to_token[i])

for i in range(len(tweets_to_token)):

tweets_to_token[i] = ' '.join([word for word in tweets_to_token[i] if not word in sw]) # turn the tokenized listf into string to fit the format for applying CountVectorizer()