Hi, I am trying to conduct a sentiment analysis on the data set using SVM and Naive Bayes. I am currently stuck trying to preprocess the data. Below is my function for preprocessing that I have used successfully on smaller data sets, but for Sentiment140 it takes upwards of 4 hours before crashing. Can anyone provide insight into whether this is normal, or any tips on how to improve the computation time? Thank you!
def flatten(l):
return [item for sublist in l for item in sublist]
def preprocessor(df):
for i in range (len(df)):
x = df['text'][i].replace('\n',' ') #cleaning newline “\n” from the tweets
df['text'][i] = html.unescape(x)
for i in range (len(df)):
df['text'][i] = re.sub(r'<br /><br />|(@[A-Za-z0–9_]+)|(#[A-Za-z0–9_]+)|[^\w\s]|http\S+', ' ', df['text'][i]) # add removal items for <br /><br /> and #something
tweets_to_token = df['text']
sw = stopwords.words('english') #you can adjust the language as you desire
sw.remove('not') #we exclude not from the stopwords corpus since removing not from the text will change the context of the text
for i in range(len(tweets_to_token)):
tweets_to_token[i] = word_tokenize(tweets_to_token[i]) # do the word tokenize
for token in tweets_to_token[i]:
tweets_to_token[i] = tk.tokenize(token)
flatten(tweets_to_token[i])
for i in range(len(tweets_to_token)):
tweets_to_token[i] = ' '.join([word for word in tweets_to_token[i] if not word in sw]) # turn the tokenized listf into string to fit the format for applying CountVectorizer()
return tweets_to_token