Build corpus using CSV

375 views

Skip to first unread message

alok tanna

unread,

Feb 9, 2017, 11:31:52 AM2/9/17

to gensim

I am new to python & I need help in building the corpus using CSV file in the below example where it is build using JSON file .

start = time()

import json

# Business IDs of the restaurants.
ids = ['4bEjOyTaDG24SY5TxsaUNQ', '2e2e7WgqU1BnpxmQL5jbfw', 'zt1TpTuJ6y9n551sw9TaEg',
      'Xhg93cMdemu5pAMkDoEdtQ', 'sIyHTizqAiGu12XMLX3N3g', 'YNQgak-ZLtYJQxlDwN-qIg']

w2v_corpus = []  # Documents to train word2vec on (all 6 restaurants).
wmd_corpus = []  # Documents to run queries against (only one restaurant).
documents = []  # wmd_corpus, with no pre-processing (so we can see the original documents).
with open('/data/yelp_academic_dataset_review.json') as data_file:
    for line in data_file:
        json_line = json.loads(line)
        
        if json_line['business_id'] not in ids:
            # Not one of the 6 restaurants.
            continue
        
        # Pre-process document.
        text = json_line['text']  # Extract text from JSON object.
        text = preprocess(text)
        
        # Add to corpus for training Word2Vec.
        w2v_corpus.append(text)
        
        if json_line['business_id'] == ids[0]:
            # Add to corpus for similarity queries.
            wmd_corpus.append(text)
            documents.append(json_line['text'])

Joe Brennan

unread,

Feb 10, 2017, 3:52:47 PM2/10/17

to gensim

I'm loading my texts from a CSV file, so I'm familiar with this issue.

I'm first creating a list of dictionaries because I have three different options to use for the texts:

import csv

# Creating the lists needed to get data from the items

items = []

# Open the items CSV file and cycle through it
with open(s_ItemsFile, newline='', encoding='utf-8') as csvfile:
reader = csv.reader(csvfile, dialect='excel')
# Skip the header row
next(reader)
for row in reader:
an_item = dict(Id = row[0],
Text1 = row[1],

Text2 = row[2],

Text3 = row[3],)

items.append(an_item)

# Get rid of the csvfile file object

del csvfile

texts = []

counta = 0

for item in items:

text = ''

if item['Text1'].strip() != '':

casetext = item['Text1'].strip()

counta = counta + 1

elif item['Text2'].strip() != '':

casetext = item['Text2'].strip()

counta = counta + 1

elif item['Text3'].strip() != '':

casetext = item['Text3t'].strip()

counta = counta + 1

text = preprocess(text)

texts.append(text)

Then you can go ahead and create your Dictionary from the texts.

Reply all

Reply to author

Forward

0 new messages