Build corpus using CSV

375 views
Skip to first unread message

alok tanna

unread,
Feb 9, 2017, 11:31:52 AM2/9/17
to gensim
I am new to python & I need help in building the corpus using CSV file in the below example where it is build using  JSON  file .
start = time()

import json

# Business IDs of the restaurants.
ids = ['4bEjOyTaDG24SY5TxsaUNQ', '2e2e7WgqU1BnpxmQL5jbfw', 'zt1TpTuJ6y9n551sw9TaEg',
      'Xhg93cMdemu5pAMkDoEdtQ', 'sIyHTizqAiGu12XMLX3N3g', 'YNQgak-ZLtYJQxlDwN-qIg']

w2v_corpus = []  # Documents to train word2vec on (all 6 restaurants).
wmd_corpus = []  # Documents to run queries against (only one restaurant).
documents = []  # wmd_corpus, with no pre-processing (so we can see the original documents).
with open('/data/yelp_academic_dataset_review.json') as data_file:
    for line in data_file:
        json_line = json.loads(line)
        
        if json_line['business_id'] not in ids:
            # Not one of the 6 restaurants.
            continue
        
        # Pre-process document.
        text = json_line['text']  # Extract text from JSON object.
        text = preprocess(text)
        
        # Add to corpus for training Word2Vec.
        w2v_corpus.append(text)
        
        if json_line['business_id'] == ids[0]:
            # Add to corpus for similarity queries.
            wmd_corpus.append(text)
            documents.append(json_line['text'])

Joe Brennan

unread,
Feb 10, 2017, 3:52:47 PM2/10/17
to gensim
I'm loading my texts from a CSV file, so I'm familiar with this issue. 
I'm first creating a list of dictionaries because I have three different options to use for the texts:

    import csv

    # Creating the lists needed to get data from the items
    items = []

    # Open the items CSV file and cycle through it
    with open(s_ItemsFile, newline='', encoding='utf-8') as csvfile:
        reader = csv.reader(csvfile, dialect='excel')
        # Skip the header row
        next(reader)
        for row in reader:
            an_item = dict(Id = row[0],
                           Text1 = row[1],
                           Text2 = row[2],
                           Text3 = row[3],)
            items.append(an_item)

    # Get rid of the csvfile file object
    del csvfile

        texts = []

        counta = 0
        for item in items:
            text = ''
            if item['Text1'].strip() != '':
                casetext = item['Text1'].strip()
                counta = counta + 1
            elif item['Text2'].strip() != '':
                casetext = item['Text2'].strip()
                counta = counta + 1
            elif item['Text3'].strip() != '':
                casetext = item['Text3t'].strip()
                counta = counta + 1
            
            text = preprocess(text)
            texts.append(text)

Then you can go ahead and create your Dictionary from the texts.
Reply all
Reply to author
Forward
0 new messages