Zotero API pulling PDF's to auto tag

35 views
Skip to first unread message
Sophie Kemprecos

unread,
Jun 12, 2024, 12:34:42 PMJun 12
to zotero-dev
Hi all, I keep getting errors thrown on my code where I am trying to read different pdf's through the zotero API into the open AI API to get a list of tags to automatically update in the database. Please let me know if this code has anything popping out at you for how to better construct the program?
BASE_URL = f"https://api.zotero.org/{library_type}s/{library_id}"

openai.api_key = 'API key'
headers = {
    'Zotero-API-Key': api_key
}
OUTPUT_DIR = "pdfs"

if not os.path.exists(OUTPUT_DIR):
    os.makedirs(OUTPUT_DIR)

# Define tag categories and prompts
tags_and_prompts = [
    ("publication_tags", ["Review", "Research Article", "Case Report"], "Read this paper and from this list of publication tags {tags_list}, decide which tag is directly mentioned. Only use terms from this list {tags_list} and match them as closely as possible. Choose only one tag from the list that is directly mentioned in the paper. Return only the accurate tag."),
    ("study_type_tags_review", ["Clinical Trial", "Preclinical"], "Read this paper and from this list of study type tags (review) {tags_list}, decide which tag is directly mentioned. Only use terms from this list {tags_list} and match them as closely as possible. Tags must be mentioned in the paper with respect to type of publication. Return only the accurate tag."),
    ("study_type_tags_other", ["Clinical Trial", "Preclinical", "In Vitro"], "Read this paper and from this list of study type tags (other) {tags_list}, decide which tag is directly mentioned. Only use terms from this list {tags_list} and match them as closely as possible. Only include tags that are mentioned in the paper. Return only the accurate tag."),
    ("fus_modality_tags", ["FUS-Thermal Ablation", "Focused Ultrasound Thermal Ablation", "Mechanical-FUS", "Mechanical Focused Ultrasound", "Histotripsy", "Pulsed-FUS", "Pulsed Focused Ultrasound", "FUS-Hyperthermia", "Focused Ultrasound Hyperthermia", "LOFU", "MB", "SDT", "FUS-BBBo", "Focused Ultrasound Blood-Brain Barrier Opening"], "Read this paper and from this list of FUS modality tags {tags_list}, decide which tags are directly mentioned in defining the modalities used. Only use terms from this list {tags_list} and match them as closely as possible, based only on the text from the paper. Return only the accurate tags, separated by commas."),
    ("combo_tags", ["FUS-alone", "Focused Ultrasound Alone", "FUS-Drug", "Focused Ultrasound and Drug Combination", "FUS-RT", "FUS-Cells" ], "Read this paper and from this list of combo tags {tags_list}, decide which tags are directly mentioned if focused ultrasound technology was used in concert with another type of treatment. Only use terms from this list {tags_list} and match them as closely as possible using the paper as your only reference. Return only the accurate tags, separated by commas."),
    ("drugs_therapy_tags", ["Chemotherapy", "IO", "aPD-1", "aPD-L1", "aCTLA-4", "Ipilimumab", "a-CD40", "Glycated Chitosan (GC)", "NPs", "CRT-NP", "ncMB", "cGAMP", "Gene Circuit", "STING", "aCD47", "CpG", "TSL", "Doxorubicine", "Liposomes", "R837", "Imiquimod", "HMME", "Sonosensitizer", "HiPorfin"], "Read this paper and from this list of drugs and therapy tags {tags_list}, decide which tags are directly mentioned to the research methods conducted. Look at the methodology section with particular interest for this part. Only use terms from this list {tags_list} and match them as closely as possible using the paper as your only reference. Return only the accurate tags, separated by commas."),
    ("cells_tags", ["NK-92", "CarTcells", "EcN", "Bacteria"], "Read this paper and from this list of cells tags {tags_list}, decide which tags are directly mentioned in defining which cells were used as a therapy. Only use terms from this list {tags_list} and match them as closely as possible, pay attention to the methodology section to inform tagging. Only use keyword-tags from the paper. Return only the accurate tags, separated by commas."),
    ("therapy_type_tags", ["Gene Therapy", "IO", "Chemotherapy", "Cellular Therapy", "Targeted Therapy"], "Read this paper and from this list of therapy type tags {tags_list}, decide which tags are directly mentioned. Only use terms from this list {tags_list} and match them to the therapies discussed in the paper. Use the keyword list and paper's text as your only reference. Return only the accurate tags, separated by commas."),
    ("assays_types_tags", ["IHC", "Histology", "H&E", "Cytotoxicity Assay", "Flow Cytomery", "Proteomics", "ELISA", "Cytokine Assay", "ImmunoFluorescence", "qRT-PCR", "Western Blot", "RNA seq", "MTT", "Genomics", "Funtional Assay", "T cell Depletion", "MassSpec", "EVs Isolation", "cryoEM", "Clonogenic Assay", "TCR seq", "scRNA seq"], "Read this paper and from this list of assay type tags {tags_list}, decide which tags are directly mentioned. Only use terms from this list {tags_list} and match them to the assay types discussed in the paper. Use the keyword list and paper's text as your only reference. Return only the accurate tags, separated by commas."),
    ("preclinical_tumor", ["Orthotopic", "Heterotopic"], "Read this paper. If this is pre-clinical research, then from this list of preclinical tumor tags {tags_list}, decide which tags are directly mentioned. Only use terms from this list {tags_list} and match them as closely as possible. Only include tags that are mentioned in the paper. Return only the accurate tags, separated by commas. If it is not preclinical do not tag it with the words."),
    ("preclinical_cellline", ["CT26", "MC38", "B16", "B16F1cOVA", "EL4", "HCT-116", "RM-9", "Pan02", "Neuro2a", "H22", "4T1", "B16-F10", "B16-F1", "MDA-MB-231", "H1-N1", "GL261", "TPSA23", "B16GP33", "Hepa1-6", "McA-RH7777", "N1-S1", "GL261", "KPC4662", "KPC", "E0771", "A20", "PC3", "Nalm-6", "NDL", "B16-OVA", "MM3MG-HER2", "LNCaP", "Hep3b", "H22", "MMTV-PyVT", "MMTV-hHER3"], "Read this paper. If this is pre-clinical research, then from this list of preclinical cell line tags {tags_list}, decide which tags are directly mentioned. Only use terms from this list {tags_list} and match them as closely as possible. Only include tags that are mentioned in the paper. Return only the accurate tags, separated by commas. If it is not preclinical do not tag it with the words.")
]

def fetch_new_paper():
    try:
        items = zot.items(sort="dateAdded", direction="desc", limit=1)
        if not items:
            return None
        return items[0]
    except zotero.RequestException as e:
        print(f"HTTP error occurred: {e}")
        return None

def get_item_children(item_key):
    try:
        url = f'{BASE_URL}/items/{item_key}/children'
        response = requests.get(url, headers=headers)
        response.raise_for_status()
        return response.json()
    except requests.RequestException as e:
        print(f"Error fetching item children: {e}")
        return []

def download_pdf(attachment_key, filename):
    try:
        url = f'{BASE_URL}/items/{attachment_key}/file'
        response = requests.get(url, headers=headers, stream=True)
        response.raise_for_status()
        
        file_path = os.path.join(OUTPUT_DIR, filename)
        with open(file_path, 'wb') as f:
            for chunk in response.iter_content(chunk_size=8192):
                f.write(chunk)
        return file_path
    except requests.RequestException as e:
        print(f"Error downloading PDF: {e}")
        return None

def extract_text_from_pdf(pdf_path):
    text = ""
    try:
        document = fitz.open(pdf_path)
        for page_num in range(document.page_count):
            page = document.load_page(page_num)
            text += page.get_text()
    except Exception as e:
        print(f"Error extracting text from PDF: {e}")
    return text

def gpt_analyze_paper(paper_content, prompt, tags_list):
    prompt_text = prompt.format(paper_content=paper_content, tags_list=', '.join(tags_list))
    print(f"GPT Prompt: {prompt_text}")  # Debug print to check the prompt
    
    try:
        response = openai.ChatCompletion.create(
            model="gpt-3.5-turbo",
            messages=[
                {"role": "system", "content": "You are a helpful research assistant."},
                {"role": "user", "content": prompt_text}
            ],
            max_tokens=500,
            temperature=0.7
        )
        
        tags = response.choices[0].message['content'].strip()
        print(f"GPT Response: {tags}")  # Debug print to check the response
        return [tag.strip() for tag in tags.split(',') if tag.strip()]
    except Exception as e:
        print(f"Error analyzing paper with GPT: {e}")
        return []

def match_tags(gpt_tags, all_tags_list, paper_content):
    normalized_paper_content = normalize_tag(paper_content)
    matched_tags = []
    
    for gpt_tag in gpt_tags:
        normalized_gpt_tag = normalize_tag(gpt_tag)
        for list_tag in all_tags_list:
            normalized_list_tag = normalize_tag(list_tag)
            if normalized_list_tag == normalized_gpt_tag and re.search(r'\b' + re.escape(list_tag) + r'\b', paper_content, re.IGNORECASE):
                print(f"Match found: {list_tag}")  # Debug print for matching tag
                matched_tags.append(list_tag)
                break
    return matched_tags

def normalize_tag(tag):
    return re.sub(r'[-\s]', '', tag).lower()

def update_tags(paper, new_tags):
    current_tags = {tag['tag'] for tag in paper['data'].get('tags', [])}
    for tag in new_tags:
        if tag not in current_tags:
            current_tags.add(tag)
    paper['data']['tags'] = [{"tag": tag} for tag in current_tags]
    try:
        zot.update_item(paper)
        print(f"Updated paper {paper['data']['title']} with tags {new_tags}")
    except zotero.RequestException as e:
        print(f"Permission error: {e}. Check your API key and library permissions.")
    except Exception as e:
        print(f"HTTP error occurred: {e}")

def main():
    while True:
        paper = fetch_new_paper()
        if paper:
            print(f"New paper found: {paper['data']['title']}")
            paper_content = (paper['data'].get('abstractNote', '') + " " + 
                             paper['data'].get('title', '') + " " + 
                             paper['data'].get('DOI', '')).strip()
            # Fetch PDF attachment
            children = get_item_children(paper['key'])
            pdf_attachment = next((child for child in children if child['data']['contentType'] == 'application/pdf'), None)
            if pdf_attachment:
                filename = f"{pdf_attachment['data']['filename']}"
                pdf_path = download_pdf(pdf_attachment['key'], filename)
                if pdf_path:
                    pdf_text = extract_text_from_pdf(pdf_path)
                    paper_content += " " + pdf_text
                    print(f"Extracted PDF text for paper: {pdf_text[:500]}...")  # Debug print for PDF text
                else:
                    print("Failed to download the PDF.")
            else:
                print("No PDF attachment found for the paper.")

            final_tags = []
            if paper_content.strip():
                for _, tags_list, prompt in tags_and_prompts:
                    gpt_tags = gpt_analyze_paper(paper_content, prompt, tags_list)
                    matched_tags = match_tags(gpt_tags, tags_list, paper_content)
                    final_tags.extend(matched_tags)
                final_tags = list(set(final_tags))  # Remove duplicates
                if final_tags:
                    update_tags(paper, final_tags)
                    print(f"Assigned tags {final_tags} to paper {paper['data']['title']}")
                else:
                    print(f"No applicable tags found for paper {paper['data']['title']}")
        else:
            print("No new papers found. Stopping the program.")
            break  # Stop the loop if no new papers are found
        time.sleep(60)  # Wait for one minute before checking again

if __name__ == "__main__":
    main()
Sebastian Karcher

unread,
Jun 12, 2024, 12:47:59 PMJun 12
to zoter...@googlegroups.com
If that's the whole code, it'd seem to assume pyzotero but neither load that nor set it up? Is this your code or GPT's?
--
You received this message because you are subscribed to the Google Groups "zotero-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to zotero-dev+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/zotero-dev/37571ad6-2429-4627-a303-b2debf4bcc3fn%40googlegroups.com.
Sebastian Karcher, PhD
www.sebastiankarcher.com
Abe Jellinek

unread,
Jun 13, 2024, 12:24:20 PMJun 13
to zoter...@googlegroups.com
I keep getting errors thrown on my code
What are the errors?
Reply all
Reply to author
Forward
0 new messages