BioCreative/OHNLP Challenge 2018
[Apologies for cross-postings]
The application of Natural Language Processing (NLP) methods and resources to clinical and biomedical text has received growing attention over the past years, but progress has been limited by difficulties to access shared tools and resources, partially caused by patient privacy and data confidentiality constraints. Efforts to increase sharing and interoperability of the few existing resources are needed to facilitate the progress observed in the general NLP domain. Leveraging our research in corpus analysis and de-identification research, we have created multiple synthetic data sets for a couple of NLP tasks based on real clinical sentences. We are organizing a challenge workshop to promote community efforts towards the advancement in clinical NLP.
The challenge workshop will have two tasks:
1. Family History Information Extraction (Winner Prize: $500)
2. Clinical Semantic Textual Similarity (Winner Prize: $500)
Task 1 – Family History Information Extraction:
The fact that many care process models uses FH information highlights the importance of FH in the decision-making process of diagnosis and treatment. However, acquiring accurate and complete FH information remains challenging for clinical NLP community. The main source of FH data is from Patient Provide Information (PPI) questionnaires, which is usually stored in semi-structured/unstructured format in electronic health records. In order to provide a comprehensive patient-provided FH data to physicians, there is a need for NLP systems that are able to extract FH from text. Elements of FH data are not pre-determined or limited. They depend on pieces of information that provided by patients about their relatives’ health situation during visits. The FH elements may include: disease, family member, cause, medication, age of onset, length of disease, etc. This variety of FH elements makes the extraction process from unstructured data challenging. In the past, though there are several systems are proposed and implemented for this purpose, the number of such systems is quite limited. To address this issue, we plan to organize a shared task and encourage researchers in relevant areas to propose and develop FH extraction (FHE) systems.
We divided the challenge into two subtasks:
1) Entity recognition (family members and disease names)
2) Relation extraction: the participant systems are expected to extract family members and corresponding observations. For each element, the systems should extract modifiers, e.g., “side of family” and “adoption status” for family member mentions and contextual information for observations “negation” and “certainty”.
To participate in Task 1:
Task 2 – Clinical Semantic Textual Similarity:
The wide adoption of electronic health records (EHRs) has provided a way to electronically document patient’s medical conditions, thoughts, and actions among the care team. While the use of EHRs has led to an improvement in quality of healthcare, it has introduced new challenges. One such challenge is the growing use of copy-and-paste, templates, and smart phrases due to ease of use causing bloated clinical notes poorly organized or erroneous documentation among many other problems. EHRs are no longer optimized for tracking multiple complex medical problems or maintaining continuity and quality of clinical decision-making process. There is a growing need for automated methods to better synthesize patient data from EHRs and reduce the cognitive burden in clinical decision-making process for providers. Patient data can be scattered in several heterogeneous sources. Tools that can aggregate data from diverse sources and minimize data redundancy, and organize and present the data in a user friendly way to reduce the cognitive burden are desired.
One necessary task for extracting and consolidating information is to compute semantic similarity between text snippets. In the general English domain, the SemEval Semantic Textual Similarity (STS) share tasks have been organized since 2012 to develop automated methods for the task. Clinical text contains highly domain-specific terminologies and thus domain-specific NLP tools and resources are needed for analysis, interpretation and management of clinical text. In the clinical domain, there is no existing resource for the study of STS.
The construction of a dataset by gathering naturally occurring pairs of sentences with different degree of semantic equivalence itself is a very challenging task. The objective of this shared task is to build systems for clinical STS.
To participate in Task 2:
Timeline of the challenges:
Training data release: Per request
Testing data release: August 1, 2018
System submission: August 5, 2018 (11:59PM, Central Time)
Paper submission: August 12, 2018 (11:59PM, Central Time)
Workshop: August 29, 2018 (in conjunction with ACM-BCB 2018)
Location & time of the workshop:
ACM-BCB 2018, which will be held on August, 2018 at JW Marriott Washington DC
Organizers and Contact information:
Majid Rastegar-Mojarad (mojarad.majid at mayo dot edu)
Sijia Liu (Liu.Sijia at mayo dot edu) (Corresponding for Task 1)
Yanshan Wang (Wang.Yanshan at mayo dot edu) (Corresponding for Task 2)
Naveed Afzal(Afzal.Naveed at mayo dot edu)
Liwei Wang (Wang.Liwei at mayo dot edu)
Feichen Shen (Shen.Feichen at mayo dot edu)
Sunyang Fu (fu.sunyang at mayo dot edu)
Hongfang Liu (Liu.Hongfang at mayo dot edu)