OpenSherlock: Status Report

14 views
Skip to first unread message

Jack Park

unread,
Nov 15, 2015, 2:01:00 PM11/15/15
to qa-...@googlegroups.com
I am copying into this a "plan" list from my master ToDo list:
PARALLEL
        DOWNLOAD PUBMED
        POLISH ONTOLOGY IMPORT
        POLISH PUBMED IMPORT
                PHASE3 (Abstracts->JSON) working fine
                PHASE4 (Clusters->JSON) ??
                READING ??
                        Wire Author networking
                        Wire Author pivot on IDocument
                        Wire publisher pivot on IDocument, author ???
                        POLISH Sentence Reading
                        WIRE Triple Detection
                        WIRE CG generation
                                FINISH CG platform
        WIRE BOOK IMPORT
                Ability to detect:
                        Named Entities
                        Relations
                        Processes
                                THE BIG ONE!!!
                                Conceptual Graphs
READ
        Build vocabulary
                Ontologies
                Mesh,etc
        HAND MERGE topics
                SAME-LABEL LISTS
                BUILDS a solid foundation
        READ BOOKS
        READ PUBMED

Downloading pubmed is orbiting the 5 million documents arena (really hard to get full cardinality on a windows box: the finder gets really grumpy when you open a directory with that many files). There are three computers doing the downloading, clustering along the way. Perhaps another couple of million left to download before I've exhausted the (always growing) list of queries.

OpenSherlock now runs on the platform TQElasticKnowledgeSystem
https://github.com/OpenSherlock/TQElasticKnowledgeSystem

which is already two version levels (0.4.0) above what's in github. The primary unfinished code in that is the conceptual graph platform. There are a number of BigDecisions(tm) left to make; the goal is to federate conceptual graphs with topic maps -- that is, each roleplayer in a CG is, in fact, a topic in the topic map. One huge question is this: should each graph also be a topic in the topic map, or should it be a Vertex in a graph structure?  I bounce back and forth about that.

One decision I have made is this: a SubjectProxy (topic object) in the topic map is not, itself, a role player. Rather, a *casting* of that proxy (another proxy) is the role player.  A reason behind that decision is this: topics in the topic map are largely containers of specific kinds of properties (key-value pairs), some of which are, in fact, *pivots*, and others of which are *relations*.  I distinguish between the two thus: a pivot is a kind of relation which entails some act, such as authorship, or tag relations, etc.  It's like this: a given topic, say, a person, will have few if any marital, causal, or employment relationships, but might have a huge number of authored documents, social bookmarks, etc.

So, casting a given topic into a role (perhaps one of many) is being treated as a pivot relationship: we thus create a casting proxy to carry that role into the CG. We must do that because that casting proxy will have other properties related to the specific needs of the CG structure, such as that topic's position in a  concept lattice (a kind of taxonomy or ontology specific to conceptual graphs).

Conceptual graphs are important to OpenSherlock because there needs to be an underlying collection of process models, which further augment reading processes.

The list clearly identifies vocabulary building as crucial.  OpenSherlock implements a kind of pattern recognition system, extensible template matching. Each noun/noun-phrase detected helps identify the topic to which that noun belongs.

When reading ontologies, mesh terms, and other vocabulary sources, there will be ambiguities: a given string might be found in more than one document assigned to a different identifier -- frequently, those identifiers are like synonyms (different tribes use different identifiers) for the same object. Thus, the topic map platform issues a NewTopic event which is analyzed, at the very least, for the label(s) put on that topic. The system maintains a map of strings (labels) vs a list of topic locators. When the cardinality of that list is > 1, we (presently) by hand have to go in and study those objects and make a decision whether to merge or not to merge.  Eventually, OpenSherlock will be able to offload much of that work (in fact, that's its intended core job!).

There are a number of tasks listed as "parallel". Right now, parallelism is implemented as round-robin multitasking by me.  Perhaps, as platforms like OpenSherlock and others grow in importance, there will emerge developer communities.  One can hope...

I am going to paste in below the JSON string for one of the PubMed abstracts as it was harvested and converted to JSON. This does not include any further tags added by clustering; this is called Phase 3. Clustering is Phase4 mentioned above.

{
    "crtr": "Carrot2AgentUser",
    "issue": "2",
    "pubMonth": "Jan",
    "abstract": "[BACKGROUND] The adaptor protein Gads is a Grb2-related protein originally identified on the basis of its interaction with the tyrosine-phosphorylated form of the docking protein Shc. Gads protein expression is restricted to hematopoietic tissues and cell lines. Gads contains a Src homology 2 (SH2) domain, which has previously been shown to have a similar binding specificity to that of Grb2. Gads also possesses two SH3 domains, but these have a distinct binding specificity to those of Grb2, as Gads does not bind to known Grb2 SH3 domain targets. Here, we investigated whether Gads is involved in T-cell signaling.\n[RESULTS] We found that Gads is highly expressed in T cells and that the SLP-76 adaptor protein is a major Gads-associated protein in vivo. The constitutive interaction between Gads and SLP-76 was mediated by the carboxy-terminal SH3 domain of Gads and a 20 amino-acid proline-rich region in SLP-76. Gads also coimmunoprecipitated the tyrosine-phosphorylated form of the linker for activated T cells (LAT) adaptor protein following cross-linking of the T-cell receptor; this interaction was mediated by the Gads SH2 domain. Overexpression of Gads and SLP-76 resulted in a synergistic augmentation of T-cell signaling, as measured by activation of nuclear factor of activated T cells (NFAT), and this cooperation required a functional Gads SH2 domain.\n[CONCLUSIONS] These results demonstrate that Gads plays an important role in T-cell signaling via its association with SLP-76 and LAT. Gads may promote cross-talk between the LAT and SLP-76 signaling complexes, thereby coupling membrane-proximal events to downstream signaling pathways.\n",
    "pmid": "10021361",
    "publicationTitle": [
        "Current biology : CB",
        "Curr. Biol."
    ],
    "title": "The hematopoietic-specific adaptor protein gads functions in T-cell signaling via interactions with the SLP-76 and LAT adaptors.",
    "volume": "9",
    "tagList": [
        "Adaptor Proteins, Signal Transducing",
        "Carrier Proteins",
        "DNA-Binding Proteins",
        "GRAP2 protein, human",
        "LAT protein, human",
        "Membrane Proteins",
        "NFATC Transcription Factors",
        "Nuclear Proteins",
        "Phosphoproteins",
        "Receptors, Antigen, T-Cell",
        "SLP-76 signal Transducing adaptor proteins",
        "Transcription Factors",
        "Tyrosine",
        "metabolism",
        "physiology",
        "Humans",
        "Jurkat Cells",
        "chemistry",
        "Phosphorylation",
        "Protein Binding",
        "Signal Transduction",
        "T-Lymphocytes"
    ],
    "pages": "67-75",
    "lox": "PubMed10021361",
    "issn": "0960-9822",
    "pubYear": "1999",
    "substanceList": [
        "Adaptor Proteins, Signal Transducing",
        "Carrier Proteins",
        "DNA-Binding Proteins",
        "GRAP2 protein, human",
        "LAT protein, human",
        "Membrane Proteins",
        "NFATC Transcription Factors",
        "Nuclear Proteins",
        "Phosphoproteins",
        "Receptors, Antigen, T-Cell",
        "SLP-76 signal Transducing adaptor proteins",
        "Transcription Factors",
        "Tyrosine"
    ],
    "publisher": "Curr Biol",
    "pubType": "RshSupNonGovType",
    "authors": [
        {
            "lName": "Liu",
            "fName": "S K",
            "affiliation": "Department of Medical Biophysics, University of Toronto, The Arthur and Sonia Labatt Brain Tumour Research Centre, Hospital for Sick Children, Research Institute, 555 University Ave, Toronto, Ontario M5G 1X8, Canada.",
            "initials": "SK"
        },
        {
            "lName": "Fang",
            "fName": "N",
            "initials": "N"
        },
        {
            "lName": "Koretzky",
            "fName": "G A",
            "initials": "GA"
        },
        {
            "lName": "McGlade",
            "fName": "C J",
            "initials": "CJ"
        }
    ]
}




Reply all
Reply to author
Forward
0 new messages