Merging ISS (RiskMetrics) and Capital IQ Compustat; TUTORIAL; Directors Data

39 views
Skip to first unread message

Kamron Eck

unread,
Jul 8, 2024, 2:18:30 PM7/8/24
to wrdssas

For those revisiting this thread in the future, I resolved my issue with merging directors' data by utilizing the SPlink Python package. SPlink efficiently merges datasets based on multiple criteria such as CUSIP, year, and director names. It performed well, especially with exact matches on CUSIP and year, while probabilistically matching names between the Compustat and RiskMetrics datasets. Despite the initial learning curve, SPlink’s comprehensive documentation, highlighted by this pivotal guide, proved invaluable.

To enhance matching efficiency, I split full names into first, middle, and last components and created a concatenated 'cy' column from the CUSIP and year fields, which I used for blocking in SPlink. This strategy significantly reduced the iterations required during the matching process. My SPlink configuration included settings for linkage type and blocking rules, along with a series of comparisons and deterministic rules based on Jaro-Winkler similarity for robust name matching.

Sidenote: It's crucial to standardize data fields for effective matching. For name fields (first, middle, last, fullname), I removed all non-alphabetic characters, including spaces. For CUSIPs, I eliminated any leading zeros. Additionally, to reconcile the differing formats between datasets, I adjusted the CUSIPs: ISS uses 9-digit codes while Compustat uses 8-digit ones. I addressed this by removing the farthest right digit from the ISS CUSIPs after stripping the leading zeros. The order of these operations does not impact the results.

settings = {
    "link_type": "link_only",
    "blocking_rules_to_generate_predictions": [
        block_on("cy",salting_partitions=2)
    ],
    "comparisons": [
        ctl.name_comparison("first"),
        ctl.name_comparison("full"),
        ctl.name_comparison("last"),
        ctl.name_comparison("middle"),
        cl.exact_match("year"),
        cl.exact_match("cusip"),
    ],      
}

deterministic_rules = [
   
    "jaro_winkler_similarity(r.full, l.full) >= .8 and r.cy = l.cy and jaro_winkler_similarity(r.first, l.first) >= .8 and jaro_winkler_similarity(r.last, l.last) >= .8",

]

linker.estimate_probability_two_random_records_match(deterministic_rules, recall=0.7)


Best,

Kameron Eck

Central Washington University

Cultural and Environmental Resource Management


Contact me on linkedin for quick help

https://www.linkedin.com/in/eckkam11/



Reply all
Reply to author
Forward
0 new messages