Help with merging Compustat and RiskMetrics

163 views
Skip to first unread message

Long Lucas

unread,
Jan 25, 2021, 4:51:56 PM1/25/21
to wrdssas
Hello everyone,

I would like to ask about the issues I faced during my research.

I try to merge two datasets, one is from Execucomp and one is from RiskMetrics.

Here are what happened,

The data from Execucomp are based on firms, for example
WeChat Screenshot_20210125165035.png
However, the data from RiskMetrics are based on a single person (directors), for examples

WeChat Screenshot_20210125165056.png

So as you may find out, one CEO or CFO may serve as director for many companies in one year. And RiskMetrics only provide the cusip for the firm that he/she served as Director. For the firm that served as CEO/CFO it only provides a name.

I try to merge using Director's Name, but these two datasets use different formats to record names. For example, one including the middle name and one without the full name.

Can anyone provide some suggestions or feedback? Many thanks in advance.

Best,

Lucas  

Kamron Eck

unread,
Jul 3, 2024, 3:27:11 AM7/3/24
to wrdssas
Did you ever figure this out? The only solution I can posit is doing a fuzzy match with a column that has the company cusip-year-director name merged.

Kamron Eck

unread,
Jul 8, 2024, 2:17:49 PM7/8/24
to wrdssas

For those revisiting this thread in the future, I resolved my issue with merging directors' data by utilizing the SPlink Python package. SPlink efficiently merges datasets based on multiple criteria such as CUSIP, year, and director names. It performed well, especially with exact matches on CUSIP and year, while probabilistically matching names between the Compustat and RiskMetrics datasets. Despite the initial learning curve, SPlink’s comprehensive documentation, highlighted by this pivotal guide, proved invaluable.

To enhance matching efficiency, I split full names into first, middle, and last components and created a concatenated 'cy' column from the CUSIP and year fields, which I used for blocking in SPlink. This strategy significantly reduced the iterations required during the matching process. My SPlink configuration included settings for linkage type and blocking rules, along with a series of comparisons and deterministic rules based on Jaro-Winkler similarity for robust name matching.

settings = {
    "link_type": "link_only",
    "blocking_rules_to_generate_predictions": [
        block_on("cy",salting_partitions=2)
    ],
    "comparisons": [
        ctl.name_comparison("first"),
        ctl.name_comparison("full"),
        ctl.name_comparison("last"),
        ctl.name_comparison("middle"),
        cl.exact_match("year"),
        cl.exact_match("cusip"),
    ],      
}

deterministic_rules = ["jaro_winkler_similarity(r.full, l.full) >= .8 and r.cy = l.cy and jaro_winkler_similarity(r.first, l.first) >= .8 and jaro_winkler_similarity(r.last, l.last) >= .8",]

Hopefully, this word vomit will be helpful to people one day. I am an amateur in Python.
On Monday, January 25, 2021 at 1:51:56 PM UTC-8 Long Lucas wrote:
Reply all
Reply to author
Forward
0 new messages