Here is a small sample of what my data look like:
speaker = c("N0005", "N0012", "N0101", "N0014", "N0036", "N0014", "N0005", "N0005", "N0031")
a = c("N00059", "N00059", "N01019", "N00059", "N00181", "N00059", "N00059", "N00059", "N00206")
b = c("N00112", "N00120", "N01020", "N00143", "N00241", "N00147", "N00147", "N00149", "N00316")
c = c("NA", "NA", "N01021", "NA", "N00363", "NA", "NA", "NA", "N00318")
df = data.frame(speaker, a, b, c); df
Factor 'speaker' shows the speaker codes containing only the first five characters. Normally, these speaker codes should be six characters, but the CGN corpus software seems to cut off the last digit. Factors a, b and c (actually several more, but just three for the example) show the participants in the conversation, without the last digit cut off -- i.e. the speaker is among these, with his/her full six-character speaker code.
I now want to see if the five-character string from 'speaker' matches the first five characters from a, b or c. If it matches just one string, I want that six-character string from a, b or c to be added to a new factor 'speaker2'. If it matches two or more strings (as in case 9, where "N0031" could match both "N00316" and "N00318"), I want it to add "ambiguous" to that new factor 'speaker2'.
In other words, I want to know how I can get this output:
speaker2 = c("N00059", "N00120", "N01019", "N00143", "N00363", "N00147", "N00059", "N00059", "ambiguous")
df_result = data.frame(df, speaker2); df_result
I tried this several ways with ifelse(), within(), pmatch() and gregexpr(), but I can never seem to get it to work entirely. There must be a pretty easy way to do this, but I can't seem to come up with it. Anyone's got any ideas?