Problem: "Exclude" argument in minimize does not exclude rows - manual exclusion does

68 views
Skip to first unread message

Nicolai Schulz

unread,
Jun 10, 2020, 10:52:24 AM6/10/20
to QCA with R
Hello everyone,

I'm fairly new to using QCA in R and have the following issue which can be replicated with the attached Script and data.

The issue I face is that whereas I can effectively exclude truth table rows manually in R, using the QCA-packages "exclude"-argument during the minimization process does not always work.

The example data will show that I:
  1. Identify a Row (#2) in the truth table which is fully made up of a deviant case in kind (dcc), which I therefore want to exclude from the minimization (whether that is actually in itself methodologically justifiable is something I just asked in the QCA Facebook group here, but I am glad for any input here as well).
  2. I then use "exclude=c(2)," to exclude that row during the minimization process.
  3. Yet, that row/case shows up in the final solution-case table as it's own path (although it is factually represented only by one deviant case).
  4. When manually excluding the row  from the truth table (i.e. "ttGDP_VPHP$tt['2', 'OUT'] <- 0") and then minimizing, the "deviant path" indeed disappears (as I hoped/thought it would).
In conclusion, my question is: Why does my attempt at using the "exclude"-argument not have the intended effect of actually excluding the said row in the minimization process? Where am I going wrong?

Many thanks for your advice! 

All the best,
Nicolai



Schulz_Replication_Problem_Script2.R
Schulz_ReplicationFile_exclude_issue.csv

Adrian Dușa

unread,
Jun 10, 2020, 4:04:56 PM6/10/20
to Nicolai Schulz, QCA with R
Hello Nicolai,

The truth table object in the QCA package was designed in such a way that requires no manual intervention. There are so many arguments to the function, that allow such a fine grained array of possible options that manually changing the output values should never be necessary.

This is of course possible, but with the cost of breaking a number of features, if I remember correctly the esa() function in the package SetMethods has a problem with the case labels when doing such a manual intervention.

Now, as to why the argument exclude = c(2) does not seem to work, I believe it is explained in the help of that argument (at least in the newest versions of the package on CRAN):

exclude = A vector of (remainder) row numbers from the truth table, to code as negative output configurations.

So, what it excludes is the remainders only. This is a methodological, rather than a programming choice, because I believe observed evidence should not be touched in this way. The argument "exclude" is useful for (and was introduced specifically to) obtaining the so-called enhanced parsimonious solution. This is done by excluding remainders from the minimization process.

Observed data, on the other hand, can be dealt with in more detail through analysing the process of calibration. If those cases are problematic, as they seem to be, I would first try to see if there is nothing wrong with the calibration process.

It would be immensely interesting to also provide information about how (specifically) did you assign set membership scores to those cases. What were the raw scores? Which calibration method did you apply?

You might find some unexpected answers when following these questions...

I hope this helps,
Adrian
> --
> You received this message because you are subscribed to the Google Groups "QCA with R" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to qcawithr+u...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/qcawithr/8eef6b3f-6acc-483c-ad8b-1f45651367f7o%40googlegroups.com.
> <Schulz_Replication_Problem_Script2.R><Schulz_ReplicationFile_exclude_issue.csv>


Adrian Dusa
University of Bucharest
Romanian Social Data Archive
Soseaua Panduri nr. 90-92
050663 Bucharest sector 5
Romania
https://adriandusa.eu

Nicolai Schulz

unread,
Jun 11, 2020, 8:33:21 AM6/11/20
to QCA with R
Hello Adrian,

many thanks for your quick and very helpful reply.

That definitely solves the technical problem. If the exclude-argument only excludes remainders, clearly my observed "Row 2" can't be. Fair enough.

With regard to the methodological question, matters seem to be more complex. But to answer your questions: I used the direct method of calibration. To identify cutoff and full non-/membership scores I either referred to clear theoretical reference points (where the variables permitted) or, more commonly, a mix of finding relatively clear-cut thresholds (gaps) in the data and/or statistical reference points such as the mean/75-percentile/etc. and finally checking the cases surrounding such thresholds to see if they fairly accurately capture my understanding of the empirical space. 

If helpful, I can post a version of the raw data and the calibration here.

Once again, many thanks and all the best,
Nicolai






Am Mittwoch, 10. Juni 2020 22:04:56 UTC+2 schrieb Adrian Dușa:
Hello Nicolai,

The truth table object in the QCA package was designed in such a way that requires no manual intervention. There are so many arguments to the function, that allow such a fine grained array of possible options that manually changing the output values should never be necessary.

This is of course possible, but with the cost of breaking a number of features, if I remember correctly the esa() function in the package SetMethods has a problem with the case labels when doing such a manual intervention.

Now, as to why the argument exclude = c(2) does not seem to work, I believe it is explained in the help of that argument (at least in the newest versions of the package on CRAN):

exclude = A vector of (remainder) row numbers from the truth table, to code as negative output configurations.

So, what it excludes is the remainders only. This is a methodological, rather than a programming choice, because I believe observed evidence should not be touched in this way. The argument "exclude" is useful for (and was introduced specifically to) obtaining the so-called enhanced parsimonious solution. This is done by excluding remainders from the minimization process.

Observed data, on the other hand, can be dealt with in more detail through analysing the process of calibration. If those cases are problematic, as they seem to be, I would first try to see if there is nothing wrong with the calibration process.

It would be immensely interesting to also provide information about how (specifically) did you assign set membership scores to those cases. What were the raw scores? Which calibration method did you apply?

You might find some unexpected answers when following these questions...

I hope this helps,
Adrian

> On 10 Jun 2020, at 17:52, Nicolai Schulz <d.nicol...@gmail.com> wrote:
>
> Hello everyone,
>
> I'm fairly new to using QCA in R and have the following issue which can be replicated with the attached Script and data.
>
> The issue I face is that whereas I can effectively exclude truth table rows manually in R, using the QCA-packages "exclude"-argument during the minimization process does not always work.
>
> The example data will show that I:
>         • Identify a Row (#2) in the truth table which is fully made up of a deviant case in kind (dcc), which I therefore want to exclude from the minimization (whether that is actually in itself methodologically justifiable is something I just asked in the QCA Facebook group here, but I am glad for any input here as well).
>         • I then use "exclude=c(2)," to exclude that row during the minimization process.
>         • Yet, that row/case shows up in the final solution-case table as it's own path (although it is factually represented only by one deviant case).
>         • When manually excluding the row  from the truth table (i.e. "ttGDP_VPHP$tt['2', 'OUT'] <- 0") and then minimizing, the "deviant path" indeed disappears (as I hoped/thought it would).
> In conclusion, my question is: Why does my attempt at using the "exclude"-argument not have the intended effect of actually excluding the said row in the minimization process? Where am I going wrong?
>
> Many thanks for your advice!
>
> All the best,
> Nicolai
>
>
>
>
> --
> You received this message because you are subscribed to the Google Groups "QCA with R" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to qcaw...@googlegroups.com.

Adrian Dușa

unread,
Jun 11, 2020, 10:36:41 AM6/11/20
to Nicolai Schulz, QCA with R
I think that would be very instructive to see.
Theoretical thresholds are very good, but data driven ones (such as the mean, or percentiles etc.) are definitely worth more inspection. I belive I am not mistaken, theory actually warns against such measures.
But let us take it one step at a time, do please attach a new script containing the raw data and the calibration commands.

Best,
Adrian

> On 11 Jun 2020, at 15:33, Nicolai Schulz <d.nicola...@gmail.com> wrote:
>
> Hello Adrian,
>
> many thanks for your quick and very helpful reply.
>
> That definitely solves the technical problem. If the exclude-argument only excludes remainders, clearly my observed "Row 2" can't be. Fair enough.
>
> With regard to the methodological question, matters seem to be more complex. But to answer your questions: I used the direct method of calibration. To identify cutoff and full non-/membership scores I either referred to clear theoretical reference points (where the variables permitted) or, more commonly, a mix of finding relatively clear-cut thresholds (gaps) in the data and/or statistical reference points such as the mean/75-percentile/etc. and finally checking the cases surrounding such thresholds to see if they fairly accurately capture my understanding of the empirical space.
>
> If helpful, I can post a version of the raw data and the calibration here.
>
> Once again, many thanks and all the best,
> Nicolai


Nicolai Schulz

unread,
Jun 11, 2020, 12:15:11 PM6/11/20
to QCA with R
Dear Adrian,

many thanks again. Attached the raw data and calibration.

Maybe four prior notes:
  1. The attached data is a sub-sample of cases I selected from a larger set of data (that I collected). The selection/reduction was done to create a more homogenous sample that holds a range of potential other explanatory conditions constant (rather than also having to include them in the QCA; there is more to all that, but I think that is bit more than I need/want to get into at this point).
  2. The "empirically-motivated" cut-off points (i.e. means, etc.) were derived from looking at the larger sample (as we still perceive this as our reference population).
  3. As a consequence, it might be that conditions that seemed fairly balanced in the larger sample are more skewed in the sub-sample. Interestingly, however, actually the only heavily skewed condition - HP - has a theoretical cut-off point (which happens to be similar to the mean in larger sample, but not in the sub-sample). 
  4. Now perhaps the reason for the akward deviant cases in the final truth table is due to the skewness of HP? I do, however, want to note that I did face similar issues of surprising deviant cases even when using the well-balanced large sample (or other sub-samples), making me slightly sceptical (but still open!) to this being the main cause. But given that we've started working with this dataset/subsample, I would propose to continue working with it, provide you agree.
Once again, many thanks for your great support!

All the best,
Nicolai


Schulz_Replication_Problem_Script_Raw.R
SchulzQCA2020RepDataRaw.csv

Adrian Dușa

unread,
Jun 13, 2020, 6:53:48 AM6/13/20
to Nicolai Schulz, QCA with R
Hi Nicolai,

First, let us do a bit of housekeeping to minimize the code and make it quick to follow. Just like for variables, I would use very short names for the objects as well, for example instead of "SchulzQCA2020RepDataRaw" I'd use something like "srdr" (sort of an acronym).

Then, I see your data was most likely exported from R, as it has the "rownames" of the original R data frame in the first column. At the same time, the real rownames (case names) are in the last column called "CountCodeP". You can assign that column as row.names directly, using this:

srdr <- read.csv2("SchulzQCA2020RepDataRaw.csv", row.names = "CountCodeP")[, -1]

The [, -1] part eliminates the irrelevant first column. If you'll want to save this new data frame, I recommend using the function export() from my package, which automatically assigns the name "Cases" to the first column containing the row names.

Next, I am looking at your calibration commands. First of all, I don't really understand why you insist on rounding your set membership scores, and why to two decimals only?
This only introduces more imprecision, for instance on condition EPOS where the second and third values are 0.5032453678 and 0.5032453678. Rounding these numbers to two decimals produces the value 0.5 for both, which is very bad (and you do get a warning in the truth table procedure because of that).

But more importantly, I don't really buy the direct calibration of the raw condition invQ2_OB. That seems to be like a Likert type response scale from 1 to 5, and direct calibration cannot be applied on this type of raw data because this is a categorical variable, while the direct calibration expects a numerical (interval) variable. That is surely problematic, and I would warmly suggest reading the section 4.3 of my book (freely available on bookdown.org)

I really cannot say anything else about the choice of thresholds for the other conditions, short of more informed theoretical information, but they do strike as odd too. For instance the raw variable QuinPeRelAnEcGrFlex, which ranges from -6.172504 to  +4.3759. Setting the full exclusion point to -2.366304 only excludes the outlier value -6.172504, as the second lowest value is -2.275266.

Besides being unnecessary, such a precision for the full exclusion threshold raises the eyebrows anyways. Then the full inclusion threshold of 2.060684 (again very precise) does not seem to be near the far end of the sorted values. This being a subset of your actual data, there might be more information that I am missing but in any case threshold values should attempt to differentiate between qualitatively different cases (using theory), rather than mechanically using some empirical distribution points like the mean, or the quartiles (data driven, and definitely not recommended).

My suggestion would be to use the interactive threshold setter in the Graphical User Interface, or at least make use of the function Xplot() to visually inspect the raw distributions.

Methodologically, I wonder why do you have variables containing negative numbers: are those scores obtained through some sort of factor analysis, perhaps?

I continue to believe your analysis (including the discussion about the deviant cases) should not continue until you've solved the calibration phase. I don't mean to say your calibration is completely wrong, but it seems to me like somewhat subjective and/or data driven.

I hope this helps,
Adrian

--

Nicolai Schulz

unread,
Jun 17, 2020, 4:43:43 PM6/17/20
to QCA with R
Hi Adrian,

once again, many thanks for your fast and comprehensive reply! Allow me to respond in blue below.

All the best,
Nicolai



Am Samstag, 13. Juni 2020 12:53:48 UTC+2 schrieb Adrian Dușa:
Hi Nicolai,

First, let us do a bit of housekeeping to minimize the code and make it quick to follow. Just like for variables, I would use very short names for the objects as well, for example instead of "SchulzQCA2020RepDataRaw" I'd use something like "srdr" (sort of an acronym).

Then, I see your data was most likely exported from R, as it has the "rownames" of the original R data frame in the first column. At the same time, the real rownames (case names) are in the last column called "CountCodeP". You can assign that column as row.names directly, using this:

srdr <- read.csv2("SchulzQCA2020RepDataRaw.csv", row.names = "CountCodeP")[, -1]

The [, -1] part eliminates the irrelevant first column. If you'll want to save this new data frame, I recommend using the function export() from my package, which automatically assigns the name "Cases" to the first column containing the row names.

Many thanks, that is all very useful. 

Next, I am looking at your calibration commands. First of all, I don't really understand why you insist on rounding your set membership scores, and why to two decimals only?
This only introduces more imprecision, for instance on condition EPOS where the second and third values are 0.5032453678 and 0.5032453678. Rounding these numbers to two decimals produces the value 0.5 for both, which is very bad (and you do get a warning in the truth table procedure because of that).
Good point! I don't insist on doing that, indeed you make a good case that I should not. I blindly followed a course script of a QCA course in doing so, but agree, I shouldn't have. 

But more importantly, I don't really buy the direct calibration of the raw condition invQ2_OB. That seems to be like a Likert type response scale from 1 to 5, and direct calibration cannot be applied on this type of raw data because this is a categorical variable, while the direct calibration expects a numerical (interval) variable. That is surely problematic, and I would warmly suggest reading the section 4.3 of my book (freely available on bookdown.org)
Good catch. This is somewhat of an artefact. I usually work with data from an online survey where we have three experts per country. They respond to ordinal scales, but when aggregating the data across the three country-experts, the data becomes interval-scaled. Hence my initial decision to use the direct calibration (as I initially used that aggregated data). In the sub-sample shared here, however, I used disaggregated expert data from expert replies we had already validated. Thus the scale here remained ordinal (which I missed). So I guess, a different calibration would have been more adequate.

I really cannot say anything else about the choice of thresholds for the other conditions, short of more informed theoretical information, but they do strike as odd too. For instance the raw variable QuinPeRelAnEcGrFlex, which ranges from -6.172504 to  +4.3759. Setting the full exclusion point to -2.366304 only excludes the outlier value -6.172504, as the second lowest value is -2.275266.

I can see why that looks odd, but these values are, as explained in my prior notes, derived from the "full-sample". Here the full exclusion is both empirically meaningful, and I would have argued, remains theoretically meaningful in the sub-sample (given that it should relate to the full-sample). Apologies if my explanations confuse more than they help. Overall, I'm planning to work more with the full-sample directly, avoiding such oddities (while I believe, likely still facing some of the issues initially raised - but let me share the proof/issues when I get there before making any further claims). 

Besides being unnecessary, such a precision for the full exclusion threshold raises the eyebrows anyways. The precise values are simply empirical numbers from means and percentiles (although I accept the challenge that one should consider whether these are good refernce points). Then the full inclusion threshold of 2.060684 (again very precise) does not seem to be near the far end of the sorted values [Similar issue as above: This is the 75% in the full-sample, which there also lookes like a meaningful threshold]. This being a subset of your actual data, there might be more information that I am missing but in any case threshold values should attempt to differentiate between qualitatively different cases (using theory), rather than mechanically using some empirical distribution points like the mean, or the quartiles (data driven, and definitely not recommended). 

My suggestion would be to use the interactive threshold setter in the Graphical User Interface, or at least make use of the function Xplot() to visually inspect the raw distributions. Thank you for the suggestions. Will do (though that resembles the approach I took...yet, in the full-sample).

Methodologically, I wonder why do you have variables containing negative numbers: are those scores obtained through some sort of factor analysis, perhaps? So Vdem_e_polity2 is the PolityIV regime type score ranging from -10 to 10. And "QuinPeRelAnEcGrFlex" is the  annual economic growth of a country relative to countries at similar levels of development during the same period. Thus, it can be negative (slower than other comparable countries) or positive (faster). 

I continue to believe your analysis (including the discussion about the deviant cases) should not continue until you've solved the calibration phase. I don't mean to say your calibration is completely wrong, but it seems to me like somewhat subjective and/or data driven.
That makes sense. As described above, I should probably concentrate on the full-sample. And if I find similar issues there - on the basis of a hopefully more comprehensible calibration - will make sure to share them here.

Once again, many thanks for your help! 

I hope this helps,
Adrian

Adrian Dușa

unread,
Jun 17, 2020, 4:57:27 PM6/17/20
to Nicolai Schulz, QCA with R
Glad if it helps. Your example, even if by accident, is fascinating given the discussed truth table anomaly.
Well worth investigating, and will keep an eye open for your subsequent messages.

Best,
Adrian

--
You received this message because you are subscribed to the Google Groups "QCA with R" group.
To unsubscribe from this group and stop receiving emails from it, send an email to qcawithr+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/qcawithr/624b355b-f803-4056-aeb5-8800fb229f67o%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages