article on factor analysis with statsmodels and scikit learn

Skip to first unread message

Jun 15, 2021, 6:17:03 PMJun 15
to pystatsmodels

David Waterworth

Jun 15, 2021, 6:51:11 PMJun 15
The authors of the linked paper concluded:

The three Python packages offer a range of functionality, both in specifying and reporting EFA models. The FactorAnalyzer and statsmodels packages present large toolsets for conducting EFA and reporting interpretable measurement models. On the other hand, scikit-learn offers more limited EFA functionality that seems to be primarily geared toward reducing dimensionality of data and enhancing predictive capabilities of machine learning models.At present, FactorAnalyzer stands out as the most comprehensive and reliable Python package for conducting EFA, because it offers necessary tests of assumptions that are overlooked by other packages and its EFA results align with those from the psych package in R. While statsmodels’ documentation describes similar functionality for EFA, it struggles to deliver accurate results. Finally, scikit-learn comes in at third place due to its limited set of options for estimating, modifying and reporting EFA models. Regarding ease-of-use, package developers should consider adding the option to output results as data frames instead of arrays. Data frames are more interpretable since they organize data into tabular form with descriptive metadata such as column and row names. The statsmodels package outperforms the others in this regard, by offering a summary function that returns a selection of commonly reported EFA results as data frames like the psych package does in R. Otherwise, users must use the pandas package to convert the packages’ output arrays into a data frame format. While not particularly difficult, manually converting arrays into data frames with corresponding metadata can present an unnecessarily tedious intermediate step in the data analytic workflow. This extra step may discourage new Python users who are not well-versed in data manipulation techniques and users who wish to perform quick analyses with minimal code modification.In terms of intended functionality, all of the packages could be improved by adding methods for identifying the optimal number of factors for EFA models. None of the packages provide tools for conducting a parallel analysis, and only one offers a scree plot function. Fortunately, programming a scree plot does not require extensive coding ability or custom-built functions. Users who wish to code a scree plot using eigenvalues from their output can try the Matplotlib visualization package, which has a number of online tutorials (Hunter, 2007; Navlani, 2019; St-Amant, 2020; Toth, 2020).

You received this message because you are subscribed to the Google Groups "pystatsmodels" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
To view this discussion on the web visit


David Waterworth

Data Scientist

T: +61 (0)2 8971 4066



Copyright © 2019 CIM Environmental Pty Ltd. All rights reserved. This email and any attachments to it are confidential. Any Intellectual Property created, developed or contained within this email or its attachments remains the property of CIM Environmental Group. If you are not the intended recipient, any use, dissemination, further distribution, or reproduction of this material is prohibited. If the email is in error, please notify by return email, delete your copy of the message, and accept our apologies for any inconvenience caused.

Jun 15, 2021, 10:05:50 PMJun 15
to pystatsmodels
Thanks David

Doesn't sound negative, as in some of those reviews.
I think we need to check what it means that statsmodels "struggles to deliver accurate results" and see if we can improve that.


Reply all
Reply to author
0 new messages