Elimination of singular variables from linear regressions

26 views
Skip to first unread message

James Muller

unread,
Jun 3, 2024, 2:18:44 PMJun 3
to pystatsmodels
For all my searching, I haven't found an equivalent implementation of the standard procedure to eliminate singular variables from linear models, ala R and SAS. If this has been implemented, please let me know as I'd rather use the official version than mine.

However, below is my working algorithm to implement the same behavior. If this hasn't been integrated into statsmodels:
  1.  Could it be integrated as default behavior into the statsmodels linear models?
    1. Noting there are probably a ton of workarounds to this implemented as usage cases for statsmodels, maybe start with defaulting to not doing it but allowing it to be forced.
  2. What changes would you make to my draft?
    1. E.g., default behavior of excluding in order of the variables in the dataframe or matrix, but allowing alternative preference of the order in which variable gets kicked out.
  3. How does one formally request this? I'm strongly inclined to not hack the statsmodels code myself, but if that's the better path, I'm happy to do so.
Here's the draft:

import numpy as np
import pandas as pd

def identify_singular_variables(X, target=None, return_included=False):
    """
    Identify variables that cause singularity in a design matrix.

    This function checks for singular variables in a design matrix and
    returns either the list of variables to drop to resolve singularity
    or the list of variables to include to maintain a non-singular matrix.

    Parameters:
    X : pandas.DataFrame, numpy.ndarray, or patsy.DesignMatrix
        The input design matrix.
    target : str or int, optional
        The name or index of the target column to exclude from the analysis.
        If None, no column is excluded.
    return_included : bool, optional
        If True, returns the variables to include in the design matrix.
        If False (default), returns the variables to exclude.

    Returns:
    list
        A list of column names or indices of variables to drop (if
        return_included is False) or to include (if return_included is True).

    Examples:
    --------
    >>> import pandas as pd
    >>> X = pd.DataFrame({
    ...     'A': [1, 2, 3],
    ...     'B': [2, 4, 6],
    ...     'C': [1, 0, 1]
    ... })
    >>> identify_singular_variables(X)
    ['B']
   
    >>> identify_singular_variables(X, return_included=True)
    ['A', 'C']
   
    >>> import numpy as np
    >>> X = np.array([
    ...     [1, 2, 1],
    ...     [2, 4, 0],
    ...     [3, 6, 1]
    ... ])
    >>> identify_singular_variables(X)
    [1]
   
    >>> identify_singular_variables(X, return_included=True)
    [0, 2]
    """

    def is_singular(matrix):
        """Check if a matrix is singular using SVD."""
        return np.linalg.matrix_rank(matrix) < min(matrix.shape)

    def prepare_matrix(X, target):
        """Prepare the matrix X and handle the target column."""
        # Try to import patsy, but continue if not available.
        try:
            from patsy import DesignMatrix
        except ImportError:
            DesignMatrix = None

        # Check if X is a pandas DataFrame.
        if isinstance(X, pd.DataFrame):
            if target is not None:
                # Drop the target column if specified.
                X = X.drop(columns=target)
            # Save column names.
            col_names = X.columns
            # Convert DataFrame to numpy array.
            X = X.values
        elif isinstance(X, np.ndarray):
            col_names = None
            if target is not None:
                # Drop the target column if specified (for numpy array).
                X = np.delete(X, target, axis=1)
        elif DesignMatrix is not None and isinstance(X, DesignMatrix):
            # Handle patsy DesignMatrix.
            col_names = X.design_info.column_names
            # Convert to numpy array without losing metadata.
            X = X[:, :]
            if target is not None:
                # Remove the target column from column names and matrix.
                col_names.remove(target)
                X = np.delete(X, target, axis=1)
        else:
            raise ValueError(
                "Unsupported input type. Expected pandas DataFrame, numpy array, "
                "or patsy DesignMatrix."
            )
        return X, col_names

    # Prepare the matrix X and handle the target column.
    X, col_names = prepare_matrix(X, target)

    # Check if the original design matrix is singular.
    if not is_singular(X):
        # If not singular, return either an empty list (to drop) or all
        # column names/indices (to include).
        return [] if not return_included else (
            list(col_names) if col_names is not None else list(range(X.shape[1]))
        )
   
    # List to keep track of variables to drop.
    to_drop = []
   
    # Iterate over each variable.
    for i in range(X.shape[1]):
        # Create a copy of X without the current variable.
        X_temp = np.delete(X, i, axis=1)
        # Check if the reduced matrix is singular.
        if is_singular(X_temp):
            # Mark the variable for removal.
            if col_names is not None:
                to_drop.append(col_names[i])
            else:
                to_drop.append(i)
   
    if return_included:
        # Return variables to include, i.e., those not in to_drop.
        if col_names is not None:
            to_include = [col for col in col_names if col not in to_drop]
        else:
            to_include = [i for i in range(X.shape[1]) if i not in to_drop]
        return to_include
   
    return to_drop

Best,

James

josef...@gmail.com

unread,
Jun 3, 2024, 3:05:23 PMJun 3
to pystat...@googlegroups.com
Hi,

It is an old debate, but it's still undecided how we can support an option to identify and drop collinear columns.
However, dropping collinear variables will never become the default in statsmodels. We will stay with generalized inverse pinv to handle collinearity in linear regression.

Here is a related summary issue with links to other related issues and discussions. 

I haven't looked through your code yet. 
AFAIK, the simplest way to identify collinear columns up to some numerical threshold is QR.
Pivoting QR looks too arbitrary in which columns are dropped, at least in my experiments when I looked at it the last time.

related

Here is an old comment of mine why I or we did not like automatic dropping of variables.

I would like to have an option for the user, but it is still not clear where and how to add (partial) support for this.


Josef
 

--
You received this message because you are subscribed to the Google Groups "pystatsmodels" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pystatsmodel...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pystatsmodels/0e101dbb-2b26-492c-a67d-d72013308fb0n%40googlegroups.com.

josef...@gmail.com

unread,
Jun 6, 2024, 4:44:03 PMJun 6
to pystat...@googlegroups.com
Actually, I had completely forgotten about `_fit_collinear`
I added it a long time ago and never made it a public method.
I have not looked at it in a very long time and don't know what the status is.
It has unit tests for the main results, so should still work as intended. 

example last two columns are repeated from earlier variables
This is equivalent to imposing zero constraints at the later collinear variables.

nobs, k_var = 100, 3
x = np.random.randn(nobs, k_var)
x = np.column_stack((np.ones(nobs), x))
y = 0.5 * x.sum(1) + np.random.randn(nobs)

x2 = np.column_stack((x, x[:, -2:]))

mod = OLS(y, x2)

res2 = mod._fit_collinear()
print(res2.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.330
Model:                            OLS   Adj. R-squared:                  0.309
Method:                 Least Squares   F-statistic:                     15.73
Date:                Thu, 06 Jun 2024   Prob (F-statistic):           2.14e-08
Time:                        16:33:56   Log-Likelihood:                -147.51
No. Observations:                 100   AIC:                             303.0
Df Residuals:                      96   BIC:                             313.4
Df Model:                           3                                        
Covariance Type:            nonrobust                                        
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.3387      0.110      3.093      0.003       0.121       0.556
x1             0.2373      0.107      2.227      0.028       0.026       0.449
x2             0.3340      0.109      3.074      0.003       0.118       0.550
x3             0.5719      0.115      4.966      0.000       0.343       0.800
x4                  0          0        nan        nan           0           0
x5                  0          0        nan        nan           0           0
==============================================================================
Omnibus:                        2.200   Durbin-Watson:                   1.688
Prob(Omnibus):                  0.333   Jarque-Bera (JB):                2.149
Skew:                           0.292   Prob(JB):                        0.341
Kurtosis:                       2.582   Cond. No.                     2.66e+16
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 3.23e-31. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.

Reply all
Reply to author
Forward
0 new messages