For all my searching, I haven't found an equivalent implementation of the standard procedure to eliminate singular variables from linear models, ala R and SAS. If this has been implemented, please let me know as I'd rather use the official version than mine.
However, below is my working algorithm to implement the same behavior. If this hasn't been integrated into statsmodels:
- Could it be integrated as default behavior into the statsmodels linear models?
- Noting there are probably a ton of workarounds to this implemented as usage cases for statsmodels, maybe start with defaulting to not doing it but allowing it to be forced.
- What changes would you make to my draft?
- E.g., default behavior of excluding in order of the variables in the dataframe or matrix, but allowing alternative preference of the order in which variable gets kicked out.
- How does one formally request this? I'm strongly inclined to not hack the statsmodels code myself, but if that's the better path, I'm happy to do so.
Here's the draft:
import numpy as np
import pandas as pd
def identify_singular_variables(X, target=None, return_included=False):
"""
Identify variables that cause singularity in a design matrix.
This function checks for singular variables in a design matrix and
returns either the list of variables to drop to resolve singularity
or the list of variables to include to maintain a non-singular matrix.
Parameters:
X : pandas.DataFrame, numpy.ndarray, or patsy.DesignMatrix
The input design matrix.
target : str or int, optional
The name or index of the target column to exclude from the analysis.
If None, no column is excluded.
return_included : bool, optional
If True, returns the variables to include in the design matrix.
If False (default), returns the variables to exclude.
Returns:
list
A list of column names or indices of variables to drop (if
return_included is False) or to include (if return_included is True).
Examples:
--------
>>> import pandas as pd
>>> X = pd.DataFrame({
... 'A': [1, 2, 3],
... 'B': [2, 4, 6],
... 'C': [1, 0, 1]
... })
>>> identify_singular_variables(X)
['B']
>>> identify_singular_variables(X, return_included=True)
['A', 'C']
>>> import numpy as np
>>> X = np.array([
... [1, 2, 1],
... [2, 4, 0],
... [3, 6, 1]
... ])
>>> identify_singular_variables(X)
[1]
>>> identify_singular_variables(X, return_included=True)
[0, 2]
"""
def is_singular(matrix):
"""Check if a matrix is singular using SVD."""
return np.linalg.matrix_rank(matrix) < min(matrix.shape)
def prepare_matrix(X, target):
"""Prepare the matrix X and handle the target column."""
# Try to import patsy, but continue if not available.
try:
from patsy import DesignMatrix
except ImportError:
DesignMatrix = None
# Check if X is a pandas DataFrame.
if isinstance(X, pd.DataFrame):
if target is not None:
# Drop the target column if specified.
X = X.drop(columns=target)
# Save column names.
col_names = X.columns
# Convert DataFrame to numpy array.
X = X.values
elif isinstance(X, np.ndarray):
col_names = None
if target is not None:
# Drop the target column if specified (for numpy array).
X = np.delete(X, target, axis=1)
elif DesignMatrix is not None and isinstance(X, DesignMatrix):
# Handle patsy DesignMatrix.
col_names = X.design_info.column_names
# Convert to numpy array without losing metadata.
X = X[:, :]
if target is not None:
# Remove the target column from column names and matrix.
col_names.remove(target)
X = np.delete(X, target, axis=1)
else:
raise ValueError(
"Unsupported input type. Expected pandas DataFrame, numpy array, "
"or patsy DesignMatrix."
)
return X, col_names
# Prepare the matrix X and handle the target column.
X, col_names = prepare_matrix(X, target)
# Check if the original design matrix is singular.
if not is_singular(X):
# If not singular, return either an empty list (to drop) or all
# column names/indices (to include).
return [] if not return_included else (
list(col_names) if col_names is not None else list(range(X.shape[1]))
)
# List to keep track of variables to drop.
to_drop = []
# Iterate over each variable.
for i in range(X.shape[1]):
# Create a copy of X without the current variable.
X_temp = np.delete(X, i, axis=1)
# Check if the reduced matrix is singular.
if is_singular(X_temp):
# Mark the variable for removal.
if col_names is not None:
to_drop.append(col_names[i])
else:
to_drop.append(i)
if return_included:
# Return variables to include, i.e., those not in to_drop.
if col_names is not None:
to_include = [col for col in col_names if col not in to_drop]
else:
to_include = [i for i in range(X.shape[1]) if i not in to_drop]
return to_include
return to_drop
Best,
James