Weights for domain imbalance across time in a Conditional Random Forest

16 views
Skip to first unread message

cbechet90

unread,
Aug 14, 2024, 7:48:13 PM8/14/24
to StatForLing with R

Hello everyone,

I am currently working on a project that involves analyzing the use of two constructions. The goal is to determine which linguistic and contextual features (e.g., period, domain, syntactic position) influence the choice between these constructions using a conditional random forest in R.

The corpus I'm working with is imbalanced in terms of domain distribution across different historical periods. For instance, religious texts are predominant in the 1300-1499 period, whereas administrative texts are more frequent in the 1500-1699 period. This imbalance could potentially bias the model if not accounted for.

Does it make sense to incorporate weights in the conditional random forest to adjust for this domain imbalance across time? Specifically, I'm considering weighting observations based on the relative proportion of each domain within its respective period.

Below is an R example that simulates this scenario. It includes:

  1. A simulated dataset of observations with two constructions and various features.
  2. The calculation of weights based on the domain and period distributions.
  3. The fitting of a conditional random forest model with these weights.
# Load necessary libraries
library(party)
library(dplyr)

# Simulate the word count distribution across periods and domains
word_counts <- data.frame(
  Period = rep(c("1300-1499", "1500-1699", "1700-1899", "1900-today"), each = 4),
  Domain = rep(c("Religious", "Historical", "Administrative", "Literary"), times = 4),
  Word_Count = c(7403351, 2986153, 2430874, 2408425,
                 307029, 3421156, 3533387, 1350994,
                 462784, 1471740, 438508, 4094127,
                 1172234, 725136, 1248272, 1545830)
)

# Calculate the total word count per period
period_totals <- word_counts %>% group_by(Period) %>% summarize(Total_Words = sum(Word_Count))

# Merge period totals with word_counts to calculate relative domain size within each period
word_counts <- merge(word_counts, period_totals, by = "Period")

# Calculate the expected frequency (as a proportion of the total words in that period)
word_counts <- word_counts %>%
  mutate(Expected_Proportion = Word_Count / Total_Words)

# Simulate the data frame with 800 observations
set.seed(42)
observations <- data.frame(
  Complex_Prep = rep(c("cx1", "cx2"), each = 400),
  Period = sample(word_counts$Period, 800, replace = TRUE, prob = word_counts$Word_Count / sum(word_counts$Word_Count)),
  Domain = sample(word_counts$Domain, 800, replace = TRUE, prob = word_counts$Word_Count / sum(word_counts$Word_Count)),
  Lexical_Semantics = sample(c("temporal", "spatial", "causal", "modal"), 800, replace = TRUE),
  Grammatical_Category = sample(c("NP", "gerund", "pro-form"), 800, replace = TRUE),
  Syntactic_Position = sample(c("initial", "medial", "final"), 800, replace = TRUE)
)

# Merge observations with word_counts to assign weights
observations_with_weights <- merge(observations, word_counts, by = c("Period", "Domain"))

# Calculate weights: Inverse of the expected proportion
observations_with_weights$Weight <- 1 / observations_with_weights$Expected_Proportion

# Fit the conditional random forest model with period- and domain-adjusted weights
crf_model_weighted <- cforest(Cx ~ Period + Domain + Lexical_Semantics + Grammatical_Category + Syntactic_Position,
                              data = observations_with_weights,
                              weights = observations_with_weights$Weight,
                              controls = cforest_unbiased(ntree = 500, mtry = 3))

# Calculate and display variable importance with the weighted model
weighted_varimp <- varimp(crf_model_weighted, conditional = TRUE)
print(weighted_varimp)

The output from varimp(crf_model_weighted, conditional = TRUE) provides variable importance scores, but my concern is whether the weighting approach appropriately addresses the domain imbalance across time.

Does this approach of incorporating weights seem valid given the domain imbalance across periods?

Are there alternative methods or considerations that I should be aware of to ensure the model accounts for this imbalance effectively?

Thank you in advance for your insights!

All the best,


Christophe.

Reply all
Reply to author
Forward
0 new messages