Hi,
You are right, and this point was mentioned in the lecture slides as a disadvantage of the filter method with mutual information:
And that's why Fisher Score and Correlation were introduced as alternatives to mutual information for the filter method.
But in case we want to really use mutual information -- estimating joint density p(x, y) is indeed a hard task, usually even harder than just estimating p(y|x) which is usually the goal in many classification approaches.
And it's true that if one knows p(x, y), then one knows basically everything, and can classify data points, generate new samples, perform data completion (e.g. if some x_i is unknown), etc.
However, one can still get *some* approximations of p(x, y), which might be useful in some cases. For example, if x and y are continuous, one can fit e.g. a mixture of Gaussians to the training data, and then get an estimate of p(x, y). Or one can discretize x and y, and then simply perform counting as can be done for the discrete case.
And also note that although the discrete case sounds conceptually simpler, this kind of counting gets directly affected by the curse of dimensionality, so it may be again not very practical.
Best,
Maksym