I was having some trouble understanding how everything fit together, and I found working through the following example (suggested by Drew) helpful:
You have a 6 sided die, and you have a feature indicating odd (e.g. f(x=1)=1, f(x=4) = 0). You know that the mean F is 0.7. Find the maxent distribution with only this information (the p(x) for x=1...6 and the lambda used when computing this). The actual solution is simple and obviously has the property p(x=1)=p(x=3)=p(x=5), and p(x=2)=p(x=4)=p(x=6)...so its easy to check you got the right answer.
Also, once you find each p(x) as a function of lambda, you can use gradient descent to find the optimal lambda.
~Shervin