I'm getting very different results from my offline evaluations (small subset of training dataset) compared to my online evaluations (the ones computed by you). What's weird is that my offline evaluations are worse than the online ones, which means the difference it's not due to overfitting.
Is it possible that the test set is not from the same distribution as the train set?
Short answer:
It is a (somewhat) different distribution. Because Lichess is open source, and all its puzzles are publicly available, we were not able to use the same puzzles for training and evaluation.
Long answer:
Instead, we tried to replicate the environment as much as possible:
1) we generated the puzzles using the same algorithm
2) we made a clone of the puzzle server and let people solve the puzzles.
3) We used people's lichess puzzle rating for glicko calculations
However, there are big differences between our set and lichess' set:
1) When solving our puzzles, people can't see their puzzle rating (as that would let them leak the data again), so psychologically it's a very different experience. In fact, simply knowing that they take part in a study can influence their problem solving.
2) When you solve puzzles on lichess, usually the level of the puzzles is fairly stable, whereas here people might get "matched" with simple and difficult puzzles alternately, again impacting their solving process and skewing the data in some ways.
3) Most lichess puzzles have thousands or even tens of thousands of people solving them. Ours mostly have 20-50. While there is always inherent variance in glicko rating, the average variance in lichess puzzles is iirc 90, for our set it's ~130, so almost 50% higher (for reference, starting variance is 500).
This definitely adds to the challenge and we encourage you to look for ways to try to account for the bias.