Will the final evaluation be done on the same dataset state as the current evaluation? Or will you use updated ratings?
To clarify, I mean, presumably there was some cutoff date for the puzzle rating collection and the ratings at that time are what are used in the current evaluation. However, perhaps the puzzle ratings are still being updated (and thus further converging to their true distribution). Will these updated ratings be used, or will it stay with the ratings from that cutoff point?
Thanks!
Whatever ratings were used for the leaderboard will not be used for final evaluation.
We will use updated puzzle ratings, but don't let that mislead you - the final number of attempts at solving the puzzles will be similar in leaderboard and final test sets. It's simply that we prioritised a first batch back then when tagging (in other words, we got enough taggings on those puzzles to get their RD <130, and now we're trying to do the same with the rest.)