Training set order
by ayeright - Wednesday, August 7, 2019, 15:12:56

Is the training set ordered by time? If so, are the top or bottom rows of the dataset the most recent?

RE: Training set order
by andrzej - Wednesday, August 07, 2019, 21:50:51

Neither the training nor the test data table is ordered by time. We don't want to reveal those orderings for now.

Andrzej

RE: Training set order
by ayeright - Tuesday, August 13, 2019, 14:25:53

Is the test set from a period after the training set?

Not knowing the order makes it impossible to construct a sound validation strategy. I am stuck with standard K-fold cross validation, which inevitably leads to training on samples which were generated after samples in the validation set. In real life we of course always train on samples from the past and predict the future. If we don't know the order of the training set our validation strategy will inevitably overfit.

RE: Training set order
by daniel_kaluza - Tuesday, August 13, 2019, 16:51:01

Yes, the test data is from a period after the training set.

I understand the issue, I will ask the rest of the organizing committee what can we do about it.

Thank you for your feedback,

Daniel

RE: Training set order
by andrzej - Wednesday, August 14, 2019, 01:47:11

@ayeright,

I understand your point perfectly, but unfortunately, we can't reveal the ordering of training data as it potentially could undermine the anonymization process. We cannot risk that at this point.

Regarding the validation procedure, please note that splitting data based on time is not the only viable option. Since data from each of SoD's clients can be assumed to be independent, you may validate your models using splits by client codes. I realize that it is not a perfect solution, but it is better than using random validation folds.

Best,

Andrzej

RE: Training set order
by ayeright - Wednesday, August 14, 2019, 12:42:43

Ok thanks, I'll try that.