2 days, 22 hours from now

## FedCSIS 2020 Challenge: Network Device Workload Prediction

### FedCSIS 2020 Data Mining Challenge: Network Device Workload Prediction is the seventh data mining competition organized in association with Conference on Computer Science and Information Systems (https://fedcsis.org/). This time, the considered task is related to the monitoring of large IT infrastructures and the estimation of their resource allocation. The challenge is sponsored by EMCA Software and Polish Information Processing Society (PTI).

EMCA Software is a Polish vendor of Energy Logserver - a system capable of collecting data from various log sources to provide in-depth data analysis and alerting to its end-users. EMCA is based in Poland but also operates in Nordics, APAC, and the USA through the partner channels. The company focuses on cybersecurity and IT infrastructure monitoring use cases, intending to deliver a system that is ready-to-use and offers inbox correlations and predictions on monitored data.

By this challenge, we want to help EMCA to answer the question of whether it is possible to reliably predict workload-related characteristics of monitored devices, based on historical data gathered from such devices. This task is of paramount importance for IT and technical teams that can put their hands on a tool that allows them to manage the capacity of their infrastructure.

An additional difficulty within this challenge, and also the reason why it might be especially interesting for the data science community, arises from the fact that devices considered in the data are not uniform. In essence, logs cover readings from various types of hardware. Some of them are cross-dependent, as they are a part of the same IT system. Moreover, some devices have multiple interfaces for which the data is aggregated.

More details regarding the task and a description of the challenge data can be found in the Task description section.

Special session at FedCSIS 2020: As in previous years, a special session devoted to the competition will be held at the conference. We will invite authors of selected challenge reports to extend them for publication in the conference proceedings (after reviews by Organizing Committee members) and presentation at the conference. The papers will be indexed by the IEEE Digital Library and Web of Science. The invited teams will be chosen based on their final rank, innovativeness of their approach, and quality of the submitted report.

Please logIn to the system!

Training data in this challenge are hourly aggregated values of various workload characteristics extracted from device logs. They were made available in the form of a CSV table containing ten columns. The first three of these columns are identifiers. They are followed by the mean, standard deviation, and a candlestick aggregation of the corresponding values. In particular, the meanings of the columns in the data set are:

• hostname: an ID of the device
• series: a name of the considered characteristic
• time_window: a timestamp of the aggregation window; the row aggregates values from an hour starting at the indicated timestamp
• Mean: the mean of the values
• SD: the standard deviation of the values
• Open: a value of the first reading during the corresponding hour
• High: the maximum of values
• Low: the minimum of values
• Close:  a value of the last reading during the corresponding hour
• Volume: the number of values

For each hostname-series pair in the data, values can be arranged into a time series spanning for over 80 days. Note, however, that some values can be missing for some pairs. Moreover, hostnames correspond to heterogeneous types of devices for which different sets of characteristics are monitored. Some of these devices are a part of the same system and it is likely that their workloads are highly correlated.

The task and the format of submissions: the task in this challenge is to predict future workload characteristic values of a number of devices from the training data. IDs of the devices (hostname) and their characteristics for which the predictions are to be made (series) are indicated in the solution_template.csv file. This file was made available in the Data files section. Participants of the challenge are asked to predict 168 consecutive values of each indicated time series (one full week) and upload the predictions through the submission system.

The format of submissions should be the same as in the solution_template.csv file. Solutions should be submitted as CSV files containing 170 columns. The first two columns should contain device ID (hostname) and characteristic ID (series), respectively. They should be followed by 168 numeric columns containing predictions – mean values of the corresponding characteristics for the next 168 hours (one week starting at 2020-02-20 12:00:00) after the end of the training data. The file exemplary_solution.csv contains an example of a correctly formatted submission file.

Evaluation: the quality of submissions will be evaluated using the $R^2$ measure, i.e., for each time series, the forecasts will be compared to ground truth values, and their quality will be assessed using the formula:

$$R^2(f, y) = 1 - \frac{RSS(f, y)}{TSS(y)},$$ where $RSS(f, y)$ is the residual sum of squares of forecasts: $$RSS(f, y) = \sum_i (y_i - f_i)^2,$$ and $TSS(y)$ is the total sum of squares: $$TSS(y) = \sum_i (y_i - \bar{y})^2,$$ and $\bar{y}$ is the mean value of time series $y$ estimated using available training data. The submission score is the average $R^2$ value over all time series from the test set.

Solutions will be evaluated on-line and the preliminary results will be published on the public leaderboard. The preliminary score will be computed on a small subset of the test time series (10%), fixed for all participants. The final evaluation will be performed after completion of the competition using the remaining part of the test data. Those results will also be published online. It is important to note that only teams that submit a report describing their approach before the end of the challenge will qualify for the final evaluation. Moreover, to be eligible for the awards, the winning teams must exceed the score of the baseline solution by at least 10%.

In case of any questions, please post on the competition forum or write an email to contact {at} knowledgepit.ml

In order to download competition files you need to be enrolled.
Rank Team Name Score Submission Date
1
Dymitr
0.3223 2020-06-3 09:54:49
2
amy
0.3098 2020-06-3 12:21:16
3
cdata
0.2990 2020-06-2 18:49:52
4
RandomGenerator
0.2575 2020-05-25 18:05:36
5
baseline solution
0.2267 2020-03-24 16:25:13
6
Climber
0.2066 2020-06-3 18:03:13
7
Karol Waszczuk
0.1955 2020-05-29 23:24:31
8
hieuvq
0.1648 2020-05-14 18:22:53
9
kajetan
0.1512 2020-06-4 15:19:25
10
Stanisław Kaźmierowski
0.1464 2020-06-5 12:14:23
11
little_skynet
0.1113 2020-05-22 16:23:28
12
berlin
0.1049 2020-05-29 17:10:48
13
Alex
0.0185 2020-06-1 18:00:41
14
-_-
0.0109 2020-05-19 01:42:03
15
pszulc
0.0096 2020-06-5 22:58:14
16
dataloader
-0.0013 2020-04-8 15:15:51
17
mathurin
-0.0013 2020-05-12 14:27:27
18
go
-0.0013 2020-05-19 18:16:53
19
vbhargav875
-0.0013 2020-05-29 10:01:37
20
IME
-0.0013 2020-05-29 23:52:46
21
joe
-0.0013 2020-06-1 15:46:56
22
TeamName
-0.0013 2020-06-5 02:35:39
23
Kirov reporting
-0.0474 2020-05-28 19:20:11
24
Les Trois Mousquetaires
-0.0475 2020-06-6 02:36:59
25
noidea
-0.0774 2020-06-6 02:57:01
26
pesto
-0.1022 2020-03-27 23:31:38
27
Michal
-0.1399 2020-04-11 09:45:35
28
SELECT name FROM competition.losers
-0.1857 2020-06-5 22:14:14
29
MultiPandas
-0.1923 2020-05-10 12:44:09
30
papiez69
-0.6648 2020-06-4 13:09:55
31
Wrong Team Name
-0.7323 2020-05-25 23:24:40
32
M
-1.4472 2020-05-30 20:21:49
33
NJJ
-3.0328 2020-05-28 16:54:04
34
One_n_Only
-7.7509 2020-06-4 05:25:34
35
DenisVorotyntsev
-318.4680 2020-04-29 00:56:23
36
onemanarmy
-999.0000 2020-05-5 20:23:09
37
Azul
-999.0000 2020-05-25 15:14:24
38
Niko
-999.0000 2020-06-6 02:06:52
• March 23, 2020: start of the challenge, the data set becomes available
• March 25, 2020: submission system opens
• June 8, 2020 (23:59:59 GMT): submission system closes
• June 10, 2020 (23:59:59 GMT): sending reports due
• June 17, 2020: online publication of the final results, sending invitations for submitting papers
• July 1, 2020: deadline for submitting invited papers
• July 8, 2020: notification of paper acceptance
• July 15, 2020: camera-ready of accepted papers, and registration to the conference due

Authors of the top-ranked solutions (based on the final evaluation scores) will be awarded prizes funded by our sponsors:

• First Prize: 1500 USD + one free FedCSIS'20 conference registration,
• Second Prize: 1000 USD + one free FedCSIS'20 conference registration,
• Third Prize: 500 USD + one free FedCSIS'20 conference registration.

The award ceremony will take place during the FedCSIS'20 conference. Please note that the winners will only be eligible for the money prizes only if their final score exceeds the baseline solution score by at least 10%.

• Andrzej Janusz, QED Software & University of Warsaw
• Piotr Biczyk, QED Software
• Artur Bicki, EMCA Software
• Mateusz Przyborowski, QED Software & University of Warsaw

In case of any questions please post on the competition forum or write an email at contact {at} knowledgepit.ml

This forum is for all users to discuss matters related to the competition. Good manners apply!
Discussion Author Replies Last post
Timer inconsistent with schedule Jan 1 by Andrzej
Thursday, June 04, 2020, 10:58:25
Target variable IOANNIS 1 by Andrzej
Friday, May 29, 2020, 20:35:30
Submission deadline approaching Piotr 3 by Andrzej
Saturday, May 23, 2020, 13:11:08
Maintenance break Andrzej 0 by Andrzej
Thursday, April 16, 2020, 16:52:25
The competition is officially open! Piotr 2 by Andrzej
Monday, March 30, 2020, 17:53:53