IML2022project1

3 years, 2 months ago

First semester project for the Interactive Machine Learning 2021/2022 course

This is the first semester project for the Interactive Machine Learning 2021/2022 course. The task is to choose optimal batches of queries for training a multi-label prediction model.

Overview

The goal of this competition is to choose three subsets of samples from the data pool such that they bring the largest improvement to a prediction model.

Competition rules are given in Terms and Conditions.

The description of the task, data, and evaluation metric is in the Task description section.

Terms & Conditions

Participants of the challenge are obliged to follow the competition rules:

This challenge is organized by Andrzej Janusz and Daniel Kałuża (the Organizers) for students enrolled in the Interactive Machine Learning 2021/2022 course at the Faculty of Mathematics, Informatics, and Mechanics at the University of Warsaw.
The provided data sets are the property of the Organizers and the KnowledgePit platform. It is forbidden to share or redistribute provided data sets to any third party without explicit consent from the Organizers.
Each team in the competition may consist of only one person. Working in larger groups or sharing solutions with other teams is strictly forbidden.
Each team has a limited number of submissions - the limit is set to 100.
The number of submissions per day is limited to 5.
Participants can use data that was made available in the challenge - using any external resources is forbidden. Queries regarding the external resources need to be issued through the competition forum.
It is strictly forbidden to hack the provided data or to exploit any unfair data leak that can improve the solution score. All attempts at making predictions for any test instance using information extracted from other test instances will result in disqualification.
The deadline for submitting the solutions is April 24, 2022 (23:59 GMT). Late submissions will not be accepted.
Each team is obliged to provide a short report describing their final solution. The report must contain information such as the name of the team, the names of all team members, and a brief overview of the used approach. It should be submitted in the KnowledgePit submission system by April 24, 2022 (23:59 GMT).
By enrolling in this competition, you grant the Organizers the right to process your submissions and reports for the purpose of evaluation and post-competition research.
The final project score will depend on the quality of the solution (the score obtained in the final evaluation), and on the quality of the submitted report.

Final results

Rank	Team Name	Is Report		Preliminary Score	Final Score	Submissions
1	mpacek	True	True	0.0738	0.075700	2
2	1._The_Industrial_Revolution_and_its_con	True	True	0.0948	0.059800	6
3	baseline	True	True	0.0538	0.058100	3
4	maciek.pioro	True	True	0.0946	0.055800	21
5	Maciej.D	True	True	0.1036	0.055300	11
6	Polska Gurom	True	True	0.0964	0.049000	15
7	150 g masła 200 g mąki ziemniaczanej 100 g mąki pszennej 3 jajka 150 g drobnego cukru do wypieków lub więcej 15 płaskiej łyżeczki proszku do pieczenia	True	True	0.0591	0.045100	16
8	aleksandra	True	True	0.0822	0.044700	28
9	I_sold_my_mum_for_ects	True	True	0.0861	0.043700	20
10	Tieru	True	True	0.0813	0.042500	44
11	Paweł	True	True	0.0620	0.042300	13
12	Michał Siennicki	True	True	0.0717	0.040400	18
13	smacznej kawusi	True	True	0.0451	0.038100	17
14	krzywicki_piotr	True	True	0.0480	0.037600	20
15	ksztenderski	True	True	0.0713	0.037000	24
16	jakubpw	True	True	0.0587	0.035200	3
17	mrgr	True	True	0.0751	0.029200	21

Task description

The task in this project is to choose three subsets of samples from the data pool such that they bring the largest improvement to a prediction model. The size of subsets should be 50, 200, 500, respectively.

The initial batch of samples, and the data pool are given as sparse matrices in the MatrixMarket format. Files in this format are simple text files with a header and three columns. The header consists of two lines of which the first one is irrelevant and the second one contains the dimensions of the sparse matrix - i.e., three integers giving the number of rows, the number of columns, and the total number of non-zero values in the matrix. After the header, there are three data columns giving the actual matrix values as EAV triples. In particular, in each line of the file, there are exactly three numbers: number of a row, number of a column, and the value. Keep in mind that the rows and columns are indexed from 1 (not 0).

Each row in the data corresponds to a preprocessed textual document, represented as a vector of TF-IDF values of its terms.

Additionally, there is a file with labels for the documents from the initial data batch. Each document can be labeled with one or more labels from the set A, B, C, D, E, F, G, H, I, J, K.

Format of submissions: solutions should be submitted as text files with three lines. The first line should contain exactly 50 integers - the indices of samples from the data pool (samples are indexed starting from 1), separated by commas. The second and the third line should contain analogous indices for the second and the third set of samples, with sizes 200, and 500, respectively.

Evaluation: the evaluation of submitted solutions will be done using an XGBoost model, trained independently on the three sets of samples added to the initial data which was made batch available to participants. Each model will be evaluated on a separate test set (hidden from participants). The quality metric used for the evaluation will be the average F1score. From each score, we will subtract the average F1score obtained by a model trained only on the initial data batch and the results for three sets will be summed with weights 10, 2.5, and 1 for the subset size 50, 200, and 500, respectively.

During the challenge, your solutions will be evaluated on a small fraction of the test set, and your best preliminary score will be displayed on the public Leaderboard. After the end of the competition, the selected solutions will be evaluated on the remaining part of the test data and this result will be used for the evaluation of the project.

Data files

Forum

This forum is for all users to discuss matters related to the competition. Good manners apply!

There is no topics in this competition.