Data Mining Course 2023

1 year ago

Second semester project for Data Mining 2022/2023 course

This is the second project for students enrolled in the Data Mining course 2022/2023 (and 2023/2024) at the Faculty of Mathematics, Informatics, and Mechanics at the University of Warsaw. The task is to predict topics of scientific publications based on their abstracts.

Overview

The task in this challenge is to predict topics of scientific articles from the ACM Digital Library. The topics correspond to classes from the ACM Computing Classification System (the old version from 1998). This is a multi-label classification problem as each text can be assigned to multiple (at least one) classes.

More details regarding the available data, submission format, and evaluation can be found in the Task description section.

Terms & Conditions

Participants of the challenge are obliged to follow the competition rules:

This challenge is organized by Andrzej Janusz (the Organizer) for students enrolled in the Data Mining 2022/2023 course at the Faculty of Mathematics, Informatics, and Mechanics at the University of Warsaw.
The provided data sets are the property of the Organizer and the KnowledgePit platform. It is forbidden to share or redistribute provided data sets to any third party without explicit consent from the Organizers.
Each team in the competition may consist of only one person. Working in larger groups or sharing solutions with other teams is strictly forbidden.
Each team has a limited number of submissions - the limit is set to 100.
The number of submissions per day is limited to 10.
Participants can use data that was made available in the challenge - using any external resources is forbidden. Queries regarding external resources need to be issued through the competition forum.
It is strictly forbidden to hack the provided data or to exploit any unfair data leak that can improve the solution score. All attempts at making predictions for any test instance using information extracted from other sources will result in disqualification.
The deadline for submitting the solutions is June 11, 2023 (23:59 GMT). Late submissions will not be accepted.
Each team is obliged to provide a short report describing their final solution. The report must contain information such as the name of the team, the names of all team members, and a brief overview of the used approach. It should be submitted using the KnowledgePit submission system by June 11, 2023 (23:59 GMT).
By enrolling in this competition, you grant the Organizers the right to process your submissions and reports for the purpose of evaluation and post-competition research.
The final project score will depend on the quality of the solution (the score obtained in the final evaluation), and on the quality of the submitted report.

Enroll

Please log in to the system!

Task description

Data for this project consists of two tables in a tab-separated columns format. Each row in those files corresponds to an abstract of a scientific article from ACM Digital Library, which was assigned to one or more topics from the ACM Computing Classification System.

The training data (DM2023_training_docs_and_labels.tsv) has three columns: the first one is an identifier of a document, the second one stores the text of the abstract, and the third one contains a list of comma-separated topic labels.

The test data (DM2023_test_docs.tsv) has a similar format, but the labels in the third column are missing.

The task and the format of submissions: the task for you is to predict the labels of documents from the test data and submit them to the evaluation system. A correctly formatted submission should be a text file with exactly 100000 lines. Each line should correspond to a document from the test data set (the order matters!) and contain a list of one or more predicted labels, separated by commas.

Evaluation: the quality of submissions will be evaluated using the average F1-score measure, i.e., for each test document, the F1-score between the predicted and true labels will be computed, and the values obtained for all test cases will be averaged.

Solutions will be evaluated online and the preliminary results will be published on the public leaderboard. The preliminary score will be computed on a small subset of the test time series (10%), fixed for all participants. The final evaluation will be performed after completion of the competition using the remaining part of the test data. Those results will also be published online. It is important to note that only teams that submit a report describing their approach before the end of the challenge will qualify for the final evaluation. Participants can submit many solutions but before the competition ends, each team needs to indicate up to two final solutions that will undergo the final evaluation (on the remaining part of the test data).

In case of additional questions, please post them on the competition forum.

Data files

Forum

This forum is for all users to discuss matters related to the competition. Good manners apply!

	Discussion	Author	Replies	Last post
	Evaluation data format	Jakub	1	by Pawel Friday, June 14, 2024, 19:05:35