2025 Poland Tech Arena
AI Innovation
Lightweight Sentence Embedding for Similarity Measurement Data Extraction
The goal of this challenge is to develop a lightweight model (at most 20MiB) that converts token-level embeddings of a sentence into a single, combined sentence-level embedding.
Requirements:
Your model should be trained and be compatible with the following versions of dependencies,
- Python
3.12
- torch
2.3.0
Please also note that only 1 CPU core is used for inference. Multithreading will be disabled in torch when your submission is running.
Dataset:
Training dataset is split into four files:
File 1: Each row contains two sentences and label, all separated by tabs. Label shows how well the sentences paired in a row align with each other.
File 2: Tokenization of the first sentences of every row from File 1. Each represented as token-level vectors, with row index correspond to row index in File 1. Every row contains:
- the number of tokens in considered sentence on the first position
- the dimension of token embeddings on the second position (it should be always
768
) - tokens embeddings coordinates in following way: first
768
coordinates form the first embedding, next768
coordinates form the second embedding and so on.
The number of coordinates in every row is a product of its first two elements (number of tokens times dimension). All numbers are separated by one space character.
File 3: Tokenization of the second sentences of every row from File 1, represented analogously to above.
File 4: Similarity labels (0–5
) for each sentence pair, also with corresponding index of the row.
Model inference:
- Model is expected to take a tensor of shape
(batch_size, seq_len, 768)
as an input with varying value ofseq_len
, which correspond to the biggest number of tokens among all sentences in considered batch. Remaining tokens are right-padded with zeros. - The model produces tensor of shape
(batch_size, output_dim)
, with one tensor for each example from a given batch.output_dim
can be set to an arbitrary value, but should be always the same.
Model submission:
Create a Python file exposing get_model()
that returns a torch.nn.Module
with shape contract (batch, seq_len, 768) -> (batch, output_dim)
. The trained model weights must also be uploaded next to the source file, using the same basename and .bin
extension.
Example (your_model.py
):
These two files should be uploaded in one .zip
file format. For the next submission, edit the previous one by changing .zip
file.
Model evaluation:
First, the model converts pairs of sentences from test dataset into pairs of their embeddings. Second, the closeness of obtained pairs and corresponding labels is compared using a built-in algorithm, which return the 'quality score' in range (0, 80)
. Third, the time of the inference is measured and converted to 'time score' in range (0, 20)
. Finally, the score is the sum of quality score
and time score
, so it is in range (0, 100)
.