The goal of the Avito competition was to predict whether 2 adverts were duplicates or not. Think of Avito as a Russian Ebay where users post their items for sale. The problem is that over-zealous users may post their advert several times or even have several user accounts and post their advert as differing users.
We were provided with the whole advert: Title, Description, Location, Category, Price and all the associated images. The adverts had either been manually labelled as duplicates or labelled by computer-generated methods. This added slight complexity into the problem because of the added noise of generation method.
The 2 main problems with this task were the number of images – over 10 million – and the fact all the text is in Russian (not a problem if you’re Russian. By the way, a Russian did actually win the competition!) I am an image-phobe. I have a rubbish GPU (graphics processing unit) that is the car equivalent of an Only Fools and Horses Robin Reliant. I decided to just use image hashes using Python’s Image Library and then calculated the Hamming distance between all the images for the 2 sets of adverts. The data was aggregated in R to retrieve the mins, max, mean, standard deviation etc of the image similarities.
For the text, I removed Russian stop-words (common words like “the”, “and” “in”) using the R package tm and calculated Cosine Similarity using the R package stringdist. I also used a pretrained Russian Word2vec model to assess Russian synonyms between the adverts.
I created a total of 100 features which, when run through DataRobot, achieved a single XGB model score of 0.93 which took me to the top of leaderboard (it didn’t last long)
The features I created fell into the following categories:
## Length, first word, last word, number of shared words in Title and Description and sorted Title and sorted Description
## Cosine similarity of 1- and 2-ngrams in Title, Description and JSON attributes
## How many items have the same images_array?
## Same region? Same location? How far apart are the locations?
## Popular price? Difference in price between adverts and difference between the prices and the median price for that category
## Differences in images at 16 pixels and 32 pixels
## How many synonyms do they share?
These 100 features could get a score of about 0.933 on the Public Leaderboard.
A week before the competition end, team DataMinders asked me and my team mate to join up. They were ahead of us on the leaderboard but by merging teams we managed to give them some uplift in their score and we remained in the top 10 for the rest of the week.
It was a really good munging competition and thanks to NxGTR, Oleksii Renov, Inversion, DataGeek and David Shinn. Our team crossed 5 international time zones, so we did pretty well.