Top Ten (9th) finish in Kaggle Avito

The goal of the Avito competition was to predict whether 2 adverts were duplicates or not. Think of Avito as a Russian Ebay where users post their items for sale. The problem is that over-zealous users may post their advert several times or even have several user accounts and post their advert as differing users.

We were provided with the whole advert: Title, Description, Location, Category, Price and all the associated images. The adverts had either been manually labelled as duplicates or labelled by computer-generated methods. This added slight complexity into the problem because of the added noise of generation method.

The 2 main problems with this task were the number of images – over 10 million – and the fact all the text is in Russian (not a problem if you’re Russian. By the way, a Russian did actually win the competition!) I am an image-phobe. I have a rubbish GPU (graphics processing unit) that is the car equivalent of an Only Fools and Horses Robin Reliant. I decided to just use image hashes using Python’s Image Library and then calculated the Hamming distance between all the images for the 2 sets of adverts. The data was aggregated in R to retrieve the mins, max, mean, standard deviation etc of the image similarities.

For the text, I removed Russian stop-words (common words like “the”, “and” “in”) using the R package tm and calculated Cosine Similarity using the R package stringdist.   I also used a pretrained Russian Word2vec model to assess Russian synonyms between the adverts.

I created a total of 100 features which, when run through DataRobot, achieved a single XGB model score of 0.93 which took me to the top of leaderboard (it didn’t last long)


The features I created fell into the following categories:

## Length, first word, last word, number of shared words in Title and Description and sorted Title and sorted Description

## Cosine similarity of 1- and 2-ngrams in Title, Description and JSON attributes

## How many items have the same images_array?

## Same region? Same location? How far apart are the locations?

## Popular price? Difference in price between adverts and difference between the prices and the median price for that category

## Differences in images at 16 pixels and 32 pixels

## How many synonyms do they share?

These 100 features could get a score of about 0.933 on the Public Leaderboard.

A week before the competition end, team DataMinders asked me and my team mate to join up. They were ahead of us on the leaderboard but by merging teams we managed to give them some uplift in their score and we remained in the top 10 for the rest of the week.

It was a really good munging competition and thanks to NxGTR, Oleksii Renov, Inversion, DataGeek and David Shinn. Our team crossed 5 international time zones, so we did pretty well.

So, you’ve hacked into Nigel Farage’s Twitter account

Hypothetically speaking, of course, if you managed to hack into Nigel Farage’s Twitter account, what would you have written? There is a constraint though – you have covered your tracks enough for it to be hard to be found, but if you do something bad enough that results in warrants / subpoenas to IP providers then you’re going to get caught. So, you want to do something bad but not too bad that $$$ lawyers are involved.

Once the initial excitement of hacking in wears off, you are left with the common problem of “how to intelligently frape”. Nigel Farage liking bums, willies, he smells just doesn’t cut the mustard. Maybe something nearly-racist that he could possibly say? Maybe something stupid about the economy? Something sly about other UKIP members? Then you realise that the man is just a fool and fools always have followers regardless.

Maybe causing them a slight 5 minute inconvenience of resetting the password will suffice. *Slowly backs away*