Data Matters

A big data company’s approach to Wheels vs Doors

The internet is a wonderful place, full of massively important conversations and debates such as: was Will Smith right to slap Chris Rock in the face on arguably the most prestigious stage in the film-making industry? Truly groundbreaking stuff.

But shortly before the internet was watching “Slap-that-Celebrity”, there was a raging debate upon whether there are more wheels or doors in the world.

It all started on March 5th, 2022, with Twitter user NewYorkNixon tweeted a poll to his followers, telling them that he and his mates were having the “stupidest debate” over whether there were more doors or wheels in the world.

Over the course of 24 hours, the poll received 223,437 votes with the winner being “wheels,” earning 53.6 percent of the vote compared to “doors,” which received 46.4 percent. Over the course of four days, the tweet received roughly 12,300 likes and 7,100 retweets.

Wonderful, isn’t it? But, we have to grant Ryan the following: it truly is a mind-boggling question. So naturally, the debate took over the internet.

We’re a big data company so we decided we’d try our hand with an answer.

 

Step 1. Analyzing the conversation in its current state

We’ve found a couple of interesting approaches to the topic across the internet.

The debate overall is pretty calm, but sometimes it can get out of hand, and people really find many paths towards a good old confirmation bias.

In conclusion:

The internet’s current approach to this was:

  • create a debate around what a wheel is and what a door is
  • view the problem as simply a statistical problem and answer with heuristics.

Leaving aside the definitions issue, which, while it might be fun, in the grand scheme of things is a moot point, the statistical + heuristics approach is a very decent one.

 

Our approach

Conveniently for us, we’re not only a big data company. We’re a big data company that focuses on company data. We use advanced scraping and machine learning to build and maintain a source of truth database with 50+ data points on 80M+ companies from around the world.

It became quite clear quite early in the thinking of the wheel vs door debate that it’s a very tricky one. 

We didn’t have the expectation to actually solve it (spoiler alert: we didn’t really – but came interestingly close). We just wanted to take a crack at it and maybe push the conversation further or provide a new perspective or maybe the data necessary for someone more driven than us to find the answer. (We do have APIs to sell, we can’t spend all our time on an internet debate. So… if you want some company data.. call us?)

 

First rough attempt

Our first attempt was very simplistic and mostly done out of curiosity. We thought we’d solve wheels vs doors but not the object, the words, and their frequency of existence on the internet. Well, on the parts of the internet we save. Literally how many times the word ‘door’ / ‘doors’ appears vs the word ‘wheel’ / ‘wheels’ on the internet.

In order to build our main product, we crawl hundreds of billions of web pages, and avoid hundreds of billions of others through some simple heuristics. It’s safe to assume however that we ran this query on around 1/3 of the entire sentences on the internet.

The results were:

Wheel(s) - 7573463
 Door(s) - 33183578

A considerable advantage for doors.

At a quick first glance, given that the poll consensus on Twitter was slightly inclined towards wheels, it appeared to be an interesting conclusion. The large discrepancy however was peculiar.

As far as we can tell, it’s derived from the fact that the word door can be used with other meanings than just the object door (i.e the door to my heart ❤– awww). So the initial results are rather pointless.

The interesting conclusion to draw from this is that while the internet looks at this as a statistical question (which is true), in reality it’s a procurement problem.

Time to adjust for that.

 

A new perspective

“Wheels vs Doors” is a statistics question but in reality it’s a procurement problem.

 

Attempt #2

Now that we’ve established that it’s a procurement problem, let’s handle it the procurement way: focusing on what products exist, and how they stack up in terms of wheels vs doors.

In large principle, we’d try a similar approach to the one where we looked for wheels and doors on the internet, but we’d only search through what our ML models can determine as being products or product descriptions. (Where we can avoid the “the door to my heart ❤ – awww” problem or either of the words appearing in blogs – where they have no impact on their existence in reality)

Here’s what we did:

  • Defined a simplistic, non-fancy definition of wheels and doors. Basically, a wheel or a door is what you would say it is if woken up at 3 AM with a bucket of cold water – no mitochondria, no quantum doors.
  • Filtered for all the products in our database that contain the word wheel or door – we found:

            | doors | wheels|
=============================
 | dataset1  | 45475 | 27239 |
=============================
 | dataset2  | 46708 | 27602 |
=============================
 | dataset3  | 44845 | 28486 |
=============================
 | dataset4  | 46240 | 28226 |
=============================
 | dataset5  | 45930 | 28486 |
=============================

 

369,237 out of 40,189,399 products (0.91%) products that contain wheels and doors.

  • Cleaned it up a bit – manually selected a small dataset of wheels and doors (obviously not all products that have door or window in their name are doors and windows, take outdoor poppy pants or the magic square wheel) then translated the product names into multiple languages (French, German, Russian, Romanian, Arabic, Chinese)
  • Trained a binary XLM-Roberta (🏆) ML model to figure out what is a “wheel” or a “door” based on our hand-picked data, so that we can figure out with more accuracy what products tend to have wheels and what products tend to have doors and to be able to do that across multi-lingual websites. (While it might come as a shock to brits and Americans, money also moves around outside of countries that speak their language.)
  • Asked the model to predict what is a door or a wheel on our entire dataset
  • Also used cosine similarity between the embedded datasets, to better visualize the data and do a double-check for the model

 

Results

THE GRAND WINNER IS … DOORS?

  • there are on average 1041 wheels for every 1M products (0.1%)
  • there are on average 2485 doors for every 1M products (0.24%)

Disclaimer – there is a high likelihood, and by high I mean certain, that the model made some wrong predictions so the numbers could be inflated or deflated a little based on how the model perceived that object (eg. calling a “couch with wheels” in a Russian wheel)

Below is a scatter plot of a Data Sample. The X-Axis is the similarity of a product (a dot) with doors and on the Y-Axis is the similarity of a product (a dot) with wheels. The orange dot is the (perfect) similarity of the average vector of words of all the doors we manually found (this sentence has a lot of words sorry). The green dot is the (perfect) similarity of the average vector of words of all the wheels we manually found.

Taking a look at this chart, a few things might seem weird, starting with how close the green dot and orange dot are to each other. This begs the question: is there an overlap between a wheel and a door in the mind of the AI? Basically, yes. As we keep training the model, it’s going to become clearer and clearer to it that wheels and doors are completely different things. (Welcome to the world of big data). 

You might ask yourself if this invalidates the entire conclusion. The answer is no. Since we’re playing with relatively big numbers here (40 million unique products), the overall numbers that point towards a door dominance, should be right.

For now, this is our conclusion! 


Post Scriptum

Here’s a link to the full approach, made by a Soleadify Data Scientist:

https://soleadify.notion.site/Doors-Vs-Wheels-d972d92a6f624255a8d6d220a274d2a0

Obviously, this is still not complete and we’re considering working on this further. We’d like to implement a layer to help us determine the frequency of each of these products vs the others in real life.

At Soleadify though, we only work with digital, public data. So it would be impossible to provide a 100% accurate answer. What we can do is define a heuristic such as pondering the products based on domain authority, traffic, and other such factors. 

We’ll keep thinking about it. In any case, treat this answer as an MVP, we trained the models quickly and used smaller data sets to validate certain assumptions in order to move fast. 

Progress can be made on what was stated above. We do hope that it helped offer a new perspective and shed some extra light on the problem.

In the meanwhile, to help the debate make further progress, we’ve decided to publish a sample list of 508 out of the 78034 door and 501 out of 49898 wheel manufacturers we have in our database, in case this is helpful for anyone else willing to take the torch in the quest for the wheels vs doors truth. 

As you might notice, our coverage is global so it’s safe to assume there are very few manufacturers we might have missed. If anyone wants to pursue this, contact us to get the full list.

Here’s the data sample. Give it a go?

Write A Comment