The need for making sense of images and extracting meaningful insights from them seems to be rapidly growing. Images are the second most popular content type and there is a vast range of companies that deal with images in one form or the other on a regular basis. This spikes up the need for better image recognition with increasing accuracy and using automation.
At Infrrd, we dealt with one such use case where the customer wanted us to build a platform for a user to sell his / her watch. In this case, a user uploads an image of the watch s/he wants to sell. This system eases the user’s work by correctly identifying the watch and auto fill all the details about the image and upload it to the site for sale. Our system needed to do the following: given an image, identify and recommend similar models and finally, classify the watch to the right model.
What most recognition platforms can do:
Given an image of a meeting room, they can identify things that are there in the room like tables, chairs, whiteboard etc. Similarly, given an image of an object, it can identify what the object is. Our solution, just like others, can identify the object to be a watch. However, it takes it to another level, and can identify that the watch probably contains a tachometer to a white / black dial, strap material, minutes and second hand etc.
Others have solutions that are not domain specific. The problem is that, they are very generic and can perform only on a broader level. They are not customizable to a use case. Our platform is domain-enabled to pull the right insights for the product category. In this use case, the requirement was to identify the make and model of the watch. The other recognition systems do not have the capability to do this.
Since our use case is specific, we extended our platform with a custom solution based on this client’s needs.
The challenges we faced:
All watches look similar to an untrained eye. However, watches have subtle differences within models of a same brand. The challenge was to get two levels deeper and identify the make and the model of the watch. We need to take into consideration all the intricate details and meticulously design the algorithm such that it can consider all the corner cases and is able to differentiate the models within a brand. Some models have multiple watches. For example, a watch model for Omega had in turn has three different categorizations of strap color or material (such as brown strap, black strap, metallic strap, leather strap etc.) and different colors of dials and outer rims. In cases like these, we were able to categorize them under the same model and not show it as a different watch.
The approach that we take for watches must be very different from the one we take for clothing. Example, for a dress, we would be able to identify it as a dress with the first pass and we next look at the other traits of this dress. For example, if the dress ends at the knee, we can classify it as a knee length dress, by the shape of the neck we can classify it as round, V-shaped or square neck, etc. But in the case of watches, we need to delve deeper into the intricate details of the dial – the text or logo to detect the brand, the sub-dials, etc.
Given an image we identify what class it belongs to. The algorithm then detects that a watch belongs to a list of classes – we consider the top few classes, identify the relevant models that match the specification of the insights extracted and display top 3 results based on the probability score. With user confirmation, we retrain the system to finalize this image to a specific class. Continuous Learning kicks in to make our system learn and save it for future transactions. In the odd case of wrong insights extraction, the user will enter the right details of the watch. Our systems gather this data and saves it in the data base thereby allowing the system to be accurate with a high probability score if the same watch were to be uploaded.
We were able to achieve an accuracy rate of 80%. Over time, our platform has learned through continuous feedback and has become a lot more accurate than what it was in the beginning.