When something does zero-shot image classification, that means it’s able to make judgments about the contents of an image without the user needing to train the system beforehand on what to look for. Watch it in action with this online demo, which uses WebGPU to implement CLIP (Contrastive Language–Image Pre-training) running in one’s browser, using the input from an attached camera.
By giving the program some natural language visual concept labels (such as ‘person’ or ‘cat’) that fit a hypothetical template for the image content, the system will output — in real-time — its judgement on the appropriateness of such labels to what the camera sees. Again, all of this runs locally.
It’s maybe a little bit unintuitive, but what’s happening in the demo is that the system is deciding which of the user-provided labels (“a photo of a cat” vs “a photo of a bald man”, for example) is most appropriate to what the camera sees. The more a particular label is judged a good fit for the image, the higher the number beside it.
This kind of process benefits greatly from shoveling the hard parts of the computation onto compatible graphics cards, which is exactly what WebGPU provides by allowing the browser access to a local GPU. WebGPU is relatively recent, but we’ve already seen it used to run LLMs (Large Language Models) directly in the browser.
Wondering what makes GPUs so very useful for AI-type applications? It’s all about their ability to work with enormous amounts of data very quickly.
Might come in handy cruising Pinterest.
or tinder
Must be getting hit hard, it just errors out.