Gemini 2.5 Unveils Pixel-Perfect Image Segmentation at Scale

Gemini 2.5 Unveils Pixel-Perfect Image Segmentation at Scale

Google’s Gemini 2.5 series of models (including Pro and the cost-effective Flash variant) continues to impress, moving beyond simple object detection with bounding boxes to offer a significantly more granular capability: image segmentation. As highlighted by AI researcher Max Woolf and explored in detail by Simon Willison, this feature allows the model to generate precise pixel-level masks for identified objects within an image, opening up sophisticated computer vision tasks to a wider audience through the Gemini API.

Beyond Bounding Boxes: Understanding Segmentation

While previous models could draw rectangular bounding boxes around objects, image segmentation provides a much richer understanding of an image’s content. It involves classifying each pixel, effectively creating a detailed outline or mask for specific objects. This allows for tasks like background removal, precise object isolation, and more nuanced image analysis, far surpassing the capabilities of simple bounding boxes.

Simon Willison’s Exploration and Tooling

Simon Willison, known for his insightful explorations of AI capabilities, quickly developed a browser-based tool, the Gemini API Image Mask Visualization, to demonstrate and interact with this new feature. His tool allows users to provide an image and a prompt (instructing the model to return segmentation masks in JSON format) directly to the Gemini API using their own API key.

The API returns the segmentation data, including 2D bounding boxes and the crucial segmentation masks. These masks are delivered as base64-encoded PNG image strings. Willison’s tool cleverly visualizes this JSON output, overlaying the generated masks onto the original image, providing an immediate understanding of the model’s segmentation accuracy. His development process, documented as “vibe coding” with assistance from models like Claude and ChatGPT, underscores the evolving landscape of AI-assisted programming.

Astonishing Cost-Effectiveness

Perhaps the most striking aspect of this new capability is its affordability. Willison’s tests revealed remarkably low costs:

  • Gemini 2.5 Pro: Segmenting a test image cost less than a quarter of a US cent.
  • Gemini 2.5 Flash (Thinking Mode): The same task cost less than a tenth of a cent.
  • Gemini 2.5 Flash (Non-Thinking Mode): Leveraging the cheaper output pricing tier, the cost plummeted to just over 1/100th of a cent per image segmentation.

This incredibly low price point, especially for the Flash model in non-thinking mode, democratizes access to advanced image segmentation capabilities that were previously the domain of specialized models or more expensive APIs.

Technical Insights and Implementation Notes

Willison notes the surprisingly low output token count reported by the API, despite the large size of the base64-encoded PNG strings. He hypothesizes that Gemini 2.5 likely uses an efficient internal token representation for masks, similar to the <seg> tokens used by Google’s open-weights PaliGemma model. This internal representation is likely translated into the more verbose (but standard) base64 PNG format by the API layer before being returned to the user.

He also details upgrading his tool’s JavaScript code to use Google’s newer recommended generative AI library, a process significantly aided by prompting OpenAI’s o4-mini model, showcasing the power of modern LLMs in code migration tasks.

Furthermore, Willison demonstrates using his llm command-line tool with JSON schemas to enforce a specific output structure from the Gemini API, ensuring reliable parsing of the bounding box and mask data.

Conclusion: A Leap Forward for Vision AI

The addition of affordable, high-quality image segmentation to the Gemini 2.5 API marks a significant step forward. It lowers the barrier to entry for developers seeking to integrate sophisticated image understanding into their applications. From creative tools needing precise object isolation to data analysis pipelines requiring detailed image parsing, the potential applications are vast. As these powerful vision capabilities become increasingly accessible and cost-effective, we can expect a new wave of innovation in how applications see and interact with the visual world.

Placeholder image 1 from Picsum Photos

Visualizing the concept: isolating distinct elements within a scene.

Placeholder image 2 from Picsum Photos

This granular understanding moves beyond simple object recognition.

Placeholder image 3 from Picsum Photos

The accessibility of such tools promises exciting developments in AI-powered applications.