GPT-4 with vision (GPT-4V) enables users to instruct GPT-4 to analyze image inputs provided by the user, and is the latest capability we are making broadly available. Incorporating additional modalities (such as image inputs) into large language models (LLMs) is viewed by some as a key frontier in artificial intelligence research and development. Multimodal LLMs offer the possibility of expanding the impact of language-only systems with novel interfaces and capabilities, enabling them to solve new tasks and provide novel experiences for their users. In this system card, we analyze the safety properties of GPT-4V. Our work on safety for GPT-4V builds on the work done for GPT-4 and here we dive deeper into the evaluations, preparation, and mitigation work done specifically for image inputs. From the official blog
How to use?
GPT-4 with vision is currently available to all developers who have access to GPT-4 via the
gpt-4-vision-previewmodel and the Chat Completions API which has been updated to support image inputs.
This example is from this video.
When input the following image to the GPT 4 and ask:
describe the image in details. we can get:
Another example. This is super impressive.
This is quite impressive.