偷偷摘套内射激情视频,久久精品99国产国产精,中文字幕无线乱码人妻,中文在线中文a,性爽19p

<s id="s3rje"></s>

51CTO首頁(yè)

AI.x社區(qū)

軟考社區(qū)

免費(fèi)課

企業(yè)培訓(xùn)

鴻蒙開發(fā)者社區(qū)

信創(chuàng)認(rèn)證

公眾號(hào)矩陣

移動(dòng)端

視頻課免費(fèi)課排行榜短視頻直播課軟考學(xué)堂

全部課程軟考信創(chuàng)認(rèn)證華為認(rèn)證廠商認(rèn)證 IT技術(shù)PMP項(xiàng)目管理免費(fèi)題庫(kù)

在線學(xué)習(xí)

文章資源問答課堂專欄直播

51CTO

鴻蒙開發(fā)者社區(qū)

51CTO技術(shù)棧

51CTO官微

51CTO學(xué)堂

51CTO博客

CTO訓(xùn)練營(yíng)

鴻蒙開發(fā)者社區(qū)訂閱號(hào)

51CTO軟考

51CTO學(xué)堂APP

51CTO學(xué)堂企業(yè)版APP

鴻蒙開發(fā)者社區(qū)視頻號(hào)

51CTO軟考題庫(kù)

AI.x社區(qū)

登錄/注冊(cè)
51CTO

中國(guó)優(yōu)質(zhì)的IT技術(shù)網(wǎng)站

51CTO博客

專業(yè)IT技術(shù)創(chuàng)作平臺(tái)

51CTO學(xué)堂

IT職業(yè)在線教育平臺(tái)

如何使用Google Gemini模型完成計(jì)算機(jī)視覺任務(wù)？原創(chuàng)

51CTO內(nèi)容精選

發(fā)布于 2025-5-26 08:31

瀏覽

0收藏

本文介紹如何使用Google Gemini模型完成計(jì)算機(jī)視覺任務(wù)的各個(gè)步驟，包括環(huán)境設(shè)置、圖像發(fā)送及模型輸出解釋等，還將探討數(shù)據(jù)注釋工具，以提供自定義訓(xùn)練場(chǎng)景的上下文。

自從人工智能聊天機(jī)器人興起以來，Google Gemini脫穎而出，已經(jīng)成為推動(dòng)智能系統(tǒng)進(jìn)化的主要參與者之一。除了強(qiáng)大的會(huì)話能力之外，Gemini還釋放了計(jì)算機(jī)視覺實(shí)際應(yīng)用的潛力，讓它們能夠看到、解釋和描述周圍的世界。

本文將逐步講解如何利用Google Gemini完成計(jì)算機(jī)視覺任務(wù)，其中包括如何設(shè)置環(huán)境，發(fā)送帶有指令的圖像以及解釋模型的輸出以進(jìn)行對(duì)象檢測(cè)、字幕生成和OCR，還將探討數(shù)據(jù)注釋工具（例如YOLO中使用的工具），為自定義訓(xùn)練場(chǎng)景提供上下文。

Google Gemini簡(jiǎn)介

Google Gemini是一系列用于處理多種數(shù)據(jù)類型（例如文本、圖像、音頻和代碼等）的人工智能模型，這意味著它可以處理涉及理解圖片和文字的任務(wù)。

Gemini 2.5 Pro的關(guān)鍵特性

?多模態(tài)輸入：在請(qǐng)求中接受文本和圖像的組合。

?推理：該模型可以分析輸入的信息，以執(zhí)行識(shí)別物體或描述場(chǎng)景等任務(wù)。

?指令跟隨：響應(yīng)指導(dǎo)其分析圖像的文本指令（提示）。

這些特性允許開發(fā)人員通過API將Google Gemini用于與視覺相關(guān)的任務(wù)，而無需為每個(gè)任務(wù)訓(xùn)練單獨(dú)的模型。

數(shù)據(jù)注釋的作用：YOLO注釋器

盡管Gemini模型在計(jì)算機(jī)視覺任務(wù)中具備強(qiáng)大的零樣本或小樣本學(xué)習(xí)能力，但構(gòu)建高度專業(yè)化的計(jì)算機(jī)視覺模型需要在針對(duì)特定問題量身定制的數(shù)據(jù)集上進(jìn)行訓(xùn)練。這就是數(shù)據(jù)注釋變得至關(guān)重要的地方，特別是對(duì)于像訓(xùn)練自定義對(duì)象檢測(cè)器這樣的監(jiān)督學(xué)習(xí)任務(wù)。

YOLO注釋器（通常指的是與YOLO格式兼容的工具，例如Labeling、CVAT或Roboflow）被設(shè)計(jì)用于創(chuàng)建標(biāo)記數(shù)據(jù)集。

什么是數(shù)據(jù)注釋？

如何使用Google Gemini模型完成計(jì)算機(jī)視覺任務(wù)？-AI.x社區(qū)

圖像來源：??鏈接???

對(duì)于對(duì)象檢測(cè)，注釋涉及在圖像中每個(gè)感興趣的對(duì)象周圍繪制邊界框，并分配類標(biāo)簽（例如“汽車”、“人”、“狗”）。這些注釋數(shù)據(jù)告訴模型在訓(xùn)練期間要查找什么以及在哪里。

注釋工具的主要特性（例如YOLO注釋器）

用戶界面：它們提供圖形界面，允許用戶加載圖像，繪制框（或多邊形，關(guān)鍵點(diǎn)等），并有效地分配標(biāo)簽。
格式兼容性：為YOLO模型設(shè)計(jì)的工具以YOLO訓(xùn)練腳本期望的特定文本文件格式保存注釋（通常每個(gè)圖像一個(gè).txt文件，包含類索引和規(guī)范化邊界框坐標(biāo)）。
效率特性：許多工具包括熱鍵、自動(dòng)保存和模型輔助標(biāo)記等特性，以加快通常耗時(shí)的注釋過程。批處理允許更有效地處理大型圖像集。
集成：使用像YOLO這樣的標(biāo)準(zhǔn)格式確保注釋數(shù)據(jù)可以輕松地與流行的訓(xùn)練框架（包括Ultralytics YOLO）一起使用。

雖然用于計(jì)算機(jī)視覺的Google Gemini可以在沒有事先注釋的情況下檢測(cè)對(duì)象，但如果需要一個(gè)模型來檢測(cè)具體的定制對(duì)象（例如獨(dú)特類型的工業(yè)設(shè)備、特定的產(chǎn)品缺陷等），可能需要收集圖像，并使用像YOLO注釋器這樣的工具對(duì)它們進(jìn)行注釋，以訓(xùn)練專用的YOLO模型。

代碼實(shí)現(xiàn)——Google Gemin用于計(jì)算機(jī)視覺

首先，需要安裝必要的軟件庫(kù)。

步驟1：安裝先決條件

（1）安裝庫(kù)

在終端運(yùn)行以下命令：

!uv pip install -U -q google-genai ultralytics

該命令安裝google-genai庫(kù)，以便與Gemini API和ultralytics庫(kù)通信，后者包含處理圖像和在圖像上繪圖的有用功能。

（2）導(dǎo)入模塊

將這些行添加到Python Notebook中：

import json
import cv2
import ultralytics
from google import genai
from google.genai import types
from PIL import Image
from ultralytics.utils.downloads import safe_download
from ultralytics.utils.plotting import Annotator, colors
ultralytics.checks()

這段代碼導(dǎo)入了用于讀取圖像（cv2、PIL）、處理JSON數(shù)據(jù)（JSON）、與API交互（google.generativeai）和實(shí)用程序函數(shù)（ultralytics）等任務(wù)的庫(kù)。

（3）配置API密鑰

使用Google AI API密鑰初始化客戶端。

首先，需要安裝必要的軟件庫(kù)。

# Replace "your_api_key" with your actual key
# Use GenerativeModel for newer versions of the library
# Initialize the Gemini client with your API key
client = genai.Client(api_key=”your_api_key”)

這一步驟準(zhǔn)備腳本以發(fā)送經(jīng)過身份驗(yàn)證的請(qǐng)求。

步驟2：與Gemini互動(dòng)

創(chuàng)建一個(gè)向模型發(fā)送請(qǐng)求的函數(shù)。這個(gè)函數(shù)接受一個(gè)圖像和一個(gè)文本提示，并返回模型的文本輸出。

def inference(image, prompt, temp=0.5):
 """
 Performs inference using Google Gemini 2.5 Pro Experimental model.
 Args:
 image (str or genai.types.Blob): The image input, either as a base64-encoded string or Blob object.
 prompt (str): A text prompt to guide the model's response.
 temp (float, optional): Sampling temperature for response randomness. Default is 0.5.
 Returns:
 str: The text response generated by the Gemini model based on the prompt and image.
 """
 response = client.models.generate_content(
 model="gemini-2.5-pro-exp-03-25",
 cnotallow=[prompt, image], # Provide both the text prompt and image as input
 cnotallow=types.GenerateContentConfig(
 temperature=temp, # Controls creativity vs. determinism in output
 ),
 )
 return response.text # Return the generated textual response

解釋

（1）該函數(shù)將圖像和文本指令（提示）發(fā)送到model_client中指定的Gemini模型。

（2）溫度設(shè)置（溫度）影響輸出的隨機(jī)性；值越低，結(jié)果越可預(yù)測(cè)。

步驟3：準(zhǔn)備圖像數(shù)據(jù)

在將圖像發(fā)送到模型之前，需要正確加載圖像。如果需要，該函數(shù)可以下載圖像，讀取圖像，轉(zhuǎn)換顏色格式，并返回PIL image對(duì)象及其尺寸。

def read_image(filename):
 image_name = safe_download(filename)
 # Read image with opencv
 image = cv2.cvtColor(cv2.imread(f"/content/{image_name}"), cv2.COLOR_BGR2RGB)
 # Extract width and height
 h, w = image.shape[:2]
 # # Read the image using OpenCV and convert it into the PIL format
 return Image.fromarray(image), w, h

解釋

（1）該函數(shù)使用OpenCV （cv2）讀取圖像文件。

（2）它將圖像顏色順序轉(zhuǎn)換為RGB，這是標(biāo)準(zhǔn)的。

（3）它返回圖像作為一個(gè)PIL對(duì)象，適合于推理函數(shù)，以及它的寬度和高度。

步驟4：結(jié)果格式化

def clean_results(results):
 """Clean the results for visualization."""
 return results.strip().removeprefix("```json").removesuffix("```").strip()

該函數(shù)將結(jié)果格式化為JSON格式。

任務(wù)1：對(duì)象檢測(cè)

Gemini可以在圖像中找到對(duì)象，并根據(jù)文本指示報(bào)告其位置（邊界框）。

# Define the text prompt
prompt = """
Detect the 2d bounding boxes of objects in image.
"""
# Fixed, plotting function depends on this.
output_prompt = "Return just box_2d and labels, no additional text."
image, w, h = read_image("https://media-cldnry.s-nbcnews.com/image/upload/t_fit-1000w,f_auto,q_auto:best/newscms/2019_02/2706861/190107-messy-desk-stock-cs-910a.jpg") # Read img, extract width, height
results = inference(image, prompt + output_prompt) # Perform inference
cln_results = json.loads(clean_results(results)) # Clean results, list convert
annotator = Annotator(image) # initialize Ultralytics annotator
for idx, item in enumerate(cln_results):
 # By default, gemini model return output with y coordinates first.
 # Scale normalized box coordinates (0–1000) to image dimensions
 y1, x1, y2, x2 = item["box_2d"] # bbox post processing,
 y1 = y1 / 1000 * h
 x1 = x1 / 1000 * w
 y2 = y2 / 1000 * h
 x2 = x2 / 1000 * w
 if x1 > x2:
 x1, x2 = x2, x1 # Swap x-coordinates if needed
 if y1 > y2:
 y1, y2 = y2, y1 # Swap y-coordinates if needed
 annotator.box_label([x1, y1, x2, y2], label=item["label"], color=colors(idx, True))
Image.fromarray(annotator.result()) # display the output

輸出

如何使用Google Gemini模型完成計(jì)算機(jī)視覺任務(wù)？-AI.x社區(qū)

圖像來源：??鏈接???

解釋

（1）提示告訴模型要查找什么以及如何格式化輸出（JSON）。

（2）它使用圖像寬度(w)和高度(h)將歸一化的邊界框框坐標(biāo)（0-1000）轉(zhuǎn)換為像素坐標(biāo)。

（3）注釋器工具在圖像的副本上繪制框和標(biāo)簽。

任務(wù)2：測(cè)試推理能力

使用Gemini模型，可以使用理解上下文并提供更精確結(jié)果的高級(jí)推理來處理復(fù)雜任務(wù)。

# Define the text prompt
prompt = """
Detect the 2d bounding box around:
highlight the area of morning light +
PC on table
potted plant
coffee cup on table
"""
# Fixed, plotting function depends on this.
output_prompt = "Return just box_2d and labels, no additional text."
image, w, h = read_image("https://thumbs.dreamstime.com/b/modern-office-workspace-laptop-coffee-cup-cityscape-sunrise-sleek-desk-featuring-stationery-organized-neatly-city-345762953.jpg") # Read image and extract width, height
results = inference(image, prompt + output_prompt)
# Clean the results and load results in list format
cln_results = json.loads(clean_results(results))
annotator = Annotator(image) # initialize Ultralytics annotator
for idx, item in enumerate(cln_results):
 # By default, gemini model return output with y coordinates first.
 # Scale normalized box coordinates (0–1000) to image dimensions
 y1, x1, y2, x2 = item["box_2d"] # bbox post processing,
 y1 = y1 / 1000 * h
 x1 = x1 / 1000 * w
 y2 = y2 / 1000 * h
 x2 = x2 / 1000 * w
 if x1 > x2:
 x1, x2 = x2, x1 # Swap x-coordinates if needed
 if y1 > y2:
 y1, y2 = y2, y1 # Swap y-coordinates if needed
 annotator.box_label([x1, y1, x2, y2], label=item["label"], color=colors(idx, True))
Image.fromarray(annotator.result()) # display the output

輸出

如何使用Google Gemini模型完成計(jì)算機(jī)視覺任務(wù)？-AI.x社區(qū)

圖像來源：??鏈接???

解釋

（1）該代碼塊包含一個(gè)復(fù)雜的提示，用于測(cè)試模型的推理能力。

（2）它使用圖像寬度(w)和高度(h)將歸一化邊界框框坐標(biāo)（0-1000）轉(zhuǎn)換為像素坐標(biāo)。

（3）注釋器工具在圖像的副本上繪制框和標(biāo)簽。

任務(wù)3：圖像字幕

Gemini可以為圖片創(chuàng)建文字描述。

# Define the text prompt
prompt = """
What's inside the image, generate a detailed captioning in the form of short
story, Make 4-5 lines and start each sentence on a new line.
"""
image, _, _ = read_image("https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg") # Read image and extract width, height
plt.imshow(image)
plt.axis('off') # Hide axes
plt.show()
print(inference(image, prompt)) # Display the results

輸出

如何使用Google Gemini模型完成計(jì)算機(jī)視覺任務(wù)？-AI.x社區(qū)

圖像來源：??鏈接???

解釋

（1）這個(gè)提示要求模型以特定風(fēng)格生成描述（如敘事風(fēng)格，限制為4行，并且每行獨(dú)立成段）。

（2）所提供的圖像顯示在輸出中。

（3）函數(shù)返回生成的文本。這對(duì)于創(chuàng)建所有文本或摘要非常有用。

任務(wù)4：光學(xué)字符識(shí)別（OCR）

Gemini可以讀取圖像中的文本，并告訴它在哪里找到了文本。

# Define the text prompt
prompt = """
Extract the text from the image
"""
# Fixed, plotting function depends on this.
output_prompt = """
Return just box_2d which will be location of detected text areas + label"""
image, w, h = read_image("https://cdn.mos.cms.futurecdn.net/4sUeciYBZHaLoMa5KiYw7h-1200-80.jpg") # Read image and extract width, height
results = inference(image, prompt + output_prompt)
# Clean the results and load results in list format
cln_results = json.loads(clean_results(results))
print()
annotator = Annotator(image) # initialize Ultralytics annotator
for idx, item in enumerate(cln_results):
 # By default, gemini model return output with y coordinates first.
 # Scale normalized box coordinates (0–1000) to image dimensions
 y1, x1, y2, x2 = item["box_2d"] # bbox post processing,
 y1 = y1 / 1000 * h
 x1 = x1 / 1000 * w
 y2 = y2 / 1000 * h
 x2 = x2 / 1000 * w
 if x1 > x2:
 x1, x2 = x2, x1 # Swap x-coordinates if needed
 if y1 > y2:
 y1, y2 = y2, y1 # Swap y-coordinates if needed
 annotator.box_label([x1, y1, x2, y2], label=item["label"], color=colors(idx, True))
Image.fromarray(annotator.result()) # display the output

輸出

如何使用Google Gemini模型完成計(jì)算機(jī)視覺任務(wù)？-AI.x社區(qū)

圖像來源：??鏈接???

解釋

（1）它使用一個(gè)類似于對(duì)象檢測(cè)的提示符，但要求輸入文本（標(biāo)簽）而不是對(duì)象名稱。

（2）代碼提取文本及其位置，打印文本內(nèi)容，并在圖像上繪制對(duì)應(yīng)的邊界框。

（3）這對(duì)于數(shù)字化文檔或從照片中的標(biāo)志或標(biāo)簽中讀取文本非常有用。

結(jié)論

通過簡(jiǎn)單的API調(diào)用，用于計(jì)算機(jī)視覺的代碼段可以輕松處理對(duì)象檢測(cè)、圖像字幕和OCR等任務(wù)。通過發(fā)送圖像以及清晰的文本說明，可以指導(dǎo)模型的理解，并獲得可用的實(shí)時(shí)結(jié)果。

也就是說，雖然Gemini非常適合通用任務(wù)或快速實(shí)驗(yàn)，但它并不總是最適合高度專業(yè)化的用例。例如，當(dāng)需要識(shí)別小眾對(duì)象或?qū)?zhǔn)確性有更高要求時(shí)，傳統(tǒng)方法依然具有優(yōu)勢(shì)：收集數(shù)據(jù)集，使用YOLO標(biāo)簽器等工具對(duì)其進(jìn)行注釋，并根據(jù)需求訓(xùn)練定制模型。
原文標(biāo)題：??How to Use Google Gemini Models for Computer Vision Tasks???，作者：Harsh Mishra

?著作權(quán)歸作者所有，如需轉(zhuǎn)載，請(qǐng)注明出處，否則將追究法律責(zé)任

標(biāo)簽

聊天機(jī)器人

已于2025-5-26 10:16:00修改

贊

收藏

回復(fù)

舉報(bào)

社區(qū)頭條

回復(fù)

相關(guān)推薦

基于MoE的通用圖像融合模型，添加2.8%參數(shù)完成多項(xiàng)任務(wù)

輕薄滴假象 ? 2688瀏覽 ? 0回復(fù)
AAAI前主席Subbarao Kambhampati：LLM-Modulo框架助力大模型完成規(guī)劃任務(wù)！

AIGC最前線 ? 3135瀏覽 ? 0回復(fù)
計(jì)算機(jī)視覺關(guān)鍵技術(shù)

mb66125a723d24d ? 2758瀏覽 ? 0回復(fù)
分分鐘完成Excel任務(wù)的十大AI工具

51CTO內(nèi)容精選 ? 5688瀏覽 ? 0回復(fù)
AGI時(shí)代下，計(jì)算機(jī)專業(yè)出身的該何去何從？

科叼dd ? 2526瀏覽 ? 0回復(fù)
優(yōu)雅談大模型：揭開計(jì)算機(jī)視覺任務(wù)神秘面紗

魯班模錘1 ? 3153瀏覽 ? 0回復(fù)
使用“反事實(shí)任務(wù)”評(píng)估大型語言模型

lintoms ? 3617瀏覽 ? 0回復(fù)
大模型訓(xùn)練完成之后可以直接使用嗎？該怎么使用訓(xùn)練好的大模型？

AI探索時(shí)代 ? 5240瀏覽 ? 0回復(fù)
冰球運(yùn)動(dòng)的AI科技感：用計(jì)算機(jī)視覺跟蹤球員

51CTO內(nèi)容精選 ? 3080瀏覽 ? 0回復(fù)
讓Google大牛告訴你，他是如何使用LLM提升10倍效率的？

Syrupup ? 2435瀏覽 ? 0回復(fù)
探索 Ultralytics YOLO11 計(jì)算機(jī)視覺領(lǐng)域的關(guān)鍵突破

穿越時(shí)空111 ? 6261瀏覽 ? 0回復(fù)
【學(xué)習(xí)挑戰(zhàn)賽】任務(wù)進(jìn)階，完成就有獎(jiǎng)品拿

AI.x社區(qū)官方賬號(hào) ? 3.2w瀏覽 ? 2回復(fù)
谷歌&Mistral AI發(fā)布TIPS：具有空間意識(shí)的文本-圖像預(yù)訓(xùn)練（適配各種計(jì)算機(jī)視覺任務(wù)）

angel ? 2626瀏覽 ? 0回復(fù)
OpenCV 5：邁向計(jì)算機(jī)視覺新紀(jì)元的最新進(jìn)展

sword_hero ? 3273瀏覽 ? 0回復(fù)
Google AI發(fā)布Gemini 2.0 Flash Thinking 模型

Halo咯咯 ? 2682瀏覽 ? 0回復(fù)
多模態(tài)大語言模型（MLLMs）如何重塑和變革計(jì)算機(jī)視覺？

angel ? 4061瀏覽 ? 0回復(fù)
Google Gemini 2.5 Pro：AI界的“全能王”來了！

Halo咯咯 ? 1876瀏覽 ? 0回復(fù)
計(jì)算機(jī)視覺五大核心算法解析

每天五分鐘玩轉(zhuǎn)人工智能 ? 1626瀏覽 ? 0回復(fù)
神經(jīng)網(wǎng)絡(luò)詳解：傳統(tǒng)機(jī)器學(xué)習(xí)在計(jì)算機(jī)視覺領(lǐng)域的局限性

人工智能訓(xùn)練營(yíng) ? 657瀏覽 ? 0回復(fù)

51CTO內(nèi)容精選

這個(gè)用戶很懶，還沒有個(gè)人簡(jiǎn)介

帖子

聲望

粉絲

關(guān)注

最近發(fā)布

建立基于AI的知識(shí)體系：面向企業(yè)需求的LlamaIndex與Apache Tika 8h前發(fā)布
一文詳解Character AI：實(shí)用指南+ ChatGPT、Gemini對(duì)比分析 1天前發(fā)布

熱門推薦

從原理到調(diào)參，小白也能讀懂的大模型微調(diào)LoRA，不懂線性代數(shù)也沒問題 0回復(fù)

本命周！MiniMax M1有多猛？網(wǎng)友：僅用40k思考預(yù)算就干翻Gemini，實(shí)測(cè)：真·超DS！ 1回復(fù)

AI Agents開源工具棧全解析~ 0回復(fù)

效果&成本雙突破！快手提出端到端生成式推薦系統(tǒng)OneRec！ 0回復(fù)

我把DeepSeek微調(diào)參數(shù)扒光了，顯存和性能優(yōu)化的秘密都在這 0回復(fù)

上一篇：當(dāng)AI邂逅向量數(shù)據(jù)庫(kù)：重新定義智能時(shí)代的數(shù)據(jù)檢索

下一篇： LiteLLM：用于統(tǒng)一大模型訪問的開源網(wǎng)關(guān)

社區(qū)精華內(nèi)容

目錄

<bdo id="7ituh"></bdo>

<tr id="7ituh"></tr>

<wbr id="7ituh"><nav id="7ituh"></nav></wbr>