Gemma 4 入门指南

Gemma 4 的发布标志着开放、高效且强大的 AI 模型演进的又一个重要步骤。Gemma 模型由谷歌构建，设计为轻量级 yet 强大——这使得它们非常适合想要强大性能但不需要庞大基础设施的开发者。

Gemma 4 推出了 4 种模型规格：

E2B、E4B — 最大化计算和内存效率
26B、31B — 每个参数的最高智能

在本实践指南中，我们将使用 Kaggle Notebook 探索 Gemma 4 2B 多模态、多语言模型。目标很简单：让你尽快从设置到实际用例，这样你就能够评估这些模型。

1、分步指南

要在 Kaggle Notebook 上运行 Gemma，我们首先需要升级 transformers 库，因为 Gemma 4 非常新（刚刚发布几小时），我们需要库的最新版本来运行它。

!pip install -U transformers

这将把你的环境升级到 transformers 5.5.0（撰写本文时）。

接下来，我们包含库并初始化一个 transformers 管道。

from transformers import pipeline
pipe = pipeline("any-to-any", model="google/gemma-4-e2b-it")

我们使用 any-to-any 选项初始化管道，以允许多模态输入。

模型下载完成后，我们可以开始测试。我们不会在这里重现所有测试。我们只会从 Kaggle Notebook 中的示例选择一个图像测试和一个视频测试（来自 HuggingFace 上的 Gemma 4 技术报告）。

2、使用一张图像和德语测试

我们将解读的图像：

不同颜色的糖果 — 图像来自 HuggingFace 示例

我们用德语询问模型这些糖果是什么颜色。

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "Welche Farben haben die Bombons?"}
        ]
    }
]
output = pipe(text=messages, max_new_tokens=200)

让我们检查我们得到的答案：

display(Markdown(output[0]["generated_text"][-1]["content"]))

结果是：

Die Bombons auf dem Bild haben folgende Farben:

Türkisgrün/Blaugrün (zwei Stück)

Orange (ein Stück)

Grün (ein Stück)

结果很好，糖果的颜色和某种颜色糖果的数量都被正确识别了。

3、使用视频测试

接下来，让我们检查一个火箭发射前的视频。视频来源是 HuggingFace。

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": "https://huggingface.co/datasets/merve/vlm_test_images/resolve/main/rockets.mp4",
            },
            {"type": "text", "text": "What is happening in this video?"},
        ],
    }
]

output = pipe(messages, load_audio_from_video=True)
display(Markdown(output[0]["generated_text"][-1]["content"]))

模型答案是：

This video shows a crowd of people gathered on a tarmac, observing a large rocket, which appears to be a SpaceX Falcon 9 rocket, likely before or after a launch event. The sky is cloudy with soft, diffused light, suggesting either sunrise or sunset. There are also other aircraft and what look like exhibition booths in the background.

4、使用日本地标测试

现在让我们用日本地标测试模型。我们也将用日语提问。图像如下所示。

东京塔 — 图像来自 Advantour

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://www.advantour.com/img/japan/tokyo/tokyo-tower.jpg"},
            {"type": "text", "text": "この画像には何が見えますか?"}
        ]
    }
]
output = pipe(text=messages, max_new_tokens=200)
display(Markdown(output[0]["generated_text"][-1]["content"]))

模型输出的答案是：

この画像は、晴れた日の都市のパノラマビューを捉えています。

主な特徴:

東京のスカイライン: 多くの高層ビルが立ち並ぶ都市の景色が広がっています。

東京タワー: 画像の中央やや左寄りに、象徴的な赤い東京タワーがそびえ立っています。

空: 非常に青く澄んだ快晴の空が広がり、明るい日差しの下で撮影されたことがわかります。

植生: 都市の中にも緑が多く、特に前景や中景には木々（秋の紅葉を思わせる色合いも見られます）が見られ、自然と都市が共存している様子がうかがえます。

建築物: 様々な高さとデザインの近代的なビル群が密集しています。

翻译如下：

This image captures a panoramic view of the city on a sunny day.

Key Features:

Tokyo Skyline: A cityscape of numerous skyscrapers unfolds.

Tokyo Tower: The iconic red Tokyo Tower stands tall slightly to the left of the center of the image.

Sky: A very clear, blue sky stretches across the sky, indicating that the image was taken under bright sunshine.

Vegetation: There is plenty of greenery even within the city, with trees (some with autumnal colors) visible in the foreground and midground, suggesting a coexistence of nature and urban life.

Architecture: A dense cluster of modern buildings of varying heights and designs.

答案正确识别了视频中的细节。火箭的制造者和类型有一个小错误——但这两枚火箭非常相似。

在 Kaggle Notebook 中，你可以找到更多示例，测试模型的图像解读和语言理解能力，涉及法语、德语、西班牙语和日语。

5、结束语

尽管“只有”2B 参数，Gemma 4 的能力出奇地令人印象深刻：

在 Kaggle 环境中快速推理
与更大模型相比内存占用低
惊人的多模态和多语言能力

这种平衡使其成为原型化 AI 功能、运行演示、构建可以在移动平台上运行的轻量级应用的理想选择。

原文链接: Getting Started with Gemma 4: A Practical Guide to Multimodal, Multilingual AI

汇智网翻译整理，版权归作者所有，转载需标明出处