用yt-dlp下载YouTube字幕

我在使用yt时遇到了一些问题,决定改用yt-dlp。

以下是我为Gemini编写的提示:

write a python script cli script that execute this subprocess

yt-dlp --write-subs --sub-format vtt  --skip-download <youtube video url>  -o /tmp/trans1.out

then find the output in the /tmp directory ( linux obviously )

then parse with vtt module.

output should be send to stdout.

Okay, here's a Python CLI script that does what you've requested.

First, you'll need to install the `webvtt-py` library if you haven't already:
pip install webvtt-py

并确保 yt-dlp 已安装并在系统PATH中。

以下是脚本:

#!/usr/bin/env python3

import subprocess
import argparse
import sys
from pathlib import Path
import webvtt # For parsing VTT files (pip install webvtt-py)
import os

def main():
    parser = argparse.ArgumentParser(
        description="Download YouTube video subtitles using yt-dlp, parse the VTT, and print to stdout."
    )
    parser.add_argument("youtube_url", help="The URL of the YouTube video.")
    parser.add_argument(
        "--output-base",
        default="/tmp/trans1.out",
        help="Base path for the temporary subtitle file (default: /tmp/trans1.out). "
             "yt-dlp will append language and .vtt extension to this base."
    )
    parser.add_argument(
        "--keep-file",
        action="store_true",
        help="Keep the downloaded .vtt file in /tmp instead of deleting it."
    )

    args = parser.parse_args()

    output_base_path_template = Path(args.output_base)
    output_directory = output_base_path_template.parent
    output_filename_stem_for_glob = output_base_path_template.name

    try:
        output_directory.mkdir(parents=True, exist_ok=True)
    except OSError as e:
        print(f"Error creating output directory {output_directory}: {e}", file=sys.stderr)
        sys.exit(1)
        
    command = [
        "yt-dlp",
        "--write-subs",
        "--sub-format", "vtt",
        "--skip-download",
        args.youtube_url,
        "-o", str(output_base_path_template)
    ]

    print(f"Executing: {' '.join(command)}", file=sys.stderr)
    try:
        process = subprocess.run(command, check=True, capture_output=True, text=True)
        if process.stdout:
            print(f"yt-dlp STDOUT:\n{process.stdout.strip()}", file=sys.stderr)
        if process.stderr:
            print(f"yt-dlp STDERR:\n{process.stderr.strip()}", file=sys.stderr)

    except subprocess.CalledProcessError as e:
        print(f"Error executing yt-dlp. Return code: {e.returncode}", file=sys.stderr)
        if e.stdout:
            print(f"yt-dlp STDOUT:\n{e.stdout.strip()}", file=sys.stderr)
        if e.stderr:
            print(f"yt-dlp STDERR:\n{e.stderr.strip()}", file=sys.stderr)
        sys.exit(1)
    except FileNotFoundError:
        print("Error: yt-dlp command not found. Please ensure it's installed and in your PATH.", file=sys.stderr)
        sys.exit(1)

    glob_pattern = f"{output_filename_stem_for_glob}.*.vtt"
    vtt_files_found = list(output_directory.glob(glob_pattern))

    if not vtt_files_found:
        print(f"Error: No .vtt file found in {output_directory} matching pattern '{glob_pattern}'", file=sys.stderr)
        print("This could be because no subtitles were available or yt-dlp had an issue.", file=sys.stderr)
        sys.exit(1)

    vtt_file_path = vtt_files_found[0]
    print(f"Found VTT file: {vtt_file_path}", file=sys.stderr)

    if len(vtt_files_found) > 1:
        print(f"Warning: Multiple VTT files found ({[str(f.name) for f in vtt_files_found]}). Processing the first one: {vtt_file_path.name}", file=sys.stderr)

    try:
        vtt_content = webvtt.read(str(vtt_file_path))
    except Exception as e:
        print(f"Error parsing VTT file {vtt_file_path}: {e}", file=sys.stderr)
        if not args.keep_file:
            try:
                os.remove(vtt_file_path)
            except OSError as del_e:
                print(f"Error deleting {vtt_file_path} after parse error: {del_e}", file=sys.stderr)
        sys.exit(1)

    print(f"\n--- Parsed Subtitles from {vtt_file_path.name} ---", file=sys.stderr)
    for caption in vtt_content:
        print(f"{caption.start} --> {caption.end}")
        cleaned_text = caption.text.replace('\n', ' ').strip()
        print(cleaned_text)
        print()

    if not args.keep_file:
        try:
            os.remove(vtt_file_path)
            print(f"Cleaned up {vtt_file_path}", file=sys.stderr)
        except OSError as e:
            print(f"Error deleting {vtt_file_path}: {e}", file=sys.stderr)
    else:
        print(f"Kept VTT file: {vtt_file_path}", file=sys.stderr)

if __name__ == "__main__":
    main()

使用方法:

  1. 将上面的代码保存为Python文件(例如 get_subs.py)。
  2. 使其可执行:chmod +x get_subs.py
  3. 使用YouTube视频URL运行:
./get_subs.py "https://www.youtube.com/watch?v=dQw4w9WgXcQ"

(替换为实际有字幕的视频URL以获得最佳效果)。

说明:

1. #!/usr/bin/env python3:使脚本可执行的shebang行。

2. import ...:导入必要的模块:

  • subprocess:运行 yt-dlp 命令。
  • argparse:创建用户友好的命令行界面。
  • sys:用于 sys.exit() 和写入 sys.stderr
  • pathlib.Path:用于现代的、面向对象的路径操作。
  • webvtt:用于解析 .vtt 文件的库。
  • os:用于删除临时文件。

3. argparse 设置

  • 定义一个位置参数 youtube_url
  • 添加一个可选的 --output-base 参数,默认为 /tmp/trans1.out。这很重要,因为 yt-dlp 将其 -o 选项用作输出模板,而不是最终文件名。它会将语言代码(如 .en)和扩展名(.vtt)附加到这个基础路径上。
  • 添加一个可选的 --keep-file 标志,以防止删除下载的 .vtt 文件。

4. yt-dlp 命令执行

  • yt-dlp 命令以字符串列表形式构建。
  • output_base_path_template 传递给 yt-dlp-o 选项。
  • subprocess.run() 执行命令。
  • check=True:如果 yt-dlp 以非零状态退出,则抛出 CalledProcessError
  • capture_output=True, text=True:以文本形式捕获 yt-dlp 的标准输出和标准错误,便于调试。
  • 包含了 CalledProcessError(如果 yt-dlp 失败)和 FileNotFoundError(如果 yt-dlp 未安装)的错误处理。

5. 查找输出文件

  • yt-dlp 通常将字幕文件命名为 <output_template>.LANG.vtt(如 /tmp/trans1.out.en.vtt)。
  • 使用 output_directory.glob(f"{output_filename_stem_for_glob}.*.vtt") 来查找匹配此模式的文件。在默认情况下,output_filename_stem_for_globtrans1.out
  • 脚本取找到的第一个文件。如果视频有多种语言的字幕且 yt-dlp 全部下载了,它将选择其中一个。

6. 使用 webvtt 解析

  • webvtt.read(str(vtt_file_path)) 解析找到的VTT文件。
  • 然后遍历 vtt_content
  • caption.startcaption.endcaption.text 提供每个字幕提示的时间和内容。
  • 解析后的信息输出到 stdout。诊断信息(如"Executing..."、"Found VTT file..."等)输出到 stderr,以保持 stdout 干净,方便管道操作。

7. 清理

  • 除非指定 --keep-file,否则 os.remove(vtt_file_path) 会从 /tmp 删除下载的VTT文件。

这并没有给出我想要的结果。输出包含了时间戳:

can we eliminate time stamps

00:43:56.840 --> 00:44:00.080
so at least I've got some snacks for the road.

它得出的是以下内容:

# Output parsed content (text only) to stdout
    print(f"\n--- Subtitle Text from {vtt_file_path.name} (timestamps omitted) ---", file=sys.stderr)
    for caption in vtt_content:
        cleaned_text = caption.text.replace('\n', ' ').strip()
        if cleaned_text: # Only print if there's actual text
            print(cleaned_text) # Output actual data (text only) to stdout

这将产生如下输出:

First line of subtitles.
Second line of subtitles, perhaps originally on multiple lines in VTT.
And so on.

所有诊断信息仍然会输出到stderr,因此stdout将只包含提取的字幕文本,每条字幕的文本占一行。

输出与原来的yt类似——但yt-dlp会有一些噪声输出,不过那都是到stderr的。stdout就是字幕文本。

[jon2allen@jons-bad-ass-fedora-server-37 python]$ ./yt_trans_dlp.py https://www.youtube.com/watch?v=XmD1eGuX4GU
Executing: yt-dlp --write-subs --sub-format vtt --skip-download https://www.youtube.com/watch?v=XmD1eGuX4GU -o /tmp/trans1.out
yt-dlp STDOUT:
[youtube] Extracting URL: https://www.youtube.com/watch?v=XmD1eGuX4GU
[youtube] XmD1eGuX4GU: Downloading webpage
[youtube] XmD1eGuX4GU: Downloading tv client config
[youtube] XmD1eGuX4GU: Downloading tv player API JSON
[youtube] XmD1eGuX4GU: Downloading ios player API JSON
[youtube] XmD1eGuX4GU: Downloading player fc2a56a5-main
[youtube] XmD1eGuX4GU: Downloading m3u8 information
[info] XmD1eGuX4GU: Downloading subtitles: en-GB
[info] XmD1eGuX4GU: Downloading 1 format(s): 625+234
[info] Writing video subtitles to: /tmp/trans1.out.en-GB.vtt
[download] Destination: /tmp/trans1.out.en-GB.vtt
...
Found VTT file: /tmp/trans1.out.en-GB.vtt

--- Subtitle Text from trans1.out.en-GB.vtt (timestamps omitted) ---
In a land far, far away.
For the fifth time, I embarked on a journey of a lifetime,

你可以在Linux中为yt命令设置别名:

alias yt='/home/jon2allen/python/yt_trans_dlp.py'

这样你就有了yt的完整替代方案。还可以做一些小的改进,比如使用pid作为输出以支持脚本的并发运行。

以下是最终代码:

#!/usr/bin/env python3

import subprocess
import argparse
import sys
from pathlib import Path
import webvtt # For parsing VTT files (pip install webvtt-py)
import os

def main():
    parser = argparse.ArgumentParser(
        description="Download YouTube video subtitles using yt-dlp, parse the VTT, and print only the text to stdout."
    )
    parser.add_argument("youtube_url", help="The URL of the YouTube video.")
    parser.add_argument(
        "--output-base",
        default="/tmp/trans1.out",
        help="Base path for the temporary subtitle file (default: /tmp/trans1.out). "
             "yt-dlp will append language and .vtt extension to this base."
    )
    parser.add_argument(
        "--keep-file",
        action="store_true",
        help="Keep the downloaded .vtt file in /tmp instead of deleting it."
    )

    args = parser.parse_args()

    output_base_path_template = Path(args.output_base)
    output_directory = output_base_path_template.parent
    output_filename_stem_for_glob = output_base_path_template.name

    try:
        output_directory.mkdir(parents=True, exist_ok=True)
    except OSError as e:
        print(f"Error creating output directory {output_directory}: {e}", file=sys.stderr)
        sys.exit(1)
        
    command = [
        "yt-dlp",
        "--write-subs",
        "--sub-format", "vtt",
        "--skip-download",
        args.youtube_url,
        "-o", str(output_base_path_template)
    ]

    print(f"Executing: {' '.join(command)}", file=sys.stderr)
    try:
        process = subprocess.run(command, check=True, capture_output=True, text=True)
        if process.stdout:
            print(f"yt-dlp STDOUT:\n{process.stdout.strip()}", file=sys.stderr)
        if process.stderr:
            print(f"yt-dlp STDERR:\n{process.stderr.strip()}", file=sys.stderr)

    except subprocess.CalledProcessError as e:
        print(f"Error executing yt-dlp. Return code: {e.returncode}", file=sys.stderr)
        if e.stdout:
            print(f"yt-dlp STDOUT:\n{e.stdout.strip()}", file=sys.stderr)
        if e.stderr:
            print(f"yt-dlp STDERR:\n{e.stderr.strip()}", file=sys.stderr)
        sys.exit(1)
    except FileNotFoundError:
        print("Error: yt-dlp command not found. Please ensure it's installed and in your PATH.", file=sys.stderr)
        sys.exit(1)

    glob_pattern = f"{output_filename_stem_for_glob}.*.vtt"
    vtt_files_found = list(output_directory.glob(glob_pattern))

    if not vtt_files_found:
        print(f"Error: No .vtt file found in {output_directory} matching pattern '{glob_pattern}'", file=sys.stderr)
        print("This could be because no subtitles were available or yt-dlp had an issue.", file=sys.stderr)
        sys.exit(1)

    vtt_file_path = vtt_files_found[0]
    print(f"Found VTT file: {vtt_file_path}", file=sys.stderr)

    if len(vtt_files_found) > 1:
        print(f"Warning: Multiple VTT files found ({[str(f.name) for f in vtt_files_found]}). Processing the first one: {vtt_file_path.name}", file=sys.stderr)

    try:
        vtt_content = webvtt.read(str(vtt_file_path))
    except Exception as e:
        print(f"Error parsing VTT file {vtt_file_path}: {e}", file=sys.stderr)
        if not args.keep_file:
            try:
                os.remove(vtt_file_path)
            except OSError as del_e:
                print(f"Error deleting {vtt_file_path} after parse error: {del_e}", file=sys.stderr)
        sys.exit(1)

    # Output parsed content (text only) to stdout
    print(f"\n--- Subtitle Text from {vtt_file_path.name} (timestamps omitted) ---", file=sys.stderr)
    for caption in vtt_content:
        cleaned_text = caption.text.replace('\n', ' ').strip()
        if cleaned_text:
            print(cleaned_text)

    if not args.keep_file:
        try:
            os.remove(vtt_file_path)
            print(f"Cleaned up {vtt_file_path}", file=sys.stderr)
        except OSError as e:
            print(f"Error deleting {vtt_file_path}: {e}", file=sys.stderr)
    else:
        print(f"Kept VTT file: {vtt_file_path}", file=sys.stderr)

if __name__ == "__main__":
    main()

原文链接: Using yt-dlp to download youtube transcript

汇智网翻译整理,转载请标明出处