用yt-dlp下载YouTube字幕
我在使用yt时遇到了一些问题,决定改用yt-dlp。
以下是我为Gemini编写的提示:
write a python script cli script that execute this subprocess
yt-dlp --write-subs --sub-format vtt --skip-download <youtube video url> -o /tmp/trans1.out
then find the output in the /tmp directory ( linux obviously )
then parse with vtt module.
output should be send to stdout.
Okay, here's a Python CLI script that does what you've requested.
First, you'll need to install the `webvtt-py` library if you haven't already:
pip install webvtt-py
并确保 yt-dlp 已安装并在系统PATH中。
以下是脚本:
#!/usr/bin/env python3
import subprocess
import argparse
import sys
from pathlib import Path
import webvtt # For parsing VTT files (pip install webvtt-py)
import os
def main():
parser = argparse.ArgumentParser(
description="Download YouTube video subtitles using yt-dlp, parse the VTT, and print to stdout."
)
parser.add_argument("youtube_url", help="The URL of the YouTube video.")
parser.add_argument(
"--output-base",
default="/tmp/trans1.out",
help="Base path for the temporary subtitle file (default: /tmp/trans1.out). "
"yt-dlp will append language and .vtt extension to this base."
)
parser.add_argument(
"--keep-file",
action="store_true",
help="Keep the downloaded .vtt file in /tmp instead of deleting it."
)
args = parser.parse_args()
output_base_path_template = Path(args.output_base)
output_directory = output_base_path_template.parent
output_filename_stem_for_glob = output_base_path_template.name
try:
output_directory.mkdir(parents=True, exist_ok=True)
except OSError as e:
print(f"Error creating output directory {output_directory}: {e}", file=sys.stderr)
sys.exit(1)
command = [
"yt-dlp",
"--write-subs",
"--sub-format", "vtt",
"--skip-download",
args.youtube_url,
"-o", str(output_base_path_template)
]
print(f"Executing: {' '.join(command)}", file=sys.stderr)
try:
process = subprocess.run(command, check=True, capture_output=True, text=True)
if process.stdout:
print(f"yt-dlp STDOUT:\n{process.stdout.strip()}", file=sys.stderr)
if process.stderr:
print(f"yt-dlp STDERR:\n{process.stderr.strip()}", file=sys.stderr)
except subprocess.CalledProcessError as e:
print(f"Error executing yt-dlp. Return code: {e.returncode}", file=sys.stderr)
if e.stdout:
print(f"yt-dlp STDOUT:\n{e.stdout.strip()}", file=sys.stderr)
if e.stderr:
print(f"yt-dlp STDERR:\n{e.stderr.strip()}", file=sys.stderr)
sys.exit(1)
except FileNotFoundError:
print("Error: yt-dlp command not found. Please ensure it's installed and in your PATH.", file=sys.stderr)
sys.exit(1)
glob_pattern = f"{output_filename_stem_for_glob}.*.vtt"
vtt_files_found = list(output_directory.glob(glob_pattern))
if not vtt_files_found:
print(f"Error: No .vtt file found in {output_directory} matching pattern '{glob_pattern}'", file=sys.stderr)
print("This could be because no subtitles were available or yt-dlp had an issue.", file=sys.stderr)
sys.exit(1)
vtt_file_path = vtt_files_found[0]
print(f"Found VTT file: {vtt_file_path}", file=sys.stderr)
if len(vtt_files_found) > 1:
print(f"Warning: Multiple VTT files found ({[str(f.name) for f in vtt_files_found]}). Processing the first one: {vtt_file_path.name}", file=sys.stderr)
try:
vtt_content = webvtt.read(str(vtt_file_path))
except Exception as e:
print(f"Error parsing VTT file {vtt_file_path}: {e}", file=sys.stderr)
if not args.keep_file:
try:
os.remove(vtt_file_path)
except OSError as del_e:
print(f"Error deleting {vtt_file_path} after parse error: {del_e}", file=sys.stderr)
sys.exit(1)
print(f"\n--- Parsed Subtitles from {vtt_file_path.name} ---", file=sys.stderr)
for caption in vtt_content:
print(f"{caption.start} --> {caption.end}")
cleaned_text = caption.text.replace('\n', ' ').strip()
print(cleaned_text)
print()
if not args.keep_file:
try:
os.remove(vtt_file_path)
print(f"Cleaned up {vtt_file_path}", file=sys.stderr)
except OSError as e:
print(f"Error deleting {vtt_file_path}: {e}", file=sys.stderr)
else:
print(f"Kept VTT file: {vtt_file_path}", file=sys.stderr)
if __name__ == "__main__":
main()
使用方法:
- 将上面的代码保存为Python文件(例如
get_subs.py)。 - 使其可执行:
chmod +x get_subs.py - 使用YouTube视频URL运行:
./get_subs.py "https://www.youtube.com/watch?v=dQw4w9WgXcQ"
(替换为实际有字幕的视频URL以获得最佳效果)。
说明:
1. #!/usr/bin/env python3:使脚本可执行的shebang行。
2. import ...:导入必要的模块:
subprocess:运行yt-dlp命令。argparse:创建用户友好的命令行界面。sys:用于sys.exit()和写入sys.stderr。pathlib.Path:用于现代的、面向对象的路径操作。webvtt:用于解析.vtt文件的库。os:用于删除临时文件。
3. argparse 设置:
- 定义一个位置参数
youtube_url。 - 添加一个可选的
--output-base参数,默认为/tmp/trans1.out。这很重要,因为yt-dlp将其-o选项用作输出模板,而不是最终文件名。它会将语言代码(如.en)和扩展名(.vtt)附加到这个基础路径上。 - 添加一个可选的
--keep-file标志,以防止删除下载的.vtt文件。
4. yt-dlp 命令执行:
yt-dlp命令以字符串列表形式构建。output_base_path_template传递给yt-dlp的-o选项。subprocess.run()执行命令。check=True:如果yt-dlp以非零状态退出,则抛出CalledProcessError。capture_output=True, text=True:以文本形式捕获 yt-dlp 的标准输出和标准错误,便于调试。- 包含了
CalledProcessError(如果yt-dlp失败)和FileNotFoundError(如果yt-dlp未安装)的错误处理。
5. 查找输出文件:
yt-dlp通常将字幕文件命名为<output_template>.LANG.vtt(如/tmp/trans1.out.en.vtt)。- 使用
output_directory.glob(f"{output_filename_stem_for_glob}.*.vtt")来查找匹配此模式的文件。在默认情况下,output_filename_stem_for_glob为trans1.out。 - 脚本取找到的第一个文件。如果视频有多种语言的字幕且
yt-dlp全部下载了,它将选择其中一个。
6. 使用 webvtt 解析:
webvtt.read(str(vtt_file_path))解析找到的VTT文件。- 然后遍历
vtt_content。 caption.start、caption.end和caption.text提供每个字幕提示的时间和内容。- 解析后的信息输出到
stdout。诊断信息(如"Executing..."、"Found VTT file..."等)输出到stderr,以保持stdout干净,方便管道操作。
7. 清理:
- 除非指定
--keep-file,否则os.remove(vtt_file_path)会从/tmp删除下载的VTT文件。
这并没有给出我想要的结果。输出包含了时间戳:
can we eliminate time stamps
00:43:56.840 --> 00:44:00.080
so at least I've got some snacks for the road.
它得出的是以下内容:
# Output parsed content (text only) to stdout
print(f"\n--- Subtitle Text from {vtt_file_path.name} (timestamps omitted) ---", file=sys.stderr)
for caption in vtt_content:
cleaned_text = caption.text.replace('\n', ' ').strip()
if cleaned_text: # Only print if there's actual text
print(cleaned_text) # Output actual data (text only) to stdout
这将产生如下输出:
First line of subtitles.
Second line of subtitles, perhaps originally on multiple lines in VTT.
And so on.
所有诊断信息仍然会输出到stderr,因此stdout将只包含提取的字幕文本,每条字幕的文本占一行。
输出与原来的yt类似——但yt-dlp会有一些噪声输出,不过那都是到stderr的。stdout就是字幕文本。
[jon2allen@jons-bad-ass-fedora-server-37 python]$ ./yt_trans_dlp.py https://www.youtube.com/watch?v=XmD1eGuX4GU
Executing: yt-dlp --write-subs --sub-format vtt --skip-download https://www.youtube.com/watch?v=XmD1eGuX4GU -o /tmp/trans1.out
yt-dlp STDOUT:
[youtube] Extracting URL: https://www.youtube.com/watch?v=XmD1eGuX4GU
[youtube] XmD1eGuX4GU: Downloading webpage
[youtube] XmD1eGuX4GU: Downloading tv client config
[youtube] XmD1eGuX4GU: Downloading tv player API JSON
[youtube] XmD1eGuX4GU: Downloading ios player API JSON
[youtube] XmD1eGuX4GU: Downloading player fc2a56a5-main
[youtube] XmD1eGuX4GU: Downloading m3u8 information
[info] XmD1eGuX4GU: Downloading subtitles: en-GB
[info] XmD1eGuX4GU: Downloading 1 format(s): 625+234
[info] Writing video subtitles to: /tmp/trans1.out.en-GB.vtt
[download] Destination: /tmp/trans1.out.en-GB.vtt
...
Found VTT file: /tmp/trans1.out.en-GB.vtt
--- Subtitle Text from trans1.out.en-GB.vtt (timestamps omitted) ---
In a land far, far away.
For the fifth time, I embarked on a journey of a lifetime,
你可以在Linux中为yt命令设置别名:
alias yt='/home/jon2allen/python/yt_trans_dlp.py'
这样你就有了yt的完整替代方案。还可以做一些小的改进,比如使用pid作为输出以支持脚本的并发运行。
以下是最终代码:
#!/usr/bin/env python3
import subprocess
import argparse
import sys
from pathlib import Path
import webvtt # For parsing VTT files (pip install webvtt-py)
import os
def main():
parser = argparse.ArgumentParser(
description="Download YouTube video subtitles using yt-dlp, parse the VTT, and print only the text to stdout."
)
parser.add_argument("youtube_url", help="The URL of the YouTube video.")
parser.add_argument(
"--output-base",
default="/tmp/trans1.out",
help="Base path for the temporary subtitle file (default: /tmp/trans1.out). "
"yt-dlp will append language and .vtt extension to this base."
)
parser.add_argument(
"--keep-file",
action="store_true",
help="Keep the downloaded .vtt file in /tmp instead of deleting it."
)
args = parser.parse_args()
output_base_path_template = Path(args.output_base)
output_directory = output_base_path_template.parent
output_filename_stem_for_glob = output_base_path_template.name
try:
output_directory.mkdir(parents=True, exist_ok=True)
except OSError as e:
print(f"Error creating output directory {output_directory}: {e}", file=sys.stderr)
sys.exit(1)
command = [
"yt-dlp",
"--write-subs",
"--sub-format", "vtt",
"--skip-download",
args.youtube_url,
"-o", str(output_base_path_template)
]
print(f"Executing: {' '.join(command)}", file=sys.stderr)
try:
process = subprocess.run(command, check=True, capture_output=True, text=True)
if process.stdout:
print(f"yt-dlp STDOUT:\n{process.stdout.strip()}", file=sys.stderr)
if process.stderr:
print(f"yt-dlp STDERR:\n{process.stderr.strip()}", file=sys.stderr)
except subprocess.CalledProcessError as e:
print(f"Error executing yt-dlp. Return code: {e.returncode}", file=sys.stderr)
if e.stdout:
print(f"yt-dlp STDOUT:\n{e.stdout.strip()}", file=sys.stderr)
if e.stderr:
print(f"yt-dlp STDERR:\n{e.stderr.strip()}", file=sys.stderr)
sys.exit(1)
except FileNotFoundError:
print("Error: yt-dlp command not found. Please ensure it's installed and in your PATH.", file=sys.stderr)
sys.exit(1)
glob_pattern = f"{output_filename_stem_for_glob}.*.vtt"
vtt_files_found = list(output_directory.glob(glob_pattern))
if not vtt_files_found:
print(f"Error: No .vtt file found in {output_directory} matching pattern '{glob_pattern}'", file=sys.stderr)
print("This could be because no subtitles were available or yt-dlp had an issue.", file=sys.stderr)
sys.exit(1)
vtt_file_path = vtt_files_found[0]
print(f"Found VTT file: {vtt_file_path}", file=sys.stderr)
if len(vtt_files_found) > 1:
print(f"Warning: Multiple VTT files found ({[str(f.name) for f in vtt_files_found]}). Processing the first one: {vtt_file_path.name}", file=sys.stderr)
try:
vtt_content = webvtt.read(str(vtt_file_path))
except Exception as e:
print(f"Error parsing VTT file {vtt_file_path}: {e}", file=sys.stderr)
if not args.keep_file:
try:
os.remove(vtt_file_path)
except OSError as del_e:
print(f"Error deleting {vtt_file_path} after parse error: {del_e}", file=sys.stderr)
sys.exit(1)
# Output parsed content (text only) to stdout
print(f"\n--- Subtitle Text from {vtt_file_path.name} (timestamps omitted) ---", file=sys.stderr)
for caption in vtt_content:
cleaned_text = caption.text.replace('\n', ' ').strip()
if cleaned_text:
print(cleaned_text)
if not args.keep_file:
try:
os.remove(vtt_file_path)
print(f"Cleaned up {vtt_file_path}", file=sys.stderr)
except OSError as e:
print(f"Error deleting {vtt_file_path}: {e}", file=sys.stderr)
else:
print(f"Kept VTT file: {vtt_file_path}", file=sys.stderr)
if __name__ == "__main__":
main()
原文链接: Using yt-dlp to download youtube transcript
汇智网翻译整理,转载请标明出处