1,项目概述
VITS是一种语音合成的方法,是一个完全端到端的TTS 模型,它使用预先训练好的语音编码器将文本转化为语音,并且是直接从文本到语音波形的转换,无需额外的中间步骤或特征提取。
VITS的工作流程为:首先,系统接收输入的文本,然后通过一系列复杂的算法将其转换为发音规则。然后,这些规则被送入一个预先训练好的语音编码器,该编码器负责生成语音信号的特征表示。最后,这些特征会被输入到语音合成模型中,模型根据这个生成最终的语音。
它的优点在于能够生成与真实人声相媲美的高质量语音,但是缺点就是训练需要大量的训练语料来训练语音合成模型,同时也需要较复杂的训练流程。
所以, VITS-fast-fine-tuning 就是在 VITS 的基础上开发的一站式多角色模型微调工具,它通过微调预训练的 VITS 模型,使用户在不到 1 小时的时间内完成对预训练模型的微调,然后生成好的训练模型,就可以用指定的音色进行语音合成和声音克隆了。
【项目地址】https://github.com/Plachtaa/VITS-fast-fine-tuning
【数据格式】https://github.com/Plachtaa/VITS-fast-fine-tuning/blob/main/DATA_EN.MD
2,本地部署
确保您已经安装了Python==3.8、CMake 和 C/C++ 编译器、ffmpeg;
pip install -r requirements.txt; # 安装处理视频数据所需的库 pip install imageio==2.4.1 pip install moviepy
Build monotonic align (necessary for training)
cd monotonic_align mkdir monotonic_align python setup.py build_ext --inplace cd ..
下载训练辅助数据:
mkdir pretrained_models # download data for fine-tuning wget https://huggingface.co/datasets/Plachta/sampled_audio4ft/resolve/main/sampled_audio4ft_v2.zip unzip sampled_audio4ft_v2.zip # create necessary directories mkdir video_data mkdir raw_audio mkdir denoised_audio mkdir custom_character_voice mkdir segmented_character_voice
下载任意一个预训练模型,可用选项有:
CJE: Trilingual (Chinese, Japanese, English) wget https://huggingface.co/spaces/Plachta/VITS-Umamusume-voice-synthesizer/resolve/main/pretrained_models/D_trilingual.pth -O ./pretrained_models/D_0.pth wget https://huggingface.co/spaces/Plachta/VITS-Umamusume-voice-synthesizer/resolve/main/pretrained_models/G_trilingual.pth -O ./pretrained_models/G_0.pth wget https://huggingface.co/spaces/Plachta/VITS-Umamusume-voice-synthesizer/resolve/main/configs/uma_trilingual.json -O ./configs/finetune_speaker.json CJ: Dualigual (Chinese, Japanese) wget https://huggingface.co/spaces/sayashi/vits-uma-genshin-honkai/resolve/main/model/D_0-p.pth -O ./pretrained_models/D_0.pth wget https://huggingface.co/spaces/sayashi/vits-uma-genshin-honkai/resolve/main/model/G_0-p.pth -O ./pretrained_models/G_0.pth wget https://huggingface.co/spaces/sayashi/vits-uma-genshin-honkai/resolve/main/model/config.json -O ./configs/finetune_speaker.json C: Chinese only wget https://huggingface.co/datasets/Plachta/sampled_audio4ft/resolve/main/VITS-Chinese/D_0.pth -O ./pretrained_models/D_0.pth wget https://huggingface.co/datasets/Plachta/sampled_audio4ft/resolve/main/VITS-Chinese/G_0.pth -O ./pretrained_models/G_0.pth wget https://huggingface.co/datasets/Plachta/sampled_audio4ft/resolve/main/VITS-Chinese/config.json -O ./configs/finetune_speaker.json
自定义数据放在 custom_character_voice:
custom_character_voice - XiJun -XiJun_1.wav -XiJun_2.wav
3,本地训练
【语音识别】借助 whisper-lager 语音识别,有哪些数据执行哪个!!!:
python scripts/video2audio.py python scripts/denoise_audio.py python scripts/long_audio_transcribe.py --languages "{PRETRAINED_MODEL}" --whisper_size large python scripts/short_audio_transcribe.py --languages "{PRETRAINED_MODEL}" --whisper_size large # 有辅助训练数据执行,记得修改目录 python scripts/resample.py
【报错】Given groups=1, weight of size [1280, 128, 3], expected input[1, 80, 3000]
【解决】short_audio_transcribe line 24
mel = whisper.log_mel_spectrogram(audio).to(model.device) 👇 mel = whisper.log_mel_spectrogram(audio, n_mels=128).to(model.device)
【数据整理】python preprocess_v2.py --add_auxiliary_data True --languages "{PRETRAINED_MODEL}"
【正式训练】python finetune_speaker_v2.py -m ./OUTPUT_MODEL --max_epochs "{Maximum_epochs}" --drop_speaker_embed True
- epoch 建议100以上
- 关闭一些日志会很好
import warnings warnings.simplefilter(action='ignore', category=FutureWarning) logging.getLogger('numba').setLevel(logging.WARNING) warnings.filterwarnings( "ignore", message="stft with return_complex=False is deprecated" )
【报错】Could not find module libtorio_ffmpeg6.pyd' (or one of its dependencies).
【解决】finetune_speaker_v2.py 最开始添加:
from torchaudio._extension.utils import _init_dll_path _init_dll_path()
【报错】RuntimeError: use_libuv was requested but PyTorch was build without libuv support
【解决】finetune_speaker_v2 main() 添加:
os.environ['USE_LIBUV'] = '0'
【报错】size mismatch for enc_p.emb.weight: copying a param with shape torch.Size([50, 192]) from checkpoint, the shape in current model is torch.Size([78, 192]).
【解决1】可能下载的预训练模型与配置文件搞串了,可能的多次下载导致。
【解决2】修改 untils.py 下的配置:
parser.add_argument('-c', '--config', type=str, default="./configs/modified_finetune_speaker.json", help='JSON file for configuration') 👇 parser.add_argument('-c', '--config', type=str, default="D:\\PyCharmWorkSpace\\TTS\\VITS-fast-fine-tuning\\configs\\finetune_speaker.json", help='JSON file for configuration')
【报错】mel() takes 0 positional arguments but 5 were given
【解决】pip install librosa==0.8.0
4,推理效果
VITS:4张 V100 显卡训练一周,连话都说不清楚。
VITS-fast-fine:1张 4070 训练20分钟(200 epoch),效果还不错。
【注意】使用微调后的 config.json,主要在 VC_inference.py 中配置。
python VC_inference.py
【报错】__init__() got an unexpected keyword argument 'source'
【解决】修改 VC_inference.py
record_audio = gr.Audio(label="record your voice", source="microphone") upload_audio = gr.Audio(label="or upload audio here", source="upload") 👇 record_audio = gr.Audio(label="record your voice") upload_audio = gr.Audio(label="or upload audio here")