Rasa 安装之后,默认是不支持中文对话的。
学习、配置的策略
查到的示例,pipeline 配置各不相同,不动手试,难以知道相互间的优劣。
所以,先从能运行的最简单配置开始。例如使用《Rasa 实战:构建开源对话机器人》这本书上的推荐的中文 pipeline。 里面有个医疗机器人的 nlu 配置示例。当然,只包含了 nlu 部分的配置,即识别意图和实体,没有回复配置。
效果
基于 Rasa websocket 的网页组件 实现。
最简单的中文配置
打开项目根目录下的 config.yml 配置文件,修改如下:
recipe: default.v1
language: zh
pipeline:
- name: JiebaTokenizer
- name: LanguageModelFeaturizer
model_name: "bert"
model_weight: "bert-base-chinese"
- name: "DIETClassifier"
- language 需要由 en 修改为 zh,即中文。
- pipeline 可以参考我整理的 Rasa NLU pipeline 组件列表。
- 具体每个组件的作用及区别,可以参考 Rasa 中 JiebaTokenizer, LanguageModelFeaturizer 与 DIETClassifier 各自的作用及区别
什么是 NLU
NLU(Natural Language Understanding)是自然语言理解的缩写。
rasa 中 nlu 的作用:
Rasa NLU 模块的主要功能是解析用户输入数据,识别出用户输入的实体、意图等关键信息,同时也可以添加诸如情感分析等自定义模块。
配置 nlu.yml
修改 data/nlu.yml,在已有的英文语料基础上,增加一些中文的语料。
version: "3.1"
nlu:
- intent: greet
examples: |
- hey
- hello
- hi
- hello there
- good morning
- good evening
- moin
- hey there
- let's go
- hey dude
- goodmorning
- goodevening
- good afternoon
- 你好!
- 您好!
- 在么!
- 在吗!
- 喂!
- intent: goodbye
examples: |
- cu
- good by
- cee you later
- good night
- bye
- goodbye
- have a nice day
- see you around
- bye bye
- see you later
- 拜拜!
- 再见!
- 拜!
- 退出。
- 结束。
- exit
- intent: affirm
examples: |
- yes
- y
- indeed
- of course
- that sounds good
- correct
- 是的
- 是
- intent: deny
examples: |
- no
- n
- never
- I don't think so
- don't like that
- no way
- not really
- 不
- 不是的
- 不是
重新训练模型
data 目录下的各种 yml 配置文件里存储的就是训练数据,例如 nlu.yml。
rasa train nlu
期间下载 tf_model.h5 1.88G,怎么这么大。。。(这个文件是 BERT 模型引入的。BERT,Bidirectional Encoder Representations from Transformers,是一种基于 TensorFlow 框架的模型。BERT 模型使用 Transformer 架构来学习文本表示,可以用于各种自然语言处理任务,如文本分类、命名实体识别、问答等。TensorFlow 是一个广泛使用的机器学习框架,可用于训练和部署各种深度学习模型。tf_model.h5 是使用 TensorFlow 框架训练的模型文件,其中 .h5 表示它是一个 HDF5 格式的文件。)
但是训练出来的模型文件,只有 20M。
> ls -lah models/
total 44M
drwxrwxrwx 1 zhongwei zhongwei 4.0K Apr 7 10:35 ./
drwxrwxrwx 1 zhongwei zhongwei 4.0K Apr 7 10:03 ../
-rwxrwxrwx 1 zhongwei zhongwei 20M Apr 7 10:35 nlu-20230407-100759-obtuse-rack.tar.gz*
测试:
rasa shell nlu
测试效果
greet intent,即,打招呼的意图:
Next message:
你好
{
"text": "你好",
"intent": {
"name": "greet",
"confidence": 0.9999979734420776
},
goodbye intent, 即,再见的意图:
Next message:
再见
{
"text": "再见",
"intent": {
"name": "goodbye",
"confidence": 0.9999972581863403
},
上面两个意料之中,至少可以说明已经支持中文了。而不是默认 en 的情况下,输入中文, 没有任何的回复。
比较让我吃惊的是下面这个的意图识别:
Next message:
我拒绝
{
"text": "我拒绝",
"intent": {
"name": "deny",
"confidence": 0.9226003289222717
},
我在 deny intent 的语料配置中,并没有设置“拒绝”这个词,但是依然准测的识别出来了。说明引入了预训练的中文语言模型,但是不知道是 pipeline 哪个配置引入的。 后续了解一下。
也有不满意的情况:
Next message:
你好啊
{
"text": "好啊",
"intent": {
"name": "affirm",
"confidence": 0.4897577464580536
},
"entities": [],
"text_tokens": [
[
0,
1
],
[
1,
2
]
],
"intent_ranking": [
{
"name": "affirm",
"confidence": 0.4897577464580536
},
{
"name": "greet",
"confidence": 0.34744495153427124
},
实际上,第一候选意图应该是 greet,却被识别为了 affirm。还是不够智能,但是基本满足要求了。
支持中文回复
前面训练 nlu 模型的过程,只是支持了中文的解析,但是并不支持中文回复。
在 domain.yml 中添加中文回复:
version: "3.1"
intents:
- greet
- goodbye
- affirm
- deny
- mood_great
- mood_unhappy
- bot_challenge
responses:
utter_greet:
- text: "你好!吃了么?"
utter_cheer_up:
- text: "Here is something to cheer you up:"
image: "https://i.imgur.com/nGF1K8f.jpg"
utter_did_that_help:
- text: "Did that help you?"
utter_happy:
- text: "Great, carry on!"
utter_goodbye:
- text: "再见"
utter_iamabot:
- text: "我是一个机器人,你可以叫我小远子"
session_config:
session_expiration_time: 60
carry_over_slots_to_new_session: true
重新训练
由于之前用 rasa train nlu 训练出来的模型只是解析,并不包含回复逻辑,所以需要重新训练。
注意,不要带 nlu 参数:
> rasa train
The configuration for policies was chosen automatically. It was written into the config file at 'config.yml'.
2023-04-08 09:43:08 INFO rasa.engine.training.hooks - Starting to train component 'JiebaTokenizer'.
2023-04-08 09:43:08 INFO rasa.engine.training.hooks - Finished training component 'JiebaTokenizer'.
Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Loading model cost 0.493 seconds.
Prefix dict has been built successfully.
2023-04-08 09:43:10 INFO rasa.nlu.featurizers.dense_featurizer.lm_featurizer - Model weights not specified. Will choose default model weights: rasa/LaBSE
All model checkpoint layers were used when initializing TFBertModel.
All the layers of TFBertModel were initialized from the model checkpoint at rasa/LaBSE.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.
2023-04-08 09:43:39 INFO rasa.engine.training.hooks - Starting to train component 'DIETClassifier'.
/home/zhongwei/.local/lib/python3.8/site-packages/rasa/utils/train_utils.py:528: UserWarning: constrain_similarities is set to `False`. It is recommended to set it to `True` when using cross-entropy loss.
rasa.shared.utils.io.raise_warning(
Epochs: 100% 300/300 [00:32<00:00, 9.16it/s, t_loss=0.282, i_acc=1]
2023-04-08 09:44:12 INFO rasa.engine.training.hooks - Finished training component 'DIETClassifier'.
2023-04-08 09:44:12 INFO rasa.engine.training.hooks - Restored component 'MemoizationPolicy' from cache.
2023-04-08 09:44:12 INFO rasa.engine.training.hooks - Restored component 'RulePolicy' from cache.
2023-04-08 09:44:12 INFO rasa.engine.training.hooks - Restored component 'TEDPolicy' from cache.
2023-04-08 09:44:12 INFO rasa.engine.training.hooks - Restored component 'UnexpecTEDIntentPolicy' from cache.
Your Rasa model is trained and saved at 'models/20230408-094308-burning-dessert.tar.gz'.
查看 models 目录,会看到多了一个非 nlu 开头的模型文件,文件大小比之前多了 4M。
> ls -lah models/
drwxrwxrwx 1 zhongwei zhongwei 4.0K Apr 8 09:44 ./
drwxrwxrwx 1 zhongwei zhongwei 4.0K Apr 7 17:28 ../
-rwxrwxrwx 1 zhongwei zhongwei 24M Apr 8 09:44 20230408-094308-burning-dessert.tar.gz*
-rwxrwxrwx 1 zhongwei zhongwei 20M Apr 7 10:35 nlu-20230407-100759-obtuse-rack.tar.gz*
rasa shell
再次启动 rasa shell,会看到同时启用了 rasa server, 并加载了新训练的模型文件。
> rasa shell
2023-04-08 09:46:57 INFO root - Connecting to channel 'cmdline' which was specified by the '--connector' argument. Any other channels will be ignored. To connect to all given channels, omit the '--connector' argument.
2023-04-08 09:46:57 INFO root - Starting Rasa server on http://0.0.0.0:5005
2023-04-08 09:46:57 INFO rasa.core.processor - Loading model models/20230408-094308-burning-dessert.tar.gz...
2023-04-08 09:46:59 INFO rasa.nlu.featurizers.dense_featurizer.lm_featurizer - Model weights not specified. Will choose default model weights: rasa/LaBSE
All model checkpoint layers were used when initializing TFBertModel.
All the layers of TFBertModel were initialized from the model checkpoint at rasa/LaBSE.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.
/home/zhongwei/.local/lib/python3.8/site-packages/rasa/utils/train_utils.py:528: UserWarning: constrain_similarities is set to `False`. It is recommended to set it to `True` when using cross-entropy loss.
rasa.shared.utils.io.raise_warning(
2023-04-08 09:47:43 WARNING rasa.shared.utils.common - The UnexpecTED Intent Policy is currently experimental and might change or be removed in the future 🔬 Please share your feedback on it in the forum (https://forum.rasa.com) to help us make this feature ready for production.
2023-04-08 09:47:50 INFO root - Rasa server is up and running.
Bot loaded. Type a message and press enter (use '/stop' to exit):
中文对话测试
Your input -> 你好
你好!吃了么?
Your input -> 你是机器人么
我是一个机器人,你可以叫我小远子
input -> 你是谁
我是一个机器人,你可以叫我小远子
果然支持中文回复了。
rasa train nlu 异常
rasa.engine.exceptions.GraphSchemaValidationException: Component 'JiebaTokenizer' requires the following packages which are currently not installed: jieba.
解决:
pip3 install jieba
rasa.engine.exceptions.GraphSchemaValidationException: Component 'LanguageModelFeaturizer' requires the following packages which are currently not installed: transformers.
解决:
pip3 install transformers
huggingface 无法访问的解决方法
国内使用的话,会遇到无法从 huggingface 下载模型的问题,需要参考这个:
https://www.zhihu.com/question/599683557/answer/3352307859
参考
- https://rasa.com/docs/rasa/language-support/
查看合集
微信关注我哦 👍
我是来自山东烟台的一名开发者,有感兴趣的话题,或者软件开发需求,欢迎加微信 zhongwei 聊聊, 查看更多联系方式