对话机器人 Rasa （二）：中文支持

Rasa 安装之后，默认是不支持中文对话的。

学习、配置的策略

查到的示例，pipeline 配置各不相同，不动手试，难以知道相互间的优劣。

所以，先从能运行的最简单配置开始。例如使用《Rasa 实战：构建开源对话机器人》这本书上的推荐的中文 pipeline。里面有个医疗机器人的 nlu 配置示例。当然，只包含了 nlu 部分的配置，即识别意图和实体，没有回复配置。

效果

rasa 中文对话机器人

基于 Rasa websocket 的网页组件实现。

最简单的中文配置

打开项目根目录下的 config.yml 配置文件，修改如下：

recipe: default.v1

language: zh

pipeline:
  - name: JiebaTokenizer
  - name: LanguageModelFeaturizer
    model_name: "bert"
    model_weight: "bert-base-chinese"
  - name: "DIETClassifier"

language 需要由 en 修改为 zh，即中文。
pipeline 可以参考我整理的 Rasa NLU pipeline 组件列表。
具体每个组件的作用及区别，可以参考 Rasa 中 JiebaTokenizer, LanguageModelFeaturizer 与 DIETClassifier 各自的作用及区别

什么是 NLU

NLU（Natural Language Understanding）是自然语言理解的缩写。

rasa 中 nlu 的作用:

Rasa NLU 模块的主要功能是解析用户输入数据，识别出用户输入的实体、意图等关键信息，同时也可以添加诸如情感分析等自定义模块。

配置 nlu.yml

修改 data/nlu.yml，在已有的英文语料基础上，增加一些中文的语料。

version: "3.1"

nlu:
- intent: greet
  examples: |
    - hey
    - hello
    - hi
    - hello there
    - good morning
    - good evening
    - moin
    - hey there
    - let's go
    - hey dude
    - goodmorning
    - goodevening
    - good afternoon
    - 你好！
    - 您好！
    - 在么！
    - 在吗！
    - 喂！

- intent: goodbye
  examples: |
    - cu
    - good by
    - cee you later
    - good night
    - bye
    - goodbye
    - have a nice day
    - see you around
    - bye bye
    - see you later
    - 拜拜！
    - 再见！
    - 拜！
    - 退出。
    - 结束。
    - exit

- intent: affirm
  examples: |
    - yes
    - y
    - indeed
    - of course
    - that sounds good
    - correct
    - 是的
    - 是

- intent: deny
  examples: |
    - no
    - n
    - never
    - I don't think so
    - don't like that
    - no way
    - not really
    - 不
    - 不是的
    - 不是

重新训练模型

data 目录下的各种 yml 配置文件里存储的就是训练数据，例如 nlu.yml。

rasa train nlu

期间下载 tf_model.h5 1.88G，怎么这么大。。。（这个文件是 BERT 模型引入的。BERT，Bidirectional Encoder Representations from Transformers，是一种基于 TensorFlow 框架的模型。BERT 模型使用 Transformer 架构来学习文本表示，可以用于各种自然语言处理任务，如文本分类、命名实体识别、问答等。TensorFlow 是一个广泛使用的机器学习框架，可用于训练和部署各种深度学习模型。tf_model.h5 是使用 TensorFlow 框架训练的模型文件，其中 .h5 表示它是一个 HDF5 格式的文件。）

但是训练出来的模型文件，只有 20M。

> ls -lah models/
total 44M
drwxrwxrwx 1 zhongwei zhongwei 4.0K Apr  7 10:35 ./
drwxrwxrwx 1 zhongwei zhongwei 4.0K Apr  7 10:03 ../
-rwxrwxrwx 1 zhongwei zhongwei  20M Apr  7 10:35 nlu-20230407-100759-obtuse-rack.tar.gz*

测试：

rasa shell nlu

测试效果

greet intent，即，打招呼的意图:

Next message:
你好
{
  "text": "你好",
  "intent": {
    "name": "greet",
    "confidence": 0.9999979734420776
  },

goodbye intent, 即，再见的意图:

Next message:
再见
{
  "text": "再见",
  "intent": {
    "name": "goodbye",
    "confidence": 0.9999972581863403
  },

上面两个意料之中，至少可以说明已经支持中文了。而不是默认 en 的情况下，输入中文, 没有任何的回复。

比较让我吃惊的是下面这个的意图识别：

Next message:
我拒绝
{
  "text": "我拒绝",
  "intent": {
    "name": "deny",
    "confidence": 0.9226003289222717
  },

我在 deny intent 的语料配置中，并没有设置“拒绝”这个词，但是依然准测的识别出来了。说明引入了预训练的中文语言模型，但是不知道是 pipeline 哪个配置引入的。后续了解一下。

也有不满意的情况：

Next message:
你好啊
{
  "text": "好啊",
  "intent": {
    "name": "affirm",
    "confidence": 0.4897577464580536
  },
  "entities": [],
  "text_tokens": [
    [
      0,
      1
    ],
    [
      1,
      2
    ]
  ],
  "intent_ranking": [
    {
      "name": "affirm",
      "confidence": 0.4897577464580536
    },
    {
      "name": "greet",
      "confidence": 0.34744495153427124
    },

实际上，第一候选意图应该是 greet，却被识别为了 affirm。还是不够智能，但是基本满足要求了。

支持中文回复

前面训练 nlu 模型的过程，只是支持了中文的解析，但是并不支持中文回复。

在 domain.yml 中添加中文回复:

version: "3.1"

intents:
  - greet
  - goodbye
  - affirm
  - deny
  - mood_great
  - mood_unhappy
  - bot_challenge

responses:
  utter_greet:
  - text: "你好！吃了么？"

  utter_cheer_up:
  - text: "Here is something to cheer you up:"
    image: "https://i.imgur.com/nGF1K8f.jpg"

  utter_did_that_help:
  - text: "Did that help you?"

  utter_happy:
  - text: "Great, carry on!"

  utter_goodbye:
  - text: "再见"

  utter_iamabot:
  - text: "我是一个机器人，你可以叫我小远子"

session_config:
  session_expiration_time: 60
  carry_over_slots_to_new_session: true

重新训练

由于之前用 rasa train nlu 训练出来的模型只是解析，并不包含回复逻辑，所以需要重新训练。

注意，不要带 nlu 参数：

> rasa train

The configuration for policies was chosen automatically. It was written into the config file at 'config.yml'.
2023-04-08 09:43:08 INFO     rasa.engine.training.hooks  - Starting to train component 'JiebaTokenizer'.
2023-04-08 09:43:08 INFO     rasa.engine.training.hooks  - Finished training component 'JiebaTokenizer'.
Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Loading model cost 0.493 seconds.
Prefix dict has been built successfully.
2023-04-08 09:43:10 INFO     rasa.nlu.featurizers.dense_featurizer.lm_featurizer  - Model weights not specified. Will choose default model weights: rasa/LaBSE
All model checkpoint layers were used when initializing TFBertModel.

All the layers of TFBertModel were initialized from the model checkpoint at rasa/LaBSE.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.
2023-04-08 09:43:39 INFO     rasa.engine.training.hooks  - Starting to train component 'DIETClassifier'.
/home/zhongwei/.local/lib/python3.8/site-packages/rasa/utils/train_utils.py:528: UserWarning: constrain_similarities is set to `False`. It is recommended to set it to `True` when using cross-entropy loss.
  rasa.shared.utils.io.raise_warning(
Epochs: 100% 300/300 [00:32<00:00,  9.16it/s, t_loss=0.282, i_acc=1]
2023-04-08 09:44:12 INFO     rasa.engine.training.hooks  - Finished training component 'DIETClassifier'.
2023-04-08 09:44:12 INFO     rasa.engine.training.hooks  - Restored component 'MemoizationPolicy' from cache.
2023-04-08 09:44:12 INFO     rasa.engine.training.hooks  - Restored component 'RulePolicy' from cache.
2023-04-08 09:44:12 INFO     rasa.engine.training.hooks  - Restored component 'TEDPolicy' from cache.
2023-04-08 09:44:12 INFO     rasa.engine.training.hooks  - Restored component 'UnexpecTEDIntentPolicy' from cache.
Your Rasa model is trained and saved at 'models/20230408-094308-burning-dessert.tar.gz'.

查看 models 目录，会看到多了一个非 nlu 开头的模型文件，文件大小比之前多了 4M。

> ls -lah models/
drwxrwxrwx 1 zhongwei zhongwei 4.0K Apr  8 09:44 ./
drwxrwxrwx 1 zhongwei zhongwei 4.0K Apr  7 17:28 ../
-rwxrwxrwx 1 zhongwei zhongwei  24M Apr  8 09:44 20230408-094308-burning-dessert.tar.gz*
-rwxrwxrwx 1 zhongwei zhongwei  20M Apr  7 10:35 nlu-20230407-100759-obtuse-rack.tar.gz*

rasa shell

再次启动 rasa shell，会看到同时启用了 rasa server, 并加载了新训练的模型文件。

> rasa shell
2023-04-08 09:46:57 INFO     root  - Connecting to channel 'cmdline' which was specified by the '--connector' argument. Any other channels will be ignored. To connect to all given channels, omit the '--connector' argument.
2023-04-08 09:46:57 INFO     root  - Starting Rasa server on http://0.0.0.0:5005
2023-04-08 09:46:57 INFO     rasa.core.processor  - Loading model models/20230408-094308-burning-dessert.tar.gz...
2023-04-08 09:46:59 INFO     rasa.nlu.featurizers.dense_featurizer.lm_featurizer  - Model weights not specified. Will choose default model weights: rasa/LaBSE
All model checkpoint layers were used when initializing TFBertModel.

All the layers of TFBertModel were initialized from the model checkpoint at rasa/LaBSE.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.
/home/zhongwei/.local/lib/python3.8/site-packages/rasa/utils/train_utils.py:528: UserWarning: constrain_similarities is set to `False`. It is recommended to set it to `True` when using cross-entropy loss.
  rasa.shared.utils.io.raise_warning(
2023-04-08 09:47:43 WARNING  rasa.shared.utils.common  - The UnexpecTED Intent Policy is currently experimental and might change or be removed in the future 🔬 Please share your feedback on it in the forum (https://forum.rasa.com) to help us make this feature ready for production.
2023-04-08 09:47:50 INFO     root  - Rasa server is up and running.
Bot loaded. Type a message and press enter (use '/stop' to exit):

中文对话测试

Your input ->  你好
你好！吃了么？

Your input ->  你是机器人么
我是一个机器人，你可以叫我小远子

input ->  你是谁
我是一个机器人，你可以叫我小远子

果然支持中文回复了。

rasa train nlu 异常

rasa.engine.exceptions.GraphSchemaValidationException: Component 'JiebaTokenizer' requires the following packages which are currently not installed: jieba.

解决：

pip3 install jieba

rasa.engine.exceptions.GraphSchemaValidationException: Component 'LanguageModelFeaturizer' requires the following packages which are currently not installed: transformers.

解决:

pip3 install transformers

huggingface 无法访问的解决方法

国内使用的话，会遇到无法从 huggingface 下载模型的问题，需要参考这个：

https://www.zhihu.com/question/599683557/answer/3352307859

参考

https://rasa.com/docs/rasa/language-support/

查看合集

📖 对话机器人 Rasa 中文系列教程

微信关注我哦 👍

大象工具微信公众号

我是来自山东烟台的一名开发者，有感兴趣的话题，或者软件开发需求，欢迎加微信 zhongwei 聊聊，查看更多联系方式

tags: rasa

生活	跑步清单足球鲁班探索孤独的美食家驾驶电视剧收纳奶爸健康 game 电影周末 joke
Geek	健身 Laravel Git Vim MySQL Linux UI Windows SVN 纪录片管理 Shell 游记工具手机 BackboneJS 自建博客 Mac DNS Tornado CDN Django Python AngularJS 理财前端 Nginx 爬虫 Redis Javascript Browser 浏览器推广 OAuth CSS PHP Social Networks 安全运维创业杂记 VueJS Android Image IDE Java ReactJS 数据分析 SQLite RESTful 读书笔记家电 ecshop Vagrant wordpress docker SEO GTD magento mongodb nodejs weex 冷知识 ruby iOS 微信小程序 AI CMS 快应用 backpack 广告联盟 OA 短信 UWP Win CSharp Tampermonkey graphviz 钉钉 WPS 数据字典微信公众号 Fuchsia Adobe XD SQL Server thinkphp 代码规范商业模式 Flutter 头痛的问题 serverless 视频制作国际化 golang 服务器 Kotlin 网站建设 5G 笔记本图片 spark spring 物联网 InfluxDB 图像识别 postgre rust 提示词
成长的烦恼	闲言碎语待产不睡觉写作程序员孙心然语录原则大鸿语录
地球	植物时间中文赚钱国家地理烟台一生伏首拜阳明 emoji 弟子规英文国际贸易

学习、配置的策略

效果

最简单的中文配置

什么是 NLU

配置 nlu.yml

重新训练模型

测试效果

支持中文回复

重新训练

rasa shell

中文对话测试

rasa train nlu 异常

huggingface 无法访问的解决方法

参考

查看合集

微信关注我哦 👍

相关文章 🔍

所有分类

AI 教程

AI 推荐书目

相关笔记

关于

应用及工具

骄傲地使用