非常典型地一个应用场景,即通过 rasa 从提交消息中提取城市和时间,然后调用三方天气接口查询天气,再返回用户。
但是按照官方文档,及电子书,网上资料里配置,jieba dict 也好,lookup table 也好,都会因为 DIETClassifier 报奇怪的异常。我感觉是版本的 bug,目前还没有定位到问题。
折腾了两天,没有头绪,看了部分 rasa 源代码,印象也非常不好,许多配置规则没有在文档中体现,需要看源码才能理清。 距离这个项目验收的时间不多了,我也懒得折腾,干脆直接在 actions 中直接获取消息原文,用 python 通过正则提取 entity,这本应由 DIETClassifier 做的事情。
所以,这篇文章不推荐阅读,仅作为踩坑笔记。不要浪费时间往下看。
2024-02-22 更新
经过某神秘大佬网友郑🔞🔞的指点,是 pipeline 的问题。将 pipeline 换成下面的即可。
# # No configuration for the NLU pipeline was provided. The following default pipeline was used to train your model.
# # If you'd like to customize it, uncomment and adjust the pipeline.
# # See https://rasa.com/docs/rasa/tuning-your-model for more information.
# - name: WhitespaceTokenizer
- name: JiebaTokenizer #支持中文
- name: RegexFeaturizer
- name: LexicalSyntacticFeaturizer
- name: CountVectorsFeaturizer
analyzer: char_wb
min_ngram: 1
max_ngram: 4
- name: DIETClassifier
epochs: 200
constrain_similarities: true
entity_recognition: false
- name: EntitySynonymMapper
- name: ResponseSelector
epochs: 200
constrain_similarities: true
- name: FallbackClassifier
threshold: 0.3
ambiguity_threshold: 0.1
- name: RegexFeaturizer
- name: RegexEntityExtractor
case_sensitive: False
use_lookup_tables: False
use_regexes: True
use_word_boundaries: True
无比烦躁
看着毫无可读性的异常信息,来回尝试各种毫无意义的配置,真是够了。
脑子嗡嗡的,今天干脆在纸上整理了思路,然后绕过 rasa 配置把功能实现了。
过去一周有时间就翻 rasa 官方文档,和唯一能找到的电子书,还有各路奇葩文章,tmd 就没有一篇是版本号相同的。 整这么多零碎的版本,配置也不兼容,真是恶心。
entity
NLU training data consists of example user utterances categorized by intent. Training examples can also include entities. Entities are structured pieces of information that can be extracted from a user's message. You can also add extra information such as regular expressions and lookup tables to your training data to help the model identify intents and entities correctly.
entity 实际上是从用户发送来的消息中解析出来的。规则配置在 nlu.yml intent 中。
slot
词槽(slot)是机器人的记忆机制。词槽是以键值对(如,城市:上海)的形式存在的,用于记录对话过程中发生了哪些关键信息,这些信息可能来自用户输入(意图或实体)或后端(如外卖购买结果:成功或者失败)。通常情况下,这些信息对于对话的走向有着至关重要的作用,或者说这些信息会被对话管理系统用于预测下一个动作。
form
表单以完成任务为核心目的的对话过程,可以理解为引导用户填写表单(form)的过程。(1)机器人会问用户想干什么。(2)用户表达了自己的需求(意图和实体)。(3)机器人按照用户的意图,确定合适的表单,并将用户在对话中提供的实体信息填入其中。随后机器人查看表单中缺失的字段,按照一定的策略(字段顺序)询问用户关于缺失字段的问题。(4)用户提供了缺失字段的信息。(5)机器人将缺失信息填入表单,询问下一个缺失字段。(6)如此往复迭代,直到某一时刻,机器人发现表单已经填写完整,于是开始执行具体的任务。
简单的场景,实际上用不到 form。比如,我并不想询问用户来补充城市和时间,那么就不需要用 form。
form / inform / slot 的区别
还是得看看官方文档,光看网上文章不太行:
- Form:Form 是一种交互式对话管理方式,用于收集某个特定目标信息的多个部分或槽值。当机器人需要获取一些特定信息时,它会提示用户提供相关信息。一个表单可以包含多个槽位,每个槽位对应一个问题,用户需要回答该问题才能继续。
- Inform:Inform 是一个流程中的一步,它用于向用户提供特定的信息。这个操作通常发生在用户主动询问信息或表达意图时,例如:“我要了解更多关于产品的信息”,这时机器人就会回答特定的产品信息。
- Slot:Slot 是在对话期间存储信息的位置。当 Rasa 遇到一个意图并提取必要的实体时,它会将这些信息存储在一个槽中。接下来,在对话的后续阶段,Rasa 可以使用这些数据来确定下一步该采取什么行动。
简而言之,form和inform都是通过对话收集信息的方法,而slot则是保存对话信息的位置。
天气 API 接口
心知天气 API 免费版:
https://www.seniverse.com/
- 国内370个主要城市
- 访问频限 20次/分钟。但无限访问量
- 99% 可用性
注册之后即可获得 API 密钥。
python 调用心知天气 API 获取烟台市今天天气:
import requests
url = "https://api.seniverse.com/v3/weather/now.json"
params = {
"key": "your_api_key", # 替换成你的 API Key,注意是私钥。否则会报错:{"status":"The API key is invalid.","status_code":"AP010003"}
"location": "烟台",
"language": "zh-Hans",
"unit": "c"
}
response = requests.get(url, params=params)
if response.status_code == 200:
data = response.json()
weather = data["results"][0]["now"]["text"]
temperature = data["results"][0]["now"]["temperature"]
print("烟台今天天气:" + weather + ",温度:" + temperature + "℃")
else:
print("请求失败")
报错
2023-05-27 13:50:32 INFO rasa.nlu.featurizers.dense_featurizer.lm_featurizer - Model weights not specified. Will choose default model weights: rasa/LaBSE
All model checkpoint layers were used when initializing TFBertModel.
All the layers of TFBertModel were initialized from the model checkpoint at rasa/LaBSE.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.
2023-05-27 13:50:53 INFO rasa.engine.training.hooks - Starting to train component 'DIETClassifier'.
/home/zhongwei/.local/lib/python3.8/site-packages/rasa/utils/train_utils.py:528:
UserWarning: constrain_similarities is set to `False`. It is recommended to set it to `True` when using cross-entropy loss.
rasa.shared.utils.io.raise_warning(
/home/zhongwei/.local/lib/python3.8/site-packages/rasa/shared/utils/io.py:98: UserWarning: Misaligned entity annotation in message '今天天气会不会晴朗' with intent 'weather'.
Make sure the start and end values of entities ([(0, 2, '今天')]) in the training data match the token boundaries ([(0, 4, '今天天气'), (4, 5, '会'), (5, 7, '不会'), (7, 9, '晴朗')]).
Common causes:
1) entities include trailing whitespaces or punctuation
2) the tokenizer gives an unexpected result, due to languages such as Chinese that don't use whitespace for word separation
More info at https://rasa.com/docs/rasa/training-data-format#nlu-training-data
我感觉是这个句子没有分词出“今天”,是否要加个词典
JiebaTokenizer 加了 dict 之后,也不管用。最后通过手动加空格的方式解决了。
https://github.com/vba34520/Rasa-Weather/blob/master/config.yml
language: zh
pipeline:
- name: JiebaTokenizer # 结巴分词器
dictionary_path: 'data/dictionary_path/'
- name: RegexFeaturizer # 正则表达式特征提取器
- name: LexicalSyntacticFeaturizer # 词法语法特征提取器
- name: CountVectorsFeaturizer # 词袋模型特征提取器
- name: DIETClassifier # 意图分类和实体提取的双向转换器
epochs: 100
- name: "extractors.match_entity_extractor.MatchEntityExtractor" # 绝对匹配实体提取器
dictionary_path: "data/dictionary_path/"
take_long: True
- name: EntitySynonymMapper # 同义词匹配实体提取器
将 city.txt 和 time.txt 放到 data/dict 目录下
- city.txt 城市名列表
- time.txt 日期相关的词,比如今天,明天等
并没有鸟用。
https://blog.csdn.net/rav009/article/details/119613186
Rasa目前已经有了基于Jieba的分词组件,但是没有实体抽取组件。本文介绍我为一个为Rasa项目实现了中文的实体抽取组件的库。
https://github.com/rav009/Rasa-Jieba-Ner/blob/main/rasa_jieba_ner.py
然后并不支持 rasa 3,各种报错,懒得改。
If you e.g. run the CLI from /Users/
/my-rasa-project and your module MyComponent is in /Users/ /my-rasa-project/custom_components/my_component.py then the module path is custom_components.my_component.MyComponent. Everything except the name entry will be passed as config to your component.
File "/mnt/d/work/py_rasa/extractor/rasa_jieba_ner.py", line 6, in <module>
from rasa.nlu.extractors.extractor import EntityExtractor
ImportError: cannot import name 'EntityExtractor' from 'rasa.nlu.extractors.extractor' (/home/zhongwei/.local/lib/python3.8/site-packages/rasa/nlu/extractors/extractor.py)
rasa.nlu 就无法引入,需要将 rasa 的代码仓库中查询这个三方组件需要的引入从哪里来的
> grep "class EntityExtractor" -r .
./nlu/extractors/extractor.py:class EntityExtractorMixin(abc.ABC):
zhongwei@DESKTOP-FST75GU ~/.l/l/p/s/rasa> pwd
/home/zhongwei/.local/lib/python3.8/site-packages/rasa
确实没有这个类了。
https://rasa.com/docs/rasa/reference/rasa/nlu/extractors/_extractor
Lookup Tables
这个应该是解决方案。同时,微信读书上 rasa 那本书也是用的这个正则处理方式。但是无奈 DIETClassifier 一直报错。
https://rasa.com/docs/rasa/nlu-training-data/
Lookup tables are lists of words used to generate case-insensitive regular expression patterns. They can be used in the same ways as regular expressions are used, in combination with the RegexFeaturizer and RegexEntityExtractor components in the pipeline.
You can use lookup tables to help extract entities which have a known set of possible values. Keep your lookup tables as specific as possible. For example, to extract country names, you could add a lookup table of all countries in the world:
nlu:
- lookup: country
examples: |
- Afghanistan
- Albania
- ...
- Zambia
- Zimbabwe
When using lookup tables with RegexFeaturizer, provide enough examples for the intent or entity you want to match so that the model can learn to use the generated regular expression as a feature. When using lookup tables with RegexEntityExtractor, provide at least two annotated examples of the entity so that the NLU model can register it as an entity at training time.
DIETClassifier 无尽的报错
rasa.engine.exceptions.GraphComponentException: Error running graph component for node train_DIETClassifier3.
Node: 'rasa_sequence_layer_text/rasa_feature_combining_layer_text/concatenate_sparse_dense_features_text_sequence/concat' ConcatOp : Dimension 1 in both shapes must be equal: shape[0] = [64,12,128] vs. shape[1] = [64,8,768] [[{{node rasa_sequence_layer_text/rasa_feature_combining_layer_text/concatenate_sparse_dense_features_text_sequence/concat}}]] [Op:__inference_train_function_32218] rasa.engine.exceptions.GraphComponentException: Error running graph component for node train_DIETClassifier3.
把 lookup table 删除就好了。。。但是删除了这事还有意义么
No lookup tables
在 intent 中定义几个 entity 就能解决。看源码才搞明白。
- Starting to train component 'RegexEntityExtractor'.
/home/zhongwei/.local/lib/python3.8/site-packages/rasa/shared/utils/io.py:98: UserWarning: No lookup tables or regexes defined in the training data that have a name equal to any entity in the training data. In order for this component to work you need to define valid lookup tables or regexes in the training data.
2023-05-29 15:23:34 INFO rasa.engine.training.hooks - Finished training component 'RegexEntityExtractor'.
2023-05-29 15:41:19 INFO rasa.engine.training.hooks - Starting to train component 'DIETClassifier'.
Epochs: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 300/300 [00:52<00:00, 5.67it/s, t_loss=0.717, i_acc=1]
2023-05-29 15:42:12 INFO rasa.engine.training.hooks - Finished training component 'DIETClassifier'.
2023-05-29 15:42:12 INFO rasa.engine.training.hooks - Starting to train component 'RegexEntityExtractor'.
/home/zhongwei/.local/lib/python3.8/site-packages/rasa/shared/utils/io.py:98: UserWarning: No lookup tables or regexes defined in the training data that have a name equal to any entity in the training data. In order for this component to work you need to define valid lookup tables or regexes in the training data.
2023-05-29 15:42:12 INFO rasa.engine.training.hooks - Finished training component 'RegexEntityExtractor'.
这个顺序也不对啊,怎么 RegexEntityExtractor 到了 DIETClassifier 后面呢?
去掉 nlu.yml 中的 lookup 部分,DIETClassifier 就不报错了。
结合前面 No looup tables 的信息,猜测是 lookup table 的表示方式不对。被 DIETClassifier 当成训练数据了?
确实可以获取到 lookup 的数据,但是就是 DIETClassifier 会报错。
打印日志才发现问题,必须在 nlu.yml 定义好 entity 即,标注的数据,否则会出错
********** use_lookup_tables ************
use_only_entities: True
training_data.entities: {'time', 'city'}
table name: city
table name: time
使用 python 正则解决
Your input -> 天气
日间多云,夜间晴,最高温度21度,最低温度15度,相对湿度百分之74
Your input -> 今天的天气
日间多云,夜间晴,最高温度21度,最低温度15度,相对湿度百分之74
Your input -> 明天的天气
日间多云,夜间多云,最高温度25度,最低温度15度,相对湿度百分之68
Your input -> 后天的天气
日间晴,夜间多云,最高温度30度,最低温度17度,相对湿度百分之70
参考
- 教程:https://medium.com/analytics-vidhya/building-a-simple-weather-chatbot-using-rasa-54eaf97daa82
- 语料库:https://github.com/vba34520/Rasa-Weather/blob/master/data/nlu.md
- entity: https://rasa.com/docs/rasa/training-data-format
- lookup table 不生效 https://stackoverflow.com/questions/65622756/lookup-table-not-working-after-training-the-model-in-rasa
微信关注我哦 👍
我是来自山东烟台的一名开发者,有感兴趣的话题,或者软件开发需求,欢迎加微信 zhongwei 聊聊, 查看更多联系方式