<kbd id="afajh"><form id="afajh"></form></kbd>
<strong id="afajh"><dl id="afajh"></dl></strong>
    <del id="afajh"><form id="afajh"></form></del>
        1. <th id="afajh"><progress id="afajh"></progress></th>
          <b id="afajh"><abbr id="afajh"></abbr></b>
          <th id="afajh"><progress id="afajh"></progress></th>

          (七)RASA NLU實(shí)體提取器

          共 7141字,需瀏覽 15分鐘

           ·

          2021-09-11 20:13


          作者簡(jiǎn)介





          原文:https://zhuanlan.zhihu.com/p/333641672

          轉(zhuǎn)載者:楊夕

          面筋地址:https://github.com/km1994/NLP-Interview-Notes

          個(gè)人筆記:https://github.com/km1994/nlp_paper_study


                            


          一個(gè)對(duì)話機(jī)器人,除了理解用戶的語(yǔ)義以外,還需要從用戶獲取必要的信息,用于信息檢索的變量,我們簡(jiǎn)稱為slot(槽),而填槽的內(nèi)容大部分來(lái)自于用戶對(duì)話中的命名實(shí)體,極個(gè)別也有用戶的意圖作為slot。舉例來(lái)說(shuō),用戶意圖為訂火車票,那機(jī)器人必須知道是從哪里出發(fā)目的地是哪里,這個(gè)信息就需要從用戶對(duì)話中提取地名這個(gè)命名實(shí)體。RASA的實(shí)體提取器完成這一功能,目前RASA支持的實(shí)體提取器有:

          MitieEntityExtractor

          使用MitieNLP提取命名實(shí)體。需要引入MitieNLP語(yǔ)言模型, 雖然在pipeline里面也需要配置MitieTokenizer,MitieFeaturizer,但實(shí)際上在MitieEntityExtractor執(zhí)行的時(shí)候,它會(huì)自己重新生成Feature。

          前面提到過(guò),Mitie使用多分類線性SVM做的實(shí)體提取,輸出的時(shí)候并沒(méi)有提供置信度參數(shù)。

          SpacyEntityExtractor

          使用SpacyNLP提取命名實(shí)體。需要引入SpacyNLP語(yǔ)言模型, SpacyTokenizer,SpacyFeaturizer。
          spaCy 使用統(tǒng)計(jì)BILOU轉(zhuǎn)換模型。到目前為止,SpacyEntityExtractor只能使用內(nèi)置的NER模型,不能重新訓(xùn)練新模型,而且模型輸出也沒(méi)有置信度分?jǐn)?shù)。

          SpacyEntityExtractor配置使用的時(shí)候,可以通過(guò)dimensions參數(shù)指定提取的實(shí)體包括的內(nèi)容,一共有這么多種 :

          PERSON People, including fictional.NORP Nationalities or religious or political groups.FAC Buildings, airports, highways, bridges, etc.ORG Companies, agencies, institutions, etc.GPE Countries, cities, states.LOC Non-GPE locations, mountain ranges, bodies of water.PRODUCT Objects, vehicles, foods, etc. (Not services.)EVENT Named hurricanes, battles, wars, sports events, etc.WORK_OF_ART Titles of books, songs, etc.LAW Named documents made into laws.LANGUAGE Any named language.DATE Absolute or relative dates or periods.TIME Times smaller than a day.PERCENT Percentage, including ”%“.MONEY Monetary values, including unit.QUANTITY Measurements, as of weight or distance.ORDINAL “first”, “second”, etc.CARDINAL Numerals that do not fall under another type.

          如果不指定,默認(rèn)會(huì)返回所有。配置方式如下

          pipeline:
          - name: "SpacyEntityExtractor"
          # dimensions to extract
          dimensions: ["PERSON", "LOC", "ORG", "PRODUCT"]

          CRFEntityExtractor

          條件隨機(jī)場(chǎng)實(shí)體提取器,目前最常用的NER工具,跟LSTM組合,或者和BERT組合能取到非常好的效果。

          如果要將自定義特征(例如預(yù)訓(xùn)練的單詞嵌入)傳遞給CRFEntityExtractor,則可以pipeline里面CRFEntityExtractor之前添加任何能輸出稠密特征的Featurizer。CRFEntityExtractor能自動(dòng)查找稠密特征向量,并檢查稠密特征是否為len(tokens)的可迭代項(xiàng),其中每個(gè)條目均為向量。如果檢查失敗,將顯示警告。然后CRFEntityExtractor將繼續(xù)訓(xùn)練,丟棄自定義特征向量。如果自定義特征滿足要求,CRFEntityExtractor會(huì)將稠密特征向量傳遞給sklearn_crfsuite用于訓(xùn)練。

          因?yàn)镃RF需要判斷出每個(gè)Token為NER的概率,因此一個(gè)句子的稠密特征向量應(yīng)該是[Token的個(gè)數(shù)*每個(gè)Token的特征向量的維數(shù)]這樣一個(gè)矩陣。

          CRFEntityExtractor有一個(gè)默認(rèn)特征列表??梢杂靡韵逻x項(xiàng)替換默認(rèn)配置項(xiàng):

          ==============  ==========================================================================================
          Feature Name Description
          ============== ==========================================================================================
          low Checks if the token is lower case.
          upper Checks if the token is upper case.
          title Checks if the token starts with an uppercase character and all remaining characters are
          lowercased.
          digit Checks if the token contains just digits.
          prefix5 Take the first five characters of the token.
          prefix2 Take the first two characters of the token.
          suffix5 Take the last five characters of the token.
          suffix3 Take the last three characters of the token.
          suffix2 Take the last two characters of the token.
          suffix1 Take the last character of the token.
          pos Take the Part-of-Speech tag of the token (``SpacyTokenizer`` required).
          pos2 Take the first two characters of the Part-of-Speech tag of the token
          (``SpacyTokenizer`` required).
          pattern Take the patterns defined by ``RegexFeaturizer``.
          bias Add an additional "bias" feature to the list of features.
          ============== ==========================================================================================

          當(dāng)featureizer的滑動(dòng)窗口在用戶消息中的Token上移動(dòng)時(shí),可以為滑動(dòng)窗口中的前一個(gè)token、當(dāng)前token,下一個(gè)token定義特征模板。特征模板定義方式為[before,token,after]數(shù)組格式。另外,可以設(shè)置一個(gè)BILOU_flag標(biāo)志來(lái)決定是否使用BILOU標(biāo)記模式(一種編碼格式,指示實(shí)體的開(kāi)始token,中間token,結(jié)束token)。

          pipeline:
          - name: "CRFEntityExtractor"
          # BILOU_flag determines whether to use BILOU tagging or not.
          "BILOU_flag": True
          # features to extract in the sliding window
          "features": [
          ["low", "title", "upper"],
          [
          "bias",
          "low",
          "prefix5",
          "prefix2",
          "suffix5",
          "suffix3",
          "suffix2",
          "upper",
          "title",
          "digit",
          "pattern",
          ],
          ["low", "title", "upper"],
          ]
          # The maximum number of iterations for optimization algorithms.
          "max_iterations": 50
          # weight of the L1 regularization
          "L1_c": 0.1
          # weight of the L2 regularization
          "L2_c": 0.1
          # Name of dense featurizers to use.
          # If list is empty all available dense features are used.
          "featurizers": []
          # Indicated whether a list of extracted entities should be split into individual entities for a given entity type
          "split_entities_by_comma":
          address: False
          email: True

          如果使用POS特性(POS或pos2),則需要在管道中使用SpacyTokenizer

          如果使用pattern 功能,則需要在管道中使用RegexFeatureizer。

          DucklingHTTPExtractor

          這個(gè)組件允許Rasa調(diào)用一個(gè)遠(yuǎn)程http服務(wù)來(lái)提前命名實(shí)體,成為Duckling服務(wù)器。

          可以通過(guò)啟動(dòng)docker容器的方式啟動(dòng)duckling服務(wù)

          docker run -p 8000:8000 rasa/duckling

          或者,可以直接安裝Duckling,然后啟動(dòng)服務(wù)器。

          Duckling可以識(shí)別日期,數(shù)字,距離和其他結(jié)構(gòu)化實(shí)體并將其標(biāo)準(zhǔn)化。Duckling會(huì)嘗試在不提供排名的情況下提取盡可能多的實(shí)體類型。例如,I will be there in 10 minutes這句話,如果同時(shí)指定numbertime作為Duckling的實(shí)體,Duckling將提取兩個(gè)實(shí)體:10作為數(shù)字和 in 10 minutes作為時(shí)間,這種情況下,需要應(yīng)用程序去判斷哪種實(shí)體類型是正確的。Duckling是基于規(guī)則的系統(tǒng),所以提取的實(shí)體終返回1.0作為置信度。

          可以在Duckling GitHub存儲(chǔ)庫(kù)中找到受支持的語(yǔ)言列表 。

          配置方式:

          pipeline:
          - name: "DucklingHTTPExtractor"
          # url of the running duckling server
          url: "http://localhost:8000"
          # dimensions to extract
          dimensions: ["time", "number", "amount-of-money", "distance"]
          # allows you to configure the locale, by default the language is
          # used
          locale: "de_DE"
          # if not set the default timezone of Duckling is going to be used
          # needed to calculate dates from relative expressions like "tomorrow"
          timezone: "Europe/Berlin"
          # Timeout for receiving response from http url of the running duckling server
          # if not set the default timeout of duckling http url is set to 3 seconds.
          timeout : 3

          DIETClassifier

          前面介紹過(guò),在意圖分類的時(shí)候,DIET會(huì)同時(shí)將實(shí)體識(shí)別一并做了。

          RegexEntityExtractor

          該組件使用在訓(xùn)練數(shù)據(jù)中定義的查找表和正則表達(dá)式提取實(shí)體。該組件檢查用戶消息是否包含某個(gè)查找表的條目或與某個(gè)正則表達(dá)式匹配。如果找到匹配項(xiàng),則將該值提取為實(shí)體。

          此組件只使用那些名稱等于訓(xùn)練數(shù)據(jù)中定義的實(shí)體之一的正則表達(dá)式的pattern,所以訓(xùn)練數(shù)據(jù)中,要確保每個(gè)實(shí)體至少注釋一個(gè)示例。

          case_sensitive:配置參數(shù)case_sensitive指定是否大小寫敏感。

          use_word_boundaries:在中文中沒(méi)用,在whitespaceTokenizer中使用,

              pipeline:
          - name: RegexEntityExtractor
          # text will be processed with case insensitive as default
          case_sensitive: False
          # use lookup tables to extract entities
          use_lookup_tables: True
          # use regexes to extract entities
          use_regexes: True
          # use match word boundaries for lookup table
          "use_word_boundaries": True

          EntitySynonymMapper

          實(shí)體同義詞映射,這個(gè)組件功能主要是將其他Extractor提取到的實(shí)體,使用同義詞表,歸一化到同一種說(shuō)法,為后續(xù)處理提供方便。例如:

          [
          {
          "text": "I moved to New York City",
          "intent": "inform_relocation",
          "entities": [{
          "value": "nyc",
          "start": 11,
          "end": 24,
          "entity": "city",
          }]
          },
          {
          "text": "I got a new flat in NYC.",
          "intent": "inform_relocation",
          "entities": [{
          "value": "nyc",
          "start": 20,
          "end": 23,
          "entity": "city",
          }]
          }
          ]

          不管用戶消息里面是New York City,還是NYC,都會(huì)被統(tǒng)一映射為nyc。但是EntitySynonymMapper并不提取實(shí)體,他只是將其他提取器提取的實(shí)體做映射。

          如何不滿足要求,RASA提供自定義組件擴(kuò)展。這個(gè)具體專門一章講。

           


          瀏覽 164
          點(diǎn)贊
          評(píng)論
          收藏
          分享

          手機(jī)掃一掃分享

          分享
          舉報(bào)
          評(píng)論
          圖片
          表情
          推薦
          點(diǎn)贊
          評(píng)論
          收藏
          分享

          手機(jī)掃一掃分享

          分享
          舉報(bào)
          <kbd id="afajh"><form id="afajh"></form></kbd>
          <strong id="afajh"><dl id="afajh"></dl></strong>
            <del id="afajh"><form id="afajh"></form></del>
                1. <th id="afajh"><progress id="afajh"></progress></th>
                  <b id="afajh"><abbr id="afajh"></abbr></b>
                  <th id="afajh"><progress id="afajh"></progress></th>
                  果冻传媒ⅩXXXXXHD | 久久免费视屏 | 国产一区二区精品 | 大香蕉色综合 | 成黄免费看|