數(shù)據(jù)競賽必備的NLP庫總結(jié)!
NLP必備的庫
上周在給大家介紹了OpenMMlab一系列的CV庫后,有很多同學(xué)問有沒有推薦的NLP庫。因此本周我們給大家整理了機(jī)器學(xué)習(xí)和競賽相關(guān)的NLP庫,方便大家進(jìn)行使用,又一篇收藏即學(xué)習(xí)系列。
import jiebaseg_list = jieba.cut("我來到北京清華大學(xué)", cut_all=True)print("Full Mode: " + "/ ".join(seg_list)) # 全模式# 【全模式】: 我/ 來到/ 北京/ 清華/ 清華大學(xué)/ 華大/ 大學(xué)seg_list = jieba.cut("我來到北京清華大學(xué)", cut_all=False)print("Default Mode: " + "/ ".join(seg_list)) # 精確模式# 【精確模式】: 我/ 來到/ 北京/ 清華大學(xué)seg_list = jieba.cut("他來到了網(wǎng)易杭研大廈") # 默認(rèn)是精確模式print(", ".join(seg_list))# 【新詞識別】:他, 來到, 了, 網(wǎng)易, 杭研, 大廈
spaCy是功能強(qiáng)化的NLP庫,可與深度學(xué)習(xí)框架一起運(yùn)行。spaCy提供了大多數(shù)NLP任務(wù)的標(biāo)準(zhǔn)功能(標(biāo)記化,PoS標(biāo)記,解析,命名實(shí)體識別)。spaCy與現(xiàn)有的深度學(xué)習(xí)框架接口可以一起使用,并預(yù)裝了常見的語言模型。
import spacy# Load English tokenizer, tagger, parser, NER and word vectorsnlp = spacy.load("en_core_web_sm")# Process whole documentstext = ("When Sebastian Thrun started working on self-driving cars at ""Google in 2007, few people outside of the company took him ""seriously. “I can tell you very senior CEOs of major American ""car companies would shake my hand and turn away because I wasn’t ""worth talking to,” said Thrun, in an interview with Recode earlier ""this week.")doc = nlp(text)# Analyze syntaxprint("Noun phrases:", [chunk.text for chunk in doc.noun_chunks])print("Verbs:", [token.lemma_ for token in doc if token.pos_ == "VERB"])# Find named entities, phrases and conceptsfor entity in doc.ents:print(entity.text, entity.label_)
spaCy項(xiàng)目主頁:https://spacy.io/
from gensim.test.utils import common_texts, get_tmpfilefrom gensim.models import Word2Vecpath = get_tmpfile("word2vec.model")model = Word2Vec(common_texts, size=100, window=5, min_count=1, workers=4)model.save("word2vec.model")
import nltksentence = """At eight o'clock on Thursday morningArthur didn't feel very good."""tokens = nltk.word_tokenize(sentence)tokens['At', 'eight', "o'clock", 'on', 'Thursday', 'morning','Arthur', 'did', "n't", 'feel', 'very', 'good', '.']tagged = nltk.pos_tag(tokens)tagged[0:6][('At', 'IN'), ('eight', 'CD'), ("o'clock", 'JJ'), ('on', 'IN'),('Thursday', 'NNP'), ('morning', 'NN')]
from textblob import TextBlobtext = '''The titular threat of The Blob has always struck me as the ultimate moviemonster: an insatiably hungry, amoeba-like mass able to penetratevirtually any safeguard, capable of--as a doomed doctor chillinglydescribes it--"assimilating flesh on contact.Snide comparisons to gelatin be damned, it's a concept with the mostdevastating of potential consequences, not unlike the grey goo scenarioproposed by technological theorists fearful ofartificial intelligence run rampant.'''blob = TextBlob(text)blob.tags # [('The', 'DT'), ('titular', 'JJ'),# ('threat', 'NN'), ('of', 'IN'), ...]blob.noun_phrases # WordList(['titular threat', 'blob',# 'ultimate movie monster',# 'amoeba-like mass', ...])for sentence in blob.sentences:print(sentence.sentiment.polarity)# 0.060# -0.341
TextBlob官網(wǎng):https://textblob.readthedocs.io/en/dev/



Transformers是現(xiàn)如今最流行的庫,它實(shí)現(xiàn)了從 BERT 和 GPT-2 到 BART 和 Reformer 的各種轉(zhuǎn)換。huggingface 的代碼可讀性強(qiáng)和文檔也是清晰易讀。在官方github的存儲庫中,甚至通過不同的任務(wù)來組織 python 腳本,例如語言建模、文本生成、問題回答、多項(xiàng)選擇等。

huggingface官網(wǎng):https://huggingface.co/
OpenNMT 是用于機(jī)器翻譯和序列學(xué)習(xí)任務(wù)的便捷而強(qiáng)大的工具。其包含的高度可配置的模型和培訓(xùn)過程,讓它成為了一個非常簡單的框架。因其開源且簡單的特性,建議大家使用 OpenNMT 進(jìn)行各種類型的序列學(xué)習(xí)任務(wù)。

OpenNMT官網(wǎng):https://opennmt.net/
若進(jìn)群失敗,可在后臺回復(fù)【競賽群】
即可得到最新的二維碼!
