<kbd id="afajh"><form id="afajh"></form></kbd>
<strong id="afajh"><dl id="afajh"></dl></strong>
    <del id="afajh"><form id="afajh"></form></del>
        1. <th id="afajh"><progress id="afajh"></progress></th>
          <b id="afajh"><abbr id="afajh"></abbr></b>
          <th id="afajh"><progress id="afajh"></progress></th>

          醫(yī)學(xué)自然語言處理相關(guān)資源整理

          共 18652字,需瀏覽 38分鐘

           ·

          2021-04-01 11:14


          向AI轉(zhuǎn)型的程序員都關(guān)注了這個(gè)號??????

          人工智能大數(shù)據(jù)與深度學(xué)習(xí)  公眾號:datayx



          Chinese_medical_NLP

          醫(yī)療NLP領(lǐng)域(主要關(guān)注中文) 評測數(shù)據(jù)集 與 論文等相關(guān)資源。



          中文評測數(shù)據(jù)集


          1. Yidu-S4K:醫(yī)渡云結(jié)構(gòu)化4K數(shù)據(jù)集

          數(shù)據(jù)集描述:

          Yidu-S4K 數(shù)據(jù)集源自CCKS 2019 評測任務(wù)一,即“面向中文電子病歷的命名實(shí)體識別”的數(shù)據(jù)集,包括兩個(gè)子任務(wù):

          1)醫(yī)療命名實(shí)體識別:由于國內(nèi)沒有公開可獲得的面向中文電子病歷醫(yī)療實(shí)體識別數(shù)據(jù)集,本年度保留了醫(yī)療命名實(shí)體識別任務(wù),對2017年度數(shù)據(jù)集做了修訂,并隨任務(wù)一同發(fā)布。本子任務(wù)的數(shù)據(jù)集包括訓(xùn)練集和測試集。

          2)醫(yī)療實(shí)體及屬性抽取(跨院遷移):在醫(yī)療實(shí)體識別的基礎(chǔ)上,對預(yù)定義實(shí)體屬性進(jìn)行抽取。本任務(wù)為遷移學(xué)習(xí)任務(wù),即在只提供目標(biāo)場景少量標(biāo)注數(shù)據(jù)的情況下,通過其他場景的標(biāo)注數(shù)據(jù)及非標(biāo)注數(shù)據(jù)進(jìn)行目標(biāo)場景的識別任務(wù)。本子任務(wù)的數(shù)據(jù)集包括訓(xùn)練集(非目標(biāo)場景和目標(biāo)場景的標(biāo)注數(shù)據(jù)、各個(gè)場景的非標(biāo)注數(shù)據(jù))和測試集(目標(biāo)場景的標(biāo)注數(shù)據(jù)


          數(shù)據(jù)集地址

          http://openkg.cn/dataset/yidu-s4k


          度盤下載地址:https://pan.baidu.com/s/1QqYtqDwhc_S51F3SYMChBQ

          提取碼:flql


          2.瑞金醫(yī)院糖尿病數(shù)據(jù)集

          數(shù)據(jù)集描述:


          數(shù)據(jù)集來自天池大賽。此數(shù)據(jù)集旨在通過糖尿病相關(guān)的教科書、研究論文來做糖尿病文獻(xiàn)挖掘并構(gòu)建糖尿病知識圖譜。參賽選手需要設(shè)計(jì)高準(zhǔn)確率,高效的算法來挑戰(zhàn)這一科學(xué)難題。第一賽季課題為“基于糖尿病臨床指南和研究論文的實(shí)體標(biāo)注構(gòu)建”,第二賽季課題為“基于糖尿病臨床指南和研究論文的實(shí)體間關(guān)系構(gòu)建”。


          官方提供的數(shù)據(jù)只包含訓(xùn)練集,真正用于最終排名的測試集沒有給出。


          數(shù)據(jù)集地址

          https://tianchi.aliyun.com/competition/entrance/231687/information


          度盤下載地址:https://pan.baidu.com/s/1CWKblBNBqR-vs2h0xiXSdQ

          提取碼:0c54


          3.Yidu-N7K:醫(yī)渡云標(biāo)準(zhǔn)化7K數(shù)據(jù)集

          數(shù)據(jù)集描述:

          Yidu-N4K 數(shù)據(jù)集源自CHIP 2019 評測任務(wù)一,即“臨床術(shù)語標(biāo)準(zhǔn)化任務(wù)”的數(shù)據(jù)集。

          臨床術(shù)語標(biāo)準(zhǔn)化任務(wù)是醫(yī)學(xué)統(tǒng)計(jì)中不可或缺的一項(xiàng)任務(wù)。臨床上,關(guān)于同一種診斷、手術(shù)、藥品、檢查、化驗(yàn)、癥狀等往往會(huì)有成百上千種不同的寫法。標(biāo)準(zhǔn)化(歸一)要解決的問題就是為臨床上各種不同說法找到對應(yīng)的標(biāo)準(zhǔn)說法。有了術(shù)語標(biāo)準(zhǔn)化的基礎(chǔ),研究人員才可對電子病歷進(jìn)行后續(xù)的統(tǒng)計(jì)分析。本質(zhì)上,臨床術(shù)語標(biāo)準(zhǔn)化任務(wù)也是語義相似度匹配任務(wù)的一種。但是由于原詞表述方式過于多樣,單一的匹配模型很難獲得很好的效果。


          數(shù)據(jù)集地址

          http://openkg.cn/dataset/yidu-n7k

            單肩包/雙肩包/斜挎包/手提包/胸包/旅行包/上課書包 /個(gè)性布袋等各式包飾挑選

            https://shop585613237.taobao.com/





          4.中文醫(yī)學(xué)問答數(shù)據(jù)集

          數(shù)據(jù)集描述:

          中文醫(yī)藥方面的問答數(shù)據(jù)集,超過10萬條。


          數(shù)據(jù)說明:

          questions.csv:所有的問題及其內(nèi)容。answers.csv :所有問題的答案。

          train_candidates.txt, dev_candidates.txt, test_candidates.txt :將上述兩個(gè)文件進(jìn)行了拆分。


          數(shù)據(jù)集地址

          https://www.kesci.com/home/dataset/5d313070cf76a60036e4b023/document


          數(shù)據(jù)集github地址

          https://github.com/zhangsheng93/cMedQA2


          5.平安醫(yī)療科技疾病問答遷移學(xué)習(xí)比賽

          數(shù)據(jù)集描述:

          本次比賽是chip2019中的評測任務(wù)二,由平安醫(yī)療科技主辦。chip2019會(huì)議詳情見鏈接:http://cips-chip.org.cn/evaluation

          遷移學(xué)習(xí)是自然語言處理中的重要一環(huán),其主要目的是通過從已學(xué)習(xí)的相關(guān)任務(wù)中轉(zhuǎn)移知識來改進(jìn)新任務(wù)的學(xué)習(xí)效果,從而提高模型的泛化能力。

          本次評測任務(wù)的主要目標(biāo)是針對中文的疾病問答數(shù)據(jù),進(jìn)行病種間的遷移學(xué)習(xí)。具體而言,給定來自5個(gè)不同病種的問句對,要求判定兩個(gè)句子語義是否相同或者相近。所有語料來自互聯(lián)網(wǎng)上患者真實(shí)的問題,并經(jīng)過了篩選和人工的意圖匹配標(biāo)注。


          數(shù)據(jù)集地址(需注冊)

          https://www.biendata.com/competition/chip2019/


          6.天池新冠肺炎問句匹配比賽

          數(shù)據(jù)集描述:


          本次大賽數(shù)據(jù)包括:脫敏之后的醫(yī)療問題數(shù)據(jù)對和標(biāo)注數(shù)據(jù)。醫(yī)療問題涉及“肺炎”、“支原體肺炎”、“支氣管炎”、“上呼吸道感染”、“肺結(jié)核”、“哮喘”、“胸膜炎”、“肺氣腫”、“感冒”、“咳血”等10個(gè)病種。

          數(shù)據(jù)共包含train.csv、dev.csv、test.csv三個(gè)文件,其中給參賽選手的文件包含訓(xùn)練集train.csv和驗(yàn)證集dev.csv,測試集test.csv 對參賽選手不可見。

          每一條數(shù)據(jù)由 Category,Query1,Query2,Label構(gòu)成,分別表示問題類別、問句1、問句2、標(biāo)簽。Label表示問句之間的語義是否相同,若相同,標(biāo)為1,若不相同,標(biāo)為0。其中,訓(xùn)練集Label已知,驗(yàn)證集和測試集Label未知。

          示例

          類別:肺炎

          問句1:肺部發(fā)炎是什么原因引起的?

          問句2:肺部發(fā)炎是什么引起的

          標(biāo)簽:1

          類別:肺炎

          問句1:肺部發(fā)炎是什么原因引起的?

          問句2:肺部炎癥有什么癥狀

          標(biāo)簽:0


          數(shù)據(jù)集地址(需注冊)

          https://tianchi.aliyun.com/competition/entrance/231776/information



          中文醫(yī)學(xué)知識圖譜

          CMeKG


          地址

          http://cmekg.pcl.ac.cn/


          簡介:CMeKG(Chinese Medical Knowledge Graph)是利用自然語言處理與文本挖掘技術(shù),基于大規(guī)模醫(yī)學(xué)文本數(shù)據(jù),以人機(jī)結(jié)合的方式研發(fā)的中文醫(yī)學(xué)知識圖譜。CMeKG的構(gòu)建參考了ICD、ATC、SNOMED、MeSH等權(quán)威的國際醫(yī)學(xué)標(biāo)準(zhǔn)以及規(guī)模龐大、多源異構(gòu)的臨床指南、行業(yè)標(biāo)準(zhǔn)、診療規(guī)范與醫(yī)學(xué)百科等醫(yī)學(xué)文本信息。CMeKG 1.0包括:6310種疾病、19853種藥物(西藥、中成藥、中草藥)、1237種診療技術(shù)及設(shè)備的結(jié)構(gòu)化知識描述,涵蓋疾病的臨床癥狀、發(fā)病部位、藥物治療、手術(shù)治療、鑒別診斷、影像學(xué)檢查、高危因素、傳播途徑、多發(fā)群體、就診科室等以及藥物的成分、適應(yīng)癥、用法用量、有效期、禁忌證等30余種常見關(guān)系類型,CMeKG描述的概念關(guān)系實(shí)例及屬性三元組達(dá)100余萬。


          英文數(shù)據(jù)集

          PubMedQA: A Dataset for Biomedical Research Question Answering

          數(shù)據(jù)集描述:基于Pubmed提取的醫(yī)學(xué)問答數(shù)據(jù)集。PubMedQA has 1k expert-annotated, 61.2k unlabeled and 211.3k artificially gen- erated QA instances.


          論文地址

          https://arxiv.org/abs/1909.06146


          相關(guān)論文

          1.醫(yī)療領(lǐng)域預(yù)訓(xùn)練embedding

          注:目前沒有收集到中文醫(yī)療領(lǐng)域的開源預(yù)訓(xùn)練模型,以下列出英文論文供參考。


          Bio-bert

          論文題目:BioBERT: a pre-trained biomedical language representation model for biomedical text mining


          論文地址

          https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btz682/5566506


          項(xiàng)目地址

          https://github.com/dmis-lab/biobert


          論文概要:以通用領(lǐng)域預(yù)訓(xùn)練bert為初始權(quán)重,基于Pubmed上大量醫(yī)療領(lǐng)域英文論文訓(xùn)練。在多個(gè)醫(yī)療相關(guān)下游任務(wù)中超越SOTA模型的表現(xiàn)。


          論文摘要:


          Motivation: Biomedical text mining is becoming increasingly important as the number of biomedical documents rapidly grows. With the progress in natural language processing (NLP), extracting valuable information from bio- medical literature has gained popularity among researchers, and deep learning has boosted the development of ef- fective biomedical text mining models. However, directly applying the advancements in NLP to biomedical text min- ing often yields unsatisfactory results due to a word distribution shift from general domain corpora to biomedical corpora. In this article, we investigate how the recently introduced pre-trained language model BERT can be adapted for biomedical corpora.

          Results: We introduce BioBERT (Bidirectional Encoder Representations from Transformers for Biomedical Text Mining), which is a domain-specific language representation model pre-trained on large-scale biomedical corpora. With almost the same architecture across tasks, BioBERT largely outperforms BERT and previous state-of-the-art models in a variety of biomedical text mining tasks when pre-trained on biomedical corpora. While BERT obtains performance comparable to that of previous state-of-the-art models, BioBERT significantly outperforms them on the following three representative biomedical text mining tasks: biomedical named entity recognition (0.62% F1 score improvement), biomedical relation extraction (2.80% F1 score improvement) and biomedical question answering (12.24% MRR improvement). Our analysis results show that pre-training BERT on biomedical corpora helps it to understand complex biomedical texts.

          Availability and implementation: We make the pre-trained weights of BioBERT freely available at https://github.com/naver/biobert-pretrained, and the source code for fine-tuning BioBERT available at https://github.com/dmis-lab/biobert.


          sci-bert

          論文題目:SCIBERT: A Pretrained Language Model for Scientific Text


          論文地址

          https://arxiv.org/abs/1903.10676


          項(xiàng)目地址

          https://github.com/allenai/scibert/


          論文概要:AllenAI 團(tuán)隊(duì)出品.基于Semantic Scholar 上 110萬+ 文章訓(xùn)練的 科學(xué)領(lǐng)域bert.


          論文摘要:Obtaining large-scale annotated data for NLP tasks in the scientific domain is challeng- ing and expensive. We release SCIBERT, a pretrained language model based on BERT (Devlin et al., 2019) to address the lack of high-quality, large-scale labeled scientific data. SCIBERT leverages unsupervised pretraining on a large multi-domain corpus of scientific publications to improve perfor- mance on downstream scientific NLP tasks. We evaluate on a suite of tasks including sequence tagging, sentence classification and dependency parsing, with datasets from a variety of scientific domains. We demon- strate statistically significant improvements over BERT and achieve new state-of-the- art results on several of these tasks. The code and pretrained models are available at https://github.com/allenai/scibert/.


          clinical-bert

          論文題目:Publicly Available Clinical BERT Embeddings


          論文地址

          https://www.aclweb.org/anthology/W19-1909/


          項(xiàng)目地址

          https://github.com/EmilyAlsentzer/clinicalBERT


          論文概要:出自NAACL Clinical NLP Workshop 2019.基于MIMIC-III數(shù)據(jù)庫中的200萬份醫(yī)療記錄訓(xùn)練的臨床領(lǐng)域bert.


          論文摘要:Contextual word embedding models such as ELMo and BERT have dramatically improved performance for many natural language processing (NLP) tasks in recent months. However, these models have been minimally explored on specialty corpora, such as clinical text; moreover, in the clinical domain, no publicly-available pre-trained BERT models yet exist. In this work, we address this need by exploring and releasing BERT models for clinical text: one for generic clinical text and another for discharge summaries specifically. We demonstrate that using a domain-specific model yields performance improvements on 3/5 clinical NLP tasks, establishing a new state-of-the-art on the MedNLI dataset. We find that these domain-specific models are not as performant on 2 clinical de-identification tasks, and argue that this is a natural consequence of the differences between de-identified source text and synthetically non de-identified task text.


          clinical-bert(另一團(tuán)隊(duì)的版本)

          論文題目:ClinicalBert: Modeling Clinical Notes and Predicting Hospital Readmission


          論文地址

          https://arxiv.org/abs/1904.05342


          項(xiàng)目地址

          https://github.com/kexinhuang12345/clinicalBERT


          論文概要:同樣基于MIMIC-III數(shù)據(jù)庫,但只隨機(jī)選取了10萬份醫(yī)療記錄訓(xùn)練的臨床領(lǐng)域bert.


          論文摘要:Clinical notes contain information about patients that goes beyond structured data like lab values and medications. However, clinical notes have been underused relative to structured data, because notes are high-dimensional and sparse. This work develops and evaluates representations of clinical notes using bidirectional transformers (ClinicalBert). Clini- calBert uncovers high-quality relationships between medical concepts as judged by hu- mans. ClinicalBert outperforms baselines on 30-day hospital readmission prediction using both discharge summaries and the first few days of notes in the intensive care unit. Code and model parameters are available.


          BEHRT

          論文題目:BEHRT: TRANSFORMER FOR ELECTRONIC HEALTH RECORDS


          論文地址

          https://arxiv.org/abs/1907.09538


          項(xiàng)目地址: 暫未開源


          論文概要:這篇論文中embedding是基于醫(yī)學(xué)實(shí)體訓(xùn)練,而不是基于單詞。


          論文摘要:Today, despite decades of developments in medicine and the growing interest in precision healthcare, vast majority of diagnoses happen once patients begin to show noticeable signs of illness. Early indication and detection of diseases, however, can provide patients and carers with the chance of early intervention, better disease management, and efficient allocation of healthcare resources. The latest developments in machine learning (more specifically, deep learning) provides a great opportunity to address this unmet need. In this study, we introduce BEHRT: A deep neural sequence transduction model for EHR (electronic health records), capable of multitask prediction and disease trajectory mapping. When trained and evaluated on the data from nearly 1.6 million individuals, BEHRT shows a striking absolute improvement of 8.0-10.8%, in terms of Average Precision Score, compared to the existing state-of-the-art deep EHR models (in terms of average precision, when predicting for the onset of 301 conditions). In addition to its superior prediction power, BEHRT provides a personalised view of disease trajectories through its attention mechanism; its flexible architecture enables it to incorporate multiple heterogeneous concepts (e.g., diagnosis, medication, measurements, and more) to improve the accuracy of its predictions; and its (pre-)training results in disease and patient representations that can help us get a step closer to interpretable predictions.


          2.綜述類文章

          nature medicine發(fā)表的綜述

          論文題目:A guide to deep learning in healthcare


          論文地址

          https://www.nature.com/articles/s41591-018-0316-z


          論文概要:發(fā)表于nature medicine,包含醫(yī)學(xué)領(lǐng)域下CV,NLP,強(qiáng)化學(xué)習(xí)等方面的應(yīng)用綜述。


          論文摘要:Here we present deep-learning techniques for healthcare, centering our discussion on deep learning in computer vision, natural language processing, reinforcement learning, and generalized methods. We describe how these computational techniques can impact a few key areas of medicine and explore how to build end-to-end systems. Our discussion of computer vision focuses largely on medical imaging, and we describe the application of natural language processing to domains such as electronic health record data. Similarly, reinforcement learning is discussed in the context of robotic-assisted surgery, and generalized deep- learning methods for genomics are reviewed.


          3.電子病歷相關(guān)文章

          Transfer Learning from Medical Literature for Section Prediction in Electronic Health Records


          論文地址

          https://www.aclweb.org/anthology/D19-1492/


          論文概要:發(fā)表于EMNLP2019。基于少量in-domain數(shù)據(jù)和大量out-of-domain數(shù)據(jù)進(jìn)行EHR相關(guān)的遷移學(xué)習(xí)。


          論文摘要:sections such as Assessment and Plan, So- cial History, and Medications. These sec- tions help physicians find information easily and can be used by an information retrieval system to return specific information sought by a user. However, it is common that the exact format of sections in a particular EHR does not adhere to known patterns. There- fore, being able to predict sections and headers in EHRs automatically is beneficial to physi- cians. Prior approaches in EHR section pre- diction have only used text data from EHRs and have required significant manual annota- tion. We propose using sections from med- ical literature (e.g., textbooks, journals, web content) that contain content similar to that found in EHR sections. Our approach uses data from a different kind of source where la- bels are provided without the need of a time- consuming annotation effort. We use this data to train two models: an RNN and a BERT- based model. We apply the learned models along with source data via transfer learning to predict sections in EHRs. Our results show that medical literature can provide helpful su- pervision signal for this classification task.


          4.醫(yī)學(xué)關(guān)系抽取

          Leveraging Dependency Forest for Neural Medical Relation Extraction


          論文地址

          https://www.aclweb.org/anthology/D19-1020/


          論文概要:發(fā)表于EMNLP 2019. 基于dependency forest方法,提升對醫(yī)學(xué)語句中依存關(guān)系的召回率,同時(shí)引進(jìn)了一部分噪聲,基于圖循環(huán)網(wǎng)絡(luò)進(jìn)行特征提取,提供了在醫(yī)療關(guān)系抽取中使用依存關(guān)系,同時(shí)減少誤差傳遞的一種思路。


          論文摘要:Medical relation extraction discovers relations between entity mentions in text, such as research articles. For this task, dependency syntax has been recognized as a crucial source of features. Yet in the medical domain, 1-best parse trees suffer from relatively low accuracies, diminishing their usefulness. We investigate a method to alleviate this problem by utilizing dependency forests. Forests contain more than one possible decisions and therefore have higher recall but more noise compared with 1-best outputs. A graph neural network is used to represent the forests, automatically distinguishing the useful syntactic information from parsing noise. Results on two benchmarks show that our method outperforms the standard tree-based methods, giving the state-of-the-art results in the literature.


          5.醫(yī)學(xué)知識圖譜

          Learning a Health Knowledge Graph from Electronic Medical Records


          論文地址

          https://www.nature.com/articles/s41598-017-05778-z


          論文概要:發(fā)表于nature scientificreports(2017). 基于27萬余份電子病歷構(gòu)建的疾病-癥狀知識圖譜。


          論文摘要:Demand for clinical decision support systems in medicine and self-diagnostic symptom checkers has substantially increased in recent years. Existing platforms rely on knowledge bases manually compiled through a labor-intensive process or automatically derived using simple pairwise statistics. This study explored an automated process to learn high quality knowledge bases linking diseases and symptoms directly from electronic medical records. Medical concepts were extracted from 273,174 de-identified patient records and maximum likelihood estimation of three probabilistic models was used to automatically construct knowledge graphs: logistic regression, naive Bayes classifier and a Bayesian network using noisy OR gates. A graph of disease-symptom relationships was elicited from the learned parameters and the constructed knowledge graphs were evaluated and validated, with permission, against Google’s manually-constructed knowledge graph and against expert physician opinions. Our study shows that direct and automated construction of high quality health knowledge graphs from medical records using rudimentary concept extraction is feasible. The noisy OR model produces a high quality knowledge graph reaching precision of 0.85 for a recall of 0.6 in the clinical evaluation. Noisy OR significantly outperforms all tested models across evaluation frameworks (p?<?0.01).


          6.輔助診斷

          Evaluation and accurate diagnoses of pediatric diseases using artificial intelligence


          論文地址

          https://www.nature.com/articles/s41591-018-0335-9


          論文概要:該文章由廣州市婦女兒童醫(yī)療中心與依圖醫(yī)療等企業(yè)和科研機(jī)構(gòu)共同完成,基于機(jī)器學(xué)習(xí)的自然語言處理(NLP)技術(shù)實(shí)現(xiàn)不輸人類醫(yī)生的強(qiáng)大診斷能力,并具備多場景的應(yīng)用能力。據(jù)介紹,這是全球首次在頂級醫(yī)學(xué)雜志發(fā)表有關(guān)自然語言處理(NLP)技術(shù)基于電子健康記錄(EHR)做臨床智能診斷的研究成果,也是利用人工智能技術(shù)診斷兒科疾病的重磅科研成果。


          論文摘要:Artificial intelligence (AI)-based methods have emerged as powerful tools to transform medical care. Although machine learning classifiers (MLCs) have already demonstrated strong performance in image-based diagnoses, analysis of diverse and massive electronic health record (EHR) data remains challenging. Here, we show that MLCs can query EHRs in a manner similar to the hypothetico-deductive reasoning used by physicians and unearth associations that previous statistical methods have not found. Our model applies an automated natural language processing system using deep learning techniques to extract clinically relevant information from EHRs. In total, 101.6 million data points from 1,362,559 pediatric patient visits presenting to a major referral center were analyzed to train and validate the framework. Our model demonstrates high diagnostic accuracy across multiple organ systems and is comparable to experienced pediatricians in diagnosing common childhood diseases. Our study provides a proof of concept for implementing an AI-based system as a means to aid physicians in tackling large amounts of data, augmenting diagnostic evaluations, and to provide clinical decision support in cases of diagnostic uncertainty or complexity. Although this impact may be most evident in areas where healthcare providers are in relative shortage, the benefits of such an AI system are likely to be universal.


          中文醫(yī)療領(lǐng)域語料

          醫(yī)學(xué)教材+培訓(xùn)考試 (共57G)

          語料說明:根據(jù)此豆瓣鏈接整理。整合到一個(gè)文件夾內(nèi),便于保存。去掉了其中視頻部分。


          度盤下載地址:https://pan.baidu.com/s/1P2WHX7hNTqErZ3j1vhkr_Q


          提取碼:xd0c


          哈工大《大詞林》開放75萬核心實(shí)體詞及相關(guān)概念、關(guān)系列表(包含中藥/醫(yī)院/生物 類別)

          語料說明:哈工大開源了《大詞林》中的75萬的核心實(shí)體詞,以及這些核心實(shí)體詞對應(yīng)的細(xì)粒度概念詞(共1.8萬概念詞,300萬實(shí)體-概念元組),還有相關(guān)的關(guān)系三元組(共300萬)。這75萬核心實(shí)體列表涵蓋了常見的人名、地名、物品名等術(shù)語。概念詞列表則包含了細(xì)粒度的實(shí)體概念信息。借助于細(xì)粒度的上位概念層次結(jié)構(gòu)和豐富的實(shí)體間關(guān)系,本次開源的數(shù)據(jù)能夠?yàn)槿藱C(jī)對話、智能推薦、等應(yīng)用技術(shù)提供數(shù)據(jù)支持。


          語料官方下載地址

          http://101.200.120.155/browser/


          度盤下載地址:https://pan.baidu.com/s/1NG8xybrEGTVYPepMM12xNw


          提取碼:mwmj


          開源工具包

          分詞工具

          PKUSEG


          項(xiàng)目地址

          https://github.com/lancopku/pkuseg-python


          項(xiàng)目說明:北京大學(xué)推出的多領(lǐng)域中文分詞工具,支持選擇醫(yī)學(xué)領(lǐng)域。


          工業(yè)級產(chǎn)品解決方案


          靈醫(yī)智慧

          https://01.baidu.com/index.html


          左手醫(yī)生

          https://open.zuoshouyisheng.com/


          友情鏈接

          awesome_Chinese_medical_NLP

          https://github.com/GanjinZero/awesome_Chinese_medical_NLP


          中文NLP數(shù)據(jù)集搜索

          https://www.cluebenchmarks.com/dataSet_search.html


          github地址

          https://github.com/lrs1353281004/Chinese_medical_NLP




          閱讀過本文的人還看了以下文章:


          TensorFlow 2.0深度學(xué)習(xí)案例實(shí)戰(zhàn)


          基于40萬表格數(shù)據(jù)集TableBank,用MaskRCNN做表格檢測


          《基于深度學(xué)習(xí)的自然語言處理》中/英PDF


          Deep Learning 中文版初版-周志華團(tuán)隊(duì)


          【全套視頻課】最全的目標(biāo)檢測算法系列講解,通俗易懂!


          《美團(tuán)機(jī)器學(xué)習(xí)實(shí)踐》_美團(tuán)算法團(tuán)隊(duì).pdf


          《深度學(xué)習(xí)入門:基于Python的理論與實(shí)現(xiàn)》高清中文PDF+源碼


          特征提取與圖像處理(第二版).pdf


          python就業(yè)班學(xué)習(xí)視頻,從入門到實(shí)戰(zhàn)項(xiàng)目


          2019最新《PyTorch自然語言處理》英、中文版PDF+源碼


          《21個(gè)項(xiàng)目玩轉(zhuǎn)深度學(xué)習(xí):基于TensorFlow的實(shí)踐詳解》完整版PDF+附書代碼


          《深度學(xué)習(xí)之pytorch》pdf+附書源碼


          PyTorch深度學(xué)習(xí)快速實(shí)戰(zhàn)入門《pytorch-handbook》


          【下載】豆瓣評分8.1,《機(jī)器學(xué)習(xí)實(shí)戰(zhàn):基于Scikit-Learn和TensorFlow》


          《Python數(shù)據(jù)分析與挖掘?qū)崙?zhàn)》PDF+完整源碼


          汽車行業(yè)完整知識圖譜項(xiàng)目實(shí)戰(zhàn)視頻(全23課)


          李沐大神開源《動(dòng)手學(xué)深度學(xué)習(xí)》,加州伯克利深度學(xué)習(xí)(2019春)教材


          筆記、代碼清晰易懂!李航《統(tǒng)計(jì)學(xué)習(xí)方法》最新資源全套!


          《神經(jīng)網(wǎng)絡(luò)與深度學(xué)習(xí)》最新2018版中英PDF+源碼


          將機(jī)器學(xué)習(xí)模型部署為REST API


          FashionAI服裝屬性標(biāo)簽圖像識別Top1-5方案分享


          重要開源!CNN-RNN-CTC 實(shí)現(xiàn)手寫漢字識別


          yolo3 檢測出圖像中的不規(guī)則漢字


          同樣是機(jī)器學(xué)習(xí)算法工程師,你的面試為什么過不了?


          前海征信大數(shù)據(jù)算法:風(fēng)險(xiǎn)概率預(yù)測


          【Keras】完整實(shí)現(xiàn)‘交通標(biāo)志’分類、‘票據(jù)’分類兩個(gè)項(xiàng)目,讓你掌握深度學(xué)習(xí)圖像分類


          VGG16遷移學(xué)習(xí),實(shí)現(xiàn)醫(yī)學(xué)圖像識別分類工程項(xiàng)目


          特征工程(一)


          特征工程(二) :文本數(shù)據(jù)的展開、過濾和分塊


          特征工程(三):特征縮放,從詞袋到 TF-IDF


          特征工程(四): 類別特征


          特征工程(五): PCA 降維


          特征工程(六): 非線性特征提取和模型堆疊


          特征工程(七):圖像特征提取和深度學(xué)習(xí)


          如何利用全新的決策樹集成級聯(lián)結(jié)構(gòu)gcForest做特征工程并打分?


          Machine Learning Yearning 中文翻譯稿


          螞蟻金服2018秋招-算法工程師(共四面)通過


          全球AI挑戰(zhàn)-場景分類的比賽源碼(多模型融合)


          斯坦福CS230官方指南:CNN、RNN及使用技巧速查(打印收藏)


          python+flask搭建CNN在線識別手寫中文網(wǎng)站


          中科院Kaggle全球文本匹配競賽華人第1名團(tuán)隊(duì)-深度學(xué)習(xí)與特征工程



          不斷更新資源

          深度學(xué)習(xí)、機(jī)器學(xué)習(xí)、數(shù)據(jù)分析、python

           搜索公眾號添加: datayx  



          機(jī)大數(shù)據(jù)技術(shù)與機(jī)器學(xué)習(xí)工程

           搜索公眾號添加: datanlp

          長按圖片,識別二維碼


          瀏覽 49
          點(diǎn)贊
          評論
          收藏
          分享

          手機(jī)掃一掃分享

          分享
          舉報(bào)
          評論
          圖片
          表情
          推薦
          點(diǎn)贊
          評論
          收藏
          分享

          手機(jī)掃一掃分享

          分享
          舉報(bào)
          <kbd id="afajh"><form id="afajh"></form></kbd>
          <strong id="afajh"><dl id="afajh"></dl></strong>
            <del id="afajh"><form id="afajh"></form></del>
                1. <th id="afajh"><progress id="afajh"></progress></th>
                  <b id="afajh"><abbr id="afajh"></abbr></b>
                  <th id="afajh"><progress id="afajh"></progress></th>
                  91麻豆视频 | 久久午夜无码鲁丝片午夜精品 | 国产视频久久 豆花 | 国内综合久久 | 亚洲福利一区二区 |