影音先锋人人操,99热18,影音先锋成人网址,毛片日逼,二级黄色电影免费看,一区二区红桃视频日本,亚洲欧美人妻,中文字幕一级黄色大片

在解決自然語言處理問題時(shí)，有時(shí)你需要獲得大量的文本集。互聯(lián)網(wǎng)是文本的最大來源，但是從任意HTML頁面提取文本是一項(xiàng)艱巨而痛苦的任務(wù)。

假設(shè)我們需要從各種網(wǎng)頁中提取全文，并且要?jiǎng)冸x所有HTML標(biāo)記。通常，默認(rèn)解決方案是使用BeautifulSoup軟件包中的get_text方法，該方法內(nèi)部使用lxml。這是一個(gè)經(jīng)過充分測試的解決方案，但是在處理成千上萬個(gè)HTML文檔時(shí)可能會非常慢。

通過用selectolax替換BeautifulSoup，您幾乎可以免費(fèi)獲得5-30倍的加速！

這是一個(gè)簡單的基準(zhǔn)測試，可分析commoncrawl(`處理NLP問題時(shí)，有時(shí)您需要獲得大量的文本集。互聯(lián)網(wǎng)是文本的最大來源，但是不幸的是，從任意HTML頁面提取文本是一項(xiàng)艱巨而痛苦的任務(wù)。

通過用selectolax替換BeautifulSoup，您幾乎可以免費(fèi)獲得5-30倍的加速！這是一個(gè)簡單的基準(zhǔn)測試，可分析commoncrawl(https://commoncrawl.org/)的10,000個(gè)HTML頁面：

#?coding:?utf-8

from?time?import?time

import?warc
from?bs4?import?BeautifulSoup
from?selectolax.parser?import?HTMLParser


def?get_text_bs(html):
????tree?=?BeautifulSoup(html,?'lxml')

????body?=?tree.body
????if?body?is?None:
????????return?None

????for?tag?in?body.select('script'):
????????tag.decompose()
????for?tag?in?body.select('style'):
????????tag.decompose()

????text?=?body.get_text(separator='\n')
????return?text


def?get_text_selectolax(html):
????tree?=?HTMLParser(html)

????if?tree.body?is?None:
????????return?None

????for?tag?in?tree.css('script'):
????????tag.decompose()
????for?tag?in?tree.css('style'):
????????tag.decompose()

????text?=?tree.body.text(separator='\n')
????return?text


def?read_doc(record,?parser=get_text_selectolax):
????url?=?record.url
????text?=?None

????if?url:
????????payload?=?record.payload.read()
????????header,?html?=?payload.split(b'\r\n\r\n',?maxsplit=1)
????????html?=?html.strip()

????????if?len(html)?>?0:
????????????text?=?parser(html)

????return?url,?text


def?process_warc(file_name,?parser,?limit=10000):
????warc_file?=?warc.open(file_name,?'rb')
????t0?=?time()
????n_documents?=?0
????for?i,?record?in?enumerate(warc_file):
????????url,?doc?=?read_doc(record,?parser)

????????if?not?doc?or?not?url:
????????????continue

????????n_documents?+=?1

????????if?i?>?limit:
????????????break

????warc_file.close()
????print('Parser:?%s'?%?parser.__name__)
????print('Parsing?took?%s?seconds?and?produced?%s?documents\n'?%?(time()?-?t0,?n_documents))

>>>?!?wget?https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2018-05/segments/1516084886237.6/warc/CC-MAIN-20180116070444-20180116090444-00000.warc.gz
>>>?file_name?=?"CC-MAIN-20180116070444-20180116090444-00000.warc.gz"
>>>?process_warc(file_name,?get_text_selectolax,?10000)
Parser:?get_text_selectolax
Parsing?took?16.170367002487183?seconds?and?produced?3317?documents
>>>?process_warc(file_name,?get_text_bs,?10000)
Parser:?get_text_bs
Parsing?took?432.6902508735657?seconds?and?produced?3283?documents

顯然，這并不是對某些事物進(jìn)行基準(zhǔn)測試的最佳方法，但是它提供了一個(gè)想法，即selectolax有時(shí)比lxml快30倍。

selectolax最適合將HTML剝離為純文本。如果我有10,000多個(gè)HTML片段，需要將它們作為純文本索引到Elasticsearch中。（Elasticsearch有一個(gè)html_strip文本過濾器，但這不是我想要/不需要在此上下文中使用的過濾器）。事實(shí)證明，以這種規(guī)模將HTML剝離為純文本實(shí)際上是非常低效的。那么，最有效的方法是什么？

PyQuery

from?pyquery?import?PyQuery?as?pq

text?=?pq(html).text()

selectolax

from?selectolax.parser?import?HTMLParser

text?=?HTMLParser(html).text()

正則表達(dá)式

import?re

regex?=?re.compile(r'<.*?>')
text?=?clean_regex.sub('',?html)

結(jié)果

我編寫了一個(gè)腳本來計(jì)算時(shí)間，該腳本遍歷包含HTML片段的10,000個(gè)文件。注意！這些片段不是完整的文檔（帶有和等），只是HTML的一小部分。平均大小為10,314字節(jié)（中位數(shù)為5138字節(jié)）。結(jié)果如下：

pyquery
??SUM:????18.61?seconds
??MEAN:???1.8633?ms
??MEDIAN:?1.0554?ms
selectolax
??SUM:????3.08?seconds
??MEAN:???0.3149?ms
??MEDIAN:?0.1621?ms
regex
??SUM:????1.64?seconds
??MEAN:???0.1613?ms
??MEDIAN:?0.0881?ms

我已經(jīng)運(yùn)行了很多次，結(jié)果非常穩(wěn)定。重點(diǎn)是：selectolax比PyQuery快7倍。

正則表達(dá)式好用？真的嗎？

對于最基本的HTML Blob，它可能工作得很好。實(shí)際上，如果HTML是

Foo＆amp; Bar ，我希望純文本轉(zhuǎn)換應(yīng)該是Foo＆Bar，而不是Foo＆amp; bar。

更重要的一點(diǎn)是，PyQuery和selectolax支持非常特定但對我的用例很重要的內(nèi)容。在繼續(xù)之前，我需要?jiǎng)h除某些標(biāo)簽（及其內(nèi)容）。例如：

<h4?class="warning">This?should?get?stripped.h4>
<p>Please?keep.p>
<div?style="display:?none">This?should?also?get?stripped.div>

正則表達(dá)式永遠(yuǎn)無法做到這一點(diǎn)。

2.0 版本

因此，我的要求可能會發(fā)生變化，但基本上，我想刪除某些標(biāo)簽。例如：

?、

和

。因此，讓我們實(shí)現(xiàn)一下：

PyQuery

from?pyquery?import?PyQuery?as?pq

_display_none_regex?=?re.compile(r'display:\s*none')

doc?=?pq(html)
doc.remove('div.warning,?div.hidden')
for?div?in?doc('div[style]').items():
????style_value?=?div.attr('style')
????if?_display_none_regex.search(style_value):
????????div.remove()
text?=?doc.text()

selectolax

from?selectolax.parser?import?HTMLParser

_display_none_regex?=?re.compile(r'display:\s*none')

tree?=?HTMLParser(html)
for?tag?in?tree.css('div.warning,?div.hidden'):
????tag.decompose()
for?tag?in?tree.css('div[style]'):
????style_value?=?tag.attributes['style']
????if?style_value?and?_display_none_regex.search(style_value):
????????tag.decompose()
text?=?tree.body.text()

這實(shí)際上有效。當(dāng)我現(xiàn)在為10,000個(gè)片段運(yùn)行相同的基準(zhǔn)時(shí)，新結(jié)果如下：

pyquery
??SUM:????21.70?seconds
??MEAN:???2.1701?ms
??MEDIAN:?1.3989?ms
selectolax
??SUM:????3.59?seconds
??MEAN:???0.3589?ms
??MEDIAN:?0.2184?ms
regex
??Skip

同樣，selectolax擊敗PyQuery約6倍。

結(jié)論

正則表達(dá)式速度快，但功能弱。selectolax的效率令人印象深刻。

更多閱讀

2020 年最佳流行 Python 庫 Top 10

2020 Python中文社區(qū)熱門文章 Top 10

Top 10 沙雕又有趣的 GitHub 程序

特別推薦

點(diǎn)擊下方閱讀原文加入社區(qū)會員

Python 高效提取 HTML 文本的方法