Python 高效提取 HTML 文本的方法

BeautifulSoup軟件包中的get_text方法,該方法內(nèi)部使用lxml。這是一個(gè)經(jīng)過充分測試的解決方案,但是在處理成千上萬個(gè)HTML文檔時(shí)可能會非常慢。selectolax替換BeautifulSoup,您幾乎可以免費(fèi)獲得5-30倍的加速!BeautifulSoup軟件包中的get_text方法,該方法內(nèi)部使用lxml。這是一個(gè)經(jīng)過充分測試的解決方案,但是在處理成千上萬個(gè)HTML文檔時(shí)可能會非常慢。selectolax替換BeautifulSoup,您幾乎可以免費(fèi)獲得5-30倍的加速!這是一個(gè)簡單的基準(zhǔn)測試,可分析commoncrawl(https://commoncrawl.org/)的10,000個(gè)HTML頁面:#?coding:?utf-8
from?time?import?time
import?warc
from?bs4?import?BeautifulSoup
from?selectolax.parser?import?HTMLParser
def?get_text_bs(html):
????tree?=?BeautifulSoup(html,?'lxml')
????body?=?tree.body
????if?body?is?None:
????????return?None
????for?tag?in?body.select('script'):
????????tag.decompose()
????for?tag?in?body.select('style'):
????????tag.decompose()
????text?=?body.get_text(separator='\n')
????return?text
def?get_text_selectolax(html):
????tree?=?HTMLParser(html)
????if?tree.body?is?None:
????????return?None
????for?tag?in?tree.css('script'):
????????tag.decompose()
????for?tag?in?tree.css('style'):
????????tag.decompose()
????text?=?tree.body.text(separator='\n')
????return?text
def?read_doc(record,?parser=get_text_selectolax):
????url?=?record.url
????text?=?None
????if?url:
????????payload?=?record.payload.read()
????????header,?html?=?payload.split(b'\r\n\r\n',?maxsplit=1)
????????html?=?html.strip()
????????if?len(html)?>?0:
????????????text?=?parser(html)
????return?url,?text
def?process_warc(file_name,?parser,?limit=10000):
????warc_file?=?warc.open(file_name,?'rb')
????t0?=?time()
????n_documents?=?0
????for?i,?record?in?enumerate(warc_file):
????????url,?doc?=?read_doc(record,?parser)
????????if?not?doc?or?not?url:
????????????continue
????????n_documents?+=?1
????????if?i?>?limit:
????????????break
????warc_file.close()
????print('Parser:?%s'?%?parser.__name__)
????print('Parsing?took?%s?seconds?and?produced?%s?documents\n'?%?(time()?-?t0,?n_documents))
>>>?!?wget?https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2018-05/segments/1516084886237.6/warc/CC-MAIN-20180116070444-20180116090444-00000.warc.gz
>>>?file_name?=?"CC-MAIN-20180116070444-20180116090444-00000.warc.gz"
>>>?process_warc(file_name,?get_text_selectolax,?10000)
Parser:?get_text_selectolax
Parsing?took?16.170367002487183?seconds?and?produced?3317?documents
>>>?process_warc(file_name,?get_text_bs,?10000)
Parser:?get_text_bs
Parsing?took?432.6902508735657?seconds?and?produced?3283?documents
selectolax有時(shí)比lxml快30倍。selectolax最適合將HTML剝離為純文本。如果我有10,000多個(gè)HTML片段,需要將它們作為純文本索引到Elasticsearch中。(Elasticsearch有一個(gè)html_strip文本過濾器,但這不是我想要/不需要在此上下文中使用的過濾器)。事實(shí)證明,以這種規(guī)模將HTML剝離為純文本實(shí)際上是非常低效的。那么,最有效的方法是什么?PyQuery
from?pyquery?import?PyQuery?as?pq
text?=?pq(html).text()
selectolax
from?selectolax.parser?import?HTMLParser
text?=?HTMLParser(html).text()
正則表達(dá)式
import?re
regex?=?re.compile(r'<.*?>')
text?=?clean_regex.sub('',?html)
文檔(帶有和等),只是HTML的一小部分。平均大小為10,314字節(jié)(中位數(shù)為5138字節(jié))。結(jié)果如下:pyquery
??SUM:????18.61?seconds
??MEAN:???1.8633?ms
??MEDIAN:?1.0554?ms
selectolax
??SUM:????3.08?seconds
??MEAN:???0.3149?ms
??MEDIAN:?0.1621?ms
regex
??SUM:????1.64?seconds
??MEAN:???0.1613?ms
??MEDIAN:?0.0881?ms
selectolax比PyQuery快7倍。HTML Blob,它可能工作得很好。實(shí)際上,如果HTML是 Foo&amp; Bar p>
,我希望純文本轉(zhuǎn)換應(yīng)該是Foo&Bar,而不是Foo&amp; bar。<h4?class="warning">This?should?get?stripped.h4>
<p>Please?keep.p>
<div?style="display:?none">This?should?also?get?stripped.div>
?、 和 。因此,讓我們實(shí)現(xiàn)一下:PyQuery
from?pyquery?import?PyQuery?as?pq
_display_none_regex?=?re.compile(r'display:\s*none')
doc?=?pq(html)
doc.remove('div.warning,?div.hidden')
for?div?in?doc('div[style]').items():
????style_value?=?div.attr('style')
????if?_display_none_regex.search(style_value):
????????div.remove()
text?=?doc.text()
selectolax
from?selectolax.parser?import?HTMLParser
_display_none_regex?=?re.compile(r'display:\s*none')
tree?=?HTMLParser(html)
for?tag?in?tree.css('div.warning,?div.hidden'):
????tag.decompose()
for?tag?in?tree.css('div[style]'):
????style_value?=?tag.attributes['style']
????if?style_value?and?_display_none_regex.search(style_value):
????????tag.decompose()
text?=?tree.body.text()
這實(shí)際上有效。當(dāng)我現(xiàn)在為10,000個(gè)片段運(yùn)行相同的基準(zhǔn)時(shí),新結(jié)果如下: pyquery
??SUM:????21.70?seconds
??MEAN:???2.1701?ms
??MEDIAN:?1.3989?ms
selectolax
??SUM:????3.59?seconds
??MEAN:???0.3589?ms
??MEDIAN:?0.2184?ms
regex
??Skip
同樣,selectolax擊敗PyQuery約6倍。 結(jié)論 正則表達(dá)式速度快,但功能弱。selectolax的效率令人印象深刻。 更多閱讀
特別推薦

點(diǎn)擊下方閱讀原文加入社區(qū)會員
瀏覽
90評論
圖片
表情
