10 分鐘 純 Python 搭建全文搜索引擎
↑?關(guān)注 + 星標(biāo)?,每天學(xué)Python新技能
后臺回復(fù)【大禮包】送你Python自學(xué)大禮包
有一個群友在群里問個如何快速搭建一個搜索引擎,在搜索之后我看到了這個

代碼所在
Git:https://github.com/asciimoo/searx
官方很貼心,很方便的是已經(jīng)提供了docker 鏡像,基本pull下來就可以很方便的使用了,執(zhí)行命令
cid=$(sudo?docker?ps?-a?|?grep?searx?|?awk?'{print?$1}')
echo?searx??cid?is?$cid
if?[?"$cid"?!=?""?];then
????sudo?docker?stop?$cid
????sudo?docker?rm?$cid
fi
sudo?docker?run?-d?--name?searx?-e?IMAGE_PROXY=True?-e?BASE_URL=http://yourdomain.com??-p?7777:8888?wonderfall/searx
然后就可以使用了,正常查看docker的狀態(tài),就可以正常的使用了
思考
怎么樣,是不是很方便,我們先看看源碼是怎么樣實現(xiàn)的

我們打開里面的代碼,其實本質(zhì)就是將request之后的結(jié)果做一個大的聚合,至于數(shù)據(jù)來源,我們可以是來于DB,或者文件,我們可以看一下他的核心代碼
from?urllib?import?urlencode
from?json?import?loads
from?collections?import?Iterable
search_url?=?None
url_query?=?None
content_query?=?None
title_query?=?None
suggestion_query?=?''
results_query?=?''
#?parameters?for?engines?with?paging?support
#
#?number?of?results?on?each?page
#?(only?needed?if?the?site?requires?not?a?page?number,?but?an?offset)
page_size?=?1
#?number?of?the?first?page?(usually?0?or?1)
first_page_num?=?1
def?iterate(iterable):
????if?type(iterable)?==?dict:
????????it?=?iterable.iteritems()
????else:
????????it?=?enumerate(iterable)
????for?index,?value?in?it:
????????yield?str(index),?value
def?is_iterable(obj):
????if?type(obj)?==?str:
????????return?False
????if?type(obj)?==?unicode:
????????return?False
????return?isinstance(obj,?Iterable)
def?parse(query):
????q?=?[]
????for?part?in?query.split('/'):
????????if?part?==?'':
????????????continue
????????else:
????????????q.append(part)
????return?q
def?do_query(data,?q):
????ret?=?[]
????if?not?q:
????????return?ret
????qkey?=?q[0]
????for?key,?value?in?iterate(data):
????????if?len(q)?==?1:
????????????if?key?==?qkey:
????????????????ret.append(value)
????????????elif?is_iterable(value):
????????????????ret.extend(do_query(value,?q))
????????else:
????????????if?not?is_iterable(value):
????????????????continue
????????????if?key?==?qkey:
????????????????ret.extend(do_query(value,?q[1:]))
????????????else:
????????????????ret.extend(do_query(value,?q))
????return?ret
def?query(data,?query_string):
????q?=?parse(query_string)
????return?do_query(data,?q)
def?request(query,?params):
????query?=?urlencode({'q':?query})[2:]
????fp?=?{'query':?query}
????if?paging?and?search_url.find('{pageno}')?>=?0:
????????fp['pageno']?=?(params['pageno']?-?1)?*?page_size?+?first_page_num
????params['url']?=?search_url.format(**fp)
????params['query']?=?query
????return?params
def?response(resp):
????results?=?[]
????json?=?loads(resp.text)
????if?results_query:
????????for?result?in?query(json,?results_query)[0]:
????????????url?=?query(result,?url_query)[0]
????????????title?=?query(result,?title_query)[0]
????????????content?=?query(result,?content_query)[0]
????????????results.append({'url':?url,?'title':?title,?'content':?content})
????else:
????????for?url,?title,?content?in?zip(
????????????query(json,?url_query),
????????????query(json,?title_query),
????????????query(json,?content_query)
????????):
????????????results.append({'url':?url,?'title':?title,?'content':?content})
????if?not?suggestion_query:
????????return?results
????for?suggestion?in?query(json,?suggestion_query):
????????results.append({'suggestion':?suggestion})
????return?results
結(jié)果
每個response的時候我們都要以輕松的定制返回的數(shù)據(jù)(可以是網(wǎng)絡(luò),可以是數(shù)據(jù)庫,可以是文件),那我們進一步想一下,如果我們可以hack response 結(jié)果,那我們完全可以將自己爬來的數(shù)據(jù)做為返回結(jié)果。如果是1024之類的,完全可以打造自己的“愛好”小引擎,代碼我就不貼了,大家可以自己動手自己玩玩。結(jié)合jieba分詞,可以更好玩一點。
原文鏈接:https://brucedone.com/archives/838
推薦閱讀
評論
圖片
表情
