<kbd id="afajh"><form id="afajh"></form></kbd>
<strong id="afajh"><dl id="afajh"></dl></strong>
    <del id="afajh"><form id="afajh"></form></del>
        1. <th id="afajh"><progress id="afajh"></progress></th>
          <b id="afajh"><abbr id="afajh"></abbr></b>
          <th id="afajh"><progress id="afajh"></progress></th>

          100,000 元總獎(jiǎng)金丨“未來杯 AI 挑戰(zhàn)賽” baseline正式發(fā)布

          共 17231字,需瀏覽 35分鐘

           ·

          2021-07-27 10:01


           算法能否判斷一篇冷門角落里的論文,比一篇在自媒體上刷屏的論文更適合你讀?


          與這個(gè)問題息息相關(guān),今年的“未來杯 AI 挑戰(zhàn)賽” 賽題設(shè)置為預(yù)測(cè)一篇論文對(duì)哪些學(xué)者具有吸引力。日前,這一由智譜?AI 與 AI TIME 聯(lián)合承辦的比賽已正式啟動(dòng)比賽頁面可以訪問“閱讀原文”

           

          本次比賽仍在進(jìn)行中,并總獎(jiǎng)金池 100000 元,使用 AMiner提供的數(shù)據(jù)集,任務(wù)、數(shù)據(jù)集、解題思路詳解見“數(shù)據(jù)實(shí)戰(zhàn)派”文章《10萬元獎(jiǎng)金:2021未來杯人工智能科技探索大賽賽題詳解》。在此基礎(chǔ)之上,本文是針對(duì)該競(jìng)賽的baseline介紹。


          baseline地址:

          https://fs.163.com/fs/display/?file=uC5xvCqblbWSGR0PpITsZx8lDrLyAiNgbAEpnVKGwqRlugVbRQA-vZO5H4uvsTi23KynHa_UhDxa_aE1_KSbGQ


          一、數(shù)據(jù)處理


          1. train/valid/test數(shù)據(jù)集劃分:


          將給定的train data按照train/valid/test劃分,進(jìn)行offline的training和test。


          劃分的原則:我們構(gòu)建的是paper-expert pairs,希望借助這樣的樣本對(duì)來分別學(xué)習(xí)paper的embedding和expert的embedding,最后根據(jù)競(jìng)賽的要求,進(jìn)行top-K的召回,為每一篇paper召回500個(gè)相關(guān)experts。因此,劃分原則是對(duì)每一篇paper,與該paper交互的所有的experts,隨機(jī)選擇一個(gè)作為test,再選擇一個(gè)作為valid,其余的experts作為train。這里,當(dāng)該paper交互的experts少于3人時(shí),我們將該paper和其交互的experts全部作為train data。


          2. 數(shù)據(jù)預(yù)處理


          如上述數(shù)據(jù)集的劃分所言,我們的訓(xùn)練樣本是paper-expert pairs?,F(xiàn)在,我們需要對(duì)每對(duì)paper-expert進(jìn)行預(yù)處理,下面分別進(jìn)行預(yù)處理介紹:


          Paper:在給定的數(shù)據(jù)集中,每一篇paper有多種屬性,包括id、title、abstract、keywords等等(中文和英文版本),這里,我們選擇title、abstract、keywords(中文版)這三個(gè)屬性來作為paper的feature進(jìn)行數(shù)據(jù)預(yù)處理。對(duì)這些feature的處理需要用nlp領(lǐng)域的知識(shí)。這里我們選擇的是oag-bert來將paper的這些屬性作為feature輸入。值得注意的是輸入到oag-bert的一段sequence,而這里的keywords是list,我們需要將其處理成seq,具體的做法是將keywords逐次相連成一個(gè)seq,每個(gè)keyword之間用空格隔開。title和abstract不用多做處理。這里我們給出了我們處理paperdata的代碼,僅供參考。

          def get_papers(self, data_dict): paper_infos = {} for iid in data_dict:paper_id, title, abstract, keywords, year = iid["id"]( iid["title"], iid["abstract"], iid[“keywords"], iid["year"#  process keywords str.keywords = "" if keywords !=for word in keywords:if word == keywords[0]: str_keywords = word else:str_keywords = str_keywords + 1 ' + word#  check data — remove unexist title and abstract data if title == "" and abstract =="" and str.keywords ==print("unexisting title and abstract and keywords paper data....") else:infos = {"title": title, "abstract": abstract, "keywords": str_keywords, "year": year} if paper_id in paper_infos:printC'repeat paper id  ")else:paper_infos[paper_id] = infos return paper_infos


          Expert:在給定的數(shù)據(jù)集中,每個(gè)expert有多種屬性,包括id、interests、tags、pub_info等等。這里我們選擇interests作為每個(gè)expert的研究興趣,若是interests為空,選擇tags為研究興趣,或者tags也為空,則interests為空。此外,我們還選擇了pub_info作為feature,pub_info中包含了該expert發(fā)表的論文,這里我們同樣選擇title、abstract及keywords作為feature。expert數(shù)據(jù)的預(yù)處理代碼:


          def get_experts(self, data.dict): expert.infos = {} for expert in data.dict:expert.id = expert['id'] pub.info = expert["pub_info“] if expert.getC'interests", None) != None: interests = expert[“interests" ] elif expert.getC'tags", None) != None: tags = expertt "tags"] interests = tags else:interests = []#  process interests str_interests = '"' len_interests = len(interests) if len_interests != 0:for interest in interests:if interest == interests[9]:str.interests = interest['t' ] else:str.interests = str.interests + ' ' + interest!'t' ]#  process pub_info pub_infos - {}for iid in pub_info: pid = iidC'id"]if iid.get('title', None) 1= None: title = iid['title']else:title = '"'if iid.getf'abstract", None) != None: abstract = iid["abstract"] else:abstract = ""if iid.get("keywords". None) 1= None: keywords = iid[ "keywords'1] else:keywords = []infos = {"title": title, “abstract": abstract, "keywords": keywords} if pid in pub_infos:pass else:if title == "" and abstract == "" and keywords == []: print C'pub.info  ", iid)passelse:pub_infos[pidJ = infosU check expert data remove data unexist interests and pub.info if str_interests == "" and pub_infos == {}: print("unexisting_id", expert_id)else:if expert.id in expert.infos:passelse:info = {“interests": str.interests, “pub.info": pub.infos} expert_infos[expert_id] = inforeturn expert_infos



          二、模型選擇


          這里,我們選擇oag-bert作為我們學(xué)習(xí)embedding的模型。


          oag-bert以每篇paper的title、keywords、abstract作為輸入,來得到每一篇paper的embedding表示。


          這里,我們需要先得到每篇paper的title、keywords、abstract等的token表示。具體的我們?cè)赽atch數(shù)據(jù)的得到中詳細(xì)講述。


          三、batch數(shù)據(jù)


          這里,我們需要得到batch數(shù)據(jù),進(jìn)行oag-bert的finetune訓(xùn)練。值得注意的是,給定的數(shù)據(jù)集中只包含有paper-expert正的交互項(xiàng),沒有負(fù)樣本,這里我們對(duì)每一個(gè)正paper-expert選擇negs_num個(gè)負(fù)樣本,具體的操作如下:將該paper沒有交互過的所有experts作為負(fù)樣本的候選集合,然后隨機(jī)采樣negs_num個(gè)。這樣我們可以得到batch的數(shù)據(jù),包括正、負(fù)樣本。代碼如下,注意,這里得到的訓(xùn)練樣本集合的是[一篇paper, 一個(gè)正expert,negs_num個(gè)負(fù)experts]:


          def generate_batch_data(self, paper.infos, experts.infos, batch, neg_num): batch_infos_anchor = [] batch_infos_pos = [] batch_infos_neg = []# generate anchor, pos, neg for pid, eid in batch:p.infos = paper.infostpid] e.infos = experts_infos[eid]#  build anchor, pos, neg#  anchoranchor.infos = <} if pid in anchor_infos:print("repeat pid  ")else:anchor_infos[pid] = p_infos batch_infos_anchor.append7anchor.infos)#  pospos.infos = {> if pid in pos.infos:print("repeat pid in pos.infos...") else:pos.infos[pid] = e.infos batch.infos.pos.append(pos.infos)#  negs#  random.sample K negsnegs.id * random.sampledist(set(experts.infos.keys()) - set(self.train_data_dict[pid])), neg.num) negs.infos = {} for neg in negs.id:if neg in negs.infos:print("repeat negs in negs.info  ")else:negs_infos[neg] = experts.infostneg]if experts.infostneg]["interests"] == "" and experts.infostneg]["pub.info"] == {>: printC'experts.empty", experts.infostneg])while True:if experts.infostneg]["interests"] != "": negs.infostneg] = experts.infostneg] break#  check pub.info when interests == "" elif experts_infos[neg]t"pub_info"] != {>:negs.infostneg] = experts.infostneg]|  break#  unexisting interests and pub.info (the information for this expert is useless) else:# sample another negprint("experts.infostneg]", experts.infostneg]) printC'expert.id", neg)printC'candidate.negs", len(list(set(experts_infos.keys()) - set(self.train_data_dict[pid]) - set(negs.id)))) neg = random.sample(list(set(experts_infos.keys()) - set(self.train_data_dict[pid]) - set(negs.id)), 1)[0] printC'neg.another", neg)batch.infos.neg.append(negs.infos)return batch.infos.anchor, batch.infos.pos, batch.infos.neg


          上述得到的batch數(shù)據(jù)并不是可以直接輸入到oag-bert中的數(shù)據(jù),我們需要將這些seq數(shù)據(jù)詞化,也就是得到這些seq的tokens。下面給出得到token的代碼:


          def get_batch_tokens(self, infos, flag): batch_tokens = [] for info in infos: tokens.dict = {>for p_id, p_info in info.items():# get tokenstokens = self.build_bert_inputs(p_info, flag) #print("tokens...", tokens) if p_id in tokens_dict:printC'repeat p_id  '')else:tokens_dict[p_id] = tokens batch_tokens.append(tokens_dict) return batch.tokensdef build_bert_inputs(self, p_info, flag): if flag == “anchor":#  title & abstract & keywordsif p_info.get("abstract". None) != None: abstract = p_info[”abstract"] else:abstract =if p_info.get(“title", None) != None: title = p.infoC'title"]else:title = " "if p_info.get("keywords", None) != None: keywords = p_info[“keywords"]else:keywords = ""return self.oagbert.build_inputs(title=title, abstract=abstract, concepts=keywords) elif flag == "pos" or flag == "neg":#  expertsif p_info.get("interests", None) != None: interests = p_info["interests"] else:interests = ""e_tokens = self.oagbert.build_inputs(title=interests, abstracts''")#  prcoess experts'pub infos expert_pub_infos = p_info["pub_info“] pub.tokens = {>for pid, expert_pub_info in expert_pub_infos.items(): title = expert_pub_info['title'] abstract = expert_pub_info['abstract'] keywords.list = expert_pub_info['keywords'] keywords = '"'if len(keywords_list) != 0: for word in keywords.list:if word == keywords_list[9]: keywords = wordelse:keywords = keywords + ' ' + wordp_tokens = self.oagbert.build_inputs(title=title, abstract=abstract, concepts=keywords) if pid in pub_tokens:printC'repeat pid  ")else:pub_tokens[pid] = p_tokenstokens = {"interests": e.tokens, "pub.info": pub.tokens} return tokens else:raise Exception(“undefine flag")return



          四、得到paper embedding


          當(dāng)?shù)玫絧aper的tokens后,我們可以將其輸入到oag-bert,得到paper的embedding。這里,我們簡(jiǎn)單講解一下:


          當(dāng)我們調(diào)用get_batch_tokens()函數(shù),會(huì)得到paper的tokens,這里的tokens包括input_ids、input_masks、token_type_ids、masked_lm_labels、position_ids、position_ids_second、masked_positions、num_spans。我們的oag-bert的輸入需要input_ids、input_masks、token_type_ids、position_ids、position_ids_second作為輸入,輸出paper的embedding表示pooled_output:


          input_ids, input.masks, token_type_ids, masked_lm_labels, position_idsf position_ids_second, maksed_positions, num_spans = token pooled_output = model.bert.forwardC input_ids=torch. Long Tensor (input_ids) .unsqueezed(0) , cuda (), token_type_ids=torch.LongTensor(token_type_ids).unsqueeze(0).cuda(),attention_mask=torch.LongTensor(input_masks).unsqueeze(0).cuda(), output_all_encoded_layers=False, checkpoint_activations=False,position_ids=torch.LongTensor(position_ids).unsqueeze(0).cuda(), position_ids_second=torch,LongTensorf position_ids_second).unsqueeze(0).cuda())



          五、得到expert embedding


          當(dāng)?shù)玫絜xpert的tokens后,我們可以將其輸入到oag-bert,得到expert的embedding。值得注意的是,這里expert的embedding由兩部分組成:


          1. expert的interests的embedding;


          2. expert的pub_info的embedding。


          首先,我們先來看interests的embedding。


          當(dāng)我們調(diào)用get_batch_tokens()函數(shù),會(huì)得到expert的interests的tokens,這里的tokens同樣包括input_ids、input_masks、token_type_ids、masked_lm_labels、position_ids、position_ids_second、masked_positions、num_spans。我們的oag-bert的輸入需要input_ids、input_masks、token_type_ids、position_ids、position_ids_second作為輸入,輸出expert的interests的embedding表示pooled_output_expert_interests,注意若是該expert的interests為空,這時(shí)interests的embedding為None:

          # interests process interests = token[“interests“]#print(“interests", interests)input.ids, input_masks, token_type_ids, masked_lm_labels, position_ids, position_ids_second, maksed_positions, num_spans = interests if input_ids != []:_,pooled_output_expert_interests = model.bert.forwardt input_ids=torch.LongTensor(input_ids).unsqueeze(0).cuda(), token_type_ids=torch.LongTensor(token_type_ids).unsqueeze*0) .cudaO, attention_mask=torch.LongTensor(input_masks).unsqueeze(0).cudat), output_all_encoded_layers=False, checkpoint_activations=False,position_ids=torch. LongTensor(position_ids) .unsqueezed(0).cuda(), position_ids_second=torch.LongTensor(position_ids_second).unsqueeze(0).cuda())else:pooled_output_expert_interests = None


          下面,我們接著看pub_info的embedding,本質(zhì)上而言,pub_info就是對(duì)experts發(fā)表的每一篇paper做embedding表示,和paper的embedding得到的方式相同:


          # pub_info of experts process pub.info = token["pub_info"pub_info_embed = []for pid, p_token in pub_info.items():input_ids, input_masks, token_type_ids, masked_lm_labels, position_ids, position_ids_second, if input_ids != []:_,pooled_output_expert_pub = model.bert.forward( input_ids=torch.LongTensor(input_ids).unsqueezed) .cuda(), token_type_ids=torch.LongTensor(token_type_ids).unsqueeze(0).cuda(), attention_mask=torch.LongTensor(input_masks).unsqueeze(0),cuda(), output_all_encoded_layers=False, checkpoint_activations=False,position_ids=torch.LongTensor(position_ids).unsqueeze(0).cuda(), position_ids_second=torch.LongTensor(position_ids_second).unsqueeze(0).cuda()) pub.info.embed.append(pooled_output_expert_pub)


          值得注意的是,每個(gè)expert發(fā)表的paper的數(shù)量不同,我們需要利用該expert發(fā)表的所有paper。


          最后,我們結(jié)合interests embedding和pub_info embedding。這里,我們簡(jiǎn)單的使用了torch.mean()來進(jìn)行merge這兩種embedding來得到最終的expert embedding。

          #  merge interests_embed and pub_info_embed#  here, we use torch.meanO for merge#  check  if pooled_output_expert_interests == None: if len(pub_info_embed) != 0:pooled_output_expert_cat = torch.cat(pub_info_embed)else:pooled_output_expert_cat = Noneelse:if len(pub_info_embed) != 0:pooled_output_expert_cat = torch.cat((pooled_output_expert_interests, torch.cat(pub_info_embed)), 0)else:pooled_output_expert_cat = pooled_output_expert_interests pooled_output_expert_final = torch.mean(pooled_output_expert_cat, 0).view(l, configC1output_dim'])


          各位選手可以對(duì)這步merge過程進(jìn)行改進(jìn),以獲得更好的expert的embedding表示。


          六、loss及訓(xùn)練(finetune)


          當(dāng)?shù)玫絜mbedding后,我們需要定義我們訓(xùn)練所需的loss。這里我們采用的是infonce loss。具體的loss形式如下:

          關(guān)于infonce loss的詳細(xì)理解,各位選手可以自行學(xué)習(xí),這里不再贅述。

          # infoNCE lossclass infoNCE(nn.Module):def   init  (self):super(infoNCE, self).__init__()self.T = 0.07self.cross_entropy = nn.CrossEntropyLoss().cuda()def forward(self, query, pos, neg):query = F.normalize(query.view(batch_size, 1, dim) p=2,dim=2 )pos = F.normalize(pos.view(batch_size, 1, dim), p=2,dim=2 )neg = F.normalize(neg.view(batch_size, K, dim), p=2,dim=2 )pos_score = torch.bmm(query, pos.transposed(1, 2)) #B*1*1neg_score = torch.bmm(query, neg.transposed(12)) #B*1*K# logits:B*(K+l)logits = torch.cat([pos_score, neg_score], dim=2). squeeze()logits /= self.T labels = torch.zeros(logits.shape[0], dtype=torch.long).cuda() info_loss = self.cross_entropy(logits, labels)return info_loss

          注意:我們加了MLP去非線性變換了oag-bert得到的embedding,以便更好地finetune。

          class MLP(nn.Module):def   init  (self, in_dim):super(MLP, self).  init  ()self.projection = nn.Sequential( nn.Linear(in_dim, in_dim), nn.ReLU(),nn.Linear(in_dim, in_dim)def forward(self, x):x = self.projection(x)return x


          整個(gè)的training過程的代碼如下:

          #  infoNCE losscriterion = infoNCE)).cuda))optimizer = torch.optim.Adam([)'params':model.parameters))>,{’params': projection.parameters))}], lr=config["learning_rate“])model.train)) projection.train)) best.mrr = -1 patience = 0#  finetuningfor epoch in range(epochs):random.shuffle(train_batches) batch_loss = [] batch_num = 0for batch in train.batches: batch.num += 1#  get anchor pos neganchor, pos, neg = data_loader.generate_batch_data(paper_infos, experts.infos, batch, config[“neg_num"])anchor.tokens = data_loader.get_batch_tokens(anchor, "anchor") pos.tokens = data_loader.get_batch_tokens(pos, "pos") neg.tokens = data_loader.get_batch_tokens(neg, "neg")anchor_emb = get_batch_embed(anchor_tokens, "anchor") pos_emb = get_batch_embed(pos_tokens, “pos") neg_emb = get_batch_embed(neg_tokens, “neg")#  add MLP#  infoNCE lossloss = criterion)projection(anchor_emb), projection(pos_emb), projection(neg_emb)) print)"loss...", loss.item)))#  compute gradient and do Adam step optimizer.zero.grad))loss.backward)) optimizer.step))if batch.num > 1 and batch_num % valid.step == 0:print) "evalute  ")t.valid = time))mrr = evaluatefmodel, valid.batches, data.loader, paper_infos, experts_infos) print)"time for valid  ", time)) - t_valid)print)"Epoch:{} batch:{} loss:{} mrr:{}".format(epoch, batch_num, loss.item)), mrr)) if mrr > best_mrr: best_mrr = mrr tt save modeltorch.save(model.state_dict(), output.dir + "oagbert")print)“Best Epoch:{} batch:{} loss:{} mrr:)}".format(epoch, batch.num, loss.item)), mrr))else:patience += 1if patience > config[“patience"]:printC("Best Epoch:{} batch:)} loss:)} mrr:)}".format(epoch, batch.num, loss.item)), mrr))model.train() projection.train()

          七、Test/valid


          最后,我們?cè)趘alid/test上進(jìn)行驗(yàn)證/測(cè)試,valid可以用來調(diào)整我們模型的超參,test來測(cè)試我們模型的泛化能力。這些都是offline的測(cè)試。最終,保存效果最好的模型參數(shù),再進(jìn)行online的valid和test。


          這里,我們用mrr來作為評(píng)價(jià)指標(biāo),當(dāng)然,各位選手可以選擇給定的評(píng)價(jià)指標(biāo)。


          注意:由于test時(shí),為每篇paper做全局召回,耗時(shí)太大,我們采用的是隨機(jī)sample 100個(gè)負(fù)樣本,與test/valid中的正例進(jìn)行rank。


          為了進(jìn)一步節(jié)省test的時(shí)間,我們采用了兩種test方式:


          1. 每個(gè)batch共享100個(gè)negs。


          2. 整個(gè)test共享100個(gè)negs。

          def evaluate(model, valid_batches, data_loader, paper_infos, experts_infos): model.eval() mrr =0.0 total_count = 0 with torch.no_grad():"""negs = data_loader.generate_negs_data(paper_infos, experts_infos, configt"Negs"]) neg_tokens = data_loader.get_batch_tokens(negs, "neg") neg_emb_candidates = get_batch_embed(neg_tokens, "neg")"""share batch negs"""for batch in valid_batches:anchor, pos, _ = data_loader.generate_batch_data_test(paper_infos, experts_infos, batch, configt"Negs"])#  use too much timeanchor_tokens = data_loader.get_batch_tokens(anchor, "anchor") pos_tokens = data_loader.get_batch_tokens(pos, "pos")#  use too much timeanchor_emb = get_batch_embed(anchor_tokens, "anchor") pos_emb = get_batch_embed(pos_tokens, "pos") neg_emb = neg_emb_candidates.repeat(len(batch), 1)#  batch share negs#for batch in valid_batches:#  anchor, pos, negs = data_loader.generate_batch_data_test(paper_infos, experts_infos, batch, configt"Negs"])#  # use too much time#  tt = timeO#  anchor_tokens = data_loader.get_batch_tokens(anchor, "anchor")#  print ("time...", timeO - tt)#  tt = timeO#  pos_tokens = data_loader.get_batch_tokens(pos, "pos")#  print ("time...", timeO - tt)#  tt = timeO#  neg_tokens = data_loader.get_batch_tokens(negs, "neg")#  print ("time...", timeO - tt)#  # use too much time#  tt = timeO#  anchor_emb = get_batch_embed(anchor_tokens, "anchor")#  print ("time...", timeO - tt)#  tt = timeO#  pos_emb = get_batch_embed(pos_tokens,  "pos")#  print ("time...", timeO - tt)#  tt = timeO#  neg_emb_candidates = get_batch_embed(neg_tokens, "neg")#  print ("time...", timeO - tt)#  neg_emb = neg_emb_candidates.repeat(len(batch), 1)#  anchor & pos_embedanchor_emb = F.normalize(anchor_emb.view(-l, 1, dim), p=2, dim=2)pos_emb = F.normalize(pos_emb.view(-l, 1, dim), p=2, dim=2)neg_emb = F.normalize(neg_emb.view(-l, configt"Negs"], dim), p=2, dim=2)pos_score = torch.bmm(anchor_emb, pos_emb.transpose(1, 2)) # B*l*l neg_score = torch.bmm(anchor_emb, neg_emb.transpose(1, 2))  # B*l*Negs#  logits:B*(l+Negs)logits = torch.cat([pos_score, neg_score], dim=2).squeeze() logits = logits.cpuO .numpyOfor i in range(batch_size): total_count += 1 logits_single = logits[i] rank = np.argsort(-logits_single) true_index = np.where(rank==0)[0][0] mrr += np.divide(1.0, true_index+l) mrr /= total_count return mrr


          建議加入【2021未來杯人工智能賽交流群】,群內(nèi)可以進(jìn)行參賽問題解答、組隊(duì)邀約、選手交流等。(若群滿200人可以添加小助手,微信號(hào):hhming98,由他拉入交流群 



          瀏覽 66
          點(diǎn)贊
          評(píng)論
          收藏
          分享

          手機(jī)掃一掃分享

          分享
          舉報(bào)
          評(píng)論
          圖片
          表情
          推薦
          點(diǎn)贊
          評(píng)論
          收藏
          分享

          手機(jī)掃一掃分享

          分享
          舉報(bào)
          <kbd id="afajh"><form id="afajh"></form></kbd>
          <strong id="afajh"><dl id="afajh"></dl></strong>
            <del id="afajh"><form id="afajh"></form></del>
                1. <th id="afajh"><progress id="afajh"></progress></th>
                  <b id="afajh"><abbr id="afajh"></abbr></b>
                  <th id="afajh"><progress id="afajh"></progress></th>
                  一级黄色特级片 | 一起草视频污在线观看视频 | 人妻a| 蜜臀精品久久久999久久久酒店 | 天天干,天天干 |