SpringBoot + Apache tika 輕松實(shí)現(xiàn)各種文檔內(nèi)容解析
點(diǎn)擊關(guān)注公眾號(hào),Java干貨 及時(shí)送達(dá)
本文演示在spring boot 中引入tika的方式解析文檔。如下:
引入依賴
在spring boot 項(xiàng)目中引入如下依賴:
<dependencyManagement>
<dependencies>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-bom</artifactId>
<version>2.8.0</version>
<type>pom</type>
<scope>import</scope>
</dependency>
</dependencies>
</dependencyManagement>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-core</artifactId>
</dependency>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-parsers-standard-package</artifactId>
</dependency>
創(chuàng)建配置
將tika-config.xml文件放在resources目錄下。tika-config.xml文件的內(nèi)容如下:
<?xml version="1.0" encoding="UTF-8"?>
<properties>
<encodingDetectors>
<encodingDetector class="org.apache.tika.parser.html.HtmlEncodingDetector">
<params>
<param name="markLimit" type="int">64000</param>
</params>
</encodingDetector>
<encodingDetector class="org.apache.tika.parser.txt.UniversalEncodingDetector">
<params>
<param name="markLimit" type="int">64001</param>
</params>
</encodingDetector>
<encodingDetector class="org.apache.tika.parser.txt.Icu4jEncodingDetector">
<params>
<param name="markLimit" type="int">64002</param>
</params>
</encodingDetector>
</encodingDetectors>
</properties>
創(chuàng)建配置類MyTikaConfig
import java.io.IOException;
import java.io.InputStream;
import org.apache.tika.Tika;
import org.apache.tika.config.TikaConfig;
import org.apache.tika.detect.Detector;
import org.apache.tika.exception.TikaException;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.Parser;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;
import org.springframework.core.io.Resource;
import org.springframework.core.io.ResourceLoader;
import org.xml.sax.SAXException;
/**
* tika配置類
*/
@Configuration
public class MyTikaConfig {
@Autowired
private ResourceLoader resourceLoader;
@Bean
public Tika tika() throws TikaException, IOException, SAXException {
Resource resource = resourceLoader.getResource("classpath:tika-config.xml");
InputStream inputStream = resource.getInputStream();
TikaConfig config = new TikaConfig(inputStream);
Detector detector = config.getDetector();
Parser autoDetectParser = new AutoDetectParser(config);
return new Tika(detector, autoDetectParser);
}
}
Tika類中提供了文芳detect、translate和parse功能, 在項(xiàng)目中通過(guò)注入TIka, 就可以使用了
在項(xiàng)目使用
配置完成后在項(xiàng)目中可以通過(guò)注入TIka即可完成文檔的解析。如下圖所示:

往 期 推 薦
1、為什么我們家里的IP都是192.168開(kāi)頭的?
2、室友打一把王者就學(xué)會(huì)了Java多線程
3、互聯(lián)網(wǎng)人為什么學(xué)不會(huì)擺爛
4、為什么國(guó)外JetBrains做 IDE 就可以養(yǎng)活自己,國(guó)內(nèi)不行?區(qū)別在哪?
5、中國(guó)程序員獨(dú)立開(kāi)發(fā)9年、最受歡迎的開(kāi)源Redis客戶端 被Redis公司收購(gòu)(文末送書(shū))
6、讓程序員早點(diǎn)下班的《技術(shù)寫(xiě)作指南》
點(diǎn) 分 享
點(diǎn) 收 藏
點(diǎn) 點(diǎn) 贊
點(diǎn)在看
評(píng)論
圖片
表情
