公开数据集
数据结构 ? 132.74M
Data Structure ?
* 以上分析是由系统提取分析形成的结果,具体实际数据为准。
README.md
I used Apache Spark to extract more than 6 million phrases from 200,000 English Wikipedia pages. Here is the process of cleaning, extracting keywords, and training Word2Vec model:
Merging page's Title and its Text
Sentence detection (spark-nlp)
Tokenizer (spark-nlp)
Normalizer (spark-nlp) POS Tagger (spark-nlp) Chuning with grammar rules to detect both uni-grams and multi-grams (spark-nlp)
Stop words remover (Spark ML)
Training and transforming Word2Vec Model (Spark ML)
Content
Word2Vec model details:
val word2Vec = new Word2Vec() .setInputCol("filteredPhrases") .setOutputCol("word2vec") .setVectorSize(300) .setMinCount(10) .setMaxIter(1) .setNumPartitions(1)
Usage
You can simply download this model and load it into your Apache Spark ML pipeline:
import org.apache.spark.ml._ val pipeLineWord2VecModel = PipelineModel.read.load("/tmp/multivac_nlp_ml_200k") val word2VecModel = pipeLineWord2VecModel.stages.last.asInstanceOf[Word2VecModel] word2VecModel.findSynonyms("climate change", 10).show(false) +--------------------------+------------------+ |word |similarity | +--------------------------+------------------+ |global warming |0.7534363269805908| |intergovernmental panel |0.7303586602210999| |sustainable development |0.714561939239502 | |greenhouse gas emissions |0.6958430409431458| |food security |0.6919037103652954| |development policy |0.6879498958587646| |environmental policy |0.6868311166763306| |energy security |0.681218147277832 | |multinational corporations|0.6769515872001648| |tax policy |0.671006977558136 | +--------------------------+------------------+ word2VecModel.findSynonyms("football", 10).show(false) +--------------------------+------------------+ |word |similarity | +--------------------------+------------------+ |football team |0.7648624181747437| |football soccer |0.7647290229797363| |field hockey |0.745803952217102 | |football teams |0.7442964911460876| |soccer |0.7377723455429077| |professional football |0.7375280261039734| |youth academy |0.7372391819953918| |national basketball league|0.7333077788352966| |coach |0.7324917912483215| |league championships |0.7308306694030762| +--------------------------+------------------+ word2VecModel.findSynonyms("cancer", 10).show(false) +-----------------------+------------------+ |word |similarity | +-----------------------+------------------+ |climate change |0.7534365057945251| |literature review |0.7533518075942993| |minimize |0.7510043382644653| |categorization |0.7404615879058838| |health effects |0.7371178269386292| |genetic information |0.7362238168716431| |scientific basis |0.7347298860549927| |intergovernmental panel|0.734147846698761 | |recent study |0.7333264350891113| |food security |0.7322153449058533| +-----------------------+------------------+ +----------------------+------------------+ word2VecModel.findSynonyms("london", 10).show(false) |word |similarity | +----------------------+------------------+ |edinburgh |0.6135260462760925| |glasgow |0.5734920501708984| |bristol |0.5710445642471313| |edinburgh scotland |0.5306239724159241| |kensington |0.5289728045463562| |islington |0.5218709707260132| |clapham |0.5164309144020081| |leicester |0.5161707401275635| |cambridge |0.5141464471817017| |royal scottish academy|0.508998453617096 | +----------------------+------------------+
Environment
Cloudera CDH 5.15.1
Apache Spark 2.3.1
Ubuntu 16.4.x
Acknowledgements
This work has been done by using ISC-PIF/CNRS(UPS3611) and Multivac Platform infrastructure.
×
帕依提提提温馨提示
该数据集正在整理中,为您准备了其他渠道,请您使用
注:部分数据正在处理中,未能直接提供下载,还请大家理解和支持。
暂无相关内容。
暂无相关内容。
- 分享你的想法
去分享你的想法~~
全部内容
欢迎交流分享
开始分享您的观点和意见,和大家一起交流分享.
数据使用声明:
- 1、该数据来自于互联网数据采集或服务商的提供,本平台为用户提供数据集的展示与浏览。
- 2、本平台仅作为数据集的基本信息展示、包括但不限于图像、文本、视频、音频等文件类型。
- 3、数据集基本信息来自数据原地址或数据提供方提供的信息,如数据集描述中有描述差异,请以数据原地址或服务商原地址为准。
- 1、本站中的所有数据集的版权都归属于原数据发布者或数据提供方所有。
- 1、如您需要转载本站数据,请保留原数据地址及相关版权声明。
- 1、如本站中的部分数据涉及侵权展示,请及时联系本站,我们会安排进行数据下线。