Select Language

AI社区

公开数据集

维基百科Word2Vec,Apache Spark word2vec由200K维基百科页面培训

维基百科Word2Vec,Apache Spark word2vec由200K维基百科页面培训

132.74M
373 浏览
0 喜欢
0 次下载
0 条讨论
NLP,Business,Earth and Nature,Text Mining Classification

I used Apache Spark to extract more than 6 million phrases from 200,000 English Wikipedia pages. Here is the process of......

数据结构 ? 132.74M

    Data Structure ?

    * 以上分析是由系统提取分析形成的结果,具体实际数据为准。

    README.md

    I used Apache Spark to extract more than 6 million phrases from 200,000 English Wikipedia pages. Here is the process of cleaning, extracting keywords, and training Word2Vec model:

    1. Merging page's Title and its Text

    2. Sentence detection (spark-nlp)

    3. Tokenizer (spark-nlp)

    4. Normalizer (spark-nlp) POS Tagger (spark-nlp) Chuning with grammar rules to detect both uni-grams and multi-grams  (spark-nlp)

    5. Stop words remover (Spark ML)

    6. Training and transforming Word2Vec Model (Spark ML)

    Content

    Word2Vec model details:

    val word2Vec = new Word2Vec()
      .setInputCol("filteredPhrases")
      .setOutputCol("word2vec")
      .setVectorSize(300)
      .setMinCount(10)
      .setMaxIter(1)
      .setNumPartitions(1)

    Usage

    You can simply download this model and load it into your Apache Spark ML pipeline:

    import org.apache.spark.ml._
    
    val pipeLineWord2VecModel = PipelineModel.read.load("/tmp/multivac_nlp_ml_200k")
    val word2VecModel = pipeLineWord2VecModel.stages.last.asInstanceOf[Word2VecModel]
    
    word2VecModel.findSynonyms("climate change", 10).show(false)
    +--------------------------+------------------+
    |word                      |similarity        |
    +--------------------------+------------------+
    |global warming            |0.7534363269805908|
    |intergovernmental panel   |0.7303586602210999|
    |sustainable development   |0.714561939239502 |
    |greenhouse gas emissions  |0.6958430409431458|
    |food security             |0.6919037103652954|
    |development policy        |0.6879498958587646|
    |environmental policy      |0.6868311166763306|
    |energy security           |0.681218147277832 |
    |multinational corporations|0.6769515872001648|
    |tax policy                |0.671006977558136 |
    +--------------------------+------------------+
    
    word2VecModel.findSynonyms("football", 10).show(false)
    +--------------------------+------------------+
    |word                      |similarity        |
    +--------------------------+------------------+
    |football team             |0.7648624181747437|
    |football soccer           |0.7647290229797363|
    |field hockey              |0.745803952217102 |
    |football teams            |0.7442964911460876|
    |soccer                    |0.7377723455429077|
    |professional football     |0.7375280261039734|
    |youth academy             |0.7372391819953918|
    |national basketball league|0.7333077788352966|
    |coach                     |0.7324917912483215|
    |league championships      |0.7308306694030762|
    +--------------------------+------------------+
    
    word2VecModel.findSynonyms("cancer", 10).show(false)
    +-----------------------+------------------+
    |word                   |similarity        |
    +-----------------------+------------------+
    |climate change         |0.7534365057945251|
    |literature review      |0.7533518075942993|
    |minimize               |0.7510043382644653|
    |categorization         |0.7404615879058838|
    |health effects         |0.7371178269386292|
    |genetic information    |0.7362238168716431|
    |scientific basis       |0.7347298860549927|
    |intergovernmental panel|0.734147846698761 |
    |recent study           |0.7333264350891113|
    |food security          |0.7322153449058533|
    +-----------------------+------------------+
    
    +----------------------+------------------+
    
    word2VecModel.findSynonyms("london", 10).show(false)
    |word                  |similarity        |
    +----------------------+------------------+
    |edinburgh             |0.6135260462760925|
    |glasgow               |0.5734920501708984|
    |bristol               |0.5710445642471313|
    |edinburgh scotland    |0.5306239724159241|
    |kensington            |0.5289728045463562|
    |islington             |0.5218709707260132|
    |clapham               |0.5164309144020081|
    |leicester             |0.5161707401275635|
    |cambridge             |0.5141464471817017|
    |royal scottish academy|0.508998453617096 |
    +----------------------+------------------+

    Environment

    Acknowledgements

    This work has been done by using ISC-PIF/CNRS(UPS3611) and Multivac Platform infrastructure.


    ×

    帕依提提提温馨提示

    该数据集正在整理中,为您准备了其他渠道,请您使用

    注:部分数据正在处理中,未能直接提供下载,还请大家理解和支持。
    暂无相关内容。
    暂无相关内容。
    • 分享你的想法
    去分享你的想法~~

    全部内容

      欢迎交流分享
      开始分享您的观点和意见,和大家一起交流分享.
    所需积分:25 去赚积分?
    • 373浏览
    • 0下载
    • 0点赞
    • 收藏
    • 分享