Select Language

AI社区

公开数据集

Cal多音节语料库

Cal多音节语料库

15.26M
379 浏览
0 喜欢
0 次下载
0 条讨论
Education,Universities and Colleges,NLP,Text Data,Text Mining,spaCy Classification

数据结构 ? 15.26M

    Data Structure ?

    * 以上分析是由系统提取分析形成的结果,具体实际数据为准。

    README.md

    Context This corpus of syllabi aims to support the [Nimbus Assistant](https://www.github.com/calpoly-csai/api), an AI similar to Siri/Alexa that answers students’ questions. In the context of syllabi, students may ask questions like: What textbook does MATH 143 need? Do I need to buy a new book after MATH 142? What’s the course website for Anton Kaul’s 143? What’s Dr. Kaul’s grading policy? What’s the bare minimum I need to do to pass Kaul’s 143 class? How do I ace Kaul’s math 143 class? Content Data was scraped using [Thruuu, an awesome and easy to use SERP (search engine result pages) scraper](https://app.samuelschmitt.com/). # Thruuu * `thruuu.xlsx` - the data exported from Thruuu. * `thruuu.pdf` - the preliminary analysis exported from Thruuu. # Notebooks/Process * `step-1-get-documents-from-sheet-urls.ipynb` - a notebook that **inputs** `thruuu.xlsx` and **outputs** `downloads.tar.gz` along with `downloads.csv` * `step-2-extract-document-data-with-OCR.ipynb` - a notebook that **inputs** `downloads.tar.gz` along with `downloads.csv` and **outputs** `extracted.csv` * `step-3-get-simple-logistical-information.ipynb` - a notebook that **inputs** `extracted.csv` and outputs `logistical_info.csv` # Notebook Outputs * `downloads.tar.gz` - 100 PDF files (some files are corrupted). * `downloads.csv` - a table associating search result positions with individual PDF files for a syllabus. * `extracted.csv` - a table associating each PDF file with the extracted OCR text (also the plain text but OCR is preferred). * `logistical_info.csv` - a table associating each PDF file with the logistical info (instructor/office/email/etc) that is found through regular expressions. Acknowledgements Thank you [Samuel Schmitt](samuelschmitt.com) for making Thruuu! Inspiration * What kinds of factoids could you mine from the syllabus text? * What are common phrases used by Cal Poly professors in their syllabi? * What are the rarest phrases found in syllabi? * Can you identify a professor’s writing style from their syllabus?
    ×

    帕依提提提温馨提示

    该数据集正在整理中,为您准备了其他渠道,请您使用

    注:部分数据正在处理中,未能直接提供下载,还请大家理解和支持。
    暂无相关内容。
    暂无相关内容。
    • 分享你的想法
    去分享你的想法~~

    全部内容

      欢迎交流分享
      开始分享您的观点和意见,和大家一起交流分享.
    所需积分:0 去赚积分?
    • 379浏览
    • 0下载
    • 0点赞
    • 收藏
    • 分享