Select Language

AI社区

公开数据集

法比孔斯

法比孔斯

2755.58M
422 浏览
0 喜欢
0 次下载
0 条讨论
Earth and Nature,Arts and Entertainment,Internet,Online Communities,Software,Image Data Classification

数据结构 ? 2755.58M

    Data Structure ?

    * 以上分析是由系统提取分析形成的结果,具体实际数据为准。

    README.md

    Context Favicons are the (usually tiny) image files that browsers may use to represent websites in tabs, in the URL bar, or for bookmarks. Kaggle, for example, uses an image of a blue lowercase "k" as its favicon. This dataset contains about 360,000 favicons from popular websites. Content and Acknowledgements These favicons were scraped in July 2016. I wrote a crawler that went through Alexa's top 1 million sites, and made a request for 'favicon.ico' at the site root. If I got a 200 response code, I saved the result as `${site_url}.ico`. For domains that were identical but for the TLD (e.g. google.com, google.ca, google.jp...), I scraped only one favicon. My scraping/cleaning code is on GitHub [here](https://github.com/colinmorris/favicon-scraper). Of 1m sites crawled, 540k responded with a 200 code. The dataset has 360k images, which were the remains after filtering out: - empty files (-140k) - non-image files, according to the [`file`](https://en.wikipedia.org/wiki/File_(command)) command (-40k). These mostly had type HTML, ASCII, or UTF-*. - corrupt/malformed image files - i.e. those that were sufficiently messed up that ImageMagick failed to parse them. (-1k) The remaining files are exactly as I received them from the site. They are mostly [ICO files](https://en.wikipedia.org/wiki/ICO_(file_format)), with the most common sizes being 16x16, 32x32, and 48x48. But there's a long tail of more exotic formats and sizes (there is at least one person living among us who thought that 88x31 was a fine size for a favicon). The favicon files are divided among 6 zip files, `full-0.zip, full-1.zip... full-5.zip`. (If you wish to download the full dataset as a single tarball, you can do so from the [Internet Archive](https://archive.org/details/favicons_201708)) `favicon_metadata.csv` is a csv file with one row per favicon in the dataset. The `split_index` says which of the zip files the image landed in. For an example of loading and interacting with particular favicons in a kernel context, check out the [Favicon helper functions](https://www.kaggle.com/colinmorris/favicon-helper-functions) kernel. As mentioned above, the full dataset is a dog's breakfast of different file formats and dimensions. I've created 'standardized' subsets of the data that may be easier to work with (particularly for machine learning applications, where it's necessary to have fixed dimensions). **16_16.tar.gz** is a tarball containing all 16x16 favicons in the dataset, converted to PNG. It has 290k images. ICO is a container format, and many of the ico files in the raw dataset contain several versions of the same favicon at different resolutions. 16x16 favicons that were stuffed together in an ICO file with images of other sizes are included in this set. But I did no resizing - if a favicon has no 'native' 16x16 version, it isn't in this set. **16_16_distinct.tar.gz** is identical to the above, but with 70k duplicate or near-duplicate images removed. There are a small number of commonly repeated favicons like the Blogger "B" that occur thousands of times, which could be an annoyance depending on the use case - e.g. a generative model might get stuck in a local maximum of spitting out Blogger Bs. Alexa's top 1-million list includes 'adult' sites, so some URLs and favicons may be NSFW or offensive. (It's pretty hard to make a credible depiction of nudity in 256 pixels, but there are some occasional attempts.) Inspiration I hope this dataset might be especially useful for small-scale deep learning experiments. Scaling photographs down to 16x16 would render many of them unintelligible, but these favicons were born tiny. The `16_16` fold has more instances than MNIST, and the images are even smaller! (Though, unlike MNIST, most of the images in this dataset are not grayscale.) If you liked this, you should also check out the recently released [Large Logo Dataset](https://data.vision.ee.ethz.ch/cvl/lld/). They've currently made available 550k favicons resized to 32x32. Their data was collected more recently, and their scraping process was more robust, so their dataset should probably be preferred (though you might still want to use this one if you need the raw favicon files, or if you prefer to use 16x16 non-resized images).
    ×

    帕依提提提温馨提示

    该数据集正在整理中,为您准备了其他渠道,请您使用

    注:部分数据正在处理中,未能直接提供下载,还请大家理解和支持。
    暂无相关内容。
    暂无相关内容。
    • 分享你的想法
    去分享你的想法~~

    全部内容

      欢迎交流分享
      开始分享您的观点和意见,和大家一起交流分享.
    所需积分:0 去赚积分?
    • 422浏览
    • 0下载
    • 0点赞
    • 收藏
    • 分享