公开数据集
数据结构 ? 4126.4M
Data Structure ?
* 以上分析是由系统提取分析形成的结果,具体实际数据为准。
README.md
# Code Images
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.1286417.svg)](https://doi.org/10.5281/zenodo.1286417)
Context
This is a subset of the [Zenodo-ML Dinosaur Dataset](https://vsoch.github.io/datasets/2018/zenodo)
[[Github](https://www.github.com/vsoch/zenodo-ml)] that has been converted to small png files
and organized in folders by the language so you can jump right in to using machine learning
methods that assume image input.
Content
Included are .tar.gz files, each named based on a file extension, and when extracted,
will produce a folder of the same name.
tree -L 1
.
├── c
├── cc
├── cpp
├── cs
├── css
├── csv
├── cxx
├── data
├── f90
├── go
├── html
├── java
├── js
├── json
├── m
├── map
├── md
├── txt
└── xml
And we can peep inside a (somewhat smaller) of the set to see that the subfolders are
zenodo identifiers. A zenodo identifier corresponds to a single Github repository, so
it means that the png files produced are chunks of code of the extension type from a particular
repository.
$ tree map -L 1
map
├── 1001104
├── 1001659
├── 1001793
├── 1008839
├── 1009700
├── 1033697
├── 1034342
...
├── 836482
├── 838329
├── 838961
├── 840877
├── 840881
├── 844050
├── 845960
├── 848163
├── 888395
├── 891478
└── 893858
154 directories, 0 files
Within each folder (zenodo id) the files are prefixed by the zenodo id, followed
by the index into the original image set array that is provided with the full
[dinosaur dataset archive](https://vsoch.github.io/datasets/2018/zenodo).
$ tree m/891531/ -L 1
m/891531/
├── 891531_0.png
├── 891531_10.png
├── 891531_11.png
├── 891531_12.png
├── 891531_13.png
├── 891531_14.png
├── 891531_15.png
├── 891531_16.png
├── 891531_17.png
├── 891531_18.png
├── 891531_19.png
├── 891531_1.png
├── 891531_20.png
├── 891531_21.png
├── 891531_22.png
├── 891531_23.png
├── 891531_24.png
├── 891531_25.png
├── 891531_26.png
├── 891531_27.png
├── 891531_28.png
├── 891531_29.png
├── 891531_2.png
├── 891531_30.png
├── 891531_3.png
├── 891531_4.png
├── 891531_5.png
├── 891531_6.png
├── 891531_7.png
├── 891531_8.png
└── 891531_9.png
0 directories, 31 files
> So what's the difference?
The difference is that these files are organized by extension type, and provided as
actual png images. The original data is provided as numpy data frames, and is organized
by zenodo ID. Both are useful for different things - this particular version is cool
because we can actually see what a code image looks like.
> How many images total?
We can count the number of total images:
find "." -type f -name *.png | wc -l
3,026,993
# Dataset Curation
The script to create the dataset is [provided here](https://github.com/vsoch/zenodo-ml/blob/master/preprocess/2.organize_by_language.py).
Essentially, we start with the top extensions as identified by [this work](https://vsoch.github.io/2018/extension-counts/)
(excluding actual images files) and then write each 80x80 image to an actual png image, organizing by extension then
zenodo id (as shown above).
# Saving the Image
I tested a few methods to write the single channel 80x80 data frames as png images,
and wound up liking cv2's imwrite function because it would save and then load
the exact same content.
import cv2
cv2.imwrite(image_path, image)
# Loading the Image
Given the above, it's pretty easy to load an image! Here is an example using scipy,
and then for newer Python (if you get a deprecation message) using imageio.
image_path = '/tmp/data1/data/csv/1009185/1009185_0.png'
from imageio import imread
image = imread(image_path)
array([[116, 105, 109, ..., 32, 32, 32],
[ 48, 44, 48, ..., 32, 32, 32],
[ 48, 46, 49, ..., 32, 32, 32],
...,
[ 32, 32, 32, ..., 32, 32, 32],
[ 32, 32, 32, ..., 32, 32, 32],
[ 32, 32, 32, ..., 32, 32, 32]], dtype=uint8)
image.shape
(80,80)
# Deprecated
from scipy import misc
misc.imread(image_path)
Image([[116, 105, 109, ..., 32, 32, 32],
[ 48, 44, 48, ..., 32, 32, 32],
[ 48, 46, 49, ..., 32, 32, 32],
...,
[ 32, 32, 32, ..., 32, 32, 32],
[ 32, 32, 32, ..., 32, 32, 32],
[ 32, 32, 32, ..., 32, 32, 32]], dtype=uint8)
Remember that the values in the data are characters that have been converted
to ordinal. Can you guess what 32 is?
ord(' ')
32
# And thus if you wanted to convert it back...
chr(32)
So how to reconstruct a line from the file? Let's try it! Here is the header
for our file, followed by the first line. Based on it's location in the folder
called csv, we know that this is a csv file.
''.join([chr(x) for x in image[0]])
# 'time,S1,S2 '
''.join([chr(x) for x in image[1]])
# '0,0.00015,0
Inspiration
These datasets can answer some of the same questions [discussed here](https://vsoch.github.io/datasets/2018/zenodo/#what-can-i-learn-from-this-dataset).
Essentially, if we are able to classify languages based on patterns in the images, we can generate signatures for groupings of scripts (software
packages). We can then detect these signatures in containers, associate signatures with domains of research, or with some quality or usage metric of
the code.
×
帕依提提提温馨提示
该数据集正在整理中,为您准备了其他渠道,请您使用
注:部分数据正在处理中,未能直接提供下载,还请大家理解和支持。
暂无相关内容。
暂无相关内容。
- 分享你的想法
去分享你的想法~~
全部内容
欢迎交流分享
开始分享您的观点和意见,和大家一起交流分享.
数据使用声明:
- 1、该数据来自于互联网数据采集或服务商的提供,本平台为用户提供数据集的展示与浏览。
- 2、本平台仅作为数据集的基本信息展示、包括但不限于图像、文本、视频、音频等文件类型。
- 3、数据集基本信息来自数据原地址或数据提供方提供的信息,如数据集描述中有描述差异,请以数据原地址或服务商原地址为准。
- 1、本站中的所有数据集的版权都归属于原数据发布者或数据提供方所有。
- 1、如您需要转载本站数据,请保留原数据地址及相关版权声明。
- 1、如本站中的部分数据涉及侵权展示,请及时联系本站,我们会安排进行数据下线。