公开数据集
数据结构 ? 5553.84M
Data Structure ?
* 以上分析是由系统提取分析形成的结果,具体实际数据为准。
README.md
Context
[Beginning from the beginning][1]. Normal matter, the one that planets, humans, and stars are made of, make up only 5% of mass in the Universe. The rest are invisible dark matter and dark energy whose existence might be hinted through the gravitational effects. One way of studying these mysteries is to recreate conditions just after the Big Bang with particle accelerators. Using a very rough analogy, we collide automobiles at supersonic speed and try to learn how they work by looking at the photos of the collisions. One of such photo cameras is the LHCb detector.
Here is a typical collision event recorded by the LHCb detector, one of the four big experiments at the Large Hadron collider. Point to the left is where the protons have collided, the lines are the the secondary particles tracks.
![A typical collision event recorded by the LHCb detector][2]
Muon subdetector (see the figure below) consists of ?ve stations (sensitive planes perpendicular to the beam pipe). Only four of them are used in our competition (M2-M5). Green parallelepipeds in the 3D-figure above are the detector pads which registered a charged particle passing through them. The physical idea is that only muons have penetration ability high enough to allow them to pass though the lead shielding that separates the Muon subdetector from the rest of the detector. Of course, in the real world not all hits are generated by muons, that’s why we need machine learning.
![Muon subdetector ][3]
You are given tracks of three types: muon, pion and proton. Pions might decay in ?ight into genuine muons, so some of their tracks are very muon-like, you want to reject them as well.
The data is real (i. e. not simulated) and the particle types cannot be known with certainty. To account for that, we use a statistical method called sPlot ([original paper][4], [blog post][5]). Each example is assigned a weight, when used with those weights, the distribution of the features matches the distribution over type-pure samples. Some of the weights are negative, this is expected.
Since the data for different particle types have been obtained from different decays, the distributions of the tracks kinematic observables are different. But in the end we need an algorithm that differentiates particle types in general, not only in the specific decays. In ML terms, this can be viewed as domain adaptation. To achieve that we reweighted the sample so that the distributions in momentum of signal and background match.
Content
The data is used for the [IDAO 2019][6]. For convenience, the training dataset is split into two files. In the ?rst phase of the competition (we call it public) the models are scored using 20% of the test data (test_public). The data is present in two formats: csv and hdf. Both have been created with pandas (see the [environment.yml][7] for versions), hdf contains pickled numpy arrays, so it might not be readable outside Python.
Features
WARNING. The description on Kaggle might be out of data. IDAO participants, please see the competition problem statement for the up to date version.
1. label, integer in {0,1} - you need to predict it. 0 is background (pions and protons), 1 is signal (muons)
2. particle_type, integer in {0,1,2} - type of the particle. 0 - pion, 1 - muon, 2 - proton. Available only for the training dataset.
3. weight, float - example weight, used in both training and evaluation. Product of sWeight and kinWeight.
4. sWeight, float - a component of the example weight that accounts for uncertainty in labeling
5. kinWeight, float ≥ 0 - a component of the example weight that equalizes kinematic observables between signal and background
6. id, integer - example id
7. Lextra_{X,Y}[N], float - coordinates of the track linear extrapolation intersection with the Nth station. The extrapolation uses the following station Z coordinates: [15270, 16470, 17670, 18870]
8. Mextra_D{X,Y}2[N}, float - uncertainty for squared {X, Y} coordinate of the track extrapolation.
9. MatchedHit_{X,Y,Z}[N], float - coordinates of the hit in the Nth station that a physics-based tracking algorithm associated with the track. [Poster about the algorithm][8] (χ2COR)
10. MatchedHit_TYPE[N], categorical in {0, 1, 2} - whether the Matched hit is crossed. 1 means uncrossed, 2 means crossed. 0 means there is no matched hit in the station (missing value). See pages 6-8 [here][9]
11. MatchedHit_T[N], integer in {255}∪ [1,20] - timing of the Matched hit, 255 is missing value (no matched hit in the station)
12. MatchedHit_D{X,YZ}[N], float in {-9999}∪ (0, +) - uncertainty of the Matched hit coordinates
13. MatchedHit_DT[N], integer delta time for the matched hit in the Nth station
14. FOI_hits_N, integer ≥ 0 - number of hits inside a physics-defined cone around the track (aka Field Of Interest, FOI)
15. FOI_hits_{,D}{X,Y,Z,T}, array of float of size FOI_hits_N - same as MatchedHit{,D}{X,Y,Z,T}, per hit
16. FOI_hits_S, array of integers in {0, 1, 2, 3} - stations of the FOI hits
17. ncl[N], integer - number of clusters in the Nth station. A high-level variable computed by an experimental undocumented algorithm, code for it is [here][10]
18. avg_cs[N], float ≥ 0 - average cluster size in the Nth station, computed by the same algorithm as ncl[N]
19. ndof, integer in {4, 6, 8} - number of degrees of freedom used in χ2 computation, a function of momentum
20. NShared, integer ≥ 0 - number of closest hits shared with the neighbouring tracks. See pages 4-5 [here][11] and pages 10-11 [here][12]
21. P, float ≥ 3000 - momentum modulo, MeV/c
22. PT, float ≥ 800 - component of the momentum transverse (i.e. perpendicular) to the beam line, MeV/c
Acknowledgements
Produced by the LHCb collaboration at CERN
Inspiration
The goal behind the dataset is to build an algorithm that distinguishes the muon tracks (green) from the tracks of the other particle types, using the information from the Muon subdetector. This is an extremely important problem, muon identification is used, one way or another, in the majority of physical analyses at LHCb.
[1]: https://home.cern/science/physics/dark-matter
[2]: https://contest.yandex.ru/testsys/statement-image?imageId=3c92616c62794048a69c0cd38e6577c311f37c2d81c049bbbcaf3fb3e1bac8b1
[3]: https://contest.yandex.ru/testsys/statement-image?imageId=32c2a694f9e6d3510ee5b9d5f074a6b2cce82455515026fef7069cd8e037468e
[4]: https://arxiv.org/abs/physics/0402083
[5]: https://arogozhnikov.github.io/2015/10/07/splot.html
[6]: https://idao.world/
[7]: https://github.com/yandexdataschool/IDAO-2019-muon-id/blob/master/environment.yml
[8]: https://indico.cern.ch/event/491582/contributions/1168914/attachments/1236304/1815447/LHCC_Cogoni_v4.pdf
[9]: https://cds.cern.ch/record/2063310/files/CERN-THESIS-2015-181.pdf
[10]: https://gitlab.cern.ch/lhcb/Rec/blob/26b3eb5e69c673f771e5a0882eb2443ec62678f4/Muon/MuonID/src/component/MuonClusterRec2.cpp
[11]: https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=2&cad=rja&uact=8&ved=2ahUKEwjk_qvJrJDgAhUqmYsKHUDBB3AQFjABegQICRAC&url=https%3A%2F%2Fcds.cern.ch%2Frecord%2F2253050%2Ffiles%2FLHCb-PUB-2017-007.pdf&usg=AOvVaw1Brv53oaelpFaVVlnuJu4l
[12]: https://cds.cern.ch/record/2063310/files/CERN-THESIS-2015-181.pdf
×
帕依提提提温馨提示
该数据集正在整理中,为您准备了其他渠道,请您使用
注:部分数据正在处理中,未能直接提供下载,还请大家理解和支持。
暂无相关内容。
暂无相关内容。
- 分享你的想法
去分享你的想法~~
全部内容
欢迎交流分享
开始分享您的观点和意见,和大家一起交流分享.
数据使用声明:
- 1、该数据来自于互联网数据采集或服务商的提供,本平台为用户提供数据集的展示与浏览。
- 2、本平台仅作为数据集的基本信息展示、包括但不限于图像、文本、视频、音频等文件类型。
- 3、数据集基本信息来自数据原地址或数据提供方提供的信息,如数据集描述中有描述差异,请以数据原地址或服务商原地址为准。
- 1、本站中的所有数据集的版权都归属于原数据发布者或数据提供方所有。
- 1、如您需要转载本站数据,请保留原数据地址及相关版权声明。
- 1、如本站中的部分数据涉及侵权展示,请及时联系本站,我们会安排进行数据下线。