Cite as: Jomhari, Nur Zulaiha; Geiser, Achim; Bin Anuar, Afiq Aizuddin; (2017). Higgs-to-four-lepton analysis example using 2011-2012 data. CERN Open Data Portal. DOI:10.7483/OPENDATA.CMS.JKB8.RR42
This research level example is a strongly simplified reimplementation of parts of the original CMS Higgs to four lepton analysis published in Phys.Lett. B716 (2012) 30-61, arXiv:1207.7235.
The published reference plot which is being approximated in this example is https://inspirehep.net/record/1124338/files/H4l_mass_3.png. Other Higgs final states (e.g. Higgs to two photons), which were also part of the same CMS paper and strongly contributed to the Higgs boson discovery, are not covered by this example.
The example consists of different levels of complexity. The highest level of this example addresses users who feel they have at least some minimal understanding of the content of this paper and of the meaning of this reference plot, which can be reached via (separate) educational exercises. The lower levels might also be interesting for educational applications. The example requires a minimal acquaintance with the linux operating system and the ROOT analysis tool.
The example uses legacy versions of the original CMS data sets in the CMS AOD, which slightly differ from the ones used for the publication due to improved calibrations. It also uses legacy versions of the corresponding Monte Carlo simulations, which are again close to, but not identical to, the ones in the original publication. These legacy data and MC sets listed below were used in practice, exactly as they are, in many later CMS publications.
Since according to the CMS Open Data policy the fraction of data which are public (and used here) is only 50% of the available LHC Run I samples, the statistical significance is reduced with respect to what can be achieved with the full dataset. However, the original paper Phys.Lett. B716 (2012) 30-61, arXiv:1207.7235, was also obtained with only part of the Run I statistics, roughly equivalent to the luminosity of the public sets, but with only partial statistical overlap.
The provided analysis code recodes the spirit of the original analysis and recodes many of the original cuts on original data objects, but does not provide the original analysis code itself. Also, for the sake of simplicity, it skips some of the more advanced analysis methods of the original paper. Nevertheless, it provides a qualitative insight about how the original result was obtained. In addition to the documented core results, the resulting root files also contain many undocumented plots which grew as a side product from setting up this example and earlier examples. The significance of the Higgs 'excess' is about 2 standard deviations in this example, while it was 3.2 standard deviations in this channel alone in the original publication. The difference is attributed to the less sophisticated background suppression. In more recent (not yet public) CMS data sets with higher statistics the signal is observed in a preliminary analysis with more than 5 standard deviations in this channel alone CMS-PAS-HIG-16-041.
The analysis strategy is the following: Get the 4mu and 2mu2e final states from the DoubleMuParked datasets and the 4e final state from the DoubleElectron dataset. This avoids double counting due to trigger overlaps. All MC contributions except top use data-driven normalization: The DY (Z/gamma^*) contribution is scaled to the Z peak. The ZZ contribution is scaled to describe the data in the independent mass range 180-600 GeV. The Higgs contribution is scaled to describe the data in the signal region. The (very small) top contribution remains scaled to the MC generator cross section.
The example uses legacy versions of the original CMS datasets in the AOD format, which slightly differ from the ones used for the original publication due to improved calibrations. It also uses legacy versions of the corresponding Monte Carlo simulations, which are again close to, but not identical to, the ones in the original publication. These legacy data and MC sets listed below were used in practice, exactly as they are, in many later CMS publications.
In addition to the instructions below which guide you through the example in detail, a github repository based on this original example is also provided. The root files needed for the Level 3 exercise can be found here.
There are four levels of increasing complexity for this example:
cd rootfilesand download the preproduced *.root histogram files given in this record for all relevant samples to this directory
wget http://opendata.web.cern.ch/record/5501/files/rootfilelist.txtand then
wget -i rootfilelist.txt
root, and on the root promt, type
TBrowser t, then double-click on the relevant file
root -l M4Lnormdatall.cc
file->Quit ROOTor, on the root  prompt, type
Demo/DemoAnalyzer/directory, which is created following Step 2: How to test and validate, replace
BuildFile.xmlby the version downloaded from this record
mkdir datasetsand change to this directory
rootfilesand download all the level 2 root files to this directory (see level 2)
cmsRun demoanalyzer_cfg_level3data.pywill produce output file
DoubleMuParked2012C_10000_Higgs.rootcontaining 1 Higgs candidate from the data
cmsRun demoanalyzer_cfg_level3MC.pywill produce output file
Higgs4L1file.rootcontaining the Higgs signal distributions with reduced statistics
rootfilesdirectory, together with the predefined files
mv DoubleMuParked2012C_10000_Higgs.root rootfiles/.
mv Higgs4L1file.root rootfiles/.
cd rootfilesand download the macro M4Lnormdatall_lvl3.cc to this directory
root -l M4Lnormdatall_lvl3.cc
file->Quit ROOTor, on the root  prompt, type
datasetsdirectory (you can find the links to the datasets in this record)
datasetsdirectory (in which you should already have the 2012 one)
MCsetsdirectory (after having created it)
cmsRun demoanalyzer_cfg_level4...) sequentially on all the input samples listed in
List_indexfile.txt, i.e. produce all root output files yourself. If you have access to a computer farm with local support for the installation of the CMS software (the Open Data team can only provide support for the single virtual machine mode), you may also run the analysis in parallel on different CPUs, correspondingly speeding up the result.
GNU General Public License (GPL) version 3