ALOJA Project

The ALOJA research project is an initiative from the Barcelona Supercomputing Center (BSC) to explore new hardware architectures for Big Data processing. One of the main goals of the project is to produce a systematic study of SW and HW configuration and deployment options; where we are analyzing the cost-effectiveness of the different cloud services as well as on-premise hardware, both commodity and up-scale.

ALOJA + Machine Learning

ALOJA-ML is the set of Machine Learning autonomous scripts, prepared to run on the ALOJA project. Also, ALOJA-ML takes care of performing data mining, models and prediction on the datasets generated in the ALOJA project.

Here you can find the links to the GitHub pages for ALOJA and ALOJA-ML, and the Barcelona Supercomputing Center. Further, you can find the project publications, also the structured datasets for those publications.

  • ALOJA Website
  • ALOJA-ML GitHub Page
  • Barcelona Supercomputing Center

ALOJA & ALOJA-ML Publications

David Buchaca, Josep Ll. Berral, David Carrera. Automatic Generation of Workload Profiles using Unsupervised Learning Pipelines. IEEE Transactions on Networks and Systems Management (TNSM), vol.15 issue.1 pp.142-155 (2017). ISSN 1932-4537. Open Access.

Josep Ll. Berral, Nicolas Poggi, David Carrera, Aaron Call, Rob Reinauer, Daron Green. ALOJA: A Framework for Benchmarking and Predictive Analytics in Big Data Deployments. IEEE Transactions on Emerging Topics in Computing (TETC), vol.5 issue.4 pp.480-493 (2017). ISSN 2168-6750. arXiv:1511.02037.

Nicolas Poggi, Josep Ll. Berral, David Carrera. ALOJA: a Benchmarking and Predictive Platform for Big Data Performance Analysis. The Sixth Workshop on Big Data Benchmarking (6th WBDB). June 16-17, 2015 in Toronto, Canada.

Nicolas Poggi, Josep Ll. Berral, David Carrera, Aaron Call, Rob Reinauer, Nikola Vujic, Daron Green, José Blakeley, Fabrizio Gagliardi. From Performance Profiling to Predictive Analytics while Evaluating Hadoop Cost-Efficiency in ALOJA. The IEEE International Conference on Big Data (IEEE BigData 2015), Santa Clara (CA), USA, Oct. 29-Nov. 1 2015.

Josep Ll. Berral, Nicolas Poggi, David Carrera, Aaron Call, Rob Reinauer, Daron Green. ALOJA-ML: A Framework for Automating Characterization and Knowledge Discovery in Hadoop Deployments. The 21st ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2015), Sydney, Australia, August 10-13 2015. arXiv:1511.02030.

Nicolas Poggi, David Carrera, Aaron Call, Rob Reinauer, Nikola Vujic, Daron Green and Jose Blakeley, et al. ALOJA: a Systematic Study of Hadoop Deployment Variables to Enable Automated Characterization of Cost-Effectiveness. IEEE BigData 2014. 27-30 Oct. 2014. Washington DC, USA.

Data-sets

The following datasets are a simplification CSV version of the data stored at ALOJA. You can download them to perform machine learning tests, workload reproduction and simulation, as far as you cite us (check each dataset references for the corresponding publication). If you obtain good (or interesting) results by applying your methods and techniques when predicting elements on our datasets, you can communicate with us to publish the results in our method rankings, providing the details, notes, evidences or link to your corresponding publication.

Here you can find the LEGAL NOTICE for BSC-CNS and content in this site.

ALOJA Spark Time-Series Dataset v1

This dataset comprises 900 executions of 30 different Spark applications from TPCx-BB (BigBench) benchmark, with different types of workloads (NLP, SQL, MapReduce, Machine Learning, UDTFs...), different data types (Structured, Semi-Structured and Un-structured data), and different data scales (1,10 and 100GB). All the jobs were run in the Microsoft’s Azure cloud using Spark 2 as the engine, HDInsight PaaS to spawn the spark clusters, running a 16-slave node cluster, and data was stored in the Azure Data Lake Store of Azure. Reference: the Workload Profiles paper at IEEE-TNSM'17.

Dataset details:
Number of entries 900 executions, 30 different Spark applications (7121338 time entries)
Notes The 'simplified' files contain the aggregate information for all nodes. The 'complete' files contain the information separated for each headnode and datanode.
Some attributes may have missing values, marked as -1 when Not Available.
Dataset features:
Time Attributes timestamp, interval, instant
Execution Attributes job_name, disk (type), query_name, cached, platform, engine, query (number)
Performance Attributes X.{usr, nice, sys, iowait, steal, irq, soft, guest, gnice, idle} (cpu percent), kbmem{used, free}, X.{memused, commit} (mem percent), kb{buffers, cached, commit, active, inact, dirty, anonpg, slab, kstack, pgtbl, vmused}, {rx, tx}pck.s, {rx, tx}Kb.s, {rx, tx}cmp.s, {rx, tx}cst.s, X.ifutil (iface percent), tps, {rd, wr}_sec.s. avgrq.sz, avgqu.sz, await, svctm, X.util (disk percent)
Download Dataset »

ALOJA Hadoop Time-Series Dataset v1

This dataset contains 182 series from Hadoop executions from the Intel Hi-Bench benchmark suite, with Map-Reduce algorithms for sorting, word-counting, machine learning, input-output stress testsing, etc. All the jobs have been running in on-premise infrastructures, with similar Hadoop configurations. Reference: the ALOJA paper at IEEE-TETC'17

Dataset details:
Number of entries 182 executions (368683 time entries)
Notes Contains Hadoop execution logs per time unit, indicating timestamp, number of workers, and consumed resources.
Some of these attributes may have missing values, marked as -1 when Not Available.
Dataset features:
Time Attributes instant, date
Execution Attributes id_JOB_job_status, id_exec, job_name, JOBD, bench
Hadoop Workers maps, shuffle, merge, reduce, waste
Performance Attributes pc.{user, system, iowait}, kbmemused, {rx, tx}pck.s, tps, rtps, wtps
Download Dataset »

ALOJA Hadoop Dataset v6

This dataset contains traces of Hadoop executions. Slice of Aloja Dataset v5, including the aggregated resource performance per execution. Executions include, at least, performance information for CPU and Memory. Some executions miss Network or Disk information. Reference: ALOJA-ML paper at KDD'15.

Dataset details:
Number of entries 33147 executions
Notes Some of these attributes (valid, filter, outlier) are not completely reliable, and are based on automatic filtering of executions. Beware when using them.
Comp (compression) is coded as [0: None, 1: ZLIB, 2: BZIP2, 3: Sappy]
Dataset features:
Execution Attributes ID, Start.Time, End.Time, Valid, Filter, Outlier, Perf.Details, Run.Num.
Configuration Attributes Benchmark, Net, Disk, Bench.Type, Maps, IO.SFac, Rep, IO.FBuf, Comp, Blk.Size, Hadoop.Version, Exec.Type, Datasize, Scale.Factor, Java.XMS, Java.XMX
Cluster Attributes Cluster (ID), Cl.Name, Service.Type, Datanodes, Headnodes, VM.Size, VM.OS, VM.Cores, VM.RAM, Provider
Cost values Cost.Remote, Cost.SSD, Cost.IB, Cost.Hour
Time Performance Attributes Exe.Time
Resource Performance Attributes CPU features (percentage from single CPUs): {avg, max, min, stdev_pop, var_pop}.user, {avg, max, min, stdev_pop, var_pop}.nice, {avg, max, min, stdev_pop, var_pop}.system, {avg, max, min, stdev_pop, var_pop}.iowait, {avg, max, min, stdev_pop, var_pop}.steal, {avg, max, min, stdev_pop, var_pop}.idle
Memory features: {avg, max, min, stdev_pop, var_pop}.kbmemfree, {avg, max, min, stdev_pop, var_pop}.kbmemused, {avg, max, min, stdev_pop, var_pop}.memused (percentage from total mem), {avg, max, min, stdev_pop, var_pop}.kbbuffers, {avg, max, min, stdev_pop, var_pop}.kbcached, {avg, max, min, stdev_pop, var_pop}.kbcommit, {avg, max, min, stdev_pop, var_pop}.commit, {avg, max, min, stdev_pop, var_pop}.kbactive, {avg, max, min, stdev_pop, var_pop}.kbinact
Network features: {avg, max, min, stdev_pop, var_pop, sum}.rxpck.s, {avg, max, min, stdev_pop, var_pop, sum}.txpck.s, {avg, max, min, stdev_pop, var_pop, sum}.rxkB.s, {avg, max, min, stdev_pop, var_pop, sum}.txkB.s, {avg, max, min, stdev_pop, var_pop, sum}.rxcmp.s, {avg, max, min, stdev_pop, var_pop, sum}.txcmp.s, {avg, max, min, stdev_pop, var_pop, sum}.rxmcst.s
Disk features: {avg, max, min}tps, {avg, max, min, stdev_pop, var_pop, sum}rd_sec.s, {avg, max, min, stdev_pop, var_pop, sum}wr_sec.s, {avg, max, min, stdev_pop, var_pop}rq_sz, {avg, max, min, stdev_pop, var_pop}qu_sz, {avg, max, min, stdev_pop, var_pop}await, {avg, max, min, stdev_pop, var_pop}.util, {avg, max, min, stdev_pop, var_pop}svctm
Download Dataset »

ALOJA Hadoop Dataset v5

This dataset contains traces of Hadoop executions. Same dataset as Aloja Dataset v4, with more executions. Reference: ALOJA-ML paper at KDD'15.

Dataset details:
Number of entries 43649 executions
Notes Some of these attributes (valid, filter, outlier) are not completely reliable, and are based on automatic filtering of executions. Beware when using them.
Comp (compression) is coded as [0: None, 1: ZLIB, 2: BZIP2, 3: Sappy]
Dataset features:
Execution Attributes ID, Valid, Filter, Outlier
Configuration Attributes Benchmark, Net, Disk, Bench.Type, Maps, IO.SFac, Rep, IO.FBuf, Comp, Blk.Size, Hadoop.Version
Cluster Attributes Cluster (ID), Cl.Name, (Service) Type, Datanodes, Headnodes, VM.Size, VM.OS, VM.Cores, VM.RAM, Provider
Time Performance Attributes Exe.Time
Download Dataset »

Machine Learning results for this dataset:

MethodTargetResultsNotes
Regression Tree M5PExecution TimeVal RAE: 0.16615
Test RAE: 0.18718
best M = 5, random sample 0.50 train vs. 0.25 validation vs. 0.25 test, averaged error from 10 trials
Nearest NeighborsExecution TimeVal RAE: 0.18968
Test RAE: 0.18478
best k = 3, random sample 0.50 train vs. 0.25 validation vs. 0.25 test, averaged error from 10 trials
Single Layer PerceptronExecution TimeVal RAE: 0.24541
Test RAE: 0.26099
best num hidden units = 5, 1000 max. iterations, decay = 5e-4, random sample 0.50 train vs. 0.25 validation vs. 0.25 test, averaged error from 10 trials
Polynomial RegressionExecution TimeVal RAE: 0.23217
Test RAE: 0.25414
3 degree multinomial, random sample 0.50 train vs. 0.25 validation vs. 0.25 test, averaged error from 10 trials


Acknowledgements

Contact

Barcelona Supercomputing Center

Barcelona Supercomputing Center-Centro Nacional de Supercomputación (BSC-CNS) is the national supercomputing centre in Spain. We specialise in high performance computing (HPC) and manage MareNostrum, one of the most powerful supercomputers in Europe, located in the Torre Girona chapel.