A survey of open source tools for machine learning with big data in the Hadoop ecosystem
Tóm tắt
Từ khóa
Tài liệu tham khảo
International Data Corporation. Digital Universe Study. 2014. http://www.emc.com/leadership/digital-universe/index.htm . Accessed 1 Jun 2015.
Ancestry.com Fact Sheet. http://corporate.ancestry.com/press/company-facts/ . Accessed 1 Jun 2015.
The R Project for Statistical Computing. http://www.r-project.org/ .
Weka. http://www.cs.waikato.ac.nz/ml/weka/ .
Apache Hadoop. https://hadoop.apache.org/ .
Laney D. 3D data management: controlling data volume, velocity and variety. META Group; 2001.
Demchenko Y, Grosso P, de Laat C, Membrey P. Addressing big data issues in scientific data infrastructure. In: 2013 International Conference on Collaboration Technologies and Systems (CTS), San Diego, 2013. IEEE, pp 48–55.
Cox M, Ellsworth D. Managing big data for scientific visualization. In: ACM Siggraph '97 course #4 exploring gigabyte datasets in real-time: algorithms, data management, and time-critical design, August, 1997.
Bekkerman R, Bilenko M, Langford J. Scaling up machine learning: parallel and distributed approaches. Cambridge: Cambridge University Press; 2011.
White T. Hadoop: The Definitive Guide, 3rd edn. Sebastopol, CA:O’Reilly Media, Inc.; 2012.
Vavilapalli VK, Murthy AC, Douglas C, Agarwal S, Konar M, Evans R, Graves T, Lowe J, Shah H, Seth S, Saha B, Curino C, O’Malley O, Radia S, Reed B, Baldeschwieler E. Apache Hadoop YARN: Yet Another Resource Negotiator. In: Proceedings of the 4th annual Symposium on Cloud Computing; 2013.
Apache Hadoop 2.7.0 Documentation. http://hadoop.apache.org/docs/current/ . Accessed 5 Jan 2015.
Cloudera. http://www.cloudera.com/ .
Hortonworks. http://hortonworks.com/ .
MapR. https://www.mapr.com .
Project Voldemort. http://www.project-voldemort.com/voldemort/ .
Redis. http://redis.io/ .
Apache CouchDB. http://couchdb.apache.org/ .
MongoDB. https://www.mongodb.org/ .
Apache HBase. http://hbase.apache.org/ .
Apache Cassandra. http://cassandra.apache.org/ .
Titan Distributed Graph Database. http://thinkaurelius.github.io/titan/ .
Neo4j. http://neo4j.com/ .
OrientDB. http://orientdb.com/orientdb/ .
Apache Flume. https://flume.apache.org/ .
Apache Kafka. http://kafka.apache.org/ .
Apache Sqoop. http://sqoop.apache.org/ .
Apache Hive. http://hive.apache.org/ .
Apache Drill. http://drill.apache.org/ .
Fernández A, del Río S, López V, Bawakid A, del Jesus MJ, Benítez JM, Herrera F. Big Data with Cloud Computing: an insight on the computing environment, MapReduce, and programming frameworks. Wiley Interdiscip Rev Data Min Knowl Discov. 2014;4(5):380–409.
Cascading. http://www.cascading.org/ .
Apache Pig. http://pig.apache.org/ .
Lin J, Kolcz A. Large-scale machine learning at twitter. In: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data; 2012. pp. 793–804.
Apache Tez. http://tez.apache.org/ .
Apache Oozie Workflow Scheduler for Hadoop. http://oozie.apache.org/ .
Apache Zookeeper. https://zookeeper.apache.org/ .
Hue. http://gethue.com/ .
MOA (Massive Online Analysis). http://moa.cs.waikato.ac.nz/ .
Hellerstein JM, Schoppmann F, Wang DZ, Fratkin E, Welton C. The MADlib Analytics Library or MAD Skills, the SQL. In: VLDB Endowment; 2012. pp. 1700–11.
Dato Core. https://github.com/dato-code/Dato-Core .
Dean J, Ghemawat S. MapReduce: Simplified Data Processing on Large Clusters. In: Proceedings of the 6th Symposium on Operating Systems Design and Implementation; 2004.
Dewitt D, Stonebraker M (2008) MapReduce : a major step backwards. Database Column.
DeMichillie G. Reimagining developer productivity and data analytics in the cloud—news from Google IO. 2014. http://googlecloudplatform.blogspot.com/2014/06/reimagining-developer-productivity-and-data-analytics-in-the-cloud-news-from-google-io.html . Accessed 5 Jan 2015.
Apache Giraph. http://giraph.apache.org/ .
Apache Hama. https://hama.apache.org/ .
Malewicz G, Austern MH, Bik AJC, Dehnert JC, Horn I, Leiser N, and Czajkowski G. Pregel: A System for Large-Scale Graph Processing. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data; 2010. pp. 135–45.
Amazon EC2. http://aws.amazon.com/ec2/ .
Microsoft Azure. http://azure.microsoft.com/ .
Attenberg J. Conjecture: Scalable Machine Learning in Hadoop with Scalding. 2014. https://codeascraft.com/2014/06/18/conjecture-scalable-machine-learning-in-hadoop-with-scalding/ . Accessed 1 Jun 2015.
Zaharia M, Chowdhury M, Das T, Dave A. Fast and interactive analytics over Hadoop data with Spark. USENIX Login. 2012;37(4):45–51.
Bu Y, Howe B, Balazinska M, Ernst MD. HaLoop: efficient Iterative Data Processing on Large Clusters. Proceedings VLDB Endowment. 2010;3(1):285–96.
Jakovits P, Srirama SN. Evaluating MapReduce frameworks for iterative Scientific Computing applications. In: 2014 International Conference on High Performance Computing & Simulation; 2014. pp. 226–33.
Spark. https://spark.apache.org/ .
Zaharia M, Chowdhury M, Franklin MJ, Shenker S, Stoica I. Spark: Cluster Computing with Working Sets. In: Proceedings of the 2nd USENIX conference on hot topics in cloud computing; 2010.
Ni Z. Comparative Evaluation of Spark and Stratosphere. Thesis, KTH Royal Institute of Technology; 2013.
Xin R. DataFrames for Large-Scale Data Science. Databricks TechTalk. https://www.youtube.com/watch?v=Hvke1f10dL0 (2015).
Sort Benchmark Home Page. http://sortbenchmark.org/ . Accessed 1 Jun 2015.
Xin R. Spark officially sets a new record in large-scale sorting. 2014. http://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html . Accessed 01 Jun 2015.
Cai Z, Gao J, Luo S, Perez LL, Vagena Z, Jermaine C. A comparison of platforms for implementing and running very large scale machine learning algorithms. In: Proceedings of the 2014 ACM SIGMOD international conference on Management of data (SIGMOD’14); 2014. pp. 1371–82.
MLlib. https://spark.apache.org/mllib/ .
GraphX. https://spark.apache.org/graphx/ .
Mahout. http://mahout.apache.org/ .
Zhang H, Tudor BM, Chen G, Ooi BC. Efficient in-memory data management: an Analysis. Proceedings VLDB Endowment. 2014;7(10):6–9.
Singh J. Big Data Analytic and Mining with Machine Learning Algorithm. Int J Inform Comput Technol. 2014;4(1):33–40.
Ousterhout K, Rasti R, Ratnasamy S, Shenker S, Chun B. Making Sense of Performance in Data Analytics Frameworks. In: Proceedings of the 12th USENIX Symposium. On Networked Systems Design and Implementation (NSDI 15); 2015.
Shahrivari S, Jalili S. Beyond batch processing : towards real-time and streaming big data. Computers. 2014;3(4):117–29.
Zaharia M, Das T, Li H, Hunter T, Shenker S, Stoica I. Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing. University of California at Berkeley Technical Report No. UCB/EECS-2012-259; 2012.
Apache Storm. https://storm.apache.org/ .
Marz N. History of Apache Storm and lessons learned. 2014. http://nathanmarz.com/blog/history-of-apache-storm-and-lessons-learned.html . Accessed 12 Apr 2015.
Khudairi S. The Apache Software Foundation Announces Apache™ Storm™ as a Top-Level Project. 2014. https://blogs.apache.org/foundation/entry/the_apache_software_foundation_announces64 .
Lorica B. A real-time processing revival. Radar. 2015. http://radar.oreilly.com/2015/04/a-real-time-processing-revival.html .
Apache Thrift. http://thrift.apache.org/ .
Toshniwal A, Donham J, Bhagat N, Mittal S, Ryaboy D, Taneja S, Shukla A, Ramasamy K, Patel JM, Kulkarni S, Jackson J, Gade K, Fu M. Storm @Twitter. In: Proceedings of the 2014 ACM SIGMOD international conference on Management of data (SIGMOD’14); 2014. pp. 147–56.
Gradvohl ALS, Senger H, Arantes L, Sens P. Comparing Distributed Online Stream Processing Systems Considering Fault Tolerance Issues. J Emerg Technol Web Intell. 2014;6(2):174–9.
Feng A, Evans R, Dagit D, Roberts N. Storm-yarn. https://github.com/yahoo/storm-yarn .
Marz N, Warren J. Big data: principles and best practices of scalable realtime data systems. Manning Publications; 2015.
H2O. http://h2o.ai/ .
Real-time Predictions with H2O on Storm. https://github.com/h2oai/h2o-training/blob/master/tutorials/streaming/storm/README.md#real-time-predictions-with-h2o-on-storm .
Wasson T, Sales AP. Application-Agnostic Streaming Bayesian Inference via Apache Storm. In: The 2014 International Conference on Big Data Analytics; 2014.
Merienne P, Trident-ml. http://github.com/pmerienne/trident-ml .
Apache Flink. https://flink.apache.org/ .
Alexandrov A, Bergmann R, Ewen S, Freytag JC, Hueske F, Heise A, Kao O, Leich M, Leser U, Markl V, Naumann F, Peters M, Rheinländer A, Sax MJ, Schelter S, Höger M, Tzoumas K, Warneke D. The Stratosphere platform for big data analytics. VLDB J Int J Very Large Data Bases. 2014;23(6):939–64.
Ewen S, Schelter S, Tzoumas K, Warneke D, Markl V. Iterative Parallel Data Processing with Stratosphere : An Inside Look. In: Proceedings of the 2013 International Conference on Management of Data (SIGMOD’13); 2013. pp. 1053–6.
Leich M, Adamek J, Schubotz M, Heise A, Rheinländer A, Markl V. Applying Stratosphere for Big Data Analytics. In: 15th Conference on Database Systems for Business, Technology and Web (BTW 2013); 2013. pp. 507–10.
Flink-ML. https://github.com/apache/flink/tree/master/flink-staging/flink-ml .
Metzger R, Celebi U. Introducing Apache Flink—A new approach to distributed data processing. In: Silicon Valley Hands On Programming Events; 2014.
Chalmers S, Bothorel C, Picot-Clemente R. Big Data—State of the Art. Technical Report, Telecom Bretagne, Technical Report; 2013.
Singh D, Reddy CK. A survey on platforms for big data analytics. J Big Data. 2014;1:8.
Collier K, Carey B, Sautter D, Marjaniemi C. A methodology for evaluating and selecting data mining software. In: Proceedings of the 32nd Annual Hawaii International Conference on Systems sSciences, Maui, HI; 1999. IEEE, pp. 11.
Zhong S, Khoshgoftaar TM, Seliya N. Clustering-based network intrusion detection. Int J Reliab Qual Saf Eng. 2007;14(02):169–87.
Khoshgoftaar TM, Dittman DJ, Wald R, Awada W. “A review of ensemble classification for dna microarrays data,” in Tools with Artificial Intelligence (ICTAI), 2013 IEEE 25th International Conference on. IEEE, 2013, pp. 381–9.
Apr 2015—Apache Mahout’s next generation version 0.10.0 released. http://mahout.apache.org/ . Accessed 16 Apr 2015.
Miller J. Recommender System for Animated Video. Issues Inform Syst. 2014;15(2):321–7.
Wegener D, Mock M, Adranale D, Wrobel S. Toolkit-Based High-Performance Data Mining of Large Data on MapReduce Clusters. In: 2009 IEEE International Conference on Data Mining Workshops; 2009. pp. 296–301.
Zeng C, Jiang Y, Zheng L, Li J, Li L, Li H, Shen C, Zhou W, Li T, Duan B, Lei M, and Wang P. FIU-Miner: A Fast, Integrated, and User-Friendly System for Data Mining in Distributed Environment. In: Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining; 2013. pp. 1506–9.
Geng X, Yang Z. Data Mining in Cloud Computing. In: Proceedings of the 2013 International Conference on Information Science and Computer Applications (ISCA 2013); 2013.
De Souza RG, Chiky R, Aoul ZK. Open Source Recommendation Systems for Mobile Application. In: Workshop on the Practical Use of Recommender Systems, Algorithms and Technologies (PRSAT 2010); 2010. pp. 55–8.
Seminario CE, Wilson DC. Case Study Evaluation of Mahout as a Recommender Platform. In: 6th ACM conference on recommender engines (RecSys 2012); 2012. pp. 45–50.
Lemnaru C, Cuibus M, Bona A, Alic A, Potolea R. A Distributed Methodology for Imbalanced Classification Problems. In: 2012 11th International Symposium on Parallel and Distributed Computing (ISPDC); 2012. pp. 164–71.
Hammond K, Varde AS. Cloud based predictive analytics: text classification, recommender systems and decision support. In: 2013 IEEE 13th International Conference on Data Mining Workshops; Dallas, TX, 2013, pp. 607–12.
Esteves RM, Pais R, Rong C. K-means Clustering in the Cloud—A Mahout Test. In: 2011 IEEE Workshops of International Conference on Advanced Information Networking and Applications; 2011. pp. 514–9.
Metz C. Mahout, There It Is! Open Source Algorithms Remake Overstock.com. Wired Magazine. 2012. http://www.wired.com/2012/12/mahout/ . Accessed 18 Dec 2014.
Jack K. Mahout becomes a researcher: Large Scale Recommendations at Mendeley. In: Big Data Week, Hadoop User Group UK; 2012.
Sumbaly R, Kreps J, Shah S. The big data ecosystem at LinkedIn. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data (SIGMOD ‘13); 2013. pp. 1125–34.
Ingersoll G. Apache Mahout: Scalable machine learning for everyone. IBM Corporation; 2011.
Sparks ER, Talwalkar A, Smith V, Kottalam J, Pan X, Gonzalez J, Franklin MJ, Jordan MI, Kraska T. MLI: An API for Distributed Machine Learning. In: 2013 IEEE 13th International Conference on Data Mining; 2013. pp. 1187–92.
Zhao H. High Performance Machine Learning through Codesign and Rooflining. Dissertation, University of California at Berkeley; 2014.
Peng H, Liang D, Choi C. Evaluating Parallel Logistic Regression Models. In: 2013 IEEE International Conference on Big Data; 2013. pp. 119–26.
Rennie JDM, Shih L, Teevan J, Karger DR. Tackling the Poor Assumptions of Naive Bayes Text Classifiers. In: Proceedings of the Twentieth International Conference on Machine Learning (ICML-2003); 2003.
Ingersoll G. Introducing Apache Mahout: Scalable, commerical-friendly machine learning for building intelligent applications, IBM Corporation; 2009.
Owen S, Anil R, Dunning T, Friedman E. Mahout in Action. Shelter Island, NY;2011.
Wang Y, Wei J, Srivatsa M, Duan Y, Du W. IntegrityMR: Integrity assurance framework for big data analytics and management applications. In: 2013 IEEE International Conference on Big Data; 2013. pp. 33–40.
Verma A, Cherkasova L, Campbell RH. Play It Again, SimMR! In: 2011 IEEE International Conference on Cluster Computing; 2011. pp. 253–61.
Janeja VP, Azari A, Namayanja JM, Heilig B. B-dIDS: Mining Anomalies in a Big-distributed Intrusion Detection System. In: 2014 IEEE International Conference on Big Data; 2014. pp 32–4.
Singh K, Guntuku SC, Thakur A, Hota C. Big Data Analytics framework for Peer-to-Peer Botnet detection using Random Forests. Inf Sci. 2014;278:488–97.
Racette MP, Smith CT, Cunningham MP, Heekin TA, Lemley JP, Mathieu RS. Improving situational awareness for humanitarian logistics through predictive modeling. In: Systems and Information Engineering Design Symposium (SIEDS); 2014. pp. 334–9.
Ko KD, El-Ghazawi T, Kim D, Morizono H. Predicting the severity of motor neuron disease progression using electronic health record data with a cloud computing Big Data approach. In: 2014 IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology; 2014.
Li L, Bagheri S, Goote H, Hasan A, Hazard G. Risk Adjustment of Patient Expenditures: A Big Data Analytics Approach. In: 2013 IEEE International Conference on Big Data; 2013. pp. 12–4.
Zolfaghar K, Meadem N, Teredesai A, Roy SB, Chin S, Muckian B. Big data solutions for predicting risk-of-readmission for congestive heart failure patients. In: 2013 IEEE International Conference on Big Data; 2013. pp. 64–71.
Mylaraswamy D, Xu B, Dietrich P, Murugan A. Case Studies: Big Data Analytics for System Health Monitoring. In: 2014 International Conference on Artificial Intelligence (ICAI’14); 2014.
Esteves RM Rong C. Using Mahout for Clustering Wikipedia’s Latest Articles: A Comparison between K-means and Fuzzy C-means in the Cloud. In: 2011 IEEE Third International Conference on Cloud Computing Technology and Science; 2011. pp. 565–9.
Filimon D. Clustering of Real-time Data at Scale. In: Berlin Buzzwords; 2013.
Gao F, Abd-Almageed W, Hefeeda M. Distributed approximate spectral clustering for large-scale datasets. In: Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing (HPDC’12); 2012. pp. 223–34.
Hussein T, Linder T, Gaulke W, Ziegler J. Hybreed: a software framework for developing context-aware hybrid recommender systems. User Model User-Adap Inter. 2014;24(1–2):121–74.
Yu H, Hsieh C, Si S, Dhillon IS. Parallel matrix factorization for recommender systems. Knowl Inf Syst. 2013;41(3):793–819.
Said A, Bellogín A. Comparative Recommender System Evaluation: Benchmarking Recommendation Frameworks. In: Proceedings of the 8th ACM Conference on Recommender systems (RecSys’14); 2014. pp. 129–36.
Zheng J, Dagnino A. An initial study of predictive machine learning analytics on large volumes of historical data for power system applications. In: 2014 IEEE International Conference on Big Data; 2014. pp. 952–59.
Katsipoulakis NR, Tian Y, Reinwald B, Pirahesh H. A Generic Solution to Integrate SQL and Analytics for Big Data. In: 18th International Conference on Extending Database Technology (EDBT); 2015. pp. 671–6.
Alber M. Big Data and Machine Learning: A Case Study with Bump Boost. Thesis, Free University of Berlin; 2014.
Lin CY, Tsai CH, Lee CP, Lin CJ. Large-scale logistic regression and linear support vector machines using spark. In: 2014 IEEE International Conference on Big Data; 2014. pp. 519–28.
Zhang C. DimmWitted: A Study of Main-Memory Statistical Analytics. 2014. arXiv Preprint,. arXiv:1403.7550.
Koutsoumpakis G. Spark-based Application for Abnormal Log Detection. Thesis, Uppsala University; 2014.
Powered By Spark. https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark . Accessed 15 Dec 2014.
Talwalkar A, Kraska T, Griffith R, Duchi J, Gonzalez J, Britz D, Pan X, Smith V, Sparks E, Wibisono A, Franklin MJ, Jordan MI. MLbase: A Distributed Machine Learning Wrapper. In: NIPS Big Learning Workshop; 2012.
Kraska T, Talwalkar A, Duchi J, Griffith R, Franklin MJ, Jordan M. MLbase: A Distributed Machine-learning System. In: 6th Biennial Conference on Innovative Data Systems Research; 2013.
Pan X, Sparks ER, Wibisono A. MLbase: Distributed Machine Learning Made Easy. University of California Berkeley Technical Report; 2013.
Sparks ER, Talwalkar A, Franklin MJ, Jordan MI, Kraska T. TuPAQ: an efficient planner for large-scale predictive analytic queries. 2015. (arXiv Preprint arXiv:1502.00068).
Sparks E. Scalable Automated Model Search. University of California at Berkeley Technical Report UCB/EECS-2014-122; 2014.
Najafabadi MM, Villanustre F, Khoshgoftaar TM, Seliya N, Wald R, Muharemagic E. Deep learning applications and challenges in big data analytics. J Big Data. 2015;2(1):1–21.
Deeplearning4j. http://www.skymind.io/deeplearning4j/ .
KNIME. http://www.knime.org/ .
RapidMiner. https://rapidminer.com/ .
H2O (2015) Algorithms Roadmap.
Kejela G, Esteves RM, Rong C. Predictive Analytics of Sensor Data Using Distributed Machine Learning Techniques. In: 2014 IEEE 6th International Conference on Cloud Computing Technology and Science; 2014. pp. 626–31.
Morales GDF, Bifet A. SAMOA: Scalable Advanced Massive Online Analysis. J Mach Learn Res. 2015;16:149–53.
Bifet A, Morales GDF. Big Data Stream Learning with SAMOA. In: 2014 IEEE International Conference on Data Mining Workshop (ICDMW); 2014. pp. 1199–202.
Severien AL. Scalable Distributed Real-Time Clustering for Big Data Streams. Thesis, Polytechnic University of Catalonia; 2013.
Romsaiyud W. Automatic Extraction of Topics on Big Data Streams through Scalable Advanced Analysis. In: 2014 International Computer Science and Engineering Conference (ICSEC); 2014. pp. 255–60.
Riondato M, DeBrabant JA, Fonseca R, Upfal E. PARMA: a parallel randomized algorithm for approximate association rules mining in MapReduce. In: Proceedings of the 21st ACM International Conference on Information and Knowledge Management (CIKM’12); 2012. pp. 85–94.
SAMOA-MOA. https://github.com/samoa-moa/samoa-moa .
Kourtellis N, Morales GDF, Bonchi F. Scalable Online Betweenness Centrality in Evolving Graphs; 2015. arXiv Preprint:1401.6981.
Rahnama AHA. Real-time Sentiment Analysis of Twitter Public Stream. Thesis, University of Jyväskylä; 2015.
Rahnama AHA, Sentinel. http://ambodi.github.io/sentinel/ .
Qi B, Ma G, Shi Z, Wang W. Efficiently Finding Top-K Items from Evolving Distributed Data Streams. In: 2014 10th International Conference on Semantics, Knowledge and Grids (SKG); 2014.
Di Mauro M, Di Sarno C. A framework for Internet data real-time processing: A machine-learning approach. In: 2014 International Carnahan Conference on Security Technology (ICCST); 2014.
Distributed Weka. http://www.cs.waikato.ac.nz/ml/weka/bigdata.html .
Oryx. https://github.com/cloudera/oryx .
Oryx 2. https://github.com/OryxProject/oryx .
Vowpal Wabbit. https://github.com/JohnLangford/vowpal_wabbit .
Van Hulse J, Khoshgoftaar T. Knowledge discovery from imbalanced and noisy data. Data Knowl Eng. 2009;68(12):1513–42.
Khoshgoftaar TM, Hulse JV. Imputation techniques for multivariate missingness in software measurement data. Software Quality J. 16(4):563–600; 2008. [Online]. http://dx.doi.org/10.1007/s11219-008-9054-7 .
Khoshgoftaar TM, Van Hulse J, Napolitano A. Comparing boosting and bagging techniques with noisy and imbalanced data. Syst Man Cybern Part A Syst Hum IEEE Trans. 2011;41(3):552–68.
Van Hulse J, Khoshgoftaar TM, Napolitano A, Wald R. Feature selection with high-dimensional imbalanced data. In: IEEE International Conference on Data Mining Workshops (ICDMW’09); 2009. pp. 507–14.
Van Hulse J, Khoshgoftaar TM, Napolitano A. Experimental perspectives on learning from imbalanced data. In: Proceedings of the 24th International Conference on Machine Learning; 2007. pp. 935–42.
Hogan JM, Peut T. Large Scale Read Classification for Next Generation Sequencing. Procedia Comput Sci. 2014;29:2003–12.
Sun K, Miao W, Zhang X, Rao R. An Improvement to Feature Selection of Random Forests on Spark. In: 2014 IEEE 17th International Conference on Computational Science and Engineering (CSE); 2014. pp. 774–9.
Kandel S, Paepcke A, Hellerstein JM, Heer J. Enterprise data analysis and visualization: an interview study. IEEE Trans Visual Comput Graphics. 2012;18(12):2917–26.
Kelley I, Blumenstock J. Computational Challenges in the Analysis of Large, Sparse, Spatiotemporal Data. In: Proceedings of the sixth international workshop on Data intensive distributed computing; 2014. pp. 41–5.