QI $$^2$$ : an interactive tool for data quality assurance

AI and Ethics - Trang 1-9 - 2024
Simon Geerkens1, Christian Sieberichs1, Alexander Braun1, Thomas Waschulzik2
1University of Applied Sciences Düsseldorf, Düsseldorf, Germany
2Siemens Mobility GmbH, Erlangen, Germany

Tóm tắt

The importance of high data quality is increasing with the growing impact and distribution of ML systems and big data. Also, the planned AI Act from the European commission defines challenging legal requirements for data quality especially for the market introduction of safety relevant ML systems. In this paper, we introduce a novel approach that supports the data quality assurance process of multiple data quality aspects. This approach enables the verification of quantitative data quality requirements. The concept and benefits are introduced and explained on small example data sets. How the method is applied is demonstrated on the well-known MNIST data set based an handwritten digits.

Tài liệu tham khảo

Ankerst, M., Breunig, M. M., Kriegel, H.-P., Sander, J.: OPTICS: ordering points to Identify the clustering structure. In: Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data, SIGMOD ’99, 49–60. New York, NY, USA: Association for Computing Machinery. ISBN 1-58113-084-8. Event-place: Philadelphia, Pennsylvania, USA (1999) Breunig, M. M., Kriegel, H.-P., Ng, R. T., Sander, J.: LOF: Identifying density-based local outliers. In: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, SIGMOD ’00, 93–104. New York, NY, USA: Association for Computing Machinery. ISBN 1-58113-217-4. Event-place: Dallas, Texas, USA (2000) Burton, S., Hellert, C., Hüger, F., Mock, M., Rohatschek, A.: Safety assurance of machine learning for perception functions. In: Fingscheidt, T., Gottschalk, H., Houben, S. (eds.) Deep Neural Networks and Data for Automated Driving, pp. 335–358. Springer International Publishing, Cham (2022) Deng, L.: The MNIST database of handwritten digit images for machine learning research [best of the web]. IEEE Sig. Process. Mag. 29(6), 141–142 (2012) Ester, M., Kriegel, H.-P., Sander, J., Xu, X.: A Density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, KDD’96, 226–231. AAAI Press. Event-place: Portland, Oregon (1996) European Comission: LAYING DOWN HARMONISED RULES ON ARTIFICIAL INTELLIGENCE (ARTIFICIAL INTELLIGENCE ACT) AND AMENDING CERTAIN UNION LEGISLATIVE ACTS (2021) Fawzy, A., Mokhtar, H.M.O., Hegazy, O.: Outliers detection and classification in wireless sensor networks. Egypt. Inf. J. 14(2), 157–164 (2013) Geerkens, S.: Anwendung und Validierung des SHLQI\(^2\) auf realen Beispielmengen und neuronale Netzwerke (2021) Gualo, F., Rodriguez, M., Verdugo, J., Caballero, I., Piattini, M.: Data quality certification using ISO/IEC 25012: industrial experiences. J. Syst. Softw. 176, 110938 (2021) Heinrich, B., Klier, M., Schiller, A., Wagner, G.: Assessing data quality - a probability-based metric for semantic consistency. Decis. Support Syst. 110, 95–106 (2018) Holcomb, Z.: Fundamentals of descriptive statistics. Routledge, 0 edition. ISBN 978-1-351-97033-4 (2016) Iannone, R., Vargas, M.: pointblank: data validation and organization of metadata for local and remote tables. URL: Https://rich-iannone.github.io/pointblank/, https://github.com/rich-iannone/pointblank (2022) Jolliffe, I.T.: Principal component analysis: a beginner’s guide - I. Introduction and application. Weather 45(10), 375–382 (1990) Maaten, Lvd, Hinton, G.E.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008) McInnes, L., Healy, J., Melville, J.: UMAP: uniform manifold approximation and projection for dimension reduction. ArXiv:1802.03426 [cs, stat] (2020) Mock, M., Scholz, S., Blank, F., Hüger, F., Rohatschek, A., Schwarz, L., Stauner, T.: An integrated approach to a safety argumentation for AI-based perception functions in automated Driving. In Habli, I., Sujan, M., Gerasimou, S., Schoitsch, E., Bitsch, F., eds., Computer Safety, Reliability, and Security. SAFECOMP 2021 Workshops, volume 12853, 265–271. Cham: Springer International Publishing. Series Title: Lecture Notes in Computer Science (2021) Pipino, L.L., Lee, Y.W., Wang, R.Y.: Data quality assessment. Commun. ACM 45(4), 211–218 (2002) Russakoff, D. B., Tomasi, C., Rohlfing, T., Maurer, C. R.: Image Similarity Using Mutual Information of Regions. In: Kanade, T., Kittler, J., Kleinberg, J. M., Mattern, F., Mitchell, J. C., Nierstrasz, O., Pandu Rangan, C., Steffen, B., Sudan, M., Terzopoulos, D., Tygar, D., Vardi, M. Y., Weikum, G., Pajdla, T., Matas, J., eds., Computer Vision - ECCV 2004, volume 3023, 596–607. Berlin, Heidelberg: Springer Berlin Heidelberg. ISBN 978-3-540-21982-8 978-3-540-24672-5. Series Title: Lecture Notes in Computer Science (2004) Samara, M.A., Bennis, I., Abouaissa, A., Lorenz, P.: Enhanced efficient outlier detection and classification approach for WSNs. Simul. Model. Pract. Theory 120, 102618 (2022) Samara, M.A., Bennis, I., Abouaissa, A., Lorenz, P.: Complete outlier detection and classification framework for WSNs based on OPTICS. J. Netw. Comput. Appl. 211, 103563 (2023) Schelter, S., Lange, D., Schmidt, P., Celikel, M., Biessmann, F., Grafberger, A.: Automating large-scale data quality verification. Proceedings of the VLDB endowment 11(12), 1781–1794 (2018) Schelter, S., Schmidt, P., Rukat, T., Kiessling, M., Taptunov, A., Biessmann, F., Lange, D.: DEEQU - Data quality validation for machine learning pipelines. In: NeurIPS 2018 (2018) Sidi, F., Shariat Panahy, P. H., Affendey, L. S., Jabar, M. A., Ibrahim, H., Mustapha, A.: Data quality: A survey of data quality dimensions. In: 2012 International Conference on Information Retrieval & Knowledge Management, 300–304. Kuala Lumpur: IEEE. ISBN 978-1-4673-1091-8 978-1-4673-1090-1 (2012) Sieberichs, C.: Anwendung und Validierung des ECS auf reale Beispielmengen und neuronale Netzwerke (2021) Sieberichs, C., Geerkens, S., Braun, A., Waschulzik, T.: ECS - an interactive tool for data quality assurance (2023) Siemens: safe.trAIn. https://safetrain-project.de. Accessed: 2023-01-15 (2022) Thang, Tran Manh, Kim, Juntae: The Anomaly Detection by Using DBSCAN Clustering with Multiple Parameters. In: 2011 International Conference on Information Science and Applications, 1–5. Jeju Island: IEEE. (2011) Wang, R.Y., Strong, D.M.: Beyond accuracy: what data quality means to data consumers. J. Manag. Inf. Syst. 12(4), 5–33 (1996) Wang, Z., Bovik, A., Sheikh, H., Simoncelli, E.: Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process. 13(4), 600–612 (2004) Waschulzik, T.: Qualitätsgesicherte effiziente Entwicklung vorwärtsgerichteter künstlicher Neuronaler Netze mit überwachtem Lernen (QUEEN). Ph.D. thesis, Technische Universität München, München (1999)