Acta Univ. Agric. Silvic. Mendelianae Brun. 2013, 61(4), 973-979 | DOI: 10.11118/actaun201361040973

Data pre-processing for web log mining: Case study of commercial bank website usage analysis

Jozef Kapusta1, Anna Pilková2, Michal Munk1, Peter ©vec1
1 Department of Computer Science, Constantine the Philosopher University in Nitra, Tr. A. Hlinku 1, 949 74 Nitra, Slovakia
2 Department of Strategy and Entrepreneurship, Commenius Univeristy in Bratislava, ©afárikovo nám. 6, 818 06 Bratislava, Slovakia

We use data cleaning, integration, reduction and data conversion methods in the pre-processing level of data analysis. Data processing techniques improve the overall quality of the patterns mined. The paper describes using of standard pre-processing methods for preparing data of the commercial bank website in the form of the log file obtained from the web server. Data cleaning, as the simplest step of data pre-processing, is non-trivial as the analysed content is highly specific. We had to deal with the problem of frequent changes of the content and even frequent changes of the structure. Regular changes in the structure make use of the sitemap impossible. We presented approaches how to deal with this problem. We were able to create the sitemap dynamically just based on the content of the log file. In this case study, we also examined just the one part of the website over the standard analysis of an entire website, as we did not have access to all log files for the security reason. As the result, the traditional practices had to be adapted for this special case. Analysing just the small fraction of the website resulted in the short session time of regular visitors. We were not able to use recommended methods to determine the optimal value of session time. Therefore, we proposed new methods based on outliers identification for raising the accuracy of the session length in this paper.

Keywords: association rules, web log mining, business intelligence, financial regulation, market discipline, data preprocessing methodology
Grants and funding:

This paper is supported by the project VEGA 1/0392/13 "Modelling of Stakeholders' Behaviour in Commercial Bank during the Recent Financial Crisis and Expectations of Basel Regulations under Pillar 3 - Market Discipline".

Received: April 11, 2013; Published: July 13, 2013  Show citation

ACS AIP APA ASA Harvard Chicago IEEE ISO690 MLA NLM Turabian Vancouver
Kapusta, J., Pilková, A., Munk, M., & ©vec, P. (2013). Data pre-processing for web log mining: Case study of commercial bank website usage analysis. Acta Universitatis Agriculturae et Silviculturae Mendelianae Brunensis61(4), 973-979. doi: 10.11118/actaun201361040973
Download citation

References

  1. BING, L., 2006: Web Data Mining. Exploring Hyperlinks, Contents and Usage Data. Berlin: Springer-Verlag, 532 p. ISBN 978-3-540-37881-5.
  2. COOLEY, R., MOBASHER, B. and SRIVASTAVA, J., 1999: Data Preparation for Mining World Wide Web Browsing Patterns. Knowledge and Information Systems, 1, 1: 5-32. ISSN 0219-1377. Go to original source...
  3. DORAN, D. and GOKHALE, S. S., 2011: Web robot detection techniques: overview and limitations. Data Mining and Knowledge Discovery, 22, 1-2: 183-210. ISSN 1384-5810. DOI: 10.1007/s10618-010-0180-z Go to original source...
  4. FANG, Y. and HUANG, Z., 2010: An Improved Algorithm for Session Identification on Web Log. In: WANG, F. et al. (ed.). Web Information Systems and Mining - LNCS, 63, 18: 53-60. ISSN 0302-9743. DOI: 10.1007/978-3-642-16515-3_8 Go to original source...
  5. FEJFAR, J. and ©«ASTNÝ, J., 2011: Time Series Clustering in Large Data Sets. Acta Univ. Agric. et Silvic. Mendel. Brunen., 64, 2: 75-80. ISSN 1211- 8516. Go to original source...
  6. HOU©KOVA BERANKOVÁ, M and HOU©KA, M., 2011: Data, information and knowledge in agricultural decision-making. Agris On-line Papers in Economics and Informatics, 3, 2: 74-82. ISSN 1804-1930.
  7. CHITRAA, V. and DAVAMANI, A., 2010: A Survey on Preprocessing Methods for Web Usage Data. International Journal of Computer Science and Information Security, 7, 3: 78- 83. ISSN 1947-5500.
  8. KLOCOKOVÁ, D., 2011: Integration of heuristics elements in the web-based learning environment: Experimental evaluation and usage analysis. Procedia Social and Behavioral Sciences, 15: 1010-1014. ISSN 1877-0428. DOI: 10.1016/j.sbspro.2011.03.230 Go to original source...
  9. KOPRDA, ©., TURČÁNI, M. and BALOGH, Z., 2012: Modelling, simulation and monitoring the use of LabVIEW. In: 6th International Conference on Application of Information and Communication Technologies, AICT 2012 - Proceedings. Tbilisi: IEEE, 450-454. ISBN 978-1-4673-1740-5. Go to original source...
  10. LIU, H. and KE©ELJ, V., 2007: Combined mining of Web server logs and web contents for classifying user navigation patterns and predicting users' future requests. Data & Knowledge Engineering, 61, 2: 304-330. ISSN 0169-023X. DOI: 10.1016/j.datak.2006.06.001 Go to original source...
  11. LIU, B., MOBASHER, B. and NASRAOUI, O., 2011: Web Usage Mining. In: Web Data Mining. Berlin: Springer, 527-603. ISBN 978-3-642-19459-7. Go to original source...
  12. MUNK, M. and DRLÍK, M., 2011a: Impact of Different Pre-Processing Tasks on Effective Identification of Users' Behavioral Patterns in Web-based Educational System. Procedia Computer Science, 3, 4: 1640-1649. ISSN 1877-0509. DOI: 10.1016/j.procs.2011.04.177 Go to original source...
  13. MUNK, M. and DRLÍK, M., 2011b: Influence of different session timeouts thresholds on results of sequence rule analysis in educational data mining. Communications in Computer and Information Science, 166: 60-74. ISSN 1865-0929. Go to original source...
  14. MUNK, M., KAPUSTA, J. and ©VEC, P., 2010: Data Preprocessing Evaluation for Web Log Mining: Reconstruction of Activities of a Web Visitor. Procedia Computer Science, 1, 1: 2273-2280. ISSN 1877-0509. DOI: 10.1016/j.procs.2010.04.255 Go to original source...
  15. MUNK, M., KAPUSTA, J. and ©VEC, P., 2009: Data preprocessing dependency for web usage mining based on sequence rule analysis. In: IADIS Multi Conference on Computer Science and Information Systems. Algarve: MCCSIS, 179-181. ISBN 978-972-8924-88-1.
  16. MUNKOVA, D., STRANOVSKA, E. and DURACKOVA, B., 2012: Impact of Cognitive-Individual Variables on Process of Foreign Language Learning. Procedia Social and Behavioral Sciences, 46: 5430-5434. ISSN 1877-0428. DOI: 10.1016/j.sbspro.2012.06.452 Go to original source...
  17. ORTEGA, J. L. and AGUILLO, I., 2012: Differences between web sessions according to the origin of their visits. Journal of Informetrics, 4, 3: 331-337. ISSN 1751-1577. DOI: 10.1016/j.joi.2010.02.001 Go to original source...
  18. POPELKA, O. and ©«ASTNÝ, J., 2009: WWW Portal Usage Analysis Using Genetic Algorithms. Acta Univ. Agric. et Silvic. Mendel. Brunen., 62, 6: 201-208. ISSN 1211-8516. Go to original source...
  19. STEPHANOU, C., 2012: Rethinking market discipline in banking: lessons from the financial crisis. Policy research working paper. ISSN 1813-9450.
  20. STEVANOVIC, D., AN, A. and VLAJIC, N., 2011: Detecting Web Crawlers from Web Server Access Logs with Data Mining Classifiers. In: KRYSZKIEWICZ, M. et al. (ed.). Foundations of Intelligent Systems - LNCS, 6804: 483-489. ISSN 0302-9743. DOI: 10.1007/978-3-642-21916-0_52 Go to original source...
  21. STEVANOVIC, D., AN, A. and VLAJIC, V., 2012: Feature evaluation for web crawler detection with data mining techniques. Expert Systems with Applications, 39, 10: 8707-8717. ISSN 0957-4174. DOI: 10.1016/j.eswa.2012.01.210 Go to original source...
  22. STRANOVSKA, E., FRATEROVA, Z., MUNKOVA, D. and MUEGLOVA, D., 2012: Politeness factors in requests formulated in the 'category width' cognitive style. Studia psychologica, 54, 2: 111-124. ISSN 0039-3320.
  23. TAN, P.N. and KUMAR, V., 2002: Discovery of Web Robot Sessions Based on their Navigational Patterns. Data Mining and Knowledge Discovery, 6, 1: 9-35. ISSN 1384-5810. DOI: 10.1023/A:1013228602957 Go to original source...
  24. THOMAS, P., 2012: Explaining difficulty navigating a website using page view data, In: Proceedings of the Seventeenth Australasian Document Computing Symposium 2012, ACM: Dunedin, 31-38, ISBN 978-1-4503-1411-4. Go to original source...
  25. THOMAS, P. and PARIS, C., 2010: Interaction differences in web search and browse logs. In: Proceedings of the 15th Australasian Document Computing Symposium2010, Melbourne: RMIT University, 52-60. ISBN 978-1-921426-80-3.
  26. XING, D. nad SHEN, J., 2004: Efficient data mining for web navigation patterns. Information and Software Technology, 46, 1: 55-63. ISSN 0950-5849. DOI: 10.1016/S0950-5849(03)00109-5 Go to original source...

This is an open access article distributed under the terms of the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License (CC BY NC ND 4.0), which permits non-comercial use, distribution, and reproduction in any medium, provided the original publication is properly cited. No use, distribution or reproduction is permitted which does not comply with these terms.