Acta Univ. Agric. Silvic. Mendelianae Brun. 2013, 61, 973-979
Published online 2013-07-13

Data pre-processing for web log mining: Case study of commercial bank website usage analysis

Jozef Kapusta1, Anna Pilková2, Michal Munk1, Peter Švec1

1Department of Computer Science, Constantine the Philosopher University in Nitra, Tr. A. Hlinku 1, 949 74 Nitra, Slovakia
2Department of Strategy and Entrepreneurship, Commenius Univeristy in Bratislava, Šafárikovo nám. 6, 818 06 Bratislava, Slovakia

We use data cleaning, integration, reduction and data conversion methods in the pre-processing level of data analysis. Data processing techniques improve the overall quality of the patterns mined. The paper describes using of standard pre-processing methods for preparing data of the commercial bank website in the form of the log file obtained from the web server. Data cleaning, as the simplest step of data pre-processing, is non–trivial as the analysed content is highly specific. We had to deal with the problem of frequent changes of the content and even frequent changes of the structure. Regular changes in the structure make use of the sitemap impossible. We presented approaches how to deal with this problem. We were able to create the sitemap dynamically just based on the content of the log file. In this case study, we also examined just the one part of the website over the standard analysis of an entire website, as we did not have access to all log files for the security reason. As the result, the traditional practices had to be adapted for this special case. Analysing just the small fraction of the website resulted in the short session time of regular visitors. We were not able to use recommended methods to determine the optimal value of session time. Therefore, we proposed new methods based on outliers identification for raising the accuracy of the session length in this paper.


26 live references