Selecting text entries using a few positive samples and similarity ranking

doi:10.11118/actaun201159040399

Acta Univ. Agric. Silvic. Mendelianae Brun. 2011, 59(4), 399-408 | DOI: 10.11118/actaun201159040399

Selecting text entries using a few positive samples and similarity ranking

Jan Žižka¹, Arnošt Svoboda², František Dařena¹: ¹ Ústav informatiky, Mendelova univerzita v Brně, Zemědělská 1, 613 00 Brno, Česká republika; ² Katedra aplikované matematiky a informatiky, Ekonomicko-správní fakulta, Masarykova univerzita, Lipová 41a, 602 00 Brno, Česká republika

This research was inspired by procedures that are used by human bibliographic searchers: Given some textual and only 'positive' (relevant, interesting) examples coming just from one category, find promptly and simply in an available collection of various unlabeled documents the most similar ones that belong to a relevant topic defined by an applicant. The problem of the categorization of unlabeled relevant and irrelevant textual documents is here solved by using a small subset of relevant available patterns labeled manually in advance. Unlabeled text items are compared with such labeled patterns. The unlabeled samples are then ranked according their degree of similarity with the patterns. At the top of the rank, there are the most similar (relevant) items. Entries receding from the rank top represent gradually less and less similar entries. The authors emphasize that this simple method, aimed at processing large volumes of text entries, provides initial filtering results from the accuracy point of view and the users can avoid the demanding task of labeling too many training examples to be able to apply a chosen classifier, and at the same time, they can obtain quickly the relevant items. The ranking-based approach gives results that can be possibly further used for the following text-item processing where the number of irrelevant items is already not so high as at the beginning. Even if this relatively simple automatic search is not errorless due to the overlapping of documents, it can help process particularly very large unstructured textual data volumes.

Keywords: unlabeled text documents, one-class categorization, text similarity, ranking by similarity, pattern recognition, machine learning, natural language processing, non-semantic documents

Grants and funding:

This paper was supported by the Research program of the Czech Ministry of Education, No. MSM 6215648904.

Received: February 25, 2011; Published: May 29, 2014 Show citation

Žižka, J., Svoboda, A., & Dařena, F. (2011). Selecting text entries using a few positive samples and similarity ranking. Acta Universitatis Agriculturae et Silviculturae Mendelianae Brunensis, 59(4), 399-408. doi: 10.11118/actaun201159040399

Share...

Download citation

Open full article

References

20Newsgroups, http://people.csail.mit.edu/jrennie/20Newsgroups/ [cit. November 2009]
Amazon.com, http://www.amazon.com/, [cit. March 2010].
ABNEY, S., 2008: Semisupervised Learning for Computational Linguistics. Chapman & Hall/CRC. ISBN 978-1-58488-559-7.
BISHOP, C. M., 2006: Pattern Recognition and Machine Learning. Berlin: Springer. ISBN 0-387-31073-2.
BOGDANOV, P., SINGH, A. K., 2010: Scalable Nearest Neighbors with Guarantees in Large and Composite Networks. Technical report, September 2010, Department of Computer Science, University of California, Santa Barbara, CA. Available also at the URL https://www.cs.ucsb.edu/research/tech_reports/reports/2010-17.pdf [cit. March 2011].
DUDA, R. O., 2004: Pattern Classification. 2nd Edition. John Wiley and Sons. ISBN 0-471-70350-8.
HROZA, J., ŽIŽKA, J., 2005: Selecting Interesting Articles Using Their Similarity Based Only on Positive Examples. In: CICLing-2005, Proceedings of the Sixth International Conference on Intelligent Text Processing and Computational Linguistics. Mexico City: Springer, 608-611. Go to original source...
HASTIE, T., TIBSHIRANI, R., FRIEDMAN, J., 2009: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2nd Edition. Berlin: Springer. ISBN 0-387-84857-0.
HU, X., ZHANG, X., LU, C., PARK, E. K., ZHOU, X., 2009: Exploiting Wikipedia as external knowledge for document clustering. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining. Paris: ACM, 389-396. Go to original source...
JO, T., JO, G. S., 2008: Table Based Single Pass Algorithm for Clustering Electronic Documents in 20NewsGroups. IWSCA-2008 IEEE International Workshop on Semantic Computing and Applications, 66-71. Go to original source...
MANEWITZ, L. R., YOUSEF, M., 2001: One-Class SVMs for Document Classification. Journal of Machine Learning Research, 2: 139-154. ISSN 1533-7928.
PORTER, M. F., 1980. An Algorithm for Suffix Stripping. Program 14, 3: 130-137. DOI: 10.1108/eb046814 Go to original source...
QUAN, H., FANG, X., XIAOGUANG, L., 2009: A Comparative Study on Feature Window Selection in Text Filtering. International Forum on Information Technology and Applications, 3: pp. 209-212. Go to original source...
SALTON, G., BUCKLEY, C., 1988: Term-weighting approaches in automatic text retrieval. Information Processing & Management, 24, 5: 513-523. ISSN 0306-4573. DOI: 10.1016/0306-4573(88)90021-0 Go to original source...
SEBASTIANI, F., 2002: Machine Learning in Automated Text categorization. ACM Computing Surveys, 34, 1: 1-47. ISSN 0360-0300. DOI: 10.1145/505282.505283 Go to original source...
SHMUELI, G., PATEL, N. R., BRUCE, P. C., 2007: Data Mining for Business Intelligence. John Wiley and Sons. ISBN 0-470-08485-5.
SRIVASTAVA, A. N., SAHAMI, M. (Eds.), 2009: Text Mining: Classification, Clustering, and Applications. London, New York: Chapman Hall/CRC. ISBN 1-420-05940-8. Go to original source...
WU, Y., KUN, S., ZHU, W., YUE, X., LUO, H., 2009: A Web Text Filter Based on Rough Set Weighted Bayesian. Dependable, Autonomic and Secure Computing. Chengdu: IEEE, 241-245. Go to original source...
ŽIŽKA, J., DAŘENA, F., 2010: Automatic Sentiment Analysis Using the Textual Pattern Content Similarity in Natural Language. Lecture Notes in Artificial Intelligence, 6231, 1: 224-231. ISSN 0302-9743. Go to original source...
ŽIZKA, J., HROZA, J., POULIQUEN, B., IGNAT, C., STEINBERGER, R., 2006: The selection of Electronic Text Documents Supported by Only Positive Examples. In: JADT-2006, Proceedings of the Eight International Conference on the Statistical Analysis of Textual Data. Besanon, Presses Universitaires de Franche-Comte, 1001-1010.

This is an open access article distributed under the terms of the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License (CC BY NC ND 4.0), which permits non-comercial use, distribution, and reproduction in any medium, provided the original publication is properly cited. No use, distribution or reproduction is permitted which does not comply with these terms.

Return to the content