The increasing of Internet resources brings up the problem of information overload, quality enhancement, which means that people want to read the most interesting messages, and avoid having to read low-quality or uninteresting messages. Web filtering is the activity of classifying a stream of incoming web pages dispatched in an asynchronous way by an information producer to an information consumer (Belkin & Croft 1992), which helps people find the most interesting and valuable information and saves Internet users from drowned in information flood.
Recent years, the machine learning (ML) paradigm, instead of knowledge engineering and domain experts, becomes more popular in solving the problem because of its automatically-learning and relativity-analysis abilities. The typical procedure of applying machine learning algorithm to web filtering can be described as follows: a general inductive process automatically builds a web pages filter by learning from a set of pre-assigned pages, namely the characteristics of different categories of user interest, and this filter to decided whether the web pages is in accord with the user interest.
The accurate description of the user interest is the critical precondition of web filtering and the precision and recall ability of the page filter is the main problem of web filtering. Practically, different web filtering task stands for different user interest. For example, for the task of searching engine, all the web pages with the key words are of the user interest. For the task of harmful information filtering, only the web pages with the user-specified orientation should be selected for the user, though quite a lot more of web
Pages may be relative to the user interest. Thus, for different filtering tasks which imply different user interest, the filtering result sets are of different size and all these results sets are subsets of the relative web pages set. The web filtering results are divided into three categories: relative pages, similar pages and homologous pages, each of which is correspondent with a kind of user interest.
To achieve more precisely the filtering result, the inductive process is improved so that it can get better precision and recall ability according to the user interest. The improved machine learning algorithm in this paper is based on the Support Vector Machine (SVM) algorithm because that of all the generic machine learning algorithms (Division Tree, Rule Induction, Bayesian algorithm and SVM),SVM algorithm has shown to be superior to other machine learning algorithm with the solid foundation of Statistical Learning Theory(SLT).
The improved algorithm is called Biased Support Vector Machine(BSVM), which imports a stimulant function, uses training examples distribution n+/n_ and a user adaptable parameter k to deal imbalancedly different classes of the pre-assigned pages so as to adjust the filtering result to be best fit for the user interest. Biased Support Vector Machine for Web Filtering A detail analysis of the user interest and filtering result. Analysis of User Interest In practical web filtering applications, the web pages set related to user interest is considerable large.
But the users may be interested in only several homologous pages or all the related ones based on the difference of page subjects, writer’s viewpoints and expression orientations. So researchers can divide web filtering tasks into three levels according to the user interest. Firstly, Relativity-filter, the filtering result contains all the web pages with the same key phrases or key sentences. These web pages express the same subject, but may be not consistent in viewpoint or orientation.
Typical applications of relativity-filtering include erotic web pages filtering and hot topic tracing which expect to collect all the web pages related to the topic, regardless of approval or not. Secondly, Similarity-filter, the filtering result contains all the web pages that hold the same subject, viewpoint and orientation with the user. Typical applications of similarity-filtering include filtering of web pages on racialism or splittism. The similarity-filtering is more strict than relativity-filtering as not only key words or sentences but also orientation is taken into consideration.
Lastly, homology-filter the filtering result contains only the web pages with quite a lot of same sentences or paragraphs. The filtering results are almost the same as the user interest, and always this is because that the articles from the official or authoritative website are redistributed by other websites with little modification. An example of homology-filtering is counting which article is the most reprinted one on the Bulletin Board Systems. On the other hand, with the proliferation of harmful Internet content such as pornography, violence, hate messages and objectionable content, effective content-filtering systems are essential.
At present, according to Lee, Hui, and Fong, there are four content-filtering approaches: Platform for Internet Content Selection (PICS), URL blocking, keyword filtering, and intelligent content analysis (2002: 48-57). PIC is a voluntary self-labeling system. Each Web content publisher is totally responsible for rating the content, so that is very difficult for all Web content’s publisher to filter Web pages according to the embedded PICS rating labels. URL blocking technique restricts or allows access by comparing the requested Web page’s URL.
This implementing a URL list and it can identify only the sites on the list. And keeping the list up-to-date is very difficult. That is, unless the list is updated constantly, the system’s accuracy will decrease over time owing to the explosive growth of new Web sites. Keyword filtering compares offensive words and phrases on a retrieved Web page against those in a keyword dictionary of prohibited words and phrases. Blocking occurs when the number of matches reaches a predefined words and phrases. Blocking occurs when the number of matches reaches a predefined threshold.
However, it is well known for over blocking. The high over blocking rate is often unacceptable and greatly jeopardizes the system’s capability. So intelligent content analysis, which can automatically classify Web content, is needed. Meanwhile, web-filtering systems are either client or server based. A client-based system performs Web content filtering solely on the computer where it is installed, without consulting remote servers about the nature of the Web content that a user tries to access. A server-based system provides filtering to computers on the local area network where it is installed.