Edward Tessen Tanaka
Jan 20, 2012
Featured

Criteria-based crowdsourcing in search engine algorithms and data mining

Data mining, the hidden seamless process in which algorithms present us with relevant search results based on the input of keywords, is reaching a critical juncture in the evolution of search technologies. Most of us, despite using search on a daily basis, have not noticed that the quality -- or in search terms, the relevancy -- of the information we have been receiving has been drastically declining.

Therefore, competitors and disruptive inventors take note: The Achilles' heel of the entire search industry is the relevancy component of the algorithms utilized by all major search engines and of course those products that license their technologies. Unfortunately, this list includes all three primary international search engines: Google, Bing and Baidu, along with a host of specialized search engines and secondary specialized data mining tools.

Welcome to McGoogle

This major deficiency is problematic and becoming more transparent as the short-term solution used by Google (and many others) was to expand their content universe so they could claim technical proficiency and leadership through speed (of their algorithms) and size (of their data pool). From a quality standpoint, Google bragging about the number of pages it indexes (and time associated) is on par with the McDonalds tagline saying over 100 billion served. While high numbers sound impressive, neither measure is about the quality of the product being presented and both conveniently ignore the degradation that happens in the name of "efficiency."

In non-technical terms, these algorithms are flawed because on a fundamental level they rate relevancy through measures which -- translated through mathematics -- ultimately rate the popularity of the content being sought. This internal validation system, unseen by users but felt by researchers, corporations and billions of internet users across the globe every time we search for information, is a legacy result both in terms of technology -- built by engineers who understand mathematics but not necessarily people -- and also the pay-per-click business model which supported this path.

Remember in high school when the idiot jock and the overrated cheerleader won homecoming King and Queen every single year? This is an apt analogy to search and data mining results, in particular those related to specialized domains of technical and cultural specific knowledge. Private corporate data mining tools experience the same shortcomings because on some level the expertise of the communities and the moderators of such communities are restrained by their own knowledge of their own disciplines and internal politics. Regardless of why the walls exist, the outcome is the same.

Criteria-based Crowdsourcing Algorithms

The solution to this epidemic may lie in the re-application of a concept called crowdsourcing.

Crowdsourcing, on a basic level, was devised as a way for companies to harness the creativity, specialized expertise and/or labor of the general public by incorporating their feedback (and bodies) into internal projects. For example, announcing a contest to the public regarding a product redesign and asking for their feedback is an example of crowdsourcing as originally intended. However, results from this process based on the feedback of companies who have utilized the crowdsourcing model have been very hit or miss. The original intent -- innovation with a massive multiplier -- has been left on the sidelines by organizations using crowdsourcing as a model for acquiring cheap labor or as a public relations platform for reasons of marketing and product visibility.

A better reapplication of crowdsourcing would be called 'criteria-based crowdsourcing.' An algorithm -- using behavioral heuristics -- would focus on mining information results within a set of declared criteria metrics by targeting public communities of knowledge outside of the sponsoring company. The system would then have an internal rating scale -- additional defined criteria -- that ranks the relevancy of the individuals within that community using traditional metrics, but also ranks those in the community who are not as relevant but perhaps seeking the same outcome in their research.

History often shows that paradigm shifts are discovered by those who have fundamentals in one subject area but are able to apply their foundation skills across numerous disciplines or recognize the impact that a discovery has on their own research despite being from a different science. Crowdsourcing addresses this issue by opening the playing field to individuals with passion and knowledge, but who may lack the shared vocabulary used by professionals in that specific sphere of influence.

A crowdsourcing-based algorithm would then communicate with another crowdsourcing-based algorithm -- in another sphere of knowledge -- and look for the commonality or lack of commonality between solution sets and trends in that field -- outside of domain-specific vocabulary -- and present results not only based on internal relevancy but data from a totally different crowdsourcing community with similar or complimentary problems.

For example, the data mining of established communities of knowledge -- created through controlled outsourcing communities of genetic researchers -- could contrast their activities and thoughts with a community of bioethicists and philosophical moralists. These tension points, when surfaced, would then be presented in a format showing gaps between current popular thinking and potential disruptive thinking, thus providing direction to areas of research outside the mainstream (or popularity of relevancy).

Criteria-based crowdsourcing using behavioral-based algorithms for insight, business drive and defined criteria, plus the ability to contrast thinking between complimentary and non-complimentary external groups to provide validation, is a better measure of true relevancy. In contrast, the relevancy component of existing algorithms are easily misled but are used with a high degree of success in supporting the pay-per-click revenue model that they helped establish. These older systems have an innate bias toward helping individuals seek consumer oriented goods, finding locations, reading news and staying abreast of changes in popular cultural, regardless of the accuracy (or need) of such information.

The implementation of criteria-based algorithms that incorporate the advantages of crowdsourcing offers a superior value proposition to those who elect to capitalize on research, are involved in intellectual property or interested in driving sales by providing immediate access to quality data and information as compared to quantity. In addition, it offers a potential venue to disrupt the existing status quo in a variety of business and technology areas because the most relevant data is not necessarily the most popular, but the most useful. Finally, it adds another layer of security to to the internet because those who elect to deliberately manipulate search results to obtain high rankings or for reasons of disinformation now have to overcome the idea proposed by Lincoln that “You can fool some of the people some of the time but you can’t fool all of the people all of the time.”