Urdu Search Engine being developed to meet linguistic needs
ISLAMABAD: The authorities concerned are engaged in developing Urdu Search Engine to help address national and linguistic needs and incubate much needed expertise in this area of research and development.
The project works in three aspects, focusing on high performance distributed computing, content search optimization and local content management.
Online content will be crawled and could be optionally filtered as per user opt-in requests. This will require developing both language identification and filtering algorithms.
Once the information is sifted, the indexing scheme will be tuned for efficient retrieval of the information. The content will also be summarized for quicker access through computers and mobile phones, and will be stored initially using Amazon Web Services and later on a local compute infrastructure. Presentation of the Urdu content will be tuned for access through the various devices.
National ICT Research and Development Fund, Ministry of Information Technology and Telecommunications is funding this project being developed by Al-Khwarizmi Institute of Computer Science, University of Engineering and Technology, Lahore at a cost of Rs. 33.22 million. The project is expected to be completed during 2019.
Official sources on Friday said though based on open source search technology, the work still presents multi-faceted challenges.
Automatic language identification is needed to ensure that Urdu content is appropriately tagged after crawling, and not mixed with content of Persian, Arabic, Pushto and other languages with common vocabulary. Further, search should be linguistically intelligent, ignoring Urdu stop words, providing proper tokenization and searching through different morphologically relevant forms of Urdu keywords.
The sources said in addition, ranking and ordering the resulting pages as per the optional user initiated filtering is needed. Finally, both summarizing the results for mobile phone access and determining user's choices and linguistically acceptable presentation forms require detailed analysis for implementation.
To get the initial user base for search engine, a marketing campaign will be organized. As user base strengthens, online contextual advertising and other services will also be initiated, to enable revenue generation for sustainability and growth of the project.
There will also be opportunity to make the user search trend data available for commercial use and policy development. In addition, the language technology for language identification, text summarization, content filtering, etc. can be independently commercialized.
The possibility to get relevant Urdu information from online sources, with access through mobile phones, provides a great opportunity to general public across Pakistan. This also opens further research opportunities to provide similar services in other Pakistani languages. The data will be crucial to spark better online marketing at more affordable rates and to drive policy around online content development and its presentation. Thus, the project presents both social and economic promise at a national scale.
It is mentioned here research indicates that indigenously developed search engines, are more successful in the communities accessing localized content, primarily because they offer language and culture specific services.
For example, Google only has 8%, 22 % and 31 % share in search market in South Korea, China and Japan respectively, till 2012, which is considerably smaller share than the search engines developed locally.
Comments
Comments are closed.