Building a Web Crawler for Text Data Indexing on Online Newspaper Web

Jamaludin Hakim, Andrian Sah, Siti Nurhayati, Wahyu Ciptaningrum, Damar Suryo Sasono

Abstract


The Internet has become a vast repository of information, often filled with distractions that can hinder the user experience. News content, for example, is usually interspersed with advertisements that interrupt the flow of reading. In addition, the fast pace of news publication is also a challenge, with potentially more than 50 new articles appearing in 20 minutes. This high-speed data flow is valuable for various applications, including Social Media Analytics Services. In this context, the speed and efficiency of data acquisition (crawling) and processing (scraping) are critical. These processes must be optimized to ensure comprehensive data collection without gaps, focusing on the latest information. To meet this need, we propose developing an application capable of capturing news data in its entirety, minimizing the risk of missing important information. At the core of this solution is a web crawler- a sophisticated program designed to automatically browse the hyperlink structure of the web, systematically downloading linked pages to local storage. This crawling methodology is often the basis for web mining initiatives and search engine development. Since web information is distributed across billions of pages hosted on millions of servers worldwide, our application utilizes the PHP programming language to capture and process this data effectively. The main goal is to present pure news content to users without any irrelevant elements. We use a Data Flow Diagram (DFD) to model the system architecture and data flow. This approach provides a clear visualization of how web users can navigate through hyperlinks to efficiently access the desired news information. By implementing this system, we aim to improve the user experience of consuming news content, facilitate more effective data analysis, and contribute to the broader web information search and processing field.


Keywords


Web Crawler, Scraping, News Content, Data Flow Diagram, PHP.

Full Text:

PDF

References


F. Imene and J. Imhanzenobe, "Information technology and the accountant today: What has really changed?," J. Account. Tax., vol. 12, no. 1, pp. 48–60, 2020.

D. C. Prakash, R. C. Narayanan, N. Ganesh, M. Ramachandran, S. Chinnasami, and R. Rajeshwari, "A study on image processing with data analysis," in AIP conference proceedings, 2022.

M. T. M. Talavera, N. P. Gordoncillo, N. A. Tandang, and D. G. C. Domingo, "Acceptability of height measuring equipment of different materials among community nutrition and health workers and parents in Laguna province, Philippines," Acta Med. Philipp., vol. 56, no. 3, 2022.

L. Rumbo-Rodr’iguez, M. Sánchez-SanSegundo, R. Ferrer-Cascales, N. Garc'ia-D'Urso, J. A. Hurtado-Sánchez, and A. Zaragoza-Mart’i, "Comparison of body scanner and manual anthropometric measurements of body shape: a systematic review," Int. J. Environ. Res. Public Health, vol. 18, no. 12, p. 6213, 2021.

W. Burger and M. J. Burge, Digital image processing: An algorithmic introduction. Springer Nature, 2022.

M. Jankowski, D. Gündüz, and K. Mikolajczyk, "Wireless image retrieval at the edge," IEEE J. Sel. Areas Commun., vol. 39, no. 1, pp. 89–100, 2020.

Y. Amit, P. Felzenszwalb, and R. Girshick, “Object detection,” in Computer Vision: A Reference Guide, Springer, 2021, pp. 875–883.

D. Lee, J. Kim, S. C. Jeong, and S. Kwon, "Human height estimation by color deep learning and depth 3D conversion," Appl. Sci., vol. 10, no. 16, p. 5531, 2020.

B. Dorjee, B. Bogin, C. Scheffler, D. Groth, J. Sen, and M. Hermanussen, "Association of anthropometric indices of nutritional status with growth in height among Limboo children of Sikkim, India," Anthr. Anz, vol. 77, pp. 389–398, 2020.

M. J. Hautus, N. A. Macmillan, and C. D. Creelman, Detection theory: A user's guide. Routledge, 2021.

K. Williamson, D. N. Blane, and M. E. J. Lean, "Challenges in obtaining accurate anthropometric measures for adults with severe obesity: A community-based study," Scand. J. Public Health, vol. 51, no. 6, pp. 935–943, 2023.

C. Morikawa et al., "Image and video processing on mobile devices: a survey," Vis. Comput., vol. 37, no. 12, pp. 2931–2949, 2021.

J. A. Richards, J. A. Richards, and others, Remote sensing digital image analysis, vol. 5. Springer, 2022.

K. Lehn, M. Gotzes, and F. Klawonn, "Greyscale and Colour Representation," in Introduction to Computer Graphics: Using OpenGL and Java, Springer, 2023, pp. 193–210.

D. Savi?, "From Digitization and Digitalization to Digital Transformation: A Case for Grey Literature Management.," Grey J., vol. 16, no. 1, 2020.

R. Thakur and R. Rohilla, "Recent advances in digital image manipulation detection techniques: A brief review," Forensic Sci. Int., vol. 312, p. 110311, 2020.

B. Meyzia, M. Hamdi, R. Amelia, and others, "Imaging analysis of thresholding image filtering, brain abnormalities morphology, and dose report CT scan records," in Journal of Physics: Conference Series, 2020, p. 12155.

M. Eminagaoglu, "A new similarity measure for vector space models in text classification and information retrieval," J. Inf. Sci., vol. 48, no. 4, pp. 463–476, 2022.

Z. E. Chay, C. H. Lee, K. C. Lee, J. S. H. Oon, and M. H. T. Ling, "Russel and Rao coefficient is a suitable substitute for Dice coefficient in studying restriction mapped genetic distances of Escherichia coli," arXiv Prepr. arXiv2302.12714, 2023.

M. A. de Albuquerque, E. R. do Nascimento, K. N. N. de Oliveira Barros, and P. S. N. Barros, "Comparison between similarity coefficients with application in forest sciences," Res. Soc. Dev., vol. 11, no. 2, pp. e48511226046--e48511226046, 2022.




DOI: https://doi.org/10.52088/ijesty.v4i4.677

Article Metrics

Abstract view : 0 times
PDF - 0 times

Refbacks

  • There are currently no refbacks.


Copyright (c) 2024 Jamaludin Hakim, Andrian Sah, Siti Nurhayati, Wahyu Ciptaningrum, Damar Suryo Sasono

International Journal of Engineering, Science and Information Technology (IJESTY) eISSN 2775-2674