Presentation of Degree Project, first cycle

Title: Detection of web scraping
Respondents: Natan Teferi Asegehegn and Amar Hodzic
Date and time: Monday, 2020-06-08 (June 8), at 10:00 a.m.
Location: Online meeting through Zoom: https://kth-se.zoom.us/j/63400530999
Opponents: (up to 3), please contact the respondents directly if you want to be an opponent
Examiner: Fredrik Lundevall, supervisor is Fadil Galjic
Language: English
Registration / Anmälan: No registration is needed for listeners. Anmälan behövs ej.

Abstract

In today’s connected society, information gathering has become of vital importance. Web scraping is one such way of amassing information; it is the act of extracting and restructuring information from web pages. Since web pages hold a great deal of data, any tools of collecting such data can be useful. However, the owners of the web page may want to prevent others from getting a hold of such information, or at least know about its occurrence. Therefore, a system for detecting web scraping may be of great use for web owners.

In order to detect web scraping, a system needs to first differentiate between legitimate users and bots and then take the appropriate measures. The first step is a difficult task, since there are many tools that allow bots to appear like real users.

The aim of this thesis was to research the particulars of web scraping and gain insights into how one can detect such behavior, as well as developing a small prototype for a non-intrusive detection system. The research method was of qualitative type based on literature study, interview and experience obtained creating scrapers at the company Panprices.

The prototype was finally tested with web traffic generated with both human and non-human interaction. The web traffic representing the human data set was gathered from Panprices’ official website while the data set, representing web scrapers, was created by us. The prototype performed fairly well for less advanced scrapers but, poorly for the more sophisticated ones.

Keywords: detection, prototype, request, web scraping, web page, website