Project Details
Description
Many systems that are central to modern society – such as web search engines, smart assistants, generative AI, and web archives – rely on the ability to automatically load (a.k.a. "crawl") large numbers of web pages quickly. However, "web crawler" software that has been traditionally used to crawl the web is now insufficient for three reasons. First, many pages require users to be logged in. As a result, a traditional crawler sees only the login page and is blind to content that actual users would see. Second, the number of web pages is ever-increasing, and interactive pages and web applications have significantly increased the amount of computation necessary for a client to identify all the resources on a typical page. In combination, these factors make it significantly more expensive than before to crawl either a large corpus of sites or to recrawl pages frequently to capture changes. Third, many pages are dynamic or interactive, and many use embedded third-party services such as maps, social media widgets, and language translation are either hampered or fail to work on crawled page copies. As a result, systems and studies that rely on content crawled from the web lack visibility into a large portion of the web, are unable to keep up with the rate at which they need to crawl pages and end up replaying crawled pages with poor fidelity.To address these challenges, this project will develop Sprinter, a modern web crawler capable of capturing the web and its rich services as seen and experienced by users. Sprinter will crawl any page such that the content crawled is representative of what users see on the page. Its overheads will grow sub-linearly with the number of pages and the frequency of monitoring. Any page crawled using Sprinter will be renderable in a manner that closely approximates the original page, both visually and functionally. To develop Sprinter, the project will make research contributions along three dimensions. First, the project will use widespread support for authentication via single sign-on (SSO) providers such as Google and Facebook and generate representative browsing profiles from privacy-preserving network traces. Second, to make Sprinter’s crawling efficient, the project will devise techniques to reuse application computations across similar pages and to identify a small representative subset of pages that Sprinter needs to measure at high frequency. Lastly, to enable high-fidelity replay of the crawled copy of a page, the project will develop methods to crawl all of the page’s resources that will be needed to serve any common load of that copy. A major broader impact is in the research and use cases that Sprinter enables for the community. Further, Sprinter and the results of its crawls will be made available to other researchers.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
Status | Active |
---|---|
Effective start/end date | 10/1/24 → 9/30/28 |
ASJC Scopus Subject Areas
- Information Systems
- Computer Networks and Communications
- Engineering(all)