54 Free Open-source Web Spiders, Crawlers and Scrapping Solutions for Data Collection

54 Free Open-source Web Spiders, Crawlers and Scrapping Solutions for Data Collection
Photo by RetroSupply / Unsplash

Web crawling, scraping, and spiders are all related to the process of extracting data from websites.

Web crawling is the process of automatically gathering data from the internet, usually with the goal of building a database of information. This is often done by searching for links within web pages, and following those links to other pages.

Web scraping is similar, but focuses on extracting specific data from websites, rather than gathering data broadly. This can be done manually or with the use of specialized software.

Spiders are a type of software used for web crawling and web scraping. They are designed to automate the process of following links and extracting data from websites.

The best use-cases for web crawling, scraping, and spiders include:

  • Market research: gathering data on competitors, pricing, and industry trends.
  • Lead generation: finding potential customers and their contact information.
  • SEO: analyzing website structure and content to optimize search engine rankings.
  • Content aggregation: gathering data from multiple sources to create a database or comparison tool.

The target audience for web crawling, scraping, and spiders varies depending on the specific use-case. However, it is typically used by businesses, marketers, and researchers who are looking to gather and analyze large amounts of data from the internet.

1- Crawlab

Crawlab is a distributed web crawler management platform is based on Golang and supports a wide range of languages, including Python, Node.js, Go, Java, and PHP. It also has the flexibility to work with various web crawler frameworks, such as Scrapy, Puppeteer, and Selenium.

It can be installed using Docker and Docker-compose in mins.

GitHub - crawlab-team/crawlab: Distributed web crawler admin platform for spiders management regardless of languages and frameworks. 分布式爬虫管理平台,支持任何语言和框架
Distributed web crawler admin platform for spiders management regardless of languages and frameworks. 分布式爬虫管理平台,支持任何语言和框架 - GitHub - crawlab-team/crawlab: Distributed web crawler admin platform for…

2- Gerapy

Gerapy is a self-hosted Distributed Crawler Management Framework Based on Scrapy, Scrapyd, Scrapyd-Client, Scrapyd-API, Django and Vue.js. It is available to install using Docker or from source.

Gerapy comes with developer-friendly documentation that allows developers to quickly start coding their scraping and crawling scenarios.

GitHub - Gerapy/Gerapy: Distributed Crawler Management Framework Based on Scrapy, Scrapyd, Django and Vue.js
Distributed Crawler Management Framework Based on Scrapy, Scrapyd, Django and Vue.js - GitHub - Gerapy/Gerapy: Distributed Crawler Management Framework Based on Scrapy, Scrapyd, Django and Vue.js

3- Serritor (Java)

Serritor is the perfect choice for anyone looking for an open-source web crawler framework. Built upon Selenium using Java, it provides the ability to crawl dynamic web pages that require JavaScript to render data.

GitHub - peterbencze/serritor: Serritor is an open source web crawler framework built upon Selenium and written in Java. It can be used to crawl dynamic web pages that require JavaScript to render data.
Serritor is an open source web crawler framework built upon Selenium and written in Java. It can be used to crawl dynamic web pages that require JavaScript to render data. - GitHub - peterbencze/se…

4- WBot

WBot is a configurable, thread-safe web crawler, provides a minimal interface for crawling and downloading web pages.

Features

  • Clean minimal API.
  • Configurable: MaxDepth, MaxBodySize, Rate Limit, Parrallelism, User Agent & Proxy rotation.
  • Memory-efficient, thread-safe.
  • Provides built-in interface: Fetcher, Store, Queue & a Logger.
GitHub - twiny/wbot: A simple & efficient web crawler.
A simple & efficient web crawler. Contribute to twiny/wbot development by creating an account on GitHub.

5- Abot

Abot is a C# web crawler framework that handles tasks like multithreading, http requests, and link parsing. It's fast and flexible, and supports custom implementations of core interfaces. Abot Nuget package has high compatibility with many .net framework/core implementations, with version >= 2.0 targeting Dotnet Standard 2.0 and version < 2.0 targeting .NET version 4.0.

Features

  • Open Source (Free for commercial and personal use)
  • It's fast, really fast!!
  • Easily customizable (Pluggable architecture allows you to decide what gets crawled and how)
  • Heavily unit tested (High code coverage)
  • Very lightweight (not over engineered)
  • No out of process dependencies (no databases, no installed services, etc...)
GitHub - sjdirect/abot: Cross Platform C# web crawler framework built for speed and flexibility. Please star this project! +1.
Cross Platform C# web crawler framework built for speed and flexibility. Please star this project! +1. - GitHub - sjdirect/abot: Cross Platform C# web crawler framework built for speed and flexibil…

6- ACHE Focused Crawler

ACHE is a focused web crawler. It collects web pages that satisfy some specific criteria, e.g., pages that belong to a given domain or that contain a user-specified pattern.

ACHE differs from generic crawlers in that it uses page classifiers to distinguish between relevant and irrelevant pages in a given domain. A page classifier can be as simple as a regular expression that matches every page containing a specific word or as complex as a machine-learning based classification model.

Moreover, ACHE can automatically learn how to prioritize links to efficiently locate relevant content while avoiding the retrieval of irrelevant content.

ACHE supports many features, such as:

  • Regular crawling of a fixed list of websites
  • Discovery and crawling of new relevant websites through automatic link prioritization
  • Configuration of different types of pages classifiers (machine-learning, regex, etc)
  • Continuous re-crawling of sitemaps to discover new pages
  • Indexing of crawled pages using Elasticsearch
  • Web interface for searching crawled pages in real-time
  • REST API and web-based user interface for crawler monitoring
  • Crawling of hidden services using TOR proxies
GitHub - VIDA-NYU/ache: ACHE is a web crawler for domain-specific search.
ACHE is a web crawler for domain-specific search. Contribute to VIDA-NYU/ache development by creating an account on GitHub.

7- crawlframej

A simple framework for a focused web-crawler in Java.

The framework has been successfully used to build a focused web-crawler for a major blogging site.

GitHub - davidpasch1/crawlframej: Simple crawl framework for a focused web-crawler in Java.
Simple crawl framework for a focused web-crawler in Java. - GitHub - davidpasch1/crawlframej: Simple crawl framework for a focused web-crawler in Java.

8- Crawlee

Crawlee covers your crawling and scraping end-to-end and helps you build reliable scrapers. Fast.

Your crawlers will appear human-like and fly under the radar of modern bot protections even with the default configuration. Crawlee gives you the tools to crawl the web for links, scrape data, and store it to disk or cloud while staying configurable to suit your project's needs.

Features

  • Single interface for HTTP and headless browser crawling
  • Persistent queue for URLs to crawl (breadth & depth first)
  • Pluggable storage of both tabular data and files
  • Automatic scaling with available system resources
  • Integrated proxy rotation and session management
  • Lifecycles customizable with hooks
  • CLI to bootstrap your projects
  • Configurable routing, error handling and retries
  • Dockerfiles ready to deploy
  • Written in TypeScript with generics

HTTP Crawling

  • Zero config HTTP2 support, even for proxies
  • Automatic generation of browser-like headers
  • Replication of browser TLS fingerprints
  • Integrated fast HTML parsers. Cheerio and JSDOM
  • Yes, you can scrape JSON APIs as well

Real Browser Crawling

  • JavaScript rendering and screenshots
  • Headless and headful support
  • Zero-config generation of human-like fingerprints
  • Automatic browser management
  • Use Playwright and Puppeteer with the same interface
  • Chrome, Firefox, Webkit and many others
GitHub - apify/crawlee: Crawlee—A web scraping and browser automation library for Node.js that helps you build reliable crawlers. Fast.
Crawlee—A web scraping and browser automation library for Node.js that helps you build reliable crawlers. Fast. - GitHub - apify/crawlee: Crawlee—A web scraping and browser automation library for N…

9- Apache Nutch (Java)

Apache Nutch is an extensible and scalable web crawler.

GitHub - apache/nutch: Apache Nutch is an extensible and scalable web crawler
Apache Nutch is an extensible and scalable web crawler - GitHub - apache/nutch: Apache Nutch is an extensible and scalable web crawler

10- Spidr (Ruby)

Spidr is a Ruby web spidering library that is capable of spidering a single site, multiple domains, specified links, or infinitely. Spidr's design prioritizes speed and ease of use.

Features:

  • Follows: a, iframe, frame, cookies protected links
  • Meta refresh direct support
  • HTTP basic auth protected links
  • Black-list or white-list URLs based upon:URL scheme, Host name, Port number, Full link, URL extension, Optional /robots.txt support.
  • HTTPS support
  • Custom proxy settings
GitHub - postmodern/spidr: A versatile Ruby web spidering library that can spider a site, multiple domains, certain links or infinitely. Spidr is designed to be fast and easy to use.
A versatile Ruby web spidering library that can spider a site, multiple domains, certain links or infinitely. Spidr is designed to be fast and easy to use. - GitHub - postmodern/spidr: A versatile…

11- GOPA

GOPA, is an open-source web  Spider Written in Go. Its goals are focused on:

  • Light weight, low footprint, memory requirement should < 100MB
  • Easy to deploy, no runtime or dependency required
  • Easy to use, no programming or scripts ability needed, out of box features
GitHub - infinitbyte/gopa: GOPA, a spider written in Golang, for Elasticsearch. DEMO: http://index.elasticsearch.cn
GOPA, a spider written in Golang, for Elasticsearch. DEMO: http://index.elasticsearch.cn - GitHub - infinitbyte/gopa: GOPA, a spider written in Golang, for Elasticsearch. DEMO: http://index.elastic…

12- ant

The Go package includes functions that can scan data from the page into your structs or slice of structs, this allows you to reduce the noise and complexity in your source-code.

GitHub - yields/ant: A web crawler for Go
A web crawler for Go. Contribute to yields/ant development by creating an account on GitHub.

13- Scrapy (Python)

Scrapy is a fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.

GitHub - scrapy/scrapy: Scrapy, a fast high-level web crawling & scraping framework for Python.
Scrapy, a fast high-level web crawling & scraping framework for Python. - GitHub - scrapy/scrapy: Scrapy, a fast high-level web crawling & scraping framework for Python.

14- Creeper

This is a web crawler and scraper built in Python 3.9. It works with HTTP(S) and FTP(S) links and also allows you to scrape for emails and phone numbers if desired.

GitHub - z7r1k3/creeper: Web Crawler and Scraper
Web Crawler and Scraper. Contribute to z7r1k3/creeper development by creating an account on GitHub.

15- Python Web Crawler

This Python script can be used as a corresponding web crawler from a specific URL. This browser monitors its features while navigating on the given URL and keeps detailed logs for each URL visited.

Features

  • Searches numerous HTML from a URL.
  • Finds links in the explored HTML content and adds URLs to visit them.
  • You can set the maximum depth level.
  • Keeps a list of visited URLs and does not revisit the same URL.
  • There are appropriate error message and exception handling elements for error handling.
  • Uses color logging.
GitHub - 0MeMo07/Web-Crawler: Web Crawler with Python
Web Crawler with Python. Contribute to 0MeMo07/Web-Crawler development by creating an account on GitHub.

16- Colly

Colly is a scraping framework for Gophers that provides a clean interface for writing crawlers, scrapers, and spiders. It allows for easy extraction of structured data from websites, which can be used for data mining, processing, or archiving.

Features

  • Clean API
  • Fast (>1k request/sec on a single core)
  • Manages request delays and maximum concurrency per domain
  • Automatic cookie and session handling
  • Sync/async/parallel scraping
  • Caching
  • Automatic encoding of non-unicode responses
  • Robots.txt support
  • Distributed scraping
  • Configuration via environment variables
  • Extensions

GitHub - gocolly/colly: Elegant Scraper and Crawler Framework for Golang
Elegant Scraper and Crawler Framework for Golang. Contribute to gocolly/colly development by creating an account on GitHub.

17- pyspider

pyspider is a Robust Python Spider (Web Crawler) System.

Its features include:

  • Powerful WebUI with script editor, task monitor, project manager, and result viewer
  • Database backend options include: MySQL, MongoDB, Redis, SQLite, and Elasticsearch; PostgreSQL with SQLAlchemy
  • Message queue options include: RabbitMQ, Redis, and Kombu
  • Advanced task management features such as task priority, retry, periodical, recrawl by age, etc.
  • Distributed architecture, with support for Crawl Javascript pages and Python 2.{6,7}, 3.{3,4,5,6}
GitHub - binux/pyspider: A Powerful Spider(Web Crawler) System in Python.
A Powerful Spider(Web Crawler) System in Python. Contribute to binux/pyspider development by creating an account on GitHub.

18- Katana

Katana is a powerful and versatile tool used for crawling and spidering. Its state-of-the-art framework is written entirely in Go, providing users with a seamless and efficient experience.

This tool is perfect for those looking to enhance their web scraping capabilities and gain valuable insight into their target websites. With Katana, you can easily customize the tool to fit your specific needs, allowing for a more personalized and effective approach to your web scraping endeavors.

Whether you are a seasoned professional or a beginner, Katana is an excellent choice for anyone looking to take their web scraping to the next level.

Features

  • Fast And fully configurable web crawling
  • Standard and Headless mode support
  • JavaScript parsing / crawling
  • Customizable automatic form filling
  • Scope control - Preconfigured field / Regex
  • Customizable output - Preconfigured fields
  • INPUT - STDIN, URL and LIST
  • OUTPUT - STDOUT, FILE and JSON
GitHub - projectdiscovery/katana: A next-generation crawling and spidering framework.
A next-generation crawling and spidering framework. - GitHub - projectdiscovery/katana: A next-generation crawling and spidering framework.

19- Hakrawler (Go)

Hakrawler is a fast Golang web crawler that gathers URLs and JavaScript file locations. It is a simple implementation of the awesome Gocolly library.

GitHub - hakluke/hakrawler: Simple, fast web crawler designed for easy, quick discovery of endpoints and assets within a web application
Simple, fast web crawler designed for easy, quick discovery of endpoints and assets within a web application - GitHub - hakluke/hakrawler: Simple, fast web crawler designed for easy, quick discover…

20- Scrapyteer (Browser + Node.js)

Scrapyteer is a Node.js web scraping tool that uses Puppeteer to scrape both plain HTML pages and JavaScript-generated content, including SPAs. It offers a small set of functions to define a crawling workflow and output data shape.

GitHub - miroshnikov/scrapyteer: Web crawling & scraping framework for Node.js on top of headless Chrome browser
Web crawling &amp; scraping framework for Node.js on top of headless Chrome browser - GitHub - miroshnikov/scrapyteer: Web crawling &amp; scraping framework for Node.js on top of headless Chrome br…

21- Kankra (Python)

Kankra is an open-source free Website Spider/Crawler, written in Python 3.0.

Kankra features include:

  • Crawls a website for hrefs, js & img files
  • Detects links that use a full URL and those without
    -> e.g <a href="https://www.ssllabs.com/index.html" VS <a href="/projects/index.html
  • Adjusts the results for a useful output
  • Removes duplicates
  • Automatic out of Scope checking
  • Configurable:
GitHub - Ak-wa/Kankra: A Website Spider/Crawler, Python 3.x
A Website Spider/Crawler, Python 3.x. Contribute to Ak-wa/Kankra development by creating an account on GitHub.

22- DirFinder

The DirFinder tool is user for bruteforce directory with dedicated Wordlist is very simple user-friendly to use. It is written in Python.

Features

  • Multi-threading on demand
  • Supports php, asp and html extensions
  • Four different types of wordlist
  • Checks for potential EAR vulnerabilites
  • user-friendly
GitHub - CyberPlatoon/DirFinder: The DirFinder tool is user for bruteforce directory with dedicated Wordlist is very simple user-friendly to use
The DirFinder tool is user for bruteforce directory with dedicated Wordlist is very simple user-friendly to use - GitHub - CyberPlatoon/DirFinder: The DirFinder tool is user for bruteforce director…

23- Bose Framework

Bose is a a feature-rich Python framework for Web Scraping and Bot Development. 🤖

GitHub - omkarcloud/bose: ✨ Bose is a a feature-rich Python framework for Web Scraping and Bot Development. 🤖
✨ Bose is a a feature-rich Python framework for Web Scraping and Bot Development. 🤖 - GitHub - omkarcloud/bose: ✨ Bose is a a feature-rich Python framework for Web Scraping and Bot Development. 🤖

24- Web Crawler (Jupyter Notebook)

This is a web crawling program which can connect a specific website and crawl the main content (data) of that website.

Once you know the main content of the website, it is good to know the type of website and categorize it.

Therefore, an algorithm is proposed which crawls a specific website using this program and determines the similarity between two websites according to the data contained in the website.

GitHub - EunBinChoi/Web-Crawler-master: This is a web crawler program without any library related to crawling.
This is a web crawler program without any library related to crawling. - GitHub - EunBinChoi/Web-Crawler-master: This is a web crawler program without any library related to crawling.

25- AutoScraper

AutoScraper is an excellent tool for Python developers looking to extract specific data from web pages. With its Smart, Automatic, Fast, and Lightweight features, it can easily retrieve URLs or HTML content and extract a wide range of data types, including text, URLs, and any HTML tag value from a page.

In addition to its web scraping capabilities, AutoScraper can learn the scraping rules and return similar elements, allowing developers to use the learned object with new URLs to extract similar content or exact elements from those pages. With its powerful capabilities and ease of use, AutoScraper is an essential tool for anyone looking to extract valuable information from the web.

GitHub - alirezamika/autoscraper: A Smart, Automatic, Fast and Lightweight Web Scraper for Python
A Smart, Automatic, Fast and Lightweight Web Scraper for Python - GitHub - alirezamika/autoscraper: A Smart, Automatic, Fast and Lightweight Web Scraper for Python

26- Ferret

Ferret is a web scraping system that simplifies data extraction from the web for various purposes, including UI testing, machine learning, and analytics. It uses a declarative language to abstract away technical details and is portable, extensible, and fast.
Features

  • Declarative language
  • Support of both static and dynamic web pages
  • Embeddable
  • Extensible
GitHub - alirezamika/autoscraper: A Smart, Automatic, Fast and Lightweight Web Scraper for Python
A Smart, Automatic, Fast and Lightweight Web Scraper for Python - GitHub - alirezamika/autoscraper: A Smart, Automatic, Fast and Lightweight Web Scraper for Python

27- DotnetSpider

DotnetSpider, is a .NET Standard web crawling library. It is a lightweight, efficient, and fast high-level web crawling & scraping framework.

GitHub - dotnetcore/DotnetSpider: DotnetSpider, a .NET standard web crawling library. It is lightweight, efficient and fast high-level web crawling & scraping framework
DotnetSpider, a .NET standard web crawling library. It is lightweight, efficient and fast high-level web crawling &amp; scraping framework - GitHub - dotnetcore/DotnetSpider: DotnetSpider, a .NET s…

28- crawlergo

Crawlergo is a browser crawler that uses Chrome headless mode for URL collection, automatically filling and submitting forms and collecting as many entries as possible. It includes a URL de-duplication module and maintains fast parsing and crawling speed for large websites to get high-quality collection of request results.

Features

  • chrome browser environment rendering
  • Intelligent form filling, automated submission
  • Full DOM event collection with automated triggering
  • Smart URL de-duplication to remove most duplicate requests
  • Intelligent analysis of web pages and collection of URLs, including javascript file content, page comments, robots.txt files and automatic Fuzz of common paths
  • Support Host binding, automatically fix and add Referer
  • Support browser request proxy
  • Support pushing the results to passive web vulnerability scanners
GitHub - Qianlitp/crawlergo: A powerful browser crawler for web vulnerability scanners
A powerful browser crawler for web vulnerability scanners - GitHub - Qianlitp/crawlergo: A powerful browser crawler for web vulnerability scanners

29- Grab Framework Project

Grab is a free open-source Python-based library for website scrapping.

GitHub - lorien/grab: Web Scraping Framework
Web Scraping Framework. Contribute to lorien/grab development by creating an account on GitHub.

30- Geziyor

Geziyor is a blazing fast web crawling and web scraping framework. It can be used to crawl websites and extract structured data from them. Geziyor is useful for a wide range of purposes such as data mining, monitoring and automated testing.


Features

  • JS Rendering
  • 5.000+ Requests/Sec
  • Caching (Memory/Disk/LevelDB)
  • Automatic Data Exporting (JSON, CSV, or custom)
  • Metrics (Prometheus, Expvar, or custom)
  • Limit Concurrency (Global/Per Domain)
  • Request Delays (Constant/Randomized)
  • Cookies, Middlewares, robots.txt
  • Automatic response decoding to UTF-8
  • Proxy management (Single, Round-Robin, Custom)


GitHub - geziyor/geziyor: Geziyor, blazing fast web crawling & scraping framework for Go. Supports JS rendering.
Geziyor, blazing fast web crawling &amp; scraping framework for Go. Supports JS rendering. - GitHub - geziyor/geziyor: Geziyor, blazing fast web crawling &amp; scraping framework for Go. Supports J…

31- FEAPDER (Python)

feapder is an easy to use, powerful crawler framework

GitHub - Boris-code/feapder: 🚀🚀🚀feapder is an easy to use, powerful crawler framework | feapder是一款上手简单,功能强大的Python爬虫框架。内置AirSpider、Spider、TaskSpider、BatchSpider四种爬虫解决不同场景的需求。且支持断点续爬、监控报警、浏览器渲染、海量数据去重等功能。更有功能强大的爬虫管理系统feaplat为其提供方便的部署及调度
🚀🚀🚀feapder is an easy to use, powerful crawler framework | feapder是一款上手简单,功能强大的Python爬虫框架。内置AirSpider、Spider、TaskSpider、BatchSpider四种爬虫解决不同场景的需求。且支持断点续爬、监控报警、浏览器渲染、海量数据去重等功能。更有功能强大的爬虫管理系统feaplat为其提…

32- Trafilatura

Trafilatura is a Python package and command-line tool for web crawling, downloads, scraping, and extraction of main texts, metadata, and comments. It aims to eliminate noise caused by recurring elements and includes author and date information. It is useful for quantitative research in corpus linguistics, natural language processing, and computational social science, among others.

Features

  • Web crawling
  • Text discovery
  • Sitemap support
  • Seamless and parallel processing, online and offline
  • Robust and efficient extraction
  • Output formats: Text, CSV, JSON, XML
  • Optional add-ons.
  • Trafilatura is distributed under the GNU General Public License v3.0.
GitHub - adbar/trafilatura: Python & command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments
Python &amp; command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments - GitHub - adbar/trafilatura: Python &amp; command-line tool to gather text o…

33- Infinity Crawler

Infinity Crawler is a simple but powerful web crawler library for .NET.

The crawler is built around fast but "polite" crawling of website. This is accomplished through a number of settings that allow adjustments of delays and throttles.

You can control:

  • Number of simulatenous requests
  • The delay between requests starting (Note: If a crawl-delay is defined for the User-agent, that will be the minimum)
  • Artificial "jitter" in request delays (requests seem less "robotic")
  • Timeout for a request before throttling will apply for new requests
  • Throttling request backoff: The amount of time added to the delay to throttle requests (this is cumulative)
  • Minimum number of requests under the throttle timeout before the throttle is gradually removed

Features:

  • Obeys robots.txt (crawl delay & allow/disallow)
  • Obeys in-page robots rules (X-Robots-Tag header and <meta name="robots" /> tag)
  • Uses sitemap.xml to seed the initial crawl of the site
  • Built around a parallel task async/await system
  • Swappable request and content processors, allowing greater customisation
  • Auto-throttling (see below)
GitHub - TurnerSoftware/InfinityCrawler: A simple but powerful web crawler library for .NET
A simple but powerful web crawler library for .NET - GitHub - TurnerSoftware/InfinityCrawler: A simple but powerful web crawler library for .NET

34- SpiderX

This is a simple web-crawler development framework based on .Net Core.

GitHub - LeaFrock/SpiderX: A simple web-crawler development framework based on .Net Core.
A simple web-crawler development framework based on .Net Core. - GitHub - LeaFrock/SpiderX: A simple web-crawler development framework based on .Net Core.

35- Onion-Crawler

C# based Tor/Onion Web crawler. There might be some errors/bugs so, feel free to contribute and mess with my code.

GitHub - OzelTam/onion-crawler: C# based Tor/Onion Web crawler.
C# based Tor/Onion Web crawler. Contribute to OzelTam/onion-crawler development by creating an account on GitHub.

36- Webcrawler

Web Crawler , that crawls all the inner links and the process goes on till no link is left and ignoring repeatedly crawled links

Open Ipython Console and run the below command to see the extracted URL’s: python sitemapcrawl.py.

GitHub - mounicmadiraju/Webcrawler: 💌🏫 Web Crawler , that crawls all the inner links and the process goes on till no link is left and ignoring repeatedly crawled links. .WebCrawler also provides users the option to search for images, audio, video, news, yellow pages and white pages.
💌🏫 Web Crawler , that crawls all the inner links and the process goes on till no link is left and ignoring repeatedly crawled links. .WebCrawler also provides users the option to search for images,…

37- GoTor

This repository contains an HTTP REST API and a command-line program designed for efficient data gathering and analysis through web crawling using the TOR network. While the program is primarily designed to work seamlessly with TorBot, the API and CLI can also operate independently.

GitHub - DedSecInside/gotor: This program provides efficient web scraping services for Tor and non-Tor sites. The program has both a CLI and REST API.
This program provides efficient web scraping services for Tor and non-Tor sites. The program has both a CLI and REST API. - GitHub - DedSecInside/gotor: This program provides efficient web scrapi…

38- Spidey

Spidey is a multi threaded web crawler library that is generic enough to allow different engines to be swapped in.

GitHub - JaCraig/Spidey: A multi threaded web crawler library that is generic enough to allow different engines to be swapped in.
A multi threaded web crawler library that is generic enough to allow different engines to be swapped in. - GitHub - JaCraig/Spidey: A multi threaded web crawler library that is generic enough to al…

39- Mimo Crawler (JavaScript)

Mimo is a web crawler that uses non-headless Firefox and js injection to crawl webpages. It uses websockets to communicate between a non-headless browser and the client, allowing for interaction and crawling of webpages by evaluating javascript code into the page's context.

Features

  • Simple Client API
  • Interactive crawling
  • Extremely fast compared to similar tools.
  • Fully operated by your javascript code
  • Web spidering
GitHub - NikosRig/Mimo-Crawler: A web crawler that uses Firefox and js injection to interact with webpages and crawl their content, written in nodejs.
A web crawler that uses Firefox and js injection to interact with webpages and crawl their content, written in nodejs. - GitHub - NikosRig/Mimo-Crawler: A web crawler that uses Firefox and js injec…

40- WebReaper

WebReaper is a declarative high performance web scraper, crawler and parser in C#. Designed as simple, extensible and scalable web scraping solution. Easily crawl any web site and parse the data, save structed result to a file, DB, or pretty much to anywhere you want.

GitHub - pavlovtech/WebReaper: Web scraper, crawler and parser in C#. Designed as simple, declarative and scalable web scraping solution.
Web scraper, crawler and parser in C#. Designed as simple, declarative and scalable web scraping solution. - GitHub - pavlovtech/WebReaper: Web scraper, crawler and parser in C#. Designed as simple…

41- Qaahl

Qaahl a simple webcrawler that can generate a graphical view of the crawled path.

GitHub - surbhitt/qaahl: a crawler that can scrap and visualize the path qrawled
a crawler that can scrap and visualize the path qrawled - GitHub - surbhitt/qaahl: a crawler that can scrap and visualize the path qrawled

42- Web Wanderer (Python)

Web Wanderer is a Python-based web crawler that uses concurrent.futures. ThreadPoolExecutor and Playwright to crawl and download web pages. It is designed to handle dynamically rendered websites and can extract content from modern web applications.

GitHub - biraj21/web-wanderer: A multi-threaded web crawler written in Python, utilizing ThreadPoolExecutor and Playwright to efficiently crawl dynamically rendered web pages and download them.
A multi-threaded web crawler written in Python, utilizing ThreadPoolExecutor and Playwright to efficiently crawl dynamically rendered web pages and download them. - GitHub - biraj21/web-wanderer: A…

43- Scrapio

Asyncio web scraping framework. The project aims to make easy to write a high performance crawlers with little knowledge of asyncio, while giving enough flexibility so that users can customise behaviour of their scrapers. It also supports Uvloop, and can be used in conjunction with custom clients allowing for browser based rendering.

GitHub - EdmundMartin/Scrapio: Asyncio web crawling framework. Work in progress.
Asyncio web crawling framework. Work in progress. Contribute to EdmundMartin/Scrapio development by creating an account on GitHub.

44- CobWeb

This is a project carried out for the collection of climatic data, as well as data on bugs and diseases that affect crops in agriculture.

GitHub - Lucs1590/cobWeb: 🌧 🐛.🌿 Web crawler to get data from weather, bugs and plant!
🌧 🐛.🌿 Web crawler to get data from weather, bugs and plant! - GitHub - Lucs1590/cobWeb: 🌧 🐛.🌿 Web crawler to get data from weather, bugs and plant!

45- Spring Boot Web Crawler and Search Engine

This project provides a REST API that allows users to submit URLs for crawling. The app internally uses RabbitMQ to publish the URLs, and then listens back to fetch the contents of the URLs using Jsoup.

The app also scrapes links and indexes the content using Apache Lucene. Finally, the app recursively publishes the links to RabbitMQ.

GitHub - deviknitkkr/Jemini: This project provides a REST API that allows users to submit URLs for crawling. The app internally uses RabbitMQ to publish the URLs, and then listens back to fetch the contents of the URLs using Jsoup. The app also scrapes links and indexes the content using Apache Lucene.
This project provides a REST API that allows users to submit URLs for crawling. The app internally uses RabbitMQ to publish the URLs, and then listens back to fetch the contents of the URLs using J…

46- uSearch (Go)

uSearch is an open-source website crawler, indexer and text search engine.

GitHub - mycok/uSearch: webpage crawler and mini search engine
webpage crawler and mini search engine. Contribute to mycok/uSearch development by creating an account on GitHub.

47- Dolors

Dolors is a Java-based web scraper with built in web crawler that allows you to index a website and its contents including media and text related information and store the output in SQL database or just export the data into a text file for quick processing.

48- Webash

Webash serves as a scanning tool for web-based applications, primarily utilized in bug bounty programs and penetration testing. Its framework-like design allows effortless integration of vulnerability detection scripts.

49- CryCrawler

CryCrawler is a portable cross-platform web crawler. Used to crawl websites and download files that match specified criteria.

Features

  • Portable - Single executable file you can run anywhere without any extra requirements
  • Cross-platform - Works on Windows/Linux/OSX and other platforms supported by NET Core
  • Multi-threaded - Single crawler can use as many threads as specified by the user for crawling.
  • Distributed - Connect multiple crawlers to a host; or connect multiple hosts to a master host. (using SSL)
  • Extensible - Extend CryCrawler's functionality by using plugins/extensions.
  • Web GUI - Check status, change configuration, start/stop program, all through your browser - remotely.
  • Breadth/Depth-first crawling - Two different modes for crawling websites (FIFO or LIFO for storing URLs)
  • Robots Exclusion Standard - Option for crawler to respect or ignore 'robots.txt' provided by websites
  • Custom User-Agent - User can provide a custom user-agent for crawler to use when getting websites.
  • File Critera Configuration - Decide which files to download based on extension, media type, file size, filename or URL.
  • Domain Whitelist/Blacklist - Force crawler to stay only on certain domains or simply blaclist domains you don't need.
  • Duplicate file detection - Files with same names are compared using MD5 checksums to ensure no duplicates.
  • Persistent - CryCrawler will keep retrying to crawl failed URLs until they are crawled (up to a certain time limit)
GitHub - CryShana/CryCrawler: Cross-platform distributed multi-threaded web crawler
Cross-platform distributed multi-threaded web crawler - GitHub - CryShana/CryCrawler: Cross-platform distributed multi-threaded web crawler

50- Catch

Catch is a Web crawler with built in parsers using latest Python technologies.

GitHub - umrashrf/catch: Web crawler with built in parsers using latest Python technologies
Web crawler with built in parsers using latest Python technologies - GitHub - umrashrf/catch: Web crawler with built in parsers using latest Python technologies

51- WebCrawler

A JavaScript web-based crawler that works well with website internal links.

GitHub - a29g/WebCrawler: JavaScript Web crawling application generates an “internal links” report for any website on the internet by crawling each page of the site.
JavaScript Web crawling application generates an &quot;internal links&quot; report for any website on the internet by crawling each page of the site. - GitHub - a29g/WebCrawler: JavaScript Web craw…

52- Google Parser

Google parser is a lightweight yet powerful HTTP client based Google Search Result scraper/parser with the purpose of sending browser-like requests out of the box. This is very essential in the web scraping industry to blend in with the website traffic.

Features:

  • Proxy support
  • Custom headers support
GitHub - nrjdalal/google-parser: HTTP based Google Search Results scraper/parser
HTTP based Google Search Results scraper/parser. Contribute to nrjdalal/google-parser development by creating an account on GitHub.

53- Senginta.js Search Engine Scrapper

Senginta is a versatile search engine scraper that can extract results from any search engine and convert them to JSON. Currently, it supports Google Product Search Engine and Baidu Search Engine, but the developer welcomes contributions to support other search engines.

GitHub - michael-act/Senginta.js: All in one Search Engine Scrapper for used by API or JS library. It’s Free & Lightweight!
All in one Search Engine Scrapper for used by API or JS library. It&#39;s Free &amp; Lightweight! - GitHub - michael-act/Senginta.js: All in one Search Engine Scrapper for used by API or JS library…

54- SERPMaster

SERPMaster is a SERP scraping tool that utilizes multiple Google APIs to acquire large volumes of data from Google Search Result Pages, including rich results for both paid and organic search data.SERPMaster is a scraping tool that delivers real-time data from Google Search Engine Result Pages.

Features

  • 100% success rate
  • Device and browser-specific requests
  • City, country, or coordinate-level data
  • Data in CSV, JSON, and HTML
  • CAPTHAs, retries, proxy management are taken care of
  • JavaScript rendering
  • Python, JavaScript, PHP languages
GitHub - serp-master/serpmaster: SERPMaster is an all-in-one Google search result scraper. Our profile consists of tutorials detailing how to scrape data from different Google data points.
SERPMaster is an all-in-one Google search result scraper. Our profile consists of tutorials detailing how to scrape data from different Google data points. - GitHub - serp-master/serpmaster: SERPM…






Read more