news-please is an open-source news crawler that extracts structured information from news websites. It uses libraries like scrapy, Newspaper, and readability, and can follow internal hyperlinks and read RSS feeds to fetch both recent and archived articles.

It also features a library mode for Python developers and can extract articles from the large news archive at commoncrawl.org.

Features

  • works out of the box: install with pip, add URLs of your pages, run
  • run news-please conveniently using its CLI mode
  • use it as a library within your own software
  • extract articles from commoncrawl.org's news archive
  • stores extracted results in JSON files, PostgreSQL, ElasticSearch, or your own storage
  • simple but extensive configuration (if you want to tweak the results)
  • revisions: crawl articles multiple times and track changes
  • crawl and extract information given a list of article URLs
  • to use news-please within your own Python code

Extracted information

news-please extracts the following attributes from news articles. An examplary json file as extracted by news-please can be found here.

  • headline
  • lead paragraph
  • main text
  • main image
  • name(s) of author(s)
  • publication date
  • language

Install

$ pip3 install news-please

License

Apache-2.0 License

Resources & Downloads

GitHub - fhamborg/news-please: news-please - an integrated web crawler and information extractor for news that just works
news-please - an integrated web crawler and information extractor for news that just works - fhamborg/news-please