Xidel is an open-source data extraction tool
A platform independent command line tool to download webpages and extract data from them, using XPath 2 / XQuery expressions, CSS 3 selectors or custom pattern-matching templates. It is kind of an example for my internet tools.
Xidel is a command line tool to download and extract data from HTML/XML pages or JSON-APIs, using CSS, XPath 3.0, XQuery 3.0, JSONiq or pattern matching. It can also create new or transformed XML/HTML/JSON documents.
It is a platform-independent package which runs on Windows, Linux, and macOS.
Features
- East to setup, use
- Zero configuration required
- Works smoothly on Windows, Linux, macOS and Android
- Well documented
- Packed with dozens of examples
- Lightweight package
Xidel supports expressions
- CSS 3 Selectors: to extract elements unchanged
- XPath 3.0: to extract values and calculate things with them.
- XQuery 3.0: to create new documents from the extracted values and to build Turing-complete scripts.
- Pattern matching: to extract several expressions in an easy way using an annotated version of the input page for pattern-matching.
- XPath 2.0/XQuery 1.0: compatibility mode for old XPath/XQuery versions.
- JSONiq: to work with JSON APIs (deprecated by XPath 3.1)
Following
- HTTP Codes: Redirections like 30x are automatically followed, while keeping things like cookies.
- Links: It can follow (all) links on a page, meta refreshs, or any extracted value.
- HTML Forms: It can fill in arbitrary data in the input elements and submit the form.
- Arbitrary HTTP requests: In any query, you can call a function to make other requests.
Output formats:
- Adhoc: just prints the data in a human-readable format.
- XML: encodes the data as XML.
- HTML: encodes the data as HTML.
- JSON: encodes the data as JSON.
- bash/cmd: exports the data as shell variables.
- fn:serialize: implements the W3C XQuery Serialization standard.
Connections
- Connections: HTTP / HTTPS as well as local files or stdin.
License
Xidel is released under the GNU General Public License v3.0.