Website input

Please read about Splunk Enterprise 8.0 and the Python 2.7 end-of-life changes and impact on apps and upgrades
The Website Input app provides an mechanism for scraping web-pages for data and indexing it in your Splunk instance to make it searchable.


  • Website Data Extraction: setup an input that will extract data from a web-page and get it into Splunk
  • Data Preview: select data from a web-page that you would like to extract and preview results to get a sample of the what the output would look like before you save the configuration
  • Website crawling: you can have the input crawl web-pages to automatically discover related content in other pages


Initial setup

Once you install the app, it will ask you to set it up on the app configuration page. The setup only contains options related to configuring a proxy server. If no proxy server is used, you can just press save.

Creating an input

You will need to create an input to define the websites that you would like to extract information from. You can setup a new input using the wizard or using the page in Splunk's manager at Settings » Data Inputs » Web-pages or by using the GUI provided in the app itself. The most difficult part of configuring the app is making the CSS selector that will capture the data you want. See W3schools for information on how to create CSS selectors.

You can usually ignore the "Output" section. This is only necessary if you want to name the fields that the input will get based on content within the page (see "Can I use attributes to set the field names?" for details).

The "Authentication" can be left blank unless the web-page requires authentication. Only HTTP authentication is supported at the current time.

Known Issues

The UI shows matches for a selector does preview shows none and the input matches nothing

The preview window may show that a selector matches in the UI even though the selector doesn't match when executed in preview due to the fact that web-browsers sometimes manipulate the HTML before rendering it. This can happen sometimes when tables do not have a tbody element (which they are supposed to). The browser adds the tbody element even though it doesn't exist in the original HTML.

To fix this, you can do one of the following:

  1. Use a selector that matches the original HTML even though it doesn't match in the preview page
  2. Make your selector more generic (like converting "font > table > tr" to "font table tr")
  3. Making a selector that matches both (like "font > table > tr,font > table > tbody > tr"


See the links below for answers to frequently asked questions:

Can I specify more than one selector (to match different things on a single page)?

Can I use attributes to set the field names?

I changed the sourcetype and now the match field is no longer a multi-value field; what do I do?

The input isn't extracting content, even though I can see it in my web-browser

More Information

This project is open source. See GitHub for the source or LukeMurphey.net for more information.

Release Notes

Version 4.5.10
July 5, 2020

1) Updated the code to be more compliant with Python 3
2) Fixed issue where the results could be on the wrong order

Version 4.5.9
Jan. 31, 2020

1) Added link to open URL in new tab
2) Improved code for communicating to the preview iframe

Version 4.5.8
Nov. 14, 2019

1. Adding support for Python 3
2. Fixing issues on Splunk 8.0.0
3. Updated the Geckodriver for Mac and Linux to version 0.26.0

Version 4.5.7
June 14, 2019

1) Fixed another error that occurred when output values as multi-valued fields
2) Updated the geckodriver to 0.24 so that newer versions of Firefox work
3) Added link to search logs to determine why browser test failed
4) Fixed issue where integrated browser test failed on the input wizard

Version 4.5.6
Feb. 22, 2019

1) Fixed error that occurred when output values as multi-valued fields
2) Fixed issue where proxy password from secure storage was not being used

Version 4.5.5
Nov. 10, 2018

1) Fixed issue where passwords were not loaded if there were more than 30
2) Improved styling on Splunk 7.0+

Version 4.5.4
July 9, 2018

1) Fixed the "when_matches_change" setting of "output_results" made results even the matches hadn't changed
2) Fixed issue where the severity chart on the health page filtered based on the severity filter and thus didn't show all entries

Version 4.5.3
June 15, 2018

Updating the styling to work better on Splunk 7.0 and 7.1

Version 4.5.2
March 14, 2018

1) Input now handles large files much better by only downloading the first 512 KB of the file
2) Updated the Chrome driver so that the input works with newer versions of Chrome
3) The input creation wizard auto-suggests a URL filter now when using spidering
4) Output is not streamed (as opposed to being cached) in order to reduce memory usage
5) The input now gracefully handles websites that return a bad encoding
6) Fixed issue where you could not drill-down on logs from the health dashboard

Version 4.5.1
Oct. 5, 2017

1) Input is now resilient to transient Splunkd outages
2) Fixed issue where index selection input was super-wide on Splunk 7.0

Version 4.5
Sept. 2, 2017

1) Added support for forms authentication with browsers
2) Fixed issue where user-agent string was not set for Firefox and Chrome
3) Fixed issue where the browser testing functionality on the UI didn't use the proxy server

Version 4.4
Aug. 7, 2017

1) Added support for forms authentication
2) Added ability to set a default value for the user-agent globally
3) Removed support for proxy authentication on Splunk Cloud

Version 4.3.0
July 21, 2017

1) Passwords are now stored using Splunk secure storage
2) Setup page has been updated to make it easier to use
3) Pages can now be rendered using Google Chrome
4) Added help page to guide users on how to use a web browser for rendering; added browser test to input page
5) Fixed a couple small bugs on the Overview dashboard

Version 4.2.1
May 4, 2017

1) Improved compatibility with Splunk 6.6
2) Fixed issue where users could not enable inputs some times

Version 4.2
April 7, 2017

Adding ability to only output results when they change

Version 4.1.3
April 3, 2017

1) Fixed issue where the host field could not be overridden
2) Reduced some unimportant log messages to debug level

Version 4.1.2
March 19, 2017

Added support for running the app on a Splunk free license

Version 4.1.1
March 13, 2017

Fixed issue where Firefox driver was not correctly added to the path on Windows

Version 4.1
March 9, 2017

1) Fixed issue where some sites could not be previewed
2) Fixed issue where selectors would not match an ID that was not lowercase
3) Added ability to include empty matches
4) Added ability to delete inputs

Version 4.0.2
Jan. 18, 2017

1) Fixed issue where HTTP authentication didn't work with Firefox
2) Fixed issue where Firefox rendering didn't work on headless environments
3) Other minor changes

Version 4.0.1
Dec. 3, 2016

Various bug fixes and minor improvements

Version 4.0
Dec. 1, 2016

Vastly updated UI, various bugs fixes and lots of smaller enhancements

Version 3.2.1
Nov. 24, 2016

1) Improved compatibility with versions of Splunk
2) Fixed overly restrictive URL validation
3) Fixed issue where some parts of the stash file may not have been indexed, losing parts of large result sets
4) Fixed controller logs which were not sourcetyped correctly

Version 3.2
Sept. 21, 2016

* Added ability to view results in search from the modular input creation page
* Improved documentation on the search command options

Version 3.1.2
Sept. 20, 2016

Fixed problem where matches were not visible when the content is very long

Version 3.1.1
July 14, 2016

Fixed problem where you could not create new inputs

Version 3.1
July 11, 2016

Added ability to grant access to make inputs to non-admin users

Version 3.0
May 26, 2016

* Added ability to rendering using a browser (to get page contents after JS rendering has executed)
* MD5 and SHA224 hashes are now included in the results
* Added ability to output matches as separate fields
* Matches are now listed in results in order that they discovered

Version 2.1
May 24, 2016

* Simplified the data input configuration screen
* Added ability to include the raw content in case you want to do your own parsing in SPL
* Added ability to specify a custom string that will separate extracted values
* Fixed incorrect reporting of matches count

Version 2.0
May 3, 2016

* Added ability to crawl websites

Version 1.2.0
Jan. 3, 2016

* Added the ability to use the tag names as the field names
* Fixed issue where the selector would sometimes not match if the content was upper-case and the selector wasn't
* Added a BNF file for the search command

Version 1.1.3
Dec. 16, 2015

Password no longer must be re-typed every time an input is modified

Version 1.1.2
Nov. 30, 2015

Fixed issue where fields without spaces were not being extracted as multi-value fields by default

Version 1.1.1
Sept. 7, 2015

Updated to the latest version of the modular input library; should fix problems where the input crashes

Version 1.1
Aug. 24, 2015

Added ability to specify the user-agent string

Version 1.0.5
June 22, 2015

* Fixed issue where web input controller used the incorrect logger name
* Fixed issue where you could not select the sourcetype correctly in some cases
* Added a search command for performing web scrapes from the search page

Version 1.0.4
March 28, 2015

* Fixed issue where some files could not be parsed because lxml won't parse correctly encoded files sometimes
* Enhanced logging for when interval gap is too large and when checkpoint file could not be found

Version 1.0.3
Jan. 9, 2015

* Fixed issue where the input would not stay on the interval because it included processing time in the interval
* Fixed issue where the modular input logs were not sourcetyped correctly

Version 1.0.2
Nov. 29, 2014

Fixed issue where the input would:
* sometimes fail due to exception thrown from sleep() being interrupted
* sometimes fail due to splunkd connection failure
* ignore the host field that was set on the configuration page

Version 1.0.1
Nov. 12, 2014

Fixed issue where preview did not work

Version 1.0
Oct. 28, 2014

Added ability to use a proxy server

Version 0.9
Aug. 17, 2014

* Fixed issue where not all matches were returned
* Added preview dialog to modular input page
* Added raw_match_count to output which counts CSS matches, even they included no text
* Fixed incompatibility with other apps that also import the modular_input base class
* Fixed issue where entering and then clearing the sourcetype causes an error
* Added ability to specify attributes that should be used for the field names

Version 0.8
July 13, 2014

Fixed problem where websites in non-Ascii encoding did not get decoded correctly

Version 0.7
July 11, 2014

Version 0.6
July 8, 2014

* Switched to multi-value output of matches and added transform for parsing match field
* Fixed exception that could happen if the web-page was not available
* Put authentication fields on a separate location on the manager page

Version 0.5
July 7, 2014

A Splunk input for retrieving and indexing information from web-pages


