icon/x Created with Sketch.

Splunk Cookie Policy

We use our own and third-party cookies to provide you with a great online experience. We also use these cookies to improve our products and services, support our marketing campaigns, and advertise to you on our website and other websites. Some cookies may continue to collect information after you have left our website. Learn more (including how to update your settings) here.
Accept Cookie Policy

Accept License Agreements

This app is provided by a third party and your right to use the app is in accordance with the license provided by that third-party licensor. Splunk is not responsible for any third-party apps and does not provide any warranty or support. If you have any questions, complaints or claims with respect to this app, please contact the licensor directly.

Thank You

Downloading NLP Text Analytics
SHA256 checksum (nlp-text-analytics_110.tgz) 385587aeb2483adcd730e50525d53e576a249031a1770c30adef229b77b5d6ed SHA256 checksum (nlp-text-analytics_103.tgz) ee795e614330cce9524584a13f8e6caf1a7b091d518239321d8ac33c346ee3b1 SHA256 checksum (nlp-text-analytics_102.tgz) 635bf9f9ca41deb4c5576529ebe0865e550696fedd7306d95ed2801596291975 SHA256 checksum (nlp-text-analytics_101.tgz) 54d0baa0e5b53927054d445a3226ddf17e2dced670953c079a3501eb4375905c SHA256 checksum (nlp-text-analytics_100.tgz) 10c1d1110efb720a82881e15be54cbbfec4ce1519174890ae0700c82cdc21b3e SHA256 checksum (nlp-text-analytics_095.tgz) 66d8364da8a94e6d5fda4c0ee99b74c50a8c3b8d7d1c75625679951d5419c05f SHA256 checksum (nlp-text-analytics_094.tgz) 5c5c807a3a8e7c42e47ea5f4c1475430e032f378c52a8313559285cb51a93c2b
To install your download
For instructions specific to your download, click the Details tab after closing this window.

Flag As Inappropriate

NLP Text Analytics

Splunk AppInspect Passed
Admins: Please read about Splunk Enterprise 8.0 and the Python 2.7 end-of-life changes and impact on apps and upgradeshere.
Overview
Details
Have you ever wanted to perform advanced text analytics inside Splunk? Splunk has some ways to handle text but also lacks some more advanced features that NLP libraries can offer. This can also benefit use-cases that involve using Splunk’s Machine Learning Toolkit (https://splunkbase.splunk.com/app/2890/). The intent of this app is to provide a simple interface for analyzing text in Splunk using python natural language processing libraries (currently just NLTK 3.4.5). The app provides custom commands and dashboards to show how to use.

See related Splunk blog https://www.splunk.com/blog/2019/04/11/let-s-talk-about-text-baby.html

The intent of this app is to provide a simple interface for analyzing text in Splunk using python natural language processing libraries (currently just NLTK 3.4.5) and Splunk's Machine Learning Toolkit. The app provides custom commands and dashboards to show how to use.

Version: 1.1.0

Author: Nathan Worsham
Created for MSDS692 Data Science Practicum I at Regis University, 2018
See associated blog for detailed information on the project creation.

Update
Additional content (combined features algorithms) created for MSDS696 Data Science Practicum II at Regis University, 2018
See associated blog for detailed information on the project creation and associated Splunk blog.
This app was part of the basis for a breakout session at Splunk Conf18 I was lucky enough to present at--Extending Splunk MLTK using GitHub Community.
Session Slides
Session Recording

Description and Use-cases

Have you ever wanted to perform advanced text analytics inside Splunk? Splunk has some ways to handle text but also lacks some more advanced features that NLP libraries can offer. This can also benefit use-cases that involve using Splunk’s ML Toolkit.

Requirements

Splunk ML Toolkit 3.2 or greater https://splunkbase.splunk.com/app/2890/
Wordcloud Custom Visualization https://splunkbase.splunk.com/app/3212/ (preferred) OR Splunk Dashboard Examples https://splunkbase.splunk.com/app/1603/
Parallel Coordinates Custom Visualization https://splunkbase.splunk.com/app/3137/
Force Directed App For Splunk https://splunkbase.splunk.com/app/3767/
Halo - Custom Visualization https://splunkbase.splunk.com/app/3514/
Sankey Diagram - Custom Visualization https://splunkbase.splunk.com/app/3112/

How to use

Install

Normal app installation can be followed from https://docs.splunk.com/Documentation/AddOns/released/Overview/AboutSplunkadd-ons. Essentially download app and install from Web UI or extract file in $SPLUNK_HOME/etc/apps folder.

Example Texts

The app comes with example Gutenberg texts formatted as CSV lookups along with the popular "20 newsgroups" dataset. Load them with the syntax | inputlookup <filename.csv>

Text Names
20newsgroups.csv
moby_dick.csv
peter_pan.csv
pride_prejudice.csv

Detailed Documentation

Documenation for the app and will be kept upto date on Github, due to character limits of splunkbase it is recommend to view documentation on Github.

Summary Documentation

Custom Commands

bs4

Description

A wrapper for BeautifulSoup4 to extract html/xml tags and text from them to use in Splunk. A wrapper script to bring some functionality from BeautifulSoup to Splunk. Default is to get the text and send it to a new field 'get_text', otherwise the selection is returned in a field named 'soup'. Default is to use the 'lxml' parser, though you can specify others, 'html5lib' is not currently included. The find methods can be used in conjuction, their order of operation is find > find_all > find_child > find children. Each option has a similar named option appended '_attrs' that will accept inner and outer quoted key:value pairs for more precise selections.

Syntax

*| bs4 textfield=<field> [get_text=<bool>] [get_text_label=<string>] [get_attr=<attribute_name_string>] [parser=<string>] [find=<tag>] [find_attrs=<quoted_key:value_pairs>] [find_all=<tag>] [find_all_attrs=<quoted_key:value_pairs>] [find_child=<tag>] [find_child_attrs=<quoted_key:value_pairs>] [find_children=<tag>] [find_children_attrs=<quoted_key:value_pairs>]

cleantext

Description

Tokenize and normalize text (remove punctuation, digits, change to base_word). Different options result in better and slower cleaning. base_type="lemma_pos" being the slowest option, base_type="lemma" assumes every word is a noun, which is faster but still results in decent lemmatization. Many fields have a default already set, textfield is only required field. By default results in a multi-valued field which is ready for used with stats count by. Optionally return special fields for analysis--pos_tags and ngrams.

Syntax

*| cleantext textfield=<field> [keep_orig=<bool>] [default_clean=<bool>] [remove_urls=<bool>] [remove_stopwords=<bool>] [base_word=<bool>] [base_type=<string>] [mv=<bool>] [force_nltk_tokenize=<bool>] [pos_tagset=<string>] [custom_stopwords=<comma_separated_string_list>] [term_min_len=<int>] [ngram_range=<int>-<int>] [ngram_mix=<bool>]

similarity

Description

A wrapper for NTLK distance metrics for comparing text to use in Splunk. Similarity (and distance) metrics can be used to tell how far apart to pieces of text are and in some algorithms return also the number of steps to make the text the same. These do not extract meaning, but are often used in text analytics to discover plagurism, conduct fuzzy searching, spell checking, and more. Defaults to using the Levenshtein distance algorithm but includes several other algorithms (Damerau-Levenshtein, Jaro, Jaro-Winkler), including some set based algorithms (Jaccard, MASI). Can handle multi-valued comparisons with an option to limit to a given number of top matches. Multi-valued output can be zipped together or returned seperately.

Syntax

*| similarity textfield=<field> comparefield=<field> [algo=<string>] [limit=<int>] [mvzip=<bool>]

vader

Description

Sentiment analysis using Valence Aware Dictionary and sEntiment Reasoner. Using option full_output will return scores for neutral, positive, and negative which are the scores that make up the compound score (that is just returned as the field "sentiment". Best to feed in uncleaned data as it takes into account capitalization and punctuation.

Syntax

* | vader textfield=sentence [full_output=<bool>]

ML Algorithms

TruncantedSVD

Description

From sklearn. Used for dimension reduction (especially on a TFIDF). This is also known in text analytics as Latent Semantic Analysis or LSA. Returns fields prepended with "SVD_". See http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html

Syntax

fit TruncatedSVD <fields> [into <model name>] k=<int>
The k option sets the number of components to change the data into. It is important that the value is less than the number of features or documents. The documentation on the algorithm recommends to be set to at least 100 for LSA.

LatentDirichletAllocation

Description

From sklearn. Used for dimension reduction. This is also known as LDA. Returns fields prepended with "LDA_". See http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html

Syntax

fit LatentDirichletAllocation <fields> [into <model name>] k=<int>
The k option sets the number of components (topics) to change the data into. It is important that the value is less than the number of features or documents.

NMF

Description

From sklearn. Used for dimension reduction. This is also known as Non-Negative Matrix Factorization. Returns fields prepended with "NMF_". See http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.NMF.html

Syntax

fit NMF <fields> [into <model name>] [k=<int>]
The k option sets the number of components (topics) to change the data into. It is important that the value is less than the number of features or documents.

TFBinary

Description

A modified implemenation of TfidfVectorizer from sklearn. The current MLTK version has TfidfVectorizer but it does not allow the option of turning off IDF or setting binary to True. This is to create a document-term matrix of whether the document has the given term or not. See http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

Syntax

fit TFBinary <fields> [into <model name>] [max_features=<int>] [max_df=<int>] [min_df=<int>] [ngram_range=<int>-<int>] [analyzer=<str>] [norm=<str>] [token_pattern=<str>] [stop_words=english] [use_idf=<true|false>] [binary=<true|false>]
In this implementation, the following settings are already set in order to create a binary output: use_idf is set to False, binary has been set to True, and norm has been set to None. The rest of the settings and options are exactly like the MLTK implementation.

MinMaxScaler

Description

From sklearn. Transforms each feature to a given range. Returns fields prepended with "MMS_". See http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html

Syntax

fit MinMaxScaler <fields> [into <model name>] [copy=<true|false>] [feature_range=<int>-<int>]
Default feature_range=0-1 copy=true.

LinearSVC

Description

From sklearn. Similar to SVC with parameter kernel=’linear’, but implemented in terms of liblinear rather than libsvm, so it has more flexibility in the choice of penalties and loss functions and should scale better to large numbers of samples. See http://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html

Syntax

fit LinearSVC <fields> [into <model name>] [gamma=<int>] [C=<int>] [tol=<int>] [intercept_scaling=<int>] [random_state=<int>] [max_iter=<int>] [penalty=<l1|l2>] [loss=<hinge|squared_hinge>] [multi_class=<ovr|crammer_singer>] [dual=<true|false>] [fit_intercept=<true|false>]
The C option sets the penalty parameter of the error term.

ExtraTreesClassifier

Description

From sklearn. This class implements a meta estimator that fits a number of randomized decision trees (a.k.a. extra-trees) on various sub-samples of the dataset and use averaging to improve the predictive accuracy and control over-fitting. See http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html

Syntax

fit ExtraTreesClassifier <fields> [into <model name>] [random_state=<int>] [n_estimators=<int>] [max_depth=<int>] [max_leaf_nodes=<int>] [max_features=<int|auto|sqrt|log|None>] [criterion=<gini|entropy>]
The n_estimators option sets the number of trees in the forest, defaults to 10.

Support

Support will be provided through Splunkbase (click on Contact Developer) or Splunk Answers or submit an issue in Github. Expected responses will depend on issue and as time permits, but every attempt will be made to fix within 2 weeks.

Release Notes

Version 1.1.0
Feb. 27, 2020

Upgraded to support Python 3 and Python 2 concurrently. Upgraded the following packages to their latest versions -- NLTK (3.4.5), splunklib (1.6.12), bs4 (4.8.2). Added required singleddispatch package for updated NLTK. Updated Counts dashboard to be able to group by sentiment and support for tag cloud when wordcloud app is not available. Updated Sentiment dashboard to support drilldown from line chart and pie chart to specific text. Added new similarity command and dashboard.

Version 1.0.3
Jan. 15, 2020

Change to allow UI to present multiple choice values for some options on cleantext and bs4 commands. Add get_attr option to the bs4 command to retrieve attributes of elements. Change to allow bs4 find_all to search mulitple elements using a comma. Update javascript to highlight specific terms when shown in the example text from the counts dashboard. Fix for Splunk 8.x Python 2 deprecated lib (still currently no Python 3 compatibility but it is coming! however ver 1.0.3 will work with Splunk 8.x using Python 2 as the interpreter but can't officially click the 8.0 compatability because of it). Fix for topic modeling algos use of xrange. Fix for 2 sentiment dashboard panels.

Version 1.0.2
March 19, 2019

Version bump for appinspect

Version 1.0.1
March 12, 2019

Minor fix for file permissions found from appinspect. Update 20newsgroups dataset to not contain and index column.

Version 1.0.0
March 8, 2019

Fix to Counts dashboard when searching for usage of term (thank you dalward!). Fix to cleantext command for consistent output on POS tagging when only one result in the text block. Added named entities to Counts dashboard. Added Themes category, renamed Themes dashboard to Clustering. Added Named Entities dashboard under Themes. Updated visualization app requirements. Added 20newsgroups.csv dataset. Added Classification dashboard. Updated documentation. Updated text cleaning option to require minimum term length of 2.

Version 0.9.5
Nov. 9, 2018

Minor redundant fix to algos.conf. Fix ngram output on text that has cleaned itself empty. Add option to maintain a copy of the original text this makes it faster for the Counts dashboard to search the original text. Updated counts dashboard to use this capability.

Version 0.9.4
Aug. 16, 2018

Added related combined features algorithms--TFBinary, MinMaxScaler, LinearSVC, ExtraTreesClassifier

328
Installs
1,988
Downloads
Share Subscribe LOGIN TO DOWNLOAD

Subscribe Share

AppInspect Tooling

Splunk AppInspect evaluates Splunk apps against a set of Splunk-defined criteria to assess the validity and security of an app package and components.

Are you a developer?

As a Splunkbase app developer, you will have access to all Splunk development resources and receive a 10GB license to build an app that will help solve use cases for customers all over the world. Splunkbase has 1000+ apps and add-ons from Splunk, our partners and our community. Find an app or add-on for most any data source and user need, or simply create your own with help from our developer portal.

Follow Us:
© 2005-2020 Splunk Inc. All rights reserved.
Splunk®, Splunk>®, Listen to Your Data®, The Engine for Machine Data®, Hunk®, Splunk Cloud™, Splunk Light™, SPL™ and Splunk MINT™ are trademarks and registered trademarks of Splunk Inc. in the United States and other countries. All other brand names, product names, or trademarks belong to their respective owners.