An Open Source C# web crawler with Lucene.NET search using SQL Server 2008/2012/2014/2016/CE An Open Source C# web crawler with Lucene.NET search using MongoDB/RavenDB/Hadoop

Completely Open Source @ GitHub

Does scale? | Download the latest release
Sentiment/Text Mining Tools

I worked at a company that was attempting to corner the market on topic-based sentiment classification of forum and blog content.  While I was there I had the opportunity to work with some of the most intelligent and inspirational code, tools and software for probalistic and NLP sentiment analysis.


  • Autonomy - A massive, sprawling multi-tiered system used for comprehension and storage of unstructed data.  Autonomy can crawl, index, federate, partition, etc. and does so while requiring a ton of horsepower, a ton of money and a pile of scientists to use.  Well, at least that's how it was implmented when I joined.  One one mind I got the impression that it was implemented improperly yet on another I observed that the system was a clever accumulation of WordNet, the GeneralInquirer and OpenNLP.
  • Lexalytics - A scarily accurate topic-based sentiment parsing engine, with confidences.  And, it's trainable, just like adding Frames to WordNet.  Peeking under the covers reveals WordNet and OpenNLP.


  • RapidMiner - Where to begin, where to begin?  My favorite application next to SQL Server, ever.  This little ball of wax has enough features, twists and turns to place it along a Final Fantasy game at your local GameStop.  Examples are provided for Text Mining and the CrossValidation and K-Folds evaluators are top notch for displaying class recall and precision.  Through and through, Rapid Miner makes SSAS seem like a toy.
  • NClassifier - The best damn little Bayesian classifier out there, for .NET.  Well, actually, it's the only Bayesian classifier I would use.  It's fast, easy to use and you can supply your own word weights, or have the classifer calculate them for you.  Combining NClassifier with the weighted word lists installed by Lexalytics yields surprisingly accurate results.  A basic tokenizer is provided, and it works well.  Look to the SNOWBALL tokenizer package from for an upgrade.  Also, a basic summarizer is provided, but isn't as complete or thorough as the OpenTextSummarizer.
  • OpenTextSumarizer - Yet another great find.  Excellent text summarization includes templates for multiple languages, which could probably be extended to detect language.  The synonym rollup is easily extended and overall, works really, really well when provided with clean, tokenized text.
  • WordNetSQLServer - WordNet for SQL Server.  What more is there to say?  :)
  • SharpNLP - As probabilistic approaches will only take you so far, you're going to need a parser.  Included a WordNet interface.
  • Lucene.NET - The SNOWBALL stemmer is the evolution of the Porter Stemmer, as written by Martin Porter.
  • LingPipe - Not .NET but close enough.  (Java)

Training Sets:

Overall, if you aim to solve the problem of topic based sentiment, and have money to spend, purchase Lexalytics and extend it's already broad understanding by teaching it new grammer.

If you're into a challenge, the OpenSource tools and the training sets listed above are all you need to use to solve the problem. be continued.

Here's a nice list of links from LingPipe.  (


ABNER is a statistical named entity recognizer using linear-chain conditional random fields (CRFs) with a variety of orthographic and contextual features. It also has a UI for annotation. Written by Burr Settles out of University of Wisconsin-Madison. Released with source with the Commons Public License.


The Baseline Information Extraction (BALIE) system is a Java natural language toolkit developed at the University of Ottawa and released under the GNU General Public License. BALIE provides language ID, sentence detection, part of speech and named-entity recognition. Here's the BALIE javadoc.


FreeLing is a set of C++ tools developed at the Universitat Politècnica de Catalunya and released under the GNU Lesser General Public License. Freeling provides sentence detection, morphological analysis, named entities, POS tagging, shallow parsing, dependency parsing and word sense disambiguation. Here's a link to their user manual.

The Dragon Toolkit


The Dragon Toolkit is a Java-based development package for academic use in information retrieval (IR) and text mining (TM, including text classification, text clustering, text summarization, and topic modeling). It is tailored for researchers who work on large-scale IR and TM and prefer Java programming.



GATE is a Java text mining toolkit developed at the University of Sheffield and released under the GNU Lesser General Public License. GATE provides a general offset-oriented development/deployment environment/framework and some rule-based tools to run within that framework. Many other GATE plugins have been contributed by Sheffield and third parties. Here is a link to their user guide and javadoc.


"The JULIE Lab here offers a comprehensive NLP tool suite for the application purposes of semantic search, information extraction and text mining. Most of our continuously expanding tool suite is based on machine learning methods and thus is domain- and language independent.

One main feature is that we offer our tools both as stand-alone programs and wrapped within the UIMA framework. UIMA is an open-source , industrial-strength, scaleable and extensible platform for creating, integrating and deploying unstructured information management solutions from combinations of semantic analysis and search components. "

Apache Lucene Mahout

Mahout's goal is to build scalable, Apache licensed machine learning libraries. Initially, we are interested in building out the ten machine learning libraries detailed in [Chu et al.'s 2006 NIPS paper Map-Reduce for Machine Learning on Multicore] using [Apache] Hadoop."


MALLET is a Java natural language toolkit developed at the University of Massachussetts and released under the Common Public License. MALLET is most widely used for classification and sequence modeling. It also includes clustering. It provides maximum entropy training, including conditional random fields, general undirected graphical models, finite-state transducers, and some general numerical optimization classes. Here's a link to their javadoc and their tutorials.

Minor Third

Minor Third is a Java natural language toolkit developed at Carnegie Mellon University and released under the BSD License. Minor Third provides an extensive suite of general and sequence classifiers, including KNN, active learning, SVMs, decision trees, CRFs, CMMs, boosting, perceptrons, etc. Here's a link to their javadoc and a link to their install/tutorial page.


MontyLingua is a free for research use, commonsense-enriched, end-to-end natural language understander for English. Feed raw English text into MontyLingua, and the output will be a semantic interpretation of that text. Perfect for information retrieval and extraction, request processing, and question answering. From English sentences, it extracts subject/verb/object tuples, extracts adjectives, noun phrases and verb phrases, and extracts people's names, places, events, dates and times, and other semantic information. MontyLingua makes traditionally difficult language processing tasks trivial!


The National Centre for Text Mining (NaCTeM) is the first publicly-funded text mining centre in the world. We provide text mining services in response to the requirements of the UK academic community. NaCTeM is operated by the University of Manchester with close collaboration with the University of Tokyo.

On our website, you can find pointers to sources of information about text mining such as links to

  • text mining services provided by NaCTeM
  • software tools, both those developed by the NaCTeM team and by other text mining groups
  • seminars, general events, conferences and workshops
  • tutorials and demonstrations
  • text mining publications


The Natural Language Toolkit (NLTK) is a general Python toolkit developed at the University of Melbourne for natural language processing released under the GNU General Public License. NLTK contains modules for heuristic and statistical tagging (including the Brill tagger) and chunking, full parsing (CFG), and clustering (including K-means and EM). The documentation page contains pointers to tutorials and API documentation. It's also distributed with a range of interesting data.


OpenNLP is a heterogeneous collection of projects distributed under a variety of open source licenses. The main projects being developed for OpenNLP itself include a general Java maximum entropy package released under the GNU Lesser General Public License. Here's the maxent javadoc. There's a tools API to go with it; here's the tools javadoc. The tools include statistical tokenizers, sentence detection, name finders, part-of-speech taggers and full syntactic (PCFG) parsing. This is one of the few packages to do coreference resolution.


Actually, it's hard to tell if this is academic or commerical. They offer a suite of .NET tools for natural language processing called Antelope. It looks like much of the work is based on wrappers to other packages.


RASP is an NLP framework for "robust accurate statistical parsing". It is trained using the British English corpora Susanne, LOB and BNC. RASP includes a tokenizer, part-of-speech tagger, hand-built FST-based morphological analyzer for English, grammar-based parser and parse reranking model.

RASP is distributed only as an executable and licensed under its own non-commercial license for education or research.

There is a RASP white paper describing the system, including dependency parser accuracy evaluation, and a longer technical report about the grammar formalism used.

Stanford NLP Software

This is many pieces of software with Java source licensed under the GNU General Public License. Tools include a part-of-speech tagger, text classifier, and PCFG parser. They no longer provide public access to their full JavaNLP toolkit.


TreeTagger is Helmud Schmid's multilingual part-of-speech tagger. It has a research-only.


WEKA is the University of Waikato's data mining software released under the GNU General Public License. It's not a natural language processing toolkit, but a very extensive general machine learning toolkit. It provides a nice graphical interface for evaluation and supports just about every machine learning algorithm known to research. It's based on Ian Witten and Eibe Frank's book Data Mining. Here's a pointer to the WEKA documentation, which includes tutorials and pointers to the javadoc.


UIMA is IBM Research's unstructured information management architecture released under the Common Public License. UIMA's not really a competitor to LingPipe so much as a general framework into which natural language processing toolks may be embedded. UIMA is the integration platform for DARPA's Global Autonomous Language Exploitation (GALE) project. We have some integrations of LingPipe's chunkers into UIMA if anyone's interested. Here's the UIMA Component Repository and a document on the architectural highlights.

Industrial Competition

The following is a list of competitors with quotes from their own web pages. We've listed technology components where we could find them.


"Accelovation's market discovery software helps business professionals answer complex questions faster, more accurately and with greater confidence by distilling deep meaning, purpose and insight drawn from the Internet and subscription content."

Among the "solutions" they cite is text analytics, which "technologies and processes for synthesizing insights from natural language data. These tools make it possible for a business analyst to perform a variety of research tasks, such as drawing conclusions about consumer sentiments or discovering new solutions to technical problems. Accelovation provides the text analytics tools that business analysts need to conduct research throughout the innovation process."

Accenture Technology Labs

Accenture is offering a sentiment monitoring service, "Sentiment Monitoring Services searches preferred sites or newsgroups on the Internet for opinions. Using advanced language technologies, it interprets the sentiment of the text towards a specified product or service and then provides the user with an analysis of the results. Sentiment Monitoring Services combines a search agent and a perception engine to present users with an instant gauge of market perception of any feature, product, brand or organization. The natural language processor of the perception engine achieves an accuracy of approximately 90 percent compared to opinion ratings ranked manually."

A-Life Medical

"A-Life Medical’s patented Natural Language Processing (NLP) technology utilizes proprietary knowledge-bases of more than ten million facts to automate the coding process. Our technology combined with our software solutions and services are dramatically changing the way healthcare codes, submits claims, collects reimbursement, as well as improving patient care."

A-Life produces Alacer, "the first end-to-end practice management system that integrates document management, real-time NLP coding, billing, collections, denials management, and auditing into one streamlined Windows-based platform."

Alitora Systems

"Alitora System provides comprehensive software solutions for biotech research, management, compliance, intellectual property management and competitive intelligence. Our software enables users to search, annotate and collaborate, seamlessly, allowing the annotation of information as simple as a click, and collaboration as simple as a drag-and-drop."

Alitora offers kHarmony, which "allows users to identify concepts that are of interest, and then search for information relating specifically to those concepts...". Alitora describes kHarmony as " using proprietary graph-theoretic and information retrieval techniques to provide structure to unstructured data, perform data clustering, and enable visual data exploration."


"Appen develops and markets sophisticated computer-based speech and language technology products and services for major international information and communication companies and government organizations."

Appen supplies a range of corpora, as well as tools for morphology and sentiment, and authorship.

Ariadne Genomics

"Ariadne develops software tools for biologists in the areas of pathway analysis and automated scientific text processing. Ariadne products incorporate proprietary Natural Language Processing (NLP) and statistical algorithms designed to functionally interpret novel genetic information."

Although they don't really sell NLP software per se, they are making applications like their Medscan Reader for text data mining over biomedical research articles. It uses entity extraction (e.g. genes and diseases), specific relation extraction (e.g. binding or regulation) and sentence-level search and summarization.


"Attensity's breakthrough Text Analytics solutions enable computers to understand and process free-form text, offering organizations the opportunity to leverage the vast amounts of information contained in non-structured formats. The technology allows users to extract and analyze facts like who, what, where, why, under what conditions and to whom, as well as opinions and events found in unstructured data."

"Attensity offers a complete suite of products for Text Analytics. The suite includes both targeted and Exhaustive Extraction Engines that pull the information out of text and put it into a usable format, analysis and discovery applications that allow you to explore and make sense of the data, knowledge libraries and knowledge engineering tools that provide the ability to define what to extract and categories to put data in, and an integration toolkit."


" Autonomy is the acknowledged leader in the rapidly growing area of Meaning Based Computing (MBC)."

"Meaning Based Computing not only uncovers, but also makes sense of, the 85% of enterprise information that is hidden to all other technologies including keyword search engines and relational databases. ... Meaning Based Computing enables organizations to automatically form a contextual understanding of people's interests, behavior and ongoing interaction with any type of information. ... Meaning Based Computing enables organizations to extract meaningful evidence from terabytes of email, documents, spreadsheets and other unstructured information."

Basis Technology

"Basis Technology provides software solutions for extracting meaningful intelligence from unstructured text in Asian, European and Middle Eastern languages. We help technology companies and government organizations improve the accuracy of information retrieval, text mining and other applications through advanced linguistics."

Basis provides entity extraction in ten languages, does language identification, as well as Chinese and Japanese character-level support.


"BBN is the leader in the development of the new "Semantic Web," which will enable powerful searches and automated agents. BBN has been coordinating the work of 23 US and international research teams in conjunction with the World Wide Web Consortium and European Union collaborators to drive the transition to a semantic web."

They're currently selling Identifinder, a named entity recognizer.


"Providing natural language understanding to any application We have already succeeded with search engines, databases, virtual assistants..."

Bitext (for "bits and text") provides NaturalFinder, "the essential complement for any search engine for Internet and intranets which allows users to query in natural language (Spanish or English) without using booleans or wildcards."


"Brainware helps the world's leading companies automatically extract, process, and retrieve data from any source. Our Data Capture solutions virtually eliminate manual data entry while our Enterprise Search solutions allow you to stop searching and start finding..."

The technology behind Brainware's Intelligent Data Capture Distiller involves "patented neural network-based classification" as well as "pattern recognition technologies and "fuzzy logic" to accurately sort documents and extract key data fields even when fields shift positions from document to document."


"Gain a more complete view of organizational performance by accessing your unstructured text data with the first and only native text analysis capabilities designed specifically to complement your BusinessObjects Enterprise and Data Integrator deployments."

BusinessObjects is part of SAP. Among its operations is BusinessObjects Text Analysis, which does entity extraction, ontology-driven categorization and document summarization. They offer dozens of languages.

Butler Hill Group

"The Butler Hill Group is a streamlined network of linguists, computer scientists, language experts and research librarians with expertise in the natural language issues of computer technology. We maintain solid relationships with highly skilled consultants ... our past projects include machine translation, web search, lexicon evaluation, product usability studies, speech and product localization ..."

Butler Hill primarily provides services for corpus creation, evaluation and internationalization.

Carrot Search

Carrot Search offers "professional installation, customization, clustering and text mining consulting services based on Open Source and proprietary software." They also offer Lingo3G, a "Document Clustering Engine that can organize collections of text documents into clearly labeled thematic groups. Accurately and on-the-fly."

They also offer the open source framework Carrot2, which provides federated search with clustering over popular search engines and APIs, including Lucene.


"Clarabridge's text mining software transforms text into actionable insight to improve market research, customer care, product development, quality assurance and risk management. Clarabridge's award-winning software links the worlds of text mining, search and business intelligence (BI) to enable enterprises to more quickly and intuitively leverage all of their data to make better business decisions."

They have a diagram of their offerings, which include data deduplication/cleansing, data linkage/merging, document segmentation and categorization, entity extraction, event, relationship and fact extraction, table parsing and image processing, as well as search and visualization on top of these.


"ClearForest's text-driven business intelligence solutions help organizations make more informed business decisions by doing what search technologies do not--extract free text for use within analytics applications and BI systems. We provide the analytical bridge between two previously disconnected worlds of information--unstructured text and enterprise data. In allowing both to be analyzed simultaneously, ClearForest makes unified business intelligence a reality."

"ClearForest Tags' open and flexible platform supports statistical, structural and semantic tagging as well as custom taggers, industry and custom taxonomies, and information agents."


"Trust, context and confidence anchor CodeRyte’s natural language processing (NLP) technology."

"The technology ‘reads’ medical reports and identifies accurate CPT and ICD codes from the text of a physician’s documentation." They very helpfully point to the American Health Information Management Association's page on Delving into Computer-assisted Coding.


"Cognia develops and distributes information solutions for pharmaceutical and biotechnology companies that cost-effectively accelerate discovery and research processes."

"The Cognia Molecular system is based on conclusions from our market research showing a critical unmet need of pharmaceutical, biotechnology, and non-profit research centers to make sense of and manage the growing tidal wave of information now available."


"CognitionSearch, the Company's patented meaning-based linguistic Search architecture, is able to deliver significantly higher levels of relevant search results than is possible with currently used Search technologies."

"The technology employs a unique mix of linguistics and mathematical algorithms which has, in effect, 'taught' the computer the meanings (or associated concepts) of nearly all the words and frequently used ph rases within the common English language. It also has knowledge of the relations between words and phrases, especially paraphrase and taxonomy."

"CognitionSearch is the only commercially available technology that combines natural langauge queries with linguistic meaning-based Search and semantics. It incorporates statistical algorithms with linguistically mapped coverage of teh English language,..."


"Connexor provides linguistic technologies and expertise to software houses and solution providers who tackle the challenge of how to derive useful information from unstructured digital text for different kinds of consumers and analysts."

"Connexor's Machinese software discovers the grammar and semantics of natural language. Machinese enriches text with linguistic markup: a uniform programmer's interface that enables use of text content in software applications and solutions."

Connexor makes some excllent heuristic/rule-based part-of-speech taggers, named entity extractors and dependency parsers.


"Go Beyond Search... Access, Share and Deliver Intelligence and Awareness, not just Links". "At Connotate we believe in working smarter, not harder."

"Utilizing patented machine learning algorithms, Agents are easily trained in minutes using a simple point-and-click process that requires NO programming. Agents can be deployed to do anything a human can do to mine, monitor, survey, collect, aggregate and normalize dynamic financial content deep within the Web or in the Enterprise into actionable business intelligence."

Content Analyst

"Stop searching, start doing." Content Analyst supplies a range of text analytics applications, including classification, named entity coreference, relationship discovery.

Their technology is based on latent semantic indexing, a dimensionality reduction technique based on singular value decomposition of a matrix of co-occurrences. The technology was acquired when Content Analyst spun out from SAIC.

Corpora Software

"Corpora's Applied Linguistics software helps knowledge workers find, read and understand textual information faster."

In addition to enterprise search and collaboration, they do document summarization, news document topic clustering, and sentiment analysis.

Crawdad Technologies

"Crawdad Technologies, LLC provides software and services to analysts and research professionals who need to transform unstructured text into insight."

Crawdad supplies Crawdad Desktop, which does scraping and some natural langauge processing involving classification and terminology extraction.

Crawdad also builds Listening Posts, which "listens to blogs, discussion boards, chat rooms, social networking sites, and online media for news or opinion about products, brands, celebrities, and issues. Users view a daily dashboard which uses patented natural language processing technology to analyze the buzz on the Web and make sense of it."

Dolores Labs

"We make crowdsourcing easy." Dolores Labs is involved in classification including topic and sentiment, document de-duplication, and other natural language tasks.

Dolores Labs are not so much competing as providing a complementary service: data annotationa via crowdsourcing, for which they've used Amazon's Mechanical Turk.


"Engenium is a pioneer in conceptual search technology that increases the effectiveness of electronic information retrieval. Unlike keyword searching that is limited to precisely matching the language of a given query, Engenium's Semetric concept search engines and integrated Autometric clustering engine analyze documents by meaning, concept and context. This yields better, faster search results --- uncovers information that otherwise would remain buried --- and enables organizations to work smarter."

They seem to be applying latent semantic analysis (a kind of principal component analysis) to search and clustering.


" Using semantic understanding of content, Evri is building the data graph of the web. We'll use this to create interesting and meaningful connections without having to search."

Evri acquired InFact product from Insightful when Evri was named 'Hypertext Solutions'.


"We do not simply search, we find. We filter out all the irrelevant, peripheral data and provide the exact information end users are looking for."

"We have solutions that monitor competitive intelligence, provide brand and litigation protection, support regulatory and policy compliance, and investigate criminal and terrorist activity. They don't just return results, they return confidence and protection."

General Dynamics Advanced Information System

"General Dynamics Advanced Information Systems designs, develops, manufactures, and integrates information solutions for defense, intelligence, space and homeland security communities."

"General Dynamics Advanced Information Systems uses data mining technologies to help customers find new correlations, patterns and trends. We use advanced technology to sift through large amounts of data (structured data, text, audio, video, etc.) stored in repositories and use pattern recognition and statistical and mathematical techniques." In particular, their system "successfully performs entity extraction, a natural language processing technique, to derive facts such as names, places, organizations, locations and time from text."


" Different than a familiar R&D agenda in a search engine company, we undertook highly specific research tasks solely dedicated to the advancement of the core-competency in Web search. The main challenge is to make science work in a constrained deployment environment where speed, coverage, accuracy, and ease-of-use are high priority considerations."

hakia-Lab provides several technologies, including OntoSem, "a formal and comprehensive linguistic theory of meaning in natural language", QDEX, "Query Detection and Extraction (QDEX) system was invented to bypass the limitations of the inverted index approach when dealing with semantically rich data", SemanticRank, "a collection of methods to score and rank paragraphs", and Dialogue, "the conversational (dialogue) systems where the search engine communicates with the user in an elevated level of confidence".

Expert System

Expert System is a " leading provider of Semantic Intelligence software to discover, classify, and understand information contained in unstructured text. Expert System technology, COGITO enables natural language processing. It leverages full semantic analysis to automatically understand the content from any textual document, including the retrieval of meanings and the comprehension of natural language.

Semantic Intelligence enables you to read, understand, and extract the most relevant concepts present in the huge amount of documents, websites, presentations, emails and blogs that are accessible to us everyday. "


"Fetch Technologies provides innovative solutions for integrating and accessing heterogeneous data sources."

Fetch isn't so much a direct competitor, but more of a complementary technology aimed at scraping web pages and record linkage (also known as database deduplication).


"GrammarSoft ApS is a small company specializing in Language Technology." Product offerings (for multiple European languages) include morphological analyzers, part-of-speech taggers, syntactic/dependency parsing, named-entity recognition, translation, and tools for teaching language and spell checking. They are a spinoff of the Visual Interactive Syntax Learning (VISL) project.


"Infogistics are one of the leading companies providing text-analysis, content extraction and document retrieval solutions across multiple areas of industry including HR, law enforcement, knowledge management and CRM."

"Using advanced Natural Language processing technology developed at Edinburgh University, Infogistics solutions enable information and data contained in structured or unstructured text documents to be retrieved, categorised, extracted and delivered to the right people at the right time." Their NLP offerings include sentence detection, part of speech tagging, and light syntactic chunking. They also offer higher-level products for specialized search, relationship extraction and document parsing.


Inform does "precise topic-based search for related content". Their technology involves text classification and entity extraction for more-like-this applications. You can also check out the Inform News Demo. Here's a paper explaining Inform's approach to text classification (regularized logistic regression).


"InQuira helps companies deliver more effective customer service through their Web sites and contact centers." Their product features "integrated capabilities for natural language search, knowledge base management, and analytics".

Their product line includes InQuira Intelligent Search, "a unified system that combines advanced linguistic techniques and contextual understanding to provide unparalleled capabilities for understanding, and responding to, the true intent behind a user’s inquiry and browsing behavior."

Intelligent Results

"The PREDIGY platform expedites the process of building, simulating and deploying analytical decision applications. PREDIGY integrates data exploration, predictive model development and strategic planning and analysis. Once strategies are defined, the platform generates fully functional scoring or decision applications."

PREDIGY consists of five modules, including IR Discover, which finds "patterns hidden in various types of data: structured (fields), semi-structured (codes) and unstructured (text). Use discovered text patterns in predictive models and for concept-based decision criteria in strategies." through the extraction of "key topics and concepts from unstructured content using patented unstructured-text analysis and advanced data-mining algorithms."


"IntelliResponse delivers on the promise of web self-service by providing one right answer to visitor questions."

Intelliresponse mainly works in question answering and classification in the context of customer relationship management (CRM). About this they say their "Patented 'one right answer' solution understands precisely what the visitor wants, regardless of the hundreds of ways a specific question can be asked" They hedge this a bit later by saying "While it's not possible (or a good idea) to create an IntelliResponse knowledge base that would answer every possible question that anyone would ask, it is very possible to create one that will answer upwards of 90% of incoming questions."

You can even try it by entering a question on their home page.


"Inxight is the leader in federated search, high-fidelity extraction and visualization, enabling enterprise, government, and OEM customers to discover, organize, and understand information contained in unstructured text in all major languages. Inxight solutions allow its customers to access, cluster, and be alerted to relevant information contained in the open Web, deep Web (patent databases, SEC filings), subscription, and internal sources."

"Inxight's deep understanding of all major languages, including English, Arabic, German, French, Farsi, Spanish, and Simplified Chinese, powers the ability to automatically identify and tag named entities in a document -- such as persons, companies, places, weapons, addresses and dates. It also identifies events - such as M&A activity and travel events - as well as relationships between entities."

They also have a software development kit (SDK) which includes entity extraction and visualization components.


"Irion Technologies has succesfully picked up the challenge to make computer programmes that make sense out of text, and really understand human language."

"Irion's software improves any web communication involving human language, and applies to any organization in the world dealing with textual information. This includes conceptual search, knowledge management, E-commerce, customer support, and many other applications." Their technology seems to be organized around classification.


"Janya provides products and services to support information discovery from unstructured and semi-structured data. With more than a decade of experience developing and integrating this technology, Janya works with customers and system integrators to incorporate information discovery capability in both unclassified and classified environments. By leveraging existing search technology and tools for structured data analysis, Janya's solutions enable users to increase their effective bandwidth to analyze text data streams."

Janya builds Semantex, "an enterprise-class information extraction system that supports the automatic or semi-automatic analysis of large volumes of electronic information in order to detect entities, attributes, relationships and events. Semantex represents a hybrid model for information extraction, combining machine-learning and grammatical approaches to achieve better results than any of the techniques could individually." They also build case restoration and "text zoning".

Note that Janya was a spinoff of Cymfony, and Cymfony is now part of TNS Media Intelligence.

LCC (Language Computer Corporation)

LCC has a suite of "Cicero" tools (CiceroLite, CiceroCustom) which do content extraction.


"Predicting outcomes and planning strategy through deeper understanding of opinion holders' sentiment over time."

They supply an extensive list of whitepapers.

Language and Computing

"Natural Language Processing (NLP) and Natural Language Understanding (NLU) are technologies that can extract data and information from free text documents for further processing. Language and Computing (L&C) is unique in delivering this level of understanding through its integration of the world's largest medical ontology with sophisticated linguistic processing algorithms."

In addition to their own proprietary structured medical "knowledge bases", they offer search. Their main offering in NLP seems to be concept-based search and template filling, as exemplified by their product for Terminology Supported Semantic Indexing product, for which they've provided an architecture diagram.

LXA Lexalytics

"Designed to help our customers address the basic problem of making their loosely structured information more valuable. We have created a set of products that attack the problem of discovering, understanding and acting on information that affects their business."

Lexalytics's products include entity extraction, relation extraction, document summarization, sentiment analysis, and juddging from their press releases, they also do classification.

Lexalytics has an Entity Extraction Demo online, which seems more focused on precision than recall.

Lexicography Master Class

"Lexicography MasterClass Ltd is a company specializing in lexicography and lexical computing. We run training courses, design and build language corpora, supply lexicographic software, and provide a complete project-management service for lexicographic projects, from conception to delivery."

Lextek International

"Lextek International supplies advanced information retrieval and natural language processing technology."

"Our technologies are used in a wide variety of business solutions. These range from document management systems to custom web based applications."

Lextek supplies a general document classification engine, a language identifier, and a documnt summarizer.


"Linguamatics enables organizations to reap maximum return from their available knowledge assets, by the effective deployment of advanced natural language processing technology."

They extract entities and relations from text, resolving them against ontologies.


Linguit's Nuggets is "a natural language enabled search engine for the mobile user and is the first service of its kind."

Linguit's services page indicates they are "designing and developing software components and applications in the domain of natural language processing". They mention stemmers, morphological and syntactic analyzers, information retrieval, language understanding, and text mining.


Lingway is a "specialized search engine company" " Lingway solutions are built around a set of NLP components. These enable users to develop specific applications or to enhance existing applications by adding linguistic capabilities."

The list of components they provide is " chunking, clustering, parsing, semantic expander, spell checker, tagging, word sense disambiguation ..."


"Leximancer is a software tool that enables users to find meaning from text-based documents. It automatically identifies key themes, concepts and ideas from unstructured text with little or no guidance."

Leximancer is focusing on word-based classification and automatic taxonomy/synonym generation.

Lockheed Martin

"The AeroTextTM product suite provides a fast, agile information extraction system for developing knowledge-based content analysis applications. Possible applications include automatic database generation, routing, browsing, summarizing and searching."

They do named entity extraction, coreference resolution, part-of-speech and phrase extraction, clustering, topic categorization, and "event" extraction. They train by example and have special capabilities for reasoning about locations and times. They offer language support for English, Arabic, Chinese, Spanash and Indonesian.


"Utilizing and implementing up to date research results in the fields of computer science, language technology, and information theory we at Matrixware enable our customers to skilfully navigate the endless sea of patent literature."

Meaningful Machines

"Meaningful Machines develops, patents, and commercializes language technologies based on a unique suite of methods that automate machine understanding of natural language. The company is developing technologies for use in machine translation (MT), text mining, machine learning, and other applications that benefit from machine understanding."

Although they cite very general problems, their technologies page only addresses machine translation.


"TextAnalyst will help you quickly summarize, efficiently navigate, and cluster documents in your textbase."

Megaputer's TextAnalyst product extracts semantic networks, summarizes text using a "balanced combination of linguistic and neural network investigation methods". Their notion of semantic network is "the most important concepts from the text and the relations between these concepts weighted by their relative importance". They also use these semantic networks for clustering and exploration/search.


"GTS [Geographic Text Search] identifies implied and explicit references to geographic locations within documents, assigns latitude/longitude coordinates to the references, indexes the document, and then enables a search for indexed documents through Graphical User Interfaces (GUIs)."

"MetaCarta GeoTagger is a web service that geographically tags documents. The same core natural language processing and GDM technology underlies both GTS and GeoTagger."

MetaCarta specializes in location mention recognition and resolution in text. They use probabilistic models with confidence ranking.

Mnemonic Technology

"At Mnemonic, we can help you realize the full value of your information, whether structured or unstructured."

"Our relevance models help you get the right information to the right people at the right time. They automatically learn to prioritize, categorize, monitor and summarize large volumes of unstructured textual information according to the unique requirements of individual users."

From what we can tell from their web site, their "relevance models" learn scored text classifiers by example. The only application mentioned for text analytics is search query refinement.


Morphologic provides a range of products, mostly arranged around morphologically sensitive bilingual translation dictionaries, including thesauri. Applications include copy editing such as spelling checkers, hyphenators, grammar and style checkers; text search tools including stemmers; and translation of full documents. Tools include morphological analyzers including stemmers based on unification grammars, syntactic analyzers, spell checkers and language identifiers.


Extractor: " Accurately perform entity extraction from unstructured texts using advanced computational linguistics and natural language processing."

Summarizer: "Reliably generate abstracts and summaries of long and complex documents."

TextMiner: "Empower users to find, organize, analyze, and mine a large volume of unstructured information using the the most advanced text analysis technology available."

They also run in several languages. They have some kind of cross-document coreference. They spun out of SRA International. They claim to have "best-of-breed entity extraction" and "unique link and event extraction", but don't explain what the breed is and don't list any unique features of their link and event extractors. They even claim "NetOwl posted the highest score ever achieved for name extraction from unformatted text, a score which has never been equaled by another system.", but don't provide any details.

Northern Light

"What if your search engine could read all the market intelligence reports and articles your company creates or licenses and tell you what is in them, suggest to you what the business issues are that they report on, and direct you to the documents that are the most interesting to you, not from a search relevance perspective, but from a meaning perspective?"

Northern Light offers Market Intelligence Analyst, which contains entity extraction, sentiment analysis, relationship identification, meaning extraction and trend analysis. There's a bit more information in their MI Analyst Product Sheet.


"We’re developing software that understands English, and can converse with people about a body of facts.". Looks like they're still in the development phase, but plan to use parsing to logical representations, ontologies and natural language generation."


"Ntelligent Enterprise Search by Nstein is a powerful search solution built to increase the efficiency and productivity of your employees on intranets and portals. For public websites, it will guide your customers in the most advanced discovery process. It quickly delivers highly accurate search results in all circumstances."

"One suspicious letter. One dangerous passenger boarding a routine flight. One viral infection in a small village, in a distant country. These events have led to tragedies we all wish had never happened. Critical information preceeded these events. Critical information that could have been flagged by Nstein Technologies."


"Ontotext is a leading developer of core semantic technology, which delivers applications in domains like Web Mining, EAI, KM, BI, and Media Research."

"Ontotext is a laboratory of Sirma, active in several research areas, including: Ontology Management; Information Extraction and Retrieval (IE, IR); Semantic Web Services." Their products include the KIM Platform for semantic annotation driven by a "semantic repository". The semantic annotation includes named-entity extraction from a specified ontology. They have contributed extensively to GATE (for more info on GATE, see above).


"Panscient is a content supplier for vertical search engines." They have the interesting business model of supplying lists of people and businesses scraped from the entire .com domain of corporate web sites and updated monthly. They also develop vertical search applications.

Parity Computing

" Parity Computing's unstructured data management and knowledge discovery solutions transform disparate data and content into a knowledge network of actionable profiles and linked relationships."

Parity offers the Profiler System, which "assembles and analyzes detailed profiles of key entities such as people, institutions, and products, from disparate unstructured documents and semi-structured data sources. ... The key entities are extracted and assembled into distinct profiles using advanced machine learning heuristics. This includes normalization of spelling variations together with disambiguation of similarly-named entities (e.g. two people with the same name)." Additional functionality cited includes home page finding, extracting patent references from web pages, etc.

Parity also offers a lower level tool, the Reference Processor, a "fully automated software engine for high-accuracy reference processing and linking of publication databases and bibliographies in arbitrary formats.". Technology includes extraction, deduplication, clustering and correction.


"We're creating natural language technology that grows and improves as it collects simple human judgments about language. Using this technology, we're building tools to search blog posts, feeds, and other texts for key concepts, not just keywords and phrases. Think of it as tagging meets natural language processing." As of February 2007, they only offered a mailing list.


"Natural Language Search".

Q-go offers the Q-go Natural Language Search Product Suite. "Q-go's Natural Language Search gives organizations insight into visitors' expectations and wishes and helps them adapt their online information accordingly." "The answers provided by Q-go are comparable to those of call centers in terms of consistency, completeness and quality, which is not only cheaper but also faster and easier for organizations."


"QL2 Software's tools and solutions deliver business critical data seamlessly and in real-time. QL2's technology integrates data from virtually any source, inside and outside the firewall, with existing applications and solutions. The result is better analytics and smarter, more profitable decisions."

It wasn't clear from the web site whether any natural language processing was involved in their products.


"RapidMiner (formerly YALE) is the world-leading open-source system for knowledge discovery and data mining. It is available in different flavours: a free open-source version licensed under the GPL, a free version with an improved user interface, and under a developer license (OEM) which allows the integration of RapidMiner as a powerful library even into proprietary products. Enhance your products with adaptability and innovative analytical features. By now, thousands of applications of RapidMiner in more than 30 countries give their users a competitive edge."


"Sophisticated Search Review and Analysis Made Simple"

"MindServer Categorization automatically maps structured and unstructured information into an information structure - taxonomy, ontology, or subject heading classification."

"The core technology powering Recommind's MindServer platform is based on patented, proprietary machine learning techniques including the Probabilistic Latent Semantic Analysis (PLSA) algorithms."

Reel Two

"Reel Two is tackling the tough problems in search and data analysis. Our software products and custom solutions provide scientists, analysts and managers with quick, intuitive access to the information that is most relevant to their work."

Reel Two spun out of the WEKA group at Waikato, and is primarily focused on text classification and entity extraction as well as additional biomedical applications aimed at chemical name resolution.


". Our groundbreaking data analytics technology is designed to deliver intelligent monitoring and analysis of unstructured data, all within the context of the analysis environment. With RiverGlass tools, organizations can make unstructured data sources like the Internet into true strategic information resources to help drive their success."

From their technology capabilities page, they appear to be doing search with ontology integration for topic monitoring, entity extraction and link analysis.


SAS Text Miner "provides a rich suite of tools for discovering and extracting knowledge from text documents. It transforms textual data into a useable, intelligenble format that facilitates classifying documents, finding explicit relationships or associations between documents, clustering documents into categories and incorporating text with other structured data to enrich predictive modeling endeavors."

Text Miner includes format parsers, integrated GUIs and databases, text parsing for terms or phrases, stemming, part-of-speech tagging, and "distillation". It also includes "data cleaning" and spelling correction. Text Miner is also integrated with their larger suite, Enterprise Miner, which supports classification and clustering.


"We are a semantic search startup with a passion for organizing information to facilitate how people discover and filter information to make better decisions." Not much info, as they say "we are committed to launching in stealth mode".


"Semantra extends traditional BI and enterprise search applications, by empowering users to quickly and easily access precise, critical information from enterprise databases through a familiar search box and natural language."


"Sinequa is an innovative leading global provider of Enterprise Search Solutions."

"Sinequa CS has been developed by Sinequa as the ultimate, multi-lingual knowledge access platform. Featuring cutting-edge semantic and linguistic technologies, Sinequa CS is one of the most advanced Enterprise Search solutions available today.


"Soliloquy is the world's first company to offer 'intelligent,' fully automated solutions that enable end users to find the information, services and products they desire through targeted online dialogs."

Soliloquy is in the business of dialog mining, a kind of text data mining over dialogues. They claim to be "the world's first company to offer turnkey solutions that enable end users to find the information and products they desire through intelligent, targeted dialogs."


"Text Mining for Clementine is a text mining workbench that enables you to extract key concepts, sentiments, and relationships from textual or "unstructured" data and convert them to a structured format that can be used to create predictive models."

STIL Language Technology

"Consultancy and software. Bringing cademic research to business." "STIL can offer solutions to businesses and non-profit organizations that have a need to explore what language technology could offer them, or with a need to integrate language technology components in their information systems."

STIL offers software based on the TiMBL memory-based learning system, including sequence taggers, shallow parsers, word-sense disambiguation and morphological analysis.

Summer Eyes Software

"Summer Eyes Software automates the extraction of value from English text. Summer Eyes automatically 'reads' articles, emails, news group postings, web pages, message boards, reviews, resumes, recipes and other text. It identifies topics, senses emotional tone, builds outline structures, and notes references to people, companies, products, relationships and events."

"To make a long story short .... Summer Eyes makes a long story, short."

They seem to be using Danny Sleator's Link Grammar Parser. Among other things, they're working on the PulseTrak blog sentiment analyzer.

Talking Dolphin

"Talking Dolphin develops commercial software that enables sophisticated processing of text documents, using state-of-the-art technologies from the fields of artificial intelligence, machine learning, and statistical natural language processing."

"For example, our technology can be used to automatically classify web pages into taxonomies, route customer support emails to appropriate recipients, or extract structured information from web pages."

They're mainly pushing their natural language classification software, though they mention beta versions of clustering and future versions of products such as named entity extraction and spelling correction. Dan Klein and Ted Grenager, two NLP researchers, founded the company and form its management team.


"TEMIS develops and markets corporate Text Mining solutions. Our software unlocks knowledge from unstructured data."

Temis's core products include an "information extraction server dedicated to the analysis of text documents, a hierarchical clusterer that "proposes the most relevant classification for a given document collection", a classifier that "classifies unstructured documents into pre-defined categories, combining statistical and linguistic analysis rules". This is all based on "XeLDA", their "multilingual linguistic engine". They have an impressive list of clients.


"Teragram Corporation is the market leader in multilingual natural language processing technologies that use the meaning of text to distill relevant information from vast amounts of data."

Teragram provides, working top down, question answering, classification, entity recognition, part-of-speech tagging, morphological stemming, spelling correction, and various search support tools such as language/charset ID.

Text Analysis International

"VisualText is the premier integrated development environment for building information extraction systems, natural language processing systems, and text analyzers."

TAI offers VisualText, "an Integrated Development Environment for deep text analysis applications. Think of it as Visual C++ for Natural Language Processing applications. They also provide TAIParse, which includes part-of-speech tagging and noun-phrase chunking. The basic technology appears to be a multi-pass rule-based approach.


"TextMap is a search engine for entities: the important (and not so important)people, places, and things in the news.".

"TextMap analyzes both the temporal and geographical distribution of news entities."

"TextMap uses natural language processing techniques to track entity references in news sources, and a variety of statistical techniques to analyze the relationships between them."


"We provide business-to-business (B2B) analytical software and services to accurately examine and extract information from large volumes of unstructured text."

"TextOre has the ability to perform searches that are highly detailed, using multiple queries and in multiple languages, while providing easily understood results. The results are provided through an advanced visualization profile tool that identifies and visually depicts the intensity of relationships in unstructured data sources (letters, documents, e-mail and web pages), including real-time news and information feeds. Our technology not only identifies anomalies missed by competitive technologies, but also identifies specific sentences, paragraphs and relationships, taking into account the precise terms applied by a user."


"TextWise energizes your existing advertising portfolio by offering high-resolution targeting, sophisticated media placement, and hassle-free automation of both ad creation and placement."

"Semantic Signatures are TextWise's patented contextual targeting technology. They innovate beyond simple keyword-based or category-based models currently used in so-called "contextual advertising" and deliver a new level of context-driven advertisment matching." They also "capture meaning through concepts, not keywords -- including multiple meanings and topics within a single document."

TNS Media Intelligence/Cymfony

" Cymfony, a division of TNS Media Intelligence, is a market influence analytics company that sifts and interprets the millions of voices at the intersection of traditional and social media such as blogs and social networks to gain consumer insight and develop stronger bonds with influencers."

"Cymfony's core is an advanced information extraction engine that combines information retrieval and Natural Language Processing (NLP) technologies to identify important people, places, companies, concepts, relationships and events in documents." From the web site, this looks like the latest version of Cymfony's "InfoXtract Engine".

Cymfony spun off the government systems business to form Janya.


"In a fully automated process BullDoc(tm) server will crawl your organization resources (shared directories, submitted emails, specific web sites), feed them to the information extraction engine that will save the extracted data into the database....The system comes with plug-ins for many applications (MSWord, outlook, numerous web browsers) that enable the user to view the documents/emails/web pages in the way he use to, only gives her the ability to browse and navigate within a document to the relevant information. "

Vantage Linguistics

"As a world leader in the development of linguistic software solutions, Vantage Linguistics continues to set the benchmark for innovation and excellence in language-based research and artificial intelligence."

Vantage offers a range of products, including language identifiers, spell checkers, grammar checkers, and linguistically informed search.

Xerox European Research Centre

"With the multiplication of on-line document repositories and the phenomenal growth of the Web, a fantastic amount of information is available at our fingertips. The central problem becomes that of quickly accessing, within that mass, the arbitrary pieces of information that are needed at any given time. As a large proportion of the data is made up of natural language texts, any comprehensive solution will rely heavily on natural language processing (NLP). Our research agenda concerns theories, methods, tools and systems that make it possible to uncover the content of natural language texts."

XRCE provides demos and licenses for software for finite state automat, machine learning for categorization and clustering, robust parsing and semantics. You can find some online demos and links to research software from the above link.


" YooName is Named Entity Recognition software based on semi-supervised learning. It identifies nine named entity categories that are split into more than 100 sub-categories."

"The YooName database and rule system are built using semi-supervised learning techniques."


"What we do: Collect and analyze Market Intelligence"

They appear to do text document classification for sentiment, survey analysis, email data mining, discussion/chat mining.

ZoomInfo "is the premier business information search engine, with profiles on more than 35 million people and 3.8 million companies. ZoomInfo delivers a single site for quick and easy access to in-depth information on industries, companies, people, products, services and jobs." "ZoomInfo, a semantic search engine, uses its patented Natural Language Processing algorithms to understand and organize the business web."

ZoomInfo focuses on search for people, companies or jobs on

Lists of Tools and Corpora

Lots of other groups have put together lists like this. They contain many links to one-off packages and many lists almost all more comprehensive on the one-off packages (like Adwait Ratnaparkhi's tagger, Michael Collins's parser, Eric Brill's tagger, the YamCha SVM tagger, the Cambridge-CMU language toolkit, etc.)

Posted Fri, Mar 13 2009 8:37 PM by


lingyun2003 wrote re: Sentiment/Text Mining Tools
on Fri, May 8 2009 9:57 PM

thanks a lot

Quora wrote Which platform / tool / language should be good for text mining ?
on Wed, May 23 2012 10:11 PM

These tools may be hepful to you:

Commercial: Autonomy Lexalytics SAS/SPSS SQLServer 2008+

OpenSource: RapidMiner NClassifier OpenTextSumarizer WordNet OpenNLP SharpNLP Lucene/Lucene.NET LingPipe Weka


An Open Source C# web crawler with Lucene.NET search using SQL 2008/2012/CE

copyright 2004-2017, LLC