Topic Modelling

Promoss Topic Modelling Toolbox

The Promoss topic modelling toolbox is free software, developed by the Institute for Web Science and Technologies at the University of Koblenz-Landau and GESIS, Leibniz Institute for the Social Sciences in Cologne.


Download jarJava source code


Latent Dirichlet Allocation (LDA)

Promoss implements LDA with an efficient online stochastic variational inference scheme, meaning that the memory consumption is lower than for standard implementations and inference is significantly sped-up.

The Usage is simple: You create a corpus.txt file in which each line corresponds to a document. Then you execute the promoss.jar with

 

java -Xmx11000M -jar ./promoss.jar -method "LDA" PATH_TO_DIRECTORY/ \ -MIN_DICT_WORDS 100 -T 50

 

Where -T 50 sets the number of topics to 50 and -MIN_DICT_WORDS 100 gives the minimum occurrences required to include a word in the analysis (in this case 100). There also exists an alternative input format based on a dictionary and documents given in SVMlight format, which is documented in the readme file.

Hierarchical Multi-Dirichlet Process Topic Model (HMDP)

You want to include multiple document metadata in your topic model, such as geographical location, timestamps or ordinal variables? But you do not want to spend weeks writing your own topic model and want an efficient inference?

Store the document metadata separated by semicolons in a file named meta.txt. The documents have to be put in a file named corpus.txt in which each line corresponds to a document. Documents can be raw and will be processed by Promoss. You have to tell which metadata are geographical locations, timestamps, ordinal or nominal data. Timestamps can be used to extract yearly, monthly, weekly or daily cycles.

Then you just have to execute the .jar file with a few parameters (documented in the readme file). Example command line usage:

 

java -Xmx11000M -jar promoss.jar -directory PATH_TO_DIRECTORY/ \ -meta_params "T(L1000,W1000,D10,Y100,M20);N" -MIN_DICT_WORDS 1000 

 

If you need any support in using Promoss, feel free to contact us:
topicmodels (ät) c-kling.de