This is a guide to using the Shiny web ‘Viewer’ application created as part of The Concept Lab project. The purpose of the app is to visualize and explore the architecture of concepts inferred from large text corpora by means of statistical measures of the co-association of words in the text.
This document refers to the ‘October’ version of the app, completed in October 2018. A paper describing in full the natural language processing methods and some of the implementation details is available in the proceedings of IWCS 2017.1 Nulty, Paul. (2017). “Network Visualizations for Exploring Political Concepts”.Proceedings of the 12th International Conference on Computational Semantics (IWCS). For a more theoretical perspective on applying these methods to the study of the history of concepts, see the Concept Lab’s article published in Contributions to the History of Concepts.2 Distributional Concept Analysis: A Computational Model for Parsing Conceptual Forms. de Bolla, P., Jones, E., Recchia, G., Regan, J., & Nulty, P. (2019). Contributions to the History of Concepts
The app is now hosted on Cambridge University Library servers. Older versions were hosted on Amazon Web Services and served through port 3838. On some public wireless networks, this port may be restricted — if the app fails to load try to access it from an internet connection that does not restrict this port.
The app pane is composed of a sidebar (on the left) and a main panel, with a tab menu along the top of the screen to switch between panels showing different aspects of the app.
When the app is first opened in a browser, the sidebar and Configuration Panel are displayed.
The Configuration Panel is the main screen from which the dataset and several universal preferences are selected. When the app is opened, the ECCO dataset will be loaded by default. This takes a few seconds, and when it is complete the sidebar text display will show ‘Co-occurrence counts loaded’ as well as several properties of the loaded data (see Figure ). The bottom of the sidebar shows the data file name from which the current co-occurrence counts are loaded, in this case ‘ECCO_100_dist_100_cut_10_2’. Also available are two sets of data constructed from libertarian and socialist text from reddit.
From the top down, the options available in the configuration panel are as follows:
measure
: The measure by which the association between words should be measured. Each of these measures are based on the number of times words co-occur in a certain context, adjusted for the number of times each word occurs independently. \[
DPF(A,B) = \frac{Co\mbox{-}occurrences(A, B)}{Freq(A)*Freq(B)}
\]DPF (Distributional Probability Factor) is a measure similar to pointwise mutual information, with an extra parameter to downweight the score of very infrequent words. By default the log-dpf option is selected, as the association scores calclated by DPF tend to have a power-law distribution
Decade Starting: For the ECCO corpus, only one decade at a time can be loaded.
Dataset: Several datasets are available, with Eighteenth Century Collections Online the default.
Concreteness threshold / filter concrete words: If the filter concrete words
checkbox is ticked, then only words which appear in a contemporary vocabulary rated for concreteness by human annotators3 Brysbaert, Marc, Amy Beth Warriner, and Victor Kuperman. ‘Concreteness ratings for 40 thousand generally known English word lemmas.’ Behavior research methods 46.3 (2014): 904-911..Note that this annotated word list excludes many archaic terms and proper nouns. will appear. The words are rated on a scale from 0 (most abstract) to 5 (most concrete), and only words below the concreteness threshold set by the slider will be included. That is, if the slider is set to 4.5, words rated 4.5 and greater (the most concrete) will be filtered out.
2D node label size: This slider controls the size of the node labels for network panels that use 2D output. This is useful for setting an appropriate size for capturing readable screenshots for figures.
The following options are only displayed in the side panel when one of the Network Visualisation panels is selected.
Prune nodes of degree < : Only nodes with this number of connections or less will be displayed. The default setting is two, meaning that nodes that are only connected to one other node in the network are not shown.
Steps from search nodes: The display shows a neighbourhood network, constructed by starting with nodes specified by the search terms and expanding to include their neighbours n hops away, where n is specified with this slider.
Centrality sample size: The centrality panel shows a table of nodes ranked by their centrality score in the co-occurrence network specified by the dataset, thresholds, options, and search terms specified in the sidebar and configuration pane. The betweenness centrality is calculated by finding the length of shortest paths between random nodes in the network.4 Ulrik Brandes, A Faster Algorithm for Betweenness Centrality. Journal of Mathematical Sociology 25(2):163-177, 2001. As this algorithm is computationally intensive, only paths of length less than the value specified here are counted in the estimation. The higher the sample size, the more accurate the betweenness estimation but the longer the time taken to run.
In the 2D network visualisation panels, a button labeled “Export as png” is available in the bottom right. The default filename of the downloaded image encodes the parameters used as follows:
keywords-dataset-distance-measure-threshold-rank-steps-pruned-concrete
For example in the filename democracy-prorogued-ecco-100-log-dpf-2.6-20-1-2-none.png, the final part (2-none) indicates the nodes with fewer than two links are pruned, and concrete words are not filtered out. If you check the “filter concrete words” box, this “none” will be replaced by the abstract/concrete filter threshold (default 4.5, i.e. exclude words that are > 4.5 on a five point scale from abstract (1) to concrete (5).
This pane compares two search terms with the following method: The common items from the lists of word1 and word2 are retrieved, and score_diff shows the score for each term in word2 subtracted from the score for the same term in word1. The result should be that words more associated with word1 get a higher score.
This pane shows the network plot of shortest route between two nodes. Exactly two search terms must be specified.
This tab shows a table of nodes ranked by their centrality score in the co-occurrence ntworek specified by the dataset, thresholds, options, and search terms specified in the sidebar and configuration pane.