Wikifier documentation
Contents
- Wikification (document annotation)
- Sample code (in Python)
- Part-of-speech tagging
- Cosine similarity measure
- Neighbourhood subgraph
- Concept information
If you use our Wikifier in your work, please cite our paper:
-
Janez Brank, Gregor Leban, Marko Grobelnik.
Annotating
Documents with Relevant Wikipedia Concepts.
Proceedings of the Slovenian Conference on Data Mining
and Data Warehouses (SiKDD 2017), Ljubljana, Slovenia, 9 October 2017.
Wikification
To call the JSI Wikifier, send a HTTP GET request to a URL of the following form:
http://www.wikifier.org/annotate-article?text=...&lang=...&...
Alternatively, you can send a POST request with a Content-Type of application/x-www-form-urlencoded and pass the parameters (everything after the ? character) in the request body.
The server is currently still in development so it is occasionally down.
The following parameters are supported:
-
userKey
: a 30-character string that uniquely identifies each user (register here to get one). This parameter is required. -
text
: the text of the document that you want to annotate. Use UTF-8 and %-encoding for non-ASCII characters (e.g.text=Beyonc%C3%A9
). -
lang
: the ISO-639 code of the language of the document. Both 2- and 3-letter codes are supported (e.g. en or eng for English, sl or slv for Slovenian, etc.). See also: list of all the languages currently supported by the JSI Wikifier.You can also use
lang=auto
to autodetect the language (using CLD2). In this case, the resulting JSON object will also contain a value namedlanguageAutodetectDetails
with more information about the results of the autodetection. Note that some of the languages supported by the Wikifier cannot be autodetected by CLD2 (and vice versa). -
secondaryAnnotLanguage
: for each annotation, the Wikifier can report, in addition to its name and Wikipedia link in the Wikipedia for the language of the input document, also the name and link of the corresponding page in the Wikipedia for a “secondary” language; thesecondaryAnnotLanguage
parameter specifies the code of this secondary language (default: en, i.e. English). -
wikiDataClasses
: should be true or false, determines whether to include, for each annotation, a list of wikidata (concept ID, concept name) pairs for all classes to which this concept belongs (directly or indirectly). -
wikiDataClassIds
: likewikiDataClasses
, but generates a list of concept IDs only (which makes the resulting JSON output shorter). -
support
: should be true or false, determines whether to include, for each annotation, a list of subranges in the input document that support this particular annotation. -
ranges
: should be true or false, determines whether to include, for each subrange in the document that looks like a possible mention of a concept, a list of all candidate annotations for that subrange. This will significantly increase the size of the resulting JSON output, so it should only be used if there's a strong need for this data. -
includeCosines
: should be true or false, determines whether to include, for each annotation, the cosine similarity between the input document and the Wikipedia page corresponding to that annotation. Currently the cosine similarities are provided for informational purposes only and are not used in choosing the annotations. Thus, you should set this to false to conserve some CPU time if you don't need the cosines for your application. -
maxMentionEntropy
: set this to a real number x to cause all highly ambiguous mentions to be ignored (i.e. they will contribute no candidate annotations into the process). The heuristic used is to ignore mentions where H(link target | anchor text = this mention) > x. (Default value: −1, which disables this heuristic.) -
maxTargetsPerMention
: set this to an integer x to use only the most frequent x candidate annotations for each mention (default value: 20). Note that some mentions appear as the anchor text of links to many different pages in the Wikipedia, so disabling this heuristic (by setting x = −1) can increase the number of candidate annotations significantly and make the annotation process slower. -
minLinkFrequency
: if a link with a particular combination of anchor text and target occurs in very few Wikipedia pages (less than the value ofminLinkFrequency
), this link is completely ignored and the target page is not considered as a candidate annotation for the phrase that matches the anchor text of this link. (Default value: 1, which effectively disables this heuristic.) -
pageRankSqThreshold
: set this to a real number x to calculate a threshold for pruning the annotations on the basis of their pagerank score. The Wikifier will compute the sum of squares of all the annotations (e.g. S), sort the annotations by decreasing order of pagerank, and calculate a threshold such that keeping the annotations whose pagerank exceeds this threshold would bring the sum of their pagerank squares to S · x. Thus, a lower x results in a higher threshold and less annotations. (Default value: −1, which disables this mechanism.) The resulting threshold is reported in theminPageRank
field of the JSON result object. If you want the Wikifier to actually discard the annotations whose pagerank is <minPageRank
instead of including them in the JSON result object, set theapplyPageRankSqThreshold
parameter totrue
(its default value isfalse
). -
partsOfSpeech
: should be true or false, determines whether to include information about parts of speech (nouns, verbs, adjectives, adverbs) and their corresponding WordNet synsets. This feature is only supported for English documents; the Brill tagger is used for POS tagging. (Default value: false). -
verbs
: likepartsOfSpeech
, but only reports verbs. -
nTopDfValuesToIgnore
: if a phrase consists entirely of very frequent words, it will be ignored and will not generate any candidate annotations. A word is considered frequent for this purpose if is one of the nTopDfValuesToIgnore most frequent words (in terms of document frequency) in the Wikipedia of the corresponding language. (Default value: 0, which disables this heuristic. If you want to use this heuristic, we recommend a value of 200 as a good starting point.) -
nWordsToIgnoreFromList
: works likenTopDfValuesToIgnore
, except that instead of looking at the most frequent words from all the words, it only uses the most frequent words from a list provided to the Wikifier at startup. Currently we have lists for about half of the languages, obtained by taking the most frequent words and manually removing some that seemed actually useful and shouldn't be ignored, e.g. place names. If both this parameter andnTopDfValuesToIgnore
are provided, the Wikifier will use the manual ignore list if it is available for the language of the input document, otherwise it will fall back on the behaviour specified bynTopDfValuesToIgnore
. (Default value: -1, which disables this heuristic. If you want to use this heuristic, we recommend passing 200 for both this parameter and fornTopDfValuesToIgnore
.)
Output format
The Wikifier returns a JSON reponse of the following form:
{ "annotations": [ ... ], "spaces":["", " ", " ", "."], "words":["New", "York", "City"], "ranges": [ ... ] }
The spaces
and words
arrays show how the input document has been split into words.
It is always the case that spaces
has exactly 1 more element than words
and that
concatenating spaces[0] + words[0] + spaces[1] + words[1] + ... + spaces[N-1] + words[N-1] + spaces[N]
(where N
is the length of words
) is exactly equal to the input document (the one that
was passed as the &text=...
parameter).
annotations
is an array of objects of the following form:
{ "title":"New York City", "url":"http:\/\/en.wikipedia.org\/wiki\/New_York_City", "lang":"en", "pageRank":0.102831, "cosine":0.662925, "secLang": "en", "secTitle":"New York City", "secUrl":"http:\/\/en.wikipedia.org\/wiki\/New_York_City", "wikiDataClasses": [ {"itemId":"Q515", "enLabel":"city"}, {"itemId":"Q1549591", "enLabel":"big city"}, ... ], "wikiDataClassIds": ["Q515", "Q1549591", ...], "dbPediaTypes":["City", "Settlement", "PopulatedPlace", ...], "dbPediaIri":"http:\/\/dbpedia.org\/resource\/New_York_City", "supportLen":2.000000, "support": [ {"wFrom":0, "wTo":1, "chFrom": 0, "chTo": 7, "pMentionGivenSurface":0.122591, "pageRank":0.018634}, {"wFrom":0, "wTo":2, "chFrom": 0, "chTo": 12, pMentionGivenSurface":0.483354, "pageRank":0.073469} ] }
-
url
is the URL of the Wikipedia page corresponding to this annotation, andtitle
is its title; -
lang
is the language code of the Wikipedia from which this annotation is taken (currently this is always the language of the input document); -
secUrl
andsecTitle
refer to the equivalent page of the Wikipedia in the languagesecLang
(which is the same language that was passed as thesecondaryAnnotLanguage
input parameter). (This is more useful whenlang != secLang
.) -
wikiDataClasses
andwikiDataClassIds
are lists of the classes to which this concept belongs according to WikiData (using theinstanceOf
property, and then all their ancestors that can be reached with thesubclassOf
property; -
dbPediaIri
is (one of) the DBPedia IRIs corresponding to this annotation, anddbPediaTypes
are types to which this DBPedia IRI is connected via thehttp://www.w3.org/1999/02/22-rdf-syntax-ns#type
property; -
support
is an array of all the subranges in the document that support this particular annotation; for each such subrange,wFrom
andwTo
are the indices (intowords
) of the first and last word of the subrange;chFrom
andchTo
are the indices (into the input document) of the first and last character of the subrange;pageRank
is the pagerank of this subrange (not necessarily a very useful value for the user), andpMentionGivenSurface
is the probability that, when a link appears in the Wikipedia with this particular subrange as its anchor text, it points to the Wikipedia page corresponding to the current annotation.
ranges
is an array of objects of the following form:
{ "wFrom": 0, "wTo": 1, "pageRank":0.018634, "pMentionGivenSurface":0.122591, "candidates": [ {"title":"New York", "url":"http:\/\/en.wikipedia.org\/wiki\/New_York", "cosine":0.578839, "linkCount":63626, "pageRank":0.049533}, {"title":"New York City", "url":"http:\/\/en.wikipedia.org\/wiki\/New_York_City", "cosine":0.662925, "linkCount":11589, "pageRank":0.102831}, {"title":"New York (magazine)", "url":"http:\/\/en.wikipedia.org\/wiki\/New_York_(magazine)", "cosine":0.431092, "linkCount":2159, "pageRank":0.030795}, ... ] }
The first four members are the same as in support
; in this particular example,
we have wFrom
= 0 and wTo
= 1, so this object refers to the phrase "New York".
The candidates
array is a list of all the pages in the Wikipedia that are pointed to
by links (from other pages in the Wikipedia) whose anchor text is the same as this phrase; for each
such page, we have an object giving its title, Wikipedia URL, cosine similarity with the input document,
number of links with this anchor text pointing to this particular page, and the pagerank score of this
candidate annotation. For phrases that generate too many candidates, some of these candidates might
not participate in the pagerank computation; in that case pageRank is shown as -1 instead.
Part-of-speech output
If the partsOfSpeech=true
parameter is used to request part-of-speech information,
the resulting JSON object will additionally contain four arrays called verbs
,
nouns
, adjectives
, and adverbs
. Each of these arrays is
of the following form:
"verbs": [ {"iFrom":27, "iTo":32, "normForm":"offer", "synsetIds":["200706557", "200871623", ...]}, {"iFrom":78, "iTo":80, "normForm":"have", "synsetIds":["200056930", "200065370", ...]}, ... ]
For each entry, iFrom
and iTo
are the indices of the first and last
character of that verb. The indices refer to the input text as a
sequence of Unicode codepoints (i.e. not as a sequence of bytes that is the result of UTF-8 encoding).
You can use these indices to recover the surface form of this verb as it appears in the input text.
By contrast, normForm
is the lemmatized form (e.g. have instead of has).
synsetIds is a list of all the Wordnet synsets that contain this verb.
Te nouns
, adjectives
, and adverbs
arrays are of
the same form. If you use the verbs=true
input parameter instead of partsOfSpeech=true
,
these arrays are not included in the response.
Part-of-speech processing is only supported on English documents.
Sample code in Python 3
Note: the following sample uses POST; if your input document is short, you can also use GET instead.
import urllib.parse, urllib.request, json def CallWikifier(text, lang="en", threshold=0.8): # Prepare the URL. data = urllib.parse.urlencode([ ("text", text), ("lang", lang), ("userKey", "insert your user key here"), ("pageRankSqThreshold", "%g" % threshold), ("applyPageRankSqThreshold", "true"), ("nTopDfValuesToIgnore", "200"), ("nWordsToIgnoreFromList", "200"), ("wikiDataClasses", "true"), ("wikiDataClassIds", "false"), ("support", "true"), ("ranges", "false"), ("minLinkFrequency", "2"), ("includeCosines", "false"), ("maxMentionEntropy", "3") ]) url = "http://www.wikifier.org/annotate-article" # Call the Wikifier and read the response. req = urllib.request.Request(url, data=data.encode("utf8"), method="POST") with urllib.request.urlopen(req, timeout = 60) as f: response = f.read() response = json.loads(response.decode("utf8")) # Output the annotations. for annotation in response["annotations"]: print("%s (%s)" % (annotation["title"], annotation["url"])) CallWikifier("Syria's foreign minister has said Damascus is ready " + "to offer a prisoner exchange with rebels.")
Additional functions
The Wikifier also supports the following functions that are not directly related to Wikification but may be of interest anyway.
Cosine similarity
This function measures the similarity between the text of two Wikipedia pages. All markup etc. is ignored for the purposes of this comparison, and the pages are represented as feature vectors under the bag-of-words model (also known as the vector space model). To use this function, make a HTTP request of the following form:
http://www.wikifier.org/get-cosine-similarity?lang=...&title1=...&title2=...
As parameters, provide the language code of the Wikipedia that your pages are from, and the titles of both pages. The result will be a small JSON object of the following form:
{ "nNonzeroComponents1":3769, "nNonzeroComponents2":5641, "cosBinVec":0.4847149433737823, "cosTfVec":0.9702968299849217, "cosTfIdfVec":0.8467604323717902 }
This gives the number of nonzero components in each of the two feature vectors (i.e. the number of distinct words on each page) and the cosine similarity measure between them. In fact three versions of the cosine measure are provided: one for binary feature vectors, one for TF vectors and one for TF-IDF vectors.
Sample code in Python 3
import urllib.parse, urllib.request, json def CosineSimilarity(lang, title1, title2): # Prepare the URL. data = urllib.parse.urlencode([("lang", lang), ("title1", title1), ("title2", title2)]) url = "http://www.wikifier.org/get-cosine-similarity?" + data # Call the Wikifier and read the response. with urllib.request.urlopen(url, timeout = 60) as f: response = f.read() response = json.loads(response.decode("utf8")) # Return the cosine similarity between the TF-IDF vectors. return response["cosTfIdfVec"] print(CosineSimilarity("en", "New York", "New York City"))
Neighbourhood subgraph
For a given language, its Wikipedia can be thought of as a large graph with a vertex corresponding to each page, and with a directed edge (u, v) wherever a hyperlink exists from a page u to another page v. Where such a link exists, we say that u is a predecessor of v, and v is a successor of u.
The following function will return a subgraph of this graph:
http://www.wikifier.org/get-neigh-graph?lang=...&title=...&nPredLevels=...&nSuccLevels=...
The subgraph consists of the following vertices:
- The vertex t representing the page whose title is specified by the value of the
title
parameter. - Any vertex u such that a directed path, at most nPredLevels edges long, exists from u to t.
- Any vertex v such that a directed path, at most nSuccLevels edges long, exists from t to v.
As is usual with subgraphs, it also contains all those edges of the original graph where both endpoints of the edge are vertices belonging to the subgraph.
The result will be a JSON object of the following form:
{ "nVertices": 123, "nEdges": 1234, "titles": [...], "successors": [[...], [...], ...] }
For the purposes of representing this subgraph, its vertices are numbered
in an arbitrary order from 0 to nVertices − 1. Then
titles
[k] gives the title of the Wikipedia page that
corresponds to vertex k, and successors
[k] is an array
containing the numbers of the successors of this vertex. This array may be empty
if the vertex does not contain any links to other vertices in the subgraph.
To save space, lists of predecessors are not provided explicitly, but you can easily generate them from the lists of successors.
Sample code in Python 3
def NeighGraph(lang, title, nPredLevels, nSuccLevels): # Prepare the URL. data = urllib.parse.urlencode([("lang", lang), ("title", title), ("nPredLevels", nPredLevels), ("nSuccLevels", nSuccLevels)]) url = "http://www.wikifier.org/get-neigh-graph?" + data # Call the Wikifier and read the response. with urllib.request.urlopen(url, timeout = 60) as f: response = f.read() response = json.loads(response.decode("utf8")) # Print the edges of the graph. nVertices = response["nVertices"] titles = response["titles"] nEdges = 0 for u in range(nVertices): for v in response["successors"][u]: print("%s -> %s" % (titles[u], titles[v])) nEdges += 1 assert nEdges == response["nEdges"] import sys, io sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding = "utf8", errors = "ignore", line_buffering = True) NeighGraph("sl", "Ljubljana", 0, 2)
Concept information
This function returns a small JSON object with information about a given concept. This is a subset of information that appears in an annotation object when annotating a document. The request should be of the form:
http://www.wikifier.org/concept-info?lang=...&title=...&secLang=...
The result will be of the following form (the following example is for
lang=en,
title=Rome and secLang=it):
{ "title":"Rome", "url":"http:\/\/en.wikipedia.org\/wiki\/Rome", "lang":"en", "secLang":"it", "secTitle":"Roma", "secUrl":"http:\/\/it.wikipedia.org\/wiki\/Roma", "wikiDataItemId": "Q220", "wikiDataClasses": [{"itemId":"Q5119", "enLabel":"capital"}, ...., ], "wikiDataClassIds": ["Q5119", ...], "dbPediaTypes": ["City", "Settlement", "PopulatedPlace", ....], "dbPediaIri": "http:\/\/dbpedia.org\/resource\/Rome"} }