(belongs to mineracao-de-dados)
- private repo: https://codeberg.org/era/inf0611
Related:
- Image Retrieval image-retrieval
- Video Retrieval video-retrieval
- Meta Search Engines meta-search-engines
- Relevance Feedback relevance-feedback
- Time series analysis time-series-analysis
- geographic-information-retrieval
Information Retrieval
Other Resources
- https://www.inf.ufsc.br/~vania.bogorny/teaching/INE5644/DistanceSimilarity-AndreSalvaroFurtado.pdf
- R lang
Indexes
- Can be done automatically
- We can today index the whole content
Terms
- Information
- made by humans, “processed data”
- Data
- “raw”
- Needs a human to process it and outputs an information
- Searching for a data means searching for a “raw” value, precise, normally the user knows what is searching for (e.g. height of a person)
- When searching for information the user may not know what they want
- structured Data = Organised data (e.g. table)
- Non-structured data is for example a book
- SRI = Sistema de Recuperação de Informação
Topics
In other to find relevant information we need to transform ther query in a discrete value of D-dimensions. We also need to do the same to the objects in our library. I will call that a function f
, which inputs is the object and the output is a vector of numbers of D positions.
In other to find the relevant information for the query q
we need to compute the similarity or distance of the query and the objects in our library. For that we will use a function σ
.
The pair f
and σ
is called “descriptors”.
Similarity is a value [0,1] where two equal objects have value 1.
Distancy have a minimum value of zero (there are very close), though the superior value is Inf
.
Distance and similarity functions
- Euclidean Distance
- Manhattan Distance
- Minkowski distance
- Generalization formula for Euclidian (L1), Manhattan (L2) and max distance (L=Inf).
For binary data we have Similarity coefficients, such as Simple Matching Foefficient (SMC)and Jaccard Coefficient.
Evaluation
It’s possible to compare the system using a anotated set so we know what is relevant or not for a given input. Than the result of our system is compared with the “ground-truth”.
Some sources:
- Text REtrieval Conference (TREC):https://trec.nist.gov/
- Oxford Buildings:https://www.robots.ox.ac.uk/~vgg/data/oxbuildings/
- Paris Buildings:https://www.robots.ox.ac.uk/~vgg/data/parisbuildings/
- Flickr 100K:http://www.robots.ox.ac.uk/~vgg/data/oxbuildings/flickr100k.html
- TREC Video Retrieval:https://open-video.org/collection_detail.php?cid=7MDC
We can split the evaluation methods in three:
- binary relevance
- Precision (# of relevant objects recovered)
- Recall (# of relevant objects recovered vs the # of all relevant objects that should be recovered in a perfect system)
- F1 (calculates the trade off between precision and recall)
- Average Precision (Average precision is a measure that combines recall and precision for ranked retrieval results. For one information need, the average precision is the mean of the precision scores after each relevant document is retrieved.)
- Mean Average Precision (the average of AP from different rankings generated by the system)
- significance level
- Two documents can have different relevance levels. In that case, we want the most relevant item to show first in our Ranking. Some of the evaluation methods:
- Cumulative gain
- Discounted cumulative gain
- Normalised discounted cumulative gain
- Two documents can have different relevance levels. In that case, we want the most relevant item to show first in our Ranking. Some of the evaluation methods:
- avaliation between rankings
- Sometimes we don’t know what is relevant or not in a domain. So we need to compare our ranking with existing ones. We can use:
Text processing
In computing, the term text processing refers to the theory and practice of automating the creation or manipulation of electronic text. Text usually refers to all the alphanumeric characters specified on the keyboard of the person engaging the practice, but in general text means the abstraction layer immediately above the standard character encoding of the target text. The term processing refers to automated (or mechanized) processing, as opposed to the same manipulation done manually. - https://en.wikipedia.org/wiki/Text_processing
Terms
Vocabulary: The set of representative terms of the documents. We can extract it from the document automatically or it can be genarated by domain specialists.
We could in theory use all the words of the documents as representative texts, but depending on the collection size, it may not be the best approach. Instead we can reduce the vocabulary with:
- Removing stopwords (examples in Portuguese:
a
,e
,estes
) - Doing stemming (using the radical of the words): humanos, humanas**, humano, etc
The workflow of an information retrieval system
- We collect the documents (called corpus)
- We process the text (normalising it, stemming, removing stopwords)
- The text is indexed and sabed
- Queries are issued to retrieval the documents
Processing the text
- We extract the tokens (words, n-grams, symbols)
- We remove the stopwords for the language
- Stemming the tokens
- We normalise the word (remove pontuations, make it all same case)
- After all this we have the terms that represents the text (document).
Indexing and storage
Models for text retrieval
Zipf’s law (from this law we can think that less frequent words are best to identify a text)
Zipf’s law (/zɪf/, not /tsɪpf/ as in German) is an empirical law formulated using mathematical statistics that refers to the fact that for many types of data studied in the physical and social sciences, the rank-frequency distribution is an inverse relation. The Zipfian distribution is one of a family of related discrete power law probability distributions. It is related to the zeta distribution, but is not identical. - https://en.wikipedia.org/wiki/Zipf’s_law
F_i = C / i
, where F_i is the frequence of the i-th word, C is a constant based on the document.
Some models are based on the Zipf’s law distribution, such as:
- Vectorial Space model
- Term Frequence - Inverse Document Frequency (tf-idf)
- Basically tries to use less frequent words to make the ranking
- Very frequent words (and, or, but) are probably scored zero and does not make a difference in the retrieval
- Binary Independence Model (BIM)
- Best Match 25 (BM25)
- Normalises the frequence of the term
- It’s based on tf-idf.
- Uses contants k, b and L to try to balance distorctions based on word repetition and different document sizes.
- BM25 allows small documents to “fight” for a place in the ranking.
- b constant allows to control the normalisation degree of document sizes
- k allows to score the repetition of a given word (give some points for some repetition, but penalise documents wich only repeats the word, for example)