Index | About | Me | Jump to Menu Section

Information Retrieval

(belongs to mineracao-de-dados)

Related:

Information Retrieval

Other Resources

Indexes

Terms

Topics

In other to find relevant information we need to transform ther query in a discrete value of D-dimensions. We also need to do the same to the objects in our library. I will call that a function f, which inputs is the object and the output is a vector of numbers of D positions.

In other to find the relevant information for the query q we need to compute the similarity or distance of the query and the objects in our library. For that we will use a function σ.

The pair f and σ is called “descriptors”.

Similarity is a value [0,1] where two equal objects have value 1.

Distancy have a minimum value of zero (there are very close), though the superior value is Inf.

Distance and similarity functions

For binary data we have Similarity coefficients, such as Simple Matching Foefficient (SMC)and Jaccard Coefficient.

Evaluation

It’s possible to compare the system using a anotated set so we know what is relevant or not for a given input. Than the result of our system is compared with the “ground-truth”.

Some sources:

We can split the evaluation methods in three:

Text processing

In computing, the term text processing refers to the theory and practice of automating the creation or manipulation of electronic text. Text usually refers to all the alphanumeric characters specified on the keyboard of the person engaging the practice, but in general text means the abstraction layer immediately above the standard character encoding of the target text. The term processing refers to automated (or mechanized) processing, as opposed to the same manipulation done manually. - https://en.wikipedia.org/wiki/Text_processing

Terms

Vocabulary: The set of representative terms of the documents. We can extract it from the document automatically or it can be genarated by domain specialists.

We could in theory use all the words of the documents as representative texts, but depending on the collection size, it may not be the best approach. Instead we can reduce the vocabulary with:

The workflow of an information retrieval system

Processing the text

Indexing and storage

Models for text retrieval

Zipf’s law (from this law we can think that less frequent words are best to identify a text)

Zipf’s law (/zɪf/, not /tsɪpf/ as in German) is an empirical law formulated using mathematical statistics that refers to the fact that for many types of data studied in the physical and social sciences, the rank-frequency distribution is an inverse relation. The Zipfian distribution is one of a family of related discrete power law probability distributions. It is related to the zeta distribution, but is not identical. - https://en.wikipedia.org/wiki/Zipf’s_law

F_i = C / i, where F_i is the frequence of the i-th word, C is a constant based on the document.

Some models are based on the Zipf’s law distribution, such as:

Tools