0

I was faced with the task of determining the topics of big text massive. For example you have 1 million any text phrases or sentences. I want factorize the main topic from this massive. The ordinary factor analysis works with continuous data. Is there analog of factor analyze, but for text mining tasks? In ideal factorize big text massive, then select any factors (semantic core) instance F1 f2 topic 1 topic 2 topic 3 topic 4
or maybe you can help me find the greatest way to decide my task. I.e. i want understand What are the main topics of interesе me people

Julia
  • 21
  • 1
    Your question is unclear. You should improve the grammar in your post and add as many details as possible so that it will be easier for someone to help you. – Mike Pierce Feb 06 '15 at 16:09

2 Answers2

0

I did this for Chinese Language text. Guessing yours is in English? The methods I used may work for you too. First of all, you need to define the concept of "topic", that could be a family of related key words. Second, you need to have a database that has the synonyms and antonyms, and the form changes like "calculate, -ing, -ed" etc. Then you count the frequency of the usage of same word, sorting them by category, etc. things like that. You need to take out words like (of, with, etc.) Hope it helps.

For example, in the text of your question, there are about 120 words. The counting results shows topic (7), factor (6), text (5), task (4), analysis (3). Others are not in high frequency. If you get the above data from a computer, and you did not read the text, you may guess that it is about "an analysis task of topic or factor of text".

PdotWang
  • 749
  • PdotWang, thank you. Do you know, was your algorithm realized in R. Yes my language english:) – Julia Feb 06 '15 at 16:33
  • R for programming? Mine is Python. I am still searching for mathematic expression for it. No, my work is very shallow so far. I would like to learn from people like you. – PdotWang Feb 06 '15 at 16:35
  • under a topic i mean generalization of the themes which people wrote for example in 1000 000 sentences i may conclude that people wrote about broken parts, change the treatment plan and so on – Julia Feb 06 '15 at 16:36
  • what do you think about this article. http://iase-web.org/documents/papers/icots7/5E2_MORI.pdf can it help me in my task – Julia Feb 06 '15 at 16:39
  • Thanks. It is good. It talks about the post processing of the searching results. – PdotWang Feb 06 '15 at 16:58
0

For others landing on this page more recently, I wanted to provide an updated reply. If I understand this question correctly, the question submitter was looking for effective, all purpose algorithms for sorting and grouping a large corpus of text by common themes or subjects. There are a few approaches to this task that might be useful, such as Tf-IDF, Latent Semantic Analysis, Non-negative Matrix Factorization, and Latent Dirichlet Allocation. Also of interest might be keyword extraction methods and algorithmic summary methods such as Rapid Automatic Keyword Extraction (RAKE), TextTeaser, TextRank, and some deep learning or convolutional neural network approaches. Many of these are implemented in R and/or Python. See also Mehdi Allahyari et al., "Text Summarization Techniques: A Brief Survey," arXiv:1707.02268 [cs], July 7, 2017, http://arxiv.org/abs/1707.02268.

Matt L.
  • 101