1

I have two matrices deriving from one matrix of the original data. One is the training, the other is the validation set. Each matrix has rows= examples, columns = featuers. The proportions are 65% vs 35% respectively.

Given that the data is in many dimensions and it is not possible to visualize it, what would you suggest to use to make predictions ?

I was initially thinking about a polynomial fit, but how does one know which of the 65 features to square, cube, etc?

1 Answers1

1

You are describing the very complex problem of feature selection (as it is known in the machine learning and statistics community).

http://en.wikipedia.org/wiki/Feature_selection

What do you want to predict, anyway? Do you have a classification for each example (each row) or a numerical outcome?

rcorty
  • 155
  • The first X features are results at a given time, the other Y a different type of result, and the last W dimensions are a binary indicator of on which day of the week, and which hour of the day the results were obtained. I would like to predict the number of the result X given some other features. I am totally new to the area and not really sure how to approach it. I would imagine that having the training set, I can create a multi-dimensional function describing it(regression) and later supply some arguments to it to obtain the result, the prediction. – user2827159 Nov 20 '13 at 15:22
  • You didn't state the data type of X, so I can't say which link function to use, but the most straightforward thing you could do is linear regression. I recommend using R. http://en.wikipedia.org/wiki/Linear_regression#Simple_and_multiple_regression – rcorty Nov 20 '13 at 15:27
  • Ah, they are all numerical. Both X and Y is a number, at a given time. I so happen to be using R :-). I have been able to make linear regression using gradient descent for smaller data, 2D. But I suspect that linear will not be good enought for this type of problem, and think therefore that I would have to use a more complex model. Could you give me some advice on how to approach this, and what functions in R can prove to be helpful? :) – user2827159 Nov 20 '13 at 15:31
  • lm() function -- include covariates you would believe if they came back significant, include interaction terms, prune the model by taking away non-significant terms – rcorty Nov 20 '13 at 15:32