Questions tagged [machine-learning]

How can we build computer systems that automatically improve with experience, and what are the fundamental laws that govern all learning processes?

From The Discipline of Machine Learning by Tom Mitchell:

The field of Machine Learning seeks to answer the question "How can we build computer systems that automatically improve with experience, and what are the fundamental laws that govern all learning processes?" This question covers a broad range of learning tasks, such as how to design autonomous mobile robots that learn to navigate from their own experience, how to data mine historical medical records to learn which future patients will respond best to which treatments, and how to build search engines that automatically customize to their user's interests. To be more precise, we say that a machine learns with respect to a particular task T, performance metric P, and type of experience E, if the system reliably improves its performance P at task T, following experience E. Depending on how we specify T, P, and E, the learning task might also be called by names such as data mining, autonomous discovery, database updating, programming by example, etc.

3322 questions
1
vote
0 answers

How do you choose the learning rate such that stochastic gradient descent provably converges?

Recall stochastic gradient descent in the case of regression for n training pointes: $\text{randomly select }\, t \in [1,n],\{\\ \quad \theta^{(k+1)} = \theta^{k} + \eta_k(y^{(t)} - \theta \cdot x^{(t)}) x^{(t)}\\ \}$ In some notes I am…
1
vote
1 answer

F1 from ROC curve

Given ROC curve (http://en.wikipedia.org/wiki/Receiver_operating_characteristic), how do you read the maximum F1? (if it is possible)
Daniel
  • 2,630
1
vote
0 answers

VC Dimension of two simple neural model of three neuron

there are two simple neural model with three neurons as in the picture. The only learning parameter in each neuron is a threshold of theta. Each neuron provide an output of $+1$ if summation of inputs gets bigger than the neuron's threshold unless…
1
vote
0 answers

Weight decay combine with conjugate gradient

We can use weight decay method for a condition stopping to avoid overfitting when we train a neural network. This method applied with gradient descent learning, bayesian learning, but i want apply it combine with scale conjugate gradient. But i…
Beginner
  • 171
1
vote
1 answer

Understanding Sutton's definition of the Projected Bellman Error

I am reading Richard Sutton's textbook Reinforcement Learning, chapter $11.4$, and I am confused by his definition of the Projected Bellman Error. He defines a norm on value functions $v: S \mapsto \mathbb{R}$ where $S$ is the set of states,…
1
vote
1 answer

Why do you take (1-the area under the ROC curve) when the area is less than .5?

I'm taking a course in which the ROC curve is specified by plotting points on an XY plane such that x is the false positive rate and y is the true positive rate at a certain threshold in binary classification. Then these points are joined into…
1
vote
1 answer

Understanding a statement in Sutton's Reinforcement Learning Section 5.5 on Importance Sampling

I am trying to understand chapter $5.5$ of Sutton's Book on Reinforcement leaning, in a particular a statement on page $104$ related to off policy prediction via importance sampling. Supposing $b$ is a behavior policy and $\pi$ is a target policy,…
1
vote
0 answers

Machine Learning gradient descent clarification

I was watching Andrew Ng's CS229 ML lectures on youtube and i noticed something when he was explaining gradient descent using contour plots.contour plot He's showing what theta gets updated to at each iteration of gradient descent. He says that it…
frank
  • 11
1
vote
1 answer

Understanding a statement from Elements of Statistical Learning about noisy linear regression

I'm reading section $2.5$ of Elements of Statistical Learning by Hastie et al (Second edition), and theres an equation I don't quite understand (here we have $N$ training samples, and $X$ are vectors in some dimension, $Y$ scalars). The authors…
1
vote
0 answers

SVM derivative with respect to X

I am currently working my way through the cs231n class and I got stuck with the partial derivative for SVM with respect to X. The loss function is defined as: the partial derivatives wrt $w_{yi}$ and $w_j$ make sense to me, and also allow a quite…
Guenter
  • 111
1
vote
1 answer

Appropriate problems for different type of ML classifiers

I have learned several classifiers in Machine learning - Decision tree, Neural network, SVM, Bayesian classifier, K-NN, Markov process...etc. Can anyone please help to understand when I should prefer one of the classifier over other - for example -…
Ronin
  • 111
1
vote
0 answers

Optimization of Softmax Regression

I'm trying to learn the mathematics behind softmax regression. Optimization of softmax regression is generally discussed in the context of deep learning but I'm looking for an explanation in the context of multinomial logistic regression. In…
1
vote
2 answers

Geometry of 2-D Linear Discriminant Function

I am studying ML through the Bishop book myself, and I could not understand why the distance of a given point $\mathbb{x}$ is given by $y(\mathbb{x})/||w||$. Could you help me on deriving this formula from scratch? Thank you.
John
  • 103
1
vote
0 answers

Crossentropy of softmax function derivative explanation

Following these calculations: https://sebastianraschka.com/faq/docs/softmax_regression.html , I am a bit confused about the last equation1 . Imagine I have a X which is of the shape (300, 7), y (which is one hot encoded) is of the shape (300, 3) and…
1
vote
1 answer

Consider the case of a binary classification problem in which the response is sampled from a non-linear complex function.

Consider the case of a binary classification problem in which the response is sampled from a non-linear complex function, which of the following algorithms has the potential to perform the best? a) Logistic Regression b) K-nearest neighbors c)…