第一篇:ELM-极限学习机-黄广斌学术报告-讲稿
>> Zhengyou Zhang: Okay.So it's my pleasure to introduce Professor Guang-Bin Huang from Nanyang Technological University of Singapore, NTU.So he is talking about his long interest research in extreme learning machine.He graduated from Northeastern University in China in applied mathematics and has been with NTU for quite a few years.And he's associate editor of Neurocomputing and also IEEE Transactions on Systems, Man and Cybernetics.And he's organizing our workshop.You can advertise your conference later.So please.>> Guang-Bin Huang: Thanks, Dr.Zhang, for inviting me here.It's my honor to give a talk to introduce the extreme learning machine.This is actually--the idea was initially initiated in '9--actually 2003.That was the first time we submitted papers.And then it become recognized by more and more researchers recently.So we just had a workshop in Australia last December.So we are going to have an international symposium on extreme learning machine that is coming December in China.It's Dr.Zhang's hometown, or nearby his hometown.So we're hoping you can join.Okay.So what is extreme learning machine? Actually, this is a technical talk about--talking about the kind of learning method.Such a learning method is different from the traditional learning method.So tuning is not required.So you will wish to know extreme learning machine I think it's better to go back to review the traditional feedforward neural networks, including support vector machines.I assume many people consider support vector machines to not be known to neural networks.But actually in my opinion [inaudible] they have the same target, same architecture in some aspects, in some sense.So after we review feedforward neural networks, then we can go to the--to introduce what is called extreme learning machine.Actually, extreme learning machine is very, very simple.Okay.During my talk I also wish to give the comparison between the year end and this square SVM.So finally I wish to show linkage between the ELM and also the traditional SVM, so we know what's the difference, what's the relationship.Okay.So the feedforward neural network, I have several types of frameworks, architectures.One of the popular ones is multi-layer feedforward neural networks.But then in theory and also in applications people found a single-hidden layer is enough for us to handle all the applications in theory.So that means given any applications, we can design a single-hidden layer feedforward network.This single-hidden layer feedforward network can be useful as to approximate any continuous target function.Can be useful as to classify any disjoined regions.Okay.So for single-hidden layer feedforward networks, usually we also have the two type of popular architectures.The first one is so-called sigmoid type of feedforward network.So that mean the hidden layer here use sigmoid of type network.Okay.Sometime I call additive hidden nodes.That means the input of each hidden node is weighted sum of the input.Okay.So of course G here, usually people are using sigmoid type of function.But you can also write the output hidden layer--output hidden node as uppercase G.So AIPI and X.X is input of the network.AIPI are the hidden parameters for each--say for hidden node I, the AIPI are the parameters of the node I.All right.So this is the sigmoid type of hidden feedforward network.Of course another one is very, very popular is RBF network.So RBF network here, a hidden node output function is RBF function.So what you rewrite then in this computer format, we actually have the same so-called output function of the single-hidden layer feedforward network as sigmoid type network.So here always uppercase G.So these two type of networks are very interesting in the past two to three decades.Two groups wrote papers, researchers, to work on these two areas, and they consider them separate.And they use a different learning method for these two method networks.But generally for both type of networks, we have this so-called theorem.So given any target continuous function, FX, so the output of function of this single-hidden layer can be as close as to this target continuous function given any arrow.And definitely in series, we can find such a network.So that is the output of this network can be as close to this target function effects.Of course, in real applications, we do not know the target function, FX.All right.So your sigmoid processing will have a sampling.We only can sample the discrete samples.And we wish to learn these discrete samples, training samples.So we wish to adjust the parameters of the hidden layer and also the waste between the hidden layer to the output layer.Try to find some algorithm to learn these parameters to make sure the output of the network to approximate a target function.Okay.From learning point of view--so given--we're given, say, N training samples, XI, TI, so we wish to have the network--output of the network with respect to the input XJ incurred to your target output, TJ.Of course, in most cases, your output not exactly the same as your target.So there is some error there.Suppose the output of the network is OJ, so we wish to minimize this cost function.So in order to do this, many people, right, actually spend time finding different methods on how to tune the hidden layer parameters, AI, BI, and also the waste between the hidden layer to the output layer.That is beta I, which is I call output weight.All right.So that is the situation for the learning.So that is for approximation.But how about classification? So in my theory published in year 2000 we say as long as this kind of hidden--single-hidden layer feedforward network can approximate any target function, this network in theory can be used for us to classify any disjoint regions.So this is a classification case.Yeah.>>: So there's a very well-known result in neural networks that you probably know which says that in order for the [inaudible] to be valid, the number of units had to be very, very large and that if you can infinite [inaudible] what's called a Gaussian process.>> Guang-Bin Huang: Yeah, you're right.>>: So that gives [inaudible] indication of processing [inaudible].>> Guang-Bin Huang: Ah.Okay.That is actually very useful theory.That one actually used for our further development.So that is why we come to extreme learning machine.That is a guide.But infinity number of hidden nodes usually not required in real applications.But in theory, in order to approximate any target function, say epsilon, the arrow reaching zero, then in that sense infinity number of hidden nodes needs to be given.But in real application we do not need that.So I will mention later.But that theory is very, very important.Okay.Without--um-hmm.>>: Also, recently many people observe that if you have many layers, actually you can [inaudible] single layer system, even if you have very large number of hidden units in a single layer system.>> Guang-Bin Huang: You're right.So this will have another paper--actually, I didn't mention here--that is we prove instance of, say, three hidden layer.Just talk about two hidden layer.Compare one hidden layer of architecture.So two hidden layer of network usually need much fewer number of hidden nodes than single-hidden layer.>>: [inaudible].>> Guang-Bin Huang: Yeah.I proved in theoretical and also showing in simulations.So that means that from learning capability point of view, my hidden layer looks more powerful.But that is actually not shown here.Okay.That one we can discuss later.Okay.So then you want to learn these kind of networks, so then in the past two decades most people use gradient-based learning method.So one of the most popular one is back-propagation.Of course and its variants.So many variants.People talk about just some parameter.They generate another learning method.Okay.Another method is called for RBF network.So talk about least-square method.But in this least-square method is some kind of--something different from ELM, which I will introduce later.So single-hidden--so single impact factor used in all hidden nodes.That means in all hidden nodes they use the--all hidden nodes use the same impact factor [inaudible].Okay.Sometime it call sigma.Right?
Okay.So what's the drawbacks of those gradient-based method or--so usually we will fear very difficult in research.So different group of researchers will count different network.Actually, intuitively speaking, they have the similar architectures, but usually RBF network researcher work on RBF.So feedforward network people work on feedforward network.So they consider the little difference.So we actually sometime waste resources.And also in all the network, users--actually, you [inaudible] two users.There's sometimes so many parameter for user to tune manually.Right? It's case by case.Okay.So it's sometimes inconvenient for long expert users.Okay.So usually we also face overfitting issues.So this is why even too many use the number of hidden nodes use it in hidden layer.We were afraid it's a problem, overfitting problem.Right?
And also for RBF a local minimum.Right? So you kind of get the optimum solution.Usually get a local minimal solution.That is better for that local area but not for the general--the entire applications.Of course time-consuming.Time-consuming not only on learning [inaudible] and also actually on human effort.Human has to spend time to find the so-called proper parameters, user-specified parameters.So we wish to overcome all these limitations constraints in the original learning methods.Now, let's look at support vector machine.Is there any relationship between the support vector machine and the traditional feedforward network.Of course, when SVM people talk about SVM, they never talk about neural network.They say they're separate.So this is why there is a story.So when I joined [inaudible] in 2004--before 2004, SVM paper is seldom appearing neural network conference.So then in 2004 the organizer of 2004, right at the top we have a committee, committee meeting, and they say why this year so many SVM papers come to neural network conference? Okay.So people consider these are different.But what I found, actually they are very close to each other.Generally speaking, they are the same, in the same network architecture.Let's look at the SVM.So SVM, of course, we talk a little bit about optimization objective.SVM is to minimize this formula.Right.So this objective function.So minimize the weight, the actual output of weights, apply the tuning error.On this you [inaudible] conditions.Right?
But in looking at the final solution, decision of SVM is this decision.But what is this? K here is a kernel of the SVM [inaudible] is a parameter we want to find.Okay.But looking at this formula, this actually exactly is a single-hidden layer feedforward network.What is single-hidden layer? Single-hidden layer formed by the hiddens with these kernels.Right? So KX, X1, KX, X2, XI2, KX, XN.So this is the hidden layer.Right? This is the kernel--hidden layer with this kernel.So what is the output weight, then? Output weight is F over 1, T1, R over I, TI, R over N, TN.That is the output weight.That is a better [inaudible] in feedforward network.Okay.Yeah.Please.>>: This is [inaudible].>> Guang-Bin Huang: Okay.Yeah.This is not objective of the SVM.But finally it turn out to be in this formula.So I say from the architecture point of view.So finally they have a similar architecture.>>: [inaudible] was called new ways to train neural networks.So that was a connection from the beginning [inaudible].>> Guang-Bin Huang: Yeah.Yeah.So actually--yeah.You will talk about [inaudible] paper published in 1995.So WAPNIC [phonetic], what is the so-called expectation inspiration to [inaudible]? So WAPNIC say actually we have to go back to the original neural network.Okay.So [inaudible] actually 1962 [inaudible] studied much in that year feedforward network.In those days, we found has new idea how to tune in the hidden layer.So then [inaudible] said how about it? We considered a [inaudible] layer [inaudible] hidden layer.You could consider master hidden layer.Master hidden layer is output mapping function.Looks at output of mapping function.So then can we find out some way to just find us an output--the last hidden layer output function.That is feature map function.Then you can [inaudible].So then WAPNIC say, okay, so what is--so the output function of the master hidden layer that can be considered is phi XI.But what is phi XI? I have no idea.So then you--WAPNIC paper say how about it, let phi X [inaudible].So we have this kind of constraint, because this is very important for classification.Under this constraint they finally get this.So although the output hidden layer mapping function, that is the SVM feature mapping function, phi X is [inaudible].But we can find the corresponding kernel.So that's why I come to this stage.But I should--I'm [inaudible].I'm talking about from a structural point of view, if we go to this stage, final turn out to be in the same format.Of course from here the question is how to find other ITS.How to find other ITS is how to find this output [inaudible].So you would consider the hidden layer feedforward network.Feedforward of hidden network also try to find the parameter.So SVM and the traditional BPN.So in this sense is either just different ways to find the parameters of the hidden layer feedforward network, single-hidden layer feedforward networks.SVM find it in SVM format.BP find it in BP format.RBF found it in RBF format.So this is why different people find methods in different ways.So then the question is can we unify them.Okay.So this is one of my so-called research work.Okay.Actually, in order to say--to show the linkage between the SVM and the single-hidden layer feedforward network, I think that people better read these two papers.Okay.Because these two papers also give me some idea, inspire my ideas.And they build the linkage between the ELM and SVM first.So then I found this linkage also.Okay.So now let's go back to the--let's go to talk about extreme learning machine.Of course extreme learning machine originally start from the neural networks.We come from neural networks.So those days we try to say, okay, we started from BP network.So BP we found is inefficient.But what is the sense, what is the original expectation for neural network? We try to eliminate human synching.The human can synch something, can find a solution very quickly.So the question is do we have tuning in human brain.But usually to me it's neutron in there in most cases, right? Just find a solution theoretically or I can learn very fast.Right? Either very fast, either without tuning.But go back to the original learning method of the neural network.So whatever machine you see, always tuning there.Right? Always you had to find a parameter is there.Can we have some way to simplify the implementation of the computation intended in this method?
So then we found--we first we--from all this [inaudible] we see O can be simplified in this single-hidden layer feedforward networks.So O is a hidden layer, AI, BI.But hidden layer here, hidden node may not be neural.You can be a kernel.You can be other subnetwork even.But each hidden layer had a parameter, AI, BI.Those are the parameters.[inaudible] real formally of each output of hidden--each output function of the hidden node, right, details.We just write them in this compact format.Okay.So then the output of this--output function of this network is FX equals summation of beta I, G AI, BI, X.G, AI, BI, X actually is the output function of the I's hidden node.Or it can be RBF little kernel.It can be so-called sigmoid type network if something else.Even can be long differential [inaudible] network.Okay.So then you see this hidden layer, the entire hidden layer, what is the output of the entire hidden layer.The entire hidden layer supposed to have--supposed--the entire hidden layer has a AO hidden nodes.So then the output layer, what is output layer? The output layer is HX, is that vector of these AO elements of this--the output of functions, output of this AO hidden nodes.Right? It's a vector.HX here is the feature mapping, is a hidden layer so-called output mapping.Right?
Okay.So in this case the question is in the traditional method, AI, BI, AAO, A1, B1, AO, BAO, O have to be tuning there.So this is why we have to have the gradient-based method or SVM method.Because, you see, if this parameter--say parameter or hidden layer has been tuning, the result you obtain in the early stage in the output layer, right, beta I here, may not be optimum.You have to adjust the output layer again.If you adjust the output layer, the beta I, beta 1 to beta AO, then the parameter in the hidden layer may not be optimal further anymore.So you have to adjust again.So this is why you always have to adjust it, right, iteratively.So then luckily in our theory, we find tuning in the hidden layer actually is not required.All the parameters in the hidden layer can be random generated.This go back to your question.So this assume infinity number of hidden nodes can be used.Of course, this assumption that actually we do not need.Okay.Right? So this also go back to the SVM.SVM is--go to the feature space, suppose the dimension of the feature space is very naughty, what is very naughty, can be good to infinity.Right? So that is one--that is why they have the kernel concept there.So you will say if the hidden layer can--so number of hidden nodes in a hidden layer can be infinity or as naughty as possible.Then the question is--then we have this conclusion.So tuning is not required.All the parameters in the hidden layer can be random generated.This based on a lot of assumptions.Say suppose given an output function, given output function G of the hidden layer, if there exists a method, okay, which can be used in tuning to find the proper parameters for AI, BI.So that means [inaudible] parameter we can have a parameter.We will have a tuning method, BP method, RBF method, or SVM method, whatever method.As long as there is--there exists such a tuning method which can be used to train our single-hidden layer feedforward, okay, then such a tuning is not required.Such a method is not required.Hidden layer can then be random generated.That is given in this paper.Okay.So--and also another assumption is G here should be nonlinear piecewise continuous.You can't just give it the error function.Then--yeah.Please.>>: Can you explain the difference between the AI and BI?
>> Guang-Bin Huang: Okay.So this [inaudible] AI, BI, sometime they can be put together, just one parameter.Okay? But AI, BI here built usually in the traditional neural network.You see RBF network.You have center and impact factor.So AI here in RBF hidden node, AI is center.BI is impact factor.If you got a sigmoid function, AI is input always.BI is the buyers of the hidden node.So this [inaudible] comes from the traditional neural network.So it's special for neural network researchers.Usually they have two parameters.So this is why I put AI, BI.But if you go to the frequency [inaudible] usually you have the position and the frequency.Right? Not the location and the frequency.Right? So usually have the two.>>: What is it for SVMs?
>> Guang-Bin Huang: Oh.For SVMs, SVM you will talk about kernel [inaudible].Kernel you have the so-called--AI is the input that is called sample, training samples.BI is a so-called--we call kernel parameter.That is sigma.But, of course, here we talk about random.So later on we can build the linkage between SVM and this method.Okay? But of course here we'll talk about random first.Okay.So--yeah.>>: [inaudible] any application do we need to tune AI [inaudible]?
>> Guang-Bin Huang: I will address this issue later.Yeah.It is an important question.I will come back.Yeah.Okay.So since all the hidden node parameter need not be tuning, a lot of--actually, not be tuning in the sense, before you see the training data you can randomly generate the parameter first.Okay.Then say for this application, you can use a random--already random generated nodes for that kind of application.That means the parameter can be independently from the training from your applications.So since AI, BI--all the parameter, hidden node parameters can be random generated, they go here.So suppose you have N training samples, given any training sample XG, so we wished the output function output of this network [inaudible] to your target, TJ.Right? The left-hand is the output of the network.We wished this equal to your target.So we have the AO hidden nodes.So here we have an AO [inaudible].Only AO on nonparameter is beta I.Because we have no idea what is beta I.That is what we want to find.Although the hidden layer can be randomly generated, the TJ your target.So we have the N training samples.So from here we know we have [inaudible] equations with AO on nonparameters, right?
So then for these AO equations, we can--can be written in the computer format H beta [inaudible] T.So H is we call hidden node output matrix.Right? It's also the so-called hidden layer mapping.Okay.Feature mapping.Okay? So we have the N training data.Right? So each row represent the output of the hidden layer with respect to one input training XI.Okay.So we wish to--T is already given.We want to find this parameter.So for this H beta [inaudible] T, H is already known.T is already known.So then we can find the parameter beta easily.Three steps then.So the first step randomly assign hidden node parameters, AI, BI.And then calculate the hidden layer output function matrix H.All right.So then just do a certain inverse to the beta.That's all.So these are three steps.Okay.So the source code can be found here.For this we will find the simple math is enough in many cases.Of course then for this single-hidden layer feedforward network, the hidden node, okay, need not be sigmoid type.Needn't be RBF type.It can be others.[inaudible] hidden node because previously [inaudible] can be trained directly.So with this method, [inaudible] came out to be trained.But that is the basic idea.So now what relationship--yeah.>>: [inaudible] generalized data models where you take the input data, you map it to some nonlinear thing, and then you take a linear combination, which looks exactly like this.So how is it related? How is [inaudible]?
>> Guang-Bin Huang: What is a nonlinear--
>>: Generalized linear [inaudible].>> Guang-Bin Huang: Uh-huh.>>: Where you basically map the data to a nonlinear form and then take the linear combination [inaudible].So it looks very similar to this [inaudible] what the difference is.>> Guang-Bin Huang: Yeah.So many method actually look similar, but the key point is here the hidden nodes here all random generated [inaudible] other methods.And they have two so-called either user preset--preset some parameters or either tune it in some way and then get the hidden layer output.So then they use the least-square method.Even the original RBF method.So they have to do something before they go to the random generated--yeah.>>: When you say that taking a random hidden layer is enough, and I don't understand what enough means.I mean, does it mean that you get uniformly good approximation across the input space? I mean, this can't be a distribution-dependent thing.You can't get actual learning from this.This has to be assuming some type of uniform distribution on your input space.>> Guang-Bin Huang: Okay.Easy enough to talk about here, say given any target function, continual target function, FX, we can say for given such a kind of single-hidden layer feedforward network may not be a neural network, okay, so as long as a number of hidden nodes is not enough, we can say the output of this single-hidden network can be as close as to your target function.So given any target function, any continuous target function.So that we can make the output of the network as close as to your target.So this is--because it's from the learning point of view.In theory we wish to have this universal approximation property.>>: I think what you want is to get--to learn the specific distribution from that end, so--
>> Guang-Bin Huang: Oh.>>: This is some kind of extreme L infinity type of--or, I mean, you're talking here about a uniform convergence, right, across the entire space, whereas we may actually care about some little part of the space which is governed by the distribution [inaudible].I mean, that's what [inaudible].>> Guang-Bin Huang: You're right.So talk about--say you give an entire--so here talk about the error in the entire [inaudible] space.Of course talk about so-called some [inaudible] space, that could be different.So this actually--this problem actually true to many, many method.Even go to the BP.And they also talk about general.Even SVM talk about over a space.Given this region, we wish to learn to see how good in this entire region instead of specific so-called locations, area, small area.But that is problem to the general learning method.Because learning method usually say give us space, give us this training data, any training data.We wish to learn the overperformance on these, any training day.Of course somehow we are also interested in some particular subspace.But that is different ways then.>>: I don't think you understood what I said.When we get a training set, you care about the distribution from which that day set came.You don't care about [inaudible] specially has an analysis, so the only learning and generalization [inaudible] SVM talks about generalizing with respect to the distribution of the training day, not uniformly across the space.>> Guang-Bin Huang: Okay.You're talking about distribution of the training data.In which sense?
>>: In which sense.You're assuming that your sample is [inaudible] from some distribution.That's a distribution with respect to which you measure risk and you want to generalize with respect to.You don't really care about doing anything uniformly across the entire space.In essence [inaudible] some infinite dimensional thing, and all you really care about is the subspace scanned by your data and the distribution order [inaudible].>> Guang-Bin Huang: Okay.>>: Here this is--this is kind of a--you know-->> Guang-Bin Huang: Okay.I will go back to your [inaudible] so why they find it can be so-called even--so least-square then can be simplified to this stage.Okay.I come back later.Yeah.So that is the three steps.But [inaudible] we talk about these three steps.So usually we--after we get a beta, beta can be, say, calculated through the pseudo inverse format.You have a different way to calculate pseudo inverse.So that one of the ways [inaudible] method.So we say we have the pseudo inverse of the--okay, beta--so-called H can be in this way or in this way.Right?
But it--from the ridge regression theory point of view, in order to improve the stability, improve the performance of the pseudo inverse, so usually we wish to add a so-called [inaudible] in the biangle entry of this path.So then we--here.So when--after we added this one here, because this one is the usual way for us to handle the pseudo inverse, so there was a relationship with SVM then.Okay.So after we added this, the issues become very interesting.If HX is known to us, this is different from SVM.In SVM, HX usually is unknown.So if HX is known to us, then we have this.All right.If HX is unknown, we can also use that same format.We can use a kernel format.So from the first formula.So H times H, then we get this, kernel format.In this case, as long as K is known to us, even the feature mappings function HX is unknown to us, we can get this formula.Okay.So this is another formula from ELM, from the ELM we mentioned in this paper.So then what happens then? You compare these two.This is the solution of the ELM.This is a solution of the least-square SVM.Least-square SVM you see, you really can know the first column, first row, first column of this least-square SVM solution.This part, this part actually very similar to this part, right, to ELM solution.But then [inaudible] of course here we have the K times K.Here is Z times Z.So they have this one.Okay.So roughly looks similar.But this actually is a simplified version of least-square SVM.Why this simplified version? [inaudible] in this paper because we found it may have some flaw in this square SVM and the SVM series I will mention later.But you will look at this square SVM.So we have the T.So [inaudible] here actually is a kernel matrix.Omega here is a kernel matrix.But [inaudible] in their kernel--kernel is the related feature mapping.It should not be related to the target.But in their feature mapping, kernel mapping actually is related to the target.But this is a problem.So your feature mapping is coming from the input instead of you have feedback from the output.But in their solution, they have the TI--so T you have a target in the kernel mapping matrix.Okay.So this is a problem there.But then from here we have simplified with [inaudible] the first column, first row, first column, then make it simple.So I will come back later.So why this one actually can be so called--it can be this constant.It may have something which can be improved somewhere.Okay.So now let's look at the classification so-called performance of the ELM.So now let's look at the [inaudible] problem.All right.So this is output boundary of the ELM which founded by ELM.So we look at another case, right, this also boundary from the so-called ELM method.So now go back to your other questions, do we need infinity number of hidden nodes? Usually not.So we say--so as long as the number of hidden nodes is not enough, is not enough, then okay.So in what sense not enough? So you know all simulations, so even for this case, we only have four parameter, four training data, 1,000 hidden nodes.Even here you have, say, here maybe 80,000 data, 1,000 training data or 1,000 hidden nodes.Even for other case we'll somehow have a meaning of training data, okay, is the case.We only just needed 1,000 hidden nodes.So that means usually 1,000 enough, 1,000 to 2,000 should be fine for ELM usually.This--what we find--yes?
>>: Is that an empirical observation [inaudible] theory that actually suggests that that's--
>> Guang-Bin Huang: Oh, no, we do not have theory.This is just from experimental simulation.Also not only founded by us, also founded by some other teams from Europe and China.Yeah.So--
>>: [inaudible].>> Guang-Bin Huang: Ah.Okay.For the moment we found it.It looks likely independent from the dimensions.Sometime--later on it was issue, some is higher dimension, some is lower dimension.Some you even say were [inaudible].Also okay.Yeah.>>: Also, when you said 1,000, do you mean you need 1,000 to match the performance of a [inaudible]? What do you--
>> Guang-Bin Huang: We also compare with SVM, regular SVM later.>>: But what I'm asking is what performance do you try to get? I mean, I've had two.>> Guang-Bin Huang: No.We wish to compare with traditional SVM in the same meaning.So even if you have the dimension reduction, then we compare in the same way.Even if do not have the dimension reduction, we also compare the performance result dimension reduction.We wish to see the--if [inaudible] came, we want to see the classification read in the testing data.Yeah.>>: [inaudible].>> Guang-Bin Huang: Yeah.>>: With a style of method.So for about a thousand hidden nodes [inaudible].>> Guang-Bin Huang: Sorry?
>>: If I was [inaudible] comparing the styling methods [inaudible] actual error rate [inaudible] same error, you need a thousand hidden nodes with ELM.>> Guang-Bin Huang: Yeah.>>: So what's the [inaudible]? What's the largest problem you can apply this to?
>> Guang-Bin Huang: Okay.So this problem--because you see we--ELM scores scalability actually is very, very good.Because you see the function.That calculation is very, very simple.You look at it here.This actually is unified for [inaudible] label classification also.But this here, similar to SVM.It's just for binary.So you will [inaudible] very naughty case for SVM and not--least-square SVM usually have so-called [inaudible] kind of method for both SVM, and least square SVM.So usually we have to find the supercomputer for very naughty case.But to us we usually run it in the home computer or just the desktop in the lab.So we [inaudible] the largest one we run is however many of data.With each so-called data instance is, say, 59 input [inaudible].But for the rest, because due to the simple computing [inaudible] somehow very difficult.We have to wait for--somehow wait for months to get the result.This is a problem.But for us just several minutes we'll get a result.So somehow we [inaudible] we also had to wait all the rest.>>: [inaudible] you reach probably up to however many--
>> Guang-Bin Huang: Yeah, the maximum.I will bet you--I have one slide to show.>>: One that you have [inaudible] variability when you try a different set of random parameters.>> Guang-Bin Huang: For the moment we have a--
>>: [inaudible].>> Guang-Bin Huang: Yeah, yeah, yeah, that's right.We usually have to seem round--we will--in order to compare SVM and other method, not only SVM, BP, we--to us we--actually, you know, just the hidden number of hidden nodes.That one usually can be just given then round for all.But for the rest messy order, you have to tune this method, tune to that method.In order to find one method, even given one method, they have to spend a lot--so it will cost a lot of time for them to compute.>>: Procedure-wise [inaudible] when you actually use a random generator to get this AB, how many runs do you have to [inaudible]?
>> Guang-Bin Huang: Usually just one.>>: But you get a different [inaudible].>> Guang-Bin Huang: For the result here, we--later I show we can generate many times.We see the average.We can see the average.Yeah.>>: I see.Do they differ a lot?
>> Guang-Bin Huang: I will show you that.There is some [inaudible].Very interesting result you will see.Okay.Let's look at these several data cases.Actually, not very nice, okay? So let's say they have this case.Say the training number of training data for letter recognition is 13,000, then testing is 6,000, okay, so then this number of input features in classes here.So this means the data can be--so training data and the testing every time during our trial are reshuffled.Or erased.No means no shuffle according to the simulation done by others.The training data fixed, testing data fixed.Okay.Now, let's look at this case.So we talk about the--let's look at the lower part first.If SVM, least-square SVM, ELM all use Gaussian kernel.Gaussian kernel here is lot of random generate.You look at an earlier solution here.I say if the HX is--are really known to you, you can use these two formulas.If HX is unknown, similar to SVM, but to us is needing to be tuning then, right, so we can use the same format here using the kernel format.This is different from least-square SVM.We can also use a kernel here, because usually here no [inaudible] just use one parameter.Of course for any method [inaudible] definitely do something for some--right, at this one parameter.So you will talk about a good kernel, Gaussian kernel method you see over here.This is the testing rate for SVM.Okay.So this is training time.This is least-square method.You see the ELM.For all the method, for all the data, the ELM achieved better generation the testing accuracy.Okay, you see here, all better here.You see the time.The same Gaussian kernel.This time [inaudible] all shorter than the SVM and this SVM, because least-square SVM is much starter than SVM.Okay.So this is the--using a kernel format.Right?
But if the--you see the standard duration here also, zero.Because the training data is fixed.Right? Training data is fixed, right? So then these three methods have the same--almost the same standard duration.>>: [inaudible] Gaussian kernel likely choosing the [inaudible]?
>> Guang-Bin Huang: Yeah.No.This not.Can't.This can't.These we want to intentionally compare SVM, least-square SVM, ELM, if they use the same kernel, that we compare them in the same environment.Of course, you have to choose the parameter gamma impact factor for all these three methods.Right? So this is the--one comparison.But you will--because ELM here, we say the hidden layer output function is unknown to us, but we can use in the kernel format similar to the SVM.Just kernel learning.But if the HX is known to us, it seems the RBF network, sigmoid network, the traditional neural network, so then we say we're just using one method, sigmoid or hidden node.So then all the parameter are random generated.And the number of hidden nodes is said as 1,000.Then what happen? You look here.It's very interesting.You see the classic case you read here also higher, almost better than all the rest.And then, of course, because they're random generated, so some [inaudible] duration here.Even the training data is fixed, [inaudible] fixed.You still have some kind of [inaudible].Because every time you random generate, then have some kind of [inaudible] there.But you see the duration is not so high.All right.Yeah.>>: [inaudible] styling methods gap.>> Guang-Bin Huang: Support vectors for this--actually, this one--oh, I see can be found from a paper.I cannot really remember.But usually--
>>: [inaudible] runtime.It's going to be slower if you have a thousand versus ten.>> Guang-Bin Huang: [inaudible] so I will go back later.So you see the training time.So the training time here is much, much shorter than the SVM, and it is least square SVM.Even shorter than the ELM with Gaussian kernel.Because when you go to Gaussian kernel, you see usually we consider a feature space go to infinity.I mean, in theory.But here-->>:--is comparing the sigmoid even though [inaudible] do you have general statement as to which one is better in terms of performance versus--
>> Guang-Bin Huang: Okay.In terms of classification--in terms of stability, Gaussian kernel is better because always standardization.Here is zero.Whenever fixed, training data fixed, then the result is definitely deterministic.But you determine the time, training time is better.Sigmoid.So-called long-kernel based is better.Because the kernel, you know, we need not give the--so we need not give too many hidden nodes.So usually 1,000 enough.But 1,000 enough very good for very large applications.[inaudible] go to a very naughty application, the scalability of SVM and least square SVM would be very hectic, troublesome to us.But then ELM is better.Right? So this is--
>>: So I'm thinking what you are doing now there is to use the random [inaudible] at the top level.>> Guang-Bin Huang: Yeah, yeah, yeah.>>: So why is that feature a good one?
>> Guang-Bin Huang: I've already explained to you--
>>: It's not intuitive.>> Guang-Bin Huang: Yeah, it is not.Even in the beginning when I think of this--when I thought of this idea, if I could myself, can it work? I can't find any natural phenomena at the latter stage.But now I found the natural phenomena.So I say this a natural learning scale.I will go back later.It's very interesting.I will go back later.>>: By the way, so the way you use SVM [inaudible] automatically determined, right? [inaudible] will be considered [inaudible] hidden units in your ELM.So do you have idea what is a number of support vectors you get with the SVM?
>> Guang-Bin Huang: Okay.Usually for SVM, what I found is in some order related to number of training data.[inaudible] regression method usually half of the training data will be used at the support vectors.Go to this page.Just run a difficult case, SinC function.Usual that's a very important benchmark problem to test the SVM performance.Right? We have a 5 training--5,000 training data and 5,000 testing data.So training data we intentionally add some noise there.Testing data is noise-free.So if we go to SVM, of course here we're training to support vector regression [inaudible] regression case.So we have the 5,000 data, training data is here.Nearly half of them use as support vectors.So this is why even in other case we all usually say the number of support vectors is very comparable to the number of training data.So we talk about very nice good large dataset.The number of support vector will be very huge.So then these go through the runtime issue that are raised by this gentleman, right?
So you could do testing.The test time was very huge for support vectors, a least-square SVM.But while the number of hidden nodes, ELM can be very short.Look at this case, [inaudible] fix them as 1,000.I just choose 20.You see the testing accuracy for this case, ELM already short, smaller then, is better than SVM.SVR.The training time here is--actually is 0.1 seconds.You see the SVR is near 20 minutes.So this is--yeah, please.>>: If you use, say, a thousand on ELM in this case, would it maintain the same accuracy, or would it overtrain?
>> Guang-Bin Huang: Okay.These are different methods then, because we all have the different version [inaudible] on the developing.Here for this case we use--because it's probably in the year 2006, quite a long time ago.For that case we did not include constant [inaudible] diagonal entry of this, of edge, right? We do not include this I over C here.We just use the--using a formula without I over C.This part.For that result, that diagonal entry, you can tune--you can tune--you can keep in the number of hidden nodes.So then the number of hidden nodes, you will go back to see this paper different from the back-propagation method.Back-propagation method is very sensitive to the number of hidden nodes.All right.You could go to the peak performance easily, quickly, then drop down quickly because overfitting there.[inaudible] ELM usually have a very large advantage, so the performance, the general performance is very stable in the very naughty range [inaudible].Of course, the number of hidden nodes in that sense--number of hidden nodes required in ELM is higher than number of hidden nodes in BP because every hidden node in the BP had to be tuned carefully.But to us random generated.But why is this single step in a very wide range [inaudible]? Because every one is random generated.So you add one more or one less, usually more or less there.So in that sense, intuitive case, most [inaudible].This is why we have a [inaudible] original intention of the ELM should be so-called tolerance--we should have some kind of redundancy.We should have some kind of error tolerance, right? But for [inaudible] BP, it's not [inaudible] feature.Yes, please.>>: In that sense there's no real tuning parameters?
>> Guang-Bin Huang: Yeah.So in this case we're just using [inaudible] section.Say ten hidden nodes too little, too few, incapable of learning.1,000 too few.It just say the 500.Just reach that usually.You want simulations.We do not have to be carefully tuning everything.Yeah.So this is why even say in this case number of hidden nodes have to be given.But usually can be given very easily just using a [inaudible].Okay.Of course this is for [inaudible] case I won't repeat any more, compare with BP, compare with ELM and support vector machine, okay?
Now, you look here [inaudible] parameters.BP have to tune in a parameters here.Of course [inaudible] have to tune in the number of learning [inaudible].Then the SVR have [inaudible] parameters.ELM for that traditional one have to give a number of hidden nodes.Of course, after we add the diagonal entry in the formula, number of hidden nodes can be fixed, such as 1,000, without doing any tuning here.>>: Now you need to tune C, right?
>> Guang-Bin Huang: You are right.But C--
>>: Yeah, it does.Definitely.That means either you're tuning--either you're tuning--tuning number of hidden nodes here.But of course this is not sensitive.Or either turning C.But C also not so sensitive compared with SVM.I mean, you have to do tuning.But it's not sensitive compared to SVM.Okay.>>: So this slide seems like to suggest you need to use much more number of hidden units than the BP.>> Guang-Bin Huang: You're right.I just mentioned because number--so-called number of the hidden--the rules of which hidden node in the BP have to be already calculated carefully by the BP, or it's just random.So sometime it's random redundancy there.But redundant is good.Help us to remain stateable in some case.>>: So my question is for large problems, for huge problems, will this cause any problem?
>> Guang-Bin Huang: No.You will see here.Not too many even compared--if you compare with SVR, not too many.But as I just mentioned, after you add the diagonal entry, number of hidden nodes can be fixed.Usually we just pick 1,000.Right? In almost all the test cases, we have so far, just a [inaudible].>>: [inaudible] so suppose you compare the same network [inaudible] the same level parameters, all that, and when you are doing that you fixed the earlier parameters by random number [inaudible] and then of course you are not going to get as low error as I do both, optimize both.I do the fixed whatever you have at the top level better, and I guess we have freedom to train the other one, you get better result in the training one.So [inaudible] about the training result [inaudible].>> Guang-Bin Huang: Correct.Great idea--question.This depends.So it depends which error we are using to train your network.Say if we are using BP, then you're stuck in the local minimum.You cannot go to the global optimal.>>: [inaudible] all I'm saying that you try [inaudible] I think I would certainly do that, you use whatever method to get [inaudible] a good result, and then you fix [inaudible] and then I fix those and I do back-propagation BP to train my AB so that you won't be the [inaudible] of what you randomly [inaudible].I actually get [inaudible].>> Guang-Bin Huang: No, BP will not get a better result in most cases.It may get a better result in some cases.What I say, it won't get--BP won't get a better result in most case in terms of accuracy, okay?
>>: No, I'm saying that given that I used the same parameter that you learned and had, I [inaudible] to adjust other parameters.I mean, have [inaudible] I will just use any [inaudible] I might get some low error.Is that going to give you--
>> Guang-Bin Huang: No, definitely--
>>: The issue will be generalization [inaudible].>> Guang-Bin Huang: No.Definitely BP will get a lower general performance.You say why? Because when the BP have to tune all the hidden nodes--even ELM hidden nodes is random generated, okay, in most cases, BP have to tune in all parameters.Say we have 1,000 hidden nodes random generated.For ELM it works.Then go to BP.Overfitting.Because say you just have 100 data.You have--using 1,000 hidden nodes, then this network definitely can not learn this 100 data.Because number of hidden nodes is larger than number of training data.It's impossible.Right?
The training data--training error can go to zero.Then testing accuracy can be good to--even good to infinity in some sense [inaudible] overfitting there.But ELM is random generated.So in which sense why the ELM is better than BP.Of course [inaudible] other reason.One reason is here.When we generated the hidden nodes, we do not give the preference to training data.So that means the hidden node is--can be generated without seeing the training data.So when they see the testing data and the training data in the same rule, give the same priority [inaudible].But in the BP, hidden node is tuning the base of the training data.In other words, you give the priority, give the advantage to the training data.Give the buyers the testing data.So in this sense BP actually is overfocused on training data.Right? So maybe not balanced, the training and the testing performance, testing data.So in this sense the BP is not good.Okay.So similar to other method.Okay?
Now, you look at a larger case, this also appeared probably five or six years ago.So we have this--so you have a--actually it's a [inaudible] 600,000 data.So 100,000 for training.Actually later on we also test for however many.Use these here.SVM spend 12 hours for one trial.That means suppose we already found the best parameters or optimum user space by the parameters.Then how many number of hidden nodes--number of support vectors.So here you have 100,000 training data.So here is 1/3 of them used as support vectors.This go back to that gentleman.Right.So that means for the runtime case, the testing time will be very huge.So even you have the training network.Give application.You are [inaudible] SVM may respond very slowly compared to ELM.ELM only 200 hidden nodes for this case.Then for each training just 1.6 minutes.All right? So--but look here.That here is roughly 12 hours for one trial.But [inaudible] appropriate parameter, you have to run maybe 200 times to find the best parameters.So usually we have to spend one month to get SVM complete.But for ELM, what did I say just several times, okay? So this one.Now, go back to here.Why ELM is better.So the key expectation of ELM is hidden layer need not be tuned.Okay.It can be generated before we see the training data.This is the first key point.Second key point is hidden layer actually should solidify universe or approximate condition.Even I just mentioned, you--even HX is unknown.We assume HX should certify this condition.If HX hidden layer does not solidify this function, let me give some application, you cannot approach it--you cannot solve that application.Then what shall we use? Why do we need to use this method.Right? Okay.So then we want to minimize these two.Actually, this similar to SVM.But this come from neural network theory.Actually, probably quite a long time ago.Okay.This is proved by so-called--actually, UC Berkeley professor.UC Berkeley professor say we should--if the training error is already smaller, we try to minimize the width of--norm of the widths, of the whole network.So then I combine, then, these two.Why not minimize margin error as well as minimize width of the whole network.Right? So this then why we come here.In order to address why ELM is better than SVM in most cases, we [inaudible] just we compare least-square SVM.From that we kind of see the why--the reasons.Okay.We go back to the original SVM.Original SVM actually talk about this kind of optimization target which [inaudible] conditions.Rights?
If we go back to ELM, we also use the [inaudible] conditions.But we know HX satisfied the universe approximate of condition.So we have--in ELM we have this target function.I mean, this is analyzed from optimization aspect.So then we also have this condition.Note here, this different from SVM.SVM here had a B there, right? But in ELM no B there, because HX satisfy universal approximation.Give any target in the feature space, suppose the separating boundary should pass through origin in the feature space.So without B, then we have this dual optimization problem for ELM.We only have this condition.That means [inaudible] get a solution for ELM.We only [inaudible] optimal parameter of I in this cube, from 0 to C.But look at [inaudible] use this paper.Look at the SVM, the problem comes.Because [inaudible] X in SVM is unknown.Unknown.That means they're even using polynomial method.Unknown, that means you may not be [inaudible] may not satisfy universal approximation condition.So this one have a B, because in the feature space, right, the separating so-called boundary may not pass through the origin in a vector space.So this is why they added a B there.After they add a B, the problem here in the dual optimization problem, this condition will be generated.But if we generate this condition, then SVM--there's a problem.In order to find the optimum of I [inaudible] parameter, how do we find.Not only in this cube, also have to find the parameter in this hyperplane.This hyperplane defined by this condition.This is why you'll have SMO method.Define the best parameter along this hyperplane in this cube.Look here.Here find the optimal solution of I in this cube.SVM find the solution in the hyperplane within this cube.So which one has the better optimal solution score.This should be better.Right? Because this solution--optimal solution here definitely a lot better than the solution here.So then why this problem comes? So this is why in my brain--of course this I just added this morning, and not published yet, so I wish to bring this idea here for discussion.Is there any flaw in SVM theory?
Look at that.SVM.SVM is very great.I had to--actually, I appreciate this work.Without SVM, definitely the computational intelligence research may not be so attractive, right?
Okay.But the problem is SVM always search the optimum solution in this hyperplane within this cube, right, as I mentioned before.Right? So then the problem [inaudible] two irrelevant applications.And they have to search in this--they may have to search the optimal parameter in the same hyperplane.Say I give the two applications, XR1, X2, X1, T1, X2, T2, okay, application 1, application 2.When I talk about weather forecasting, when I talk about human face recognition, part of the different irrelevant applications, but their target may be the same, whether good or not.This face recognition, this man, yes or no.Right? So the target is always T1, 0 or 1, 0 or 1.But according to SVM condition, they have to search--they have to follow so-called [inaudible] solution the same plane.Even that these two applications different, but their TI target class label may be the same.May be the same.Right?
So that means summation of ITI equal to 0 [inaudible] the same, build TI the same.Okay.Weather forecasting, 500 data.Face recognition, 500 data.So in weather forecasting, half of TI is 1.Half of TI is 0.Face recognition, target half of them, 1.Half of them 0 possibility.Right? So then TI is the same in these two applications.But applications are totally irrelevant.Input XI quite a different dimension, different distribution.But then the solution area of the TI of I, same.So this looks not so reasonable.Different application.Why the solution so-called error the same? Solution [inaudible] could be the same.>>: But if the--if you [inaudible] separate the two causes, why is it--is it not [inaudible]?
>> Guang-Bin Huang: This is a problem.Because they are different applications, input dimensions distribution totally different.So here they only talk about the [inaudible] should be some parameter for the tuning.This should not only be related to the target.Also, you really want to relate it to the training data.You should relate a target and also some way to input.>>: [inaudible] training data because if you optimize for another [inaudible].>> Guang-Bin Huang: You're right.So this only related to the target in this sense.In other words, give the preference to the target.This is what I want to say.It's good to relate it to training data, but I only relate to the target at this stage.Either you--
>>: I mean, [inaudible] to the training data because clearly this is just a constraint.You have the optimization [inaudible].>> Guang-Bin Huang: Ah.You're right.But you will go [inaudible] relation optimizing goal here.You'll see here, right, this is optimized.So from here, this part only relate to the target.But of course I cannot say this is totally wrong.What I found myself is it give the [inaudible] BP.It give the preference to the target.It looks--two different application, they want to link even their input distribution dimension totally different, but due to their similarity in the target, so may match them, make them close to each other.This is why I have some kind of so-called pre-reservation, some kind of hesitation on this part.Right? This may not be tolerated at all.This is why I put a question here.Is this reasonable enough? Okay.But then the reality is [inaudible] for discussion only, okay, I haven't published this.This come from this morning, okay, after I wake up.So because SVM is too general to the feature mapping, phi X is unknown.So in all the literature of the SVM, what you cannot see if phi X had to satisfy universal condition.They can be anything.You say polynomial, even X squared can be used sometimes.But as you can see, that only--[inaudible] will go to infinity.The output function, output of the network can go to infinity or so, right? So that may not be reasonable.But because the feature mapping is unknown, so then they--[inaudible] unknown, they have buyers in the output.Because the target is here, your output feature mapping is unknown.So they have some buyers between this [inaudible] between your real target and your actual target, actual output.So we have buyers there.Because there's buyers there, so I think this is kind of contradiction.Of course this have to be verified further.This is just for internal discussion, okay? Yeah.We can discuss later.Okay.So this is--go back to this method.Go to--even on this analyze, we go to this method for ELM.It also works in this sense we see here.We give some kind of--the classification testing case here.So then we see SVM method, ELM method.Of course for this case ELM, we said the number of hidden nodes is 1,000 [inaudible] C had to be tuning in there.Had to be given.The SVM of course you have seen in gamma.Of course [inaudible] we also can use Gaussian kernel.But in other comparing, I didn't use the Gaussian kernel for ELM.So you see here testing and reading roughly is always better than SVM.So that means they are comparing the same method here [inaudible] compare least-square SVM, because those use [inaudible] constraints.Because you go back earlier slides, and we compare the same method in the least-square method, right, of the metric space.>>: So the data size roughly is how much in those tasks.>> Guang-Bin Huang: Yeah.Training data.Yeah.This is just the [inaudible].This is not sure in the large dataset.Even here for--we see the high dimension gene selection case, so we had this result.Okay.Of course that is for ELM [inaudible] means all the data are ready.But in real applications, especially in the so-called [inaudible] applications, right, so many data are coming one by one or chunk by chunk.You do not know how many that you will have and how many--when all the data will come.So this is the situation.So how do we implement something online.So this is why we have another one called online sequential ELM [inaudible] one slide here to introduce a feature.So the data may come in one by one.Okay.They come in a chunk.A chunk size may be wide [inaudible] maybe just one, maybe somehow 20 data come in maybe one [inaudible], sometime next time you just have so-called 500 data come in.Right?
So we wish to have a [inaudible] which can just [inaudible] arriving data.Okay.As long as this [inaudible] data has been done, this data should be discard.Should be removed from the system.Right? So next time when the data come in, you should [inaudible] new data with the knowledge which learned from the past data, from the history of the data.But after all the data learned, data should be discarded.It's just very similar to our learning skills.Say we started from kindergarten to the PAD.Okay.So every semester we have new textbook.When we learn the new textbook, we go to examination.This is a chunk of data.After we pass the examination, we discard the old textbook.We learn the new textbook, means a new chunk of data.Right? But when we learn the new chunk of data, we learn some new data with the knowledge learn from the other textbooks.But we needed to go back to our old textbook.Of course you can review the old text, but usually just learn one, then you ignore it.So this is the very natural way.So in this sense, we can implement this online sequential learning based on the original batch learning mode ELM.Okay.I just skip this part.Online sequential learning is very [inaudible] to SVM and other method.Usually many parameter have been tuning manually.So very difficult to handle.Okay.Even the training times are very huge.In many, many case, [inaudible] very troublesome.So we use the ELM, this can be done very, very easily.Okay.So I just skip this part.Okay.So thank you for summary--or, sorry, [inaudible].So usually it can be done without tuning.Of course we say without tuning means you have one insensitive parameter, the number of hidden nodes or the so-called complexity [inaudible].But usually compare SVM with other method, this not so sensitive, just give one, just try one or two time, usually okay.Right?
It can be efficient for batch mode learning or the sequential learning.One part of equation, incremental learning, I didn't--was not mentioned here.Actually, it's all very interesting.This come from our ELM theory.For ELM different from other method.Any major algorithms of ELM come from theory.So that means any algorithm with a proof in theory.Okay.Any major theory in ELM generally result in the learning method.So this is different from other method.Other method usually come from idea, but it kind of prove.Okay.Somehow you can prove something in the neural network, but no algorithm to support.So number of hidden node can't go to infinity, how to implement it.That's a problem, right?
So of course ELM work for different hidden nodes, also including kernel.If HX is not random generated.If HX are not, you can also use the same as kernel as SVM method just using kernel.Right? So HX can be [inaudible].Okay.So [inaudible] learning is very fast.Usually you say SVM can be [inaudible] least-squares, say, for one case three hour, seven hour, right, to complete one case.So ELM just even less than one second.So in this sense it can be--can we do something on realtime learning, okay, and especially also online sequential learning can also be used in realtime cases.Then the next two this morning I added.I just mention so ELM--we talk about traditional SVM.Okay.Even talk about least-square.Traditional SVM switch the best result in the hyperplane within the cube.If we compared the same method, ELM switch the best result in the entire cube.So in this sense, ELM always provided better general performance.Because that's definitely one of the worst thing, the result generated from a hyperplane in that cube.Right?
And also--
>>: But the parameter space is different.>> Guang-Bin Huang: SVM.SVM.>>: No, no, SVM usually has a [inaudible].>> Guang-Bin Huang: SVM.You look here, this actually--
>>: [inaudible] you have more [inaudible].>> Guang-Bin Huang: You see here, you look at this, this is the optimization target of SVM.Right? So TI, TI, phi I, okay, usually here the same.>>: [inaudible].>> Guang-Bin Huang: N here the same.N here is the number of training data.So you say that training data the same.Then we use the same kernel, same feature mapping phi in SVM.So H in ELM, they use the same one, say, for example.And then the condition for ELM is here.The condition for SVM is--you add one more.Right? So usually better.Because we--in ELM case we request feature mapping satisfy universal approximate, even unknown.Should be.[inaudible] you can't certify universal equation.That means some black hole there.Right? Some case you can't--
>>: [inaudible] I haven't [inaudible] in your case, for the universal approximation to be held, do you--I remember a long time ago when people [inaudible] universal approximation [inaudible] they have to assume that you have [inaudible].>> Zhengyou Zhang: Well, maybe we can [inaudible].>> Guang-Bin Huang: Yeah.Yeah.Okay.So then similar to least-square SVM, the same case, okay, the same case, so then go back to your earlier question.So why ELM--so random generate good.Okay.[inaudible] so in the beginning, in the five years ago I have new idea.And I always tried to think of some idea [inaudible].So I already told this story to my students and also other researchers.Okay.In the beginning [inaudible].So what is the natural phenomena very similar to this case? Say Chinese.China, right? In China.[inaudible] lake, right? Lake.We want to fill up the lake, okay, a river.You see the bottom.Bottom is a curve.Bottom is a function.So two type of people to fill up the lake.One is engineer.Another is farmer.So how does engineer fill the lake? Engineer had to pump out all the water, measure the curve of the bottom to see what's the size, what's the curve [inaudible] the training data first.And then--okay.So the training data.Then do another research.How many stones I can use to fill up the lake.1,000--1,000 stones, 10,000 stones, or 100 stones fill the lake.You move the stone to fill the lake.Of course VLM means go to the [inaudible].Okay.So each stone used by the engineer looks like a hidden node.So then the engineer had to measure, had to calculate where to put this stone.This looks so big, put here may be overfitting.Too nice.Put there--so they have spent a lot of time finding location.Right? If 1,000 of his stone not enough, can we [inaudible] where to put.So this the BP learning method.Now, farmer comes.Farmer has no knowledge, right? It's very--it does not know how to do calculation.Very simple.We have a large mountain.[inaudible] mountain using a [inaudible] to explore the whole mountains.They generated random size stones, rocks, big size, small size, even sand.So how the farmer fill in that lake? Using a truck.Random--when I pick random number of stones, random number of size of stones, push to the lake, random put here, here, here.What is that? That is random generate.So when the farmer use a truck to take the stone, does he need to know what's the size, what's the training--somewhat training data from the lake? No need.Just the random generated stone.Let's put random here, here, here.So which one faster? Of course farmer method is much more efficient.And does the farmer care about how many stone need to be used? No need, as long as the least-square or means the error smaller.It okay.Right?
So then what is a beta in--used by the farmer.For the engineer, he has to calculate the size, the height of the stones.Well, farmer can go there using something, just smash the stone, if the stone [inaudible].We suppose that the beta indicate the height of a stone.Right? If the beta equals 0, the height of the stone equal 0 means nothing.Even if you put a stone there, [inaudible] hidden nodes put there, you can smash the stone to make the height of the stone 0.So definitely farmer method is more efficient.I'm right? So then even for farmer method, does it care about--do you have 1,000 hidden stone or 1,001 hidden node? Doesn't matter.Even have more, doesn't matter.I can smash the height of the stone and make the beta become 0.Right? Even if you have too many, it doesn't matter.It all becomes 0.So this is why ELM works.>>: [inaudible] I appreciate the work you said earlier on that, when you randomize your lower layer parameters, you don't need to have all the data.In that case you minimize some of the early position in terms of each simulation.And there's actually kind of similar kind of technique now [inaudible] mostly in the NIPS [inaudible] they actually talk about this [inaudible].They have a similar philosophy of not using the target in initializing parameter.But they actually use the data for reconstruction to make sure that whatever [inaudible] you have you'll get minimum error in [inaudible] the data, but [inaudible] entirely the output.And once they do that, [inaudible] to do better [inaudible].>> Guang-Bin Huang: Yeah.This is possibly right.This is why--
>>: [inaudible] similar kind of--
>> Guang-Bin Huang: I think so.They may have some kind of common so-called sense behind.So that means when we do hidden layer, some parameter may not be related to your target.>> Guang-Bin Huang: Well, but the whole power of that approach is that that allowed me to build the [inaudible] layer by layer [inaudible].So that comes back to some of the thing [inaudible] whether you actually use this not only for one single layer, but keep building up the layers using similar kind of philosophy [inaudible].>> Guang-Bin Huang: Yeah, yeah.Great idea.So here we talk about a hidden layer.We say the hidden layer HX.So sometimes we say HX is unknown or HX is something else, is not just hidden--one hidden node.It may be a larger network there.Maybe a lot of hidden layers there.>>: [inaudible].>> Guang-Bin Huang: That's right.We [inaudible] layer by layer [inaudible] in our theory we say HX may not be just a single compute--single-hidden node.It can be just any kind of computation of hidden node.It just--it can be a larger network.A larger layer.But in theory already prove it.>>: You still try to see whether you build--
>> Guang-Bin Huang: Ah.>>:--from the lower [inaudible] better result.If you do that [inaudible].>> Guang-Bin Huang: [inaudible].But in this case, we haven't tried yet.But that is one of the good directions.In theory, we have something related.That means HX may not be just a so-called RBF hidden node or some kind of kernel.It may be a computation--other computation [inaudible] it can be a system, even.So we somehow--so this why I do not intentionally use [inaudible] here.I wish to use a node.A computation of node.These can be a system.It can be a single output function, whatever.>> Zhengyou Zhang: Okay.Last question.>>: [inaudible] online learning scheme, so would you have to like incrementally sort of adjust the number of hidden nodes?
>> Guang-Bin Huang: Okay.So actually we have the two type of learning methods for online.So one is called online sequential learning.So here is the [inaudible] the case where the data come in one by one or chunk by chunk.Or in this case the [inaudible] fixed.Is fixed.So then whenever the data come in, of course some parameters of the [inaudible] be tuning, especially for beta, output weight.So another so-called technique we call incremental learning.Of course this is different--definition may be different from other incremental learning.So that means the data is fixed or the data was coming, but the [inaudible] can also be updated.So then--so both the parameter and network [inaudible] can be updated.That is another one.That is also very, very recent.We prove in all the cases it's better than SVR, support vector machine for regression.That is actually already published in the so-called--actually in the [inaudible].That is also--comes from why--okay.That is--let me show here.That is published here.So this why this prove this theory actually generated that algorithm.Okay.Actually this theory also come from that area.We can consider intuitively that algorithm works.Now, how to prove it.That is very, very difficult to prove.We have to appreciate the peer review, professional review from this journal and also the editor, past editor in chief of this journal, that a review took three--it took three years to complete a review, seven reviews.Of course in the first version of the paper, the proof we taught later on.But then in review, very [inaudible] the result is right.So then we--I work with my student, tried to correct the proof, but [inaudible] published paper.Actually the proof in that paper also wrong.So then we did a node proof in that paper wrong.So we take a long time and find out the proof in that paper wrong, so then we tried to correct [inaudible] also from the math department, so then he say [inaudible] able to use the proof in that paper, so we tried to correct the proof in that paper.So then we tried to do it.But just one week before we submitted this paper, we also found out no, the old correction is also not correct.So then we [inaudible] then we use the [inaudible] method to correct it.So finally it works.So it's very interesting.So finally we--the algorithm generated from that theory actually is very, very efficient.I just took out these slides this morning.I found it too many pages here.Yeah.>> Zhengyou Zhang: Let's thank the--
>> Guang-Bin Huang: Thank you so much.[applause]