Nyhetsflöde
Logga in till din kurswebb
Du är inte inloggad på KTH så innehållet är inte anpassat efter dina val.
Har du frågor om kursen?
Om du är registrerad på en aktuell kursomgång, se kursrummet i Canvas. Du hittar rätt kursrum under "Kurser" i personliga menyn.
Är du inte registrerad, se Kurs-PM för DD2434 eller kontakta din studentexpedition, studievägledare, eller utbilningskansli.
I Nyhetsflödet hittar du uppdateringar på sidor, schema och inlägg från lärare (när de även behöver nå tidigare registrerade studenter).
Sorry, now it's the correct years.
Hello!
Should the report be at most 7 pages or 10 pages?
Thanks!
7
Hi. I have an exam 08-12 on January 16th in Kista, how does the schedule look like for the presentation for the project?
Afternoon.
Dear students
Please find the following project assignments (on a personal and group level) concerned with the project papers suggested by Pawel. Since I am away until January 6, I suggest that we communicate via email. I also propose that you create a googledoc where I could address some of your questions and leave comments. As for your work, it is not necessary for you to reproduce all the results but what you create has to be self-consistent and has to stand on its own. For example, in the paper "Text classification using string kernels" it is possible to focus on the first part (without Approximating kernels) as it constitutes a logic entity on its own.
To start with, could each group email me with a plan on what they are going to do and a list of potential questions? Later on, we can share the document and I will certainly answer further questions that will inevitably emerge (please try to group them in packets rather than sending a plethora of single-question emails).
"Text classification using string kernels"
Group 2: JJ, DS, GT, PM
Group 3: PB, PK, ED, BC
Group 4: MV, JH, DM, TB
Group 13: VP, BG, WK, FF
Group 15: JA, NBR, SL, ELM
Group 23: MG, PM, SA, IJ
Group 36: MD, JD, FH, ÞDG
“The Relevance Vector Machine”
Group 8: JH, AD, MM
Group 16: HH, DK, OXL, ML, HA
Group 37: OH, PF, HN, OB
"Sparse Gaussian processes using pseudo-inputs"
Group 9: SL, LM, MN, HH, CM
"Probabilistic principal components analysis"
Group 22: GC, AT, PM, AD
Hi,
I have question regarding projects.
In order to implement the method, we need to have access to the data used in the paper.
Do some of u professors have license to access?
Or can we create some synthetic data for numerical simulation purpose?
Thank you!
Motoya
The data is sometimes accessible in the paper or in a supplement of it, which you can access from KTH as well as any of us. If you cannot find data, please go ahead and create synthetic data. This is also a good idea if the original paper use synthetic data.
Given that the criteria for the higher grades are rather vague, what can those of us who are aiming for a high grade do to ensure that we get the grade we are aiming for?
Also, will the difficulty (theoretical or practical) of the paper be taken into account during grading, assuming that there is a significant difference in difficulty between the papers?
The 7 page limit is for a two column report or single column report?
Single column :).
For those of you implementing "Kernel PCA and de-noising in feature spaces", some of the versions found on google do not display some of the figures correctly (for example figure 1 where lots of data has gone missing). Here is a link for a paper with a correct figure 1.
http://doc.ml.tu-berlin.de/publications/publications/MikSchSmoMueRaeSch99.pdf
Hello!
I have a question regarding the oral presentations taking place on Jan 16 between 14:00-19:00. Do we have to attend all 5 hours? And is there a presenting schedule we can have access to in advance?
Best,
Michael
Sorry, it seems that I forgot to add
You have to participate during "your hour".
So you only have to be there during 1 hour.
May people not physically present record a video, or is skype presence specifically what you are after? How will this work?
Where can I see where my groups "hour" is?
The schedule can be found under schedule. I repeat it below.
We will run the presentations in parallell sessions. Hedvig will be in charge of the presentations in Q2 and Pawel of those in Q31. I will be running back and fourth between Q2 and Q31. Five groups will present in each lecture hall during each hour. You have to participate during "your hour". The schedule is design without considering who your supervisor is. Check the format under projects (to the left). I'll make sure that, we can use the lecture halls after 18.00 (but if your schedule late keep an eye on this page).
Q31:
14.00-15.00 group 1-5
15.00-16.00 group 11-15
16.00-17.00 group 21-25
17.00-18.00 group 31-25
18.00-18.20 group 41-42
Q2:
14.00-15.00 group 6-10
15.00-16.00 group 16-20
16.00-17.00 group 26-30
17.00-18.00 group 36-40
18.00-18.20 group 43-44
Yes Nikos, missing members should do their part over Skype. The group is in charge of organizing it.
Have I understod it correctly, the missing member should present its part live during the presentation? Is it ok to show a recorded video?
It should be Skype.
When can we expected to get the results for home assignment 1 & 2 and the project?
Hello, my results did not show in my Ladok yet.
Hi! With only a week to solve the assignments and collisions with other assignments and projects in other classes, what is your recommendation in order to get a satisfying grade, besides reading the chapters from the book? Previous assignments from mladv15 are over twenty pages long and it seems like we would need at least a full-time week in order to solve the problems to get A
Do also exercises from the book. We will try to extend the time window available for each assignment. The first will probably be less extensive than last year, no guaranties though.
Will they focus more on implementation with Python, numpy, scikitlearn, pandas etc or will it focus more on theoretical stuff and more math problems? Yeah, the assignments from last year looked very extensive.
The first assignment will be very similar to the last year's assignment.
what was the assignment of last year?
The assignment will be posted this week.
With the same deadline?
@Zhao
2015:
https://www.kth.se/social/course/DD2434/subgroup/ht-2015-mladv15/page/assignments-51/
2014:
https://www.kth.se/social/course/DD2434/subgroup/ht-2014-51466/page/assignments-41/
In assignment 1, equation (8), how are we meant to interpret the probability density of the prior \(p(\mathbf W) = \mathcal N(\mathbf W_0, \tau^2 \mathbf I)\) considering that \(\mathbf W\) is a matrix?
Please read the remarks attached at the top of the assignment. You can think about this prior for sets of weights corresponding to each output Y (with different Wo), which is analogous to the vectorised form of matrix W. Most important is the independence assumption that you can deduce from the nature of the covariance matrix.
Hi!
Do you mean that every \(\bf{W_0}\) represents a row of \(\bf{W}\) ?
Why don't you write \(\bf{w_{\it{i}}}\) instead so that
$$p(\bf{W}) = \prod_{\it{i}}^{\it{N}} \mathcal{N}(\bf{w_{\it{i}}}|{\tau}^2 \bf{I})$$
??
It is a matter of notation. In essence this is a normal distribution spanned over an array of variables (matrix normal distribution) so you can also think about it in the vectorized form (Wo can be treated as a long vector containing all parameters w in the matrix).
I am wondering about the dimensions of the matrices X and Y in assignment 1. From Question 2 I understand that the Y matrix is D x N so I thought the X matrix was stacked by vectors the same way which would indicate dimension q x N but then the computations in Question 6 get really tricky.
Ok, I have now simplified this last question. Please see the text of the assignment (Q6 in Part I) and focus on a single output variable case.
Eq. (23) has been corrected in Assignment 1.
Hi Pawel!
To make sure that I understood your change in question 6 correctly (single output variable). Are we supposed to derive the posterior for the linear regression problem:
$$\mathbf{y} = \mathbf{X} \mathbf{w} + \mathbf{\epsilon}$$
with \({\mathbf{y}} \in {\mathbb{R}}^N\), \({\mathbf{X}} \in {\mathbb{R}}^{N \times D}\), \({\mathbf{w}} \in {\mathbb{R}}^{D}\), \({\mathbf{\epsilon}} \sim \mathcal{N}(\mathbf{0}, {\sigma}^2 \mathbf{I} )\) where \(\mathbf{I}\) denotes the \(N \times N\) identity matrix.
Is that what you mean?
Yes, please derive and comment.
Wait, should it be \(\textbf{y} = \textbf{Xw} + \epsilon\) or \(\textbf{y} = \textbf{Wx} + \epsilon\) ?
The latter is the definition in the assignment.
@Erik
That's indeed the definition in the assignment, but Pawel changed question 6 so that \(\mathbf{y}\) is a vector instead of a matrix \(\mathbf{Y}\) which implies that \(\mathbf{w}\) is also a vector instead of a matrix \(\mathbf{W}\). I think one would run into dimension problems when one doesn't change the order to \(\mathbf{y} = \mathbf{X} \mathbf{w} + \mathbf{\epsilon}\).
@Pawel
At the beginning of the nonlinear regression part you defined that \(f_i = f(\mathbf{x_i})\) which might imply that \(f\) denotes the actual function.
In question 9 you define the joint likelihood of the full model as \(p(\mathbf{Y},f,\mathbf{X},\mathbf{\theta})\) and the marginal likelihood as
$$p(\mathbf{Y}|\mathbf{X}, \mathbf{\theta}) = \int_{}^{} p(\mathbf{Y}|f) p(f|\mathbf{X},\mathbf{\theta}) \textit{d}f$$
Do you mean by \(f\) now the vector \(f = (f_1, ..., f_N) = \mathbf{f}\) according to Murphy or the actual function? So, shouldn't it be e.g.
$$p(\mathbf{Y}|\mathbf{X}, \mathbf{\theta}) = \int_{}^{} p(\mathbf{Y}|\mathbf{f}) p(\mathbf{f}|\mathbf{X},\mathbf{\theta}) \textit{d}{\mathbf{f}}$$
with \(\mathbf{f} = (f_1, ..., f_N)\)
We think of f as a function. However, since we take a nonparametric approach we define it as a collection of points fi (we do not have any model for the function that accounts for data). So, to cut a long story short, vector f implies a collection of samples (f1,..,fn).
@Erik
The bottom line is that for the derivation it is fine to think about a univariate output y, which implies that w is a weight (parameter) vector. How you define the dimensionality of your vectors, matrices is of secondary importance as long as you stay consistent and all the algebraic operations you demonstrate match.
Eq 11 is currently stated as \(p(f|\bold{X},\bold{\theta})=N(\bold{0},k(\bold{X,X})).\) Should this be \(p(f|\bold{X},\bold{\theta})=N(\bold{0},k(\bold{X,X'}))\)?
Yes, this is a matter of notation.
To be very precise, the arguments of kernel function are vectors not matrices so one could write N(0,K), where Kij=k(xi,xj).
Hi Paweł,
Could you please, once again, explain how we are supposed to understand \(\boldsymbol{W}_0\) in the Prior, BOTH in the context of Question 6 and in general? I have a hard time understanding where did it come from.
Thank you,
Wojciech
Hi Wojciech
I suggest that you clearly point out what you do not understand, it will be easier for me to help you with a specific answer. So, if you look through the discussion above (numerous comments) could you let me know which remarks/comments remain unclear? I will do my best then to straighten them out. As a general comment I can say that a difference or exception in Q6 consists in considering our regression problem to have only one output variable, which implies that our W0 becomes a vector, w0. In the rest of this part of the assignment we consider a more general case where input vector x is mapped to output vector y (not a scalar y as in Q6), which implies that regression parameters form a matrix W. The problem can be vectorised and matrix W can be then treated as a vector (given assumptions that are met here). Then you do not really have to worry about it and you can treat the problem in a similar way we did in the lectures.
Please let me know then where specifically more clarification is needed.
Next year I am going to make it look more straightforward.
Pawel, thank you very much for the fast response. I guess my question was too general, sorry about that, but I had a hard time wrapping my head around this.
To straighten out the confusing parts, if you could please confirm that my understanding is correct:
When you wrote: "You can think about this prior for sets of weights corresponding to each output Y (with different Wo)"
Here "each output Y" means each output variable in Y - in this assignment (using the notation from Theory 2.1) we would have D such output variables, as we can split the problem in to D separate regression tasks?
Using the notation introduced by Peter Langenberg, we could understand the Prior as:
\(p(\bf{W}) = \prod_{\it{i}}^{\it{N}} \mathcal{N}(\bf{w_{\it{i}}}|\bf{W}_{0}^{(i)},{\tau}^2 \bf{I})\), where \(\bf{W_0}^{(i)}\) is a vector from \(\bf{W_0}\) that holds the means of the distribution corresponding to \(\bf{w_i}\).
And the last part:
The elements in \(\bf{W_0}\) are the means of the distribution of the Prior corresponding to weights \(\bf{W}\).
The values of the elements in \(\bf{W_0}\) are arbitrarily chosen by us (according to our prior beliefs) and could be seen as "hyperparameters" of the regression problem we are solving?
Using the notation from the Bishop book we would have: \(p(\bf{W} | \bf{W_0}, \tau^2) = \mathcal{N}(\bf{W} | \bf{W}_0, \tau^2\bf{I})\)
Thanks in advance! :)
Each output implies each univariate component, y_j (j=1,..,D), of the multivariate output vector y.
I would not necessarily split it into independent tasks (though it is often done this way) but rather interpret your Wo as a vector (notation is not very obvious here) obtained from matrix vectorisation (a simplified approach to matrix normal distribution). This will imply then that you operate in the space spanned by all the parameters. Bear in mind however that these details are not really all that relevant to the questions (except Q5 where you just need to take into account this multidimensional aspect of W representation).
We can assume we incorporate our beliefs in Wo so the parameters of the prior (tau and mean values in Wo) are fixed.
Hi Pawel, can we take use of a Jupyter iPython Notebook for the entire assignment(s), and hand this in as the report as well, or is a PDF with LaTeX more preferred? iPython Notebooks are great for having everything in one place; explaining theory, computing in python, visualising the data etc.
Hi Oktay, please submit a pdf.
You can generate a PDF directly from an IPython notebook with LaTeX markdown cells. To hide the code and just show the figures, follow the guide in the link below (requires Pandoc!).
https://ga7g08.github.io/2015/06/25/ipython-nbconvert-latex-template-to-hide-code/
Awesome, thanks Fabian! :D
Hi Pawel!
In question 17 we are supposed to perform the marginalisation according to
$$p(\mathbf{Y}|\mathbf{W}) = \int_{}^{} p(\mathbf{Y}|{\mathbf{X}},{\mathbf{W}}) p({\mathbf{X}}) \textit{\rm d}{\mathbf{X}}$$
Is it ok, when I do that for vectors with \(\mathbf{x} \in {\mathbb{R}}^M\) (latent variable), \(\mathbf{y} \in {\mathbb{R}}^D\) (observed variable) and \(\mathbf{W} \in {\mathbb{R}}^{D \times M}\) so that
$$p(\mathbf{y}|\mathbf{W}) = \int_{}^{} p(\mathbf{y}|{\mathbf{x}},{\mathbf{W}}) p({\mathbf{x}}) \textit{\rm d}{\mathbf{x}}$$
\Peter
Yes, most definitely!
Hej! Do you want to have a specifc level of simlification in the expression for question 19.2 (gradients of the objective with respect to the parameters)?
My expression should be correct but it looks kind of messy (a very long term including a lot of linear algebra such as trace, inverse matrices ...).
\Peter
Hi Pawel,
I spent a lot of time trying to understand what create_index_set was doing, and it seems that it's simply returning the order of indices that sorts the input list (except the first index btw). Why not not simply use numpy.argsort in that case? Did I miss something?
Thanks,
Alexandre
Is anyone else getting that the result of \(k(\mathbf X, \mathbf X)\) is a singular matrix? I tried generating 500 evenly spaced points [0, 1, ..., 499] and calculating the covariance function for every pair of points (which reduces to \(\sigma_f^2 \exp(-(x_i - x_j)^2/l^2)\) in the one-dimensional case), but unless I choose a rather small \(l^2\) (seemingly at most 9), the the resulting matrix is singular (it is both symmetric and positive semi-definite, however). This makes it difficult to sample more "smooth" functions. Changing \(\sigma_f^2\) doesn't seem to help, and I've mostly stuck to a value of 1.
I also tried generating 500 evenly spaced points in the [0, 1) range instead, but when I did that I got a singular matrix with an \(l^2\) of anything above around \(10^{-5}\).
If we are supposed to do Question 6 as well as Question 17 for a single output variable y, how are we then supposed to be able to do the practical part 2.4 where we do not have a single output variable y?
Also, the dimensions does not seem to add up in part 2.4. Should it perhaps be Y^T instead of Y in Eq. (28)?
Yes, you may end up with nearly singular covariance matrix if you estimate it in high dimensional space (a relatively large number of points used to construct your Gram, or more general kernel, matrix). Still, you can draw samples from it, most packages are capable of doing that resorting to different approaches, e.g. Cholesky decomposition/estimation of cov matrix, eigenvector structure etc. Take a look at scipy.stats.multivariate_normal.
I managed to solve it by using numpy.random.multivariate_normal instead of creating a scipy.stats.multivariate_normal object as I was doing previously. It also seems to work if you call scipy.stats.multivariate_normal.rvs as a static method rather than calling rvs on a multivariate_normal object.
Wonderful! Great job.
Dear students
Having talked to some of you, I have made some cosmetic changes in the document. Please find the new version uploaded. In short, in Q11 it says clearly now that you should set the prior (cov) and visualise it. The longest text update has been made in Q17, where a hint is provided as to how the marginalisation can be done. I think it is more productive for you to address it in a simpler way rather than get bogged down in the derivations (which was my original plan). In Q19 there is an explicit reference to Eq.23. In practical 2.4 I am asking you now to discuss (explain) any invariance you observe. Two equations are also refined since the dimensions did not match.
Hi! In the part 2.4.1 (question 20), are we only supposed to learn the linear mapping, so matrix A that generates the data? Or are we also supposed to learn the non-linear mapping (function f_nonlin)?
In question 20 we are supposed to visualize X. Do you mean by X the true underlying parameters or is it just a symbol for what we learn from performing the optimization?
You do not need to learn f_nonlin since you know what it is.
Please visualise the learned representation X of the data you observe.
Whenever the questions don't explicitly tell us to derive a result ourselves, is it OK to just use formulas as given in the lectures/book?
Yes, with a comment on where the equation comes from and what it means (if applicable).
Two more questions:
1. In question 24, should we talk about marginalization in general and what it does, or what the implication of marginalization is in this specific context, or both?
2. The text after question 24 mentions the prior \(p(\boldsymbol \theta \mid M_i)\), but the marginalization in (40) uses \(p(\boldsymbol \theta)\); shouldn't the latter be \(p(\boldsymbol \theta \mid M_i)\) as well?
Also, is there any limit on how low S can be when doing the Monte Carlo sampling? Because generating the evidence for all of the data sets turned out to be unbearably slow for S larger than 1000, and I'm wondering if it's worth spending the time to try to optimize my naive approach.
There is no need for fine tuning over a massive number of sample data sets.
I see that it did not come across clear in the lecture. In Q17 when you marginalize please do not think about distributions in terms of likelihood. You deal with prior and posterior that are integrated over all samples of x (X). Therefore, it might be easier for you to think of P(y_i|W) (y_i=W x_i + mu + noise) that becomes in fact a distribution over random variables, not individual samples (in the spirit of marginalization).
@ Arvid Fahlström Myrman
Ad.1) It is about the marginalization in the Bayesian context and certainly the equation in Q24 is a good representative here. You might want then to explain it given the framework of this equation (it has a general meaning though in Bayesian modelling where we average variables out. What does it mean?
Ad.2) Yes, to be strict, it should. However, the interpretation from a sampling perspective is that the marginal likelihood is the probability of generating the data set D from a model whose parameters are sampled at random from the prior p(\(\textbf{$\theta$}\)), hence sometimes we just write p(\(\textbf{$\theta$}\)). To stick with the principle, you may want to write as you suggest though.
Dear all
I am just writing to let you know that I will not be able to address your questions over the weekend.
Good luck with the assignment and enjoy the weekend!
I think create_index_set is buggy. When populating N, it checks if D[ind] == L[-1], but L[-1] has already been removed from D previously, so the expression will always be false, and as a result N will always be empty.
@Alexander While I do have an updated local version, I'm still not confident that it's 100% correct (I already found one other bug other than the one I mentioned). You're probably better off either writing your own implementation based on the original paper, or coming up with your own sorting scheme.
@Arvid et al.
It is strange, this piece of code has worked for me. However, I cannot certainly guarantee as I have no access now to a proper PC to re-run the assignment. I inherited this piece of code from the last year's course round and it did the job for me.
Anyway, if you find it buggy, please (anyone) implement these few lines of code in line with the original report by Murray et al. "A note on the evidence and Bayesian Occam’s razor" (as far as I remember) and post it. I apologise for the trouble and will certainly extend the deadline for such effort.
@Pawel I think the reason it looks like it works is that it (inadvertently) does something similar to simply sorting the data sets by the sum of probabilities from each model, which actually ends up being a fairly good way to visualize the data sets.
Hi,
I want to make sure the deadline, is that on the 23:59 on 11.29 in Stockholm time zone or on the noon of 11.29?
Thanks.
Sure, the midnight is perfectly fine.
Does that mean that the deadline given in the assignment PDF is incorrect?
It just means that 12 hours of difference does not play any role, especially if there is anyone who travels and is currently working from a very distant location in a different time zone.
Hello @Pawel,
I'm a little bit unsure about the formulation of question 20. As I understand the topic of the section, we want to recover our 1 dimensional x values. But the question states, that we should plot it in a two dimensional space. Are we supposed to do the PCA backwards and get the y's from the estimated x's or should I learn a 2- or higher-dimensional representation and plot the first to dimensions? I get reasonable results in both cases. But I can't state them here without revealing some possible answers to the question... Can you give me a hint, how the question was intended.
Thank you in advance.
Hi @Pawel,
In a question 23 there's a subquestion "What does this actually imply in terms of uncertainty?" I'm a bit confused about the kind of uncertainty meant here. Is it an uncertainty inside each particular model, an uncertainty about model selection or any other kind of uncertainty?
Thank you in advance.
@Simon
The focus is on the latent variable that generated Y. So you could plot x in one dimensional space. Then in order to discuss invariances you can also plot the full mapping (nonlin) you have recovered (here is where the hint to plot X as a two-dimensional representation comes).
@Dmytro
I am asking about probabilistic uncertainty associated with the selection of a model based on the evidence from data (posterior) and prior. Naturally, it involves individual models since we select from them.
Hi!
Did somebody update the create_index_set script or is it still buggy?
@Hanna I think you're on the right track, but as D_org points to the same object as D, any changes to D in your code will be reflected in D_org, so I think the code still works identically. Try adding some debug output to see if the "else" clause is ever actually run.
I attempted to implement the algorithm myself, but I find that the resulting graph contains terrible oscillations. While it's quite possible that I have a bug in the code, I'd say it's probably better to simply run argsort on the summed evidence to get a nice graph.
Dear students
Just a quick update: given extenuating circumstances, I have decided to extend the deadline to November 30 (midnight). Those who have already submitted their work can resubmit it if they want to. Again, thank you for the good work and solid assignment reports.
I am afraid that tomorrow I will be unable to respond to your queries.
Good luck!
@ Arvid
Yes, you are totally right. Thank you. Even when I change to np.copy(D) I only get to the else part once in the beginning. So I decided to just remove my script since it might just be confusing and I don't have the time to fix it.
Hi!
For me, the code worked well and could plot correctly so I think it is ok now.
@Motoya No the code is still broken, but after fixing the bugs Arvid found I get an almost identical output to the original buggy code anyway. So it should suffice for the purposes of this assignment.
@Arvid I also got those oscillations at first, until I defined the distance metric in terms of abs(sum_i p(D|H_i) - p(D'|H_i)). Also the line "L.append(N[dist[L[-1],N].argmin()]) should" use argmax() instead since L should be the furthest point from L in N according to the paper. This resulted in a plot very similar to the ones in the paper.
@Axiel
Yeah it's oscillating, but you mean we can still infer the meaning of the evidence plots from them and thus no problem for the assignment, right?
@Axel I'd noticed the argmin/argmax bug, but I never got around to testing different distances. I tried your suggestion now, and it turns out that the result is actually exactly the same as what I get when I simply run argsort on the summed evidence. However, you inspired me to see what happens when I use \(\sum_i |P(D \mid H_i) - P(D' \mid H_i)|\) instead, which actually yields something more similar to the graphs in the original paper. Then I tried changing this to the Euclidean distance instead, i.e. \(\sum_i (P(D \mid H_i) - P(D' \mid H_i))^2\), and the result is almost identical to the original paper.
Here is the code I used. Note that this code is written for Python 3 (though it may or may not work with Python 2), and that it assumes that the evidence matrix has one row per data set, and one column per model. Pass the whole evidence matrix to the function rather than summing over the evidence for each data set.
@Arvid Brilliant work! This seems to be what the authors intended and I find this visualization a lot cleaner than simply running argsort.
@Axel
Hi, I think the sorting code does not have bug.
The problem is, as you guys say, the definition of the distance.
The graphs in the paper seem to be based on the 1)maximum evidence (l_infinity norm), and 2)the difference between the maximum and the second largest value.
Therefore, heuristic but easy way is just change the line
"index = create_index_set(np.sum(evidence,axis=0))" to
"npmax = np.max(evidence,axis=0)
index = create_index_set(npmax+np.max(npmax-evidence,axis=0))", for example.
The result is still oscillating but this is because of the smaller number of samples I think.
But the point is not the definition of the distance.
@Axel
Yeah, your code works perfectly even if the number of samples is small.
But I cannot understand the inner structure of your code....
Btw, thanks very much!
Wow! Huge difference, thanx a lot @Arvid :)
@Pawel Is it okay if I use the index function (CH Ek's py script) in the assignment?
Will there be a confirmation after receiving the report?
Unfortunately, I am not physically capable of doing that given the number of submissions.
Hi Jens!
Just before questions 1-3 you mention "Answer the following questions (in this problem you do not have to motivate your answers)". By that, you mean that we need just to right down the results without even mention the blocked/unblocked paths? Or that we need to right only the basic assumptions that we made?
Thank you,
Christos
Hi,
When can we expect to have assignment 1 graded? This would be nice to know in order to have the grade requirements in check to set a target for assignment 2.
Secondly, you should really consider a proper hand-in system if you cannot manually give confirmations. Given the impact on a failed hand-in, no matter which end (or middle) is at fault, it is quite an uneasy experience.
Thank you,
David
I don't really know, but we are trying to do this as fast as possible. Although I realize that it is advantageous with full information, sometimes we have to deal with uncertainty. Moreover the deadline for the second assignment is not at all now.
As an answer to Christos question. It is enough with a correct answer. That is which set which property.
@Jens
Thank you
Hi Jens,
"All problems except those concerned with VI can be attacked right now", does that include the exercise 2.3 as well, or only 2.6 and 2.7?
Thanks,
Alexandre
Hi Jens!
In Question 5 (Implement the Casino model (in Matlab or Python).): Are we supposed to write down anything regarding this question in our report?
Thanks,
Peter
Regarding Q10: How specific should the algorithm description be? Is it sufficient to show a mathematical expression that's easy to compute given the parameters and definitions given in the lecture, or are we supposed to e.g. provide pseudo code for computing the entire expression, or for things defined during the lectures?
Alexandre: Yes you correctly spotted that 2.3 is mature enough to be attacked at this point. So, wait with the last two, but the other problems are OK to work on.
Peter: You should be ready to show your implementation if we ask you to, otherwise you should only use it.
Nikos: Yes mathematical expression that are easy to compute is sufficient. So do what I did, start with what you need to compute and stop whenever you reached an expression that is easy to compute (a bit of judgment is required here I have to believe that you know how to compute it, but there is no reason to redo what you can find in the book or what I have done during lectures).
In the casino model, how is the starting state (\(t_1\) or \(t_1'\)) chosen?
Let's say with equal probability.
Arvid, I initialized it with equal probability of either table as the start.
Jens, for q6: do you want the sums for each table for some players, player sums over all tables, or both in the report?
The sum for each table.
In section 2.6 VI:
Can we assume that \mu and \lambda (not with k) are given parameters (i.e. parameters for the prior distribution)?
And also that \epsilon and l (not with k) are also given (i.e. parameters for the prior distribution)?
In question 6 when you say to provide generated data, do you mean that we should actually include samples from the model in the report? Also, do you expect actual written answers to questions 5 and 7?
Not "raw" samples, but a visualisation of what you get. I have earlier answered no, concerning text answers on coding questions. I will, however, update this and give the following instruction: state the language you used for the implementation and the number of code lines.
Hi Jens,
For the casino model, the introduction paragraph says we should note the outcome of the player's dice Z_k. Why is the player's dice indexed by a table k and not by a player i ? Isn't it the categorical distribution of the dice? It seems more natural for me to index the player's dice by i. Otherwize, S^i_k = Z_k + X_k with no dependence on i.
Hi Jens,
In question 10, it states that we are given "a sequence of tables R1 …. Rn" am I correct to assume that n < K in this case?
Yes Bertrand, the notation is not fully describing the situation. Writing S_k = Z_k + X_k would specify a one player model (the k for the player shows that it is the value the player obtains at the kth table, all X_k are sampled from the same distribution), which can then can be repeated for each player. By writing S^i_k = Z^i_k + X^i_k all the players become explicit. OK?
Joonatan: In fact, this must be a typo and n should be K, that is a sequence of tables r_1,...,r_K.
A warning to those doing task 2.3, depending on what edition of the book you are using, the result given for \(\ln q^*_\tau(\tau)\) (equations 10.28-10.30 on page 471) might be incorrect; see the list of errata for the first printing of the book. I spent way too much time trying to figure out why I didn't get the same expression as the book.
Dear Jens
I have some trouble applying the theory for Q10, which requires checking the probability of a sequence of hidden states, given observations and parameters. Bishop explains how to find p(z_n = k | X_{1:T}) through the forward-backward algorithm, but I am not sure how to go from this (prob of a hidden state at one particular time) to a -joint- conditional probability of the whole sequence (in our case all tables r). In your slides, in sampling the posterior (Q11), you explain how a sample can be acquired one timestep (table) at a time through smoothing, which fits better with these methods because its stepwise, but how can I find the probability of the whole sequence given its steps? In short, I don't understand how to estimate the right-most expression here:
PS: I was also thinking of using the Viterbi algorithm, following the path of the hidden states I am interested in rather than doing argmax, but since this algorithm is mainly used for decoding (rather than evaluation), I feel like the idea is a bit off-track.
Amund Q10 is easier than the sampling and there is no reason to use Viterbi (in fact it cannot be used for this purpose). Your expression looks really fishy, i.e., look at the limits of the product. Anyway, you need to think about what you want to compute here.
Just to confirm, for questions 1-3 on A2, it is sufficient with an answer like "3,4,5"?
Hi Jens,
At the beginning of section 2.1, you use variables \(x\) and \(y\) to define the the plus/minus notation used in the figure. What are these variables?
What are we supposed to actually write in our reports for tasks 2.6 and 2.7? Or will the questions be added later?
You should merely derive the algorithms. I should have added that as a question.
In section 2.6, is the posterior parameters given (i.e. epsilon and lambda, (without k))?
Oktay: yes.
Lucas: I had a quick look (I'm in a cab ... ) and the notation is not cristal clear. However, y must be a child x a specific parent, and c values for the other parents.
Erik: I don't really get your question the parameter have indices as far as I can see. Perhaps I'm missing something and you can expand on your question.
I think I was a bit unclear. In the question it says that the prior distribution has the form \(\varepsilon_n \sim N(\varepsilon, \varsigma^{-1})\). I was wondering if \(\varepsilon\) and \(\varsigma\) are given constants (the same for the \(mu_k \sim N(\mu, \lambda)\))
In questions 10 and 11, we are only concerned with a single person's walk through the casino, right? So the distribution of the player-owned dice is the same throughout?
Yes it is a single player with a single dice.
Erik: Yes these hyper parameters are known constants.
Hello Jens,
In question 8 we are asked to "describe the exact posterior". What does exactly "describe" mean in this context? The formula of a Normal-Gamma posterior can be found in books, so is it enough if we just write it down or should we show the derivation?
Thanks,
Wojciech
Yes it is enough.
There seems to be a typo in the top box of the assignment 2 pdf, it says "These grades are valid for review December 18th, 2015. Delayed assignments can only receive the grade E.". Shouldn't this instead read December 22th, 2016?
It should definitely. Thanks for catching this.
In 2.6 it seems like Z have been mixed up with Y. Could you confirm this jens?
Thanks.
In 2.6 are we supposed to implement VI, or is the mathematical solution enough?
I just had a quick look and my impression is that 2.6 is (self) consistent, but has another notation than the earlier problems (i.e., Y rather than Z)
A mathematical solution is sufficient.
I think the following sentences could raise a bit of confusion:
1. "...and \(\mu_k\) has prior distribution \(N(Y|\mu, \lambda^{-1})\), ..." (using \(Y\) in another context rather than the outcome of the player)
2. "... and \(P_n\) is equipped with \(N(Z|\xi_n;\iota_n)\), ..." (using \(Z\) instead of \(Y\) to denote the outcome of the player)
Jens, the thing is that both Z and Y occurs in 2.6. Or maybe there is something you wish to convay by this?
It was a bit late yesterday :). I have now uploaded a new version, I believe that I've managed to weed out the Y's.
Hi, in the first part of the current assignment, we are supposed to answer questions regarding the probability pairs with respect to the given DAG. I am running into a - perhaps semantic - issue with the third question. What do we mean when we state that a pair is incomparable? I am hesitant to provide more clarification as it might answer the question, but @Jens got my question by email and asked me to ask it here, so I assume any answer will factor in the appropriate amount of detail!
Any and all input is received with gratitude!
The explanation in the text is "the two values can not be compared based on the information available in the DAG". It should perhaps be "no relation between the two values can be inferred from the information provided".
In 2.2 it says "We then observe the sum S^i_k of the two dice, while the outcome of the table’s dice X_k and the player’s dice Z_k are hidden variables."
Does this mean that the graphical model should show the sums as observed?
Hi Jens,
I have a quick question about the drawing of the Casino. I would somehow like to represent that the table-dice value (which is an RV) depends on the Categorical Distribution parameters (which are constant) but they are different between primed/unprimed tables (which is an RV). Will the notation shown below be considered proper (circle is RV, dot is a constant) or is there a different way of representing this type of relation?
Thanks,
Wojtek
Well this is the type of question I normally avoid answering. However, in this case it seems appropriate to point out that a constant which depends on a random variable, or rather that dependence, is not so useful.
Andre: focus on the relation between the variables. That is the most important part, but of course adding the information concerning hidden and observed is good.
Hi, are we not allowed to assume that "explaining away" is the interaction for all cases of intercausal reasoning in Question 1 - Question 3?
The formulation is "You should also assume the parents in any V-structure are conditionally dependent given the the common child". That answers your question, right?
Is it as with assignment 1 that if we hand it in after the deadline (22nd of December), we can only get an E? I would really appreciate to get a few more days after christmas to finish the assignment and I assume the examiners would not start revising the assignments before christmas?
In task 2.5 it says that we should find the parameter \(\Theta\) that maximizes \(P(s_1^i, \ldots, s_K^i|\Theta)\) . I'm a bit confused about the \(s_k^i\) terms, does this mean we should only analyze the one-player case or does the question imply we should find \(\Theta\) that maximizes the joint probability over all \(i\)?
Thanks Axel. Yes it is clearly confusing. It is the joint probability over all the players (i:s) that should be maximised.
I agree with Viktor. Many of us have exams this week too, and have been studying for them for weeks parallel with the assignment. I'd love to have an extension until the end of this week on Sunday night so I can work on more problems.
Is it okay to do Task 2.3 using only broad non-informative priors since nothing is known about the parameters?
I'm somewhat puzzled by your question. Equation (10.22) and (10.24) specifies the priors distributions and you know the hyper parameters. Please ask again if this didn't resolve your issue.
So I wonder if I can consider only the case with all the hyperparameters set to zero, \(\mu_0 = a_0 = b_0 = \lambda_0 = 0\), and use Eq. (10.31)-(10.33) in Bishop to compute the distribution of the factors.
Your solution should work for any choice of hyper parameters. So, you cannot assume that they are all zero.
Hi!
My implementation works fine for part 2.3. I'm just wondering whether you wanna see s.th. special in question 9 ? I'm able to reproduce the the images from Bishop. Is that enough for this question?
\Peter
I have a question regarding 2.6, should Xk and Zn be included in the joint distribution used for the variational inference approximation? I.e. should it be p(S,X,Z,\mu,\xi) or p(S,\mu,\xi)?
/Markus
Peter: one image can not be classified as a couple of of interesting cases.
Markus: I would advice you to consider the tip " what distribution to you get from the sum of two Gaussian random variables? What is the relation between the means?"
Yes I have the distribution of S, but I got two different answers depending on if X and Z is included in the joint or not. I think both have the correct mean but different precisions, and I cannot find from the book which is the correct joint.
/Markus
This is somewhat hard to answer. You should probably take a decision.
For question 13: Do I have to use the data generated in Task 2.2 or can I create extra (more interesting data) for this question?
More interesting is, well, more interesting.
I really do not want to annoy but I would like to get an answer to my question from yesterday as well. We even do not have our grades for the first assignment, so it's pretty tricky for us to balance between the effort we put into this assignment and other tasks that we have to finish...
Since I'm really interested in putting more time into the ML assignment, an extended deadline would be awesome.
Extended deadline would be much appreciated, as I too want to learn as much as possible. Given circumstances much like others here - a lot of other tasks in other courses and the extensiveness of ML assignments - time has been hard to find for full completion of this assignment. Motivational rewards (i.e extended deadline for possibility to complete more tasks) for more in depth learning should be in your best interest as well? I believe a lot of students are about to write thesis work in the area too, and are therefor eager to gain as much knowledge as possible.
So in other words: Please help us in the process of making good thesis work in order to reflect good results upon your institution! :)
You have been granted a very substantial deadline extension.
Thank you, Jens!
Thanks for the extension Jens! A question about 2.6: do we have to figure out the functional form of \(q^{*}\) or is an expression of the type (10.11) in Bishop with a normalizing term enough?
Please specify the actual distribution.
I don't know when you were planning on grading the assignments originally, but is there any chance you could start grading early assignment submissions ahead of time? If possible I'd like to know how I did on the assignments before the end of the year.
Question 13 says that we should test the EM algorithm on data generated in task 2.2. Does this mean that even though we are in task 2.5 modelling the casino as having only one set of tables \($$t_1, \ldots, t_K$$\)\(t_1, \ldots, t_K\) we are supposed to test the EM algorithm on data generated from having two sets of tables, \(t_1, \ldots, t_K, t_1', \ldots, t_K'\)?
Yes. How do you handle that?
Hi, Jens.
I have some questions about EM algorithm in general
- As far as I got, EM algorithm guarantees the increase of log likelihood at each step but says nothing aboutcomplete log likelihood, so complete log likelihood can decrease. Is that correct?
- For complete log likelihood there's a notation \(l(\Theta; D)\). Is there any semantic meaning of this semicolon, I mean why a simple comma is not used instead?
- Is there any way of verification that EM derivations are correct? I mean, they look reasonable to me, but there should exist a less subjective sanity check :)
Thanks!
1. Well the expected complete log-likelihood may decrease during the execution. I guess that that's what you wanted ask about.
2. It is used in order to emphasise that it is a function of \Theta, but not a probability (so in general integrating over \Theta don't give 1).
3. I guess that the easiest way to test it is to implement and see whether the log-likelihood increases. Sure you may implement something slightly incorrect and it is not an absolut proof, but it will make you more confident.
Merry Christmas!
Hello,
When can we expect to have the assignment's grades released (at least the first) ?
Any news on the grading Pawel and Jens?
Hello!
A question about what's expected of the students in terms of the chapters in the book.
For example, regarding chapter 2 and the comment: "The entire chapter is background but you need to know a lot of this, in particular 2.3 and 2.4." Am I supposed to be able to solve most of the exercises in the book that is related to the specific chapters? And would this be the best way to prepare for the assignments?
I would estimate that it is too time consuming to actually solve most of the exercises related to the chapters we will be covering. Whether you are supposed to be able to solve most of them is of course another question, and I believe that the answer is no. Doing exercises (selected and as many as possible) as well as actively asking questions about and testing yourself on the material you read is the way to learn this material properly. Knowledge is constructed, not received.
Ok, thanks for your answer Jens!
This information seems to be outdated; is there any chance you could update it, in particular the information concerning important dates for the project?