In a day short of research (because: getting ready to teach again!), I spent some time working with the Simons Foundation to prepare for the #GaiaSprint, which is coming up in 8 weeks. After that I had lunch with Ekta Patel (Arizona), who has been working on the dynamics of the Local Group, and especially understanding the orbits of M31 and M33.
In the morning I had a long and overdue conversation with Alex Malz, who is attempting to determine galaxy one-point statistics given probabilistic photometric redshift information. That is, each galaxy (as in, say, the LSST plan and some SDSS outputs) is given a posterior probability over redshifts rather than a strict redshift determination. How are these responsibly used? It turns out that the answer is not trivial: They have to be incorporated into a hierarchical inference, in which the (often implicit) interim priors used to make the p(z) outputs is replaced by a model for the distribution of galaxies. That requires (a) mathematics of probability, and (b) knowing the interim priors. One big piece of advice or warning we have for current and future surveys is: Don't produce probabilistic redshifts unless you can produce the exact priors too! Some photometric redshift schemes don't even really know what their priors are, and this is death.
In the afternoon, I discussed various projects with John Moustakas (Siena), around Gaia and large galaxies. He mentioned that he is creating a diameter limited catalog and atlas of galaxies. I am very interested in this, but we had to part ways before discussing further.
Coming back in from a short vacation, it was a low research day. John Moustakas (Siena) is in town this week, and we discussed the state of some of his projects. In particular, we discussed Guangtun Zhu's paper on discrete optimization for making archetype sets, and the awesomeness of that tool, which Moustakas and I intend to use in various galaxy contexts.
On the airplane home from MPIA (boo hoo!) I wrote the shortest piece of code I could that can take interim posterior p(z) redshift probability distributions from a set of galaxies and produce N(z) (and maybe other one-point statistics). I can make pathological cases in which there are terrible photometric-redshift outliers that are structured to cause havoc for N(z). But as long as you have a good generative model (and that is a big ask, I hate to admit), and as long as the providers of the p(z) information also provide the effective prior on z that was used to generate the p(z)s (another big ask, apparently), you can infer the true N(z) surprisingly accurately. This is work with Alex Malz and Boris Leistedt.
Christina Eilers (MPIA) and I spent a long time today pair-coding her extension to The Cannon in which we marginalize over the true labels of the training data, under the assumption of small, known, Gaussian, label noise. Our job was to vastly speed optimization by getting correct derivatives (gradient) of the objective function (a likelihood function) with respect to parameters, and insertion of this into a proper optimizer. We built tests, did some experimental coding, and then fully succeeded! Eilers's Cannon is slower than other implementations, but more scientifically conservative. We showed by the end of the day that the model becomes a better fit to the data as the label variances are made realistic. Stars really do have simple spectra!
While we were working, Anna Y. Q. Ho and Sven Buder (MPIA) were discovering non-trivial covariances between stellar radial Velocity (or possibly radial velocity mis-estimation) and alpha abundances, with Ho working in LAMOST data and Buder working in GALAH data. Both are using The Cannon. After some investigation, we think the issue is probably related to the collision of alpha-estimating spectral features and ISM and telluric features. We discussed methods for mitigation, which range from censoring data at one end and fully modeling velocity along with the model parameters at the other.
Late in the day, I finished my response to referee and submitted it.
At Galaxy Coffee, Ben Weiner (Arizona) gave a talk about his great project (with many collaborators) to study very faint satellites around Milky-Way-like galaxies using overwhelming force: They are taking spectra of everything within the projected virial radius! That's thousands of targets, among which (for a typical parent galaxy), they find a handful of satellites. The punchline is that the Milky Way appears to be typical in its number of satellites, though there is certainly a range.
I spoke with Glenn Van de Ven (MPIA) about the possibility that he could upgrade his state-of-the-art Schwarzschild modeling of external-galaxy integral field data to something that would do chemo-dynamics. I suggested ways that he could keep the problem convex, but use regularization to reduce model complexity. We discussed baby steps towards the goal.
I also wrote a title and abstract (paper scope) for Adrian Price-Whelan and started on the same for Andy Casey.
As always, MPIA Milky Way group meeting was a pleasure today, featuring short discussions led by Nicholas Martin (Strasbourg), Adrian Price-Whelan, and Andy Casey. Casey showed his developments on The Cannon and applications to new surveys. Price-Whelan spoke about our ability to see possible warps (coherent responses) in the Milky Way disk from interactions with satellites. Martin showed amazing color-magnitude diagrams of stars in Andromeda satellite galaxies. So. Much. Detail.
Chaos reigned around me. Jonathan Bird and Melissa Ness worked on the Disco concept. Anna Y. Q. Ho, working on a suggestion from Casey, found Li lines in (a rare subsample of) LAMOST giants, leading to a whole new insta-project on Li. Price-Whelan figured out multiple methods for initializing and running MCMC on our single-line binary stars, initializing from either the prior or from literature orbits. It looks like many (or maybe all) of the APOGEE variable-velocity stars have multiple qualitatively different but nonetheless plausible orbital solutions. Casey and I conceived of a totally new way to build The Cannon as a local model for every test-step object; a non-parametric Cannon if you wish? I spoke with Jeroen Bouwman (MPIA) about his (very promising) work using Dun Wang's Causal Pixel Model to fit the Spitzer data on transit spectroscopy for a hot Jupiter.
Anna Y. Q. Ho is in town to finish two—yes, two—papers on what can be learned about stellar properties from (relatively) low-resolution LAMOST spectroscopy. She has amazing results on ages and chemical abundances, which challenge long-held beliefs about what can be done at medium to low resolution. One of her two papers is about using C and N abundances to infer red-giant ages, as we did with APOGEE and The Cannon earlier. Ho and I met with Rix today to discuss error propagation from abundances to ages, and all the possible sources of scatter, including the unknown unknowns.
Adrian Price-Whelan started running our probabilistic inference of single-line spectroscopic binaries on the Troup et al sample. We had to complexify our noise model, since clearly there are variations larger than the error bars. We also had to reparameterize our binary-star parameters to a better set. In this process, we wanted to go from a phase angle to a time and back. Going from time to phase angle is a numerically stable mod() operation. Going from phase angle back to time can naively involve adding and subtracting huge numbers. We re-cast the function so no large subtractions ever happen. That was not totally trivial!
Late in the day, Melissa Ness and Jonathan Bird interviewed Price-Whelan about ideas potentially going into the nascent Disco proposal.
All hell broke loose in Heidelberg today, as Andy Casey got done with his meeting downtown, Jonathan Bird (Vanderbilt) showed up to work on the Disco proposal for the next big thing with the SDSS hardware, Ben Weiner (Arizona) showed up to talk science, and Anna Ho came in to finish her new set of papers about the LAMOST data. And even with these distractions, Price-Whelan and I “decided” (I use scare quotes because our decision was heavily influenced by Rix!) to work on the single-line binaries in the APOGEE data.
Price-Whelan and I joined up my celestial mechanics code from June with the simulated APOGEE single-visit velocities through a likelihood function and got MCMC sampling working. We showed that you can say significant things about binary stars even with only a few observations; you don't need full coverage of the orbit to make substantial statements. Though it sure helps if you want very specific orbital parameters! Tomorrow we will hit real data; we will have to put in a noise model and some outlier modeling (probably).
Bird and I discussed the high-level point of the Disco proposal: We need it to express, clearly, an idea (or set of ideas) that is worth many tens of millions of dollars. That's hard; the project is very valuable and will have huge impact per dollar, but crystallizing a complex project into one bullet point is never trivial.
I worked on the weekend to get my “Chemical tagging can work” paper ready for resubmission to the ApJ, incorporating referee and co-author comments, both of which made the paper much better. By Sunday it was good enough to send to the co-authors for final comments. In case it is some comfort to my loyal reader, it took me a full six months to get to this, which is embarrassing, but normal. And even then—when I sent it to the co-authors—it was missing a paragraph about the abundances in cluster M5. While Andy Casey and I were relaxing in a Heidelberg pub, he (Casey) wrote that final paragraph. I love my job!
Yesterday at Milky Way group meeting, Adrian Price-Whelan brought up the possibility that the halo might be made up of many disrupted globular clusters. Sarah Martell (UNSW) showed up today and said more along these lines, based on chemical arguments. That got me thinking about the birthday paradox: If you have 30 people in the room, you are more than likely to have two that share the same birthday. The implication of this paradox for the Galaxy is the following:
Imagine that the Milky Way halo (or even better, bulge) is made up of 1000 disrupted stellar clusters that fell in. If we look at even 100-ish stars, we would expect to find pairs of stars with identical abundances, with very good confidence. And this confidence can be kept high even if there is a smooth background of stars that doesn't participate in the cluster origin, and even if there are multiple populations in the original clusters. As long as we can show that pairs of stars are not co-eval (chemically), we can rule out all of these hypotheses with far less data than we already have, in hand. Awesome! I wrote code to check this, but am far from having a real-data test.
Andy Casey had the afternoon off from #FirstStarsV; that and the presence of Adrian Price-Whelan inspired me to suggest that we structure the afternoon like a hack day, in an undisclosed garden location in the Heidelberger Neuenheim. We were joined by Christina Eilers (MPIA), Melissa Ness (MPIA), Hans-Walter Rix, Branimir Sesar (MPIA), and Gail Zasowski (JHU). Various projects were pitched and executed. My own work was on my response-to-referee (boring I know!) and helping Eilers with coding up the objective function and derivatives for a version of The Cannon that permits the inclusion of stars with noisy and missing labels at training time. Casey worked on building giant-branch and main-sequence Cannon models and mixing them or switching between them. It appears to work amazingly well.
In the morning before that, MPIA Milky Way group meeting hosted a discussion by Price-Whelan of the possibility of understanding what original population of globular clusters was ground up and stripped into the present-day Milky-Way halo, and a discussion by Andy Casey of an amazingly low metallicity, amazingly rapidly moving star, that appears to have just fallen in from somewhere. These led to excited discussions, and, indeed, framed some of the projects performed at the above-mentioned hack day. For example, at the hack day, Price-Whelan made predictions for other stars that might be part of whatever cluster, group, or galaxy fell in with Casey's crazy star.
I got troubled this morning by the so many projects problem! In the subdomain of my life that is about modeling spectra of stars, and within that the subdomain that is thinking about APOGEE data, there are these, which I don't know how to prioritize!
- Fit for velocity widths and velocity offsets (redshifts) simultaneously with the star labels, to remove projections of velocity errors and line-spread-function (or microturbulence) variations onto parameters of interest.
- Fit stars as linear combinations of stars at different velocities to find the double-lined spectroscopic binaries. Combine this with Kepler data to get the full properties of eclipsing binaries. We have many examples, and I expect we will find many more! We might put Adrian Price-Whelan onto parts of this this week.
- Build (train) models for all parts of the H-R diagram, especially the subgiant and dwarf parts, where we have never produced good models. These are particularly important in the era of Gaia. We might convince Andy Casey to do some of this this week, and Sven Buder (MPIA) is also doing some of this in GALAH.
- Project residuals onto (theoretically determined) derivatives with respect to element abundances, to get or check element abundances. This might also be used to build an element-abundance measuring system that doesn't require a full training set of abundances that we believe. Yuan-Sen Ting (UCSC) is producing the relevant derivatives right now.
- Marginalize out noisy labels at training time, and marginalize out noisy internal parameters at test time. We have Christina Eilers (MPIA) on that one right now.
- Look at going fully probabilistic, where we get posteriors over all labels and all internal parameters. I owe Jonathan Weare (Chicago) elements for this.
- Include photometry into the training and test data to break the temperature–gravity degeneracies. And maybe also extinction! This is easy to do and ought to have a big impact.
- Include priors on stellar structure and evolution to prevent results from departing from physically reasonable solutions. This is anathema to the stellar spectroscopy world (or most of it), but much desired by the customers of stellar parameter pipelines!
- Add in latent variables to capture variations in stellar spectra not captured by the quadratic-on-labels model. Are the learned latent variables interpretable?
Rix and I discussed many data analysis problems in the morning today. We have been discussing the possibility of measuring the bolometric fluxes of stars with very little (possibly vanishing) dependence on spectral assumptions. (The idea is: If you have enough bands, the spectrum or SED is very strongly tied down.) If we can combine these with other kinds of measurements (of, say, effective temperature), we can make predictions for interferometry without (heavy use of) stellar models! One constant that comes up in these discussions is 4πGσ (the gravitational constant times the Stefan-Boltzmann constant (have I mentioned that I hate it when constants are named after people?)). Is this a new fundamental, astronomical constant?
We also discussed with Coryn Bailer-Jones (MPIA), Morgan Fouesneau (MPIA), and Rene Andre (MPIA) actually putting the bolometric-flux project into practice with real data. Fouesneau and Andre seem to have working code!
I also had the pleasure of reading and giving comments on some writing Bailer-Jones has been doing for a possible new textbook on computational data analysis. This is exciting!
I had a great conversation with Markus Pössel (MPIA) and Johannes Fröschle (MPIA) about Fröschle's work re-analyzing data on the expansion of the Universe. He is looking at when the expansion of the Universe was clearly discovered, and (subsequently) when the acceleration was clearly discovered. His approach is to reanalyze historical data sets with clear, simple hypotheses, and perform Bayesian evidence tests. He finds that even in 1924 the expansion of the Universe was clearly and firmly established, by such a large factor that in fact it was probably known much earlier.
This conversation got me thinking about a more general question, which is simple: Imagine you have measured a set of galaxy redshifts but know nothing about distances. How much data do you need to infer that the Universe is expanding? The two hypotheses are: Galaxies have random velocities but with a well-defined rest frame, with respect to which we are moving, and the same but there is expansion. You don't know any distances nor any expansion parameter. Go! I bet that once you have good sky coverage, you are done, even without any distance information at all.
In the afternoon, Melissa Ness and I worked on fiber-number and LSF issues in the APOGEE data. There are clear trends of abundance measurements with fiber number (presumably mainly because of the variation in spectrograph resolution). We worked on testing methods to remove them, which involve correcting the training set going in to The Cannon and also giving The Cannon information to simultaneously fit the relevant (nuisance) trends.
At the end of the day, I gave my talk at MPIA on probabilistic graphical models.