Weekly Head Voices #148: Data stylist.

Ridiculously fun trail in Paarl somewhere. (Photo taken by Trail Friend #1. Trail Friend #2 cropped from picture, because no permission to appear on the internets!)

This post covers the week from Monday July 9 to Sunday July 15.

The business part of my week was unfairly dominated by far too much after-work obsessing over programming languages, with which I seem to have an unhealthy (or perhaps not) obsession.

I will externalise some of these thoughts further down in this post.

I’m starting with a weekend / running update, which should be reasonably safe for non-nerds to read. However, after that, the nerd dial will go up to 11 with stuff about tools and programming languages right up to the end of the post.

I would have wanted to use the adjective “face-melting”, but I’m not sure if any intensity of nerdery could ever reach that level.

We can dream.

Weekend running update

Most fortunately the weekend had other plans and supplied us with at least 2.5 parties, the first of which even culminated in a ridiculously fun trail run in the mountains on the winter morning after.

The winter morning sun was just perfect, the company was great, and I had forgotten all forms of performance tracking devices at home.

Readers with bionic eyes might notice the Lunas on my feet.

I have now ran just over 260km in them, but, in a surprise twist to the regular readers of this blog, my biological equipment has still not yet completely adjusted to the new style of locomotion.

The latest victim seems to be one of Tom, Dick and Harry, the tendons running under the medial malleolus of my left foot, also known as that big knob on your inside ankle. Tom (the primary suspect in this case according to Trail Friend #1 who is knowledgable with regard to these matters, being a running foot surgeon and all), Dick and Harry are also known as the *T*ibialis posterior, flexor *D*igitorum longus and the flexor *H*allucis longus.

They currently have to work extra hard to stabilise my feet while running, because, you know, no shoes.

Because doing this thing was not hard enough already, and because the Lunas are perhaps still a bit too cushiony, and because my friend the Very Flat Cat forgot that I’m very suggestible after 11:00 in the morning when my prefrontal cortex takes the rest of the day off, I am now also the very shy owner of a pair of Xero Genesis running sandals:

Image result for xero genesis

The soles are only 5mm thick, and quite hard, being rated for a few thousand miles and all. The upshot of this is that one’s feet have to work even harder than in the Lunas.

My first run in these was amazing: I could feel my feet reacting to every little pebble, and my running style having to adapt even more to the terrain.

However, there was a price to pay for all of that additional terrain feel (and the fact that I took a much longer maiden run than I should have): The next day, the tendons in my feet felt even more (ab)used than usual.

WITH GREAT POWER COMES GREAT RESPONSIBILITY, it seems.

Due to these shoes being so powerful, I have had to resign to introducing Xero running far more gradually than I had initially thought.

Vacation-based-thinking-driven tool sharpening aka The WVV 2018 Data Science Toolbox(tm).

During the previously blogged-about Mpumalanga vacation, the lack of alarms, devices, and other work accoutrements, resulted in there being ample time for staring-into-space-grade thinking sessions.

During one of these thinking sessions, I realised that I had somehow neglected my data science toolbox for a while.

At some point a few years back, I was so into ipython notebooks (what has now become jupyter) that I used them as my main work lab notes modality.

However, in the meantime I had fallen slightly out of love with the computational notebook style of data programming, because I had begun to develop doubts about their role in the analysis pipeline.

interlude 1: jupyter notebooks are nice for initial data exploration, and they’re especially useful for remote computation with embedded graphics. However, that initial momentum of discovery risks devolving into an unwieldy monolith of code snippets, data transformations and experiments. There’s a fine line to be walked between flexible experimentation on the one hand, and version-controlled, time-stamped, permutational and scientific rigour on the other.

interlude 2: I have to apologise for using the term “data science” in a non-comedic context. In spite of the inherent humour, it has turned into a usable blanket term for computational data understanding.

Due to my growing doubts in the order of Jupyter, and due to being occupied with less traditionally data sciencey work projects, I had unfortunately let my data science toolbox gather perhaps a bit too much dust.

Slightly more worrying than falling out of love with the Jupyter Notebooks (I still like them, I’m just not that madly in love anymore), was the more specific issue that I’d even let the datavis parts get a bit dusty.

Anyways.

Although I should probably write a more complete post about this, here is the list of ingredients of the official 2018 WHV Data Science Toolbox(tm):

Programming language and library ecosystem: Python.

This language, in spite of its shortcomings, dominates the data science / machine learning world thanks to its STELLAR ecosystem.

numpy, pandas, scipy, scikit-*, tensorflow, pytorch, keras, cython… this snowball has turned into a pretty sizeable planet.

For this reason, it would be hard to justify any other choice for data science.

However, since I’ve been seeing more of Lisp and the rest of the ever-expanding programming language landscape, I can see (Python’s shortcomings as a programming language) clearly now.

In terms of interactive programming, Python beats the majority of practical programming languages, with Common Lisp being one notable exception.

However, it’s not functional enough, which engenders unnecessarily imperative, side-effecting code.  More specifically, it’s not expression-oriented.

More about this slightly further down. Maybe.

Datavis: Anything, as long as it’s Vega or Vega-Lite.

I spent a few years of my life wrangling d3.js, down to INNARD-LEVEL.

Mike Bostock’s idea of data-element-joins is genius, and internalising it was intellectually satisfying.

I thought that these d3 skillz would serve me well for decades (that’s WEEKS in javascript-time), but it turns out that there’s a new, even smarter kid in town.

(if it’s any consolation, the new kid can be considered the grand-child of d3.js.)

vega and vega-lite are so-called visualization grammars, or visualization DSLs (domain specific languages).

The upshot is that one codes up a chart, or a whole set of linked charts and their interactive behaviour, using a language that was designed for this purpose.

This chart code can be easily shared, or converted into interactive visual representations that can be embedded in applications, online or in print quality documents.

Genius!

With Altair, you can even send your pandas dataframes to vega and vega-lite charts all from the comfort of your slightly defective Python armchair.

Development Environment: PyCharm.

You knew it was not going to be Jupyter Notebooks, but you probably expected it to be Emacs.

Well it’s not. Surprise!

The remote interpreter support in PyCharm enables me to connect to a Python virtual environment anywhere on the planet, which I often do.

The JetBrains wizards have optimised the remote communication of code intelligence, so completion, documentation and general code understanding is almost indistinguishable from that on a completely local project.

Being able to step through a remote PyTorch neural network training iteration with the PyCharm debugger or any other remote Python algorithmics is insightful.

Two notable drawbacks are visualization and long-running jobs.

For the long-running jobs I do tend to use Jupyter Notebooks or when at all possible mosh, which is amazing. However, because the primary modality is not the notebook, my code is versioned and organised into separate libraries which I can call into from notebook or mosh.

For visualization, it’s either connecting to the altair chart server via SSH pipe, dumping the chart to the unison-synced project, and/or a Jupyter Notebook.

The rest.

Of course you use Postgres on an SSD for your data, and of course you know enough SQL to make short work of most of the heavy-weight transformations often required at the start your data crunching pipeline.

For all of my lab notes, reports, books, papers and blog posts, I use Emacs Org mode.

LaTeX math with live preview, live code snippets, SVG graphics, bibtex references, export to anything. This is one of the best ways to document your science.

Programming language addiction update.

I spend far too much obsessing over programming languages, old and new.

For the past two weeks, I wasted even more precious time than usual reading up about programming languages.

Because I would really like to spend more of my time on other, perhaps more valuable activities, I’ve been trying to better define what it is I’m actually looking for.

Of course there is no single best programming language, but a whole set of good languages that map in intricate ways to different problem domains.

In spite of this, I have been pining for a language with, in order of importance:

  1. A Functional Programming DNA, with which I’m referring to a) expression-orientedness, b) a preference for pure functions, and at a higher level, c) the modelling of reality as more or less explicit dataflows.
  2. Interactive programming, with Common Lisp being the textbook example of this.
  3. Great tooling and IDEs, meaning first-class support by something from JetBrains, Microsoft or Emacs.
  4. Great concurrency and parallelism stories.
  5. A great library ecosystem.
  6. Modest memory use.

Having just explicitly written this down for the first time (!! – it was consuming so much glucose just being kept amorphously swirling around in my brain) I can now mentally map some of my most recent language dalliances to these points.

go

This language is far too simple for my taste, but probably really great for teams.

I did recently take a more serious look when setting up a telegram bot using tbot and being amazed at how simple it was building web services like these using goroutines and channels.

Go satisfies points 3 to 6 from the list above. Makes sense that I decided to file this experiment away under “check when you need to put a webservice together REALLY QUICKLY”.

rust

When I saw up that rust, surprisingly, is an expression-oriented language, I flew through the O’Reilly Programming Rust book I had bought previously as part of a bundle.

Evaluating rust by the list above, we award it a fractional 1 because expression-oriented, 3 due to jetbrains plugin amongst others, 4(ish) – great memory safety, but compared to clojure, concurrency and parallelism stories still have much room to grow, a solid 5 thanks to cargo and a very strong 6.

I filed this one away under “re-evaluate whenever you reach for your trusty C++”. (also, actix-web looks amazing for super high performance microservices.)

f#

You didn’t see this one coming, did you?

Very strong 1 to 5 and a solid 6.

WAT?!

I’m currently working my way through Domain Modeling Made Functional by Scott Wlaschin, who is also the author of the brilliant f# for fun and profit website.

In addition to f# hitting all 6 of my 2018 PL-requirements above, I’m slowly starting to see the advantages of having a real type system under the hood.

f# is a member of the ML-family of functional languages, which have their origin in Lisp (some very naughty person removed all of the lovely parentheses I’m afraid…).

I hope that at some point I’ll have the opportunity to use f# in anger, at which point I’ll be able to report more concretely as to its suitability.

The End

Let me know in the comments what you think about any of this, or anything else.

I hope to meet you again in a few days, here or elsewhere.

Weekly Head Voices #117: Dissimilar.

The week of Monday February 13 to Sunday February 19, 2017 might have appeared to be really pretty boring to any inter-dimensional and also more mundane onlookers.

(I mention both groups, because I’m almost sure I would have detected the second group watching, whereas the first group, being interdimensional, would probably have been able to escape detection. As far as I know, nobody watched.)

I just went through my orgmode journals. They are filled with a mix of notes on the following mostly very nerdy and quite boring topics.

Warning: If you’re not an emacs, python or machine learning nerd, there is a high probability that you might not enjoy this post. Please feel free to skip to the pretty mountain at the end!

Advanced Emacs configuration

I finally migrated my whole configuration away from Emacs Prelude.

Prelude is a fantastic Emacs “distribution” (it’s a simple git clone away!) that truly upgrades one’s Emacs experience in terms of look and feel, and functionality. It played a central role in my return to the Emacs fold after a decade long hiatus spent with JED, VIM (there was more really weird stuff going on during that time…) and Sublime.

However, it’s a sort of rite of passage constructing one’s own Emacs configuration from scratch, and my time had come.

In parallel with Day Job, I extricated Prelude from my configuration, and filled up the gaps it left with my own constructs. There is something quite addictive using emacs-lisp to weave together whatever you need in your computing environment.

To celebrate, I decided that it was also time to move my todo system away from todoist (a really great ecosystem) and into Emacs orgmode.

From this… (beautiful multi-platform graphical app)
… to this!! (YOUR LIFE IN PLAINTEXT. DUN DUN DUUUUUN!)

I had sort of settled with todoist for the past few years. However, my yearly subscription is about to end on March 5, and I’ve realised that with the above-mentioned Emacs-lisp weaving and orgmode, there is almost unlimited flexibility also in managing my todo list.

Anyways,  I have it setup so that tasks are extracted right from their context in various orgfiles, including my current monthly journal, and shown in a special view. I can add arbitrary metadata, such as attachments and just plain text, and more esoteric tidbits such as live queries into my email database.

The advantage of having the bulk of the tasks in my month journal, means I am forced to review all of the remaining tasks at the end of the month before transferring them to the new month’s journal.

We’ll see how this goes!

Jupyter Notebook usage

Due to an interesting machine learning project at work, I had a great excuse to spend some quality time with the Jupyter Notebook (formerly known as IPython Notebook) and the scipy family of packages.

Because Far Too Much Muscle-Memory, I tried interfacing to my notebook server using Emacs IPython Notebook (EIN), which looked like this:

However, the initial exhilaration quickly fizzled out as EIN exhibits some flakiness (primarily broken indentation in cells which makes this hard to interact with), and I had no time to try to fix or work-around, because day job deadlines. (When I have a little more time, I will have to get back to the EIN! Apparently they were planning to call this new fork Zwei. Now that would have been awesome.)

So it was back to the Jupyter Notebook. This time I made an effort to learn all of the new hotkeys. (Things have gone modal since I last used this intensively.)

The Notebook is an awe-inspiringly useful tool.

However, the cell-based execution model definitely has its drawbacks. I often wish to re-execute a single line or a few lines after changing something. With the notebook, I have to split the cell at the very least once to do this, resulting in multiple cells that I now have to manage.

In certain other languages, which I cannot mention anymore because I have utterly exhausted my monthly quota, you can easily re-execute any sub-expression interactively, which makes for a more effective interactive coding experience.

The notebook is a good and practical way to document one’s analytical path. However, I sometimes wonder if there are less linear (graph-oriented?) ways of representing the often branching routes one follows during an analysis session.

Dissimilarity representation

Some years ago, I attended a talk where Prof. Robert P.W. Duin gave a fantastic talk about the history and future of pattern recognition.

In this talk, he introduced the idea of dissimilarity representation.

In much of pattern recognition, it was pretty much the norm that you had to reduce your training samples (and later unseen samples) to feature vectors. The core idea of building a classifier, is constructing hyper-surfaces that divide the high-dimensional feature space into classes. An unseen sample can then be positioned in feature space, and its class simply determined by checking on which side of the hypersurface(s) it finds itself.

However, for many types of (heterogenous) data, determining these feature vectors can be prohibitively difficult.

With the dissimilarity representation, one only has to determine a suitable function that can be used to calculate the dissimilarity between any two samples in the population. Especially for heterogenous data, or data such as geometric shapes for example, this is a much more tractable exercise.

More importantly, it’s often easier to discuss with domain experts about similarity than it is to talk about feature spaces.

Due to the machine learning project mentioned above, I had to work with categorical data that will probably later also prove to be of heterogeneous modality. This was of course the best (THE BEST) excuse to get out the old dissimilarity toolbox (in my case, that’s SciPy and friends), and to read a bunch of dissimilarity papers that were still on my list.

Besides the fact that much fun was had by all (me), I am cautiously optimistic, based on first experiments, that this approach might be a good one. I was especially impressed by how much I could put together in a relatively short time with the SciPy ecosystem.

Machine learning peeps in the audience, what is your experience with the dissimilarity representation?

A mountain at the end

By the end of a week filled with nerdery, it was high time to walk up a mountain(ish), and so I did, in the sun and the wind, up a piece of the Kogelberg in Betty’s Bay.

At the top, I made you this panoroma of the view:

Click for the 7738 x 2067 full resolution panorama!

At that point, the wind was doing its best to blow me off the mountain, which served as a visceral reminder of my mortality, and thus also kept the big M (for mindfulness) dial turned up to 11.

I was really only planning to go up and down in brisk hike mode due to a whiny knee, but I could not help turning parts of the up and the largest part of the down into an exhilarating lope.

When I grow up, I’m going to be a trail runner.

Have fun (nerdy) kids, I hope to see you soon!