A collection of frivolities...[tag list] [ARCHIVE]

say your name ⦿

Baltimore is not that large of a town. Occasionally, the following situation happens to me. You, a friend or acquaintance of mine, will see me on the sidewalk in a random part of town as you are driving by, and will recognize me. In the excitement of the moment, you will yell out my name. “Hey, Matt!” or “Matt Post!” or “Matt!” By the time I can register what’s happened and look up, you have disappeared around the corner or down the road. Unless I am able to recognize your voice or process what’s happening quickly enough to see your face or car, I’m left wondering who you are. Almost no information was exchanged and, unless you think to bring it up in the near future, before we have forgotten the event, I will never know that it was you who passed me by.

I would like to suggest that, in such situations, instead of yelling my name, you yell your own name. That way, we both have the pleasure of the recognition.1


  1. A tip o’ the hat to Connor C., a friend of mine who executed on this a month or so after I shared this idea with a small group of friends, some time back. 

running the Olmsted parkways ⦿

The other day I decided to run a loop from my house, following the Olmsted Parkways. These are wide grassy medians lined with trees, originally designed by the Olmsted brothers to be pedestrian connectors between city parks. The Olmsteds were early landscape architects who designed Central Park, along with many of the neighborhoods in Baltimore.

Route for the paths with picture locations, courtesy of Strava

It was really quite lovely. The trees in many of these places are big enough that you can almost feel you’re not surrounded by cars.

Picture of lovely path in road's median (The Alameda)
Picture of lovely path in road's median (33rd St)

This vision is broken any time you have to cross an intersection, of course. This is especially tricky at the junction of 33rd and University Parkway. My strategy is to follow the light for my traffic direction, and to take special care to maximize my visibility, so that drivers turning left across the intersection are able to see me during those brief snatches of time when they glance up from their phones. I found that I could navigate this in a way where I felt reasonably safe. However, I think this viewpoint was particular to my status as a bold male jogger strengthened by healthy daily feeding anger and resentment against Baltimore traffic culture. If I’d been with my kids, I wouldn’t have felt the same way.

Of course, doing this with kids wouldn’t really be possible; the parkways are not paved and as such are not suitable for bike traffic. There are also frequent situations where one has to navigate around traffic infrastructure that was clearly placed in defiance of the Olmsted vision. Here is one egregious example.

A traffic post in the center of a path

(At the same time, the curb cutout provides a hint to the intended use. Baltimore’s creative use of pedestrian sidewalks as a home for utility works is pretty common, unfortunately. Many sidewalks cannot even be navigated because of a fire hydrant set squat in their middle.)

JHU makes a minor nod to the paths, having actually realized the walkway for one short block. But others, especially on Charles, are undeveloped.

Olmsted parkway near JHU
Undeveloped median on Charles

This is all we have after years of Hopkins-directed renovations on both St. Paul and Charles. In addition to ignoring this rich vision that was handed to us, there are no protected bike lanes until you get down beneath Wyman Park Dell, where Maryland Ave and its semi-protected lane starts.

I’m grateful for these parkways, including their well-maintained treelines. But as with much of this city, the implementation suffers from a clear lack of consideration for getting anywhere other than in a car. It’s expecially sad here, though, because the vision exists, and the groundwork has already been laid. And the vision isn’t limited to a writeup from some Baltimore non-profit, sitting unread at City Hall. You can actually see it, if you care to look; and you can even almost use it, if you’re willing to put up with the city’s impediments.

The Batman movie I want to see ⦿

So there’s a new Batman movie coming out. I watched the trailer, it looks fine. I’m attached to The Dark Knight series and am not too eager to have it supplanted, but whatever, I’ll probably watch it. They’re clearly after my demographic with the plaintive Nirvana soundtrack. What I like about Batman is that there’s no pretense of higher principle, just atavistism; no truth, justice, and the American way, but Odyssean score-settling meted out quickly and violently against violators of the moral law, whose badness is clear and unambiguous and therefore licensing punishment. It’s an escape; justice is never this simple; we are all to some extent liars and thieves and rooting out the truth of any matter requires a complicated system which still only badly approximates it.

BUT however implausible instant justice may be, in the spirit of the fantasy, here is a sketch of a few scenes from The Batman movie I’d like to see.

  1. It is raining in Gotham City. A pedestrian waits to cross a street. A car approaches from far off at a high speed, and the pedestrian decides to wait for it to pass. At the last minute, the car turns right into the cross street, such that it never crosses the pedestrian’s intended path. No blinker. The pedestrian could have safely crossed the entire time. Suddenly, there is a barely perceptible flash, followed by a large crash. It is the Batmobile. It hits the rear of the car at 100 MPH, hurling it into the side of a deserted building, where it explodes.

  2. A family pulls up to a red light at an intersection. A moment later, a flashy low-riding car pulls up next to them, blasting the owner’s terrible taste in music, drowning out any all other sound, including the family’s happy conversation. The owner glares at them through his open window, and flicks his cigarette into the street. He reaches for a knob on his radio to turn up the volume. The family’s car starts rumbling—is it from the bass of the car’s stereo? No, it is fallout from The Batman, who has appeared seemingly out of nowhere, landing on the car’s rear trunk with enough force to instantly blow out all tires and flip the car vertically. He jumps off, and the car completes its flip, landing upside down. The owner is trapped beneath the car. Gas is dripping everywhere, slowly trickling towards the still-glowing cigarette.

  3. It is a holiday in a city neighborhood. Citizens tend to their yards and prepare food for the evening festivities, taking advantage of nice whether and a well-deserved break from their productive daily lives. Every eight minutes, however, a deafening noise from the city’s panopticon sky service blasts from overhead, forcing everyone each time to stop everything they’re doing and wait for it to pass. A barely discernible black shape streaks across the sky. Is it a bird? Is it a plane? Yes, it is the Batplane, or rather, a missile from the Batplane. Gotham citizens have their peace disturbed once again as the panopticon plane explodes and debris falls from the sky. But it is the last time.

  4. It is another peaceful day. Citizens chat in their yards, enjoying the new sustained peace in the absence of the surveillance machine in the sky, but are forced to pause as the noise from a rapidly accelerating car, cascades across the neighborhood. The owner of the car, perhaps perceiving himself to be very important, has intentionally made his car very loud, including removing the muffler. Suddenly the noise is cut short. The Batman has dropped on the hood of the car, eviscerated the still-running engine from its chassis, and hurled it through the auto’s front windshield.

  5. A mother and her child ride down the road in one of Gotham’s few, poorly maintained bike lanes. A friendly driver maintains a safe speed and distance from them, to the fury of the the car behind it, which whips around and illegally passes at 45 mph in this two-lane residential area, before slamming on the brakes at the red light 50 feet up the road. The driver looks up from his phone to see The Batman materialize next to him, but has no time to think before he is yanked from his seat and tossed out onto the road into the busy cross traffic. The camera cuts away. There is an ambiguous thump.

Many questions remain; for example, what is the plot? And who should direct it? There are many capable people, but my vote is George Miller, chiefly for his jarring, creeptastic work on Babe: Pig in the City.

An appreciation of the Bonne Maman jar ⦿

Is there any finer jar than the one used by Bonne Maman jam? Consider the features of a traditional condiment jar:

  • A narrow mouth with a steep lip1 helps inhibit efficient extraction of the jar’s contents2
  • Opaque sides hide the jar’s contents
  • Construction from plastic imbues the condiment with a faint chemical aroma

A Bonne Maman jar is none of these. It has a wide mouth and no lip, meaning every bit of excellent jam can be extracted with ease. One wash in the dishwasher neatly removes the label. It is well-proportioned and made from glass, adding heft that lowers the center of gravity and inhibits tipping when repurposed after the jam is gone. Finally, its colorful lid provides a water-tight seal and marks you as a person with fine taste; the brand’s name has a nice bit of inflection, providing pleasure to the customer hoping to emit an air of wordly erudition in correctly pronouncing the name.

Every time I am scooping out mayonnaise from a tall, plastic, narrow-mouthed tub, I think, as I wipe condiment from high up on the handle of the stupid utensil, the world does not need to be this way. But for some reason it is.


  1. Why do any jars have lips? What purpose do they serve, other than to hinder removal of their contents? Is there some unquestioned tradition from the early days of container manufacturing? Some trick of mass production that requires it? Are lids super expensive? It is remarkable that Bonne Maman alone seems to have escaped this design flaw. 

  2. It is worth pointing out that a spoon is a much more effective tool than a knife for both removing and spreading nearly any condiment. 

On English ⦿

A testament to the inscrutable wonder that is English orthography, enabler of spelling bees and crossword puzzles. (This really needs an IPA gloss).

On English

I take it you already know,
Of tough and bough and cough and dough.
Others may stumble, but not you,
On hiccough, thorough, laugh and through.
Well done! And now you wish, perhaps,
To learn of less familiar traps.

Beware of heard, a dreadful word,
That looks like beard and sounds like bird.
And dead—it’s said like bed, not bead,
For goodness’ sake, don’t call it ‘deed’!
Watch out for meat and great and threat,
(They rhyme with suite and straight and debt).

A moth is not a moth in mother,
Nor both in bother, broth in brother.
And here is not a match for there,
Nor dear and fear for bear and pear.
And then there’s dose and rose and lose –
Just look them up—and goose and choose.
And cork and work and card and ward,
And font and front and word and sword.
And do and go and thwart and cart –
Come, come, I’ve hardly made a start!

A dreadful language? Why man alive!
I’d mastered it when I was five.
And yet to write it, the more I tried,
I hadn’t learned at fifty-five.
— T.S. Watt, 1954 (attributed)

That linguistics meme ⦿

This Linguistics Opinions Meme was making its way around Twitter. I’m not a linguist but I really like this meme.

Linguistics meme image

1. Brilliant, amazing, future looking, a paean to sweating the details and taking the time to get them right. Surely it bears weight that this 40+ yeor-old software is still the standard for academic publishing, despite its imperfections.

2. Above my pay grade, but it seems like the law of the excluded middle is relevant here.

3. Never heard of it, but speaking of department stores, unless its KaDeWe, I’d rather visit smaller, sidewalk-accessible stores than a monolithic windowless store.

4. Among the better variable names, great when working with synchronous grammar formalisms, pairs nicely with $u$.

5. I’ve only been there once, on a panel in 2011, and felt like I was in a different world.

6. Twitter is pretty much the furthest thing from the real world, but I suppose it could be useful in some instances.

7. A great tool for producing a quick generic waveform when you need it for a slide on ASR.

8. One shouldn’t complain about public misperceptions of one’s field, and anyway shouldn’t linguists speak lots of languages? I’d favor a foreign language requirement for all doctoral degrees (especially for engineers).

9. Great movie, I’ve loved Amy Adams since Junebug. The visual language is highly inventive and could be a good wedge for helping laypeople understand sign language as a real language.

10. No idea what this is and not going to look it up.

11. If I made up a language it’d have French prosody, the Devanagari writing system, phonetic spelling, German word order, English inflectional morphology, and no tones.

12. Passing on this one.

13. Silly, often encumbered by politics.

14. Even better than DP is DP for DP.

15. It seems like he should pay back a lot of money to whatever missionary organization sent him to Brazil.

16. Deplatforming is a mob tool; bad ideas should be countered on the merits with better ones.

17. As an elective, where local interests match teaching ability, why not? Maybe best offered through community colleges instead of high schools.

18. I think popular science books are a service to the general public.

19. Does I can haz cheezburger count? That was a decent meme.

20. Why would you spend all the time it takes to learn a conlang instead of a natural one? I do not see the appeal.

21. Not looking this one up.

22. I endorse any beard.

23. I’m not on FB.

memento mori ⦿

I don’t have it in me to write a general reflection on mortality, at least not at the moment. But death is of course a reality, perhaps the reality, albeit often a distant one for many of us on many days. And realities are best acknowledged and faced. Another reality is that things that are remote are difficult to acknowledge and face, in part because it is easy to forget them. For such matters, periodic–perhaps even daily—reminders are appropriate. It is for this reason that I wrote a program that generates a chart of all the weeks of your life, one year per row, at A4 dimensions, with an image optionally rasterized into the table. If you like, you can use it to generate one for yourself. Hang it somewhere prominent and reflect.

Image of the weeks of a 90 year life arranged in a grid with a skull background

List of regrets ⦿

We all have regrets. But do we all make lists of them and publish them on the web? Probably not, for good reason. But to the extent that these experiences bind us in our humanity, maybe we should. Here are some of mine.


I was swimming in a roped-off region in a lake when I saw a small duckling that had been separated from its family. I swam after it, thinking I’d easily catch it, but it paddled away surprisingly fast out of the swimming region and towards the large open lake. Wanting to help, I kept up the chase for a while, but I only drove it further out into the lake, and I eventually gave up, partially out of uncertainty about my goal. Later that night when walking towards the parking lot, I came across a mother duck with its ducklings and was sure this was the intended home. It brought to mind that terrible scene in Planet Earth (I think) where photographers in a helicopter watch a baby elephant, somehow separated from the herd, run helplessly off into the desert in the wrong direction. The narrator sadly explains their heartless—but rationall–non-intervention policy. I should have left it alone.

In fourth grade, at my small school in my small, rural town in Michigan, a new kid came to class. He was from Indonesia and had been adopted. We made fun of him because he smelled bad. All I remember is him running away on the playground and probably crying.

Decades ago I was at my parents house in Wisconsin and was out the yard. A young neighbor kid from a rough family living in a house nearby was hanging around, asking questions about what I was doing and so on. At some point I had to go in to dinner so I got rid of him. I wish I’d invited him in to eat with us.

My wife and I were out on a date and found ourselves in a karaoke dive bar. It was early and there were just a few takers. I had this very strong urge to participate and sing a Pearl Jam song, but I chickened out.

We had started a dinner group with some close friends of ours that met quarterly or so, rotating among our houses. I missed one of these dinners because I was traveling for work. A while later, one of those friends, who had hosted the dinner I missed, died in a tragic accident. I miss him all the time. It wasn’t the last time I saw him, but I wish I had been at that dinner.

New neighbors moved in next door and were busy painting and getting their house ready. I’d chatted a bit in the yard and we had hit it off a bit. One night we had family in town and I was headed out to pick up dinner. It was late and I saw them painting through the front window. They’d been working all day. I thought only too late that I could have offered to pick something up for them. It would have been easy and I know how much I would have loved having someone think of me that way.

I came across this piece of artwork at a flea market and texted a picture of it to my wife. But I did not buy it.

In my sophomore year of college I was the dorm president. The next year, I moved to college apartments, and had this great idea to make a sign next to the best parking spot reading “Reserved for President of [DORM NAME]”. I would have done it in the proper style so as to look official, and seating it in poured concrete. But I didn’t.

One of the main tasks of the dorm president was to organize the spring banquet. We didn’t plan for audio, and didn’t realize it till the last minute. One guy on one of the dorm floors had a nice speaker set, but we couldn’t get ahold of him. I made an executive decision to borrow them on the assumption he would have said yes. It turned out okay but was the wrong call.

I decided to attend a conference where I had a paper but was not presenting, and missed my cousin’s wedding. It was a big family gathering and a sort of goodbye to my grandparents’ place, since they were moving into a retirement home. So my wife and kids never saw a place that I had visited a lot as a child and which meant a lot to me.

I repeatedly put off getting an oil change on a dying old car and the engine seized while someone else was driving it. It eventually caught up to me and totaled the car. (I can barely bring myself to acknowledge this but feel it is a measure of penance.)

This Be The Verse ⦿

Ages ago the movie 49 Up was released in theaters and, having no children and abundant free time, my wife and I watched it at Rochester’s excellent little independent local theatre. (Amazing, from my current perspective, is that we were actual members of this place. Once we watched a movie, and then decided to stay for another one, just because we wanted to.) The film is a fascinating case study on the role of class in determining fate. Too much time has passed for me to feel comfortable commenting on its content (and I have not seen the two followups in the intervening years), but one thing really stuck with me from the movie. The introduction to the movie presents the following Jesuit proverb as motivation for the documentary:

Give me a child until the age of seven, and I will give you the man.

I have been haunted by that thought ever since, especially this year, when my youngest child will turn eight. According to the proverb, maybe, my chance to form my children has passed. I’m not ready at this moment to speculate on its veracity, or on the more interesting question of how much influence I ever wielded to begin with, but any father is bound to wonder to himself, What kind of father am I?, in moments of urgency and crisis and regret.

There’s no special magic in this poem, but it is one that often comes to mind in light of this question. I take it to be the anthropologist’s version of the Old Testament idea of the “sins of the father” being passed down to the great grandkids.1

This Be The Verse

They fuck you up, your mom and dad
They may not mean to, but they do.
They fill you with the faults they had,
and add some extra, just for you.

But they were fucked up in their turn,
by fools in old-style hats and coats.
Who half the time were soppy stern,
and half at one another’s throats.

Man hands on misery to man
It deepens like a coastal shelf;
Get out early while you can
and don’t have any kids yourself.
— Philip Larkin

That written, I can’t sign on to the third stanza of this poem, and in fact would advise the opposite to most people. And for that reason, I love the idea of rebuttals to this poem. I believe this trend began with This be the converse (“They tuck you up, your mom and dad”). But the best of them that I’ve come across belongs to Richard Kell:

This Be The Converse

They buck you up, your mum and dad,
Or if they don’t they clearly should.
No decent parents let the bad
They’ve handed on defeat the good.

Forebears you reckon daft old farts,
Bucked up in their turn by a creed
Whose homely mixture warmed their hearts,
Were just the counsellors you need.

Life is no continental shelf:
It lifts and falls as mountains do.
So, if you have some kids yourself,
They could reach higher ground than you.

This falls apart at the third stanza, which begins by hewing too close to the original in an unsensible way, and misses the point with the last two lines. It feels incomplete, despite the strong start. The best rebuttal remains to be written.

  1. If you apply it inductively, you can prove Original Sin

To a friend whose work has come to nothing ⦿

I like this poem for its terse testament to the futility of prevailing against fundamentally dishonest people. It speaks to the cost of honesty, of submission to rules laid down by higher principle, and of the frustration and despair that can result when coming up against enemies who don’t. I’d say it’s an anthem for our times, but the poem is a hundred years old, and Boethius among many others already said it anyway, so really it is an anthem to the permanence of the human condition.

To a friend whose work has come to nothing

Now all the truth is out
Be secret and take defeat
From any brazen throat
For how can you compete—
Being honor bred—with one
Who, were it proved he lies,
Were neither shamed in his own,
Or in his neighbor’s eyes?
Bred to a harder thing than
Triumph, turn away, and
Like a laughing string,
Whereupon mad fingers play
Amid a place of stone,
Be secret and exult,
Because of all things known,
That is most difficult.
— W.B. Yeats

There is also the note about the consolation of private virtue (“be secret and exult”), which is true but can also lead to a kind of preening that undermines it. But the more interesting piece of this is the brief line about “breeding”. Yeats could have written “Holding” or “Given” or any number of words, but went with “Bred”. It reminds me of C.S. Lewis’ comment in The Abolition of Man (in a longer discussion about objective value and the role of education):

…no justification of virtue will enable a man to be virtuous. Without the aid of trained emotions the intellect is powerless against the animal organism. I had sooner play cards against a man who was quite sceptical about ethics, but bred to believe that ‘a gentleman does not cheat’, than against an irreproachable moral philosopher who had been brought up among sharpers.

Virtues like honesty and courage are muscles that have to be taught, cultivated, and trained. I don’t know whether Yeats meant to invoke this idea, or whether he simply chose it unconsciously, but it’s significant.

Binsey Poplars ⦿

Hopkins is probably my favorite poet. His dense poetry makes you work hard to understand it, almost requiring you to read it aloud and play with the prosody. This work is rewarded richly, even if you don’t share the beliefs that motivated it. His reflections on nature, beauty, and piety are wonderful and distinctive.

Binsey Poplars

My aspens dear, whose airy cages quelled—
Quelled and quenched in leaves the leaping sun,
Are felled, felled, are all felled—
  Of a fresh and following folded rank.
      Not spared, not one,
      that dandled a sandled
    shadow that swam or sank
on meadow and river and wind-wandering, weed-winding bank.

Oh if we but knew what we do
    when we delve or hew
  Hack and rack the growing green!
  Since country is so tender
  to touch, her being so slender, that
    Like this slick and seeing ball,
    But a prick will make no eye at all.
  That we, even where we mean
    to mend her we end her,
  when we hew or delve;
Aftercomers cannot guess the beauty been.

Ten or twelve, just ten or twelve
    strokes of havoc unselve
  the sweet especial scene.
Rural scene, a rural scene—
Sweet, especial, rural scene.
— Gerard Manley Hopkins

On the subject of nature, another of his that I love (but haven’t memorized) is Inversnaid, which ends with this lovely stanza:

What would the world be, once bereft
Of wildness and wet? Let them be left, O,
Let them be left, wildness and wet;
Long live the weeds and the wilderness yet.

It would be a fitting caption for the Half-Earth Project.

an abecedary of poetry ⦿

In college I saw a performance by Garrison Keillor and Roland Flint entitled “An Abecedary of Poetry”, in which the two performers went back and forth reciting poems across the alphabet. If you ever saw or even listened to Keillor perform you know how dynamic and delightful he was. At the time it seemed to me that having this kind of command of poetry was beyond reach, but, older now, I see it’s largely just a matter of interest, time, and application. Memorizing poetry and prose is a good thing and the alphabet is a great organizational tool. Here is my own abecedary, some of which is aspirational. My favorite poems are unquestionably those of Gerard Manley Hopkins, in part because achieving a good understanding of them requires you to read them almost enough to memorize anyway. (I welcome suggestions for Y and Z!)

  • A - As Kingfishers Catch Fire; Anunciation; Andrea del Sarto
  • B - Binsey Poplars
  • C - Child Logic; Church Monuments
  • D - Dulce et Decorum Est; The Darkling Thrush
  • E - The Ecstasy; Ecclesiastes 3:1–8
  • F - Funeral Blues; Foolish Questions; For the Time Being
  • G - God’s Grandeur
  • H - High Windows; Harlem
  • I - It is the Duty of the Student; Inviting a Friend to Supper; Invictus; Is there a baby in the house?; Innocence Abroad
  • J - Jabberwocky (I needed a J!)
  • K - Kubla Khan; The Kraken
  • L - Little Abigail and the Beautiful Pony; Litany
  • M - Musée des Beaux Arts; Mandalay
  • N - No man is an island; The New Colossus
  • O - On English; Ozymandias; On his seventy-fifth birthday
  • P - On Pooping; Psalm 23; Portions of Paradise Lost
  • Q - Foolish Questions
  • R - Romans 12; A Red, Red Rose
  • S - Sick; The Second Coming; Sonnet 116; Sonnet 23 (Milton)
  • T - To What Serves Mortal Beauty?; To a Friend Whose Work Has Come to Nothing
  • U - Ulysses
  • V - This Be the Verse
  • W - When I Consider How My Light is Spent; When Great Trees Fall; The World
  • X - I Wake and Feel the Fell of Dark Not Day
  • Y -
  • Z -

I intend to add links to each of these as I find time and interest to write about them.

Joni Mitchell's "All I want" as sung by a preternaturally talented chicken ⦿

Buck buck buck buck bu-uck buck bgawk buck buck buck bugaaawk
bawk buck buck, bgawk buck buck, bgawwwk buck buck—
buck-buck buck buck buck, bgawk braaawwk!
Buck guck baaawk buck buck, baaawk buck buck, bgaaaawk
Buck bgaawk buck, buck buck bgawk buck.

Buck buck buck buhgerk! Buck buck buck buck bugHERK!
Buck buck buck braaaaugh buck buck buck beeergh buck!
Braaugh, Braaugh, buck buck buck buck-bruck buck begawk
Buck-buck-buck-buck-buck bugheeergh buck-buck buck BUCK BUCK!
Braagh buck—braugh buck—braugh buck buck-buck bugheerk buck-buck
Bruck buck buck buck buck-buck
buck buck-buck buck-buck buck BRUCK braugh buck buck braugh-bugherk!
Buck buck brauuuugh.

Braaaaugh buck buck-buck buck-buck braugh buck buck buck braauuugh
buck-buck buck-braugh buck-buck-buck buck-buuuuck buck buuck
loses attention

Sing along! TODO: post recording.

Chicken Tikka Masala ⦿

It feels pretty lame posting a recipe for a very common food dish, but this is my go-to meal for potlucks or visiting guests and is usually well-received, even by children. It has served me well as a meal trick and I highly recommend adopting it as your own. I’ve improved it in small ways from the base recipe I found a while back at Epicurious. I’ve also added German spice names since I made this a number of times in Germany.

Chicken Tikka Masala

Originally by Julie Sahni via Epicurious (May 2013). Presented here quadrupled.

Prep time 1 hour. Serves 16 (freezes well). 

Marinade for the chicken

The original (quartered) recipe called for exactly this much chicken, which was about four times too much. With the quadrupled sauce below, it’s just right.

  • A pound or so of boneless, skinless chicken breast (1 or 2 breast halves total)
  • 1/4 cup (50 g) plain whole-milk Greek-style yogurt
  • 2 tablespoons peanut, vegetable, or canola oil
  • 2 teaspoons fresh lime or lemon juice
  • 1 large clove garlic, minced

Flatten the chicken with a mallet or rolling pin. Then prick it with a fork on both sides. Dump the chicken and all the ingredients into a quart or gallon bag, mash together to mix, and put in the refrigerator. Do this well ahead of time if you can (e.g, the night before). It really makes a difference!

Tomato sauce

In a large (e.g., 6 quart) pot, melt at medium-low heat:

  • 1 cup (2 sticks; 224 g) unsalted butter

The add

  • 3 large white onions, finely chopped

Brown the onions in the butter at medium-low heat. It will take 20–30 minutes. While this is happening, mix the following in a small bowl:

  • 4 T (20 g) ground coriander (Koriander)
  • 2 T (24 g) ground cumin (Kreuzkümmel)
  • 2 tsp (8 g) ground cardamom (Kardamom)
  • 2 tsp (8 g) ground nutmeg (Muskatnuss)
  • 2 T (24 g) paprika (Paprika)
  • 1 teaspoon (2 g) cayenne (1/2 teaspoon / 1 g for northerners) (Cayenne Pfeffer)
  • 4 T grated peeled fresh ginger

Add this mixture to the onions when they’re ready. Mix in. I think it helps to mix this and let it cook for a bit, but it will start to gum up. When that starts to happen, add:

  • 1 28 oz. can and 1 14.5 oz. can (1,197 g) of canned tomato purée or diced tomatoes
  • 2 cups (456 g) heavy cream or half-and-half
  • 3 cups (170 g) water
  • 5 tsp (30 g) kosher salt

and bring to a boil. Stir frequently to prevent burning on the bottom of the pan. Reduce the heat to gently simmer the sauce, uncovered, until thickened slightly, about 20 minutes. DO AHEAD: The sauce can be prepared ahead and refrigerated.

Chicken

Cook the chicken on a griddle or skillet. You can use peanut oil if you like but you shouldn’t need anything. Don’t overcook it.

Cut the chicken into small (~10mm) cubes and add to the simmering sauce. Simmer 10–15 minutes and remove from heat. (You can also just add the chicken to sauce that’s already been moved to the fridge. It’s good to let the meat soak in the sauce as long as possible.)

Serve

If frozen, thaw and heat on low heat. When ready to serve, remove from heat, and stir in

  • 1/2 teaspoon freshly ground black pepper

(Note: this is from the original recipe, but I have never actually done this).

Serve with naan (Indian flatbread) and/or cooked Basmati rice. Optionally garnish with 1/2 cup chopped fresh cilantro plus additional sprigs. (Note: I have never done this either).

the ideal public restroom ⦿

The first goal in using any public restroom is of course to relieve oneself. However, this goal only barely edges out a second, which is to accomplish #1 while touching as little as possible. Of course, this is often no possible, but the biggest obstacle appears to be ignorance of this general principle on the part of restroom architects. In hopes of correcting this I offer this checklist of features of the ideal public restroom:

  1. The bathroom door should open outward. A kick plate on the bottom of the door, or a knee-high electronic switch near it, allows it to be opened with your foot. An acceptable alternative is to have a winding passage instead of a door (airport-style), provided it is wide enough to accommodate two-way traffic and wheel chairs. If the exit door must open inward, there is no reason not to install a toe grab. If all else fails, a waste basket should be placed next to the door, such that visitors can use a piece of paper towel to open the door and then toss it into the basket.

  2. In men’s bathrooms, first build two stalls. If there is more room, add a urinal, then another. Continue on, alternating between adding a stall and then a urinal.

  3. Toilets should be flushed by foot. This is what many people do anyway. The lever should be easily cleaned.

  4. There should be a place to secure a small child, both for purposes of changing the child, and of having a place to set the child while the adult gets their own job done. This includes the men’s bathroom!

  5. The purpose of stalls is to provide privacy. It is a madness therefore to have stalls that are 5’10” tall, which seeems to be a common practice, at least in America. Stall walls should be no shorter than 6’6” tall, and should start at roughly 1’ off the ground. There should be no visible gap in the door once it is closed.

    Urinals need dividers, too. Contrary to what seems to be common practice, the purpose of a divider is to break visual contact between adjacent visitors. They should therefore be at least 6’6” tall, just like stalls, and should start maybe 2’ off the floor. (If the budget is tight, a divider from shoulder to head level is preferable.)

  6. Stall doors should lock securely. The lock should be designed such that the locking mechanism works in the direction of gravity and will therefore fail gracefully should the latch internals break down. A falling deadbolt is ideal. Doors should open in so that if a person forgets to lock the door, they can reflexively block the door opening even while engaged in getting the job done.

    A good stall lock A bad stall lock

  7. There needs to be an obvious external indicator that notes whether a stall is in use. This indicator is engaged when the visitor locks the stall.

    A stall lock indicator

  8. Sinks should be both child- and wheelchair-accessible. For children, this means either a very low sink or a stepstool, and soap dispensers that are within reach.

    Badly placed soap dispenser

  9. The faucet should be controlled automatically via a reliable sensor or, ideally, via an easily-cleanable foot lever.

  10. Soap should also be touchlessly distributed and should not be scented (though almond is okay).

  11. No hand dryers. They are oppressively loud, they spread disease, and they don’t do the job. This especially applies to Dyson’s latest abomination which integrates over the sink, blowing dirty gray water from the sink all over everything. Only automatically-dispensed, rough, unbleached paper towel should be distributed.

Additionally, there should be something interesting to look at. Advertising is okay. Do not under any circumstances put a mirror in front of a seated toilet. A thoughtful modern approach is a polite sign asking people to put away their phones, particularly if there is a shortage of stalls (see rule #2).

The Krinner Christmas tree stand ⦿

American Christmas tree stands are an abomination. A typical home store will have a display of ones like these, which I saw on display at Home Depot.

They range in price from $15 to $40 but they are all the same variations of failure and frustration. They require you to lie on your stomach under the tree, screwing in little pins, an imprecise, slow, and uncomfortable if not painful operation. When you are done, your spouse will inform your that the tree is not plumb. If you don’t want to ruin the evening you will curse inwardly only.

In stark contrast is the Krinner Christmas tree stand, which I paid €30 for at Bauhaus in Berlin. Opening it each year is like an early Christmas present.

A metal pin at the base holds the tree stump in place. A stainless steel wire operated by a crank clamps around the tree. Here I am opening it.

To work it, you just drop the tree in, and, while standing, level the tree by grabbing its trunk. When you are ready, you just step down on the lever with your foot. The arms pull in and automatically adjust to the contours of the tree. No fiddling, frustration, or freaking out. When you are done, your spouse will tell you how nice the tree looks.

The box says “SEHR GUT”.

Can confirm.

This removes enough stress from my life that I’d gladly pay €30 for it every year, but I only had to pay it once, and expect to pass this thing down in my will.

Anyway, fröhliche Weinschten to the German engineers who built this thing. It is a huge failure of the free market that this thing is not sold here in the States. I would expect stacks of these at nice stores like Ace Hardware.

mon premier compte linux ⦿

Je suis dans le lycée quands je reçois mon premier compte linux. J’ai grandi dans une petite ville avec seulement 5,000 personnes, et on était tard a reçoiver l’Internet dans notre communauté. Mais, lors’que il venait, j’ai l’occasion de travailler sur le place ou l’Internet a été lié a notre ville, parce que l’homme qui a courri le programme a enseigné ma classe a l’église.

Je ne me souviens pas les prémieres jours d’y travailler, mais je me souviens bien quand j’ai reçu mon premier compte Linux.

L’homme qui a crée mon compte était très cool. Je l’appellerai “Lars”. J’avais a ce temps quinze ans, au debut du lycée, et cet homme était deux ou trois ans plus agé que mois. A ce temps, j’aimais bien—et j’avais un peu peur de—les hommes qui étaient cool et qui étaient aussi plus agé que mois. Lars avait beaucoup de peintures dans son piel, et il a eu aussi beaucoup de dessins fait avec un couteau (je pense que s’appelle “cutting” dans inglais). Il a ecouté à la musique electronique, que je n’avais jamais ecouté parlé avant. Il savais tout de le système Linux, que je venais d’apprendre. Et, plus que tout, il était très genial à tous les personnes. Il était assis devant son ordinateur, et m’a demandé, “Quel nom de compte veux-tu?” Je pensais. C’est une question importante, le nom de compte! Lars avait un nom de compte très cool. Je ne dois pas le dire içi, pour aider lui garder sécret son identité, mais c’était un nom mysterieux, qui a evoqué les pays etrangés. C’était un nom meme plus cool que “zero cool”. On ne peut pas choisir un nom simple, comme—“post”, ou “mpost”—ça n’est pas cool! Pour un homme d’un petit ville comme moi, je ne savais pas quel nom j’ai voulu, mais je savais que ce nom dois être comme ça. Ce n’ètait pas seulement un nom. C’était un identité. Je lui disais que je devrais penser, et j’ai retourné chez moi cette nuit-là, en cherchent le nom parfait.

Aprés le lycée la prochaine journée, j’ai conduit au travail. J’avais pensé pendant une grande portion de la nuit, et j’avais trouvé le nom parfait. J’était excité. Avant de la fin de la journée, j’aurais mon premier compte Linux, avec un identité très cool! Je trouvais Lars, et lui demander de me fait le compte. Il s’assis devant son ordinateur, tapait son login-nom très cool, et me demandait, “Quel nom est-ce que tu veux?” Je repondais avec le nom qui j’ai decouvré dans les petites heures de la nuit. “Manimal”.

J’ai ecouté un son qui vient de la gorge de Lars, mais son front ne bougait pas. Aujourd’hui, quand j’imagine ce jour, je peux voir les coins de sa bouche devenir tout droit. Un moment passait. “Manimal?” “Oui, manimal”. Il créait mon compte, et m’a donné le clavier, et j’ai tapé mon mot de pass. Heureusement, je ne me souviens pas mon mot de pass.

Et, dans cette façon, j’ai devenu “manimal” pour la prochaine trois ans, pendant je travaillais à l’enterprise d’Internet. C’était le debut de plusiers d’ans très blague, avec un collègue de le même age, qui avait un nom de compte normale, mais un nom d’ordinateur très cool. Mais ça est un temoignage pour un autre jour. Maintainent que je suis adulte, je voudrais voir Lars, pour le demander quoi il avait pensé de ma decision, mais je doute que il me se souvient. Et en tout cas, je sais qu’il a pensé. C’est le même que je pense maintenant. J’était un idiote. Mais n’oublie pas ce que j’ai dit sur Lars: il était gentíl. Meme les idiotes comme moi déserve la traitement gentíl. Et situations comme celui me fait penser: Que fais-je maintainent, que je détesterai dans vingt ans?

french deux fois ⦿

J’appris le langue français depuis quatre ou cinque ans. “French Deux Fois” est une methode pour ameliorer mon abilité dans la maniére siguante. Je prends une histoire, une texte, ou une temoignage, et je l’écris deux fois (comme vous avez deviné). Le prèmier fois, je l’écris sans tout aide. Je ne peux pas consulter les dictionnaires, les services de traduction, ni les amis qui parle le langue, ni les ennemies français qui sont dans mon dois. Tout vient de mon cerveux. Àpres, je corrige la prèmiere texte avec tout l’aide que je veux. Je la fais la plus bonne que je peux. Beaucoup de problèmes vont rester, mais si ça n’était pas le cas, il n’aurait pas une raisson pour faire cette exercise.


J’appris le langue français depuis quatre ou cinque ans. “French Deux Fois” est une méthode pour ameliorer mon abilité dans la manière suivante. Je prendrai une histoire, une texte, ou une temoignage, et je l’écrirai deux fois (comme vous avez deviné). La première fois, je l’écrirai sans toute aide. Je ne peux pas consulter ni dictionnaires, ni services de traduction, ni amis qui parlent la langue, ni ennemies français qui sont dans ma dette. Tout dois venir de mon cerveau. Ensuite, je corrigerai le premier texte avec toute l’aide que je souhaiterais. Je le fais le mieux que je peux. Beaucoup de problèmes resteront, mais si ça n’était pas le cas, il n’y aurait pas de raison de faire cet exercise.

An encounter with nature ⦿

[Setting: A sunny Sunday morning at Elk Neck State Park in Maryland. Summer has come a bit early but a fresh storm the night before has brought in a cool breeze. A short distance away the small waves of the Chesapeake beat upon the rocky shore of a beach, imperceptibly continuing a process that eons from now will break the beach’s large, painful rocks into pleasant sand. The sun shines down upon the ranger station. The trees are verdant. In the distance, birds chirp. Ranger 1 is looking over the reservations database: it looks like a light day ahead. Ranger 2 is preparing her eighth grade graduation speech. A camper walks into the ranger office, holding a bag, with a cup inside it.]

RANGER 2: Good morning.

CAMPER: Good morning. I found this black widow spider.

RANGER 2 [face brightening]: Cool!

CAMPER [pausing]: I’d, uh, never seen one before.

[RANGER 1 walks away from his computer to have a look]

RANGER 1: Yep, that’s a nice one.

RANGER 2: They’re everywhere in Maryland.

[An awkward pause ensues.]

CAMPER: I just thought you should know.

RANGER 2: Thanks! You can just let it go in the wood.

CAMPER: Actually, uh, I was going to kill it!

RANGER 2: Oh, no! Don’t do that!

RANGER 1: He’s our little friend.

RANGER 2: We’ll just take that from you, thanks.

CAMPER: Um, okay.

[The camper walks away. A suspicion is dawning on the him that he has played a stereotype. Like the middle schooler excitedly informing her long-haired, flip-flop wearing social studies teacher about this band she’d just discovered called “The Grateful Dead”, the naïve suburban camper has just brought an everyday specimen from nature and introduced it to the staff as if it were remarkable. The camper returns to his car and proceeds to deliver a lecture to his young daughter on the importance of respecting nature and living with it in precarious balance.]

free startup ideas ⦿

Here are some free startup ideas that I just sent to a friend. I told him he could use them, but did not give him exclusive rights, so you can maybe beat him to the patent office. They are free of charge, but consider hiring me as a consultant at an egregious hourly rate. After all, these three amazing ideas are actually the very worst ones that I could think of. Imagine what I could come up with if you were paying me!

  • You’re on a car trip to visit family and it’s a long drive so you have to stop, but every stop with kids takes at least 45 minutes, especially meals. Wouldn’t it be nice if you could have the meal delivered to you—without having to get off the highway? Why not? This app lets you order from restaurants approximately 30 minutes ahead of you along your route. When it’s ready, a driver will pick it up, and wait near an onramp until you begin to approach. Then he or she will get on the highway and maneuver near you. At this point, you lower your window, and they just toss it right in. Payment is all by app of course so there’s no need to have to negotiate tossing the tip back into their open window (sounds dangerous!) Business model: Sell to Uber, cash out. Risks: A potential problem is that the drivers will want a lot of money and might need training (expensive), but I think we could work around this by selling the excitement factor.

  • Flavored seltzer water is all the rage. But sometimes you pick a flavor and get bored of it before you finish the whole can, or maybe you want to add your own twist. This idea is to have a straw with an inlet where you can attach flavor packs. The straw has bluetooth connectivity. After connecting it to your phone, whenever you want a little flavor boost, you just tap a button in the app, and a valve opens, releasing some flavor into the drink as it ascends the straw. Instant f-f-f-f-flavor boost (the app will optionally say that when you push the button)! We could also gamify this for kids (e.g., no flavor boost for you till you solve this math problem!). Later as society progresses and social norms loosen we could also add drugs. One potential problem is that the current sentiment is against straws in some places and hence it might have to be biodegradable, but that just means people would buy more. Collect the whole set! Business model: sell flavor packs at a loss, burn through VC money advocating for looser drug laws, sell drug packs. Risks: competitors will make their own flavor packs; Canadian companies will sell drugs first (mail order). The first flavor would be vanilla.

  • This one is a little gross but we’re all adults here. Social networking is all the rage and it’s helped connect people in ways they have never been connected before, shining sunlight into the darkness of our formerly lonely and private lives. Before then, people had a hard time knowing just where they fit amongst their peers, but now it is obvious. But there are still areas of our lives that are beyond their reach. One area of people’s lives you don’t know anything about is their bowel movements, i.e., how often they go, what a “normal” one looks like, what consistencies you should expect, etc. This is a problem because you never get to know how you stack up. Until now. With this app, you snap a picture of every bowel movement before you flush. It is uploaded to our servers where it is analyzed with deep learning / neural nets / AI. It is then summarized and used to make an “anonymous” profile and you can compare yourself to any peer group you like. Business model: data is not actually anonymous, is sold to advertisers. Risks: people might object at first but they’ll quickly change.

What I love / hate about Berlin ⦿

What I love most about Berlin

  • Streets are built with pedestrians and bicyclists in mind.
  • Pedestrians stop for lights at crosswalks, even if there is no traffic in sight. (Note: increasingly less true in Berlin).
  • The bread is amazing and almost free. (The kinds of breads you might be inclined to avoid in the US—full wheat breads, dark breads—are in fact the best.)
  • I love euros; they are fun to stack and count.
  • Döner kebaps: cheap, delicious, everywhere. (Though best not to inquire about the source of the meat log they shave it from.)
  • New Year’s.
  • The next morning, city employees are out first thing, cleaning it all up, dressed in bright orange jumpsuits.
  • There are parks and playgrounds everywhere and they are all beautiful.
  • They build child play structures that would take the breath away from American personal injury lawyers.
  • Christmas markets.
  • In the winter, at any sign of sunshine, people spill out onto the streets, dressed head to toe for warmth, sunning their faces.
  • Despite much malignment, the German language is beautiful.
  • Kindergeld! (strangely, paid out also to temporary residents)
  • Schools are laid out so kids can walk to them. The school day is out around 1:45 in the afternoon (this can be extended till 6 for working parents).
  • School vacations are spread out throughout the year.
  • You can get most anywhere via public transportation.
  • The windows are all six feet tall, and can either swing open or tilt from the top.
  • Mosquitos are few, move slowly, and congregate in visible places.
  • Strangers will correct your behavior.
  • There are an uncountable number of bike rental services.
  • The ceilings are all 12 feet high.
  • Working spaces even in open offices are large enough for three American people.
  • You are never far from a cheap ice cream cone (€1.50 max).
  • Everything is closed on Sunday.
  • Emergency vehicles turn off their sirens except when at intersections or otherwise necessary.

What I don’t love

  • There are no drinking fountains anywhere.
  • The Mexican food is best avoided.
  • The bureaucratic vocabulary is large, disjoint from that of everyday life, and mostly noncompositional.
  • The streets are covered in dog excrement.
  • There are prominent ads for adult shops even in the nice parts of town.
  • Alexanderplatz—where you find the TV tower that is a symbol of Berlin—is kind of a dump.
  • They are quite fastidious about schooling. There are reports about families trying to cut out a day early for vacation being stopped by police at the train station!
  • Glühwein is overrated and overpriced (except at the Alt-Rixdorf Christmas Market in Neukölln).
  • The painted line separating lanes on the road is the same color (white) as the outer line. It makes it harder to distinguish them.

With apologies to my undergraduate advisor.

Sockeye Code Walkthrough ⦿

This document describes how Sockeye implements the Transformer model. It begins with data preparation, then moves into training, and finally decoding. At the heart of Sockeye is a generic structure that implements all three major architectures in neural machine translation: the RNN, the Transformer, and the Convolutional approach.

The document is split into three major sections:

  1. Data Preparation: raw iteratorsprepared iteratorsgeneric interfaceterminologysummary
  2. Training: skeletonmodulesthe transformerencoderdecoder [incomplete]
  3. Inference [incomplete]

Like most NMT systems, Sockeye’s codebase changes daily. I will be baseing this tutorial on Sockeye v1.18.54 (commit 985c97edecaa93c3ab45e0b93b7a6493a1c5d7c7); all links to the code will be within this branch.

Data Preparation

The basic operation of the training algorithm is to consume sentences of the training data and run forward and backward operations over the computation graph. Sockeye can process this training data in two ways: a traditional way in which the trainer works directly with raw text, or an efficient manner that requires a separate pre-preparation step to organize the raw data into training shards. Both of these approaches are abstracted away beneath Sockeye’s data iterators interface. This API returns DataBatch objects that are fed into training; what varies is which of these two data preparation approaches the user chooses.

It is important to understand basics of how training works. During training, sentences are batched together, meaning that the losses and gradients are computed for many sentences at the same time. Because they are done in parallel, the computation runs to the length of the longest item in the batch. For this reason, NMT attempts to group together sentences of similar lengths into buckets. The buckets are keyed by a tuple (M, N) indicating the length of the maximum source and target sentence lengths. These buckets can be defined in any way the user chooses.

This section will begin with the raw data approach. It is the most intuitive, and its discussion will bring to light issues that arise in training neural machine translation systems. This will motivate the use of prepared data iterators.

Raw data iterators

This “traditional” approach to data iterators makes two passes over the raw data. It is engaged by passing --source and --target to sockeye.train. During the first pass, it builds statistics and calculates information about what the batches will look like. In the second, it assigns data to batches. It then creates an iterator that iterates over them.

The entry point is get_training_data_iters() in data_io.py. The steps are laid out fairly clearly:

  • Calculate length ratio (analyze_sequence_lengths). This is a linear pass over the data. After running it, Sockeye knows the number of valid training sentences as well as statistics (mean and standard deviation) of their length ratios.

  • Define the parallel buckets (define_parallel_buckets). A bucket is a pair of integers determining the maximum length (source and target) of a sentence pair in the training data. Sockeye uses --bucket-width (call it B) to iterate over the target data in steps (B, 2B, 3B, …), defining the lengths of the target bucket, up to --max-seq-len. It then uses the ratio computed above to determine the relative size of the source bucket. At training time, a sentence pair is fit into the smallest bucket that can fit both sides.

  • Compute statistics (get_data_statistics). With the buckets defined, Sockeye takes another pass over the training data and computes statistics using a DataStatisticsAccumulator. Afterwards, it knows how many data points (sentence pairs) belong in each bucket, among other things.

  • Define batch sizes. (define_bucket_batch_sizes). Two types of batching are available via --batch-type: word and sentence based. When using sentence-based batching, all batches have the same number of sentences (and thus batches with shorter sentences take up less memory, since the graph doesn’t need to be rolled out as far). With word-based batching, batches with shorter sentences can fit more of them. After this function returns, we know the batch sizes for each bucket.

  • Load data (RawParallelDatasetLoader.load) and ParallelDataSet.fill_up. Finally, the data is loaded into memory with a third pass over the training data. The raw data iterators load the entire training corpus into memory. For each bucket, we know exactly how many data points there are, so we can directly initialize an MXNet NDArray object (though actually, we use numpy first, and then initialize MXNet at the end). As each data point is read in, it is padded with constants.PAD_ID (i.e., 0) so that it reaches the length of the bucket its in.

    The complete set of data is returned as a ParallelDataSet. This data all resides in main (i.e., not GPU) memory. There is one small remaining issue. The number of samples for each bucket (i.e., the number of sentences in the training data that fell into, say, bucket (29, 30)) is unlikely to be a multiple of the batch size. In order to have full batches, the remainder must be row-padded, i.e., adding dummy sentences so that the final batch is the correct size. This is accomplished by a call to fill_up().

  • Create iterator. Sockeye returns a ParallelSampleIter. This is a custom iterator whose ultimate job is to return mx.io.DataBatch objects at each call to next(). It also supports shuffling both the batches and the data within each batch.

Prepared data iterators

There are a number of problems with the raw data iterators:

  • They require the entire training data to be loaded into memory.
  • They waste a lot of time at training startup—time when the GPU(s) are not being used.

These problems are exacerbated by the fact that many experiments will make use of the exact same prepared data. Because of this, Sockeye provides a second, more efficient way to iterate over the training data. This approach essentially offloads the above training data preparation steps to a separate program (sockeye.prepare_data), which divides the data into shards and then applies the above process to each shard. Prepared data then defines a new iterator, which traverses all the data in each shard, one at a time. With this:

  • Memory use is at most the size of the largest shard (default: 1m data points).
  • There is almost no startup cost, apart from directly loading the first shard into memory.

Preparing the training data

Preparing the training data is accomplished by calling

    python3 sockeye.prepare_data --source ... --target ... [--shared-vocab] [other args]

The CLI just wraps a call to data_io.prepare_data(). This functions proceeds along the following steps:

  1. As before, it iterates over the training data to collect the mean ratio statistics, and then uses that information to define the bucket sizes.

  2. As a second step, the data is randomly assigned to shards. By default, each shard has 1,000,000 data points (sentence pair), though this can be overridden with --num-samples-per-shard. These shards are plain text.

  3. Finally, each shard is read and processed to raw mx.nd.NDArray objects, which makes them quite ffast to load at runtime. This is accomplished by using RawParallelDatasetLoader.load() as before. These datasets are then immediate dumped to disk using mx.nd.save, which simply concatenates the source, target, and label streams.

Using prepared iterators

Prepared data iterators are activated in Sockeye via the --prepared-data to sockeye.train. (This flag is incompatible with --source ... --target ....) Shards are accessed via the ShardedParallelSampleIter, which is effectively just a shard-aware wrapper around the earlier-described ParallelSampleIter. In addition to shuffling batches, it shuffles shards. At each epoch, reset is called, which shuffles everything. The order of shards is randomized, as is the data within a shard. However, all the data within each shard is processed before the next shard is loaded. Data loading is much faster, because (a) the shards are much smaller and (b) each shard has already been written to serialized integer-based NDArray objects (via the vocabulary), and therefore doesn’t have to be constructed.

A generic interface

Sockeye brings this together into a generic API. In train.py, the function create_data_iters_and_vocabs uses Sockeye’s training parameters to figure out the correct thing to do, hiding the complexity of:

  • Choose raw or prepared data iterators
  • Starting training or continuing training from an earlier aborted run

Terminology

Here is some of the terminology used in the previous section. Most of this is generic NMT terminology and not specific to Sockeye.

  • Buckets group together sentences in some fashion, usually ones of similar length. Buckets are determined by using increments of --bucket-width on the target side, with the source sides computed using the empirical source/target sentence mean ratio.

  • A batch takes a group of sentences and runs training over them in parallel. Batches operate on a subset of a bucket; the whole bucket probably doesn’t fit fit, since buckets are usually much larger than batches. Batching can be either sentence-based (with a fixed number of sentences per batch) or word based, which fixes the number of words per batch and determines the number of sentences via sentence length. Word-based batching makes sense since more sentences can be in the batch when they are shorter.

  • A shard is a random subset of the entire training data. By default, a shard contains a million sentence pairs, but this can be changed with the --num-samples-per-shard flag to sockeye.prepare_data.

Summary

The following table compares training using raw iterators and prepared iterators.

  raw iterators prepared iterators
Entry point get_training_data_iters() get_prepared_data_iters()
Iterator ParallelSampleIter ShardedParallelSampleIter
Memory Loads all training data into memory Shard-by-shard
Training time cost three passes constant

Training

The basic function of Sockeye’s training module is to construct a symbolic computation graph and to run data through it. The assembly of the graph is separated from the execution of the graph, which does not occur until a batch of input examples is passed in via a call to a module’s forward() method. These concepts will be familiar to you if you have used MXNet or another deep learning toolkit. In order to make them concrete in MXNet, however, the following steps are the important ones:

  1. The graph is assembled by linking together various mxnet.sym.Symbols into a computation graph and creating a module.
  2. The module is placed into GPU memory when module.bind() is called. This call provides the module with the input shapes, from which it can infer all shapes (and required memory usage) throughout the graph.
  3. The graph is executed when a call to [module.forward()] is made.

Of course, there are many other pieces required to run training. The data preparation section above is one important piece of it. Sockeye contains implementations of all three major NMT architectures, and there are a host of parameters affecting each of them, as well as architecture-agnostic values such as the optimizer to use or the learning rate.

The skeleton

It is often difficult to determine what the most important aspect of an entity is: its essential core, its quiddity. For example, what is the central or most important part of an automobile? Well, it couldn’t go anywhere without an engine, so that is certainly a candidate. But then, it couldn’t go anywhere with wheels, either, or for that matter, seats to hold the driver, who serves the core function of directing the vehicle (since driverless cars are a pipe dream). Anyone who has spent time around pot smokers may be familiar with arguments of this nature and their notorious circularity and unresolvability. I have no problem, however, in identifying the golden core of Sockeye, its white-hot center, the engine that drives the entire codebase, its most important and essential piece. And since there are no MT researchers smoking pot nearby to argue with me, I can even proceed directly to identifying the exact line number without fear of contradiction. It is lines 107–146 of training.py, in the TrainingModel initialization, where the symbolic graph for a particular sentence pair length is lazily defined, to be later passed to the BucketingModule, which will unroll it on demand.

The function is short enough that it’s worth repeating here:

def sym_gen(seq_lens):
    """
    Returns a (grouped) loss symbol given source & target input lengths.
    Also returns data and label names for the BucketingModule.
    """
    source_seq_len, target_seq_len = seq_lens

    # source embedding
    (source_embed,
     source_embed_length,
     source_embed_seq_len) = self.embedding_source.encode(source, source_length, source_seq_len)

    # target embedding
    (target_embed,
     target_embed_length,
     target_embed_seq_len) = self.embedding_target.encode(target, target_length, target_seq_len)

    # encoder
    # source_encoded: (batch_size, source_encoded_length, encoder_depth)
    (source_encoded,
     source_encoded_length,
     source_encoded_seq_len) = self.encoder.encode(source_embed,
                                                   source_embed_length,
                                                   source_embed_seq_len)

    # decoder
    # target_decoded: (batch-size, target_len, decoder_depth)
    target_decoded = self.decoder.decode_sequence(source_encoded, source_encoded_length, source_encoded_seq_len,
                                                  target_embed, target_embed_length, target_embed_seq_len)

    # target_decoded: (batch_size * target_seq_len, decoder_depth)
    target_decoded = mx.sym.reshape(data=target_decoded, shape=(-3, 0))

    # output layer
    # logits: (batch_size * target_seq_len, target_vocab_size)
    logits = self.output_layer(target_decoded)

    loss_output = self.model_loss.get_loss(logits, labels)

    return mx.sym.Group(loss_output), data_names, label_names

It is not included in the comments, but this code is the high-level skeleton which manages training for all three architectures in Sockeye. There are five pieces or layers:

  1. The embedding layer, which computes the source and target word embeddings.
  2. The encoder layer, which encodes the source sentence, producing a sequence of encoder hidden states.
  3. The decoder layer, which runs the decoder to produce a sequence of target hidden states.
  4. The output layer, which produces, for each target word position, a distribution over the target language vocabulary (in the form of raw logits).
  5. The loss computation, which computes the loss of the output distributions relative against the target labels (the correct answers).

This has been a fairly generic view of how training works. Having defined this central piece, I will work both outward and inward to explain Sockeye’s internals in grounded detail. The “outward” explanation will describe how we get here, that is, the code-specific role this code plays in the construction of the graph, and how it is executed with data fed into the training module. Some of this will be a review of information from the previous section on data preparation.

I will follow that with an “inward” explanation. Three of these skeletal pieces(embedding, the output layer, and loss computation) are shared among all three of Sockeye’s auto-regressive NMT architectures: the RNN, the Transformer, and the convolutional model. The architectures differ in how they construct the encoder and decoder layers. I will explain below how one particular instantiation of this generic skeleton—the Transformer—proceeds with training a model. Along the way you will learn lots of intricacies and secrets of both Sockeye and MXNet.

Modules

Sockeye relies on MXNet’s modules, which implement and execute programs defined by a computation graph, which are built from operations on MXNet Symbols, such as the sym_gen() function above. They have already come up a number of times. But while the basic idea is simple, in my opinion, this clarity is muddied a bit by the particulars, so it is worthwhile going over once again. Modules have a few basic functions:

  • Receiving a computation graph definition, in the form of a Symbol;
  • initializing or loading parameters for the computation graph;
  • executing the graph with actual data.

An important concept in training neural systems is bucketing. This is the process by which similar-sized inputs are grouped together and executed as a batch. MXNet provides some support for bucketing, by allowing the user to provide a function which generates the symbolic graph for a bucket on the fly. This function is keyed to a bucket key, which is a (source length, target length) pair: whenever the bucketing module gets a group of data under a certain bucket key, it generates that graph, caching the result.

When run, the sym_gen() function generates a symbolic graph. The graph is not actually executed at this point. The function is defined, but not called. Even when it is called, the graph is only created, but cannot even be laid out in memory. That doesn’t occur until the data is passed to MXNet’s “bucketing module” system, which rolls out the graph to different lengths on demand, sharing parameters between them while saving time and computation by allowing buckets with shorter sentences to quit training earlier. This module is defined a few lines later, where we see.

self.module = mx.mod.BucketingModule(sym_gen=sym_gen,
                                     logger=logger,
                                     default_bucket_key=default_bucket_key,
                                     context=self.context,
                                     ...)

Sockeye’s default behavior is to use buckets, but you can turn that off by passing --no-bucketing to sockeye.train (and sockeye.prepare_data, if you are using data preparation). In that case, Sockeye runs the following code instead:

symbol, _, __ = sym_gen(default_bucket_key)
self.module = mx.mod.Module(symbol=symbol,
                            data_names=data_names,
                            label_names=label_names,
                            logger=logger,
                            context=self.context,
                            compression_params=self._gradient_compression_params,
                            fixed_param_names=fixed_param_names)

Here, you can see how the sym_gen() function is called with the default bucket key. Since the bucket keys are pairs of (source, target) lengths, the default is to roll out to the longest possible length. So when bucketing is turned off, Sockeye creates a single graph, and every training instance gets executed all the way to the end.

The bucketing module doesn’t run sym_gen() now, but later, on demand, as it encounters particularly bucket keys (e.g., (30, 27) for a bucket with a maximum source length of 30 and a maximum target length of 27). The bucketing is basically just a hash function, mapping the bucket keys to unrolled computation graphs. Each time a new batch is put in, the bucketing module creates the graph if it is not already present. The graph is created by calling sym_gen(). (The default bucket key is used as the maximum length, and the context is the CPU or GPU(s)). This is similar to the following design pattern which you have probably written yourself:

def get_dictionary_value(self, key):
    if not self.dict.has_key(key):
        self.dict[key] = 0
    return self.dict[key]

A few lines after module creation, the module is allocated in memory on the specified device (e.g., a GPU). (Sockeye defaults to looking for a GPU, unless you specify --use-cpu to training or inference).

self.module.bind(data_shapes=provide_data,
                 label_shapes=provide_label,
                 for_training=True,
                 force_rebind=True,
                 grad_req='write')

Here, the data and label shapes are provided, which allows the computation graph to figure out how much memory it needs. The computation graph is executed later, in the training.fit() function, where the data iterators from the previous section are iterated over (MXNet actually provides its own fit() implementation, but Sockeye uses its own, in order to have more control over stopping conditions and epochs).

In train.py, Sockeye creates a training data iterator. This object iterates over the training data, returning a mx.io.DataBatch at each call to next() from the EarlyStoppingTrainer:

def fit(...):

    [snip]

    while True:
        batch = next_data_batch
        self._step(self.model, batch, checkpoint_frequency, metric_train, metric_loss)

        [snip]

        next_data_batch = train_iter.next()
        self.model.prepare_batch(next_data_batch)

This batch object is passed directly to TrainingModel.run_forward_backward(), which passes the call to the internal module:

def run_forward_backward(self, batch: mx.io.DataBatch, metric: mx.metric.EvalMetric):
    """
    Runs forward/backward pass and updates training metric(s).
    """
    self.module.forward_backward(batch)
    self.module.update_metric(metric, batch.label)

That’s basically it.

The Transformer

This section introduces you to Sockeye’s implementation of the Transformer model. The goal is not to teach you how the transformer works (there are also a number of good tutorials by Michał Chromiak, Jay Alammar, and Sasha Rush), but how to follow its implementation in Sockeye’s code.

As noted above, with respect to Sockeye’s graph skeleton, the Transformer model is distinct from the other models only in its implementation of the encoder and decoder phases. These implementations are spread across three files: encoder.py, decoder.py, and transformer.py. The first two files are the main ones, with the third containing the TransformerConfiguration as well as support routines used by the encoder or decoder or both.

Transformer Encoder

The top level of the transformer encoder is expressed succinctly in the following code (and below, in Figure 1). It receives data as input, which is a group of source sentences encoded in a batch. The batch has a max length (the bucket key), and data_length records the actual length of each sentence in the batch, for masking purposes.

def encode(self,
           data: mx.sym.Symbol,
           data_length: mx.sym.Symbol,
           seq_len: int) -> Tuple[mx.sym.Symbol, mx.sym.Symbol, int]:
    """
    Encodes data given sequence lengths of individual examples and maximum sequence length.

    :param data: Input data.
    :param data_length: Vector with sequence lengths.
    :param seq_len: Maximum sequence length.
    :return: Encoded versions of input data data, data_length, seq_len.
    """
    data = utils.cast_conditionally(data, self.dtype)
    if self.config.dropout_prepost > 0.0:
        data = mx.sym.Dropout(data=data, p=self.config.dropout_prepost)

    # (batch_size * heads, 1, max_length)
    bias = mx.sym.expand_dims(transformer.get_variable_length_bias(lengths=data_length,
                                                                   max_length=seq_len,
                                                                   num_heads=self.config.attention_heads,
                                                                   fold_heads=True,
                                                                   name="%sbias" % self.prefix), axis=1)
    bias = utils.cast_conditionally(bias, self.dtype)
    for i, layer in enumerate(self.layers):
        # (batch_size, seq_len, config.model_size)
        data = layer(data, bias)
    data = self.final_process(data=data, prev=None)
    data = utils.uncast_conditionally(data, self.dtype)
    return data, data_length, seq_len

There are a few small items here. The data is cast to 32-bit floats (effectively a NOOP). Dropout is enabled if requested via an MXNet primitive. Next, the bias is created, with dimenions (batch size * heads, 1, max_length). (This is done because the “self-attention” block works with this dimension, and this way, the bias doesn’t have to be reshaped).

Next, we create and link the layers by iterating over them. Each layer is a TransformerEncoderBlock:

    self.layers = [transformer.TransformerEncoderBlock(
        config, prefix="%s%d_" % (prefix, i)) for i in range(config.num_layers)]

When data = layer(data, bias) is called, the __call__() method is invoked on each of these layers. This method builds a new layer and links it up with the layer below it, which is passed in as an argument. This new layer is then returned for further chaining to arbitrary depths. The complete implementation is here:

def __call__(self, data: mx.sym.Symbol, bias: mx.sym.Symbol) -> mx.sym.Symbol:
    # self-attention
    data_self_att = self.self_attention(inputs=self.pre_self_attention(data, None),
                                        bias=bias,
                                        cache=None)
    data = self.post_self_attention(data_self_att, data)

    # feed-forward
    data_ff = self.ff(self.pre_ff(data, None))
    data = self.post_ff(data_ff, data)

    if self.lhuc:
        data = self.lhuc(data)

    return data

When executed, this code constructs the symbol graph for a single layer, linking up the following pieces:

This is visualized in Figure 1 (except for LHUC). The blue boxes denote the dimensions of the tensors that are output from each sub-block (unannotated lines keep the same shape of their inputs). Each block is also labeled with the Sockeye class that processes that block.

Figure 1. Sockeye's Transformer encoder block

Getting back to the code, these items are all defined in the TransformerEncoderBlock initializer:

        self.pre_self_attention = TransformerProcessBlock(sequence=config.preprocess_sequence,
                                                          dropout=config.dropout_prepost,
                                                          prefix="%satt_self_pre_" % prefix)
        self.self_attention = layers.MultiHeadSelfAttention(depth_att=config.model_size,
                                                            heads=config.attention_heads,
                                                            depth_out=config.model_size,
                                                            dropout=config.dropout_attention,
                                                            prefix="%satt_self_" % prefix)
        self.post_self_attention = TransformerProcessBlock(sequence=config.postprocess_sequence,
                                                           dropout=config.dropout_prepost,
                                                           prefix="%satt_self_post_" % prefix)

        self.pre_ff = TransformerProcessBlock(sequence=config.preprocess_sequence,
                                              dropout=config.dropout_prepost,
                                              prefix="%sff_pre_" % prefix)
        self.ff = TransformerFeedForward(num_hidden=config.feed_forward_num_hidden,
                                         num_model=config.model_size,
                                         act_type=config.act_type,
                                         dropout=config.dropout_act,
                                         prefix="%sff_" % prefix)
        self.post_ff = TransformerProcessBlock(sequence=config.postprocess_sequence,
                                               dropout=config.dropout_prepost,
                                               prefix="%sff_post_" % prefix)
        self.lhuc = None
        if config.use_lhuc:
            self.lhuc = layers.LHUC(config.model_size, prefix=prefix)

The TransformerProcessBlock appears many times throughout this. It is a layer which performs pre- and post-processing of data sequences within the transformer. It applies any subset of layer (n)ormalization, (r)esidual connections, and (d)ropout. Many of these variables are controlled with the --transformer-preprocess and --transformer-postprocess flags, documented here, which default to ‘n’ and ‘dr’, respectively.

Pulling this all together, we have something that is very similar to the encoder side of Figure 1 in the Transformer paper. Differences from that diagram are:

  • Sockeye adds explicit pre-processing blocks before the multi-head self-attention and feed-foward layers
  • Sockeye (by default) applies layer normalization before the multi-head attention and feed foward layers, and applies drop-out afterwards; residual connections remain in place.

Multi-headed Self Attention

Most of the layers above are clear enough. However, Multi-head attention, and all its variants—multi-head self attention in the encoder, and multi-head (source) attention and masked multi-head (self) attention in the decoder—benefit from some further explanation. Recall that the goal of attention is to (a) compute a distribution across source words and (b) use this distribution to produced a weighted sum of representations. The first part (a) is computed with a softmax over the comparison of a hidden state against each of the source encodings, and the second part (b) is computed by multiplying this distribution against those same source encodings.

First, a note on terminology, specifically related to Vaswani et al.’s generalization (Section 3.2) of queries, keys, and values. They write:

An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.

This is a nice generalization over the kinds of attention used by the transformer. Writing specifically of multi-head encoder self-attention, it might not be clear from their paper that the queries, keys, and values are all the same thing. For layer \(l\), the queries, keys, and values are all the encoder hidden state at the previous layer. The hidden state representation for each source word is compared to every other word, including itself, to produce a distribution over each words. That distribution is then used to produce a weighted sum of the source representations.

Here we focus just on multi-head self attention, which is the only version that’s used in the decoder. Self-attention provides two main benefits: (a) direct access to the encoding for any word, and (b) removal of the source-side recurrent computations. Direct access to encodings is possible because each state is computed as a weighted sum of the states of the layer below it. (For the first encoder layer, the layer beneath is the positionally-encoded embeddings.) Contrast this to an RNN, where the encoding for each word is a function of (a) the encoder output for the previous word in the same layer and (b) the encoder output for the same word in the previous layer (again, with the embeddings acting as a 0th layer). In the RNN setting, for source word i, information about states of words other than i must be filtered through the encoder state mechanism. This setup attenuates the influence of long-distance dependencies. The transformer allows the encoder state at each position to look directly at any word in the source.

A consequence of this is that all the encoder states in a layer can be computed simultaneously, which speeds up training. Sockeye accomplishes this in layers.py in the MultiHeadSelfAttention class. When the computation graph for the layer is constructed, the following code block (MultiHeadSelfAttention.__call__()) is executed:

# inputs shape (batch, max_length, depth)
combined = mx.sym.FullyConnected(data=inputs,
                                 weight=self.w_i2h,
                                 no_bias=True,
                                 num_hidden=self.depth * 3,
                                 flatten=False,
                                 name="%sqkv_transform" % self.prefix)
# Shape: (batch, max_length, depth)
queries, keys, values = mx.sym.split(data=combined, num_outputs=3, axis=2)

if cache is not None:
    # append new keys & values to cache, update the cache
    keys = cache['k'] = keys if cache['k'] is None else mx.sym.concat(cache['k'], keys, dim=1)
    values = cache['v'] = values if cache['v'] is None else mx.sym.concat(cache['v'], values, dim=1)

return self._attend(queries,
                    keys,
                    values,
                    lengths=input_lengths,
                    bias=bias)

This takes the input (variable inputs with shape (batch, max_source_len, input_depth)), and projects the entire input through a fully connected layer that triples its size. (Note that this means that the queries, keys, and values are not passed directly, but instead go first through this feed-forward layer). Symbol inputs has shape (batch_size, max_length, input_depth) and combined has shape (batch_size, max_length, input_depth * 3). These projections then become the queries, keys, and values with the call to mx.sym.split(). Next is caching, which I will skip because it is only used by decoder self attention. The final piece is computing the context vector via a call to _attend() in MultiHeadAttentionBase. It is defined as follows. I will walk through it:

def _attend(self,
        queries: mx.sym.Symbol,
        keys: mx.sym.Symbol,
        values: mx.sym.Symbol,
        lengths: Optional[mx.sym.Symbol] = None,
        bias: Optional[mx.sym.Symbol] = None) -> mx.sym.Symbol:
"""
Returns context vectors of multi-head dot attention.
:param queries: Query tensor. Shape: (batch_size, query_max_length, depth).
:param keys: Keys. Shape: (batch_size, memory_max_length, depth).
:param values: Values. Shape: (batch_size, memory_max_length, depth).
:param lengths: Optional lengths of keys. Shape: (batch_size,).
:param bias: Optional 3d bias.
:return: Context vectors. Shape: (batch_size, query_max_length, output_depth).
"""

The queries are first scaled down by dividing them by the square root of the number of head dimensions:

# scale by sqrt(depth_per_head)
queries = queries * (self.depth_per_head ** -0.5)

Next, split_heads() is called, which reshapes and transforms the input symbol from (batch_size, source length, depth) to (batch * num_heads, source length, model_size / heads). The multiple heads are just multiple independent attention mechanisms, each computed over the full source, and which are distinguished in training with random initialization. Below, they will be projected down to a smaller dimension, and all concatenated together, such that the original input dimension, called the “model size”, is restored. As a result of this downward projection and concatenation, the model size (Sockeye: --transformer-model-size) must be divisible by the number of heads (--transformer-attention-heads). The defaults are 512 and 8. The model size must also be equal to the embedding size, since the embeddings serve as their own attention layer for the first layer self attention.

This is done to the queries, the keys, and the values.

# (batch*heads, length, depth/heads)
queries = split_heads(queries, self.depth_per_head, self.heads)
keys = split_heads(keys, self.depth_per_head, self.heads)
values = split_heads(values, self.depth_per_head, self.heads)

The last operation basically multiplied out the first dimension of the tensor to (batch * num_heads). Next, Sockeye broadcasts the lengths of each input sentence in the batch across this flattened axis. Lengths is used in the decoder for masking the attention mechanism so that it cannot see timesteps in the future:

lengths = broadcast_to_heads(lengths, self.heads, ndim=1, fold_heads=True) if lengths is not None else lengths

Finally, we compute dot attention. This corresponds to Equation 1 in Vaswani et al.. It is itself somewhat involved, but I am not going to go into it because I am tired. You can read the short function for yourself here. Basically, it makes use of MXNet primitives (mx.sym.batch_dot, mx.sym.SequenceMask, and mx.sym.softmax) to compute the context vector for each attention head.

# (batch*heads, query_max_length, depth_per_head)
contexts = dot_attention(queries, keys, values,
                         lengths=lengths, dropout=self.dropout, bias=bias, prefix=self.prefix)

The results of the attention heads are then rearranged to transform shape (batch * heads, source_len, depth_per_head) to (batch, source_len, depth). From that, another feed-forward layer yields the contexsts for the layer.

# (batch, query_max_length, depth)
contexts = combine_heads(contexts, self.depth_per_head, self.heads)

# contexts: (batch, query_max_length, output_depth)
contexts = mx.sym.FullyConnected(data=contexts,
                                 weight=self.w_h2o,
                                 no_bias=True,
                                 num_hidden=self.depth_out,
                                 flatten=False)

return contexts

In this way, the attention mechanisms for all heads, and for all words in the sentence, and all sentences in the batch, are computed in parallel.

Transformer Decoder

This section is incomplete.

The Transformer decoder is a lot like the Transformer encoder. There is really only one difference: each decoder layer adds masked self-attention, which, analogous to source-side attention in the encoder, is an attention block over the target side word representations one layer down. Masked self-attention is used because we need to mirror the inference-time scenario where words are generated one-by-one, left-to-right. We therefore need to ensure that the decoder does not attend to words that haven’t been generated yet, which is accomplished by masking out those positions in the fixed-length vector.

Each decoder layer also includes an attention block over the source sentence. The decoder source-attention and self-attention blocks differ in their inputs; masked (target-side) self-attention takes as input the positionally-encoded target-side sequence, and the source attention block returns the output of this block in addition to the top layer of the encoder. This way, the decoder is able to attend both to the source and to the words in the target as they are generated.

Inference

This section is incomplete.

In training, we created (via the bucketing module) a symbolic graph that was “rolled out” to the lengths defined by the bucket key, a tuple containing the maximum source and target length of any sentence in the batch. It was then executed via a single call. This is a nice scenario that we are no longer able to use at inference time, due to the following differences from training:

  • We don’t know the length of the target sentence, so we can’t construct a single symbolic graph.
  • We don’t know the words of the target sentence, so we don’t have the words needed for decoder self-attention.

Inference in Sockeye and MXNet, therefore, differs from training. We can still compute the embeddings and run the encoder stage for all three architectures, but decoding is quite different. Instead of rolling out the entire decoder graph, we repeatedly construct a decoder graph that is rolled out just a single step. We will then effectively run this single decoding step again and again, each time feeding it the relevant outputs from the previous time step, until some stopping criterion is reached. We will also expand this mechanism to enable beam search.

This section will walk through how Sockeye accomplishes inference. Perhaps surprisingly, a single generic interface is used at inference time, as well, enabling decoding with all three neural architectures. This is possible because they are all auto-regressive in nature.

Build the network

This happens in InferenceModel.initialize(). There is one InferenceModel per “-m” switch from the command line.

  • Construct the encoder and decoder modules
  • Get the encoder and decoder shapes. Encoder shapes are the same for all models: an mx.io.DataDesc object with C.SOURCE_NAME. The decoder states vary by model and are passed from step to step.
  • bind() the module to the shapes [TODO: understand this]
  • Initialize the parameters of each of the modules [which can now be read because of the data descriptor that was bound to them?]

Once the models are all loaded, translate.py calls translate(), which in turn calls Translator.translate() in inference.py. Eventually we get to _beam_search().

Run the encoder

Here we call translator._encode(), which calls model.run_encoder() for every model. Each one of these returns a ModelState, which wraps together each model’s decoder states. Each model has only a single encoder output, but many different decoder outputs.

After running the encoder by calling encoder_module.forward(), we get its outputs by calling encoder_module.get_outputs(). We then repeat these states over and over to fill the entire beam, and initialize a ModelState object with them.

écrire dans une autre langue ⦿

J’ai lu récemment le petit livre “Language”, ecrít par Xiaolu Guo. C’est réelment un extrait d’un livre plus longue, une partie d’une series de petit livres qui s’apellent “Vintage Minis”. Je veux dire que j’aime bien les petits livres comme celui. Quand je vois un livre que je veux lire, mais il a mil pièges, je pense, je suis trop occupé, et je n’ai pas de temps pour tous ces mots-là. Mais quand je vois un livre avec cien pièges, je pense, aha! J’ai certainment le temps pour lire ce petit livre. Et je le dit ça à dix livres differents. Peut-être que je ne peux pas donner mon attention sur mil pièges.

En tout cas, ce petit livre n’ètais pas mon favorite de ce series, mais je l’aimerais. L’anglais de l’auteur est mal, mais l’observations sont très intèresants et rigolos, et la language mauvaise ajoute a l’intérêt. Par example, ici, l’auter est arrivée à Londres par la première fois, seulement, pendant une journée, pour appendre l’anglais pour aider ses parents dans leur entreprise. Elle ecrit:

I going back quickly to Nuttington House. Red old carpet, red old curtain, red old blanket. Better switch off light.

Night long and lonely, staying nervously in tacky room. London should be like emperor’s city. But I cannot feel it. Noise coming from other room. Laughing in drunkenly way. Upstairs TV news speaking instensely nonsense. Often the man shouting like mad in the street. I worry. I worry I getting lost and nobody in China can find me anymore. How I finding important places like Buckingham Palace or Big Stupid Clock? I looking everywhere but not seeing big posters of David Beckham, Spicy Girls, or President Margaret Thatcher. In China we hanging them everywhere. English person not respecting their heroes or what?

Son degoûtre à le pays, son mal utilisant de les mots, ces sont comme la poésie. Je soupçonne que c’est une affectation, qui jouet dans les prejudices et expectations d’une personne comme moi, mais je l’apprecie.

En tout cas, ce petit livre m’a dit la permission pour écrire mon français mauvais. Je ne peux pas espérer grimper à ses niveaux, mais je ne craindre plus l’échec. Mon but ici vient dans deux parties: pour le moment, pour écrire, pour la joie; et dans l’avenir, pour écrire quelque chose que Google Translate ne peux pas traduire.

replanting a grape vine ⦿

I’ve been impressed often enough at the combinatoric availability of documents and especially videos on the web for even the most specific of the nearly limitless tasks one can do, that I’m at least mildly disappointed when I can’t find one that I need. By combinatoric, I mean that one is able to have highly specific questions answered in a user-friendly manner, implying that variations of those questions might also have answers. For example, I recently wanted to change the struts on our 18 year-old minivan, and found a walk-through video on Youtube for our particular model, complete with annotations of specific tools I’d need. (That video saved me $500 and I was quick to send “Fix It Angel” a token of my appreciation). When Google or Youtube fail to deliver I sometimes have this gnawing feeling that maybe I should add to this wealth of information instead of just taking from it, so what follows is my small contribution.

I am building a pergola at my rear property boundary to lend some privacy and provide a structure for grape vines, and had made what seemed to me a mistake of planting one of the vines in a small cavity in the alley wall, which I feared would limit its growth. Poking around was inconclusive about whether a Zone 7b one year-old grape vine about five to seven feet in length could be safely moved once in full leaf bloom. But it seemed to me that it could be done if one took care to carefully dig out the roots, and I am happy to report that this is in fact the case.

Some notes:

  • I took great care to gently dig out as many of the roots as I could, particularly what I thought of as the smaller “capillary” roots. This required me to follow many branches along and dig out fairly deep. I did break a few of them, including one that I didn’t want to bother going too deep with.
  • The root ball was out of water for maybe half an hour. I dug a big hole and put down a layer of good bag soil, then laid out the roots, covered them with more soil, and doused everything with water. Were I to do this again, I’d soak it in a bucket or cover it in damp rags to minimize the dry time.
  • After transplanting, the leaves were all limp within four hours. I covered the whole vine with a tarp, on the idea that until the roots could adjust and supply enough water, it would be best to keep the hot sun off them.
  • I poured about 2–3 gallons of water over the roots twice a day for two days afterward.

Two days later, the leaves had recovered their zest, and I only lost a few of them.

So the answer is, yes, you can transplant a one year-old grape vine in full bloom in USDA Zone 7b and recover within a day or two provided you preserve most of the roots, shield the leaves, and douse regularly in water.

[Update: it seems I didn’t look hard enough, as my-grape-vine.com has this covered, so consider this a corroboration.]