The UX of Sharing Photos with Google Photos and Instagram

I had one of the most frustrating technology related moments of my life on Mother’s Day. My aim was simple: share a few old and new photos of my mother and me on Instagram. All of the photos are already in Google Photos so this should be a relatively simple task, right? With the pieces all set, let’s see what actually happened.

Google Photos actually made it quite easy to find photos of my mom. I simply typed her name in the search bar and many years worth of photos popped up. I selected a few and created a new album called “Mother’s Day 2018”. This held 7 photos that I thought would be fun to share.

Popping over to Instagram, I hit the + symbol at the bottom of the app and am greeted with a grid of options on the bottom third of the screen with the currently selected photo taking up ~55%. None of the photos I selected in Google Photos are visible. At the top I notice a dropdown that says “Gallery”. I tap it and am greeted with a long list of items including Videos, Instagram, IMG_20180120_162026, Hangouts, FaceApp, Download Camera, 100APPLE, Other…, Photos, Twitter, Other Photos, Allo Images, and a few more options. Some of these I can guess what they are, others are less obvious. The option “Other Photos” appear to be a weird mix of screenshots, gifs, and scans. “Photos” appears to match what’s in “Gallery”. The APPLE ones seem to be a mix of photos from my old iPhone. None of them, from what I can tell, have any ability to give me the 7 photos I selected in Google Photos.

I hop back over to Google Photos and attempt to share from the app. Selecting a single photo will allow me to share with Instagram as a post, but as soon as I select multiple photos, the option disappears and is instead replaced with a DM link instead. Presumably that would send a link to a user of my choice to the Google Photos album.

This is the first major roadblock I hit. I’m not going to scroll through the entire gallery to find the correct 7 photos I want to post. Even if I did do that, the UX would be miserable as well because selecting multiple photos on Instagram is generally an annoying experience.

My new goal becomes figuring out how to get these 7 photos as the first ones that appear when you open Instagram. To do this, I download the album to my PC. I unzip the photos and then upload them right back to Google Photos. I’m greeted with an option to add to album or to share. I add them to a new album called “delete”. Back in the main view of the app, the photos I just uploaded aren’t present. They’re in the album, but they’ve been sorted based on the date they were taken.

I remember that photos have EXIF data stored with them, so I attempt to remove the EXIF data. First I copy the files to new ones. This does not remove the data. After some googling, I learn you can go to the properties of a photo, navigate to details, and click “Remove Properties and Personal Information”. There’s an option to make a copy of the photo and remove as much information as possible. I do that and now I have copies of the 7 photos in the same folder. These copies are then uploaded to Google Photos.

I’m expecting these photos to be the first 7 that appear in the “Photos” tab within Google Photos. Contrary to expectations, only three appear. This is deeply frustrating. I immediately want to see what’s different about those photos that would cause Google Photos to organize them in an unexpected order. The mission then becomes how do I find my most recently uploaded photos. Nothing in the main UI indicates how I might do that. The hamburger menu is unhelpful. I search for “recent” and get no results. “Recently added” returns no results. “Latest” gives me two photos that don’t appear to be connected to that word in any way whatsoever. I happen to click to the “X” button on my search and realize that some options pop up when there is no text in the search bar. This wasn’t obvious to me because I typed my initial query and hit enter rather quick.

In the third modal-esque box I see Videos, Selfies, Movies, Animations, and Show More. Clicking “Show More” drops down a bunch more options and at the bottom I see “Recently Added”. Clicking this shows all of the photos I was hoping to find. The url has the word “_tra_” after /search/, but searching for “_tra_” returns no results as does searching for “tra”.

Clicking on a photo that showed up in the past in the timeline reveals that Google Photos still has the original date the photo was taken. When I look at the properties of the photo on windows I see that the Created, Modified, and Accessed dates are all today’s date. It’s unclear to me why there is this discrepancy. I’m guessing there’s something in the EXIF data that I’m not aware of, but I don’t know how to fix it. Even more frustrated, I proceed to manually change the date on the photos that retained the original date. Now within the web version of Google Photos, I see all of the photos at the top of the page under “Today”. My assumptions here are that these will now sync to Google Photos on my phone and then I will be able to upload them to Instagram.

On my phone (Pixel 2XL), these photos don’t appear. As a user, I want to force a sync to occur. On my PC I do a hard refresh (Control + Shift + R) and confirm that the photos are there. On my phone, I close all apps, restart my phone, and even turn sync on and off and the photos don’t appear. Going into Instagram, I don’t see these photos on any of the options.

At this point I’m becoming frustrated and angry. I almost stop trying to make this happen but I want to keep going because the photos are meaningful to me and I legitimately want to share them. My sisters probably don’t remember some of these photos and I think they’d like it. My wife notices I’m upset and after telling her my frustrations, she suggests I just text them to myself. To me that’s a ridiculous way of doing it because, after all, there must be a better way. But unfortunately, I know it will work. So I do it anyways.

I open Google Inbox and use hangouts to type in my phone number and open a new chat window. I copy the photos from my PC into the chat and text them to myself.

Sidebar: Trying to recreate the chain of events for this post leads me to another quirk. Evidently, my phone number is different in Google’s eyes when I include the country code (+1) or not. With it included I’m now in a pending state on my PC where it’s saying the invite is pending. On my phone, I see the chat with my number and the country code at the top and a message displaying “+1 hasn’t accepted your invitation”. Digging a little deeper, I find that I have a separate hangout for my phone number without the country code. This is correctly associated with my name and Project Fi account. I don’t know what to even make of this.

Now that the photos are on my phone, I proceed to manually download each photo under the triple dot option at the top and “Save Media”. Evidently, these go directly to my Google Photos library, which, after all of this, is surprising to me. I do this for each of the seven photos and they all successfully appear at the top of my Google Photos library and at the top of the options list when I’m in Instagram. I then am able to proceed to crafting a multi-photo post.


This entire process is insane and I feel like it should not be happening in 2018. It seems crazy to me that a simple task like “choose multiple photos from the primary photo app on a flagship phone and share them on Instagram” is such a complicated process. I get it, Google and Facebook are two big competing companies with different goals. But the difficulty and lack of interoperability between them make me hate them both. This process has left me with so many questions.

Why was this so difficult to do?

Is this really such an uncommon thing to want to do?

Why can’t I directly share motion photos on Instagram? This seems like a perfect thing to share!

Why did my PC not properly strip out all data on the photo on some photos but it did on others? Or is this really Google’s fault?

Why are my photos not syncing? (As of this writing, I have a photo I uploaded to Google Photos on my PC that is not on my phone and a photo that I saved to my phone that is not synced to Google Photos on my PC.)

Why don’t products want to give users a “Sync Now” button? (1Password does not want to do this either as it should “just work”).

Why does a modern web app like Google Photos eschew easily accessible and common photo app features like “latest upload”?

What are all of these photo folder options in Instagram? Where can I access those? What does “Gallery” actually mean?

Most of all, why does it feel like these frustrating UX experiences are becoming more common and not less?

Perhaps I don’t have great data but it certainly feels like the bad experiences are trumping the good ones lately. I do notice good UI and UX when they happen, but the bad UX experiences have stood out so much more to me lately. This isn’t the first bad experience I’ve had but it’s the first one I’ve taken the time to write at length about. There are other experiences I’ve had recently that I’m tempted to write about (Nest’s billing vs home address, student loan refinancing docs, and a billing payment error that requires me to call them to finish paying).

I’m not a UX, design, or even product expert. But I am a tech-savvy consumer and I know that it’s possible to make these experiences better. The technical challenges are solvable, it’s the people and process around these features that have failed. But that’s fixable too, I think. My hope in writing about these is to release some of my own tension about the process and to highlight areas for improvement. Also, I hope to encourage and inspire more people to care about the little things more. A little love goes a long way.

New Job

After nearly four and a half years at Concert Genetics, I moved on to a new position. I’m at GitLab working as a Data Engineer on their BizOps project. I’m very excited about the opportunity and have really enjoyed my time at the company since starting on January 29.

I wanted to reflect a bit on the advice I gave in an earlier post about finding a new job. My secret when writing that post was that I was actively looking for a new position, but I had to frame it as a conversation with another person because I wasn’t public with that fact. The conversation really did happen in real life, but I was also sharing my own personal strategy for job hunting.

Create a Portfolio : C-

I didn’t have much of a portfolio initially. Most of my code wasn’t open source and since I wasn’t really on the engineering side of things I wasn’t able to contribute as much as I’d like. I did have my blog, some kaggle competitions, and some GitHub contributions, but it wasn’t what I would consider impressive or reflective of what I wanted to do. THe cool thing about my new gig is that nearly all of my work is open source.

Work on Projects, not Books: B

I was working my way through the fast.ai course and I found it to be a useful way to learn and showcase my skill. The course was advanced for the jobs I was applying for, but it showed interest in the field.

Write a Blog: B-

I wrote two posts while job hunting. Applying for jobs is a lot of work. I was spending my mornings and evenings researching jobs, writing cover letters, and tweaking my resume. After I got the job I didn’t write as much, but I want to change that.

Apply even if you aren’t perfect : A

I applied to a lot of jobs (close to 40 I think). With each one, minus a few early misfires, I took a look at the requirements and specifically crafted my resume and cover letter to the job description. Most of the jobs I applied to were Data Science and Data Engineering jobs, with a few Data Manager jobs thrown in. Some I met <50% of the requirements, but I felt like I could learn on the job. Some I knew there was no chance I’d get the position, but I threw a line out anyways because I’d rather they tell me no instead of me preemptively doing it.

Don’t be desperate : A+

I really honed this skill when I was first looking for a job after graduate school. Back then I was desperate for a job, but I couldn’t act like it. It’s a big turn off. I took the same approach here: I really wanted a new position, but in all of my conversations I protrayed myself as someone eager to take a position but only if it made sense for me. This was an accurate portrayal, but it came from a position of power: I already had a job and didn’t have to take something that wasn’t a good fit. In fairness, I was much less desperate than back in 2013, but the strategy is stil valid.

On top of these recommendations, I have a few more I’d probably add.

For remote positions, aim a little lower

If you’re applying for remote positions (which I was) aim slightly below your skill level. You may be hot shit in your local market but once you’re competing with the whole world, your skills may not be competitive. I took a pure Data Engineer position, which is kind of a step down from a Director level position, but it works with my current career strategy and I’m very happy in my current role.

Don’t be cheap, pay for help

Get some kind of help in your search. There are people that do this work professionally and they know more than you. If you’re seriuos and can afford it, pay some money to make your resume and cover letter look better or to have somebody look it over and recommend changes. I paid for a service called Up to Work which gave me a nice way to make my cover letter and resume look much better. They also have a ton of free blog posts recommending how to write effective resumes and cover letters. Ues them to your advantage!

Get lucky

I mentioned this briefly in the other post linked above, but sometimes you’re going to have to just get lucky. I feel incredibly lucky that I was able to score this position with GitLab: It’s a great opportunity and a great company. My timing was good and I checked enough of the boxes. So as much as it sucks, yeah, you’re probably going to have to get a little lucky to get the job you want. But the above strategies will help you.

Break a leg out there!

Pop Quiz - Algorithms

I was recently quizzed over the phone about my algorithmic knowledge. I’m, evidently, not the best at defining terms or concepts on the spot, and my answers certainly reflected that. I’ve always struggled with pure memorization of facts, and the pressure of accurately defining them on the spot makes it even worse. But, with that said, these are things I should know, so it’s worth it to take the time to explain and define them in a way that hopefully sticks in my brain better.

Linear vs Logistic Regression

There’s something about algorithms and math and the people who name things that makes them want to obfuscate what’s really going on. These should really just be called Continuous Regression and Binary Regression (or Categorical Regression).

In Linear (Continuous) Regression, the outcome (i.e. your dependent variable - dependent meaning its value depends on the other variable) can have any value. An example of this is the work I did in grad school: we’d adjust the concentration of analyte (say, glucose) in our culture medium (the analyte being the independent variable) and see what effect it had on the reaction rates, or fluxes (this was the dependent variable). You would then use linear regression to figure out how glucose predicts the flux.

In Logistic (Binary/Categorical) Regression, you’re trying to predict a discrete number of events. Using cell growth as an example, a good use for Logistic Regression would be to predict whether the cells grow above or below a certain viability threshold given the glucose concentration in the medium.

Random Forests

I’ve never actually used random forests for my professional work, but they aren’t that bad to grok once you get the hang of it. And they’re actually named fairly well.

At a high level, random forests are essentially a bunch of decision trees that enable you to predict specific categories (classification) for your data. You take a bunch of these trees (randomly generated), train them on your data, and a lot of them will perform poorly. But enough of them won’t! The predictions of all the models are then combined to give a pretty good prediction model (this is called ensemble learning). The idea being that the bad models will cancel each other out and the good models will win.

I need to spend more time with it, but it seems like it works pretty well for many use cases.

Regularization

I’m not a fan of how this is named because it doesn’t bring to mind what it’s actually doing. I’ve been using regularization regularly (yuk yuk yuk) in the Fast.ai course I’m working my way through, but I didn’t have a strong definition of it in my head. Regularization is essential a tool for reducing overfitting (overfitting being the inability of a model to generalize its predictions well).

For neural networks, we’ve been using random dropout (literally setting random weights to zero) on the network for regularization. The regularization that most people talk about is L1 vs L2. Simply put, these regularization techniques add to the objective function of the model (probably sum of squared errors) via either the sum of absolute value of the weights or the sum of squares of the weights, respectively. It’s trying to control the model complexity by adding a penalty to it for when overfitting happens. Maybe a better name for it is penalization?

Regularization is actually one of 5 recommended techniques for reducing overfitting. The recommendation comes from the fast.ai course:

  1. Add more data
  2. Data Augmentation - (for images this would be shifting, rotating, and skewing the images)
  3. Using generalizable architectures - meaning batch normalization in the scope of this course
  4. Regularization
  5. Reduce architecture complexity

So, you can see that it’s actually lower on the list of things you can do to reduce overfitting. All of the other methods could be their own post / explanation (except, I guess, adding more data…).

Well, there you have it. Things I should know and could reasonably explain given time, but the spur of the moment questions just aren’t my bag. Hopefully, this is helpful to people in the future that need a more human-understandable definition.

Advice for Job Hunters

I was having coffee the other day with a woman, who I’ll call Sally, who’d reached out to me after a panel I was on at Vanderbilt. Everytime I do one of those events, I tell everyone in the audience that I’m definitely up to grab coffee with anyone. Rarely does somebody reach out. Which makes it extra-special when somebody finally does.

Sally recently left her post-doc and was looking to get a job within the Data Science field. That, of course, is a loaded term encompassing different skill sets depending on who’s asking. But generally it means somebody who’s able to play with data, ask interesting questions, actually get the results, and then communicate them to a wider audience.

Our relatively quick coffee meeting was mainly centered around her asking for advice, and I happily obliged. I thought it’d be good to capture my thoughts as well since I think it’s valid for more than just Data Science.

Create a Portfolio

A portfolio, whether in art or data science, is a body of work that you can point to and say “I did that”. It’s a way for other people to see examples of your skillset and prove to them that you’re capable of shipping work. Especially in fields where there are no credentials, you need to have some way to show people that you have certains skills beyond just telling them.

Work on Projects, not Books

Related to the portfolio, is the idea that you need to learn by doing things. Sally asked whether it’d be a good idea to learn more R since she only had basic skills. I told her that yes, it’d be valuable, but don’t go off and read a book and just do the exercises. Pick a project that you can work on that will challenge you but also create something that you can show others. A simple R Shiny project with free data would be perfect.

Keep Meeting People

Coffee meetings around town are some of the best things you can do to meet people. Meetups, conferences, and similar events are ok, but in my experience they’re best used as a way to find interesting people and then have 1-on-1 meetings with them. Keep in mind that people are busy and so you might not be able to meet everyone, and also think about the delta between where you are and where they are. Cold-emailing the CTO of a large company might not work, but find somebody who’s newer to the company and see if you can reach out to them. I’ve had a few people randomly contact me through LinkedIn and I actually responded because their messages were friendly and unique.

Write a Blog

This blog post by Julia Evans is a great set of advice on the how of actually blogging. This post by Rachel Thomas is another great one. They say basically everything I want to say, but writing, for me, helps me think better.

Apply even if you aren’t perfect

Most job descriptions are describing the company’s perfect candidate. Those people rarely exist. If you meet ~50% of the criteria on a job description and you think it’s interesting, go ahead and apply. You’ll have to spend a little more time on your resume and cover letter, but it’s worth a shot. I’d rather have somebody who was really excited for the position but needed some training than somebody who checked all the boxes but just viewed the position as “meh”.

Don’t be desperate

No matter how much you need a job, it’s important not to appear desperate when talking to people or applying for positions. Nothing is more of a turnoff than somebody who is begging for a job. Pretend like you don’t need a job and you’ll come across as more appealing and easy to get along with. It’s tough, believe me. Back when I was desperate for a job I’d have to calm myself down in the car for about 5 minutes before meeting with people. But it pays off. Don’t be desperate.


Those are the main points we hit. All of this advice feels directionally right to me and it’s hard to see how any of them could backfire. It’s all about showing other people you’re interested and maximizing your opportunities for luck to happen. My Data Science career started with a ton of luck, and I’ll always admit that. But luck can’t happen unless you’re putting yourself out there.

Good luck to everyone on the job hunt!

Creating an Alexa Skill

One of the many great talks I went to at PyTennessee 2017 was Jason Bynum’s “Alexa Doesn’t Even Have Any Skillz”. I opted to go to his talk instead of some potentially more relevant ones (sorry Elasticsearch!) but I’m glad I made that choice.

Jason walked us through the basics of setting up an Alexa skill using Python. He then teased us with the promotion Amazon is currently offering, which is a hoodie if you publish a skill before the end of February.

The Start

Amazon is trying to convince as many people as possible to develop skills for the Alexa device. To encourage everyone to publish something, they’ve created a simple guide to create your own Fact Skill. As I am highly motivated by hoodies, I started from this Fact Skill and decided to tweak it as needed.

The first thing you need to do is create an account on developer.amazon.com. I linked my personal Amazon account because it was easier. Within the portal, you then select “Alexa” and then “Add a New Skill”.

Here’s where the first decision has to be made. You need to create a name for the skill and create an invocation name. This is the phrase that users will say to launch your skill. I knew I wanted to make my skill specifically for querying the HUGO Gene Nomenclature Committee website, so I called it “Gene Facts”.

Defining the Interaction

Because we’re working off a template, you don’t have to do much work to define the interaction model. But you could if you really wanted to. The awesome thing about what Amazon has done with Alexa is the way they’re defining the interaction model for humans and machines. The simplified version is you have intents and slots.

Intents are user-defined and represents the requests the skill can handle. Slots are user-provided arguments that are passed to the intent. The Fact skill is set up with the basic ones we need. We have GetNewFactIntent, which is our primary trigger for the skill, and we have 3 built-ins for Help, Stop, and Cancel. The cool thing about these is that you can add intent-specific data to them. For example, in the docs, they discuss a Horoscope example which has the primary intent GetHoroscope which then has two slots for sign and date. The date slot is referencing an Amazon built-in dataset of available dates while the “sign” slot is built from a list of developer specified words.

A decent analogy to the intent/slot idea is that of a function to which you pass specific arguments. The intent is the function definition and the slots are the data that you are passing to the function. Our fact skill doesn’t have any slots, or function arguments if you will, because we’re not making a request with specific data. It picks a random number and selects a fact from a subset of available data.

There is a lot you could do with the intents and slots model definition. Certainly more than the basic fact skill is showing. I’m excited for what people will come up with in the future, but I wanted to keep my own skill simple for now.

Once you’ve defined the intent schema and any custom slot types you can specify sample utterances that people would use to interact with your skill. The utterances provided by the fact example all all pretty obvious. I simply changed the words “space” to “gene” and went with what they had.

AWS Lambda

The next step of the tutorial has you setting up a Lambda function. Lambda functions are quite cool because you don’t have to constantly run an EC2 instance to be watching and waiting for a trigger. The lambda function will get triggered and a box is spun up to quickly execute your code.

The tutorial is quite explicit about what you need to do to set up a Lambda function so I won’t go into detail about that. The problems weren’t with setting up the Lambda function, it was when I started messing with the code.

Node.js

I’m not a Node developer. Technically, I’m not really a Python developer either since I play with data mostly. But I have a much better intuitive grasp of Python than JavaScript and, in particular, Node. I wasn’t content with just replacing the space facts in the example skill with some crappy gene facts. I wanted to actually query an API and do something interesting.

The HGNC API is open and doesn’t require any authentication, so it’s a quick and easy way to dynamically pull data from the internet. Because I didn’t want to deal with user requested data (slots) I stuck with the randomized model of picking data. My initial attempt at the skill would pick a random number between 1 and 20000 and attempt to use that as the HGNC ID. If that wasn’t a valid ID, then it returned the information for BRAF. This turned out to be an unsatisfying solution because every third request was for BRAF.

I then decided to just download the entire protein-coding gene dataset from HGNC and get all the actual IDs. Then at least I wouldn’t always be returning BRAF. I saved this into an array and adjusted the gene selector to pick a random number from 0 to the length of the gene_ids array minus 1. This at least was a more valid way of randomly selecting a gene.

My next struggle was with making an HTTP request in Node. The cool thing about Node is that it’s asynchronous by nature. The frustrating thing with Node is that it’s asynchronous by nature. I initially had the speech results returning outside the request function because that was the basic format of the initial code. But by doing this I was constantly returning an incomplete sentence that had no data. The final emit statement wasn’t waiting for data to be returned from the request.

Of course the solution to this is to use callbacks, but I still don’t have a great understanding of how that really works, so I just shoved the emit statement into the request to make my life easier. Ugly? Yes. Does it work? Sure!

Testing and Running

With my code mostly working, I now had to test that I was returning what I expected from the program. The “quickest” way to do this was to zip everything up, upload it to AWS Lambda, and test it within the skill development portal. I’m sure I could’ve tested this locally better, but going through this circuitous route was faster in the moment. I selected the folder with all of my code and node modules in it and compressed it using the Mac’s built in zipping feature.

I selected my zip file and uploaded and tested it in Lambda. Immediately I was getting errors. Unfortunately, to figure out exactly what the errors were I needed to go to Cloudwatch to actually see them. There I saw the error was: Unable to import module 'index': Error. Everything seemed like it should be working fine based on what I could tell, so I turned to my handy-dandy reference guides known as Google and StackOverflow. Thankfully, I quickly found the issue via this thread. Basically, compressing the files via the built-in was causing some weird problems. Using zip -r on the command line fixed it. Though now that brings up even more questions about why the initial compression wouldn’t work.

With everything seemingly squared away, I was able to test that Alexa could speak the responses. But now I had new problems: she was trying to say the gene abbreviations, which was hilarious, and she spoke way too damn fast.

Slow Down Alexa!

My goal then became making Alexa easier to understand. I wanted her to spell out the gene abbreviation and to do it slowly. Google again was my friend and it quickly become obvious that you could use Speech Synthesis Markup Language to get her to say words in a particular manner. In this case, it’s as easy as adding the markup <say-as interpret-as="spell-out">YOUR PHRASE</say-as> around the word you want spelled out.

I added this to the code, zipped it, uploaded it, and tested it (again, not the quickest method of testing these things…) and Alexa dutifully spelled out the gene name. But she did it damn quick and there wasn’t an obvious method of slowing her down using the say-as markup. StackOverflow was my friend again and I found this thread which basically says to insert breaks between every letter.

With this new-found knowledge, I adjusted the code for the gene abbreviation like so:

the_gene.split('').join('<break time="100ms"/>') + '<break time="100ms"/></say-as>

This worked surprisingly well and I decided that was enough fiddling with the skill.

Finishing Touches

With everything how I liked it, I then moved on to the rest of the tutorial which is all about the user-facing experience: writing out the long and short descriptions, selecting some icons, and figuring out example phrases for the skill. This was all pretty trivial but I should’ve spent a little more time with it because this step is what kicked back my skill initially. The review team had concerns about my use of HGNC in the name and they didn’t like my example phrases. Evidently the initial phrase is supposed to be the full one and then the other two examples are just the final “action” words. (e.g. Alexa ask Gene Facts to tell me a fact is the first example and the other two are ‘tell me trivia’ and ‘give me a fact’)

With that fix in place, my skill was accepted and officially published! It’s available on the Alexa Skills Store for free and it already has a review for 4/5 stars. Not too shabby.

Give a try and tell me what you think. Or better yet, publish your own! Definitely use the Amazon tutorial as a starting point, but here’s my own code if you’re interesetd.

Data Manipulation with jq

I had a data maniuplation challenge the other data that I was able to easily solve with the wonderful command line tool jq. As described on the homepage for jq:

jq is like sed for JSON data

Which basically means that you can do some very powerful transformations all without having to open the file.

My Problem

I had JSON data of the form:

[{
	"value": "GENE:d",
	"other": "Data1"
},{
	"value": "GENE2:d",
	"other": "Data2"
},{
	"value": "GENE:d",
	"other": "Data3"
}]

and I needed to get it into the form of:

[{
	"term": "GENE:d",
	"synonym": "GENE"
},{
	"term": "GENE2:d",
	"synonym": "GENE2"
}]

You’ll notice some key points here. First, I needed to ensure uniqueness of all terms (in this case they happen to be genes). Second, I needed to remove the :d from each of the terms and store it with the synonym key.

It’s possible to do this in a one-liner with jq, but it took me a lot of trial and error (at least 30 attempts based on my bash history). My first successful attempt actually had an intermediate step in Sublime Text where I permuted the data to determine the unique values. I then used jq to carry it the rest of the way. After playing with it a bit more when the task at hand was no longer urgent, I was able to solve it all in one step and build a better mental model of how the tool works. What really tripped me up was a basic lack of understanding about how the data flows from one jq filter to another.

jq works on a stream of JSON data. If you have a properly formatted JSON file, the easiest way to make it a stream is to call cat on the file and then pipe it (|) to jq. It would look like this:

cat your_json_file.json | jq '.[]'

For all subsequent examples, assume the cat command is present and is piping the data to jq.

The Steps

The first thing to do is call unique_by on each object. This is easily done via:

jq 'unique_by(.value)'

With uniqueness guaranteed across the dataset, we’re now able to focus on a single object for all subsequent steps. You’ll notice above that all of my data is in a single array with each item being a JSON object. To “unpack” the array, the next command is simply .[]. All together this looks like:

jq 'unique_by(.value) | .[]'

With that, we’re now able to process each object in isolation. The next part I had to do was to take the string from the “value” key and store it into a variable. I figured this out after a lot of trial and error. In my earlier attempts I was unable to access the value I needed after the following steps. Saving it to a variable worked, so I’m going with it. It looks like this:

jq 'unique_by(.value) | .[] | .value as $term'

.value is accessing the string at the key “value” and it’s storing it to the variable $term. The next step is to parse our string to remove the :d from each term. I used the match function but in theory I could have used the sub function to remove the :d. Matching was more useful because I wanted to store the match in my final object. To access the “value” term we have to filter our inidiviual object by doing this:

jq 'unique_by(.value) | .[] | .value as $term | .value'

So that now we’re working with a string. Each string can then be passed to the match function like so:

jq 'unique_by(.value) | .[] | .value as $term | .value | match("[A-Z0-9orf\\-]{1,}") as $synonym'

That regular expression should match all the genes I have in my dataset. The above code won’t work though because you’re not doing anything with the match. The match object looks like this:

{
  "offset": 0,
  "length": 6,
  "string": "OR14I1",
  "captures": []
}

So the value I need is stored in the “string” key. Now, with both values I need accessible via variables, I’m able to generate the final object:

jq 'unique_by(.value) | .[] | .value as $term | .value | match("[A-Z0-9orf\\-]{1,}") as $synonym | {term:$value, synonym:$synonym.string}'

Running the above will output each object in the proper format. The only problem is that because I started with an array I need to end with an array. This is done trivially by wrapping the entire jq expression with brackets:

jq '[unique_by(.value) | .[] | .value as $term | .value | match("[A-Z0-9orf\\-]{1,}") as $synonym | {term:$value, synonym:$synonym.string}]'

With all of that in place, we’re then able to write to a file by simply adding > your_output_file.json to the end of the above command.

And that’s that! Hopefully that helps explain the workings of jq a bit. It’s certainly helped me have a better mental model of how data is passed to each filter.

PyTennessee 2016

This past weekend was PyTennessee 2016 here in Nashville, TN. For the money, it’s the best bang for your buck by far. High quality talks and less than $50. I wanted to briefly share some of the things I learned and some of my takeaways from the conference.

Type python, press enter, what happens? by Philip James

This was a fun talk that answered the question of what’s actually happening when you enter the python interpreter from the command line. I’ll briefly share my notes.

The bash shell is the command language for interacting with unix operating system. It’s really a program for finding and running other programs and doing some other work as well. To figure out where bash is going to look for programs, you can do

$ echo $PATH

My computer spits back:

/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/sbin:/Users/tayloramurphy/bin

The next step was to figure out which python you were running and where it is.

$ which python returns /usr/local/bin/python. If you then do:

$ file /user/local/bin/python you get back /usr/local/bin/python: Mach-O 64-bit executable x86_64.

He then did hexdump -C /usr/local/bin/python which spits out the contents of the file in Hexadecimal. He then talked about “ELF” but I sort of lost him at that point.

We then started diving into more of the bash internals. When you type

exec python

it’s actually converting the current process to a python process. This means that when you quit the python process, it ends the session and, in the case of iTerm, closes the window. But when you call python normal, bash first calls fork(), which creates a copy of itself, then calls exec() which transforms the fork into a new process.

He also briefly talked about pstree which is a package that allows you to see the current processes as a tree. Install it via brew install pstree.

He then did a short sidenote on verbose mode in python. This can be done by python -v which is basically equivalent to exec('usr/local/bin/python', ['python','-v']). The array part at the end is cool to me because it’s just passing the arguments as a list into bash.

The next part started to lose me a little bit. He talked briefly about how much of how a computer works internally is dervied from old teletype machines which had to send commands to and from a server and all you had was, essentially, a keyboard to work with. I’m just going to copy my notes from that as is and if I ever feel inspired I’ll look into this more.

Input/Output
file descriptor 0 (stdin)
file descriptor 1 (stdout)
file descriptor 2 (errors)

0: /dev/pts/3 is stdin (standard input) pseudoterminal
termios layer mimics telephone line connection to a server and back

so why does CTRL+C not work? Sends the signal to sigint (signal interrupt)

Overall, it was a very good talk and I do feel like I learned some things.


I then attended three talks, one with a guy from NASA which, while it had a cool premise, wasn’t terribly interesting. He wasn’t the best presenter unfortunately and it took away from the (meager) content. The next talk was about Jupyter Kernels. The speaker was much more engaging and very clearly explained how kernels work behind the scene. It didn’t really catch my attention and I just wasn’t that interested in it (though the messaging system was cool). The last talk was about elasticsearch but I didn’t get as much out of that one as I’d hoped either.


Duck, Duck, Moose. by Brian Costlow

This was a good talk and my main takeaway is that I think I finally get what duck typing is. Duck typing is based off the quote attributed to James Whitcomb Riley:

When I see a bird that walks like a duck and swims like a duck and quacks like a duck, I call that bird a duck.

What this means is that, particularly with python, it’s better to do a try and except statement instead of explicitly checking for the type of the object. This is more relevant for object-oriented programming, but it makes sense in general, I think.

Another thing I learned is that you can do exceptions like this

try:
	duck.quack()

except AttributeError:
	print("you can't quack")

except Exception as err:
	print("Got this error {}".format(err))
	raise 

which seems like a cool way to handle a specific error you care about and then deal with all the other ones in another manner.

I also learned a bit about function annotations which can be added to a function without having an effect at run time. An example would be:

def adder (first: int, second: int) -> int:
	return first + second

which just gives more information about type hinting. These can be used in some linters and code editors (but I’m not sure what else you gain by this versus good docstrings).

I also learned about this little tidbit which will allow to get back the stack trace and do whatever you want with it on an error: https://docs.python.org/2/library/sys.html#sys.exc_info

Overall, an enjoyable talk with some actionable takeaways.


Test-Driven Development with Python by Kevin Harvey

This was a good talk given by the guy who wrote Test-Driven Development with Django. Not too many takeaways because I’m already writing some tests, but the big thing was Three A’s:

  • Arrange
  • Act
  • Assert

This means that you first arrange the data however you need to be for your test. Do the action that you’re testing, and then assert that the result is what you expect it to be.

Some other points are to:

  1. Write tests for bug reports - meaning if you see a bug, write a test for it, fix it, and then you’ll never have to deal with it again.

  2. Use TDD for new features - write your tests as you’re going. A test is basically a hypothesis of what should happen.

  3. Write unit tests for existing code you use while doing 1 and 2 - This will help you clear out some of the technical debt.


PyTennessee is a great budget conference and I look forward to it every year. Hopefully I can give a talk at the next one…

Removing newlines from CSV

A small csv annoyance at work necessitated me writing a script to automate the fix for said annoyance.

The problem:

I had CSV files that were generated from the PDF reading app Tabula. When these files were opened in Sublime Text 2 or Excel, the newline characters in between double quotes were rendered and data that was supposed to be on a single line would end up on multiple lines. This was annoying and I wanted remove this so that it could be displayed and manipulated properly in Excel. (Yes, I know, Excel is rough but we do what we must when the business demands it).

My desired end result is that I would be able to use an Alfred script to make it simple to execute this. At first I thought I would copy in the text I needed fixed and then I could put the corrected data back into my clipboard. This worked quite well for my JSONify Alfred script which converts a single line of JSON into human-readable JSON. But I’m generally always working with actual CSV files, so I decided to have it read from a file and write to a new one.

In Alfred, the workflow is quite simple:

  1. Create a File Filter — In my case, the files are always on my desktop. My file filter is triggered with “cn” and it searches my desktop for .csv files.

  2. Pass file to Run Script — This file is then passed to a Python script that I wrote. The actual Alfred script runs python csvnewlineremove.py {query} in bash.

Here’s the Python script:

# -*- coding: utf-8 -*-

import csv, os, sys
import StringIO


input_file = sys.argv[1]

with open(input_file, "rb") as csv_file:
    csv_data = StringIO.StringIO(csv_file.read())

final_location = os.path.expanduser("~/Desktop/alfredfixed.csv")

with open(final_location, "wb") as output:
    mywriter = csv.writer(output)
    filtered = (line.replace('\r', '') for line in csv_data)

    for record in csv.reader(filtered):
        mywriter.writerow(tuple(record))

sys.argv[1] parses the input file directory which is then read as a string by StringIO. I did this after a bunch of googling as a way to pull in the contents of a file as a string. I’m not 100% sure this is necessary but it works. I need to look into this more.

final_location is a way for me to output the file to the desktop of whoever happens to be using the workflow.

The final with sets up a csv writer instance in mywriter. The magic happens in filtered. Originally I thought I needed to remove excess new lines (\n). But when I opened up the file in a hex editor (I recommend Hex Fiend) I discovered that it was actually character returns (0D in hex) that were causing the problem. So by switching in the replace field \n with \r, it worked.

The Stack Overflow question I referenced most closely was this one. As you’ll notice, the accepted answer had the final record wrapped in a tuple call. I’m not sure why they did that except to make the data immutable. I’ve left in it as it seems safer.

All in all, this was a fun little project that I’ve now automated and it won’t slow me up anymore.

Python in Memory

This following is a simple concept that I was struggling with last week but managed to figure out with the help of some pair programming.

Here’s the setup. I have a few tables in RethinkDB that were ~100 MB each. RethinkDB is a NoSQL database and it stores data in JSON. The bits I specifically needed were nested inside an array. So something like this

[
  "id": "32-characer GUID",
  "something": "some string"
  "other": [
    {
       "bit_I_care_about": "string"
    },
    {
       "bit_I_care_about": "string"
    }
  ]
]

To get at the bits I care about, I’d have to iterate through each top level object, get the other key, then iterate through each element of the array in other. My initial thought was to do this procedurally.

Read the data without modifications to a variable. Then iterate through each element of the array in the key other. Then get the bits I need.

Doing this on one table was fine. But I needed to compare the bits_I_need between multiple tables. It consistently was bombing out whenever I tried to do this due to memory constraints. Even when I tried to use iPython’s magic commands to clear memory each time, it kept bombing out.

The solution was, of course, to get the bits_I_need as I’m reading the data into memory. Instead of doing something like this:

things_i_need = []
data = r.db('my_db').table('my_table').run(conn)
in_memory = [flatten(thing) for thing in data]
# flatten() is a custom function that brings everything in `other` to the top level

for thing in in_memory:
    things_i_need.append(stuff['bit_I_care_about'])

which is way too memory intensive and has much more than I need, I could do something like this:

things_i_need = []
data = r.db('my_db').table('my_table').run(conn)
for thing in data:
   for stuff in thing['other']:
        things_i_need.append(stuff['bit_I_care_about'])

Which runs like a breeze. Even better, since I was comparing between tables, I could wrap this in a function to get exactly what I want:

def my_function(table_1, table_2):
	things_i_need = []
	data = r.db('my_db').table(table_1).run(conn)
	for thing in data:
	   for stuff in thing['other']:
	        things_i_need.append(stuff['bit_I_care_about'])

	things_i_need2 = []
	data2 = r.db('my_db').table('table_2').run(conn)
	for thing in data2:
	   for stuff in thing['other']:
	        things_i_need2.append(stuff['bit_I_care_about'])

    return list(set(things_i_need2)-set(things_i_need))

Then I can just iterate through all the tables I need to compare, call this function with the appropriate tables, and be on my merry way.

Of course, this could be optimized further, but it can also be extended. If I needed to only add to my list after satisfying a certain condition, I could do so. I should also probably use thing.get('other', []) instead of thing['other'] because it allows the program to continue even if that key doesn’t exist.

This was a fun little challenge that was solved with the help of a second set of eyes and some insight into what I was actually trying to get accomplish. Instead of getting all the data in case I needed, we took a look at the goal I was trying to achieve and coded it in that manner.

It's all magic

It’s frustrating to me when people claim that anything done with computers is “magical”. It’s probably more frustrating when it’s done in the context of a technology company.

My frustration is borne out of the fact that anything that has been created by humans or is done by humans isn’t, by definition, magical. If something was actually magical then it would break every rule of the universe as we know it and be impossible for any one person to actually accomplish. But we don’t live in that world. People are bound to obey the laws of the universe whether they know they exist or not.

Because of this fact, anything a human does is understandable, if not replicable, by another human.

Let me unpack that a little bit.

I was fortunate enough to see a Cirque du Soleil show recently. The acrobats and gymnasts were amazing. Almost magical, if you will. And while I will not be able to replicate any of their feats in my lifetime, I have at least a passing understanding of how they are able to do what they can do. They have practiced, exercised, and worked hard for years to achieve this level of skill.

Working with computers is similarly not magical, although it can be impressive to the inexperienced. Claiming that something you don’t understand is magical puts a barrier between you and any potential learning. It’s a not-so subtle way of saying “I will never be able to understand this, therefore I don’t have to put any effort into understanding it.” It makes it okay to be lazy and ignorant.

It’s unreasonable to expect everybody to be comfortable with the command line and multiple different programming languages. But I think it’s entirely reasonable for people to say “I don’t understand how this works, but I’m sure I could if I tried.” And I really believe they could if they really did try. Will it be easier for some? Of course! Do they have to? Of course not! There’s nothing wrong with saying that you’re not going to learn about something right now or even ever. I will probably never learn how to captain a ship on the ocean and I have no intention to learn. That’s ok. I don’t interact with ships everyday.

But it does seem a shame to me that people rely so heavily on computers to run much of their lives but are unwilling to try and understand how they work under the hood. I’m not asking everyone to understand how RAM works or what the JVM is. But learning the basics of computing and having the skill to attempt to solve your own problems when you’re stuck really isn’t that much to ask.

I guess all I’m really asking for is for people to not put an imaginary barrier between themselves and learning something new. I’m not that interested in cars or internal combustion engines, even though I use one every day, but I’ve learned a thing or two about how they work and how to maintain them. It’s almost irresponsible for me not to have at least a basic understanding of how they work. And I have the fundamental belief that if I wanted to learn more about my car, that I would be able to do so. It’s a simple belief that’s easy to have and makes your life better.

So please, remember this:

There’s nothing magical about anything. You just don’t know how it works yet.