Refactoring with Loops and Collection Pipelines
The loop is the classic way of processing collections, but with the greater adoption of first-class functions in programming languages the collection pipeline is an appealing alternative. In this article I look at refactoring loops to collection pipelines with a series of small examples.
14 July 2015
A common task in programming is processing a list of objects. Most programmers naturally do this with a loop, as it's one of the basic control structures we learn with our very first programs. But loops aren't the only way to represent list processing, and in recent years more people are making use of another approach, which I call the collection pipeline. This style is often considered to be part of functional programming, but I used it heavily in Smalltalk. As OO languages support lambdas and libraries that make first class functions easier to program with, then collection pipelines become an appealing choice.
Refactoring a simple loop into a pipeline
I'll start with a simple example of a loop and show the basic way I refactor one into a collection pipeline.
Let's imagine we have a list of authors, each of which has the following data structure.
class Author...
public string Name { get; set; } public string TwitterHandle { get; set;} public string Company { get; set;}
This example uses C#
Here is the loop.
class Author...
static public IEnumerable<String> TwitterHandles(IEnumerable<Author> authors, string company) { var result = new List<String> (); foreach (Author a in authors) { if (a.Company == company) { var handle = a.TwitterHandle; if (handle != null) result.Add(handle); } } return result; }
My first step in refactoring a loop into a collection pipeline is to apply
1: Well, actually, my first move is to consider applying Extract Method on the loop, as it's often easier to manipulate a loop if its isolated into its own function.
class Author...
static public IEnumerable<String> TwitterHandles(IEnumerable<Author> authors, string company) { var result = new List<String> (); var loopStart = authors; foreach (Author a in loopStart) { if (a.Company == company) { var handle = a.TwitterHandle; if (handle != null) result.Add(handle); } } return result; }
This variable gives me a starting point for pipeline operations. I don't have a good name for it right now, so I'll use one that makes sense for the moment, expecting to rename it later.
I then start looking at bits of behavior in the loop. The first thing I see is a conditional check, I can move this to the pipeline with a filter operation.
class Author...
static public IEnumerable<String> TwitterHandles(IEnumerable<Author> authors, string company) { var result = new List<String> (); var loopStart = authors .Where(a => a.Company == company); foreach (Author a in loopStart) { if (a.Company == company) { var handle = a.TwitterHandle; if (handle != null) result.Add(handle); } } return result; }
I see the next part of the loop operates on the twitter handle, rather than the author, so I can use a a map operation 2.
2: For me, it's odd to see the map operator called “Select”. The reason is that the pipeline methods in C# come from Linq, whose main aim is abstracting database access, so the method names were chosen to be similar to SQL. “Select” is the projection operator in SQL, which makes sense when you think of it as selecting columns, but is an odd name if you think of it of using a function mapping.
class Author...
static public IEnumerable<String> TwitterHandles(IEnumerable<Author> authors, string company) { var result = new List<String> (); var loopStart = authors .Where(a => a.Company == company) .Select(a => a.TwitterHandle); foreach (string handle in loopStart) { var handle = a.TwitterHandle; if (handle != null) result.Add(handle); } return result; }
Next in the loop as another conditional, which again I can move to a filter operation.
class Author...
static public IEnumerable<String> TwitterHandles(IEnumerable<Author> authors, string company) { var result = new List<String> (); var loopStart = authors .Where(a => a.Company == company) .Select(a => a.TwitterHandle) .Where (h => h != null); foreach (string handle in loopStart) { if (handle != null) result.Add(handle); } return result; }
All the loop now does is add everything in its loop collection into the result collection, so I can remove the loop and just return the pipeline result.
class Author...
static public IEnumerable<String> TwitterHandles(IEnumerable<Author> authors, string company) { var result = new List<String> (); return authors .Where(a => a.Company == company) .Select(a => a.TwitterHandle) .Where (h => h != null); foreach (string handle in loopStart) { result.Add(handle); } return result; }
Here's the final state of the code
class Author...
static public IEnumerable<String> TwitterHandles(IEnumerable<Author> authors, string company) { return authors .Where(a => a.Company == company) .Select(a => a.TwitterHandle) .Where (h => h != null); }
What I like about collection pipelines is that I can see the flow of logic as the elements of the list pass through the pipeline. For me it reads very closely to how I'd define the outcome of the loop “take the authors, choose those who have a company, and get their twitter handles removing any null handles”.
Furthermore, this style of code is familiar even in different languages who have different syntaxes and different names for pipeline operators. 3
3: Of course this is not a comprehensive list of all the languages that can do collection pipelines, so I expect the usual rush of complaints that I'm not showing Javascript, Scala, or Whatever++. I don't want a huge list of languages here, just a small set that's varied enough to convey the notion that collection pipelines are easy to follow in an unfamiliar language.
Java
public List<String> twitterHandles(List<Author> authors, String company) { return authors.stream() .filter(a -> a.getCompany().equals(company)) .map(a -> a.getTwitterHandle()) .filter(h -> null != h) .collect(toList()); }
Ruby
def twitter_handles authors, company authors .select {|a| company == a.company} .map {|a| a.twitter_handle} .reject {|h| h.nil?} end
while this matches the other examples, I would replace
the final reject
with compact
Clojure
(defn twitter-handles [authors company] (->> authors (filter #(= company (:company %))) (map :twitter-handle) (remove nil?)))
F#
let twitterHandles (authors : seq<Author>, company : string) = authors |> Seq.filter(fun a -> a.Company = company) |> Seq.map(fun a -> a.TwitterHandle) |> Seq.choose (fun h -> h)
again, if I wasn't concerned about matching the structure of the other examples I would combine the map and choose into a single step
I've found that once I got used to thinking in terms of pipelines I could apply them quickly even in an unfamiliar language. Since the fundamental approach is the same it's relatively easy to translate from even unfamiliar syntax and function names.
Refactoring within the pipeline, and to a comprehension
Once you have some behavior expressed as a pipeline, there are potential refactorings you can do by reordering steps in the pipeline. One such move is that if you have a map followed by a filter, you can usually move the filter before the map like this.
class Author...
static public IEnumerable<String> TwitterHandles(IEnumerable<Author> authors, string company) {
return authors
.Where(a => a.Company == company)
.Where (a => a.TwitterHandle != null)
.Select(a => a.TwitterHandle);
}
When you have two adjacent filters, you can combine them using a conjunction. 4
4: in some cases it will need to be a short-circuiting one, although that's not the case here
class Author...
static public IEnumerable<String> TwitterHandles(IEnumerable<Author> authors, string company) {
return authors
.Where(a => a.Company == company && a.TwitterHandle != null)
.Select(a => a.TwitterHandle);
}
Once I have a C# collection pipeline in the form of a simple filter and map like this, I can replace it with a Linq expression
class Author...
static public IEnumerable<String> TwitterHandles(IEnumerable<Author> authors, string company) { return from a in authors where a.Company == company && a.TwitterHandle != null select a.TwitterHandle; }
I consider Linq expressions to be a form of list comprehension, and similarly you can do something like this with any language that supports list comprehensions. It's a matter of taste whether you prefer the list comprehension form, or the pipeline form (I prefer pipelines). In general pipelines are more powerful, in that you can't refactor all pipelines into comprehensions.
Nested loop - readers of books
For a second example, I'll refactor a simple, doubly nested loop. I'll assume I have a online system that allows readers to read books. I have a data service that will tell me which books each reader has read during a particular day. This service returns a hash whose keys are identifiers of readers and values are a collection of identifiers of books
interface DataService…
Map<String, Collection<String>> getBooksReadOn(Date date);
for this example, I'll switch to Java because I'm sick of method names with a capitalized first letter
Here's the loop
public Set<String> getReadersOfBooks(Collection<String> readers, Collection<String> books, Date date) { Set<String> result = new HashSet<>(); Map<String, Collection<String>> data = dataService.getBooksReadOn(date); for (Map.Entry<String, Collection<String>> e : data.entrySet()) { for (String bookId : books) { if (e.getValue().contains(bookId) && readers.contains(e.getKey())) { result.add(e.getKey()); } } } return result; }
I'll use my usual first step, which is to apply
public Set<String> getReadersOfBooks(Collection<String> readers, Collection<String> books, Date date) { Set<String> result = new HashSet<>(); Map<String, Collection<String>> data = dataService.getBooksReadOn(date); final Set<Map.Entry<String, Collection<String>>> entries = data.entrySet(); for (Map.Entry<String, Collection<String>> e : entries) { for (String bookId : books) { if (e.getValue().contains(bookId) && readers.contains(e.getKey())) { result.add(e.getKey()); } } } return result; }
Moves like this make me so glad that IntelliJ's automated refactoring saves me from typing that gnarly type expression.
Once I've got the initial collection into a variable, I can work on elements of the loop behavior. All the work is going on inside the conditional so I'll begin with the second clause in that conditional and move its logic to a filter.
public Set<String> getReadersOfBooks(Collection<String> readers, Collection<String> books, Date date) { Set<String> result = new HashSet<>(); Map<String, Collection<String>> data = dataService.getBooksReadOn(date); final Set<Map.Entry<String, Collection<String>>> entries = data.entrySet().stream() .filter(e -> readers.contains(e.getKey())) .collect(Collectors.toSet()); for (Map.Entry<String, Collection<String>> e : entries) { for (String bookId : books) { if (e.getValue().contains(bookId) && readers.contains(e.getKey()))) { result.add(e.getKey()); } } } return result; }
With Java's streams library, the pipeline has to end with a terminal (such as a collector)
The other clause is more tricky to move since it refers to the inner loop variable. This clause is testing to see if the value of the map entry contains any book that's also in the list of books in the method parameter. I can do this by using a set intersection. The java core classes don't include a set intersection method. I can get by using one of the common collection oriented add-ins to Java such as Guava or Apache Commons. Since this is a simple pedagogical example I'll write a crude implementation.
class Utils...
public static <T> Set<T> intersection (Collection<T> a, Collection<T> b) { Set<T> result = new HashSet<T>(a); result.retainAll(b); return result; }
This works here, but for any substantial project, I'd use a common library.
Now I can move the clause from the loop into the pipeline.
public Set<String> getReadersOfBooks(Collection<String> readers, Collection<String> books, Date date) { Set<String> result = new HashSet<>(); Map<String, Collection<String>> data = dataService.getBooksReadOn(date); final Set<Map.Entry<String, Collection<String>>> entries = data.entrySet().stream() .filter(e -> readers.contains(e.getKey())) .filter(e -> !Utils.intersection(e.getValue(), books).isEmpty()) .collect(Collectors.toSet()); for (Map.Entry<String, Collection<String>> e : entries) { for (String bookId : books) { if (e.getValue().contains(bookId) ) { result.add(e.getKey()); } } } return result; }
Now all the loop is doing is returning the key of map entry, so I can kill the remainder of the loop by adding a map operation to the pipeline
public Set<String> getReadersOfBooks(Collection<String> readers, Collection<String> books, Date date) { Set<String> result = new HashSet<>(); Map<String, Collection<String>> data = dataService.getBooksReadOn(date); result = data.entrySet().stream() .filter(e -> readers.contains(e.getKey())) .filter(e -> !Utils.intersection(e.getValue(), books).isEmpty()) .map(e -> e.getKey()) .collect(Collectors.toSet()); for (Map.Entry<String, Collection<String>> e : entries) { for (String bookId : books) { result.add(e.getKey()); } } return result; }
Then I can use Inline Temp on result
.
public Set<String> getReadersOfBooks(Collection<String> readers, Collection<String> books, Date date) { Set<String> result = new HashSet<>(); Map<String, Collection<String>> data = dataService.getBooksReadOn(date); return data.entrySet().stream() .filter(e -> readers.contains(e.getKey())) .filter(e -> !Utils.intersection(e.getValue(), books).isEmpty()) .map(e -> e.getKey()) .collect(Collectors.toSet()); return result; }
Looking that use of intersection, I find it's rather complicated, I have to figure out what it does when I read it - which means I should extract it. 5
5:
I often find this with negative booleans. This is because the
negation (!
) is at the start of the expression, while the predicate
(isEmpty
) is at the end. With any substantive
expression between the two, the result is hard to parse. (At
least for me.)
class Utils...
public static <T> boolean hasIntersection(Collection<T> a, Collection<T> b) { return !intersection(a,b).isEmpty(); }
class Service...
public Set<String> getReadersOfBooks(Collection<String> readers, Collection<String> books, Date date) {
Map<String, Collection<String>> data = dataService.getBooksReadOn(date);
return data.entrySet().stream()
.filter(e -> readers.contains(e.getKey()))
.filter(e -> Utils.hasIntersection(e.getValue(), books))
.map(e -> e.getKey())
.collect(Collectors.toSet());
}
The object-oriented approach is at a disadvantage with when you need to do something like this. Switching between static utility methods and normal methods on objects is awkward. With some languages I have a way to make it read like a method on the stream class, but I don't have that option in Java.
Despite this issue, I still find the pipeline version easier to comprehend. I could combine the filters into a single conjunction, but I usually find it easier to understand each filter as a single element.
Equipment Offerings
The next bit example uses some simple criteria to mark preferred equipment for particular regions. To understand what it does, I need to explain the domain model. We have an organization that makes equipment available across different regions. When you request some equipment, you may be able to get the exact thing you want, but often you'll end up being offerred a substitute, which will do almost what you want, but perhaps not quite so well. To take a rather stretched example: you are in Boston and want a snow blower, but if there aren't any in the store, you may get offerred a snow-shovel instead. However if you're in Miami, they don't even offer snow blowers, so all you can get is the snow shovel. The data model captures this through three classes.
Figure 1: Each instance of offering represents the fact that a particular kind of equipment is offered in a region to support the need for a kind of equipment.
So we might see data like this:
products: ['snow-blower', 'snow-shovel'] regions: ['boston', 'miami'] offerings: - {region: 'boston', supported: 'snow-blower', supplied: 'snow-blower'} - {region: 'boston', supported: 'snow-blower', supplied: 'snow-shovel'} - {region: 'miami', supported: 'snow-blower', supplied: 'snow-shovel'}
The code we're going to look at is some code that marks some of these offerings as preferred, meaning that this offering is the preferred offering for some equipment in a region.
class Service...
var checkedRegions = new HashSet<Region>(); foreach (Offering o1 in equipment.AllOfferings()) { Region r = o1.Region; if (checkedRegions.Contains(r)) { continue; } Offering possPref = null; foreach(var o2 in equipment.AllOfferings(r)) { if (o2.isPreferred) { possPref = o2; break; } else { if (o2.isMatch || possPref == null) { possPref = o2; } } } possPref.isPreferred = true; checkedRegions.Add(r); }
This example is in C#, because I'm a flip-flopper.
My suspicion is that there's some easily comprehensible logic for what this loop is doing, by refactoring it I can hopefully surface that logic.
My first step is to begin with the outer loop and apply
class Service...
var loopStart = equipment.AllOfferings(); var checkedRegions = new HashSet<Region>(); foreach (Offering o1 in loopStart) { Region r = o1.Region; if (checkedRegions.Contains(r)) { continue; } Offering possPref = null; foreach(var o2 in equipment.AllOfferings(r)) { if (o2.isPreferred) { possPref = o2; break; } else { if (o2.isMatch || possPref == null) { possPref = o2; } } } possPref.isPreferred = true; checkedRegions.Add(r); }
I then look at the first part of the loop. It has a control
variable, checkedRegions
which the loop uses to keep
a track of regions it's already processed to avoid processing the
same region more than once. That's a smell right there, but it
also suggests that the loop variable o1
is only a
stepping stone to the get to the region r
. I confirm
that by highlighting o1
in my editor. Once I know
that, I know I can simplify this with a map.
class Service...
var loopStart = equipment.AllOfferings() .Select(o => o.Region); var checkedRegions = new HashSet<Region>(); foreach (Region r in loopStart) { Region r = o1.Region; if (checkedRegions.Contains(r)) { continue; } Offering possPref = null; foreach(var o2 in equipment.AllOfferings(r)) { if (o2.isPreferred) { possPref = o2; break; } else { if (o2.isMatch || possPref == null) { possPref = o2; } } } possPref.isPreferred = true; checkedRegions.Add(r); }
Now I can talk about the checkedRegions
control
variable. The loop uses this to avoid processing a region more
than once. It's not clear to me yet whether
checking a region is idempotent, if it is I might get away with
avoiding this check completely (and measure to see if there is an
appreciable impact on performance). Since I'm not sure I decide to
retain that logic, especially since avoiding duplicates is
trivial with a pipeline.
class Service...
var loopStart = equipment.AllOfferings() .Select(o => o.Region) .Distinct(); var checkedRegions = new HashSet<Region>(); foreach (Region r in loopStart) { if (checkedRegions.Contains(r)) { continue; } Offering possPref = null; foreach(var o2 in equipment.AllOfferings(r)) { if (o2.isPreferred) { possPref = o2; break; } else { if (o2.isMatch || possPref == null) { possPref = o2; } } } possPref.isPreferred = true; checkedRegions.Add(r); }
The next step is about determining the possPref
variable. I think this will be easier to deal with as its own
method, so apply Extract Method
class Service...
var loopStart = equipment.AllOfferings()
.Select(o => o.Region)
.Distinct();
foreach (Region r in loopStart) {
var possPref = possiblePreference(equipment, r);
possPref.isPreferred = true;
}
static Offering possiblePreference(Equipment equipment, Region region) {
Offering possPref = null;
foreach (var o2 in equipment.AllOfferings(region)) {
if (o2.isPreferred) {
possPref = o2;
break;
}
else {
if (o2.isMatch || possPref == null) {
possPref = o2;
}
}
}
return possPref;
}
I extract the loop collection into a variable.
class Service...
static Offering possiblePreference(Equipment equipment, Region region) { Offering possPref = null; var allOfferings = equipment.AllOfferings(region); foreach (var o2 in allOfferings) { if (o2.isPreferred) { possPref = o2; break; } else { if (o2.isMatch || possPref == null) { possPref = o2; } } } return possPref; }
Since the loop is now in its own function, I can use a return rather than a break.
class Service...
static Offering possiblePreference(Equipment equipment, Region region) { Offering possPref = null; var allOfferings = equipment.AllOfferings(region); foreach (var o2 in allOfferings) { if (o2.isPreferred) { return o2; break; } else { if (o2.isMatch || possPref == null) { possPref = o2; } } } return possPref; }
The first condition is looking for the first offering, if any,
that passes a conditional. That's a job for the detect operation
(called First
in C#.) 6
6: I haven't put this in the operator catalog yet.
class Service...
static Offering possiblePreference(Equipment equipment, Region region) { Offering possPref = null; var allOfferings = equipment.AllOfferings(region); possPref = allOfferings.FirstOrDefault(o => o.isPreferred); if (null != possPref) return possPref; foreach (var o2 in allOfferings) { if (o2.isPreferred) { return o2; } else { if (o2.isMatch || possPref == null) { possPref = o2; } } } return possPref; }
The last conditional is rather tricky. It sets
possPref
to the first offering in the list, but will
overwrite that value if any offering passes the
isMatch
test. But the loop doesn't break with that
isMatch
pass, so any later isMatch
offerings will overwrite that match. So to replicate that
behavior, I need to use LastOrDefault
. 7
7: If I'm using a language that doesn't have an operation that detects the last item that passes a predicate, I can reverse it first and then detect the first one.
class Service...
static Offering possiblePreference(Equipment equipment, Region region) { Offering possPref = null; var allOfferings = equipment.AllOfferings(region); possPref = allOfferings.FirstOrDefault(o => o.isPreferred); if (null != possPref) return possPref; possPref = allOfferings.LastOrDefault(o => o.isMatch); if (null != possPref) return possPref; foreach (var o2 in allOfferings) { if (o2.isMatch || possPref == null) { possPref = o2; } } return possPref; }
The last remaining bit of the loop just returns the first item.
class Service...
static Offering possiblePreference(Equipment equipment, Region region) { Offering possPref = null; var allOfferings = equipment.AllOfferings(region); possPref = allOfferings.FirstOrDefault(o => o.isPreferred); if (null != possPref) return possPref; possPref = allOfferings.LastOrDefault(o => o.isMatch); if (null != possPref) return possPref; return allOfferings.First(); foreach (var o2 in allOfferings) { if (possPref == null) { possPref = o2; } } return possPref; }
My personal convention is to use result
for the
name of any variable used for returns in a function, so I rename it.
class Service...
static Offering possiblePreference(Equipment equipment, Region region) { Offering result = null; var allOfferings = equipment.AllOfferings(region); result = allOfferings.FirstOrDefault(o => o.isPreferred); if (null != result) return result; result = allOfferings.LastOrDefault(o => o.isMatch); if (null != result) return result; return allOfferings.First(); }
I'm reasonably happy now with possiblePreference
, I think it
pretty clearly states the logic in a way that makes sense within
the domain. I no longer need to figure out what the code is doing
in order to understand its intent.
However since I'm in C#, I can make it read even better by
using the null-coalescing operator (??
). This allows me to chain
several expressions together and return the first one that isn't a
null.
class Service...
static Offering possiblePreference(Equipment equipment, Region region) { var allOfferings = equipment.AllOfferings(region); return allOfferings.FirstOrDefault(o => o.isPreferred) ?? allOfferings.LastOrDefault(o => o.isMatch) ?? allOfferings.First(); }
In less strictly typed languages that treat null as a falsey value, you do the same thing with an “or” operator. Another alternative is to compose first-class functions (but that's a whole other topic).
Now I go back to the outer loop, which currently looks like this.
class Service...
var loopStart = equipment.AllOfferings() .Select(o => o.Region) .Distinct(); foreach (Region r in loopStart) { var possPref = possiblePreference(equipment, r); possPref.isPreferred = true; }
I can use my possiblePreference
in the pipeline.
class Service...
var loopStart = equipment.AllOfferings() .Select(o => o.Region) .Distinct() .Select(r => possiblePreference(equipment, r)) ; foreach (Offering o in loopStart) { var possPref = possiblePreference(product, r); o.isPreferred = true; }
Note the style of putting the semi-colon on its own line. I often do that with longer pipelines as it makes it easier to manipulate the pipeline.
With renaming the initial loop variable, result reads nicely clear.
class Service...
var preferredOfferings = equipment.AllOfferings() .Select(o => o.Region) .Distinct() .Select(r => possiblePreference(equipment, r)) ; foreach (Offering o in preferredOfferings) { o.isPreferred = true; }
I'd be happy leaving it at that, but I can also move the
forEach
behavior into the pipeline like this.
class Service...
equipment.AllOfferings() .Select(o => o.Region) .Distinct() .Select(r => possiblePreference(equipment, r)) .ToList() .ForEach(o => o.isPreferred = true) ;
This is a more controversial step. Many people don't like using
functions with side-effects in a pipeline. That's why I
have to use the intermediate ToList
since
ForEach
isn't available on IEnumerable
.
As well the issue around side-effects, using ToList
also reminds us that whenever we use side-effects, we'll also lose
any laziness in the evaluation of the pipe (which is not an issue
here since the whole point of the pipe is to select some objects
for modification).
Either way, however, I find this much clearer than the original
loop. The earlier loop examples were reasonably clear to follow,
but this one took a some thinking to figure out what it was doing.
Certainly extracting possiblePreference
is a big
factor in making it clearer, and you could do that and retain a
loop, although I'd certainly want to avoid futzing with the logic
to ensure I avoided duplicate regions.
Grouping flight records
With this example I'll look at some code that summarizes flight delay information. The code starts with records of ontime flight performance, originally sourced from U.S. Department of Transportation Bureau of Transportation Statistics. After some preliminary massaging, the resulting data looks something like this
[ { "origin":"BOS","dest":"LAX","date":"2015-01-12", "number":"25","carrier":"AA","delay":0.0,"cancelled":false }, { "origin":"BOS","dest":"LAX","date":"2015-01-13", "number":"25","carrier":"AA","delay":0.0,"cancelled":false }, …
Here's the loop that processes it
export function airportData() { const data = flightData(); const count = {}; const cancellations = {}; const totalDelay = {}; for (let row of data) { const airport = row.dest; if (count[airport] === undefined) { count[airport] = 0; cancellations[airport] = 0; totalDelay[airport] = 0; } count[airport]++; if (row.cancelled) { cancellations[airport]++ ; } else { totalDelay[airport] += row.delay; } } const result = {}; for (let i in count) { result[i] = {}; result[i].meanDelay = totalDelay[i] / (count[i] - cancellations[i]); result[i].cancellationRate = cancellations[i] / count[i]; } return result; }
this example uses Javascript (es6 on node), since everything has to be written in Javascript these days.
The loop summarizes the flight data by destination airport
(dest
) and calculates the cancellation rate and mean
delay. The central activity here is grouping the flight data by
destination, which is well suited to the group
operator in a pipeline. So my first move is to make a
variable that captures this grouping.
import _ from 'underscore';
export function airportData() {
const data = flightData();
const working = _.groupBy(data, r => r.dest);
const count = {};
const cancellations = {};
const totalDelay = {};
for (let row of data) {
const airport = row.dest;
if (count[airport] === undefined) {
count[airport] = 0;
cancellations[airport] = 0;
totalDelay[airport] = 0;
}
count[airport]++;
if (row.cancelled) {
cancellations[airport]++;
}
else {
totalDelay[airport] += row.delay;
}
}
const result = {};
for (let i in count) {
result[i] = {};
result[i].meanDelay = totalDelay[i] / (count[i] - cancellations[i]);
result[i].cancellationRate = cancellations[i] / count[i];
}
return result;
}
A couple of things about this step. Firstly I can't think of a
good name for it yet, so I just call it working
.
Secondly, while javascript has a good set of collection pipeline
operators on Array
, it lacks a grouping operator. I
could write one myself, but instead I'll make use of the
underscore library, which has long been a useful tool in Javascript-land.
The count variable captures how many flight records there are for each destination airport. I can easily calculate this in the pipeline with a map operation.
export function airportData() { const data = flightData(); const working = _.chain(data) .groupBy(r => r.dest) .mapObject((val, key) => {return {count: val.length}}) .value() ; const count = {}; const cancellations = {}; const totalDelay = {}; for (let row of data) { const airport = row.dest; if (count[airport] === undefined) { count[airport] = 0; cancellations[airport] = 0; totalDelay[airport] = 0; } count[airport]++; if (row.cancelled) { cancellations[airport]++; } else { totalDelay[airport] += row.delay; } } const result = {}; for (let i in count) { result[i] = {}; result[i].meanDelay = totalDelay[i] / (working[i].count - cancellations[i]); result[i].cancellationRate = cancellations[i] / working[i].count; } return result; }
To do a multi-step pipeline like this in underscore, I have to
start the pipeline with the chain
function. This
ensures that each step in the pipeline is wrapped inside
underscore, so I can use a method
chain to build the pipeline. The downside is that I have to
use value
at the end to get the underlying array out
of it.
The map operations isn't the standard map, since it operates on
the contents of a Javascript object, essentially a hash map, so
the mapping function acts on a key/value pair. In underscore I do
this with the mapObject
function.
Usually when I move behavior into the pipeline, I like to remove the control variable entirely, but it's also playing a role in keeping track of the needed keys, so I'll leave it for a while until I've dealt with the other calculations.
Next I'll deal with the cancellations variable, which this time I can remove.
export function airportData() { const data = flightData(); const working = _.chain(data) .groupBy(r => r.dest) .mapObject((val, key) => { return { count: val.length, cancellations: val.filter(r => r.cancelled).length } }) .value() ; const count = {}; const cancellations = {}; const totalDelay = {}; const cancellations = {}; for (let row of data) { const airport = row.dest; if (count[airport] === undefined) { count[airport] = 0; cancellations[airport] = 0; totalDelay[airport] = 0; } count[airport]++; if (row.cancelled) { cancellations[airport]++; } else { totalDelay[airport] += row.delay; } } const result = {}; for (let i in count) { result[i] = {}; result[i].meanDelay = totalDelay[i] / (working[i].count - working[i].cancellations); result[i].cancellationRate = working[i].cancellations / working[i].count; } return result; }
The mapping function is now getting rather long, so I think it's time to use Extract Method on it.
export function airportData() { const data = flightData(); const summarize = function(rows) { return { count: rows.length, cancellations: rows.filter(r => r.cancelled).length } } const working = _.chain(data) .groupBy(r => r.dest) .mapObject((val, key) => summarize(val)) .value() ; const count = {}; const totalDelay = {} for (let row of data) { const airport = row.dest; if (count[airport] === undefined) { count[airport] = 0; totalDelay[airport] = 0; } count[airport]++; if (row.cancelled) { } else { totalDelay[airport] += row.delay; } } const result = {}; for (let i in count) { result[i] = {}; result[i].meanDelay = totalDelay[i] / (working[i].count - working[i].cancellations); result[i].cancellationRate = working[i].cancellations / working[i].count; } return result; }
Assigning the function to a variable within the overall
function is javascript's way of nesting the function definition to
limit its scope to the airportData
function. I could
imagine this function being more widely useful, but that's a later
refactoring to consider.
Now to handle the total delay calculation.
export function airportData() { const data = flightData(); const summarize = function(rows) { return { count: rows.length, cancellations: rows.filter(r => r.cancelled).length, totalDelay: rows.filter(r => !r.cancelled).reduce((acc,each) => acc + each.delay, 0) } } const working = _.chain(data) .groupBy(r => r.dest) .mapObject((val, key) => summarize(val)) .value() ; const count = {}; const totalDelay = {} for (let row of data) { const airport = row.dest; if (count[airport] === undefined) { count[airport] = 0; totalDelay[airport] = 0; } count[airport]++; if (row.cancelled) { } else { totalDelay[airport] += row.delay; } } const result = {}; for (let i in count) { result[i] = {}; result[i].meanDelay = working[i].totalDelay / (working[i].count - working[i].cancellations); result[i].cancellationRate = working[i].cancellations / working[i].count; } return result; }
The expression in the lambda for the total delay mirrors the original formulation, using a reduce operation to calculate a sum. I often find it reads better to use a map first.
export function airportData() {
const data = flightData();
const summarize = function(rows) {
return {
count: rows.length,
cancellations: rows.filter(r => r.cancelled).length,
totalDelay: rows.filter(r => !r.cancelled).map(r => r.delay).reduce((a,b) => a + b)
}
}
const working = _.chain(data)
.groupBy(r => r.dest)
.mapObject((val, key) => summarize(val))
.value()
;
const count = {};
for (let row of data) {
const airport = row.dest;
if (count[airport] === undefined) {
count[airport] = 0;
}
count[airport]++;
}
const result = {};
for (let i in count) {
result[i] = {};
result[i].meanDelay = working[i].totalDelay / (working[i].count - working[i].cancellations);
result[i].cancellationRate = working[i].cancellations / working[i].count;
}
return result;
}
That reformulation isn't a big deal, but I increasingly prefer it. That lambda is also a bit long, but it's not quite at the point where I feel the need to extract it.
I also took the opportunity to replace the lambda to invoke
summarize
with just naming the function.
export function airportData() {
const data = flightData();
const summarize = function(rows) {
return {
count: rows.length,
cancellations: rows.filter(r => r.cancelled).length,
totalDelay: rows.filter(r => !r.cancelled).map(r => r.delay).reduce((a,b) => a + b)
}
}
const working = _.chain(data)
.groupBy(r => r.dest)
.mapObject(summarize)
.value()
;
const count = {};
for (let row of data) {
const airport = row.dest;
if (count[airport] === undefined) {
count[airport] = 0;
}
count[airport]++;
}
const result = {};
for (let i in count) {
result[i] = {};
result[i].meanDelay = working[i].totalDelay / (working[i].count - working[i].cancellations);
result[i].cancellationRate = working[i].cancellations / working[i].count;
}
return result;
}
Now with all the dependent data removed, I'm ready to remove count
.
export function airportData() { const data = flightData(); const summarize = function(rows) { return { count: rows.length, cancellations: rows.filter(r => r.cancelled).length, totalDelay: rows.filter(r => !r.cancelled).map(r => r.delay).reduce((a,b) => a + b) } } const working = _.chain(data) .groupBy(r => r.dest) .mapObject(summarize) .value() ; const count = {}; for (let row of data) { const airport = row.dest; if (count[airport] === undefined) { count[airport] = 0; } count[airport]++; } const result = {}; for (let i in working) { result[i] = {}; result[i].meanDelay = working[i].totalDelay / (working[i].count - working[i].cancellations); result[i].cancellationRate = working[i].cancellations / working[i].count; } return result; }
Now I turn my attention to the second loop, which is essentially doing a map to calculate its two values.
export function airportData() { const data = flightData(); const summarize = function(rows) { return { count: rows.length, cancellations: rows.filter(r => r.cancelled).length, totalDelay: rows.filter(r => !r.cancelled).map(r => r.delay).reduce((a,b) => a + b) } } const formResult = function(row) { return { meanDelay: row.totalDelay / (row.count - row.cancellations), cancellationRate: row.cancellations / row.count } } let working = _.chain(data) .groupBy(r => r.dest) .mapObject(summarize) .mapObject(formResult) .value() ; return working; let result = {}; for (let i in working) { result[i] = {}; result[i].meanDelay = working[i].totalDelay / (working[i].count - working[i].cancellations); result[i].cancellationRate = working[i].cancellations / working[i].count; } return result; }
With all that done, I can use Inline Temp on working
and do a little
more renaming and tidying.
export function airportData() { const data = flightData(); const summarize = function(flights) { return { numFlights: flights.length, numCancellations: flights.filter(f => f.cancelled).length, totalDelay: flights.filter(f => !f.cancelled).map(f => f.delay).reduce((a,b) => a + b) } } const formResult = function(airport) { return { meanDelay: airport.totalDelay / (airport.numFlights - airport.numCancellations), cancellationRate: airport.numCancellations / airport.numFlights } } return _.chain(data) .groupBy(r => r.dest) .mapObject(summarize) .mapObject(formResult) .value() ; }
As is often the case, a good bit of the greater readability of the final function comes from extracting functions. But I find the grouping operator does much to clarify the purpose of the function and helps set up the extractions.
There is another potential benefit from this refactoring if the data comes from a relational database and I'm experiencing performance problems. By refactoring from a loop to a collection pipeline I'm representing the transformation in a form that's much more similar to SQL. In the case of this task I might be pulling a lot of data from the database, but the refactored code makes it easier to consider moving the grouping and first level summarization logic into the SQL, which would reduce how much data I need to ship over the wire. Usually I prefer to keep logic in application code rather than SQL, so I would treat such a move as a performance optimization and only do this if I can measure a significant performance improvement. But this reinforces the point that it's much easier to do optimizations when you have clear code to work with, which is why all the performance wizards I know stress the importance of clarity-first as a foundation for performant code.
Identifiers
For our next example, I'll take a look at some code that checks that a person has a set of required identifiers. It's common for systems to rely on identifying people through some kind of hopefully-unique id, such as a customer id. In many domains, you have to deal with many different identification schemes, and a person should have identifiers with multiple schemes. So a town government might expect a person to have a town id, a state id, and a national id.
Figure 2: A data model for a person having identifiers with different schemes
The data structure for this situation is pretty simple. The person class has a collection of identifier objects. Identifiers have a field for the scheme and some value. But usually there are further constraints that can't be enforced solely by the data model, such constraints are checked by a validation function such as this.
class Person...
def check_valid_ids required_schemes, note: nil note ||= Notification.new note.add_error "has no ids" if @ids.size < 1 used = [] found_required = [] dups = [] for id in @ids next if id.void? if used.include?(id.scheme) dups << id.scheme else for req in required_schemes if id.scheme == req found_required << req required_schemes.delete req next end end end used << id.scheme end if dups.size > 0 note.add_error "duplicate schemes: " + dups.join(", ") end if required_schemes.size > 0 missing_names = "" for req in required_schemes missing_names += (missing_names.size > 0) ? ", " + req.to_s : req.to_s end note.add_error "missing schemes: " + missing_names end return note end
This example is in Ruby, because I like programming in Ruby
There's a couple of other objects that support this loop. The Identifier class knows its scheme, value, and whether it is void - meaning it's been logically deleted (but is still kept in the database). There is also a notification to keep track of any errors.
class Identifier attr_reader :scheme, :value def void? … class Notification def add_error e …
The biggest smell for me in this routine is that the loop is
doing two things at once. It is both finding duplicate
identifiers (collected in dups
) and finding required schemes
that are missing (in required_schemes
). It's quite
common that programmers, faced with two things to do with the
same collection of objects, decide to do both things in the same
loop. One reason is the code required to set up the loop, it
seems a shame to write it twice. Modern loop constructs and
pipelines remove that burden. The more pernicious reason is
concerns over performance. Certainly many performance hotspots
involve a loop, and there are cases where loops can be fused to
improve matters. But these are only a tiny fraction of all the
loops we write, so we should follow the usual principle of
programming. Focus on clarity over performance, unless you have
a measured, significant performance problem. If you have such a
problem, then fixing it takes priority over clarity, but such
cases are rare.
Focus on clarity over performance, unless you have a measured, significant performance problem.
So faced with a loop doing two things, I have no hesitation in duplicating the loop to improve clarity. It's extremely rare that performance analysis will cause me to reverse that refactoring.
So my first move is use a refactoring I'll call Split Loop. When I do this I start by getting the loop and the code that's connecting it into a coherent block of code, and apply Extract Method to it.
class Person...
def check_valid_ids required_schemes, note: nil
note ||= Notification.new
note.add_error "has no ids" if @ids.size < 1
return inner_check_valid_ids required_schemes, note: note
end
def inner_check_valid_ids required_schemes, note: nil
used = []
found_required = []
dups = []
for id in @ids
next if id.void?
if used.include?(id.scheme)
dups << id.scheme
else
for req in required_schemes
if id.scheme == req
found_required << req
required_schemes.delete req
next
end
end
end
used << id.scheme
end
if dups.size > 0
note.add_error "duplicate schemes: " + dups.join(", ")
end
if required_schemes.size > 0
missing_names = ""
for req in required_schemes
missing_names += (missing_names.size > 0) ? ", " + req.to_s : req.to_s
end
note.add_error "missing schemes: " + missing_names
end
return note
end
This extracted method is doing two things, so now I want to duplicate it to form the scaffolding for the two separate methods, each of which will do just one thing. If I duplicate it and call each one, then my accumulating notification will get twice the errors, I can avoid this by removing the irrelevent update from each of the duplicates.
class Person...
def check_valid_ids required_schemes, note: nil note ||= Notification.new note.add_error "has no ids" if @ids.size < 1 check_no_duplicate_ids required_schemes, note: note check_all_required_schemes required_schemes, note: note end def check_no_duplicate_ids required_schemes, note: nil used = [] found_required = [] dups = [] for id in @ids next if id.void? if used.include?(id.scheme) dups << id.scheme else for req in required_schemes if id.scheme == req found_required << req required_schemes.delete req next end end end used << id.scheme end if dups.size > 0 note.add_error "duplicate schemes: " + dups.join(", ") end if required_schemes.size > 0 missing_names = "" for req in required_schemes missing_names += (missing_names.size > 0) ? ", " + req.to_s : req.to_s end note.add_error "missing schemes: " + missing_names end return note end def check_all_required_schemes required_schemes, note: nil used = [] found_required = [] dups = [] for id in @ids next if id.void? if used.include?(id.scheme) dups << id.scheme else for req in required_schemes if id.scheme == req found_required << req required_schemes.delete req next end end end used << id.scheme end if dups.size > 0 note.add_error "duplicate schemes: " + dups.join(", ") end if required_schemes.size > 0 missing_names = "" for req in required_schemes missing_names += (missing_names.size > 0) ? ", " + req.to_s : req.to_s end note.add_error "missing schemes: " + missing_names end return note end
It's important to remove the double-update, that way my tests all keep passing while I'm refactoring.
The result is rather ugly, but now I can work on each method independently, removing anything that doesn't involve the purpose of each method.
Refactoring the no-duplicates check
I'll start with the no-duplicates case,
I can cut out hunks of code in several steps, testing after each
one to ensure I don't make a mistake. I start by removing the
code that uses required_schemes
at the end
class Person...
def check_no_duplicate_ids required_schemes, note: nil used = [] found_required = [] dups = [] for id in @ids next if id.void? if used.include?(id.scheme) dups << id.scheme else for req in required_schemes if id.scheme == req found_required << req required_schemes.delete req next end end end used << id.scheme end if dups.size > 0 note.add_error "duplicate schemes: " + dups.join(", ") end if required_schemes.size > 0 missing_names = "" for req in required_schemes missing_names += (missing_names.size > 0) ? ", " + req.to_s : req.to_s end end return note end
I then take out the unneeded branch of the conditional
class Person...
def check_no_duplicate_ids required_schemes, note: nil used = [] found_required = [] dups = [] for id in @ids next if id.void? if used.include?(id.scheme) dups << id.scheme else for req in required_schemes if id.scheme == req found_required << req required_schemes.delete req next end end end used << id.scheme end if dups.size > 0 note.add_error "duplicate schemes: " + dups.join(", ") end return note end
At this point I could, and perhaps should, have
removed the now unneeded required_schemes
parameter. I
didn't, and you'll see it gets sorted out in the end.
I do my usual
class Person...
def check_no_duplicate_ids required_schemes, note: nil used = [] dups = [] input = @ids for id in input next if id.void? if used.include?(id.scheme) dups << id.scheme end used << id.scheme end if dups.size > 0 note.add_error "duplicate schemes: " + dups.join(", ") end return note end
I can then add a filter to the input variable and remove the line that skips over void identifiers.
class Person...
def check_no_duplicate_ids required_schemes, note: nil used = [] dups = [] input = @ids.reject{|id| id.void?} for id in input next if id.void? if used.include?(id.scheme) dups << id.scheme end used << id.scheme end if dups.size > 0 note.add_error "duplicate schemes: " + dups.join(", ") end return note end
Looking further in the loop I can see that it uses the scheme rather than the id, so I can add the pipeline step to map ids to schemes.
class Person...
def check_no_duplicate_ids required_schemes, note: nil used = [] dups = [] input = @ids .reject{|id| id.void?} .map {|id| id.scheme} for scheme in input if used.include?(scheme) dups << scheme end used << scheme end if dups.size > 0 note.add_error "duplicate schemes: " + dups.join(", ") end return note end
At this point, I've reduced the loop body to the simple removing duplicates behavior. There is a pipeline way to find duplicates, this is to group the schemes by themselves and filter those that appear more than once.
class Person...
def check_no_duplicate_ids required_schemes, note: nil used = [] dups = [] input = @ids .reject{|id| id.void?} .map {|id| id.scheme} .group_by {|s| s} .select {|k,v| v.size > 1} .keys for scheme in input if used.include?(scheme) dups << scheme end used << scheme end if dups.size > 0 note.add_error "duplicate schemes: " + dups.join(", ") end return note end
Now the output of the pipeline is the duplicates, so I can
remove the input variable and assign the pipeline to the
variable (and remove the now unneeded used
variable).
class Person...
def check_no_duplicate_ids required_schemes, note: nil used = [] dups = [] dups = @ids .reject{|id| id.void?} .map {|id| id.scheme} .group_by {|s| s} .select {|k,v| v.size > 1} .keys for scheme in input dups << scheme used << scheme end if dups.size > 0 note.add_error "duplicate schemes: " + dups.join(", ") end return note end
That gives us a nice pipeline, but there is a troubling element to it. The last three steps in the pipeline are there to remove duplicates, but that knowledge is in my head, not in the code. I need to move it into the code by using Extract Method.
class Person...
def check_no_duplicate_ids required_schemes, note: nil
dups = @ids
.reject{|id| id.void?}
.map {|id| id.scheme}
.duplicates
if dups.size > 0
note.add_error "duplicate schemes: " + dups.join(", ")
end
return note
end
class Array…
def duplicates
self
.group_by {|s| s}
.select {|k,v| v.size > 1}
.keys
end
Here I've used Ruby's ability to add a method to an existing class (known as monkey-patching). I could also use Ruby's refinement feature in the latest Ruby versions. However many OO languages don't support monkey-patching, in which case I have to use a locally defined function, along the lines of this
class Person...
def check_no_duplicate_ids required_schemes, note: nil
schemes = @ids
.reject{|id| id.void?}
.map {|id| id.scheme}
if duplicates(schemes).size > 0
note.add_error "duplicate schemes: " + duplicates(schemes).join(", ")
end
return note
end
def duplicates anArray
anArray
.group_by {|s| s}
.select {|k,v| v.size > 1}
.keys
end
Defining a method on person doesn't work so well for the pipeline as putting it on the array. But often we can't put a method on the array because our language doesn't involve monkey patching, or project standards don't make it easy, or it's a method that isn't generic enough to sit on a generic list class. This is a case where the object-oriented approach gets in the way and a functional approach, that doesn't bind methods to objects, works better.
Whenever I have a local variable like this, I always consider using Replace Temp with Query to turn the variable into a method - resulting in something like this.
class Person...
def check_no_duplicate_ids required_schemes, note: nil if duplicate_identity_schemes.size > 0 note.add_error "duplicate schemes: " + duplicate_identity_schemes.join(", ") end return note end def duplicate_identity_schemes @ids .reject{|id| id.void?} .map {|id| id.scheme} .duplicates end
I base this decision on whether I think the
duplicate_identity_schemes
behavior is likely to
be useful to other methods in the person class. But despite
the fact that I prefer to err in making the query method, for
this case I prefer to keep it as a local variable.
Refactoring the check for all required schemes
Now I've cleaned up the check for no duplicates I can work on the check that we have all the required schemes. Here's the current method.
class Person...
def check_all_required_schemes required_schemes, note: nil used = [] found_required = [] dups = [] for id in @ids next if id.void? if used.include?(id.scheme) dups << id.scheme else for req in required_schemes if id.scheme == req found_required << req required_schemes.delete req next end end end used << id.scheme end if dups.size > 0 end if required_schemes.size > 0 missing_names = "" for req in required_schemes missing_names += (missing_names.size > 0) ? ", " + req.to_s : req.to_s end note.add_error "missing schemes: " + missing_names end return note end
As with the earlier method, my first steps are to remove anything that's to do with checking for duplicates.
class Person...
def check_all_required_schemes required_schemes, note: nil used = [] found_required = [] dups = [] for id in @ids next if id.void? if used.include?(id.scheme) dups << id.scheme else for req in required_schemes if id.scheme == req found_required << req required_schemes.delete req next end end end used << id.scheme end if dups.size > 0 end if required_schemes.size > 0 missing_names = "" for req in required_schemes missing_names += (missing_names.size > 0) ? ", " + req.to_s : req.to_s end note.add_error "missing schemes: " + missing_names end return note end
To get into the meat of this function, I start by looking
at the found_required
variable. As with the
checking for duplicates case, it's primarily interested in
the schemes for which we have non-void identifiers, so I'm
inclined to start by capturing the schemes into a variable and
using those rather than the ids.
class Person...
def check_all_required_schemes required_schemes, note: nil found_required = [] schemes = @ids .reject{|i| i.void?} .map {|i| i.scheme} for s in schemes next if id.void? for req in required_schemes if s == req found_required << req required_schemes.delete req next end end end if required_schemes.size > 0 missing_names = "" for req in required_schemes missing_names += (missing_names.size > 0) ? ", " + req.to_s : req.to_s end note.add_error "missing schemes: " + missing_names end return note end
The purpose of found_required
is
to capture schemes that are both in the required_schemes
list
and among the schemes from our ids. That sounds like a set
intersection to me, and that's a function I should expect on
any self-respecting collection. So I should be able to
determine found_required with that.
class Person...
def check_all_required_schemes required_schemes, note: nil found_required = [] schemes = @ids .reject{|i| i.void?} .map {|i| i.scheme} for s in schemes for req in required_schemes if s == req found_required << req required_schemes.delete req next end end end found_required = schemes & required_schemes if required_schemes.size > 0 missing_names = "" for req in required_schemes missing_names += (missing_names.size > 0) ? ", " + req.to_s : req.to_s end note.add_error "missing schemes: " + missing_names end return note end
Sadly that move failed the tests. Now I look harder at the
code and realize that found_required
isn't used at
all by the code later on, it's a zombie variable that was
probably used for something once, but that use was abandoned
later and the variable was never removed from the code. So I
back out the change I just made and remove it.
class Person...
def check_all_required_schemes required_schemes, note: nil found_required = [] schemes = @ids .reject{|i| i.void?} .map {|i| i.scheme} for s in schemes for req in required_schemes if s == req found_required << req required_schemes.delete req next end end end if required_schemes.size > 0 missing_names = "" for req in required_schemes missing_names += (missing_names.size > 0) ? ", " + req.to_s : req.to_s end note.add_error "missing schemes: " + missing_names end return note end
Now I see that the loop is removing elements from the
parameter required_schemes
. Modifying a parameter
like this is a strict no-no for me, unless it's a collecting
parameter (such as the note). So I immediately apply
Remove Assignments to Parameters
class Person...
def check_all_required_schemes required_schemes, note: nil missing_schemes = required_schemes.dup schemes = @ids .reject{|i| i.void?} .map {|i| i.scheme} for s in schemes for req in required_schemes if s == req missing_schemes.delete req next end end end if missing_schemes.size > 0 missing_names = "" for req in missing_schemes missing_names += (missing_names.size > 0) ? ", " + req.to_s : req.to_s end note.add_error "missing schemes: " + missing_names end return note end
Doing this also revealed that the loop was deleting items from the list it's enumerating - an even worse thing than modifying the parameter.
Now that this is clarified, I can see that a set operation is appropriate, but what I need to do is remove the schemes we have from the required list - using a set difference operation.
class Person...
def check_all_required_schemes required_schemes, note: nil missing_schemes = required_schemes.dup schemes = @ids .reject{|i| i.void?} .map {|i| i.scheme} missing_schemes = required_schemes - schemes for s in schemes for req in required_schemes if s == req missing_schemes.delete req next end end end if missing_schemes.size > 0 missing_names = "" for req in missing_schemes missing_names += (missing_names.size > 0) ? ", " + req.to_s : req.to_s end note.add_error "missing schemes: " + missing_names end return note end
Now I look at the second loop, forming the error message. This just converts the schemes to string and joins them with commas - which is the work of the string join operation.
class Person...
def check_all_required_schemes required_schemes, note: nil schemes = @ids .reject{|i| i.void?} .map {|i| i.scheme} missing_schemes = required_schemes - schemes if missing_schemes.size > 0 missing_names = missing_schemes.join(", ") for req in missing_schemes missing_names += (missing_names.size > 0) ? ", " + req.to_s : req.to_s end note.add_error "missing schemes: " + missing_names end return note end
Consolidating the two methods
Both methods are cleaned up now, they do just one thing and are clear in what they are doing. Both of them need the list of schemes for non-void identifiers, so I'm inclined to use Extract Method
class Person...
def check_no_duplicate_ids required_schemes, note: nil dups = @ids .reject{|id| id.void?} .map {|id| id.scheme} .duplicates dups = identity_schemes.duplicates if dups.size > 0 note.add_error "duplicate schemes: " + dups.join(", ") end return note end
def check_all_required_schemes required_schemes, note: nil schemes = @ids .reject{|i| i.void?} .map {|i| i.scheme} missing_schemes = required_schemes - identity_schemes if missing_schemes.size > 0 missing_names = missing_schemes.join(", ") note.add_error "missing schemes: " + missing_names end return note end def identity_schemes @ids .reject{|i| i.void?} .map {|i| i.scheme} end
I then fancy a couple of small clean-ups. Firstly I'm testing to see if a collection is empty by checking its size. I always prefer a more intention-revealing empty method.
class Person...
def check_no_duplicate_ids required_schemes, note: nil dups = identity_schemes.duplicates unless dups.empty? note.add_error "duplicate schemes: " + dups.join(", ") end return note end def check_all_required_schemes required_schemes, note: nil missing_schemes = required_schemes - identity_schemes unless missing_schemes.empty? missing_names = missing_schemes.join(", ") note.add_error "missing schemes: " + missing_names end return note end
I don't have a name for this refactoring, it should be something like “replace implementation-revealing method with intention-revealing method”.
The missing_names
variable isn't pulling its
weight, so I'll use Inline Temp on
it.
class Person...
def check_no_duplicate_ids required_schemes, note: nil dups = identity_schemes.duplicates unless dups.empty? note.add_error "duplicate schemes: " + dups.join(", ") end return note end def check_all_required_schemes required_schemes, note: nil missing_schemes = required_schemes - identity_schemes unless missing_schemes.empty? missing_names = missing_schemes.join(", ") note.add_error "missing schemes: " + missing_schemes.join(", ") end return note end
I also fancy converting both of these to use the single line conditional syntax
class Person...
def check_no_duplicate_ids required_schemes, note: nil dups = identity_schemes.duplicates unless dups.empty? note.add_error "duplicate schemes: " + dups.join(", ") unless dups.empty? end return note end def check_all_required_schemes required_schemes, note: nil missing_schemes = required_schemes - identity_schemes unless missing_schemes.empty? note.add_error "missing schemes: " + missing_schemes.join(", ") unless missing_schemes.empty? end return note end
Again there's no defined refactoring for that, and it would be very much a ruby-specific one.
With that, I don't think the methods are worthwhile any
more, so I inline them, and the extracted identity_schemes
method, back into caller
class Person...
def check_valid_ids required_schemes, note: nil note ||= Notification.new note.add_error "has no ids" if @ids.size < 1 identity_schemes = @ids.reject{|i| i.void?}.map {|i| i.scheme} dups = identity_schemes.duplicates note.add_error("duplicate schemes: " + dups.join(", ")) unless dups.empty? missing_schemes = required_schemes - identity_schemes note.add_error "missing schemes: " + missing_schemes.join(", ") unless missing_schemes.empty? return note end
The final method is a bit longer than I usually go for, but I like its cohesiveness. If it grew much bigger I'd want to split it up, perhaps using Replace Method with Method Object Even as long as it is, I find it much clearer in communicating what errors the validation is checking for.
Final Thoughts
That concludes this set of refactoring examples. I hope it's given you a good sense of how collection pipelines can clarify the logic of code that manipulates a collection, and how it's often quite straightforward to refactor a loop into a collection pipeline.
As with any refactoring, there is a similar inverse refactoring to turn a collection pipeline into a loop, but I hardly ever do that.
These days most modern languages offer first class functions and a collection library that includes the necessary operations for collection pipelines. If you're unused to collection pipelines, it is a good exercise to take loops that you run into and refactor them like this. If you find the final pipeline isn't clearer than the original loop, you can always revert the refactoring when you're done. Even if you do revert the refactoring, the exercise can teach you a lot about this technique. I've been using this programming pattern for a long time and find it a valuable way of helping me read my own code. As such I think it's worth the effort of exploring, to see if your team comes to a similar conclusion.
Footnotes
1: Well, actually, my first move is to consider applying Extract Method on the loop, as it's often easier to manipulate a loop if its isolated into its own function.
2: For me, it's odd to see the map operator called “Select”. The reason is that the pipeline methods in C# come from Linq, whose main aim is abstracting database access, so the method names were chosen to be similar to SQL. “Select” is the projection operator in SQL, which makes sense when you think of it as selecting columns, but is an odd name if you think of it of using a function mapping.
3: Of course this is not a comprehensive list of all the languages that can do collection pipelines, so I expect the usual rush of complaints that I'm not showing Javascript, Scala, or Whatever++. I don't want a huge list of languages here, just a small set that's varied enough to convey the notion that collection pipelines are easy to follow in an unfamiliar language.
4: in some cases it will need to be a short-circuiting one, although that's not the case here
5:
I often find this with negative booleans. This is because the
negation (!
) is at the start of the expression, while the predicate
(isEmpty
) is at the end. With any substantive
expression between the two, the result is hard to parse. (At
least for me.)
6: I haven't put this in the operator catalog yet.
7: If I'm using a language that doesn't have an operation that detects the last item that passes a predicate, I can reverse it first and then detect the first one.
Acknowledgements
Kit Eason helped make the F# example a bit more idiomatic. Les Ramer helped me improve my C#. Richard Warburton corrected some loose wording in my Java. Daniel Sandbecker spotted an error in the ruby example. Andrew Kiellor, Bruno Trecenti, David Johnston, Duncan Cragg, Karel Alfonso, Korny Sietsma, Matteo Vaccari, Pete Hodgson, Piyush Srivastava, Scott Robinson, Steven Lowe, and Vijay Aravamudhan discussed drafts of this article on the Thoughtworks mailing list.
Significant Revisions
14 July 2015: Added the identifiers example and final thoughts
07 July 2015: Added the grouping flight data example
30 June 2015: Added the equipment offering example.
25 June 2015: Added the nested loop installment
23 June 2015: Published first installment