How to Split Strings and Join Values

Fairly often someone tells us they have a combination of values and delimited strings, and they want to split up the strings but keep the values. That is, they have something like this:

https://preview.redd.it/yxpr6ijjfxig1.png?width=273&format=png&auto=webp&s=ce56eae127fd88e24239d89fb34d0ee5527a2b47

Values	Strings
1	a,b,c
2	d,e
3	f
4	g,h,i,j
5
6	a,,b
7	,b,c

And they want to produce something like this:

https://preview.redd.it/lfhj6taz9yig1.png?width=191&format=png&auto=webp&s=d96eec129f3e16d2b2836b318cf67dc6853e2d79

V	S
1	a
1	b
1	c
2	d
2	e
3	f
4	g
4	h
4	i
4	j
6	a
6
6	b
7
7	b
7	c

In this post, I'm going to describe how to do this and explain why it works. This can all be done in a single formula, as shown in this screenshot, which shows the input (columns A and B) and the output (columns Z and AA) plus the formula that generates it. For educational purposes, all the intermediate values are also shown, and I'll discuss them below, but you do not have to create all these extra columns in normal use!

https://preview.redd.it/gijk8k0ahxig1.png?width=2553&format=png&auto=webp&s=d4e7b5ea0e41a98c6df87d99513ff44c78ef9331

For those who just want to jump straight to the solution and save the explanations for another day, here's the final formula. Again, it uses nothing but the data in columns A and B, and it is a single formula (in cell Z2) which spills the entire output.

=LET(input, A:.B, body, DROP(input,1), values, TAKE(body,,1), strings, DROP(body,,1), split_strings, LAMBDA(ss, TEXTAFTER(TEXTBEFORE(ss,",",SEQUENCE(,MAX(LEN(REGEXREPLACE(ss,"[^,]+",)))+1),,1),",",-1,,1)), flood, LAMBDA(vv,ww, IF(vv<>ww,vv,ww)), ss_sp, split_strings(strings), vv_fl, flood(values, ss_sp), out, HSTACK(TOCOL(vv_fl,2),TOCOL(ss_sp,2)), out )

The top line is pretty standard boilerplate to parse the first two columns into values and strings. Only strings and values are used below this line.

So how does this work? First let's look at the LAMBDA function, split_strings, which does what we wish BYCOL+TEXTSPLIT (or just TEXTSPLIT with a column of strings) would do. Here's what that looks like exploded into steps:

=LET(input, A:.B, strings, DROP(input,1,1), ss_rx, REGEXREPLACE(strings,"[^,]+",), ss_l, LEN(ss_rx), mm, SEQUENCE(,MAX(ss_l)+1), ss_bef, TEXTBEFORE(strings,",",mm,,1), out, TEXTAFTER(ss_bef,",",-1,,1), out )

Column C shows the value of ss_rx. What this regular expression does is match to everything except a comma. REGEXREPLACE uses this regular expression to delete everything in every string except the commas. Column D gives us ss_l, which is just the number of commas in each line.

The reason we do this is that we need to know, in advance, how many columns our result is going to be. This is precisely why TEXTSPLIT on a column of strings fails; Excel can't know how wide the output is going to be without examining every string first, so it only returns the first string on each row. In our case, we know the output array would be width 1 if there were no commas at all, and each comma implies another column, so the required output will be the max number of commas plus one.

Given that calculation, the mm array is a row vector which is the same width as the array we intend to produce. It's in columns F through I. The numbers correspond to which element from each string we want to extract. That is, 1 is for the first string (i.e. the substring in front of the first comma), 2 is for the second string (the substring between the first two commas), etc. We'll get this information in two steps, as follows.

When we call TEXTBEFORE(strings,",",mm,,1) , we get a separate result for each string and for each value in mm. (See columns K through N, rows 2 through 8.) That means text before the first comma, text before the second comma, text before the third comma, etc. The 1 at the end tells TEXTBEFORE to treat the end of the string as an honorary comma, which is why "a,b,c" is a valid result on the first row. We get errors for the slots that are out-of-range, but that's okay; HSTACK and VSTACK do something similar for arrays with misaligned dimensions, after all.

Finally, TEXTAFTER(ss_bef,",",-1,,1) gives us precisely the result we want, in columns P through S, rows 2 through 8. The -1 says that we want text after the last comma in the string, and the 1 at the end says to consider the start of the string as an honorary comma. Again, the errors are not a problem; TOCOL(ss_sp,2) will discard all those errors and turn this into what we'll see later in column AA.

So, going back to the the original formula, this gets us to ss_sp--an array of split strings with errors for the ones that don't exist. How do we join this with the column of values? This is what the flood function does: LAMBDA(vv,ww, IF(vv<>ww,vv,ww)). Notice that this IF statement is kind of stupid; it's going to produce vv no matter what the value of ww is! Except that if vv and ww have different dimensions, Excel will expand one or the other one first to make them have the same dimensions and then execute the IF on every element separately. In particular, if one of them (vv in this case) is a row and the other (ww) is an array, it expands vv by copying the row over and over until it matches the size of ww. In parallel-programming parlance, Excel floods vv to match the dimensions of ww. Better, whenever ww has an error, the flooded array will also have an error in the same place.

Columns U through X show what the flooded array looks like. It really is just mm copied over and over but with the same error pattern as ss_sp had.

To get our final result (in columns Z and AA), we just convert vv_fl and ss_sp to columns, stripping out the errors, and then stack them side-by-side:

HSTACK(TOCOL(vv_fl,2),TOCOL(ss_sp,2))

As mentioned above, the error patterns are the same, so all the values line up perfectly. (N.B. this will not work if mm itself has errors!)

And that's all there is to it! Well, here are some variations you might care about:

Changing this to support a different single-character delimiter is very easy; you just need to change it at three points: in the parameters to TEXTBEFORE, TEXTAFTER, and REGEXREPLACE.

If you want a multi-character delimiter, I suggest a prior call to REGEXREPLACE to convert your desired delimiter to a single character you're sure won't be used in your input text.

If you want to use (or permit) comma+space instead of just comma, I suggest calling TRIM on the final result.

In the very last example (row 8 in the screenshot) you'll see a bug in Excel, where TEXTAFTER returns a #VALUE error for an empty string, despite the flag telling it to match to the start. (It should return an empty string.) If it's important to you that a leading comma return a blank value, you'll have to kludge around it. E.g. insert an obscure Unicode character at the start of every string and then delete it from the final results.

Feedback, comments, corrections, and superior approaches are all welcome! :-)

submitted by /u/GregHullender
[link] [comments]

How to Split Strings and Join Values

Want to read more?

Tagged with