Using Java regexes

Using Java regexes #

Escaping \ #

A regex in Java is a string.

Recall that within a Java string, a \ character must be escaped (i.e. written \\) to be read as a normal character. So in a Java string that represents a regex, every \ should be written \\.

Example. The string "[A-Z]\\d+" represents the regex [A-Z]\d+

Hint. This is another good reason for using a regex validation tool (e.g. regex101) to test and debug a regex, before incorporating it into a program.

Warning. We saw earlier that when \ is treated as a normal character in a regex, it must be escaped (\\). Therefore in this specific case, the Java string should contain \\\\.

Example. The string "\\d+\\\\[A-Z]" represents the regex \d+\\[A-Z]

Regexes and String instance methods #

Some instance methods of the class String take a regex as input. Among these:

  • boolean matches(String regex) returns true iff the whole string belongs to the language described by the regex. Or equivalently if there is a match for the regex ^regex$.
String input = "ab";

// Outputs false
System.out.println(input.matches("a"));

// Outputs true
System.out.println(input.matches("ab|a"));

// Outputs true
System.out.println(input.matches("a|ab"));
  • String replaceFirst(String regex, String replacement) replaces the best first match with the input replacement string:
String input = "aba";

// Outputs "ba"
System.out.println(input.replaceFirst("ab|a", "b"));

// Outputs "bba"
System.out.println(input.replaceFirst("a|ab", "b"));
String input = "aba";

// Outputs "bb"
System.out.println(input.replaceAll("ab|a", "b"));

// Outputs "bbb"
System.out.println(input.replaceAll("a|ab", "b"));

For both replaceFirst and replaceAll, the replacement string can use a reference to a group, written $$n$, where $n$ is the group number:

String input = "The windows are open. The apples are green.";
String output = input.replaceAll("(\\w+)s\\s+are", "$1 is");

// Outputs "The window is open. The apple is green."
System.out.println(output);

Reminder. Strings in Java are immutable. So when the methods replaceFirst and replaceAll successfully “modify” the string, they return a different object.

String input = "This is weird.\n"+
               "Or not."

// Contains [ "This", "is", "weird.", "Or", "not." ]
String[] output = input.split("\\s+");

java.util.regex #

The native package java.util.regex contains among other the following classes:

Pattern #

A Pattern is a regex.

A Pattern can be created with the static method Pattern Pattern.compile(Sring regex).

Matcher #

A Matcher is a “regex engine” for a specific regex and a specific string.

A Matcher can be created out of a Pattern, with the instance method Matcher matcher(String inputString) of the class Pattern.

Pattern pattern = Pattern.compile("\\d+[a-z]*");
Matcher matcher = pattern.matcher("Alice787@unibz");

Among others, the class Matcher provides the following instance methods:

  • boolean matches(), String replaceFirst() and String replaceAll() behave analogously to their counterpart for the class String, described above.

  • boolean find() tries to match the next best first match. If this method succeeds, then information can be retrieved about the matched segment (see below). The next call to find() will find the following best first match (if any).

Pattern pattern = Pattern.compile("[A-Z][a-z]*");
Matcher matcher = pattern.matcher("Alice and Bob are exhausted.");

// Outputs true and matches the segment with word "Alice"
System.out.println(matcher.find());

// Outputs true and matches the segment with word "Bob"
System.out.println(matcher.find());

// Outputs false
System.out.println(matcher.find());
  • String group(int i) returns the substring captured by Group $i$ in the latest match. Group 0 stands for the whole regex.

  • String group() is equivalent to group(0).

  • int start() returns the start index (included) of the latest matched segment.

  • int end() returns the end index (excluded) of the latest matched segment.

Pattern pattern = Pattern.compile("([A-Z])[a-z]*");
Matcher matcher = pattern.matcher("Alice and Bob are exhausted.");

//Find the best first match
matcher.find();
// Outputs "Alice"
System.out.println(matcher.group());
// Outputs "A"
System.out.println(matcher.group(1));
// Outputs 0
System.out.println(matcher.start());
// Outputs 5
System.out.println(matcher.end());

//Find the next best first match
matcher.find();
// Outputs "Bob"
System.out.println(matcher.group());
// Outputs "B"
System.out.println(matcher.group(1));
Pattern pattern = Pattern.compile("[A-Z][a-z]*");
Matcher matcher = pattern.matcher("Alice and Bob are exhausted.");

// Contains two match results:
// - one for the segment with word "Alice",
// - one for the segment with word "Bob"
List<MatchResult> matches = matcher.results().toList();

MatchResult #

A MatchResult is a matched segment.

The class MatchResult provides (among others) the instance methods String group(int i), String group(), int start() and int end(), which behave analogously to their counterparts in the class Matcher.

Pattern pattern = Pattern.compile("[A-Z][a-z]*");
Matcher matcher = pattern.matcher("Alice and Bob are exhausted.");

// Get all best first matches
List<MatchResult> matches = matcher.results().toList();
// Outputs "Alice"
System.out.println(matches.get(0).group());
// Outputs "Bob"
System.out.println(matches.get(1).group());