Written documents created through dictation differ significantly from a true verbatim transcript of the recorded speech.
This poses an obstacle in automatic dictation systems as speech recognition output needs to undergo a fair amount of editing in order to turn it into a document that complies with the customary standards.
We present an approach that attempts to perform this edit from recognized words to final document automatically by learning the appropriate transformations from example documents.
This addresses a number of problems in an integrated way, which have so far been studied independently, in particular automatic punctuation, text segmentation, error correction and disfluency repair.
We study two different learning methods, one based on rule induction and one based on a probabilistic sequence model.
Quantitative evaluation shows that the probabilistic method performs more accurately.
1 Introduction
Large vocabulary speech recognition today achieves a level of accuracy that makes it useful in the production of written documents.
Especially in the medical and legal domains large volumes of text are traditionally produced by means of dictation.
Here document creation is typically a "back-end" process.
The author dictates all necessary information into a telephone handset or a portable recording device and is not concerned with the actual production of the document any further.
A transcriptionist will then
listen to the recorded dictation and produce a well-formed document using a word processor.
The goal of introducing speech recognition in this process is to create a draft document automatically, so that the transcriptionist only has to verify the accuracy of the document and to fix occasional recognition errors.
We observe that users try to spend as little time as possible dictating.
They usually focus only on the content and rely on the transcriptionist to compose a readable, syntactically correct, stylistically acceptable and formally compliant document.
For this reason there is a considerable discrepancy between the final document and what the speaker has said literally.
In particular in medical reports we see differences of the following kinds:
• Punctuation marks are typically not verbalized.
• No instructions on the formatting of the report are dictated.
Section headings are not identified as such.
• Frequently section headings are only implied.
("vitals are" — "Physical Examination: Vital Signs:")
• The dictation usually begins with a preamble (e.g. "This is doctor Xyz ...") which does not appear in the report.
Similarly there are typical phrases at the end of the dictation which should not be transcribed (e.g. "End of dictation.
Thank you.")
• There are specific standards regarding the use of medical terminology.
Transcriptionists frequently expand dictated abbreviations (e.g. "CVA" — "cerebrovascular accident") or otherwise use equivalent terms (e.g. "nonicteric sclerae" — "no scleral icterus").
• The dictation typically has a more narrative style (e.g. "She has no allergies.", "I examined him").
In contrast, the report is normally more impersonal and structured (e.g. "Allergies: None.", "he was examined").
• Instruction to the transcriptionist and so-called normal reports, pre-defined text templates invoked by a short phrase like "This is a normal chest x-ray."
• In addition to the above, speech recognition output has the usual share of recognition errors some of which may occur systematically.
These phenomena pose a problem that goes beyond the speech recognition task which has traditionally focused on correctly identifying speech utterances.
Even with a perfectly accurate verbatim transcript of the user's utterances, the transcriptionist would need to perform a significant amount of editing to obtain a document conforming to the customary standards.
We need to look for what the user wants rather than what he says.
2004).
The method we present in the following attempts to address all this by a unified transformation model.
The goal is simply stated as transforming the recognition output into a text document.
We will first describe the general framework of learning transformations from example documents.
In the following two sections we will discuss a rule-induction-based and a probabilistic transformation method respectively.
Finally we present experimental results in the context of medical transcription and conclude with an assessment of both methods.
2 Text transformation
In dictation and transcription management systems corresponding pairs of recognition output and edited and corrected documents are readily available.
The idea of transformation modeling, outlined in figure 1, is to learn to emulate the transcriptionist.
To this end we first process archived dictations with the speech recognizer to create approximate verbatim transcriptions.
For each document this yields the spoken or source word sequence S = s\... sM, which is supposed to be a word-by-word transcription of the user's utterances, but which may actually contain recognition errors.
The corresponding final reports are cleaned (removal of page headers etc.), tagged (identification of section headings and enumerated lists) and tokenized, yielding the text or target token sequence T = ti...tN for each document.
Generally, the token sequence corresponds to the spoken form.
(E.g. "25mg" is tokenized as "twenty five milligrams".)
Tokens can be ordinary words or special symbols representing line breaks, section headings, etc. Specifically, we represent each section heading by a single indivisible token, even if the section name consists of multiple words.
Enumerations are represented by special tokens, too.
Different techniques can be applied to learn and execute the actual transformation from S to T. Two options are discussed in the following.
With the transformation model at hand, a draft for a new document is created in three steps.
First the speech recognizer processes the audio recording and produces the source word sequence S. Next, the transformation step converts S into the target sequence T. Finally the transformation output T is formatted into a text document.
Formatting is the
archived
dictations
dictation
recognize
transcripts
transcript
transformation model
transform
tokenize
archived documents
document
manual correction
final document
Figure l: Illustration of how text transformation is integrated into a speech-to-text system.
inverse of tokenization and includes conversion of number words to digits, rendition of paragraphs and section headings, etc.
Before we turn to concrete transformation techniques, we can make two general statements about this problem.
Firstly, in the absence of observations to the contrary, it is reasonable to leave words unchanged.
So, a priori the mapping should be the identity.
Secondly, the transformation is mostly monotonous. out-of-order sections do occur but are the exception rather than the rule.
3 Transformation based learning
Following Strzalkowski and Brandow (1997) and Peters and Drexel (2004) we have implemented a transformation-based learning (TBL) algorithm (Brill, 1995).
This method iteratively improves the match (as measured by token error rate) of a collection of corresponding source and target token sequences by positing and applying a sequence of substitution rules.
In each iteration the source and target tokens are aligned using a minimum edit distance criterion.
We refer to maximal contiguous subsequences of non-matching tokens as error re-
gions.
These consist of paired sequences of source and target tokens, where either sequence may be empty.
Each error region serves as a candidate substitution rule.
Additionally we consider refinements of these rules with varying amounts of contiguous context tokens on either side.
Deviating from Peters and Drexel (2004), in the special case of an empty target sequence, i.e. a deletion rule, we consider deleting all (non-empty) contiguous subsequences of the source sequence as well.
For each candidate rule we accumulate two counts: the number of exactly matching error regions and the number of false alarms, i.e. when its left-hand-side matches a sequence of already correct tokens.
Rules are ranked by the difference in these counts scaled by the number of errors corrected by a single rule application, which is the length of the corresponding error region.
This is an approximation to the total number of errors corrected by a rule, ignoring rule interactions and non-local changes in the minimum edit distance alignment.
A subset of the top-ranked non-overlapping rules satisfying frequency and minimum impact constraints are selected and the source sequences are updated by applying the selected rules.
Again deviating from Peters and Drexel (2004), we consider two rules as overlapping if the left-hand-side of one is a contiguous subsequence of the other.
This procedure is iterated until no additional rules can be selected.
The initial rule set is populated by a small sequence of hand-crafted rules (e.g. "impression colon" — "impression:").
A user-independent baseline rule set is generated by applying the algorithm to data from a collection of users.
We construct speaker-dependent models by initializing the algorithm with the speaker-independent rule set and applying it to data from the given user.
4 Probabilistic model
The canonical approach to text transformation following statistical decision theory is to maximize the text document posterior probability given the spoken document.
Obviously, the global model p(T|S) must be constructed from smaller scale observations on the cor-
respondence between source and target words.
We use a 1-to-n alignment scheme.
This means each source word is assigned to a sequence of zero, one or more target words.
We denote the target words assigned to source word si as Ti.
Each replacement T is a possibly empty sequence of target words.
A source word together with its replacement sequence will be called a segment.
We constrain the set of possible transformations by selecting a relatively small set of allowable replacements A(s) to each source word.
This means we require Ti G A(si).
We use the usual m-gram approximation to model the joint probability of a transformation:
The work of Ringger and Allen (1996) is similar in spirit to this method, but uses a factored source-channel model.
Note that the decision rule (1) is over whole documents.
Therefore we processes complete documents at a time without prior segmentation into sentences.
To estimate this model we first align all training documents.
That is, for each document, the target word sequence is segmented into M segments T = ...
^tm .
The criterion for this alignment is to maximize the likelihood of a segment unigram model.
The alignment is performed by an expectation maximization algorithm.
Subsequent to the alignment step, m-gram probabilities are estimated by standard language modeling techniques.
We create speaker-specific models by linearly interpolating an m-gram model based on data from the user with a speaker-independent background m-gram model trained on data pooled from a collection of users.
To select the allowable replacements for each source word we count how often each particular target sequence is aligned to it in the training data.
A source target pair is selected if it occurs twice or more times.
Source words that were not observed in training are immutable, i.e. the word itself is its only allowable replacement A(s) = {(s)}.
As an example suppose "patient" was deleted 10 times, left unchanged 105 times, replaced by "the patient" 113 times and once replaced by "she".
The word patient would then have three allowables: A(patient) = {(), (patient), (the, patient)}.)
The decision rule (1) minimizes the document error rate.
A more appropriate loss function is the number of source words that are replaced incorrectly.
Therefore we use the following minimum word risk (MWR) decision strategy, which minimizes source word loss.
This means for each source sequence position we choose the replacement that has the highest posterior probability p(rj|S) given the entire source sequence.
To compute the posterior probabilities, first a graph is created representing alternatives "around" the most probable transform using beam search.
Then the forward-backward algorithm is applied to compute edge posterior probabilities.
Finally edge posterior probabilities for each source position are accumulated.
5 Experimental evaluation
The methods presented were evaluated on a set of real-life medical reports dictated by 51 doctors.
For each doctor we use 30 reports as a test set.
Transformation models are trained on a disjoint set of reports that predated the evaluation reports.
The typical document length is between one hundred and one thousand words.
All dictations were recorded via telephone.
The speech recognizer works with acoustic models that are specifically adapted for each user, not using the test data, of course.
It is hard to quote the verbatim word error rate of the recognizer, because this would require a careful and time-consuming manual transcription of the test set.
The recognition output is auto-punctuated by a method similar in spirit to the one proposed by Liu et al. (2005) before being passed to the transformation model.
This was done because we considered the auto-punctuation output as the status quo ante which transformation modeling was to be compared to.
Neither of both transformation methods actually relies on having auto-punctuated input.
The auto-punctuation step only inserts periods and commas and the document is not explicitly segmented into sentences.
(The transformation step always applies to entire documents and the interpretation of a period as a sentence boundary is left to the human
Table 1: Experimental evaluation of different text transformation techniques with different amounts of user-specific data.
Precision, recall, deletion, insertion and error rate values are given in percent and represent the average of 51 users, where the results for each user are the ratios of sums over 30 reports.
sections
punctuation
all tokens
precision
deletions
insertions
none (only auto-punct)
3-gram without MWR
reader of the document.)
For each doctor a background transformation model was constructed using 100 reports from each of the other users.
This is referred to as the speaker-independent (SI) model.
In the case of the probabilistic model, all models were 3-gram models.
User-specific models were created by augmenting the SI model with 25, 50 or 100 reports.
One report from the test set is shown as an example in the appendix.
5.1 Evaluation metric
The output of the text transformation is aligned with the corresponding tokenized report using a minimum edit cost criterion.
Alignments between section headings and non-section headings are not permitted.
Likewise no alignment of punctuation and non-punctuation tokens is allowed.
Using the alignment we compute precision and recall for sections headings and punctuation marks as well as the overall token error rate.
It should be noted that the so derived error rate is not comparable to word error rates usually reported in speech recognition research.
All missing or erroneous section headings, punctuation marks and line breaks are counted as errors.
As pointed out in the introduction the reference texts do not represent a literal transcript of the dictation.
Furthermore the data were not cleaned manually.
There are, for example, instances of letter heads or page numbers that were not correctly removed when the text was extracted from the word processor's file for-
mat.
The example report shown in the appendix features some of the typical differences between the produced draft and the final report that may or may not be judged as errors.
(For example, the date of the report was not given in the dictation, the section names "laboratory data" and "laboratory evaluation" are presumably equivalent and whether "stable" is preceded by a hyphen or a period in the last section might not be important.)
Nevertheless, the numbers reported do permit a quantitative comparison between different methods.
Results are stated in table 1.
In the baseline setup no transformation is applied to the auto-punctuated recognition output.
Since many parts of the source data do not need to be altered, this constitutes the reference point for assessing the benefit of transformation modeling.
For obvious reasons precision and recall of section headings are zero.
A high rate of insertion errors is observed which can largely be attributed to preambles.
Both transformation methods reduce the discrepancy between the draft document and the final corrected document significantly.
With 100 training documents per user the mean token error rate is reduced by up to 40% relative by the probabilistic model.
When user specific data is used, the probabilistic approach performs consistently better than TBL on all accounts.
In particular it always has much lower insertion rates reflecting its supe-
rior ability to remove utterances that are not typically part of the report.
On the other hand the probabilistic model suffers from a slightly higher deletion rate due to being overzealous in this regard.
In speaker independent mode, however, the deletion rate is excessively high and leads to inferior overall performance.
Interestingly the precision of the automatic punctuation is increased by the transformation step, without compromising on recall, at least when enough user specific training data is available.
The minimum word risk criterion (3) yields slightly better results than the simpler document risk criterion (1).
6 Conclusions
Automatic text transformation brings speech recognition output much closer to the end result desired by the user of a back-end dictation system.
It automatically punctuates, sections and rephrases the document and thereby greatly enhances transcriptionist productivity.
The holistic approach followed here is simpler and more comprehensive than a cascade of more specialized methods.
Whether or not the holistic approach is also more accurate is not an easy question to answer.
Clearly the outcome would depend on the specifics of the specialized methods one would compare to, as well as the complexity of the integrated transformation model one applies.
The simple models studied in this work admittedly have little provisions for targeting specific transformation problems.
For example the typical length of a section is not taken into account.
However, this is not a limitation of the general approach.
We have observed that a simple probabilistic sequence model performs consistently better than the transformation-based learning approach.
Even though neither of both methods is novel, we deem this an important finding since none of the previous publications we know of in this domain allow this conclusion.
While the present experiments have used a separate auto-punctuation step, future work will aim to eliminate it by integrating the punctuation features into the transformation step.
In the future we plan to integrate additional knowledge sources into our statistical method in order to more specifically address each of the various phenomena encountered in spontaneous dictation.
