Voice to text in Android with Architecture Components

Let us join voice to text in Android with the new Architecture Components. It will make our speech recognizer class lifecycle aware - allowing us to keep on recording while rotating the screen, and providing a proper separation of concerns. Our goal is to implement it through a mashup of MVVM and MVI. Let’s have a look at how it’s done!

Setup

This is all standard procedure, but in short we need to add the following to our build.gradle (Module: app) dependencies to access ViewModel and LiveData.

implementation "android.arch.lifecycle:extensions:1.0.0"
annotationProcessor "android.arch.lifecycle:compiler:1.0.0"

And we need to add two permissions to our AndroidManifest.xml.

<uses-permission android:name="android.permission.RECORD_AUDIO"/>
<uses-permission android:name="android.permission.INTERNET"/>

The project will consist of a SpeechRecognizerViewModel class, responsible for providing updates to the UI. The UI itself is managed through a single MainActivity.

The final result is available at this repository.

The Speech Recognizer

Let’s create the view model SpeechRecognizerViewModel to own the speech-to-text handling. We need to implement RecognitionListener which will allow us to receive the audio results.

class SpeechRecognizerViewModel(application: Application) : AndroidViewModel(application), RecognitionListener {

    data class ViewState(
            val spokenText: String,
            val isListening: Boolean,
            val error: String?
    )

    private var viewState: MutableLiveData<ViewState>? = null

    private val speechRecognizer: SpeechRecognizer = createSpeechRecognizer(application.applicationContext).apply {
        setRecognitionListener(this@SpeechRecognizerViewModel)
    }

    private val recognizerIntent: Intent = Intent(RecognizerIntent.ACTION_RECOGNIZE_SPEECH).apply {
        putExtra(RecognizerIntent.EXTRA_LANGUAGE_MODEL, RecognizerIntent.LANGUAGE_MODEL_FREE_FORM)
        putExtra(RecognizerIntent.EXTRA_CALLING_PACKAGE, application.packageName)
        putExtra(RecognizerIntent.EXTRA_PARTIAL_RESULTS, true)
    }

    var isListening = false
        get() = viewState?.value?.isListening ?: false

    var permissionToRecordAudio = checkAudioRecordingPermission(context = application)

    fun getViewState(): LiveData<ViewState> {
        if (viewState == null) {
            viewState = MutableLiveData()
            viewState?.value = initViewState()
        }

        return viewState as MutableLiveData<ViewState>
    }

    private fun initViewState() = ViewState(spokenText = "", isListening = false, error = null)

    fun startListening() {
        speechRecognizer.startListening(recognizerIntent)
        notifyListening(isRecording = true)
    }

    fun stopListening() {
        speechRecognizer.stopListening()
        notifyListening(isRecording = false)
    }

    private fun notifyListening(isRecording: Boolean) {
        viewState?.value = viewState?.value?.copy(isListening = isRecording)
    }

    private fun updateResults(speechBundle: Bundle?) {
        val userSaid = speechBundle?.getStringArrayList(RESULTS_RECOGNITION)
        viewState?.value = viewState?.value?.copy(spokenText = userSaid?.get(0) ?: "")
    }

    private fun checkAudioRecordingPermission(context: Application) =
        ContextCompat.checkSelfPermission(context, Manifest.permission.RECORD_AUDIO) == PackageManager.PERMISSION_GRANTED

    override fun onPartialResults(results: Bundle?) = updateResults(speechBundle = results)
    override fun onResults(results: Bundle?) = updateResults(speechBundle = results)
    override fun onEndOfSpeech() = notifyListening(isRecording = false)

    override fun onError(errorCode: Int) {
        viewState?.value = viewState?.value?.copy(error = when (errorCode) {
            ERROR_AUDIO -> "error_audio_error"
            ERROR_CLIENT -> "error_client"
            ERROR_INSUFFICIENT_PERMISSIONS -> "error_permission"
            ERROR_NETWORK -> "error_network"
            ERROR_NETWORK_TIMEOUT -> "error_timeout"
            ERROR_NO_MATCH -> "error_no_match"
            ERROR_RECOGNIZER_BUSY -> "error_busy"
            ERROR_SERVER -> "error_server"
            ERROR_SPEECH_TIMEOUT -> "error_timeout"
            else -> "error_unknown"
        })
    }

    override fun onReadyForSpeech(p0: Bundle?) {}
    override fun onRmsChanged(p0: Float) {}
    override fun onBufferReceived(p0: ByteArray?) {}
    override fun onEvent(p0: Int, p1: Bundle?) {}
    override fun onBeginningOfSpeech() {}
}

Notice that we extend AndroidViewModel since we need the application context.

The gist of the class is that we provide an observable getSpeech(): ViewState to which others can subscribe to receiver the spoken text. The complete view state is modeled by ViewState.

We also expose startListening() and stopListening() to allow for manual control. The SpeeechRecognizer enginge will stop automatically if you stop talking, but this gives extra control if needed.

That’s about it. We update our observed variable viewState on both partial and final results, allowing the UI to be updated while we speak.

Rendering the UI

As desired, the UI can now focus on doing what it should - namely rendering a view state, and receiving input from the user (MVI).

We end up with a MainActivity that looks something like this (though I’ve shortened it slightly for this blogpost)

class MainActivity : AppCompatActivity() {
    private lateinit var speechRecognizerViewModel: SpeechRecognizerViewModel

    override fun onCreate(savedInstanceState: Bundle?) {
        super.onCreate(savedInstanceState)
        setContentView(R.layout.activity_main)

        textField = findViewById(R.id.spoken_result_text_field)
        micButton = findViewById<Button>(R.id.mic_button).apply {
            setOnClickListener(micClickListener)
        }

        setupSpeechViewModel()
    }

    private val micClickListener = View.OnClickListener {
        if (!speechRecognizerViewModel.permissionToRecordAudio) {
            ActivityCompat.requestPermissions(this, permissions, REQUEST_RECORD_AUDIO_PERMISSION)
            return@OnClickListener
        }

        if (speechRecognizerViewModel.isListening) {
            speechRecognizerViewModel.stopListening()
        } else {
            speechRecognizerViewModel.startListening()
        }
    }

    private fun setupSpeechViewModel() {
        speechRecognizerViewModel = ViewModelProviders.of(this).get(SpeechRecognizerViewModel::class.java)
        speechRecognizerViewModel.getViewState().observe(this, Observer<ViewState> { viewState ->
            render(viewState)
        })
    }

    private fun render(uiOutput: ViewState?) {
        if (uiOutput == null) return

        textField.text = uiOutput.spokenText

        micButton.background  = if (uiOutput.isListening) {
            ContextCompat.getDrawable(this, R.drawable.mic_red)
        } else {
            ContextCompat.getDrawable(this, R.drawable.mic_black)
        }
    }

    override fun onRequestPermissionsResult(requestCode: Int, permissions: Array<out String>, grantResults: IntArray) {
        super.onRequestPermissionsResult(requestCode, permissions, grantResults)

        if (requestCode == REQUEST_RECORD_AUDIO_PERMISSION) {
            speechRecognizerViewModel.permissionToRecordAudio = grantResults[0] == PackageManager.PERMISSION_GRANTED
        }

        if (speechRecognizerViewModel.permissionToRecordAudio) {
            micButton.performClick()
        }
    }
}

We set up the SpeechRecognizerViewModel in setupSpeechViewModel(), and render the UI in the render() method.

As a side effect, we need to request user permission to record the audio. If you have any suggestions on where to put that logic other than in the activity, I’m all ears.

Result

Nothing fancy going on here but, it gets the message through.

voice-to-text.png

Any suggestions on improvements are greatly appreciated! Just drop a comment below.