Can robots do my homework?

Hacking Online Foreign Language Tools

Read

A Note on Ethics

All research shown here is ethical because no active attacks are performed on Duolingo or Yabla servers. All exploits are performed locally and do not modify http requests at all when compared to normal usage of either services. All data used in the following exploits was sent intentionally to clients of both services. Additionally, tools created for this research were never used on official assignments that counted towards grades.

Read

After a brief examination of the network traffic used by the Yabla online learning tool, it became apparent that it was incredibly easy to exploit the method by which Yabla loaded transcripts on the frontend in order to automate the completion of their games. A proof of concept chrome extension was written to take advantage of this exploit. Other online language tools such as Duolingo were evaluated, but similar exploits were not found.

Introduction

Recently, my Spanish teacher announced that the class would begin to receive upwards of three Yabla listening activities per week. The class has also been subject to Duolingo quotas, where we are graded on completing hundreds of lessons in a given semester. Deciding to take advantage of some software development experience, I decided to examine if I could complete these assignments without spending vast amounts of time listening to native Spanish speakers repeatedly or typing for hours on end.

Meet Duolingo

Duolingo is a much newer and much more advanced online foreign language tool. It is written in React and. It is quick at makes use of animations. Duolingo is a much more complicated application than Yabla.

Analysis of Duolingo

I began be examining the possibility of automating the answering of multiple choice questions.

Yabla uses a PHP backend and a loose policy of one html page per application view. A detailed analysis of XHR traffic was not necessary as almost all data and code to run the game is provided within the first load of the php page player_cdn.php. It would appear that dynamic content is provided through global javascript variables populated in a script tag at the top of the page by PHP. A quick search of the first lyric of the song revealed the transcript information in the global javascript variable CAPTIONS.

It is worth noting that upon opening the development console, Duolingo advertises that they are hiring software engineers.

It is apparent immediately that Duolingo utilizes a Single Page Application (SPA) architecture, so I choose to focus analysis on only XHR requests.

Alternate

Type	URL	Method	Status	Price
XHR	2x Requests to `?https://api.mixpanel.com/track/?data={large base64 encoded string}`	GET	200	The data parameter in the URL contained base64 encoded key-value paired tracking data. I found it interesting Duolingo was tracking user interaction in a very detailed manner; however, I didn’t find the information very interesting. It did contain sensitive-looking user ids. It seems to me that this could be done much more efficiently by posting a JSON blob.
XHR	2x Requests to `https://ljknv3sref.execute-api.us-east-1.amazonaws.com/prod/events`	POST	200	This appears to be an additional tracking endpoint that, in fact does POST a JSON blob. Duolingo is likely experimenting with two different user-interaction tracking systems.
XHR	`https://www.duolingo.com/sessions/394864751773?name=Basics-1&language=es&lesson_number=3&type=lesson&speak_capable=true&web_speech_enabled=true&kc_strength_model_version=2&client_grading_data_version=2&_=1489702221619`	GET	200	This request contained a very large amount of information regarding the lesson including references to “surface_forms,” “lexemes,” and other information. See below for a more detailed analysis of this data.
XHR	`https://www.duolingo.com/log_grading_result`	POST	200	This endpoint is called whenever a question is answered. It accepts a lengthy form data input and returns only a status code. See below for detailed analysis.

The response to the URL https://www.duolingo.com/sessions/394864751773?name=Basics-1&language=es&lesson_number=3&type=lesson… Appeared to contain all of the information relevant to the lesson. The response included a very large amount of information regarding the lesson including all questions, which they call session_elements. A field titled solution_keys was present; however, they did not appear to match any data currently held in the browser. This could be a hash of the correct solution or a database key. However, the correct answer is not held in obvious fashion in the session information.

The property corresponding to the correct answer may not be immediately apparent from reading this data. A further review of traffic and code makes it far more clear. In this capture, the user is asked to translate “the bread” to Spanish. The correct answer is “el pan.” The correct answer was provided.

In this capture the user is asked to translate “the water” to Spanish. The correct answer is “el agua.” An incorrect answer of “el pan” was provided.

Notice that the result appears to be graded before being sent to the server. It would appear that Duolingo is processing the grading on the frontend before sending it to the backend for logging. This opens the doors for a potential exploit. However, it raises the question of which property is being used to validate the phrase.

In order to answer these questions, I decided to examine the code. I downloaded the javascript file that appeared to contain the application from http://d7mj4aqfscim2.cloudfront.net/proxy/js2/230720168c0b82eedd46-duolingo.js. Using free online tools, I de-minified the code. The code appeared to be only partially obfuscated. As a result, I was able to read all original function names. The code has some interesting features. For instance, all instances of “true” have been replaced with “!0” and “!1” for “false.” This was presumably done by a build engine to compress the file. The /log_graded_result endpoint is called from the following function logGradedSolution():

Mitigation

These issues could be mitigated by creating a backend endpoint to validate question responses and ensuring that the correct answer is not passed to the client. If the backend validates responses, there is no way for the client to know what the correct response is until the user figures it out.

Meet Yabla

I decided to analyze the video-based fill in the blank game for the purposes of this research because it appeared to be a unique challenge to exploit and offered the possibility of quick massive time savings for students using an automated tool. The objective of the game is to fill in blanks in transcripts of a song in order to practice Spanish listening skills. Compared to Duolingo, Yabla is a somewhat simplistic application.

Analysis of Yabla

As a result, the game can be exploited very easily into giving up the answers to a student or automation tool. I created a proof of concept chrome extension that automatically plays the game using content scripts in a chrome extension. First, the script acquires and parses the transcript data by bypassing chrome's content script sandboxing by parsing the DOM content of the script tag. Then, it uses DOM manipulation to complete the blank textboxes automatically.

See the source code listing of the proof of concept here: Yabla Helper Chrome Content Script

Mitigation

Yabla Helper Chrome Content Script

(function YablaInject(document) {
//API
function levenshteinDistance(a, b) {
if (a.length == 0) return b.length;
if (b.length == 0) return a.length;

var matrix = [];

// increment along the first column of each row
var i;
for (i = 0; i <= b.length; i++) {
matrix[i] = [i];
}

// increment each column in the first row
var j;
for (j = 0; j <= a.length; j++) {
matrix[0][j] = j;
}

// Fill in the rest of the matrix
for (i = 1; i <= b.length; i++) {
for (j = 1; j <= a.length; j++) {
if (b.charAt(i - 1) == a.charAt(j - 1)) {
matrix[i][j] = matrix[i - 1][j - 1];
} else {
matrix[i][j] = Math.min(matrix[i - 1][j - 1] + 1, // substitution
Math.min(matrix[i][j - 1] + 1, // insertion
matrix[i - 1][j] + 1)); // deletion
}
}
}

return matrix[b.length][a.length];
}

//Config
var SCRIPT_SEARCH_TERM = 'CAPTIONS';
var EVENT_LOOP_TIMEOUT = 2000;
var ALPHANUMERIC_REGEX = /[^A-Za-z_]/g;

//Find the script tag containing the captions
var query = document.getElementsByTagName('script');
var scriptContents;
for (var i = 0; i < query.length; i++) {
if (query[i].innerHTML.indexOf(SCRIPT_SEARCH_TERM) >= 0) {
scriptContents = query[i].innerHTML;
}
}
if (!scriptContents) return console.warn('YablaInject could not find Yabla script on this page',
scriptContents);


//Process the contents of the script tag to find the JSON blob representing the captions
//Find the line with the captions JSON
var lines = scriptContents.split(';');
var line;
lines.forEach(function (l) {
if (l.indexOf(SCRIPT_SEARCH_TERM) >= 0) {
line = l;
}
});
if (!line) return console.warn('YablaInject could not find a line with captions in this script',
line);

//Process the line
var captionJsonString = line.split('=')[1].replace(';', '').trim();

//Parse the JSON from the line
var CAPTIONS;

try {
CAPTIONS = JSON.parse(captionJsonString);
} catch (e) {
return console.error('Error parsing captionsJsonString', e);
}

if (!CAPTIONS) return console.warn('YablaInject could not find valid captions on this page',
CAPTIONS);


//Process the captions into readable transcript
var TRANSCRIPT, TRANSCRIPT_LINES, TRANSLATION_LINES, TRANSLATION, PROCESSED_TRANSCRIPT_LINES;

TRANSCRIPT_LINES = CAPTIONS.map(function (a) {
return a.transcript;
});
PROCESSED_TRANSCRIPT_LINES = CAPTIONS.map(function (a) {
return {
processed: a.transcript.replace(ALPHANUMERIC_REGEX, '').toLowerCase().trim(),
unprocessed: a.transcript
}
});

TRANSCRIPT = TRANSCRIPT_LINES.join('\n\n');

TRANSLATION_LINES = CAPTIONS.map(function (a) {
return a.translation;
});

TRANSLATION = TRANSLATION_LINES.join('\n\n');

//Dump the transcript to the console before doing anything funky
console.info('Dumping Yabla transcript for your reading pleasure...');
console.info('English: ', TRANSLATION);
console.info('Spanish: ', TRANSCRIPT);


//DOM Manipulation loops
function loop() {
var question_wrap = document.getElementsByClassName('question_wrap')[0];
var question_answer = document.getElementById('cloze_answer');

if (question_wrap&&question_answer&&(question_answer.value==='')) {
var QUESTION_TEXT_BAD_WORDS = ['slow', 'replay', 'submit', 'answer', 'next', 'question']
var processedQuestionText = question_wrap.textContent.replace(ALPHANUMERIC_REGEX,
'').toLowerCase().trim();
QUESTION_TEXT_BAD_WORDS.forEach(function (word) {
processedQuestionText = processedQuestionText.replace(word, '');
});

//Remove any complete lines of transcript from the processed line because a complete transcript
line
//Will never hold the answer
PROCESSED_TRANSCRIPT_LINES.forEach(function (processedTranscriptLine) {
processedQuestionText=processedQuestionText.replace(processedTranscriptLine.processed, '');
});

// //Sort transcript entries by closeness and remove dupes
var candidates = PROCESSED_TRANSCRIPT_LINES
.filter(function (item, pos) {
var index;
PROCESSED_TRANSCRIPT_LINES.forEach(function (a, p) {
if (a.processed === item.processed) index = p;
});
return index === pos;
})
.sort(function (a, b) {
return levenshteinDistance(a.processed, processedQuestionText)-levenshteinDistance(b.processed,
processedQuestionText);
});

var processedTranscriptLine = candidates[0];


//Find the missing word
var missingWords = [];
processedTranscriptLine.unprocessed.replace(/[,?;':.\\\/\(\)]/g,'').split(' ').forEach(function
(word) {
var processedWord = word.replace(ALPHANUMERIC_REGEX, '').toLowerCase().trim();
if (processedQuestionText.indexOf(processedWord)>=0) return;
if (missingWords.indexOf(processedWord)>=0) return;
missingWords.push(word);
});

var missingPhrase = missingWords.join(' ');

question_answer.value=missingPhrase;
}


//Another iteration
return setTimeout(loop, EVENT_LOOP_TIMEOUT);
}

//Start the loop
loop();
})(document);

See more at https://github.com/jake-billings/YablaHelper