The Arcology Garden

Is my phone is running an ad-fraud campaign while I am sleep

LifeTechEmacsTopicsArcology

Something on my phone is waiting for me to go to sleep to contact ad servers ATTACH

I realized last week that my phone's DNS wasn't going through pihole – part of this is Comcast xFI's fault, part of this is Android's fault, and some of it was simply my fault by making some overly-effective firewall rules to prevent my server from being an open DNS relay. Ultimately, I "fixed" it properly by pointing Tailscale's MagicDNS at one of my Tailscale nodes running a Pi Hole server and leaving Tailscale running on my phone. Bit of extra battery usage but not significant enough to care unless I'm using my server also as an Exit Node.

But then I looked at my traffic graphs:

Something on my phone makes thousands of requests overnight to look up googleadservices.com and ad.doubleclick.net … I smell an ad impression fraud playbook, or at least someone doing too many DNS lookups. While my pihole does have a nearly constant background chatter of ad-server lookups on the order of 1-3 per minute, overnight these jump up to nearly 25 per minute.

On the one hand, this is expected and accepted behavior on mobile devices despite being a huge freakin' problem, but this one in particular raised my hackles: those periods where it went crazy are more or less the exact times I was asleep, which is really disturbing that something might be monitoring me specifically to hide its activity. This is, at the very least spooky and at most a pretty blatant violation of Privacy norms. Or maybe it's only running when my phone is plugged in so that someone doesn't worry about battery usage.

But what is normal after all?

Can I figure out which app is doing this? ATTACH

I poked around in network data usage screens and battery usage screens, but of course that didn't elucidate much. Maybe tomorrow night i'll leave my phone unplugged and see if the battery usage reflects that.

But is there anything in the phone's debug logs? I ran adb logcat and saved the output to a file for analysis:

link to adb logs attachment

It's easy enough to slurp this file in to Python to analyze it. If I want to do more I can pull in Pandas and Numpy to do some jupyter style exploration, but I want to see some simple distributions of the messages. Logcat can give me some of this.

export LOG_FILE=~/org/data/20/220412T115028.569214/adb-output 
adb logcat -d -v long > $LOG_FILE
adb logcat -S 2>/dev/null

None of this is particularly … exciting or elucidating is it? the biggest logging systems which jump out to me are my keyboard input method, the media scanner which gets confused trying to analyze all the files I plug in to the device via Syncthing, YouTube, my calendar app… hmm. not much "there there"

import pathlib
in_file = pathlib.Path("data/20/220412T115028.569214/adb-output")
assert in_file.exists()

output = ''
with open(in_file, "r") as f:
  output = f.read()
assert len(output) > 0

blocks = output.split("\n\n")
[[block] for block in blocks[0:5]]

so using adb with -v long like i did breaks each message in to blocks separated by a blank line. good enough to parse like this, especially since all I really care about is the metadata header, usually the first line – the first record looks corrupted but who knows.. let's see:

headers = []
import re

for block in blocks:
    lines = re.split("\n+", block)
    header = list(filter(lambda line: line.startswith("[") and line.endswith("]"), lines))
    second_line = ("\n".join(lines[2:]))
    headers.append(header + [second_line])

# only print the uhh header
[[header[0]] for header in headers[0:10]]

Cool so now all my headers are extracted, let's parse this! it's not exactly trivial, of course these lines aren't regular, you can see that they are generally set to the same column, especially the fields I care about (time, and that last field, the emitting module). we'll be ugly about it.

import re
from datetime import datetime

export = []

for header in headers:
    try:
        working_header = header[0]
    except IndexError:
        print("bad header: {}".format(header))
        continue
    parts = re.split(r"\s+", working_header)
    try:
        burp = "2022-{} {}000".format(parts[1], parts[2]) # (ref:burp)
        assembly = [datetime.strptime(burp, "%Y-%m-%d %H:%M:%S.%f"), parts[-2], header[-1]]
        export.append(assembly)
    except IndexError:
        continue

len(export)

I am lazy with intermediate variable names.. (burp) takes the date and time and crams them in to a thing I can call datetime.strptime on & parts[-2] is the second to last element from that header, the Android module which emitted it.

And so I want all the messages between 21:00 and 09:00… I could do this within the logcat invocation, I probably should, but let's do some more nasty shit with datetime since i'm already invoking strptime

start_window = datetime(2022, 4, 11, 22, 0)
end_window = datetime(2022, 4, 12, 9, 0)
def date_filter_fn(pair):
    dt = pair[0]
    if dt > start_window and dt < end_window:
        return True
    return False
filtered = list(filter(date_filter_fn, export))
len(filtered)

That's not as much I would have hoped!! Let's see where they come from

sum_map = {}
for msg in filtered:
  module = msg[1]
  sum_map[module] = sum_map.get(module, 0) + 1

sum_map

So chatty is .. chatty? grr.

grep -A1 "I/chatty" $LOG_FILE | head

Well that's not so useful is it… some things are being too chatty and being elided from the logs

Same string extraction gymnastics… Tired of ugly python? so am i… enjoy an ugly shell pipeline! one day maybe i'll just learn awk outright.

grep -A1 "I/chatty" $LOG_FILE | grep expire | awk '{print $2}' | sed -e 's/:.*//' | sort | uniq -c | sort -nr

so this is a dead end to tracking this down, it seems – I'm sure I gave "informed consent" when I installed whatever app is tracking my sleep schedule to run advertising fraud schemes.

What's next, must I install network monitoring software directly on my phone to track this down?

What about dumpsys output?

Okay that was a dead end … Poking around on some poorly stack overflow questions and SEO'd android dev blog google results, i found that there is a command in Android systems called dumpsys which will … dump system information to the terminal. In particular, it can dump detailed network statistics broken down by the process UID responsible for the traffic. Since my phone isn't doing a whole lot over night, in theory those 20000+ DNS requests should wind up in here somehow one hopes…

adb shell dumpsys netstats detail > /tmp/netstats
echo "[[file:/tmp/netstats]]"

The output of the dumpsys command is just barely structured…

what if i yaml load it lul:

import yaml

try: 
   with open("/tmp/netstats", "r") as f:
      dictmaybe = yaml.safe_load(f.read())
except Exception:
   dictmaybe = "lol no"

dictmaybe

cool so i gotta parse this wordsoup myself

import pathlib
import re
from datetime import datetime

lines = pathlib.Path("/tmp/netstats").read_text().split("\n")

PARSE_STATE = None
curr_uid = None

parsed_data = []

for line in lines:
  if line.startswith("UID stats"):
    PARSE_STATE = "UID"
    continue

  if PARSE_STATE is not None:
    match = re.search(r"^(?P<whitespace> +)", line)

    if match is None:
      print("nope")
      level = 0
      PARSE_STATE = None
      curr_uid = None
      continue

    level = int(len(match.group("whitespace")) / 2)
    if level == 1 and line.startswith("  ident"):
      # extract UID
      m = re.search(r"uid=(?P<uid>\d+)", line)
      if not m:
        pass
        #print("line bad {}".format(line))
      else:
        curr_uid = m.group("uid") 

    if level == 3:
      # example line
      # st=1648792800 rb=0 rp=0 tb=76 tp=1 op=0
      m = re.search(r"st=(?P<st>\d+) rb=(?P<rb>\d+).*tb=(?P<tb>\d+).*", line)
      if m is None:
        print("uhh {}".format(line))
        continue

      ts = datetime.fromtimestamp(int(m.group("st")))

      parsed_data.append([ts, curr_uid, m.group("rb"), m.group("tb")])

len(parsed_data)

Now we can shove the dataframe in to Pandas and plot the output

import pandas as pd

time_index = [r[0] for r in parsed_data]
uid_index = [r[1] for r in parsed_data ]
midx = pd.MultiIndex.from_arrays([time_index, uid_index], names=["ts", "uid"])

df = pd.DataFrame(data=parsed_data, columns=["ts", "uid", "rx", "tx"], index=time_index)
df = df.sort_index()
df['ts'] = pd.to_datetime(df['ts'])  
df['rx'] = pd.to_numeric(df['rx'])  
df['tx'] = pd.to_numeric(df['tx'])  
df['uid'] = pd.to_numeric(df['uid'])  
df.head()
downsize = df.loc["2022/04/11 02:00":"2022/04/12 09:00"]
downsize.head()
import matplotlib
import matplotlib.pyplot as plt
import seaborn
uids = downsize["uid"].unique()
fig, ax = plt.subplots()
seaborn.set_style("ticks")
for uid in uids:
    this_uid_df = downsize.loc[downsize["uid"]==uid]
    total_bytes = this_uid_df["tx"].sum()
    if total_bytes > 64000:
        seaborn.lineplot(
            ax=ax,
            data=this_uid_df.groupby("ts").sum(),
            legend="full",
            y="tx", x="ts",
            palette="deep",
            label=f"{uid} ({total_bytes})", 
        )

h,l = ax.get_legend_handles_labels()
ax.legend_.remove()

fig.autofmt_xdate()
fig.legend(h, l, loc="center right", ncol=4, mode="tight", fontsize=6)

loc = 'data/20/220412T115028.569214/plt1.png'
fig.savefig(loc, bbox_inches=matplotlib.transforms.Bbox([[0,0],[11,5]]))
loc

some interesting outliers, but no idea which… Let's see

maxis = downsize.groupby("uid")["tx"].sum().sort_values().iloc[::-1]
maxis[0:19]

And the top 10 UIDs are….

top10 = maxis.index[0:50].astype(str)
"\n".join(top10)
10009
10316
10329
10263
10195
10282
10167
10276
10325
0
10123
10250
10125
10281
10202
1010123
1010125
10274
10218
1010228
10318
1010167
10247
10191
10264
10343
10286
10272
10307
1073
10185
10186
10285
1010205
10270
10278
1010185
10337
10287
10136
10279
10269
10204
10205
10254
10112
10459
10209
1000
1010223

Okay so … this gives me the maximum transmitted bytes in the 2 hour sampling window, which is quite useful. Using a different dumpsys command I can figure out which will export information about the packages including the UID mapping.

run in shell:adb shell dumpsys package > packages_dump.txt

export PKG_FILE=data/20/220412T115028.569214/packages_dump.txt

function finduid() {
  echo -ne "$1 "
  (grep -F -B1 "userId=$1" $PKG_FILE || echo "unknown") | head -1 
}

finduid <<top-10-network-uids()>>

Analyzing the list of "network-noisy" packages

Okay, so we've narrowed it down –

let's skim through this list and see if anything that shouldn't be there hops out at me…

com.urbandroid.sleep.addon.port is the sleep tracker i use, and it does make a shitload of network calls. This isn't surprising as I have their cloud backup enabled with data going back to 2012. I've been paying for it, so I'll be pretty sour if they're running ad-fraud on my phone.

Tailscale, Nextcloud, Syncthing, Firefox (with ublock origin, etc addons installed), shouldn't be doing these things and will have a lot of natural network traffic. I sure hope I can trust these things as I can't feasibly replace them.

reddit.news is Relay for reddit which is the reddit mobile app I use , I wouldn't put it past them or some JS loaded in a site's web frame, but it happens regularly.

com.amtrak.rider could be doing mean things, their site already runs a bunch of spy-ing bullshit… but background ad-fraud? idk.

net.daylio is a mood/habit tracking app which has a nightly automated backup function. I hope they're not doing anything untoward

Some com.google packages, they wouldn't ad fraud themselves right?

com.jumboprivacy which is a paid service to run little privacy preserving scripts against web services on your device… uh huh…

The google quick search is a bit surprising to me, I suppose.

And of course there are some things being run as root (uid 0), and some other high-level UIDs which aren't reflected in my package list….

Something approaching a conclusion

All of this leaves me where I started, confused and a bit disoriented, unsure of what is happening on my own device in my own home and basically powerless to do anything about it.

But it raises a lot of questions, of course the main one being "what did I learn"

I learned that:

Another question: did i actually learn anything useful in these stats? Remember that this all started with tracking down DNS traffic, which of course is miniscule compared to even a single JPG or /r/formula1 comment thread loaded by Relay. In reality, any app on my system could be running a campaign like this, and there are not a lot of tools to explore this data usage.

Of course, you gave your "informed consent" when you installed these apps!

What is there to do for someone who doesn't consent?

dark laughter

how do you feel about the unabomber manifesto?

I'll disable some of these apps or at least force close them tonight and see how many times we try to load ad-impressions overnight.

I also noticed within the dumpsys service was a way to dump the active "activities" on an Android device. I'll set up ADB in a cron job and see what sort of interesting things run overnight, perhaps. write some awful plain-text parser again to extract the useful bits of data out of it.

My Librem 5 will supposably ship soon, and while it'll be nice to have something approaching a libre or at least introspectable userspace, and a set of mostly functional free apps, there are parts of my life that won't fit on this device. Even if it's living in a desk drawer 18 hours a day, if it's just running weird ad campaigns while I sleep or while it charges, should I feel okay with that?

How should I?

Note to self on opening this doc:

Make sure to hack in a python for the session, don't feel like putting this somewhere for direnv

(setq org-babel-python-command (concat (s-chomp (shell-command-to-string "nix-build ~/org/nix-shells/python-pandas.nix"))
                                       "/bin/python")))